Download ILP Architecture Classifications and Their Impact on Instruction Processing and more Study notes Advanced Computer Architecture in PDF only on Docsity!
Page 1
CS211^1
CS 211: Computer ArchitectureIntroduction to ILP Processors &Concepts
Course Outline^ •^ Introduction: Trends, Performance models^ •^ Review of computer organization and ISAimplementation^ •^ Overview of Pipelining^ •^ ILP Processors: Superscalar Processors^ –^ Next! ILP Intro and Superscalar^ •^ ILP: EPIC/VLIW Processors^ •^ Compiler optimization techniques for ILPprocessors – getting max performance out ofILP design^ •^ Part 2: Other components- memory, I/O.
CS211^3
Introduction to Instruction LevelParallelism (ILP)^ •^ What is ILP?^ –^ Processor and Compiler design techniques thatspeed up execution by causing individual machineoperations to execute in parallel^ •^ ILP is transparent to the user^ –^ Multiple operations executed in parallel eventhough the system is handed a single programwritten with a sequential processor in mind^ •^ Same execution hardware as a normal RISCmachine^ –^ May be more than one of any given type ofhardware
Architectures forInstruction-Level Parallelism^ Scalar Pipeline (baseline)
Instruction Parallelism = DOperation Latency = 1Peak IPC = 1^1
IF^ DE^
EX^ WB
TIME IN CYCLES (OF BASELINE MACHINE)
SUCCESSIVEINSTRUCTIONS
D
Page 2
CS211^5
Superpipelined Machine^ Superpipelined ExecutionIP =
DxM OL =^ M minor cycles Peak IPC =^
1 per minor cycle
(M per
baseline cycle) 1 2345
IF^ DE
EX
WB
major cycle = M minor cycleminor cycle
Superscalar Machines^ Superscalar (Pipelined) ExecutionIP =
DxN OL =^ 1 baseline cycles Peak IPC =^
N per baseline cycle^ IF^
DE^ EX
WB
N
CS211^7
Superscalar and Superpipelined^ Superscalar and superpipelined machines of equal degreehave roughly the same performance, i.e. if n = m then bothhave about
the same IPC.
Superscalar Parallelism Operation Latency: 1Issuing Rate: NSuperscalar Degree (SSD): N(Determined by Issue Rate)
Superpipeline Parallelism Operation Latency: MIssuing Rate: 1Superpipelined Degree (SPD): M(Determined by Operation Latency)
(^1 0) Time in Cycles (of Base Machine) 2 3 4
5 6 7
8 9 SUPERPIPELINED
10 11 12
13
SUPERSCALAR
Key:IFetchDcode
ExecuteWriteback
Limitations of Inorder Pipelines^ •^ CPI of inorder pipelines degrades very sharply if themachine parallelism is increased beyond a certainpoint,
i.e. when NxM approaches average distancebetween dependent instructions
•^ Forwarding is no longer effective
⇒^ must stall more often
-^ Pipeline may never be full due to frequent dependencystalls!!
IF^ DE
EX^ WB
123456789
Page 4
CS211^13
ILP: Instruction-Level Parallelism^ •^ ILP is is a measure of the amount of inter-dependencies between instructions^ •^ Average ILP =
no. instruction / no. cyc required
code1:^
ILP = 1^ i.e. must execute serially
code2:^
ILP = 3^ i.e. can execute at the same time
code1:^
r1^ ←^ r2 + 1r3^ ←^ r1 / 17r4^ ←^ r0 - r
code2:^
r1^ ←^ r2 + 1r3^ ←^ r9 / 17r4^ ←^ r0 - r
Inter-instruction Dependences^
Data dependence
r^ ←^ r^3
op r^2
Read-after-Write
r^ ←^ r^5
op r^4
(RAW)
^ Anti-dependence
r^ ←^ r^3
op r^1
Write-after-Read
r^ ←^ r^1
op r^4
(WAR)
^ Output dependence
r^ ←^ r^3
op r^2
Write-after-Write
r^ ←^ r^5
op r^4
(WAW)
r^ ←^ r^3
op r^7
^ Control dependence
CS211^15
Scope of ILP Analysis
r1^ ⇐^ r2 + 1r3^ ⇐^ r1 / 17r4^ ⇐^ r0 - r3r11^ ⇐
r12 + 1
r13^ ⇐
r19 / 17
r14^ ⇐
r0 - r
ILP=
ILP=1 Out-of-order execution permits more ILP to beexploited
Questions Facing ILP System Designers^ •^ What gives rise to instruction-level parallelism inconventional, sequential programs^ •^ How is the potential parallelism identified andenhanced, and how much is there?^ •^ What must be done in order to exploit the parallelismthat has been identified?^ •^ How should the work of identifying, enhancing andexploiting the parallelism be divided between thehardware and software (the compiler)?^ •^ What are the alternatives in selecting the architectureof an ILP processor?
Page 5
CS211^17
Sequential Processor^ Sequential Instructions
Processor Execution unit Execution unit
ILP Processors:Superscalar^ Sequential Instructions
Superscalar Processor
Scheduling SchedulingLogicLogic Instruction scheduling/Instruction scheduling/Instruction scheduling/parallelism extractionparallelism extractionparallelism extractiondone by hardwaredone by hardwaredone by hardware
CS211^19
Serial ProgramSerial Program Serial Program(C code)(C code)(C code)
Scheduled Instructions
EPIC Processor
ILP Processors:EPIC/VLIW^ compiler
ILP Architectures^ •^ Between the compiler and the run-timehardware, the following functions must beperformed^ –^ Dependencies between operations must bedetermined^ –^ Operations that are independent of any operationthat has not yet completed must be determined^ –^ Independent operations must be scheduled toexecute at some particular time, on some specificfunctional unit, and must be assigned a registerinto which the result may be deposited.
Page 7
CS211^25
Dependence Architecture^ •^ Compiler or programmer communicates tothe hardware the dependencies betweeninstructions^ –^ removes the need to scan the program in sequentialorder (the bottleneck for superscalar processors)^ •^ Hardware determines at run-time when toschedule the instruction
Dependence Architecture Example^ •^ Dataflow processors are representative ofDependence architectures^ –^ Execute instruction at earliest possible time subjectto availability of input operands and functionalunits^ –^ Dependencies communicated by providing witheach instruction a list of all successor instructions^ –^ As soon as all input operands of an instruction areavailable, the hardware fetches the instruction^ –^ The instruction is executed as soon as a functionalunit is available^ •^ Few Dataflow processors currently exist
CS211^27
Independence Architecture^ •^ By knowing which operations areindependent, the hardware needs no furtherchecking to determine which instructions canbe issued in the same cycle^ •^ The set of independent operations is fargreater than the set of dependent operations^ –^ Only a subset of independent operations arespecified^ •^ The compiler may additionally specify onwhich functional unit and in which cycle anoperation is executed^ –^ The hardware needs to make no run-time decisions
Independence Architecture Example •^ EPIC/VLIW processors are examples ofIndependence architectures^ –^ Specify exactly which functional unit eachoperation is executed on and when each operationis issued^ –^ Operations are independent of other operationsissued at the same time as well as those that are inexecution^ –^ Compiler emulates at compile time what a dataflowprocessor does at run-time
Page 8
CS211^29
Frontend and OptimizerDetermine DependencesDetermine IndependencesBind Operations to Function UnitsBind Transports to Busses
Determine DependencesBind Transports to BussesExecute SuperscalarDataflowIndep. Arch.VLIW TTA
Compiler^
Hardware Determine IndependencesBind Operations to Function Units
Compiler vs. Processor B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. TheJournal of Supercomputing, 7(1-2):9-50, May 1993.
VLIW and Superscalar^ •^ basic structure of VLIW and superscalarconsists of a number of Eus, each capable ofparallel operation on data fetched fromregister file^ •^ VLIW and superscalar require highlymultiported register files^ –^ limit on register ports places inherent limitation onmaximum number of EUs
CS211^31
VLIW & Superscalar-Differences^ •^ presentation of instructions:^ –^ VLIW receive multi-operation instructions^ –^ Superscalar accept traditional sequential streambut can issue more than one instruction^ •^ VLIW needs very long instructions in order tospecify what each EU should do^ •^ Superscalar receive stream of conventionalinstructions
VLIW&Superscalar-Differences^ •^ Decode and Issue unit in superscalar issuesmultiple instructions for the EUs^ –^ Have to figure out dependencies and independentinstructions^ •^ VLIW expect dependency free code whereassuperscalar typically do not expect this.^ –^ Superscalars cope with dependencies usinghardware
Page 10
CS211^37
Superscalar Processors
Superscalar Terminology
• Basic^ Superscalar
Able to issue > 1 instruction / cycle
Superpipelined
Deep, but not superscalar pipeline.E.g., MIPS R5000 has 8 stages
Branch prediction
Logic to guess whether or not branch will betaken, and possibly branch target
• Advanced^ Out-of-order
Able to issue instructions out of program order
Speculation
Execute instructions beyond branch points,possibly nullifying later
Register renaming
Able to dynamically assign physicalregisters to instructions
Retire unit
Logic to keep track of instructions as theycomplete.
CS211^39
Superscalar Execution ExampleSingle Order, Data Dependence – In Order • Assumptions^ –^ Single FP adder takes 2cycles^ –^ Single FP multiplier takes 5cycles^ –^ Can issue add & multiplytogether^ –^ Must issue in-order^ –^ in,in,out v: addt^ $f2, $f4, $f10 w: mult $f10, $f6, $f10 x: addt $f10, $f8, $f12 y: addt^ $f4, $f6,
$f z:^ addt^
$f4, $f8, $f
v w
x y
(In order)(Single adder, data dependence)
(inorder)
Data Flow $f2 $f4^ $f6 +^ + * +
y $f4$f8 $f v x^
z CriticalPath =9 cycles
+ w
z $f12^ z
Adding Advanced Features
-^ Out Of Order Issue^ –^ Can start y as soon as adder available^ –^ Must hold back z until
$f10^ not busy & adder available
-^ With Register Renaming
v w
x y
z v w
x y z
v:^ addt^
$f2, $f4, $f w:^ mult $f10, $f6, $f10 x:^ addt^ $f
, $f8, $f y:^ addt^
$f4, $f6,^
$f z:^ addt^
$f4, $f8,^ $f v:^ addt^
$f2, $f4, $f10a w:^ mult $f10a, $f6, $f10a x:^ addt $f10a, $f8, $f12 y:^ addt^
$f4, $f6,^
$f z:^ addt^
$f4, $f8, $f
Page 11
CS211^41
Flow Path Model of Superscalars
I-cacheFETCHDECODE COMMIT
D-cache
BranchPredictor^
InstructionBuffer ReorderBufferStoreQueue Integer^ Floating-point
Media
Instruction Memory
RegisterData
MemoryData Flow EXECUTE (ROB)
Flow
Flow
Superscalar Pipeline Design
Instruction Buffer Fetch
Dispatch Buffer Decode
Issuing Buffer Dispatch
Completion Buffer Execute
Store Buffer CompleteRetire
InstructionFlow Data Flow
CS211^43
Inorder Pipelines
IF D1 D2 EX WB Intel i
IF^
IF D^
D D^
D EX^
EX
WB^
WB
Intel Pentium
U - Pipe^
V - Pipe
Inorder pipeline, no WAW no WAR (almost always true)
Out-of-order Pipelining 101
-^ •^ • •^ •^ • •^ •^ • •^ •^ • IF ID RD INT WB
Fadd
Fmult
LD/ST
Fadd^
Fmult2Fmult
EX
Program Order^ I:^ F1^ ←^ a^
F2 x F3..... I: F1 ← F4 + F5 b
What is the value of F1? WAW!!!
Out-of-order WB I:^ F1^ ←^ “F4 + F5” b^...... I:^ F1^ ←^ “F2 x F3” a^
Page 13
CS211^49
More Hardware Featuresto Support ILP^ •^ Pipelining^ –^ Advantages^ »
Relatively low cost of implementation - requireslatches within functional units » With pipelining, ILP can be doubled, tripled or more – Disadvantages » Adds delays to execution of individual operations » Increased latency eventually counterbalancesincrease in ILP
•^ Additional Functional Units^ –^ Advantages^ »^
Does not suffer from increased latencybottleneck – Disadvantages » Amount of functional unit hardwareproportional to degree of parallelism » Interconnection network and register file sizeproportional to square of number of functionalunits
Hardware Features to Support ILP
CS211^51
Hardware Features to Support ILP•^ Instruction Issue Unit^ –^ Care must be taken not to issue an instruction ifanother instruction upon which it is dependent isnot complete^ –^ Requires complex control logic in Superscalarprocessors^ –^ Virtually trivial control logic in VLIW processors
•^ Speculative Execution^ –^ Little ILP typically found in basic blocks^ »^
a straight-line sequence of operations with nointervening control flow – Multiple basic blocks must be executed in parallel » Execution may continue along multiple pathsbefore it is known which path will be executed
Hardware Features to Support ILP
Page 14
CS211^53
Hardware Features to Support ILP•^ Requirements for Speculative Execution^ –^ Terminate unnecessary speculative computationonce the branch has been resolved^ –^ Undo the effects of the speculatively executedoperations that should not have been executed^ –^ Ensure that no exceptions are reported until it isknown that the excepting operation should havebeen executed^ –^ Preserve enough execution state at eachspeculative branch point to enable execution toresume down the correct path if the speculativeexecution happened to proceed down the wrongone.
•^ Speculative Execution^ –^ Expensive in hardware^ –^ Alternative is to perform speculative code motion atcompile time^ »^
Move operations from subsequent blocks uppast branch operations into proceeding blocks – Requires less demanding hardware » A mechanism to ensure that exceptions causedby speculatively scheduled operations arereported if and only if flow of control is suchthat they would have been executed in the non-speculative version of the code » Additional registers to hold the speculativeexecution state
Hardware Features to Support ILP
CS211^55
Introduction to S/W Techniques for ILP
Instruction Level Parallelism (ILP) • ILP: Overlap execution of unrelated instructions • How to extract parallelism from program? –^ Beyond single block to get more instruction level parallelism • Who does instruction scheduling and parallelismextraction? –^ Software or Hardware or mix? –^ Superscalar processors require H/W solutions, but can alsouse some compiler help • What new hardware features are required to supportmore ILP..? –^ Different requirements for Superscalar and EPIC
Page 16
CS211^61
Quick recall of data hazards..^ •^ True/flow dependencies - RAW^ •^ Name dependencies WAR, WAW^ –^ Also known as false dependencies, output dep
•^ Instr^ J
is data dependent on Instr
I
Instr^ tries to read operand before InstrJ^
writes itI^
•^ or Instr
is data dependent on InstrJ^
which isK^
dependent on Instr
I
Data Dependence and Hazards • Caused by a “True Dependence” (compiler term) • If true dependence caused a hazard in the pipeline,called a Read After Write (RAW) hazard
I: add r1,r2,r3J: sub r4,r1,r
CS211^63
•^ Name dependence: when 2 instructions use sameregister or memory location, called a name, but noflow of data between the instructions associatedwith that name; 2 versions of name dependence •^ Instr^ J
writes operand
before^
Instr^ reads itI^
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1” • If anti-dependence caused a hazard in the pipeline,called a Write After Read (WAR) hazard
I: sub r4,r1,r3J: add r1,r2,r3K: mul r6,r1,r
Name Dependence #1:Anti-dependence
Name Dependence #2:Output dependence • Instr^ writes operandJ^
before^
Instr^ writes it.I^
•^ Called an “output dependence” by compiler writersThis also results from the reuse of name “r1” •^ If anti-dependence caused a hazard in the pipeline,called a Write After Write (WAW) hazard
I: sub r1,r4,r3J: add r1,r2,r3K: mul r6,r1,r
Page 17
CS211^65
ILP and Data Hazards • HW/SW must preserve program order:^ –^ order instructions would execute in if executed sequentiallyone at a time as determined by original source program^ –^ Does this mean we can never change order of execution ofinstructions?^ »^ Ask - What happens if we change the order of aninstruction^ »^ Does result change?
ILP and Data Hazards^ •^ HW/SW goal: exploit parallelism bypreserving program order only where itaffects the outcome of the program^ •^ Instructions involved in a name dependencecan execute simultaneously if name used ininstructions is changed so instructions donot conflict^ –^ Register renaming resolves name dependence forregs^ –^ Either by compiler or by HW^ •
CS211^67
Control Dependencies^ •^ Every instruction is control dependent on some set ofbranches, and, in general, these control dependenciesmust be preserved to preserve program order^ if p1 {S1;};if p2 {S2;}^ •^ S1^ is control dependent on
p1, and^ S
is control
dependent on
p2^ but not on
p1.
Control Dependence Ignored^ •^ Control dependence need not be preserved^ –^ willing to execute instructions that should not have beenexecuted, thereby violating the control dependences, ifcan do so without affecting correctness of the program^ •^ Instead, 2 properties critical to programcorrectness are exception behavior and data flow
Page 19
CS211^73
Example FP Loop: Where are the Hazards?^ Loop:^
LD^ F0,0(R1)
;F0=vector element ADDD^ F4,F0,F
;add scalar from F SD^ 0(R1),F
;store result SUBI^ R1,R1,
;decrement pointer 8B (DW) BNEZ^ R1,Loop
;branch R1!=zero NOP^
;delayed branch slot Instruction
Instruction
Latency in
producing result
using result
clock cycles
FP ALU op
Another FP ALU op
FP ALU op
Store double
Load double
FP ALU op
Load double
Store double
Integer op
Integer op
•^ Where are the stalls?
FP Loop Hazards^ Instruction
Instruction
Latency in
producing result
using result
clock cycles
FP ALU op
Another FP ALU op
FP ALU op
Store double
Load double
FP ALU op
Load double
Store double
Integer op
Integer op
Loop:^ LD
F0,0(R1)
;F0=vector element ADDD^ F4,F0,F
;add scalar in F SD^ 0(R1),F
;store result SUBI^ R1,R1,
;decrement pointer 8B (DW) BNEZ^ R1,Loop
;branch R1!=zero NOP^
;delayed branch slot
CS211^75
FP Loop Showing Stalls^ Instruction^ •^ 9 clocks: Rewrite code to minimize stalls?
Instruction
Latency in
producing result
using result
clock cycles
FP ALU op
Another FP ALU op
FP ALU op
Store double
Load double
FP ALU op
1 Loop:^
LD^ F0,0(R1)
;F0=vector element 2
stall 3
ADDD^ F4,F0,F
;add scalar in F 4
stall 5
stall 6
SD^ 0(R1),F
;store result 7
SUBI^ R1,R1,
;decrement pointer 8B (DW) 8
BNEZ^ R1,Loop
;branch R1!=zero 9
stall^
;delayed branch slot
Revised FP Loop Minimizing Stalls^ Instruction 6 clocks: Unroll loop 4 times code to make faster?
Instruction
Latency in
producing result
using result
clock cycles
FP ALU op
Another FP ALU op
FP ALU op
Store double
Load double
FP ALU op
1 Loop:^
LD^ F0,0(R1)
stall 3
ADDD^ F4,F0,F
SUBI^ R1,R1,
BNEZ^ R1,Loop
;delayed branch 6
SD^ 8(R1),F
;altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Page 20
CS211^77
Unroll Loop Four Times (straightforwardway)
Rewrite loop tominimize stalls?
1 Loop: LD
F0,0(R1)
2 ADDD
F4,F0,F
3 SD
0(R1),F
;drop SUBI & BNEZ 4 LD
F6,-8(R1)
5 ADDD
F8,F6,F
6 SD
-8(R1),F
;drop SUBI & BNEZ 7 LD
F10,-16(R1)
8 ADDD
F12,F10,F
9 SD
-16(R1),F
;drop SUBI & BNEZ
10 LD
F14,-24(R1)
11 ADDD
F16,F14,F
12 SD
-24(R1),F
13 SUBI
R1,R1,#
;alter to 4 14 BNEZ*
R1,LOOP
15 NOP^ 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iterationAssumes R1 is multiple of 4
Unrolled Loop That Minimizes Stalls
-^ What assumptions madewhen moved code?^ –^ OK to move store pastSUBI even though changesregister^ –^ OK to move loads beforestores: get right data?^ –^ When is it safe forcompiler to do suchchanges?
1 Loop: LD
F0,0(R1)
2 LD
F6,-8(R1)
3 LD
F10,-16(R1)
4 LD
F14,-24(R1)
5 ADDD
F4,F0,F
6 ADDD
F8,F6,F
7 ADDD
F12,F10,F
8 ADDD
F16,F14,F
9 SD
0(R1),F
10 SD
-8(R1),F
11 SD
-16(R1),F
12 SUBI
R1,R1,#
13 BNEZ
R1,LOOP
14 SD
8(R1),F
14 clock cycles, or 3.5 per iterationWhen safe to move instructions?
CS211^79
Compiler Perspectives on CodeMovement • Definitions: compiler concerned about dependencies inprogram, whether or not a HW hazard depends on a givenpipeline • Try to schedule to avoid hazards • (True) Data dependencies (RAW if a hazard for HW) –^ Instruction i produces a result used by instruction j, or –^ Instruction j is data dependent on instruction k, and instruction kis data dependent on instruction i. • If dependent, can’t execute in parallel • Easy to determine for registers (fixed names) • Hard for memory: –^ Does 100(R4) = 20(R6)? –^ From different loop iterations, does 20(R6) = 20(R6)?
Where are the data dependencies?^1 Loop:
LD^ F0,0(R1) 2 ADDD^ F4,F0,F2 3 SUBI^ R1,R1,8 4 BNEZ^ R1,Loop
;delayed branch 5
SD^ 8(R1),F
;altered when move past SUBI