Download Compiler Optimization & Processor Pipelining in CS411: Exploring ILP - Prof. Alan L. Sussm and more Study notes Computer Science in PDF only on Docsity!
Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level
parallelism
Alan Sussman
March 28, 2006
CMSC 411 - Alan Sussman 2
What we already know about
pipelining
• We need to avoid structural hazards, data hazards
and control hazards in order to get optimal
performance from the pipeline
- Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls
• Accomplish this by techniques such as
- instruction reordering
- hardware modifications to detect branches earlier
- compiler approaches to reduce branch delays
• Each technique reduces one or more of the stall
components of the Pipeline CPI
CMSC 411 - Alan Sussman 3
What's new in Units 4 & 4b
• Some instructions can be executed independently
of others. In fact, they could be executed in
parallel if there was the hardware to do it
• Idea is to take advantage of this instruction level
parallelism to do major rearrangements to how
compilers generate code (Unit 4b) and how the
MIPS (and other modern processors) pipeline
executes code (Unit 4)
CMSC 411 - Alan Sussman 4
New techniques
• Hardware
- dynamic pipeline scheduling
- with scoreboarding
- with register renaming (Tomasulo)
- dynamic branch prediction
- issuing multiple instructions per cycle
- speculation
- dynamic memory disambiguation
• Software (compiler)
- loop unrolling
- compiler dependence analysis
- software pipelining and trace scheduling
- compiler speculation
CMSC 411 - Alan Sussman 5
Data dependences (background)
• If 2 instructions are parallel , can execute
simultaneously in pipeline without stalls
– assuming no structural hazards
• If 2 instructions are dependent , must be
executed in order, but may sometimes be
partially overlapped
CMSC 411 - Alan Sussman 6
Data dependences (cont.)
• Instruction j is data dependent on instruction i if
either
- instruction i produces a result that may be used by instruction j, or
- instruction j is data dependent on instruction k , and instruction k is data dependent on instruction I (transitivity)
• Dependence implies a chain of one or more data
hazards between the 2 instructions
- potentially causing a pipelined processor to stall
• Dependences are properties of programs
- and whether one causes a stall depends on the properties of the pipeline organization
Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level
parallelism
Alan Sussman
April 6, 2006
CMSC 411 - Alan Sussman 8
Administrivia
• Midterm returned Tuesday
• Next project posted by tomorrow
– talk about it on Tuesday
• Read A.8 and 3.1-3.7, 3.
CMSC 411 - Alan Sussman 9
Last time
• Virtual memory protection
- main memory shared by multiple processes
- each process has its own virtual address space that no other process can touch or see
- enforced by base/bound registers and OS/kernel that manages page tables
• Hardware/software instruction-level parallelism
- data dependences – prevent instructions from executing in parallel - direct vs. indirect dependences - property of programs, and may cause a pipeline stall CMSC 411 - Alan Sussman 10
Data dependences (cont.)
• A dependence
– indicates the possibility of a hazard
– determines the order results must be produced
– sets an upper bound on available parallelism
• Ways to overcome dependences
– maintain the dependence, but avoid the hazard
(schedule the code – hardware or software)
– eliminate the dependence – by transforming the
code (software)
CMSC 411 - Alan Sussman 11
Name dependences
• When 2 instructions use the same register or
memory location (harder to detect), called a name ,
but no flow of data between them
• 2 types
- antidependence between instruction i and instruction j when j writes a register or memory location that i reads
- output dependence occurs when i and j write the same register or memory location
• In both cases, the instructions can execute at the
same time, or be reordered, if the name is changed
so the instructions don’t conflict
- statically by compiler or dynamically by hardware (usually only for registers – why?) CMSC 411 - Alan Sussman 12
Hazards
• True data dependences correspond to RAW
data hazards
• Output dependences correspond to WAW
hazards
• Antidependences correspond to WAR
hazards
CMSC 411 - Alan Sussman 19
Functional units
• One segment of the scoreboard keeps track of, for
each functional unit:
- Busy (yes or no)
- Op: operation to perform
- Fi: name of destination register
- Fj, Fk: names of source registers
- Qj, Qk: functional units producing the sources
- Rj, Rk: ready flags for Fj and Fk:
- "yes" if the source is ready and still need it
- "no" if the source is not ready, or if no longer need it CMSC 411 - Alan Sussman 20
Scoreboard for functional units
Divide Yes Div F10 F0 F6 Mult1 No Yes
Add Yes Sub F8 F6 F2 Integer Yes No
Mult2 No
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Integer Yes Load F2 R3 No
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
CMSC 411 - Alan Sussman 21
Register status
• One segment of the scoreboard keeps track
of which registers are expecting outputs
from which functional units
FU Mult1 Integer Add Divide
F0 F2 F4 F6 F8 F10 F12 … F
• Next, an example of the scoreboard method
CMSC 411 - Alan Sussman 22
The bookkeeping
And we’ll assume these initial register contents and flags: F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 0 0 0 0 0 0 0 0
Unit Opn Fj Fk Qj Qk
Inst Stage
Each functional unit keeps track of:
For our convenience, we’ll also indicate:
CMSC 411 - Alan Sussman 23
Scoreboard - bookkeeping
Mult1 Mult
Add1 Add
x x x x x x x
x x x x x x x x x x x x x x
x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 0 0 0 0 0 0 0 0
Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level
parallelism
Alan Sussman
April 11, 2006
CMSC 411 - Alan Sussman 25
Administrivia
• Midterm returned Thursday
– answers posted
• Project posted
– due April 27
CMSC 411 - Alan Sussman 26
Last time
- Data dependences
- true, or flow, dependences – RAW hazards
- name dependences• antidependences – WAR hazards
- output dependences – WAW hazards
- Control dependences– statement orderings with respect to branches
- must preserve data flow and exception behavior
- Dynamic pipeline scheduling
- start each instruction as early as possible – once operands are available
- instructions can complete out-of-order
- scoreboard and Tomasulo scheduling
- Scoreboard
- – split ID stage into issue and read operands stagesinstructions issue in order, can pass each other in read operands stage,
- enter functional units when readykeeps track of which pipe stage each inst. is in, state of each functional unit, state of registers (which functional unit will supply data)
CMSC 411 - Alan Sussman 27
Time 0
S1: ADD.D F0, F2, F4 Assume 2 cycle add
Mult1 Mult
Add1 Add
x x x x x x x
x x x x x x x Add.d 3 4 0 0 S1 Issue
x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 0 0 0 0 0 0 0 CMSC 411 - Alan Sussman 28
Time 1
S2: MULT.D F2, F6, F8 Assume 4 cycle mult Mult1 Mult
Add1 Add
Mult.d 6 8 0 0 S2 Issue
x x x x x x x Add.d 3 4 0 0 S1 Exec
x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 m1 0 0 0 0 0 0
CMSC 411 - Alan Sussman 29
Time 2
S3: MULT.D F10, F0, F
Mult1 Mult
Add1 Add
Mult.d 6 8 0 0 S2 Exec
Mult.dx x a1 m S3 Issue Add.d 3 4 0 0 S1 Exec
x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 m1 0 0 0 m2 0 0 CMSC 411 - Alan Sussman 30
Time 3
S4: ADD.D F0, F8, F6 Stalled - WAW hazard on F Mult1 Mult
Add1 Add
Mult.d 6 8 0 0 S2 Exec
Mult.d 7 x 0 m S3 Issue Add.d 3 4 0 0 S1 Write
x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 7 3 4 6 8 10 12 14 0 m1 0 0 0 m2 0 0
CMSC 411 - Alan Sussman 37
Tomasulo’s dynamic scheduling
method
• Origins: IBM 360/91, 3 years after CDC
• Scoreboard controls the progress of each
instruction and makes sure that hazards
such as RAW, etc. are avoided
• In Tomasulo's method, these functions are
decentralized in reservation stations
CMSC 411 - Alan Sussman 38
Tomasulo’s method
• Main differences with scoreboard method:
– decentralized execution control
– decentralized hazard detection
– register renaming by having operands passed
to reservation stations
- eliminates WAW and WAR hazards
CMSC 411 - Alan Sussman 39
Instruction status
• Only 3 kinds of pipeline stages, not 4:
- Issue : send instruction to an empty reservation station, in FIFO order, along with any register contents it needs - also renames registers, removing WAR and WAW hazards
- Execute : wait for all operands, check for RAW hazards, then execute the instruction - if multiple instructions ready in same cycle for same functionalunit, for f.p. units choose one arbitrarily - for loads/stores, compute effective address first, execute loads from load buffer as soon as memory unit available, for stores wait for value to be stored before send to memory unit
- Write result : put result on the common data bus (CDB) to get it back to a register and to any other reservation stations that need it - stores write to memory here too CMSC 411 - Alan Sussman 40
Instruction status (cont.)
• Note differences from scoreboard:
– no check for WAW and WAR hazards, since
renaming prevents them
– CDB broadcasts results, so don't have to wait
for registers to be filled
– Load and Store are also treated as functional
units
CMSC 411 - Alan Sussman 41
Advantages over scoreboard
• Distribution of hazard detection logic
– from distributed reservation stations and CDB
– can release multiple instructions at once from
single result (if other operands available)
• Eliminates stalls for WAW and WAR
hazards
– from renaming registers using reservation
stations
– and from storing operands into reservation
station as soon as they are available
CMSC 411 - Alan Sussman 42
Bookkeeping for Tomasulo method
• Each reservation station keeps track of
– Busy flag
– Op: the operation to be performed
– Qj, Qk: sources of operands
- already present
- or to come from another reservation station (functional unit)
– Vj, Vk: values of the operands
– A: the memory address info for a load or store
(eventually the effective address)
CMSC 411 - Alan Sussman 43
Bookkeeping (cont.)
• Register hardware keeps track of which
reservation station will fill each register
– Qi – number of reservation station containing
the operation whose result will be stored into
the register
• The load and store buffers have
– busy flag
– value to be stored
– A: effective address
CMSC 411 - Alan Sussman 44
MIPS FP with Tomasulo’s algorithm
Figure 3.
CMSC 411 - Alan Sussman 45
Instructions (Fig. 3.3)
ADD.D F6,F8,F2 x
DIV.D F10,F0,F6 x
SUB.D F8,F2,F6 x
MUL.D F0,F2,F4 x
L.D F2, 45(R3) x x
L.D F6,34(R2) x x x
Instruction Issue Execute Write Result
CMSC 411 - Alan Sussman 46
Reservation Stations (Fig. 3.3)
Mult2 yes DIV Mem[34+R[R2]]Mult
Mult1 yes MUL R[F4] Load
Add3 no
Add2 yes ADD Add1 Load
Add1 yes SUB Mem[34+R[R2]]Load
Load2 yes Load 45+R[R3]
Load1 no
Name Busy Op Vj Vk Qj Qk A
CMSC 411 - Alan Sussman 47
Register Status (Fig. 3.3)
Qi Mult1Load2 Add2 Add1 Mult
Field F0 F2 F4 F6 F8 F10 F12 …F
Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level
parallelism
Alan Sussman
April 13, 2006
CMSC 411 - Alan Sussman 55
Time 1
S2: MULT.D F2, F6, F8 Assume 4 cycle mult. Mult1 Mult
Add1 Add
Mult.d 6 8 0 0 S2 Issue
x x x x x x x Add.d 3 4 0 0 S1 Exec
x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 m1 0 0 0 0 0 0 CMSC 411 - Alan Sussman 56
Time 2
S3: MULT.D F10, F0, F
Mult1 Mult
Add1 Add
Mult.d 6 8 0 0 S2 Exec
Mult.dx x a1 m S3 Issue Add.d 3 4 0 0 S1 Exec
x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 m1 0 0 0 m2 0 0
CMSC 411 - Alan Sussman 57
Time 3
S4: ADD.D F0, F8, F
NOT STALLED - F0’s contents already passed to all units that need it! Mult1 Mult
Add1 Add
Mult.d 6 8 0 0 S2 Exec
Mult.d 7 x 0 m S3 Issue Add.d 3 4 0 0 S1 Write
Add.d 8 6 0 0 S4 Issue F0 F2 F4 F6 F8 F10 F12 F 7 3 4 6 8 10 12 14 a2 m1 0 0 0 m2 0 0
To m2, but F0 not really necessary (^) CMSC 411 - Alan Sussman 58
Time 4
S5: ADD.D F4, F6, F
Mult1 Mult
Add1 Add
Mult.d 6 8 0 0 S2 Exec
Mult.d 7 x 0 m S3 Issue Add.d 6 6 0 0 S5 Issue
Add.d 8 6 0 0 S4 Exec F0 F2 F4 F6 F8 F10 F12 F 7 3 4 6 8 10 12 14 a2 m1 a1 0 0 m2 0 0
CMSC 411 - Alan Sussman 59
Time 5
S6: ADD.D F2, F2, F6 stalled Mult1 Mult
Add1 Add
Multd 6 8 0 0 S2 Exec
Mult.d 7 x 0 m S3 Issue Add.d 6 6 0 0 S5 Exec
Add.d 8 6 0 0 S4 Exec F0 F2 F4 F6 F8 F10 F12 F 3 4 6 8 10 12 14 a2 m1 a1 0 0 m2 0 0
CMSC 411 - Alan Sussman 60
Time 6
S6: ADD.D F2, F2, F6 stalled Mult1 Mult
Add1 Add
Mult.d 6 8 0 0 S2 Write
Mult.d 6 48 0 0 S3 Issue Add.d 6 6 0 0 S5 Exec
Add.d 8 6 0 0 S4 Write F0 F2 F4 F6 F8 F10 F12 F 7 48 4 6 8 10 12 14 a2 0 a1 0 0 m2 0 0
stalled
CMSC 411 - Alan Sussman 61
Time 7
S6: ADD.D F2, F2, F
Mult1 Mult
Add1 Add
x x x x x x x
Mult.d 7 48 0 0 S3 Exec Add.d 6 6 0 0 S5 Write
Add.d 8 6 0 0 S4 Write F0 F2 F4 F6 F8 F10 F12 F 14 48 4 6 8 10 12 14 0 0 a1 0 0 m2 0 0
stalled
stalled
CMSC 411 - Alan Sussman 62
Time 8
S6: ADD.D F2, F2, F
Mult1 Mult
Add1 Add
x x x x x x x
Mult.d 7 48 0 0 S3 Exec Add.d 6 6 0 0 S5 Write
Add.d 48 6 0 0 S6 Issue F0 F2 F4 F6 F8 F10 F12 F 14 48 12 6 8 10 12 14 0 a2 0 0 0 m2 0 0
CMSC 411 - Alan Sussman 63
Loop-based example
• To show power of eliminating WAR and WAW
hazards from dynamic register renaming
Loop: L.D F0, 0(R1) MUL.D F4,F0,F S.D F4,0(R1) DADDUI R1,R1,- BNE R1,R2,Loop
• If predict branches are taken, multiple loop
iterations can execute at same time – dynamic
loop unrolling
CMSC 411 - Alan Sussman 64
Example (Fig. 3.6)
• Assume all instructions issued in two successive
iterations of loop, but none of f.p. loads/stores or
operations has completed
- don’t show integer ALU op, assume branch predicted as taken
• Once system reaches this state, 2 copies of loop
could be sustained, with a CPI close to 1.
- if multiplies take less than 4 cycles to complete
CMSC 411 - Alan Sussman 65
Instruction status (Fig. 3.6)
S.D F4,0(R1) 2 x
MUL.D F4,F0,F2 2 x
L.D F0,0(R1) 2 x x
S.D F4,0(R1) 1 x
MUL.D F4,F0,F2 1 x
L.D F0,0(R1) 1 x x
Instruction From iteration Issue Execute Write Result
CMSC 411 - Alan Sussman 66
Reservation stations (Fig. 3.6)
Store2 yes Store R[R1]-8 Mult
Store1 yes Store R[R1] Mult
Mult2 yes MUL R[F2] Load
Mult1 yes MUL R[F2] Load
Add3 no
Add2 no
Add1 no
Load2 yes Load R[R1]-
Load1 yes Load R[R1]+
Name Busy Op Vj Vk Qj Qk A
CMSC 411 - Alan Sussman 73
Branch-prediction buffers (cont.)
• When to access the buffer:
- fetch the contents during the ID pipeline cycle
- update the contents after the branch is resolved
• What is stored in the buffer:
- perhaps one bit per branch, indicating whether the branch was last taken or not - predict taken if it was taken last time. - otherwise predict not taken Example: If the branch is for an inner loop, the prediction will be wrong on the 1st and last iterations, each time the loop is executed CMSC 411 - Alan Sussman 74
Branch-prediction buffers (cont.)
• What is stored in the buffer (cont.):
– perhaps a counter for each branch. If have 3
bits, for example, start the count at 4
- add one every time the branch is taken
- subtract one every time the branch is not taken
- predict taken if the count is 4 or more
- predict not taken if the count is 3 or less.
– perhaps a correlating predictor that counts for
this branch and for 1 or more earlier ones
CMSC 411 - Alan Sussman 75
How good are they? – Fig. 3.
For a 4096 entry 2-bit prediction buffer
Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level
parallelism
Alan Sussman
April 18, 2006
CMSC 411 - Alan Sussman 77
Administrivia
• Midterm questions?
• Project 1 grades emailed (with minor error)
• Project 2 questions?
• anyone looking for a summer job?
CMSC 411 - Alan Sussman 78
Last time
- Tomasulo dynamic scheduling
- detailed example showed benefit of forwarding results directly to reservation stations, no WAR/WAW hazards
- still get structural hazards (not enough functional units, CDB)
- loop example showed instructions from multiple loop iterations in execution at same time, to get CPI close to 1.
- need care with loads/stores to same address to avoid WAR, WAW hazards - check contents of load/store buffers (res. stations)
- Branch prediction
- to reduce control hazards
- branch prediction buffer – keep track of recent branch history, to predict which way it will go in future - fetch during ID stage, to get next instruction - update after branch resolved
CMSC 411 - Alan Sussman 79
Correlating branch predictors
- To improve prediction accuracy, look at behavior of recent other branches than the one trying to predict
- Take advantage of correlation between behavior of different branches - also called two-level predictors Simple example: if (d == 0) d = 1; if (d == 1) …
MIPS code, with d in R1: BNEZ R1,L1 ; branch b DADDIU R1,R0,# L1: DADDIU R3,R1,#- BNEZ R3,L2 ; branch b … L2: CMSC 411 - Alan Sussman 80
Example (cont.) – Fig. 3.
2 no taken 2 no taken
not taken
1 no taken 1 yes
not taken
not 1 yes taken
0 yes
Value of d d==1? b before b
Initial d==0? b value of d
b1 not taken → b2 not taken
CMSC 411 - Alan Sussman 81
Example – 1 bit predictor
Predictor initialized to not taken – Fig. 3.
0 T NT NT T NT NT
2 NT T T NT T T
0 T NT NT T NT NT
2 NT T T NT T T
New b prediction
b action
b prediction
New b prediction
b action
b prediction
d=?
All branches mispredicted! CMSC 411 - Alan Sussman 82
1-bit prediction with 1 bit correlation
Initialized to not taken/not taken – Fig. 3.
0 T/ NT NT T/NT NT /T NT NT/T
2 T /NT T T/NT NT/ T T NT/T
0 T/ NT NT T/NT NT /T NT NT/T
2 NT /NT T T/NT NT/ NT T NT/T
New b prediction
b action
b prediction
New b prediction
b action
b prediction
d=?
Only misprediction is on first iteration/row
CMSC 411 - Alan Sussman 83
Correlating predictors (cont.)
• Predictor on last slide is a (1,1) predictor
- uses behavior of last branch to choose from among a pair of 1-bit branch predictors
• In general, an ( m,n ) predictor uses behavior of last
m branches to choose from 2 m^ branch predictors
- each is an n -bit predictor for a single branch
- simple hardware to do this – can keep global history of most recent m branches in an m -bit shift register, and each bit records whether branch taken/not taken - and index branch prediction buffer using low-order bits of branch address with m -bit global history CMSC 411 - Alan Sussman 84
Comparing predictors
• Correlating vs. standard 2-bit scheme
- using same number of state bits
• Number of state bits for ( m,n ) predictor is:
- 2 m^ × n × # prediction entries to select from with the branch address
• 2-bit predictor with no global history is a (0,2)
predictor
- (0,2) predictor from Fig. 3.8 had 4K entries selected by branch address - total number of bits is 2 0 × 2 × 4K = 8K bits
- (2,2) predictor from Fig. 3.14 has 64 entries, with 4 entries per branch address - total number of bits is 2 2 × 2 × 16 = 128 bits
CMSC 411 - Alan Sussman 91
BTB algorithm – Fig. 3.
CMSC 411 - Alan Sussman 92
BTB penalties – Fig. 3.
no not taken 0
no taken 2
yes taken not taken 2
yes taken taken 0
Penalty cycles
Actual branch
Instruction Prediction in buffer
Multiple Issue
CMSC 411 - Alan Sussman 94
Issuing multiple instructions at
one time
• Scoreboard and Tomasulo’s method reduce stalls
from data hazards
• Branch buffers reduce stalls from control hazards
• Now talk about getting more parallelism by
issuing several instructions at once
• Two currently used methods:
- superscalar processors
- VLIW (very long instruction word) processors
CMSC 411 - Alan Sussman 95
5 approaches – Fig. 3.
explicit dep.Itanium by compiler
mostly static
mostly software
EPIC static
Trimedia, Intel i
no hazards bet. packets
VLIW/LIW static software static
PIII/4, R10K, Alpha 21264
o-o-o, with speculation
dynamic, with spec.
Superscalar dynamic hardware (speculative)
some o-o-o IBM Power execution
Superscalar dynamic hardwaredynamic (dynamic)
Sun UltraSparc II/III
in-order execution
Superscalar dynamic hardwarestatic (static)
Interesting Examples feature
Hazard Scheduling detection
Issue structure
Common name
CMSC 411 - Alan Sussman 96
Superscalar MIPS
• Idea: issue several non-interfering instructions at
each clock cycle
• The hardware has the burden of detecting
interfering instructions
• Can be statically scheduled (using compiler
techniques), or dynamically scheduled (using
scoreboard or Tomasulo’s algorithm)
• For example, look at a superscalar MIPS processor
that can issue two instructions at once:
- one floating point instruction
- one non-floating point (integer or branch) instruction
CMSC 411 - Alan Sussman 97
Superscalar MIPS (cont.)
• 3 steps in instruction fetch and issue:
– fetch 2 instructions from cache
– determine whether 0, 1, or 2 can issue
- look at hazards from earlier instructions issued, and from within the 2 fetched instructions
- since 1 integer and 1 FP instruction can issue, mostly just look at opcodes for hazards within fetched pair, unless integer instruction is FP load/store/move – possible RAW hazard
- also need another read/write port on FP register file, to allow loads/stores to issue with FP operations, and (lots) more bypass paths
– issue the instructions to the correct functional
unit
Computer Systems Architecture
CMSC 411
Unit 4 – Instruction-level
parallelism
Alan Sussman
April 20, 2006
CMSC 411 - Alan Sussman 99
Administrivia
• Midterm questions?
• Project 2 questions?
– due next Thursday, April 27
• Guest lecturer Tuesday, (short) class next
Thursday to finish up Unit 4b
CMSC 411 - Alan Sussman 100
Last time
- Branch prediction
- for a given branch instruction, local predictor uses history of that branch inst.
- correlating predictor uses history of all recently executed branches
- two-level predictor combines both – also called a tournament predictor
- branch target buffer also stores and uses address of next predicted instruction, to reduce branch penalty further
- Multiple instruction issue
- to issue and complete more than one instruction per cycle
- superscalar and VLIW are dynamic and static techniques,respectively
- superscalar MIPS from example can issue 0, 1, or 2 instructionsper cycle, depending on structural and data hazards
CMSC 411 - Alan Sussman 101
Example
x is an array of 1000 double precision numbers, indexed from 1 to 1000 for i=1000, 999, ..., 1 x[i] = x[i] + s; Unroll the loop – execute 4 iterations before branch (for now don’t worry about loop test)
Loop: L.D F0, 0(R1) x[i] = x[i] + s ADD.D F4, F0,F2 uses F0 and F S.D F4,0(R1) L.D F6,-8(R1) x[i-1] = x[i-1] + s ADD.D F8,F6,F2 uses F6 and F S.D F8,-8(R1) L.D F10,-16(R1) x[i-2] = x[i-2] + s ADD.D F12,F10,F2 uses F10 and F S.D F12,-16(R1) L.D F14,-24(R1) x[i-3] = x[i-3] + s ADD.D F16, F14,F2 uses F10 and F S.D F16, -24(R1) DSUBI R1,R1,#32 point to next element BNEZ R1, Loop CMSC 411 - Alan Sussman 102
Example (cont.)
On single issue MIPS pipeline, takes 14 cycles
9 BNEZ R1, Loop (delayed branch) 10 S.D F16, 8(R1)
8 S.D F12, 16(R1)
7 S.D F8, 24(R1)
6 ADD.D F16, F14,F2 S.D F4, 32(R1)
DSUBI R1, R1, # (changes later offsets)
5 ADD.D F12, F10, F
4 ADD.D F8, F6, F2 L.D F14, -24(R1)
3 ADD.D F4, F0, F2 L.D F10, -16(R1)
2 L.D F6, -8(R1)
1 L.D F0, 0(R1)
Cycle FP instruction non-FP instruction
CMSC 411 - Alan Sussman 109
Hardware speculation (cont.)
• 3 key ideas
- dynamic branch prediction , to choose which instructions to execute
- speculation , to allow executing instructions before branches resolved (and effects undone if speculation was wrong)
- dynamic scheduling , to deal with scheduling combinations of basic blocks
• Essentially a data flow execution
- operations execute as soon as operands available
• We’ll use Tomasulo’s algorithm for dynamic
scheduling, and just for FP unit
CMSC 411 - Alan Sussman 110
Extensions to Tomasulo’s algorithm
• Separate bypassing of results between
instructions from actual completion of an
instruction
– so that an instruction can execute and produce
results for other instructions, without
performing any updates that can’t be undone –
until know that the instruction should have been
executed (control dependences resolved)
– when instruction is no longer speculative:
- update the register file, or memory
- call this instruction commit
CMSC 411 - Alan Sussman 111
Implementing speculation
• Allow instructions to execute out-of-order, but
make them commit in order
- and prevent any action that can’t be undone (update state, take an exception, etc.) until instruction commits
• Requires some changes to standard pipeline
sequence, and hardware buffers to hold results of
instructions that have completed execution, but
have not committed yet
- a reorder buffer
- also used to pass results between speculated instructions CMSC 411 - Alan Sussman 112
Reorder buffer (ROB)
• Provides registers, like reservation stations
• Holds instruction result from time operation
completes until instruction commits
• But no writeback to register file until instruction
commits
- so ROB supplies operands in this time interval
• Similar to store buffer in Tomasulo’s algorithm
- in design shown, store buffer integrated into ROB
CMSC 411 - Alan Sussman 113
Reorder buffer (cont.)
• 4 fields in an ROB entry
– instruction type – branch, store, register op
(ALU or load)
– destination – register number or memory
address
– value – result of instruction
– ready – set when instruction completes
execution, and value is ready
CMSC 411 - Alan Sussman 114
MIPS with Tomasulo + speculation
Fig. 3.
CMSC 411 - Alan Sussman 115
Instruction execution – 4 steps
• Issue
– get instruction from instruction queue
– issue if both reservation station and empty ROB
slot available
– send operands to reservation station if available
in either registers or ROB
– send ROB entry number to reservation station
for writing result
– if reservations stations or ROB full, instruction
stalls
CMSC 411 - Alan Sussman 116
Instruction execution (cont.)
• Execute
– If an operand is not yet available, monitor CDB
until it appears
– When all operands available at reservation
station, execute the operation
– Instructions can take multiple cycles in this
stage, loads require 2 steps (compute effective
address and memory access), stores only need
to compute effective address
CMSC 411 - Alan Sussman 117
Instruction execution (cont.)
• Write result
– When result available, write on CDB (with
ROB tag set in issue step), and from CDB to
ROB, plus any reservation stations waiting
– Mark reservation station available
– For stores, if value to be stored is available,
write into Value field of ROB entry for store
- otherwise, wait until it appears on CDB, then update Value field of ROB entry CMSC 411 - Alan Sussman 118
Instruction execution (cont.)
• Commit
- For “normal” case (instruction reaches head of ROB and its result is in the buffer), update the register with the result and remove instruction from ROB
- For a store, update memory instead of result register
- For an incorrectly predicted branch, the speculation was wrong – flush the ROB and reservation stations, and restart execution at correct branch target - correctly predicted branch is just finished
CMSC 411 - Alan Sussman 119
Example – Fig. 3.
L.D F6,34(R2)
L.D F2,45(R3)
MUL.D F0,F2,F
SUB.D F8,F6,F
DIV.D F10,F0,F
ADD.D F6,F8,F
• Assume 2 cycle add, 10
cycle multiply, 40 cycle
divide
• Status tables will show
when MUL.D is ready to
commit
CMSC 411 - Alan Sussman 120
Example (cont.)
Mult2 yes DIV.D Mem[34+R[R2]] #3 #
Mem[45+ R[F4] # R[R3]]
Mult1 no MUL.D
Add3 no
Add2 no
Add1 no
Load2 no
Load1 no
Name Busy Op Vj Vk Qj Qk Dest A
Reservation stations