Download Instruction Pipelining in Computer Systems Architecture: MIPS Implementation and Pitfalls and more Study notes Computer Science in PDF only on Docsity!
Computer Systems Architecture
CMSC 411
Unit 3 – Instruction Pipelining
Alan Sussman
February 13, 2003
CMSC 411 - Alan Sussman 2
Administrivia
- HW #2 due today – solution posted soon
- Quiz on Tuesday – Units 1 & 2
- HW #1 problem 1.17d
- MFLOPs with coprocessor
- answer shows that MFLOPs is computed as (# fp ops)/(time for fp ops) = (# fp ops)/(total time – time for integer ops)
- that is correct – don’t count integer ops against MFLOPs
- but both are counted in MIPS (both integer and fp ops are instructions!)
CMSC 411 - Alan Sussman 3
Last time
- Compiler/architecture interaction
- providing a good target for the compiler can make a huge difference in performance – up to a factor of 10 on an f.p. intensive application
- provide regularity, primitives, make costs of code sequences easy to determine
- MIPS/MIPS64 architectures
- load/store, 64 bits (with 32-bit ops), 3 instruction formats for MIPS64 (all 32 bits), immediate and displacement addressing modes
CMSC 411 - Alan Sussman 4
So far
- What we mean by computer performance
- How to measure it
- How instruction sets are designed
- How the design influences performance
CMSC 411 - Alan Sussman 5
What’s next
- A variety of hardware and compiler techniques to speed the execution of programs - What is pipelining? (Section A.1) - How does MIPS divide instructions into stages or cycles? (A.1) - What kinds of overheads are there in pipelining? (A.1) - How much speedup do we get? (A.1) - What are structural hazards, data hazards, and control hazards? (A.2) - How are these techniques used to reduce stalls: - data forwarding? (A.2) - instruction reordering? (A.2) - compiler approaches to reduce branch delays? (A.2)
CMSC 411 - Alan Sussman 6
What is pipelining?
- Pipelining is an implementation technique
whereby multiple instructions are
overlapped in execution
- In other words, at any given moment in the
execution of a computer program, many
different instructions are at various stages of
completion!
CMSC 411 - Alan Sussman 7
Throughput
- The number of instructions that complete
per unit time
- Instructions take many clock cycles
Ideally, every clock cycle, we want a new
instruction to begin (and end)
- This is how we will improve throughput
CMSC 411 - Alan Sussman 8
A MIPS implementation without
pipelining
- Recall from CMSC 311 that instructions
execute in different stages or cycles
- Instruction fetch cycle (IF) : fetch the instruction from memory and update the program counter (PC) to point to the next instruction. Note: We’re not using the NPC register that the book introduces. IR ← Mem[PC] PC ← PC + 4
CMSC 411 - Alan Sussman 9
MIPS w/o pipelining (cont.)
- Instruction decode cycle (ID) : Put the operands in pipeline registers A and B. Sign- extend the low order 16 bits of the IR and store in pipeline register Imm. (This sometimes holds the "immediate" constant.) A ← Regs[IR6..10] B ← Regs[IR11..15] Imm ← ((IR 16 ) 16 ##IR16..31)
CMSC 411 - Alan Sussman 10
MIPS w/o pipelining (cont.)
- Execution cycle (EC) : Use the ALU
- If memory reference: ALUOutput ← A + Imm
- If register-register ALU instruction: ALUOutput ← A op B
- If register-immediate ALU instruction: ALUOutput ← A op Imm
- If branch instruction: compute the branch address and check the branch condition: ALUOutput ← PC + (Imm << 2) Cond ← (A op 0) (but PC or Imm should be adjusted down by 4 to make this work right).
CMSC 411 - Alan Sussman 11
MIPS w/o pipelining (cont.)
- Memory access cycle (MEM) : finish loads, stores, and branches: Load: LMD ← Mem[ALUOutput] Store: Mem[ALUOutput] ← B Branch: if Cond then PC ← ALUOutput else PC is ok
CMSC 411 - Alan Sussman 12
MIPS w/o pipelining (cont.)
- Write-back cycle (WB) : update the registers Register-register ALU instruction: Regs[IR16..20] ← ALUOutput Register-immediate ALU instruction: Regs[IR11..15] ← ALUOutput Load instruction: Regs[IR11..15] ← LMD
CMSC 411 - Alan Sussman 19
Example 2 (cont.)
- Time for pipelined MIPS implementation:
We have to synchronize the stages, so we
need to run the clock at 10 ns
- 1st instruction takes 50 ns. The others each
finish 1 cycle later than the preceding one.
- Time = 50 ns + 99*10 ns = 1040 ns
- Speedup = 4000/1040 ≈ 3.
CMSC 411 - Alan Sussman 20
Even more realistic case
- Example 3: The original MIPS implementation doesn't always need to use the MEM cycle - IF -10ns - ID - 8ns - EX - 7ns - MEM - 10ns - WB - 5ns
- Suppose that only 30% of instructions use memory access. So, on average, for every 100 instructions, we have about 70 that use 4 stages and 30 that use 5.
CMSC 411 - Alan Sussman 21
Example 3 (cont.)
- Time for original MIPS implementation:
- 70 instructions × 30 ns per instruction + 30 instructions × 40 ns per instruction = 3300 ns
- Time for pipelined MIPS implementation: We have to synchronize the stages, so we need to run the clock at 10 ns, and we need 5 cycles for every instruction. - 1st instruction takes 50 ns. The others each finish 1 cycle later than the preceding one - Time = 50 ns + 99*10 ns = 1040 ns
- Speedup = 3300/1040 ≈ 3.
CMSC 411 - Alan Sussman 22
Overhead of pipelining
- We just summarized the two major overhead costs in pipelining: - making the time for every stage equal the time for the longest stage - making the time for every instruction equal the time for the longest instruction (not quite true, but true for a wide range of instructions)
- Unfortunately, the speedup of pipelining is reduced even further by hazards that cause “bubbles” in the pipeline
CMSC 411 - Alan Sussman 23
Pipeline hazards cause stalls
- When some instruction is unable to
complete on schedule, we must
- finish the earlier instructions on schedule
- delay the later instructions
- This is called stalling the pipeline
CMSC 411 - Alan Sussman 24
Pipeline hazards
- What causes delays in instruction completion?
- Structural hazards are hardware delays Example: memory does not respond to a request as fast as it is expected to
- Data hazards arise when data can be predicted to be unready at the time it is needed Example: an instruction needs a register that a previous instruction is still modifying
- Control hazards arise when we need to do something other than incrementing the PC by 4 Example: conditional branch, jump
CMSC 411 - Alan Sussman 25
Pipeline hazards (cont.)
Pipeline hazards reduce throughput and speedup even more! Fig. A. Structural hazard – a load with 1 memory port for data/instructions Clock cycle
i +6 IF ID EX
i +5 IF ID EX MEM
i +4 IF ID EX MEM WB
i +3 stall IF ID EX MEM WB
i +2 IF ID EX MEMWB
i+1 IF ID EX MEM WB
Load IFIDEX MEM WB
Inst # 1 2 3 4 5 6 7 8 9 10
CMSC 411 - Alan Sussman 26
Pipeline hazards (cont.)
- Example 4: In Example 3, had on average, 70 instructions that use 4 stages and 30 that use 5
- Time for original MIPS implementation = 3300 ns
- Suppose that 5 of those instructions involve branches. So 5 times, need to wait until the ID cycle of one instruction is complete before start the IF cycle of the next instruction.
- Therefore, the next instruction will start 2 cycles later, not 1. So add 5 cycles to the time.
CMSC 411 - Alan Sussman 27
Example 4 (cont.)
- Time for pipelined MIPS implementation:
- 1st instruction takes 50 ns. The others each finish 1 cycle later than the preceding one, but there is a 5 cycle hazard penalty
- Time = 50 ns + 9910 ns +510 ns = 1090 ns
- Speedup = 3300/1090 ≈ 3.
CMSC 411 - Alan Sussman 28
Data hazards
- A data hazard occurs when a piece of data
is not available when it is needed
- Perhaps there was a cache miss : we expected the value to be in cache, but instead we need to find it in memory
- Perhaps it is involved in a previous computation that has not yet completed
CMSC 411 - Alan Sussman 29
Example – Figure A.
CMSC 411 - Alan Sussman 30
Types of data hazards
- RAW : read after write
- One instruction writes a value. A later instruction reads it. Problem: an old value may be read.
- WAW : write after write
- One instruction writes a value. A later instruction writes in the same location. Problem: the final value may be the first, rather than the second.
- WAR : write after read
- One instruction reads a value. A later instruction writes in the same location. Problem: the value read may be the changed value rather than the original. This ordinarily cannot happen.
CMSC 411 - Alan Sussman 37
Sometimes forwarding not enough
- Example : Data needs to be loaded from memory at least two instructions before use in order to avoid a stall – Figure A.
CMSC 411 - Alan Sussman 38
Forwarding (cont.)
- Compilers need to be smart enough to prevent stalls when possible
Example : a = b + c + d; e = d - f;
- Need to make sure that the first ADD operation delays until b and c are loaded
LD R1, b LD R2, c LD R3, d ADD can’t be done yet DADD R4,R1,R DADD R4,R3,R4 ok by forwarding LD R5, f need to start this before a = b + c + d completes SD a, R DSUB R6,R3,R SD e, R6 ok by forwarding
CMSC 411 - Alan Sussman 39
Forwarding (cont.)
- Rules for interchanging instructions:
- must be in same block (i.e., no branches between them)
- must check graph of dependencies to make sure they are independent
CMSC 411 - Alan Sussman 40
How the MIPS pipeline
introduces stalls
- Data hazards are checked during instruction
decode (ID) - if a hazard exists, the EX
cycle is delayed (i.e., the instruction is not
issued ), a "no-op" is issued instead
- The ID cycle also determines whether data
forwarding is needed
CMSC 411 - Alan Sussman 41
Control hazards
- Question : When do we find out that the PC
needs to be modified?
- Answer : In pipeline stage ID of a branch
instruction
- So, if a branch is taken (i.e., if the PC is
modified), then have to wait until the next
cycle before can fetch the correct
instruction
CMSC 411 - Alan Sussman 42
Control hazards (cont.)
Successor IF ID EX
Successor IF ID EX MEM
Branch IF IF ID EX MEM WB successor
Branch IF ID EXMEM WB inst.
Wastes 1 clock cycle
CMSC 411 - Alan Sussman 43
Example
- If branch in 30% of instructions, then
instead of executing 1 instruction per cycle,
have 70% of instructions executing in 1
cycle and 30% of instructions executing in 2
cycles
- An average of .7 + .6 = 1.3 cycles per
instruction
- Worse by 30% CMSC 411 - Alan Sussman 44
Compiler approaches to branch
delays
- Freeze or flush the pipeline when
determine that a branch is taken - refer back
to Figure A.11 (a stall is inserted)
- Predict not taken : continue to begin
execution of instructions as if the branch is
not taken, but change them to a "no-op" if
the branch is taken
CMSC 411 - Alan Sussman 45
Predict not taken scheme – Fig. A.
Inst. i+4 IF ID EX MEMWB
Inst. i+3 IF ID EX MEMWB
Inst. i+2 IF ID EX MEMWB
Inst. i+1 IF ID EX MEMWB
UntakenIF ID EX MEMWB branch
B.t. + 2 IF ID EX MEMWB
B.t. + 1 IF ID EX MEMWB
Branch IF ID EX MEMWB target
Inst. i+1 IF idle idle idle idle
Taken IF ID EX MEMWB branch
CMSC 411 - Alan Sussman 46
Compiler approaches (cont.)
- Predict taken : Good if most of the
branches are from loops
- Schedule using branch delay slots ,
reordering the code to test the branch earlier
CMSC 411 - Alan Sussman 47
Branch delay slot – Fig. A.
CMSC 411 - Alan Sussman 48
Scheduling branch delay slot
- If taken from before branch
- branch must not depend on rescheduled instruction
- always improves performance
- If taken from branch target
- must be OK to execute rescheduled instructions if branch not taken, and may need to duplicate insts.
- performance improved when branch taken
- If taken from fall through
- must be OK to execute insts. if branch taken
- improves performance when branch not taken
CMSC 411 - Alan Sussman 55
Categorizing exceptions – Fig. A. 27
Floating pt.Synch Coerced MaskableWithin Resume overflow/ underflow
Integer Synch Coerced MaskableWithin Resume overflow
Breakpoint Synch User req. MaskableBetween Resume
Tracing Synch User req. MaskableBetween Resume instructions
Invoke OS Synch User req. Not Between Resume
I/O device Asynch Coerced Not Between Resume request
Resume vs. terminate
Within vs. between instructions
User maskable vs. not
User request vs. coerce
Synch. vs. asynch.
Exception type
CMSC 411 - Alan Sussman 56
Categorizing exceptions (cont.)
Power Asynch Coerced Not Within Terminate failure
Hardware Asynch Coerced Not Within Terminate malfunction
Undefined Synch Coerced Not Within Terminate instruction
Mem. prot. Synch Coerced Not Within Resume violation
Misaligned Synch Coerced MaskableWithin Resume memory access
Page fault Synch Coerced Not Within Resume
Resume vs. terminate
Within vs. between instructions
User maskable vs. not
User request vs. coerce
Synch. vs. asynch.
Exception type
CMSC 411 - Alan Sussman 57
The most difficult exceptions...
- ... are those that occur within EX or MEM stages and need to be handled in a restartable way
- Why difficult? Handling one includes:
- the next IF gets a "trap instruction"
- until the trap is taken, turn off all "writes" for the faulting instruction and those that follow it.
- what does the trap do?
- The trap transfers control to the exception handling routine in the operating system, which saves the PC of the faulting instruction and handles the fault
- the task is then resumed, using the saved PC and the MIPS instruction RFE or something like it
- Note : May need to save several PCs if delayed branches are involved CMSC 411 - Alan Sussman 58
Exceptions (cont.)
- Ideally, pipeline can be interrupted so that
instructions before the fault complete. Then
want to restart execution just after the
faulting instruction - precise exception
handling
- This is the right way to do it, but sometimes
architects/manufacturers take shortcuts
CMSC 411 - Alan Sussman 59
When do MIPS exceptions occur?
• IF
- page fault on instruction fetch
- misaligned memory access
- memory protection violation
- ID
- undefined or illegal opcode
- EX
- arithmetic exception
- MEM
- page fault on data fetch/store
- misaligned memory access
- memory protection violation
- WB : None!
Computer Systems Architecture
CMSC 411
Unit 3 – Instruction Pipelining
Alan Sussman
February 25, 2003
CMSC 411 - Alan Sussman 61
Administrivia
- HW #3 due next Tuesday, March 4
- Quiz today – last 25 minutes of class
- No office hours today or tomorrow
CMSC 411 - Alan Sussman 62
Last time
- Forwarding
- to use ALU or load result before WB
- compiler reorders instructions to prevent stalls, use forwarding
- Control hazards lead to branch delays
- because branch target isn’t computed until ID
- one (partial) solution is for compiler to schedule branch delay slots
- Exceptions
- machine must save pipeline state, handle exception (with OS), and restart where exception occurred
- precise vs. imprecise exception handling
- program generated ones can occur in all pipe stages except WB
CMSC 411 - Alan Sussman 63
Examples of exception handling
ADD IF ID EX MEMWB
LD IF ID EX MEM WB
- Handle the MEM fault first, then restart
ADD IF ID EX MEMWB
LD IF ID EX MEM WB
- IF fault occurs first, even though LD will fault later
- But for precise exceptions, must handle LD fault first CMSC 411 - Alan Sussman 64
How is this done?
- Answer : Don't handle exceptions until the WB stage - each instruction has an associated status vector that keeps track of faults - any bit set in the status vector turns off register writes and memory writes - in WB stage, the status vector is checked and any fault is handled - So, since instructions reach WB in proper order, faults for earlier instructions are handled before faults for later instructions - Unfortunately, will need to violate this later (for instructions that don’t reach WB in proper order)
CMSC 411 - Alan Sussman 65
Commitment
- When an instruction is guaranteed to complete, it is committed
- Life is easier if no instruction changes the machine state before it is committed
- In MIPS, commitment occurs at the end of the MEM stage - that’s why register update occurs in the stage after that
- Some machines muddy the state before commitment, and the exception handler must do its best to restore the state that existed before the instruction started CMSC 411 - Alan Sussman 66
Complications caused by long
instructions
- So far, all MIPS instructions take 5 cycles
- But haven't talked yet about the floating
point instructions
- Take it on faith that floating point
instructions are inherently slower than
integer arithmetic instructions
- doubters may consult Appendix H in H&P online
CMSC 411 - Alan Sussman 73
FP stalls from RAW hazards – Fig. A.
S.D IF stall stall stall stall stall F2,0(R2)
ADD.D IF stall ID stall stall stall stall F2,F0,F
MUL.D IF ID stall M1 M2 M3 M4 M F0,F4,F
L.D IF ID EX MEM WB F4,0(R2)
Inst. 1 2 3 4 5 6 7 8 9
S.D stall stall ID EX stall stall stall MEM
ADD.Dstall stall A1 A2 A3 A4 MEM
MUL.DM6 M7 MEM WB
L.D
Inst. 10 11 12 13 14 15 16 17
CMSC 411 - Alan Sussman 74
Long instructions (cont.)
- It is possible that two instructions enter the WB stage at the same time
DADD IF ID ALU MEMWB
DADD IF ID ALU MEMWB
LD IF ID ALU MEMWB
ADD.D IFID A1 A2 A3 A4 MEMWB
CMSC 411 - Alan Sussman 75
Long instructions (cont.)
- Instructions can finish in the wrong order
- This can cause WAW hazards
- see p. A-52 of H&P for an example
- This violation of WB ordering defeats the
previous strategy for precise exception
handling
CMSC 411 - Alan Sussman 76
WAW structural hazard – Fig. A.
L.D IF ID EX MEMWB F2,0(R2)
… IF ID EX MEM WB
… IF ID EX MEM WB
ADD.D IF ID A1 A2 A3 A4 MEMWB F2,F4,F
… IF ID EX MEM WB
… IF ID EX MEMWB
MUL.D IF IDM1M2M3 M4 M5 M6 M7 MEMWB F0,F4,F
CMSC 411 - Alan Sussman 77
How to detect hazards in ID
- Early detection would prevent trouble
- Check for structural hazards :
- will the divide unit clear in time?
- will WB be possible when we need it?
- Check for RAW data hazards :
- will all source registers be available when needed?
- Check for WAW data hazards :
- Is the destination register for any ADD.D, multiply or divide instructions the same register as the destination for this instruction?
- If anything dangerous could happen, delay the execute cycle so no conflict occurs CMSC 411 - Alan Sussman 78
Precise exception handling for
long instructions
- Suppose
- ADD.D completes,
- then SUB.D has a floating-point exception,
- then DIV.D detects an exception
- Big trouble, because ADD.D has destroyed
register F
Example: DIV.D F0, F2, F ADD.D F10, F10, F SUB.D F12, F12, F
CMSC 411 - Alan Sussman 79
Possible fixes
- Give up and just do imprecise exception handling - tempting, but very annoying to users
- Delay WB until all previous instructions complete
- since so many instructions can be active, this is expensive - requires a lot of supporting hardware
- Write, to memory, a history file of register and memory changes so can undo instructions if necessary - or keep a future file of computed results that are waiting for MEM or WB CMSC 411 - Alan Sussman 80
Possible fixes (cont.)
- Let the exception handler finish the
instructions in the pipeline and then restart
the pipe at the next instruction
- Have the floating point units diagnose
exceptions in their first or second stages ,
so can handle them by methods that work
well for handling integer exceptions
Computer Systems Architecture
CMSC 411
Unit 3 – Instruction Pipelining
Alan Sussman
February 27, 2003
CMSC 411 - Alan Sussman 82
Administrivia
- Quiz returned Tuesday
- answers posted on web page
- Read Chapter 3 – Unit 4 on instruction-level
parallelism
- HW #3 due Tuesday, March 4
- HW #4 posted soon
CMSC 411 - Alan Sussman 83
Last time
- Exception handling
- for 5 stage pipeline, handle them in WB stage, to keep proper order
- after the instruction commits
- Long instructions – generally means f.p. instructions
- higher latency, and initiation interval, than other (integer) instructions
- typically means the EX stage is multiple cycles, and sometimes not pipelined (e.g., divider)
- can get 2 (or more) instructions trying to enter WB at same time – structural hazard
- can get instructions finishing in wrong order – may cause WAW hazards, messing up precise exception handling
- detect hazards in ID stage, and delay EX if a conflict occurs
- finally, several ways to handle exceptions – history/future file, software (OS exception handler), diagnose early in pipeline, … CMSC 411 - Alan Sussman 84
A case study: MIPS R
pipeline design
- MIPS64 architecture, with deeper 8 stage
pipeline
- to get higher clock rates
- extra stages come from memory accesses
- techniques called superpipelining
CMSC 411 - Alan Sussman 91
R4000 pipeline performance
- 4 major causes of pipeline stalls
- load stalls – from using load result 1 or 2 cycles after load
- branch stalls – 2 cycles on every taken branch, or empty branch delay slot
- FP result stalls – RAW hazards for an FP operand
- FP structural stalls – from conflicts for functional units in FP pipeline
CMSC 411 - Alan Sussman 92
SPEC92 benchmarks
Assuming a perfect cache – 5 integer and five FP programs
CMSC 411 - Alan Sussman 93
Dynamically scheduled pipelines
- We’ll cover this, and the scoreboard
technique, in Unit 4
- need some general background first
CMSC 411 - Alan Sussman 94
Pitfalls
- Unexpected hazards do occur …
- for example, when a branch is taken before a previous instruction finishes
- Extensive pipelining can slow a machine
down, or lead to worse cost-performance
- more complex hardware can cause a longer clock cycle, killing the benefits of more pipelining
CMSC 411 - Alan Sussman 95
Pitfalls (cont.)
- A poor compiler can make a good
machine look bad
- compiler writers need to understand the architecture in order to - optimize efficiently and - avoid hazards
- better to eliminate useless instructions, than make them run faster