Download Pipelining the Beta: Improving CPU Performance through Pipelining in 6.004, Spring 2009 and more Slides Computer Fundamentals in PDF only on Docsity!
L22 – Pipelining the Beta 1
4/30/
Pipelining the Beta
betta
('be-t&)
n.^ Any of various species of small, brightly colored, long-finnedfreshwater fishes of the genus
Betta,
found in southeast Asia. beta^
(‘bA-
t&, ‘b
E-)^ n.
1.^ The second letter
of the Greek alphabet.
2.^ The exemplary
computer system used in 6.004. I don’t think they mean the fish...
maybe they’llgive me partialcredit...
modified 4/27/09 11:
Lab #7 due Tonight!
6.004 – Spring 2009
4/30/
CPU Performance
We’ve got a working Beta… can we make it
fast?
MIPS = Millions of Instructions/SecondFreq = Clock Frequency, MHzCPI = Clocks per Instruction
MIPS
=^
FreqCPI
To Increase MIPS:
1. DECREASE CPI.- RISC
simplicity
reduces CPI to 1.0.
below
1.0? Tough... you’ll see multiple instruction issue
machines in 6.823.
2. INCREASE Freq.- Freq limited by delay along longest combinational path; hence-^ PIPELINING
is the key to improved performance through fast
clocks.
L22 – Pipelining the Beta 3
4/30/
Beta Timing
New PC
PC+^
Fetch Inst.
Control Logic
Read Regs
RA2SEL mux ASEL mux
BSEL mux ALU
Fetch data
+OFFSET
WDSEL muxRF setup
PC setup
Mem setup
PCSEL mux
=0?
CLK
CLK
Wanted:
longest paths
Complications: • ^ some apparent paths aren’t“possible” • ^ operations have variableexecution times (eg, ALU) • ^ time axis is not to scale (eg,t^ PD,MEM
is very big!)
New PC
PC+
Control Logic
RA2SEL mux ASEL mux
Fetch Inst. BSEL mux Read Regs
Fetch data
+OFFSET
WDSEL muxRF setup
PC setup
Mem setup
PCSEL mux
=0?
CLK
CLK
“precedencegraph”
PC+
ALU
+OFFSET
LDR(X,R3)LD(R1,10,R0)
6.004 – Spring 2009
4/30/
Why isn’t this a 20-minute lecture?1. The Beta isn’t combinational…
^ Explicit state
in register file, memory;
^ Hidden
state in PC.
2. Consecutive operations – instruction executions – interact:
- ^ Jumps, branches dynamically change instruction sequence • ^ Communication through registers, memory
Our goals:
Move slow components into separate pipeline stages, runningclock faster
Maintain instruction semantics of unpipelined Beta as far aspossible
We’ve learned how to pipeline combinational circuits.
What’s the big deal?
L22 – Pipelining the Beta 5
4/30/
Ultimate Goal: 5-Stage PipelineGOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to
barely
include slowest components (mems, regfile, ALU)
APPROACH: structure processor as 5-stage pipeline:
IF^
Instruction Fetch stage: Maintains PC, fetchesone instruction per cycle and passes it to
WB
Write-Back stage: writes result back intoregister file.
RF
Register File stage: Reads source operands fromregister file, passes them to
ALU
ALU stage: Performs indicated operation, passesresult to
MEM
Memory stage: If it’s a LD, use ALU resultas an address, pass mem data(or ALU result if not LD) to
6.004 – Spring 2009
4/30/
First Steps:
A Simple 2-Stage Pipeline
ASEL^
(^01)
Data Memory
WD RD Adr
R/W
WDSEL WA^0 1
<25:21>^ XP^10
JT IFPC^ +
Instruction A Memory D
<15:11> <20:16>
<25:21>^ RA2SEL
+^
RegisterFile RA^
RA RD^
RD2^ BSEL^01
Z
A^^ ALU
B WA^ JT
WDWE
ALUFN
PC+
0 1
Wr
(^01) (^23) ILLXAdrOP 4
WASEL
WERF
00
PCSEL
<15:0>
EXE^ PC 00
EXEIR
IF EXE
L22 – Pipelining the Beta 7
4/30/
2-Stage Pipelined Beta Operation
..ADDC(r1, 1, r2)SUBC(r1, 1, r3)XOR(r1, r5, r1)MUL(r2, r6, r0)...
Consider a sequenceof instructions:Executed on our 2-stage pipeline:
IF EXE
i^
i+^
i+^
i+^
i+^
i+^
i+
SUBC
ADDC
MUL
XOR
SUBC
ADDC
MUL
TIME (cycles) XOR
Pipeline
6.004 – Spring 2009
4/30/
Pipeline Control Hazards
BUT consider instead:^ IF EXE
i^
i+^
i+^
i+^
i+^
i+^
i+
CMP
ADD
XOR
BT
CMP
ADD
BT
LOOP:
ADD(r1,
r3,
r3)
CMPLEC(r3,
r0)
BT(r0,
LOOP)
XOR(r3,
r3)
MUL(r1,
r2,
r2)
This is the cycle where the branch decisionis made… but we’ve already fetched thefollowing instruction which should be executed only
if branch is not taken!
L22 – Pipelining the Beta 13
4/30/
Branch Alternative 2b(i)
Put USEFUL instructions inthe branch delay slots;remember they will beexecuted whether thebranch is taken or not
IF EXE
i^
i+^
i+^
i+^
i+^
i+^
i+
CMP
ADD
ADD
BT CMP
ADD
BT
BT
CMP
ADDBT
CMP
ADD Branch taken
Pros: only two “extra” instructions are executed (on last iteration)Cons: finding “useful” instructions that are
always
executed
is difficult; clever rewrite may be required. Program executesdifferently on naïve unpipelined implementation.
LOOP:
ADD(r1,r3,r3) LOOPx: CMPLEC(r3,100,r0)
BT(r0,LOOPx)ADD(r1,r3,r3) SUB(r3,r1,r3) XOR(r3,-1,r3)MUL(r1,r2,r2)...
We need toadd this sillyinstructionto UNDO theeffects ofthat lastADD
6.004 – Spring 2009
4/30/
Branch Alternative 2b(ii)
Put USEFUL instructions inthe branch delay slots;annul them if branch
doesn’t
behave as predicted
IF EXE
i^
i+^
i+^
i+^
i+^
i+^
i+
CMP
ADD
ADD
BT CMP
ADD
BT
BT
CMP
ADDBT
CMP
ADD Branch taken
Pros: only one instruction is annulled (on last iteration); about 70%
of branch delay slots can be filled with useful instructions
Cons: Program executes differently on
naïve
unpipelined implementation;
not really useful with more than one delay slot.
LOOP:
ADD(r1,
r3,
r3)
LOOPx:
CMPLEC(r3,
r0)
BT.taken(r0,
LOOPx)
ADD(r1,
r3,
r3)
XOR(r3,
r3)
MUL(r1,
r2,
r2)
L22 – Pipelining the Beta 15
4/30/
Architectural Issue:Branch Decision Timing
BETA approach:
- SIMPLE branch condition logic ...
Test for Reg[Ra] = 0!
- ADVANTAGE: early decision,
single delay slot
ALTERNATIVES:
(eg, if Reg[Ra] > Reg[Rb])
- MORE powerful, but• LATER decision (hence moredelay slots)
IF instruction InstructionFetch
ALU
ALU^
CL
A^
B
instruction RegisterFile
CL
RF(read)
instruction instruction
Y
WriteBack^
CL
RF(write)
instruction
Y
Memory
CL
Suppose decision were made in the ALUstage ... then there would be 2 branchdelay slots (and instructions to annul!)
Wow! I guess those guys reallywere thinking when they made upall those instructions
6.004 – Spring 2009
4/30/
(NB:
SAME
RF
+4^ PC AS ABOVE!)
RF ALUPC MEMPC
Rb: <15:11> Ra <20:16>
Rc <25:21>^ RA2SEL
Instruction A Memory D RF^ IR
InstructionFetch
+^ C: <15:0> << 2sign-extended
RD
WA^^ RegisterFile
0 1 2 WDSELWDWE MEM IR
MEMD
MEMY
ALUD B
ALU^ IR
A
01 BSEL
Z
A^^ ALU
B Y
WDAdr (^) Data Memory RegisterFile ALU R/W WriteBack
ASEL^
(^01) Rc <25:21> +C
PCSEL
ILLXAdrOP^ JT^01234 IF^ PC
00
XP 0 1 WASEL
Register WA File RA^
RA RD^
RD JTC: <15:0>sign-extended ALUFN
WERF
4-Stage
Beta PipelineTreat register file as twoseparate devices:combinational READ,clocked WRITE at end ofpipe.What other information dowe have to pass downpipeline?PC instruction fieldsWhat sort of improvementshould expect in cycle time?
(return addresses)(decoding)
L22 – Pipelining the Beta 17
4/30/
4-Stage Beta Operation
...ADDC(r1, 1, r2)SUBC(r1, 1, r3)XOR(r1, r5, r1)MUL(r2, r6, r0)...
Consider a sequenceof instructions:Executed on our 4-stage pipeline:
SUBC
ADDC
MUL
XOR
SUBC
ADDC
MUL
XOR
SUBC
ADDC
MUL
XORSUBC
ADDC
MUL
XOR
TIME (cycles)
Pipeline
IF RF ALUWB
i^
i+^
i+^
i+^
i+^
i+^
i+
L22 – Pipelining the Beta 18
6.004 – Spring 2009
4/30/
r3fetched r available
Pipeline “Data Hazard”
BUT consider instead:
ADD(r1,
r2,
r3)
CMPLEC(r3,
r0)
MULC(r1,
r4)
SUB(r1,
r2,
r5)
ADD
CMP
MUL
ADD
ADD
ADD
CMP
CMP
CMP
SUBMUL
MUL
MUL
SUB
SUB
SUB
Oops! CMP is trying to read Reg[R3] during cyclei+2 but ADD doesn’t write its result intoReg[R3] until the end of cycle i+3!
IF RF ALUWB
i^
i+^
i+^
i+^
i+^
i+^
i+
L22 – Pipelining the Beta 19
4/30/
Data Hazard Solution 1
“Program around it”... document weirdo semantics, declare it a software problem.
- Breaks sequential semantics!- Costs code efficiency.^ ADD(r1, r2, r3)CMPLEC(r3, 100, r0)MULC(r1, 100, r4)SUB(r1, r2, r5)
ADD(r1, r2, r3)MULC(r1, 100, r4)SUB(r1, r2, r5)CMPLEC(r3, 100, r0)
EXAMPLE: Rewrite
as
How often can we do this?
Programmer’s fallback: Insert NOPs (sigh!)
L22 – Pipelining the Beta 20
6.004 – Spring 2009
4/30/
Data Hazard Solution 2
Stall the pipeline:
Freeze IF, RF stages for 2 cycles, inserting NOPsinto ALU-stage instruction register
IF RF ALUWB
i^
i+^
i+^
i+^
i+^
i+^
i+
ADD
CMP
MUL
SUB
ADD
CMP
MUL
SUB
ADD
ADD
CMP
MULCMP
NOP
NOP
NOP
NOP
MULCMP
MULCMP
Drawback: NOPs mean “wasted” cycles