Pipelining the Beta: Improving CPU Performance through Pipelining in 6.004, Spring 2009, Slides of Computer Fundamentals

The concept of pipelining in computer architecture, focusing on the beta cpu in the context of the 6.004 course at mit during the spring 2009 semester. The basics of pipelining, its benefits, and the complications that arise when dealing with consecutive operations, branching, and data hazards. The ultimate goal is to create a 5-stage pipeline to maintain a near 1.0 cpi while increasing clock speed.

Typology: Slides

2012/2013

Uploaded on 04/18/2013

palmoni
palmoni 🇮🇳

4.5

(2)

75 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
L22 – Pipelining the Beta 1
6.004 – Spring 2009 4/30/09
Pipelining the Beta
betta ('be-t&) n. Any of various species of
small, brightly colored, long-finned
freshwater fishes of the genus Betta,
found in southeast Asia.
beta (‘bA-t&, ‘bE-) n. 1. The second letter
of the Greek alphabet. 2. The exemplary
computer system used in 6.004.
I don’t think they mean the fish...
maybe they’ll
give me partial
credit...
modified 4/27/09 11:17
Lab #7 due Tonight!
L22 – Pipelining the Beta 2
6.004 – Spring 2009 4/30/09
CPU Performance
We’ve got a working Beta… can we make it fast?
MIPS = Millions of Instructions/Second
Freq = Clock Frequency, MHz
CPI = Clocks per Instruction
MIPS = Freq
CPI
To Increase MIPS:
1. DECREASE CPI.
- RISC simplicity reduces CPI to 1.0.
- CPI below 1.0? Tough... you’ll see multiple instruction issue
machines in 6.823.
2. INCREASE Freq.
- Freq limited by delay along longest combinational path; hence
-PIPELINING is the key to improved performance through fast
clocks.
L22 – Pipelining the Beta 3
6.004 – Spring 2009 4/30/09
Beta Timing
New PC
PC+4 Fetch Inst.
Control Logic
Read Regs
RA2SEL mux
ASEL mux BSEL mux
ALU
Fetch data
+OFFSET
WDSEL mux
RF setup PC setup Mem setup
PCSEL mux
=0?
CLK
CLK
Wanted:
longest paths
Complications:
some apparent paths aren’t
“possible”
operations have variable
execution times (eg, ALU)
time axis is not to scale (eg,
tPD,MEM is very big!)
New PC
PC+4
Control Logic
RA2SEL mux
ASEL mux BSEL mux
Fetch Inst.
Read Regs
Fetch data
+OFFSET
WDSEL mux
RF setup PC setup Mem setup
PCSEL mux
=0?
CLK
CLK
“precedence
graph”
PC+4
ALU
+OFFSET
LD(R1,10,R0)
LDR(X,R3)
L22 – Pipelining the Beta 4
6.004 – Spring 2009 4/30/09
Why isn’t this a 20-minute lecture?
1. The Beta isn’t combinational…
Explicit state in register file, memory;
Hidden state in PC.
2. Consecutive operations – instruction executions – interact:
Jumps, branches dynamically change instruction sequence
Communication through registers, memory
Our goals:
Move slow components into separate pipeline stages, running
clock faster
Maintain instruction semantics of unpipelined Beta as far as
possible
We’ve learned how to pipeline combinational circuits.
What’s the big deal?
pf3
pf4
pf5

Partial preview of the text

Download Pipelining the Beta: Improving CPU Performance through Pipelining in 6.004, Spring 2009 and more Slides Computer Fundamentals in PDF only on Docsity!

L22 – Pipelining the Beta 1

4/30/

Pipelining the Beta

betta

('be-t&)

n.^ Any of various species of small, brightly colored, long-finnedfreshwater fishes of the genus

Betta,

found in southeast Asia. beta^

(‘bA-

t&, ‘b

E-)^ n.

1.^ The second letter

of the Greek alphabet.

2.^ The exemplary

computer system used in 6.004. I don’t think they mean the fish...

maybe they’llgive me partialcredit...

modified 4/27/09 11:

Lab #7 due Tonight!

6.004 – Spring 2009

4/30/

CPU Performance

We’ve got a working Beta… can we make it

fast?

MIPS = Millions of Instructions/SecondFreq = Clock Frequency, MHzCPI = Clocks per Instruction

MIPS

=^

FreqCPI

To Increase MIPS:

1. DECREASE CPI.- RISC

simplicity

reduces CPI to 1.0.

  • CPI

below

1.0? Tough... you’ll see multiple instruction issue

machines in 6.823.

2. INCREASE Freq.- Freq limited by delay along longest combinational path; hence-^ PIPELINING

is the key to improved performance through fast

clocks.

L22 – Pipelining the Beta 3

4/30/

Beta Timing

New PC

PC+^

Fetch Inst.

Control Logic

Read Regs

RA2SEL mux ASEL mux

BSEL mux ALU

Fetch data

+OFFSET

WDSEL muxRF setup

PC setup

Mem setup

PCSEL mux

=0?

CLK

CLK

Wanted:

longest paths

Complications: • ^ some apparent paths aren’t“possible” • ^ operations have variableexecution times (eg, ALU) • ^ time axis is not to scale (eg,t^ PD,MEM

is very big!)

New PC

PC+

Control Logic

RA2SEL mux ASEL mux

Fetch Inst. BSEL mux Read Regs

Fetch data

+OFFSET

WDSEL muxRF setup

PC setup

Mem setup

PCSEL mux

=0?

CLK

CLK

“precedencegraph”

PC+

ALU

+OFFSET

LDR(X,R3)LD(R1,10,R0)

6.004 – Spring 2009

4/30/

Why isn’t this a 20-minute lecture?1. The Beta isn’t combinational…

^ Explicit state

in register file, memory;

^ Hidden

state in PC.

2. Consecutive operations – instruction executions – interact:

  • ^ Jumps, branches dynamically change instruction sequence • ^ Communication through registers, memory

Our goals:

  • ^

Move slow components into separate pipeline stages, runningclock faster

  • ^

Maintain instruction semantics of unpipelined Beta as far aspossible

We’ve learned how to pipeline combinational circuits.

What’s the big deal?

L22 – Pipelining the Beta 5

4/30/

Ultimate Goal: 5-Stage PipelineGOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to

barely

include slowest components (mems, regfile, ALU)

APPROACH: structure processor as 5-stage pipeline:

IF^

Instruction Fetch stage: Maintains PC, fetchesone instruction per cycle and passes it to

WB

Write-Back stage: writes result back intoregister file.

RF

Register File stage: Reads source operands fromregister file, passes them to

ALU

ALU stage: Performs indicated operation, passesresult to

MEM

Memory stage: If it’s a LD, use ALU resultas an address, pass mem data(or ALU result if not LD) to

6.004 – Spring 2009

4/30/

First Steps:

A Simple 2-Stage Pipeline

ASEL^

(^01)

Data Memory

WD RD Adr

R/W

WDSEL WA^0 1

<25:21>^ XP^10

JT IFPC^ +

Instruction A Memory D

<15:11> <20:16>

<25:21>^ RA2SEL

+^

RegisterFile RA^

RA RD^

RD2^ BSEL^01

Z

A^^ ALU

B WA^ JT

WDWE

ALUFN

PC+

0 1

Wr

(^01) (^23) ILLXAdrOP 4

WASEL

WERF

00

PCSEL

<15:0>

EXE^ PC 00

EXEIR

IF EXE

L22 – Pipelining the Beta 7

4/30/

2-Stage Pipelined Beta Operation

..ADDC(r1, 1, r2)SUBC(r1, 1, r3)XOR(r1, r5, r1)MUL(r2, r6, r0)...

Consider a sequenceof instructions:Executed on our 2-stage pipeline:

IF EXE

i^

i+^

i+^

i+^

i+^

i+^

i+

SUBC

ADDC

MUL

XOR

SUBC

ADDC

MUL

TIME (cycles) XOR

Pipeline

6.004 – Spring 2009

4/30/

Pipeline Control Hazards

BUT consider instead:^ IF EXE

i^

i+^

i+^

i+^

i+^

i+^

i+

CMP

ADD

XOR

BT

CMP

ADD

BT

LOOP:

ADD(r1,

r3,

r3)

CMPLEC(r3,

r0)

BT(r0,

LOOP)

XOR(r3,

r3)

MUL(r1,

r2,

r2)

This is the cycle where the branch decisionis made… but we’ve already fetched thefollowing instruction which should be executed only

if branch is not taken!

L22 – Pipelining the Beta 13

4/30/

Branch Alternative 2b(i)

Put USEFUL instructions inthe branch delay slots;remember they will beexecuted whether thebranch is taken or not

IF EXE

i^

i+^

i+^

i+^

i+^

i+^

i+

CMP

ADD

ADD

BT CMP

ADD

BT

BT

CMP

ADDBT

CMP

ADD Branch taken

Pros: only two “extra” instructions are executed (on last iteration)Cons: finding “useful” instructions that are

always

executed

is difficult; clever rewrite may be required. Program executesdifferently on naïve unpipelined implementation.

LOOP:

ADD(r1,r3,r3) LOOPx: CMPLEC(r3,100,r0)

BT(r0,LOOPx)ADD(r1,r3,r3) SUB(r3,r1,r3) XOR(r3,-1,r3)MUL(r1,r2,r2)...

We need toadd this sillyinstructionto UNDO theeffects ofthat lastADD

6.004 – Spring 2009

4/30/

Branch Alternative 2b(ii)

Put USEFUL instructions inthe branch delay slots;annul them if branch

doesn’t

behave as predicted

IF EXE

i^

i+^

i+^

i+^

i+^

i+^

i+

CMP

ADD

ADD

BT CMP

ADD

BT

BT

CMP

ADDBT

CMP

ADD Branch taken

Pros: only one instruction is annulled (on last iteration); about 70%

of branch delay slots can be filled with useful instructions

Cons: Program executes differently on

naïve

unpipelined implementation;

not really useful with more than one delay slot.

LOOP:

ADD(r1,

r3,

r3)

LOOPx:

CMPLEC(r3,

r0)

BT.taken(r0,

LOOPx)

ADD(r1,

r3,

r3)

XOR(r3,

r3)

MUL(r1,

r2,

r2)

L22 – Pipelining the Beta 15

4/30/

Architectural Issue:Branch Decision Timing

BETA approach:

  • SIMPLE branch condition logic ...

Test for Reg[Ra] = 0!

  • ADVANTAGE: early decision,

single delay slot

ALTERNATIVES:

  • Compare-and-branch...

(eg, if Reg[Ra] > Reg[Rb])

  • MORE powerful, but• LATER decision (hence moredelay slots)

IF instruction InstructionFetch

ALU

ALU^

CL

A^

B

instruction RegisterFile

CL

RF(read)

instruction instruction

Y

WriteBack^

CL

RF(write)

instruction

Y

Memory

CL

Suppose decision were made in the ALUstage ... then there would be 2 branchdelay slots (and instructions to annul!)

Wow! I guess those guys reallywere thinking when they made upall those instructions

6.004 – Spring 2009

4/30/

(NB:

SAME

RF

+4^ PC AS ABOVE!)

RF ALUPC MEMPC

Rb: <15:11> Ra <20:16>

Rc <25:21>^ RA2SEL

Instruction A Memory D RF^ IR

InstructionFetch

+^ C: <15:0> << 2sign-extended

RD

WA^^ RegisterFile

0 1 2 WDSELWDWE MEM IR

MEMD

MEMY

ALUD B

ALU^ IR

A

01 BSEL

Z

A^^ ALU

B Y

WDAdr (^) Data Memory RegisterFile ALU R/W WriteBack

ASEL^

(^01) Rc <25:21> +C

PCSEL

ILLXAdrOP^ JT^01234 IF^ PC

00

XP 0 1 WASEL

Register WA File RA^

RA RD^

RD JTC: <15:0>sign-extended ALUFN

WERF

4-Stage

Beta PipelineTreat register file as twoseparate devices:combinational READ,clocked WRITE at end ofpipe.What other information dowe have to pass downpipeline?PC instruction fieldsWhat sort of improvementshould expect in cycle time?

(return addresses)(decoding)

L22 – Pipelining the Beta 17

4/30/

4-Stage Beta Operation

...ADDC(r1, 1, r2)SUBC(r1, 1, r3)XOR(r1, r5, r1)MUL(r2, r6, r0)...

Consider a sequenceof instructions:Executed on our 4-stage pipeline:

SUBC

ADDC

MUL

XOR

SUBC

ADDC

MUL

XOR

SUBC

ADDC

MUL

XORSUBC

ADDC

MUL

XOR

TIME (cycles)

Pipeline

IF RF ALUWB

i^

i+^

i+^

i+^

i+^

i+^

i+

L22 – Pipelining the Beta 18

6.004 – Spring 2009

4/30/

r3fetched r available

Pipeline “Data Hazard”

BUT consider instead:

ADD(r1,

r2,

r3)

CMPLEC(r3,

r0)

MULC(r1,

r4)

SUB(r1,

r2,

r5)

ADD

CMP

MUL

ADD

ADD

ADD

CMP

CMP

CMP

SUBMUL

MUL

MUL

SUB

SUB

SUB

Oops! CMP is trying to read Reg[R3] during cyclei+2 but ADD doesn’t write its result intoReg[R3] until the end of cycle i+3!

IF RF ALUWB

i^

i+^

i+^

i+^

i+^

i+^

i+

L22 – Pipelining the Beta 19

4/30/

Data Hazard Solution 1

“Program around it”... document weirdo semantics, declare it a software problem.

  • Breaks sequential semantics!- Costs code efficiency.^ ADD(r1, r2, r3)CMPLEC(r3, 100, r0)MULC(r1, 100, r4)SUB(r1, r2, r5)

ADD(r1, r2, r3)MULC(r1, 100, r4)SUB(r1, r2, r5)CMPLEC(r3, 100, r0)

EXAMPLE: Rewrite

as

How often can we do this?

Programmer’s fallback: Insert NOPs (sigh!)

L22 – Pipelining the Beta 20

6.004 – Spring 2009

4/30/

Data Hazard Solution 2

Stall the pipeline:

Freeze IF, RF stages for 2 cycles, inserting NOPsinto ALU-stage instruction register

IF RF ALUWB

i^

i+^

i+^

i+^

i+^

i+^

i+

ADD

CMP

MUL

SUB

ADD

CMP

MUL

SUB

ADD

ADD

CMP

MULCMP

NOP

NOP

NOP

NOP

MULCMP

MULCMP

Drawback: NOPs mean “wasted” cycles