Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Pipelining the Beta: Improving CPU Performance through Pipelining in 6.004, Spring 2009, Slides of Computer Fundamentals

Baddi University of Emerging Sciences and Technologies Computer Fundamentals

The concept of pipelining in computer architecture, focusing on the beta cpu in the context of the 6.004 course at mit during the spring 2009 semester. The basics of pipelining, its benefits, and the complications that arise when dealing with consecutive operations, branching, and data hazards. The ultimate goal is to create a 5-stage pipeline to maintain a near 1.0 cpi while increasing clock speed.

Typology: Slides

2012/2013

Uploaded on 04/18/2013

palmoni 🇮🇳

4.5

(2)

75 documents

1 / 6

This page cannot be seen from the preview

Don't miss anything!

L22 – Pipelining the Beta 1

6.004 – Spring 2009 4/30/09

Pipelining the Beta

betta ('be-t&) n. Any of various species of

small, brightly colored, long-finned

freshwater fishes of the genus Betta,

found in southeast Asia.

beta (‘bA-t&, ‘bE-) n. 1. The second letter

of the Greek alphabet. 2. The exemplary

computer system used in 6.004.

I don’t think they mean the ﬁsh...

maybe they’ll

give me partial

credit...

modiﬁed 4/27/09 11:17

Lab #7 due Tonight!

L22 – Pipelining the Beta 2

6.004 – Spring 2009 4/30/09

CPU Performance

We’ve got a working Beta… can we make it fast?

MIPS = Millions of Instructions/Second

Freq = Clock Frequency, MHz

CPI = Clocks per Instruction

MIPS = Freq

CPI

To Increase MIPS:

1. DECREASE CPI.

- RISC simplicity reduces CPI to 1.0.

- CPI below 1.0? Tough... you’ll see multiple instruction issue

machines in 6.823.

2. INCREASE Freq.

- Freq limited by delay along longest combinational path; hence

-PIPELINING is the key to improved performance through fast

clocks.

L22 – Pipelining the Beta 3

6.004 – Spring 2009 4/30/09

Beta Timing

New PC

PC+4 Fetch Inst.

Control Logic

Read Regs

RA2SEL mux

ASEL mux BSEL mux

ALU

Fetch data

+OFFSET

WDSEL mux

RF setup PC setup Mem setup

PCSEL mux

=0?

CLK



CLK

Wanted:

longest paths

Complications:

• some apparent paths aren’t

“possible”

• operations have variable

execution times (eg, ALU)

• time axis is not to scale (eg,

tPD,MEM is very big!)

New PC

PC+4

Control Logic

RA2SEL mux

ASEL mux BSEL mux

Fetch Inst.

Read Regs

Fetch data

+OFFSET

WDSEL mux

RF setup PC setup Mem setup

PCSEL mux

=0?

CLK



CLK

“precedence

graph”

PC+4

ALU

+OFFSET

LD(R1,10,R0)

LDR(X,R3)

L22 – Pipelining the Beta 4

6.004 – Spring 2009 4/30/09

Why isn’t this a 20-minute lecture?

1. The Beta isn’t combinational…

Explicit state in register ﬁle, memory;

Hidden state in PC.

2. Consecutive operations – instruction executions – interact:

•Jumps, branches dynamically change instruction sequence

•Communication through registers, memory

Our goals:

•Move slow components into separate pipeline stages, running

clock faster

•Maintain instruction semantics of unpipelined Beta as far as

possible

We’ve learned how to pipeline combinational circuits.

What’s the big deal?



Discover Slides of Computer Fundamentals Baddi University of Emerging Sciences and Technologies

Partial preview of the text

Download Pipelining the Beta: Improving CPU Performance through Pipelining in 6.004, Spring 2009 and more Slides Computer Fundamentals in PDF only on Docsity!

L22 – Pipelining the Beta 1

4/30/

Pipelining the Beta

betta

('be-t&)

n.^ Any of various species of small, brightly colored, long-finnedfreshwater fishes of the genus

Betta,

found in southeast Asia. beta^

(‘bA-

t&, ‘b

E-)^ n.

1.^ The second letter

of the Greek alphabet.

2.^ The exemplary

computer system used in 6.004. I don’t think they mean the fish...

maybe they’llgive me partialcredit...

modified 4/27/09 11:

Lab #7 due Tonight!

6.004 – Spring 2009

4/30/

CPU Performance

We’ve got a working Beta… can we make it

fast?

MIPS = Millions of Instructions/SecondFreq = Clock Frequency, MHzCPI = Clocks per Instruction

MIPS

=^

FreqCPI

To Increase MIPS:

1. DECREASE CPI.- RISC

simplicity

reduces CPI to 1.0.

CPI

below

1.0? Tough... you’ll see multiple instruction issue

machines in 6.823.

2. INCREASE Freq.- Freq limited by delay along longest combinational path; hence-^ PIPELINING

is the key to improved performance through fast

clocks.

L22 – Pipelining the Beta 3

4/30/

Beta Timing

New PC

PC+^

Fetch Inst.

Control Logic

Read Regs

RA2SEL mux ASEL mux

BSEL mux ALU

Fetch data

+OFFSET

WDSEL muxRF setup

PC setup

Mem setup

PCSEL mux

=0?

CLK

Wanted:

longest paths

Complications: • ^ some apparent paths aren’t“possible” • ^ operations have variableexecution times (eg, ALU) • ^ time axis is not to scale (eg,t^ PD,MEM

is very big!)

New PC

PC+

Control Logic

RA2SEL mux ASEL mux

Fetch Inst. BSEL mux Read Regs

Fetch data

+OFFSET

WDSEL muxRF setup

PC setup

Mem setup

PCSEL mux

=0?

CLK

“precedencegraph”

PC+

ALU

+OFFSET

LDR(X,R3)LD(R1,10,R0)

6.004 – Spring 2009

4/30/

Why isn’t this a 20-minute lecture?1. The Beta isn’t combinational…

^ Explicit state

in register file, memory;

^ Hidden

state in PC.

2. Consecutive operations – instruction executions – interact:

^ Jumps, branches dynamically change instruction sequence • ^ Communication through registers, memory

Our goals:

^

Move slow components into separate pipeline stages, runningclock faster

^

Maintain instruction semantics of unpipelined Beta as far aspossible

We’ve learned how to pipeline combinational circuits.

What’s the big deal?

L22 – Pipelining the Beta 5

4/30/

Ultimate Goal: 5-Stage PipelineGOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to

barely

include slowest components (mems, regfile, ALU)

APPROACH: structure processor as 5-stage pipeline:

IF^

Instruction Fetch stage: Maintains PC, fetchesone instruction per cycle and passes it to

WB

Write-Back stage: writes result back intoregister file.

RF

Register File stage: Reads source operands fromregister file, passes them to

ALU

ALU stage: Performs indicated operation, passesresult to

MEM

Memory stage: If it’s a LD, use ALU resultas an address, pass mem data(or ALU result if not LD) to

6.004 – Spring 2009

4/30/

First Steps:

A Simple 2-Stage Pipeline

ASEL^

(^01)

Data Memory

WD RD Adr

R/W

WDSEL WA^0 1

<25:21>^ XP^10

JT IFPC^ +

Instruction A Memory D

<15:11> <20:16>

<25:21>^ RA2SEL

+^

RegisterFile RA^

RA RD^

RD2^ BSEL^01

Z

A^^ ALU

B WA^ JT

WDWE

ALUFN

PC+

0 1

Wr

(^01) (^23) ILLXAdrOP 4

WASEL

WERF

00

PCSEL

<15:0>

EXE^ PC 00

EXEIR

IF EXE

L22 – Pipelining the Beta 7

4/30/

2-Stage Pipelined Beta Operation

..ADDC(r1, 1, r2)SUBC(r1, 1, r3)XOR(r1, r5, r1)MUL(r2, r6, r0)...

Consider a sequenceof instructions:Executed on our 2-stage pipeline:

IF EXE

i^

i+^

i+

SUBC

ADDC

MUL

XOR

SUBC

ADDC

MUL

TIME (cycles) XOR

Pipeline

6.004 – Spring 2009

4/30/

Pipeline Control Hazards

BUT consider instead:^ IF EXE

i^

i+^

i+

CMP

ADD

XOR

BT

CMP

ADD

BT

LOOP:

ADD(r1,

r3,

r3)

CMPLEC(r3,

r0)

BT(r0,

LOOP)

XOR(r3,

r3)

MUL(r1,

r2,

r2)

This is the cycle where the branch decisionis made… but we’ve already fetched thefollowing instruction which should be executed only

if branch is not taken!

L22 – Pipelining the Beta 13

4/30/

Branch Alternative 2b(i)

Put USEFUL instructions inthe branch delay slots;remember they will beexecuted whether thebranch is taken or not

IF EXE

i^

i+^

i+

CMP

ADD

BT CMP

ADD

BT

CMP

ADDBT

CMP

ADD Branch taken

Pros: only two “extra” instructions are executed (on last iteration)Cons: finding “useful” instructions that are

always

executed

is difficult; clever rewrite may be required. Program executesdifferently on naïve unpipelined implementation.

LOOP:

ADD(r1,r3,r3) LOOPx: CMPLEC(r3,100,r0)

BT(r0,LOOPx)ADD(r1,r3,r3) SUB(r3,r1,r3) XOR(r3,-1,r3)MUL(r1,r2,r2)...

We need toadd this sillyinstructionto UNDO theeffects ofthat lastADD

6.004 – Spring 2009

4/30/

Branch Alternative 2b(ii)

Put USEFUL instructions inthe branch delay slots;annul them if branch

doesn’t

behave as predicted

IF EXE

i^

i+^

i+

CMP

ADD

BT CMP

ADD

BT

CMP

ADDBT

CMP

ADD Branch taken

Pros: only one instruction is annulled (on last iteration); about 70%

of branch delay slots can be filled with useful instructions

Cons: Program executes differently on

naïve

unpipelined implementation;

not really useful with more than one delay slot.

LOOP:

ADD(r1,

r3,

r3)

LOOPx:

CMPLEC(r3,

r0)

BT.taken(r0,

LOOPx)

ADD(r1,

r3,

r3)

XOR(r3,

r3)

MUL(r1,

r2,

r2)

L22 – Pipelining the Beta 15

4/30/

Architectural Issue:Branch Decision Timing

BETA approach:

SIMPLE branch condition logic ...

Test for Reg[Ra] = 0!

ADVANTAGE: early decision,

single delay slot

ALTERNATIVES:

Compare-and-branch...

(eg, if Reg[Ra] > Reg[Rb])

MORE powerful, but• LATER decision (hence moredelay slots)

IF instruction InstructionFetch

ALU

ALU^

CL

A^

B

instruction RegisterFile

CL

RF(read)

instruction instruction

Y

WriteBack^

CL

RF(write)

instruction

Y

Memory

CL

Suppose decision were made in the ALUstage ... then there would be 2 branchdelay slots (and instructions to annul!)

Wow! I guess those guys reallywere thinking when they made upall those instructions

6.004 – Spring 2009

4/30/

(NB:

SAME

RF

+4^ PC AS ABOVE!)

RF ALUPC MEMPC

Rb: <15:11> Ra <20:16>

Rc <25:21>^ RA2SEL

Instruction A Memory D RF^ IR

InstructionFetch

+^ C: <15:0> << 2sign-extended

RD

WA^^ RegisterFile

0 1 2 WDSELWDWE MEM IR

MEMD

MEMY

ALUD B

ALU^ IR

A

01 BSEL

Z

A^^ ALU

B Y

WDAdr (^) Data Memory RegisterFile ALU R/W WriteBack

ASEL^

(^01) Rc <25:21> +C

PCSEL

ILLXAdrOP^ JT^01234 IF^ PC

00

XP 0 1 WASEL

Register WA File RA^

RA RD^

RD JTC: <15:0>sign-extended ALUFN

WERF

4-Stage

Beta PipelineTreat register file as twoseparate devices:combinational READ,clocked WRITE at end ofpipe.What other information dowe have to pass downpipeline?PC instruction fieldsWhat sort of improvementshould expect in cycle time?

(return addresses)(decoding)

L22 – Pipelining the Beta 17

4/30/

4-Stage Beta Operation

...ADDC(r1, 1, r2)SUBC(r1, 1, r3)XOR(r1, r5, r1)MUL(r2, r6, r0)...

Consider a sequenceof instructions:Executed on our 4-stage pipeline:

SUBC

ADDC

MUL

XOR

SUBC

ADDC

MUL

XOR

SUBC

ADDC

MUL

XORSUBC

ADDC

MUL

XOR

TIME (cycles)

Pipeline

IF RF ALUWB

i^

i+^

i+

L22 – Pipelining the Beta 18

6.004 – Spring 2009

4/30/

r3fetched r available

Pipeline “Data Hazard”

BUT consider instead:

ADD(r1,

r2,

r3)

CMPLEC(r3,

r0)

MULC(r1,

r4)

SUB(r1,

r2,

r5)

ADD

CMP

MUL

ADD

CMP

SUBMUL

MUL

SUB

Oops! CMP is trying to read Reg[R3] during cyclei+2 but ADD doesn’t write its result intoReg[R3] until the end of cycle i+3!

IF RF ALUWB

i^

i+^

i+

L22 – Pipelining the Beta 19

4/30/

Data Hazard Solution 1

“Program around it”... document weirdo semantics, declare it a software problem.

Breaks sequential semantics!- Costs code efficiency.^ ADD(r1, r2, r3)CMPLEC(r3, 100, r0)MULC(r1, 100, r4)SUB(r1, r2, r5)

ADD(r1, r2, r3)MULC(r1, 100, r4)SUB(r1, r2, r5)CMPLEC(r3, 100, r0)

EXAMPLE: Rewrite

as

How often can we do this?

Programmer’s fallback: Insert NOPs (sigh!)

L22 – Pipelining the Beta 20

6.004 – Spring 2009

4/30/

Pipelining the Beta: Improving CPU Performance through Pipelining in 6.004, Spring 2009, Slides of Computer Fundamentals

Related documents

Partial preview of the text

Download Pipelining the Beta: Improving CPU Performance through Pipelining in 6.004, Spring 2009 and more Slides Computer Fundamentals in PDF only on Docsity!

Pipelining the Beta

CPU Performance

We’ve got a working Beta… can we make it

fast?

MIPS = Millions of Instructions/SecondFreq = Clock Frequency, MHzCPI = Clocks per Instruction

MIPS

=^

To Increase MIPS:

1. DECREASE CPI.- RISC

simplicity

reduces CPI to 1.0.

below

1.0? Tough... you’ll see multiple instruction issue

machines in 6.823.

2. INCREASE Freq.- Freq limited by delay along longest combinational path; hence-^ PIPELINING

is the key to improved performance through fast

clocks.

Beta Timing

Wanted:

longest paths

Complications: • ^ some apparent paths aren’t“possible” • ^ operations have variableexecution times (eg, ALU) • ^ time axis is not to scale (eg,t^ PD,MEM

is very big!)

“precedencegraph”

LDR(X,R3)LD(R1,10,R0)

Why isn’t this a 20-minute lecture?1. The Beta isn’t combinational…

^ Explicit state

in register file, memory;

^ Hidden

state in PC.

2. Consecutive operations – instruction executions – interact:

Our goals:

Move slow components into separate pipeline stages, runningclock faster

Maintain instruction semantics of unpipelined Beta as far aspossible

We’ve learned how to pipeline combinational circuits.

What’s the big deal?

Ultimate Goal: 5-Stage PipelineGOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to

barely

include slowest components (mems, regfile, ALU)

APPROACH: structure processor as 5-stage pipeline:

IF^

Instruction Fetch stage: Maintains PC, fetchesone instruction per cycle and passes it to

WB

Write-Back stage: writes result back intoregister file.

RF

Register File stage: Reads source operands fromregister file, passes them to

ALU

ALU stage: Performs indicated operation, passesresult to

MEM

Memory stage: If it’s a LD, use ALU resultas an address, pass mem data(or ALU result if not LD) to

First Steps:

A Simple 2-Stage Pipeline

+^

IF EXE

2-Stage Pipelined Beta Operation

Consider a sequenceof instructions:Executed on our 2-stage pipeline:

IF EXE

SUBC

ADDC

MUL

XOR

SUBC

ADDC

MUL

Pipeline Control Hazards

BUT consider instead:^ IF EXE

CMP

ADD

XOR

BT

CMP

ADD

BT

LOOP:

LOOP)

This is the cycle where the branch decisionis made… but we’ve already fetched thefollowing instruction which should be executed only

if branch is not taken!