Pipelining in Digital Circuits: A Case Study from MIT 6.004, Slides of Computer Fundamentals

A part of the mit 6.004 spring 2009 course materials, focusing on pipelining in digital circuits. It discusses the concept of pipelining, its advantages and disadvantages, and various pipelining methodologies. The document also includes examples and summaries to help students understand the concept.

Typology: Slides

2012/2013

Uploaded on 04/18/2013

palmoni
palmoni šŸ‡®šŸ‡³

4.5

(2)

75 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
L08 - Pipelining 1
6.004 – Spring 2009 3/3/09
Pipelining
what Seymour Cray taught the laundry industry
I’ve got 3 months
Worth of laundry
To do tonight…
Funny, considering that
he’s only got one
outfit…
Due Thursday: Lab #3
modified 2/23/09 10:45 L08 - Pipelining 2
6.004 – Spring 2009 3/3/09
Forget circuits… lets solve a ā€œReal Problemā€
Device: Washer
Function: Fill, Agitate, Spin
WasherPD = 30 mins
Device: Dryer
Function: Heat, Spin
DryerPD = 60 mins
INPUT:
dirty laundry
OUTPUT:
6 more weeks
L08 - Pipelining 3
6.004 – Spring 2009 3/3/09
Total = WasherPD + DryerPD
= _________ mins
90
One load at a time
Everyone knows that the real
reason that MIT students put
off doing laundry so long is not
because they procrastinate,
are lazy, or even have better
things to do.
The fact is, doing one load at a time
is not smart.
L08 - Pipelining 4
6.004 – Spring 2009 3/3/09
Doing N loads of laundry
Here’s how they do laundry at
Harvard, the ā€œcombinationalā€ way.
Total = N*(WasherPD + DryerPD)
= ____________ mins
N*90
(Of course, this is just an urban legend.
No one at Harvard actually does
laundry. The butlers all arrive on
Wednesday morning, pick up the dirty
laundry and return it all pressed and
starched in time for afternoon tea)
Figure by MIT OpenCourseware.
Step 1:
Step 2:
Figure by MIT OpenCourseware.
Image by MIT OpenCourseWare.
Step 1:
Step 3:
Step 2:
Step 4:
...
Figure by MIT OpenCourseware.
pf3
pf4
pf5

Partial preview of the text

Download Pipelining in Digital Circuits: A Case Study from MIT 6.004 and more Slides Computer Fundamentals in PDF only on Docsity!

L08 - Pipelining 1

3/3/

Pipelining

what Seymour Cray taught the laundry industry

I’ve got

3 months Worth of laundryTo do tonight…

Funny, considering thathe’s only got one

outfit…

Due Thursday: Lab #

modified 2/23/09 10:

L08 - Pipelining 2

6.004 – Spring 2009

3/3/

Forget circuits… lets solve a ā€œReal Problemā€

Device: WasherFunction: Fill, Agitate, SpinWasher

= 30 minsPD Device: DryerFunction: Heat, SpinDryer

= 60 minsPD

INPUT:dirty laundry OUTPUT:6 more weeks

L08 - Pipelining 3

3/3/

Total = Washer

  • DryerPD

PD

= _________ mins

One load at a time

Everyone knows that the realreason that MIT students putoff doing laundry so long is notbecause they procrastinate,are lazy, or even have betterthings to do.The fact is, doing one load at a timeis not smart.

L08 - Pipelining 4

6.004 – Spring 2009

3/3/

Doing N loads of laundry

Here’s how they do laundry atHarvard, the ā€œcombinationalā€ way.

Total = N*(Washer

  • DryerPD
)PD

= ____________ mins

N*

(Of course, this is just an urban legend.No one at Harvard actually

does

laundry. The butlers all arrive onWednesday morning, pick up the dirtylaundry and return it all pressed andstarched in time for afternoon tea)

Figure by MIT OpenCourseware.

Step 1: Step 2:

Figure by MIT OpenCourseware.

Image by MIT OpenCourseWare.

Step 1: Step 2: Step 3: Step 4:...

Figure by MIT OpenCourseware.

L08 - Pipelining 5

3/3/

Doing N Loads… the MIT way

MIT students ā€œpipelineā€the laundry process.That’s why we wait!

Total = N * Max(Washer

, DryerPD

)PD

= ____________ mins

N*

Actually, it’s more like N*60 + 30if we account for the startuptransient correctly. When doingpipeline analysis, we’re mostlyinterested in the ā€œsteady stateā€where we assume we have aninfinite supply of inputs.

L08 - Pipelining 6

6.004 – Spring 2009

3/3/

Performance Measures

Latency:^ The delay from when an input is established until the outputassociated with that input becomes valid.

(Harvard Laundry = _________ mins)(^

MIT Laundry = _________ mins)

Throughput:^ The

rate

at which inputs or outputs are processed.

(Harvard Laundry = _________ outputs/min)(^

MIT Laundry = _________ outputs/min)

(^90120) 1/901/

Assuming that the washis started as soon aspossible and waits (wet)in the washer until dryeris available.

L08 - Pipelining 7

3/3/

Okay, back to circuits…

F G

H

X^

P(X)

For combinational logic:latency = t

,PD

throughput = 1/t

PD.

We can’t get the answer faster, butare we making effective use of ourhardware at all times?

F(X) G(X)P(X)

X^ F & G are ā€œidleā€, just holding their outputsstable while H performs its computation

L08 - Pipelining 8

6.004 – Spring 2009

3/3/

Pipelined Circuits

use registers to hold H’s input stable!F G

H

X^

P(X)

15 20

25

Now F & G can be working on input X

i+

while H is performing its computation onX^. We’ve created a 2-stagei^

pipeline

: if we

have a valid input X during clock cycle j,P(X) is valid during clock j+2.

Suppose F, G, H have propagation delays of 15, 20, 25 ns andwe are using ideal zero-delay registers:

latency^45 ______

throughput1/45______

unpipelined

2-stage pipeline

(^50) worse

1/25better

Step 1: Step 2: Step 3:...^ Figure by MIT OpenCourseware.

L08 - Pipelining 13

3/3/

Pipeline Example

A
B
C

X Y

2

1

1

0-pipe:1-pipe:2-pipe:3-pipe:

LATENCY
THROUGHPUT
OBSERVATIONS:
  • 1-pipeline improves neitherL or T.• T improved by breaking longcombinational paths,allowing faster clock.• Too many stages cost L,don’t improve T.• Back-to-back registers areoften required to keeppipeline well-formed.

2

L08 - Pipelining 14

6.004 – Spring 2009

3/3/

Pipelining Summary

Advantages:

  • ^ Allows us to increase thruput, by breaking up longcombinational paths and (hence) increasing clockfrequency

Disadvantages:

  • ^ May increase latency... – ^ Only as good as the weakest link: slowest stepconstrains system thruput.

Isn’t there a way around this ā€œweak linkā€ problem?

This bottleneckis the onlyproblem

L08 - Pipelining 15

but... but... 3/3/ How can I pipelinea^ clothes dryer???

A’^ (2-pipe)

Pipelined Components

C

X Y

1

Pipelined systems can behierarchical:

  • ^ Replacing a slowcombinational componentwith a k-pipe version mayincrease clock frequency
B^1

3

1

2

4 4-stage pipeline, thruput=

  • ^ Must account for newpipeline stages in our plan

L08 - Pipelining 16

6.004 – Spring 2009

3/3/

How do 6.004 Aces do Laundry?

They work around the bottleneck.First, they find a place withtwice as many dryers aswashers.Throughput =

______ loads/min

Latency = ______ mins/load

Step 1: Step 2: Step 3: Step 4:^ Figure by MIT OpenCourseware.

L08 - Pipelining 17

3/3/

Back to our bottleneck...

A4 nS

B3 nS

C8 nS D4 nS

E2 nS

F5 nS

T = 1/8nsL = 24ns

Recall our earlier example...

  • ^ C – the slowest component –limits clock period to 8 ns. • ^ HENCE throughput limited to1/8ns. We could improve throughput by - ^ Finding a pipelined version of C;OR ... • ^ interleaving

multiple copies of C!

L08 - Pipelining 18

6.004 – Spring 2009

3/3/

Circuit Interleaving

We can simulate a pipelinedversion of a slowcomponent by replicatingthe critical element andalternate inputs betweenthe various copies.

C^0 D QG D Q

1 0 C’

D Q^ G

C^1

Xi

C(Xi-

This is a simple2-state FSMthat alternatesbetween 0 and 1on each clock

clk Q

L08 - Pipelining 19

3/3/

Circuit Interleaving

We can simulate a pipelinedversion of a slowcomponent by replicatingthe critical element andalternate inputs betweenthe various copies.

C^0 D QG D Q

(^10)

C’

D Q^ G

C^1

X^ i

C(X^ i-

clk Q

When Q is 1 the lower path iscombinational (the latch isopen), yet the output of theupper path will be enabledonto the input of the outputregister ready for the NEXTclock edge.Meanwhile, the other latchmaintains the input from thelast clock.

C^ odd

Coutput^1

Ceven

Mux output

Codd

ā€œIt acts like a 2-stage pipelineā€

L08 - Pipelining 20

6.004 – Spring 2009

3/3/

C^0 D QG D Q

1 0 C’

D Q^ G

C^1

Xi

x x^

C(Xi-

C^0 D QG D Q

1 0 C’

D Q^ G

C^1

X^0

C(Xi-

C^0 D QG D Q

1 0 C’

D Q^ G

C^1

X^1

C(Xi-

C^0 D QG D Q

1 0 C’

D Q^ G

C^1

X^2
C(X^0

C^0 D QG D Q

1 0 C’

D Q^ G

C^1

X^3
C(X^1

Circuit Interleaving

Latency = 2 clocks

  • ^ Clock period 0: X

presented at input, 0

propagates thru upper latch, C

. 0 - ^ Clock period 1: X

presented at input, 1

propagates thru lower latch, C

. C(X 10

) 0

propagates to register inputs. •  Clock period 2: X

presented at input, 2

propagates thru upper latch, C. C

(X^ ) loaded 00

into register, appears at output.

N registers… N-wayinterleave

2-Clock MartinizingN-way interleavingis equivalent toN pipeline Stages...

ā€œIn by t

, out by ti

ā€i+

L08 - Pipelining 25

3/3/

Self-timed Example

a glimpse of an asynchronous, locally-time discipline

Elegant, timing-independent design:

X • Each component specifies its own time constraints• Local adaptation to special cases (eg, multiplication by 0)• Module performance improvements automatically exploited• Can be made asynchronous (no clock at all!) or synchronous

A^

C

B A(X)

here’s^

…Got it.

L08 - Pipelining 26

6.004 – Spring 2009

3/3/

Control Structure TaxonomySynchronous

Asynchronous

GloballyTimed LocallyTimed

Centralized clockedFSM generates allcontrol signals.

Central control unit tailorscurrent time slice to

current tasks.

Start and Finish signalsgenerated by each major

subsystem, synchronously with global

clock.

Each subsystem takesasynchronous Start,generates asynchronousFinish (perhaps using local

clock).

Easy to design but fixed-sizedinterval can be wasteful (no data-dependencies in timing)

Large systems lead to verycomplicated timing generators…just say no!

The best way to build largesystems that haveindependently-timed

components.

The ā€œnext big ideaā€ for the lastseveral decades: a lot of designwork to do in general, but extrawork is worth it in special cases

L08 - Pipelining 27

3/3/

Summary

  • ^ Latency (L) = time it takes for given input to arrive at output • ^ Throughput (T) = rate at each new outputs appear • ^ For combinational circuits: L = t

of circuit, T = 1/LPD

  • ^ For K-pipelines (K > 0):
    • ^ always have register on output(s) • ^ K registers on every path from input to output • ^ Inputs available shortly after clock i, outputs availableshortly after clock (i+K) • ^ T = 1/(t

PD,REG

  • t PD

of slowest pipeline stage + t

SETUP

  • ^ more throughput

^ split slowest pipeline stage(s)

  • ^ use replication/interleaving if no further splits possible
    • ^ L = K / T
      • ^ pipelined latency

^ combinational latency