Computer Architecture: Understanding Pipelining and Data Hazards - Prof. Richard Whaley, Study notes of Computer Architecture and Organization

An in-depth analysis of computer architecture, focusing on pipelining and data hazards. It covers various concepts such as pipeline stages, pipeline registers, structural hazards, data hazards, and control hazards. The document also discusses techniques to reduce branch stalls and memory port as structural hazards. It includes numerous figures and examples to illustrate the concepts.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-xj0
koofers-user-xj0 🇺🇸

9 documents

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1. Appendix A Concepts
Basics of pipelining
Hazards:
Structural
Data
Control
Pipeline implementation
Data path
Control
Exceptions
Multicycle operations
Precise exceptions/interrupts
2. Pipelining
Pipelining like an assembly line
Fine-grained parallelism between pipe
stages
Each step (pipe step) takes 1 cycle
Different steps from different instructions are
processed in parallel
Pipelining improves throughput
Individual instructions will take slightly
longer due to pipeline overhead
3. Simple Integer Pipeline
1. IF (instruction fetch): fetch inst
frm i-cache and increment PC
2. ID (inst decode):
Decode inst
Read vals from register file
check for equal on those vals
(for branch)
Sign extend immed val
Calc pc-rel target address us-
ing immed
3. EX (execution/effective @):
For mem, calc EA
(reg+imm)
Perform ALU oper on two in-
put regs
Perform ALU oper on reg
and imm
4. MEM (memory access): ld val
from or store val to d-cache
5. WB (write back): update reg
file with result of op or ld
4. Simple RISC Pipeline
Time moves left-to-right, inst order top-to-bottom
If we handle each inst sequentially, unpipelined
Notice at clock 5, we have 5 inst issuing in parallel
Ideally, will give 5-fold speedup over unpipelined code!
...................Clock number . . . . . . . . . . . . . . . . .
Inst # 1 2 3 4 5 6 7 8 9
inst iIF ID EX MEM WB
inst i+ 1 IF ID EX MEM WB
inst i+ 2 IF ID EX MEM WB
inst i+ 3 IF ID EX MEM WB
inst i+ 4 IF ID EX MEM WB
...............................................................
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Computer Architecture: Understanding Pipelining and Data Hazards - Prof. Richard Whaley and more Study notes Computer Architecture and Organization in PDF only on Docsity!

(^) Appendix A Concepts

Basics of pipelining

  • Hazards:
  • (^) Control – (^) Data (^) Structural
  • Pipeline implementation
  • (^) Exceptions – (^) Control (^) Data path

Multicycle operations

Precise exceptions/interrupts

(^) Pipelining

  • Pipelining like an assembly line (^) Fine-grained

parallelism

between

pipe

stages

Each step (pipe step) takes 1 cycle

processed in parallelDifferent steps from different instructions are

  • Pipelining improves throughput (^) Individual

instructions

will

take

slightly

longer due to pipeline overhead

(^) Simple Integer Pipeline

IF

(^) (instruction fetch): fetch inst

frm i-cache and increment PC

ID

(^) (inst decode):

Decode inst

Read vals from register file

(for branch)check for equal on those vals

Sign extend immed val

ing immedCalc pc-rel target address us-

EX

(^) (execution/effective @):

For

mem,

calc

EA

(reg+imm)

put regsPerform ALU oper on two in-

Perform

ALU

oper

on reg

and imm

MEM

(^) (memory access):

(^) ld val

from or store val to d-cache

WB

(^) (write

(^) back):

update

(^) reg

file with result of op or ld

(^) Simple RISC Pipeline

Time moves left-to-right, inst order top-to-bottom

If we handle each inst sequentially, unpipelined

  • Notice at clock 5, we have 5 inst issuing in parallel (^) Ideally, will give 5-fold speedup over unpipelined code!

(^) Clock number.................

Inst #

inst (^) i

IF

ID

EX

MEM

WB

inst (^) i (^) + 1

IF

ID

EX

MEM

WB

inst (^) i (^) + 2

IF

ID

EX

MEM

WB

inst (^) i (^) + 3

IF

ID

EX

MEM

WB

inst (^) i (^) + 4

IF

ID

EX

MEM

WB

(^) Pipeline as series of time-shifted data paths

Cycle 5 shows what happens when pipe is full:

Each pipe stage processing diff inst

Running all stages in parallel may require duplicate hardware

occurs in first, so no conflictReg read in ID stage occurs in last half of cycle, reg write in WB

Computer Architecture, Henn & Patt, Fig A.2, pg A-

(^) Reducing Branch Stall / Understanding Pipelineing

Computer Architecture, Henn & Patt, Fig A.24, pg A-

MUX

(multiplexor):

Used

to decide

which

thruof N inputs gets passed

passed between stagesData & control must be

Internal

regs

(pipeline

registers or

(^) latches

) are

  • used (^) Named

for

stages

they connect

  • Pipe regs are state elements & retain vals between pipe stages ∗^ –^ Any retained info must be passed thru pipe regs as long as needed^ Other state elts include memory, general regs, the PC, etc. (^) Dest reg passed through to last pipereg
  • Branch haz red by moving zero test & branch-targ calc into ID stage (^) Requires extra hardware (adder, zero test)

(^) Pipelining Limitations

  • (^) Overheadlower bounds on cycle time

from

info

flow

in

piperegs

  • Imbalance between stages (^) IF, MEM, EX often fat

Minimum

(^) realizable

(^) cycle

(^) time

due to clock skew

N-deep pipe give N-fold speedup? speedup

pipedepth

1+ pipe (^) stalls

(^) per (^) inst

Pipeline hazards

Structural

resource

con-

flicts

Data

: result dependencies

Control

changing

PC

(branch)

(^) Structural Hazards

Structural hazard

Resource conflict due to hardware not supporting all

possible combinations of overlapping instructions.

  • Why do structural hazard occur?
  • (^) A resource not duped enough (mem) (^) FU not fully pipelined (div)
  • Why not design hardware so they never occur?
  • (^) For rare inst, pipe will be empty, so non-pipelined inst better (^) If case is rare, not worth expense

hardwareWe will see that MIPS detects all stalls in ID, which requires extra

(^) Example of RAW Data Hazard

Which

(^) of

these

insts

(^) have

data hazard with DADD?

ANDDSUB R4, R1, R5 DADD R1, R2, R R6, R1, R

OR R8, R1, R

XOR R10, R1, R

Computer Architecture, H&P, Fig A.6, pg A-

(^) Ameliorating Data Hazards through Forwarding

Forwarding

(AKA

(^) bypassing

/short-circuiting

overcomes

(^) many

data

hazards.

  • rectly after computationVals forwd frm piperegs di- to same unit^ simple ex, forwarded only

Data

paths

forward

frm

EX/MEM

MEM/WB

piperegs

Comparators

check

if reg

  • prev inflight instsrc# of inst matches dest of (^) If

so,

control

logic

chooses forwarded val

OR?Why fwdng not needed for

Computer Architecture, H&P, Fig A.7, pg A-

(^) Forwarding from MEM

Can

forward

between

any

stages

access does not stallIf fwd from MEM, succ data

LD DADD R1, R2, R R4, 0(R1)

SD R4, 12(R1)

Would DADD stall?

LD R4, 0(R1)

IF ID EX DM WB

DADD R5, R4, R

IF ID EX DM WB

Computer Architecture, H&P, Fig A.8, pg A-

(^) Forwarding Limitations

LD R1, 0(R2)

IF ID EX DM WB

DSUB R4, R1, R

IF ID EX DM WB

AND R6, R1, R

IF ID EX DM WB

OR R8, R1, R

IF ID EX DM WB

  • Cannot forward into the past not work^ End cycle 4 to beg 4 will

DSUB must stall

Computer Architecture, H&P, Fig A.9, pg A-

(^) Stalling for Correct Execution

Must stall DSUB @ 4 to produce LD result

Why does stall affect subsequent insts?

Why does stall not affect prior insts?

(^) Clock number.................

Inst

LD R1,0(R2)

IF

ID

EX

MEM

WB

DSUB R4,R1,R

IF

ID

EX

MEM

WB

AND R6,R1,R

IF

ID

EX

MEM

WB

OR R8,R1,R

IF

ID

EX

MEM

WB

LD R1,0(R2)

IF

ID

EX

MEM

WB

DSUB R4,R1,R

IF

ID

stall

EX

MEM

WB

AND R6,R1,R

IF

stall

ID

EX

MEM

WB

OR R8,R1,R

stall

IF

ID

EX

MEM

WB

(^) Forwarding Adds Complexity to Hardware

Forwarding to ALU requires 3 extra inputs

(^) MUX and 3 new paths to

these inputs:

  1. EX/MEM1. ID/EX reg (frm reg file)Value comes from:

(ALU

result

of

prev inst)

  1. MEM/WB (ALU or load re- sult of prev inst)

Control

of

MUX

requires

compare

src reg#

of current

inst

with

dest

reg#s

(^) of

prev

inst (^) in (^) ID/EX,

EX/MEM

(^) and

MEM/WB piperegs

Computer Architecture, H&P, Fig A.23, pg A-

(^) Load RAW Interlock for Integer Pipeline

Pass

(^) frm

(^) ID (^) to (^) EX:

inst is

(^) issued

All data (^) haz

(^) det

(^) in

ID!

Comparators

det

if

two reg# the same

Only

prob

comes

usewith load in EX and (^) in (^) ID, (^) as (^) shown

in table

inInsert bubble if read ID,

load

in EX,

and

read#

(^) matches

dest#

Code

Result

Action

LD R1,45(R2)

ORDSUB R8,R6,R7DADD R5,R6,R R9,R6,R

No dep

no actionR1 not used after EX, so

LD R1,45(R2)

ORDSUB R8,R6,R7DADD R5,R1,R R9,R6,R

Stall (^) for

depend

comparators

(^) det (^) use (^) of

(andR1 in DADD, stall DADD

succ

inst)

before

DADD enters EX

LD R1,45(R2)

ORDSUB R8,R1,R7DADD R5,R6,R R9,R6,R

bydefeated Depend for-

warding

DSUB, Comp detect use of R1 in

(^) forward

(^) ld (^) val (^) in

EXtime for DSUM to enter

LD R1,45(R2)

ORDSUB R8,R6,R7DADD R5,R6,R R9,R1,R

in orderaccessesbut Depend,

Read of R1 by OR in 2

nd

cured in 1 half of ID, while write oc-

st (WB of LD)

(^) Instruction Scheduling

Some dependency stalls can be defeated by instruction scheduling

Simple scheduling may be done in hardware

More complex scheduling (like this) may be done by compiler

LW

R1, 0(R4)

LW

R2, 4(R4)

DADD

R3, R2, R

SW

R3, 8(R4)

DADDI R4, R4,

LW

R1, 0(R4)

LW

R2, 4(R4)

DADDDADDI R4, R4,

R3, R2, R

SW

R3, 4(R4)

(^) Pipeline for ‘Predict Untaken’ w/o Delay Slot

If pred wrong, change op to noop before state change

Predict ’taken’ requires delay slot or computing branch target in IF!

Why is this better than always delaying?

What are advantages of predicting taken/untaken?

(^) Clock number...............

Inst

untaken br

IF

ID

EX

MEM

WB

inst i+

IF

ID

EX

MEM

WB

inst i+

IF

ID

EX

MEM

WB

inst i+

IF

ID

EX

MEM

WB

taken br

IF

ID

EX

MEM

WB

inst i+

IF

nop

nop

nop

nop

targ

IF

ID

EX

MEM

WB

targ+

IF

ID

EX

MEM

WB

(^) Delay Slots

Instruction in delay slot executed regardless of branch outcome

If delay slot full, no bubble in pipe regardless of prediction

  • Compiler must find valid inst(s)
  • (^) Otherwise, (^) Best choice is to use independent inst from above branch

use (^) inst

(^) from

(^) fall-thru

(^) or (^) target,

(^) but

(^) only

(^) if (^) inst

interrupt, no side-effect)doesn’t change program behavior when executed uselessly (no

Some archs allow for a

(^) cancelling

(^) or (^) nullifying

(^) branch, which nops

delay slot inst if branch-contained prediction is wrong

Additional restrictions on delay slot inst (branch)

Can cause complications (return from interrupt)

of delay slots pipeline detail visable in ISA

Less common these days

(^) Exceptions/interrupts

Exceptions

(^) (AKA interrupt): Unscheduled events that change the nor-

mal execution order of instructions (eg., by calling an exception handler).

Examples:

I/O, call to system space,

(^) int overflow,

(^) FP anomaly,

memory protection violation, etc.

normal executionMust save machine state, handle exception, & if possible, restart

Exceptions more complex with pipelining, as mult inst in flight

Precise

(^) exceptions guarantee inst before the fault are completed,

after exception is handled.and inst following it are not allowed to change machine state until

(^) Classifying Exceptions

(^) Synchronous

event occurs at

same place every execution.

(^) User

(^) requested

Prog

(^) directly

asks for exception.

(^) Maskable

: User can override ex-

ception.

(^) Within

Excep

occurs

during

in between separate inst.the internal exec of an inst, not

(^) Resume

execution resumes af-

ter handler, else

(^) terminate

Sync/

User/

User

Within/

Resume/

Exception

Async

Coerced

Mask

Between

Term

I/O dev req

async

coerced

NO

between

resume

OS call

sync

user

NO

between

resume

Inst trace

sync

user

YES

between

resume

breakpoint

sync

user

YES

between

resume

ioverflow

sync

coerced

YES

within

resume

fp over/und

sync

coerced

YES

within

resume

page fault

sync

coerced

NO

within

resume

misalign mem

sync

coerced

YES

within

resume

mem prot viol

sync

coerced

NO

within

resume

Undef inst

sync

coerced

NO

within

resume

Hardware malfunc

async

coerced

NO

within

terminate

Power failure

async

coerced

NO

within

terminate

(^) Exceptions in MIPS Integer Pipe

Most exceptions involve memory (segfault)

Exception raised by inst

(^) i

(^) x could occur before the exception from

inst (^) i !

(^) Start to see why precise exceptions hard

Stage

Possible Exceptions

IF

Page fault (inst fetch), misaligned mem access, mem prot viol

ID

Undefined/illegal opcode

EX

Arithmetic exception

MEM

Page fault (data), misaligned mem access, mem prot viol

WB

None

(^) Multiple Cycle Operations

  • cles:Many arithmetic operations traditionally performed in multiple cy- (^) Integer & floating point multiplies (

pipelined)

Integer & floating point divides (

unpipelined)

fp adds and subtracts (

pipelined)

? a clock cycle to be worth making single cycleMulticycle ops tend to require too much hardware and/or too long slowing down clock?^ If expense no object, can we make all ops single-cycle without

Multicycle unpipelined ops even more so, and often relatively rare

(^) Adding the Floating Point Pipes

latency

of intervening cycles between an inst that produces a

  • result and inst that uses it (^) Since

(^) ops

(^) consume

(^) data

(^) in (^) EX (^) = (^) # (^) of (^) stages

(^) in (^) EX (^) - (^1) (^) =

pipelen

f u (^) − (^1)

repeat

/initiation

(^) interval: # cycles that must elapse between issuing

two inst of same type

Fully pipelined mul/add, unpiped div (why?). How many new preg? Comp Arch, Henn & Patt, Fig A.31, pg A-

FU

Lat

Repeat

iALU

i/f DM

fadd

fmul

f/i div

(^) Pipelining of Independent FP Instructions

No stalls (indep operands, units)

Stage where operands consumed in

(^) italics

(^) Operands consumed in order

  • Stage where result available (forwarding) is underlined (^) Add result before mul, load before add!

(^) Operands available out-of-order

What data does store need in EX & MEM?

MUL.D

IF ID

M

M

M

M

M

M

M

MEM

WB

ADD.D

IF

ID

A

A

A

A

MEM

WB

L.D

IF

ID

EX

MEM

WB

S.D

IF

ID

EX

MEM

(^) Actual Pipeline:

R

IF

: PC select, init ifetch

IS

: complete ifetch

RF

inst

decode,

reg

fetch,

icache hit det, hazard check

EX

: EA calc,

(^) ALU op,

(^) branch

target, cond. eval

DF

: init dfetch

DS

: complete cache access

TC

: tag check (det if cache hit)

WB

: write back to regfile

on this arch^ We see caches slower than ALU

Comp Arch, Henn & Patt, Fig A.37, pg A-

(^) R4000 Pipe Causes 2-cycle Load/use Delay

Data available at end of DS (cache hit)

If TC shows miss, pipe must be backed up

Comp Arch, Henn & Patt, Fig A.38, pg A-

(^) Two-cycle LD/Use Delay in R

Forwarding from DS stage

(^) Clock number............

Inst

LD R

IF

IS

RF

EX

DF

DS

TC

WB

DADD R2,R1...

IF

IS

RF

stall

stall

EX

DF

DS

DSUB R3,R1...

IF

IS

stall

stall

RF

EX

DF

OR R4,R1...

IF

stall

stall

IS

RF

EX

  1. Is forwarding needed to defeat hazard on

DSUB

  1. Is forwarding needed to defeat hazard on

OR

(^) Three-cycle Branch Delay in R

R4000 predicts ‘not taken’, computes branch targ in EX

Br targ forwarded from EX (4th stage

(^) 3 cycle delay)

Compiler can reduce taken branches

Why not have 3 delay slots?

(^) Clock number............

Inst

taken branch (

i)

IF

IS

RF

EX

DF

DS

TC

WB

delay slot (

i (^) + 1)

IF

IS

RF

EX

DF

DS

TC

WB

stall (

i (^) + 2)

IF

IS

nop

nop

nop

nop

nop

stall (

i (^) + 3)

IF

nop

nop

nop

nop

nop

branch targ

IF

IS

RF

EX

DX

untaken branch (............................................................

i) IF IS RF

EX

DF

DS

TC

WB

delay slot (

i (^) + 1)

IF

IS

RF

EX

DF

DS

TC

WB

inst (^) i (^) + 2

IF

IS

RF

EX

DF

DS

TC

inst

(^) i (^) + 3

IF

IS

RF

EX

DF

DS

(^) Pipelining effects on CPI

  • First 5 int benchmarks (^) br stalls dominate
  • Last 5 fp benchmarks (^) Result (data) stalls dominate
  • Transforming code can help:
  • (^) Hand-tunder can eliminate (^) Compiler can ameliorate

Comp Arch, Henn & Patt, Fig A.48, pg A- CPI for 10 SPEC92 benchmarks

(perfect cache)

(^) Fallacies and Pitfalls

  • ards.Pitfall: Rearranged execution sequences may cause unexpected haz- (^) Filling delay slot causes WAW hazard: BNEZ

R1, L

.....DIV.D F0, F2, F4 ; in delay slot

L1:

L.D

F0, 28(R1)

Pitfall:

Extensive pipelining in search of higher clock speeds can

impact other aspects of design, leading to worse cost-performance