






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An in-depth analysis of computer architecture, focusing on pipelining and data hazards. It covers various concepts such as pipeline stages, pipeline registers, structural hazards, data hazards, and control hazards. The document also discusses techniques to reduce branch stalls and memory port as structural hazards. It includes numerous figures and examples to illustrate the concepts.
Typology: Study notes
1 / 11
This page cannot be seen from the preview
Don't miss anything!







(^) Appendix A Concepts
Basics of pipelining
Multicycle operations
Precise exceptions/interrupts
(^) Pipelining
parallelism
between
pipe
stages
Each step (pipe step) takes 1 cycle
processed in parallelDifferent steps from different instructions are
instructions
will
take
slightly
longer due to pipeline overhead
(^) Simple Integer Pipeline
(^) (instruction fetch): fetch inst
frm i-cache and increment PC
(^) (inst decode):
Decode inst
Read vals from register file
(for branch)check for equal on those vals
Sign extend immed val
ing immedCalc pc-rel target address us-
(^) (execution/effective @):
For
mem,
calc
(reg+imm)
put regsPerform ALU oper on two in-
Perform
oper
on reg
and imm
(^) (memory access):
(^) ld val
from or store val to d-cache
(^) (write
(^) back):
update
(^) reg
file with result of op or ld
(^) Simple RISC Pipeline
Time moves left-to-right, inst order top-to-bottom
If we handle each inst sequentially, unpipelined
(^) Clock number.................
Inst #
inst (^) i
inst (^) i (^) + 1
inst (^) i (^) + 2
inst (^) i (^) + 3
inst (^) i (^) + 4
(^) Pipeline as series of time-shifted data paths
Cycle 5 shows what happens when pipe is full:
Each pipe stage processing diff inst
Running all stages in parallel may require duplicate hardware
occurs in first, so no conflictReg read in ID stage occurs in last half of cycle, reg write in WB
Computer Architecture, Henn & Patt, Fig A.2, pg A-
(^) Reducing Branch Stall / Understanding Pipelineing
Computer Architecture, Henn & Patt, Fig A.24, pg A-
(multiplexor):
Used
to decide
which
thruof N inputs gets passed
passed between stagesData & control must be
Internal
regs
(pipeline
registers or
(^) latches
) are
for
stages
they connect
(^) Pipelining Limitations
from
info
flow
in
piperegs
Minimum
(^) realizable
(^) cycle
(^) time
due to clock skew
N-deep pipe give N-fold speedup? speedup
pipedepth
1+ pipe (^) stalls
(^) per (^) inst
Pipeline hazards
Structural
resource
con-
flicts
Data
: result dependencies
Control
changing
(branch)
(^) Structural Hazards
Structural hazard
Resource conflict due to hardware not supporting all
possible combinations of overlapping instructions.
hardwareWe will see that MIPS detects all stalls in ID, which requires extra
(^) Example of RAW Data Hazard
Which
(^) of
these
insts
(^) have
data hazard with DADD?
ANDDSUB R4, R1, R5 DADD R1, R2, R R6, R1, R
OR R8, R1, R
XOR R10, R1, R
Computer Architecture, H&P, Fig A.6, pg A-
(^) Ameliorating Data Hazards through Forwarding
Forwarding
(^) bypassing
/short-circuiting
overcomes
(^) many
data
hazards.
Data
paths
forward
frm
piperegs
Comparators
check
if reg
so,
control
logic
chooses forwarded val
OR?Why fwdng not needed for
Computer Architecture, H&P, Fig A.7, pg A-
(^) Forwarding from MEM
Can
forward
between
any
stages
access does not stallIf fwd from MEM, succ data
LD DADD R1, R2, R R4, 0(R1)
SD R4, 12(R1)
Would DADD stall?
LD R4, 0(R1)
IF ID EX DM WB
DADD R5, R4, R
IF ID EX DM WB
Computer Architecture, H&P, Fig A.8, pg A-
(^) Forwarding Limitations
LD R1, 0(R2)
IF ID EX DM WB
DSUB R4, R1, R
IF ID EX DM WB
AND R6, R1, R
IF ID EX DM WB
OR R8, R1, R
IF ID EX DM WB
DSUB must stall
Computer Architecture, H&P, Fig A.9, pg A-
(^) Stalling for Correct Execution
Must stall DSUB @ 4 to produce LD result
Why does stall affect subsequent insts?
Why does stall not affect prior insts?
(^) Clock number.................
Inst
stall
stall
stall
(^) Forwarding Adds Complexity to Hardware
Forwarding to ALU requires 3 extra inputs
(^) MUX and 3 new paths to
these inputs:
result
of
prev inst)
Control
of
MUX
requires
compare
src reg#
of current
inst
with
dest
reg#s
(^) of
prev
inst (^) in (^) ID/EX,
(^) and
MEM/WB piperegs
Computer Architecture, H&P, Fig A.23, pg A-
(^) Load RAW Interlock for Integer Pipeline
Pass
(^) frm
(^) ID (^) to (^) EX:
inst is
(^) issued
All data (^) haz
(^) det
(^) in
Comparators
det
if
two reg# the same
Only
prob
comes
usewith load in EX and (^) in (^) ID, (^) as (^) shown
in table
inInsert bubble if read ID,
load
in EX,
and
read#
(^) matches
dest#
Code
Result
Action
LD R1,45(R2)
ORDSUB R8,R6,R7DADD R5,R6,R R9,R6,R
No dep
no actionR1 not used after EX, so
LD R1,45(R2)
ORDSUB R8,R6,R7DADD R5,R1,R R9,R6,R
Stall (^) for
depend
comparators
(^) det (^) use (^) of
(andR1 in DADD, stall DADD
succ
inst)
before
DADD enters EX
LD R1,45(R2)
ORDSUB R8,R1,R7DADD R5,R6,R R9,R6,R
bydefeated Depend for-
warding
DSUB, Comp detect use of R1 in
(^) forward
(^) ld (^) val (^) in
EXtime for DSUM to enter
LD R1,45(R2)
ORDSUB R8,R6,R7DADD R5,R6,R R9,R1,R
in orderaccessesbut Depend,
Read of R1 by OR in 2
nd
cured in 1 half of ID, while write oc-
st (WB of LD)
(^) Instruction Scheduling
Some dependency stalls can be defeated by instruction scheduling
Simple scheduling may be done in hardware
More complex scheduling (like this) may be done by compiler
(^) Pipeline for ‘Predict Untaken’ w/o Delay Slot
If pred wrong, change op to noop before state change
Predict ’taken’ requires delay slot or computing branch target in IF!
Why is this better than always delaying?
What are advantages of predicting taken/untaken?
(^) Clock number...............
Inst
untaken br
inst i+
inst i+
inst i+
taken br
inst i+
nop
nop
nop
nop
targ
targ+
(^) Delay Slots
Instruction in delay slot executed regardless of branch outcome
If delay slot full, no bubble in pipe regardless of prediction
use (^) inst
(^) from
(^) fall-thru
(^) or (^) target,
(^) but
(^) only
(^) if (^) inst
interrupt, no side-effect)doesn’t change program behavior when executed uselessly (no
Some archs allow for a
(^) cancelling
(^) or (^) nullifying
(^) branch, which nops
delay slot inst if branch-contained prediction is wrong
Additional restrictions on delay slot inst (branch)
Can cause complications (return from interrupt)
Less common these days
(^) Exceptions/interrupts
Exceptions
(^) (AKA interrupt): Unscheduled events that change the nor-
mal execution order of instructions (eg., by calling an exception handler).
Examples:
I/O, call to system space,
(^) int overflow,
(^) FP anomaly,
memory protection violation, etc.
normal executionMust save machine state, handle exception, & if possible, restart
Exceptions more complex with pipelining, as mult inst in flight
Precise
(^) exceptions guarantee inst before the fault are completed,
after exception is handled.and inst following it are not allowed to change machine state until
(^) Classifying Exceptions
(^) Synchronous
event occurs at
same place every execution.
(^) User
(^) requested
Prog
(^) directly
asks for exception.
(^) Maskable
: User can override ex-
ception.
(^) Within
Excep
occurs
during
in between separate inst.the internal exec of an inst, not
(^) Resume
execution resumes af-
ter handler, else
(^) terminate
Sync/
User/
User
Within/
Resume/
Exception
Async
Coerced
Mask
Between
Term
I/O dev req
async
coerced
NO
between
resume
OS call
sync
user
NO
between
resume
Inst trace
sync
user
YES
between
resume
breakpoint
sync
user
YES
between
resume
ioverflow
sync
coerced
YES
within
resume
fp over/und
sync
coerced
YES
within
resume
page fault
sync
coerced
NO
within
resume
misalign mem
sync
coerced
YES
within
resume
mem prot viol
sync
coerced
NO
within
resume
Undef inst
sync
coerced
NO
within
resume
Hardware malfunc
async
coerced
NO
within
terminate
Power failure
async
coerced
NO
within
terminate
(^) Exceptions in MIPS Integer Pipe
Most exceptions involve memory (segfault)
Exception raised by inst
(^) i
(^) x could occur before the exception from
inst (^) i !
(^) Start to see why precise exceptions hard
Stage
Possible Exceptions
Page fault (inst fetch), misaligned mem access, mem prot viol
Undefined/illegal opcode
Arithmetic exception
Page fault (data), misaligned mem access, mem prot viol
None
(^) Multiple Cycle Operations
pipelined)
Integer & floating point divides (
unpipelined)
fp adds and subtracts (
pipelined)
? a clock cycle to be worth making single cycleMulticycle ops tend to require too much hardware and/or too long slowing down clock?^ If expense no object, can we make all ops single-cycle without
Multicycle unpipelined ops even more so, and often relatively rare
(^) Adding the Floating Point Pipes
latency
(^) ops
(^) consume
(^) data
(^) in (^) EX (^) = (^) # (^) of (^) stages
(^) in (^) EX (^) - (^1) (^) =
pipelen
f u (^) − (^1)
repeat
/initiation
(^) interval: # cycles that must elapse between issuing
two inst of same type
Fully pipelined mul/add, unpiped div (why?). How many new preg? Comp Arch, Henn & Patt, Fig A.31, pg A-
Lat
Repeat
iALU
i/f DM
fadd
fmul
f/i div
(^) Pipelining of Independent FP Instructions
No stalls (indep operands, units)
Stage where operands consumed in
(^) italics
(^) Operands consumed in order
(^) Operands available out-of-order
What data does store need in EX & MEM?
MUL.D
IF ID
M
M
M
M
M
M
M
MEM
WB
ADD.D
IF
ID
A
A
A
A
MEM
WB
L.D
IF
ID
EX
MEM
WB
S.D
IF
ID
EX
MEM
(^) Actual Pipeline:
: PC select, init ifetch
: complete ifetch
inst
decode,
reg
fetch,
icache hit det, hazard check
: EA calc,
(^) ALU op,
(^) branch
target, cond. eval
: init dfetch
: complete cache access
: tag check (det if cache hit)
: write back to regfile
on this arch^ We see caches slower than ALU
Comp Arch, Henn & Patt, Fig A.37, pg A-
(^) R4000 Pipe Causes 2-cycle Load/use Delay
Data available at end of DS (cache hit)
If TC shows miss, pipe must be backed up
Comp Arch, Henn & Patt, Fig A.38, pg A-
(^) Two-cycle LD/Use Delay in R
Forwarding from DS stage
(^) Clock number............
Inst
stall
stall
stall
stall
stall
stall
(^) Three-cycle Branch Delay in R
R4000 predicts ‘not taken’, computes branch targ in EX
Br targ forwarded from EX (4th stage
(^) 3 cycle delay)
Compiler can reduce taken branches
Why not have 3 delay slots?
(^) Clock number............
Inst
taken branch (
i)
delay slot (
i (^) + 1)
stall (
i (^) + 2)
nop
nop
nop
nop
nop
stall (
i (^) + 3)
nop
nop
nop
nop
nop
branch targ
untaken branch (............................................................
i) IF IS RF
EX
DF
DS
TC
WB
delay slot (
i (^) + 1)
inst (^) i (^) + 2
inst
(^) i (^) + 3
(^) Pipelining effects on CPI
Comp Arch, Henn & Patt, Fig A.48, pg A- CPI for 10 SPEC92 benchmarks
(perfect cache)
(^) Fallacies and Pitfalls
.....DIV.D F0, F2, F4 ; in delay slot
Pitfall:
Extensive pipelining in search of higher clock speeds can
impact other aspects of design, leading to worse cost-performance