Lecture 7: Pipelining & Instruction Level Parallelism in MIPS R4000 - Prof. Alan L. Sussma, Assignments of Computer Science

A lecture note from cmsc 411 computer systems architecture course, focusing on the mips r4000 processor's pipelining and instruction-level parallelism. The stages of the mips r4000 pipeline, forwarding techniques, and pipeline performance. It also discusses the pitfalls of extensive pipelining and the importance of instruction-level parallelism.

Typology: Assignments

Pre 2010

Uploaded on 07/29/2009

koofers-user-trb
koofers-user-trb 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMSC 411
Computer Systems Architecture
Lecture 7
Lecture
7
Basic Pipelining (cont.) &
Instruction Level Parallelism
Alan Sussman
l@ d d
a
l
s
@
cs.um
d
.e
d
u
Administrivia
Questions about HW #1?
HW #2 on pipelining due next Thursday Feb 26
HW
#2
,
on
pipelining
,
due
next
Thursday
,
Feb
.
26
Start reading Chapter 2 of H&P
CMSC 411 - 7 (from Patterson) 2
A case study: MIPS R4000
MIPS64 architecture with deeper 8 stage pipeline
MIPS64
architecture
,
with
deeper
8
stage
pipeline
to get higher clock rates
extra stages come from memory accesses
thi lld
ilii
t
ec
h
n
i
ques ca
ll
e
d
superp
i
pe
li
n
i
ng
CMSC 411 - 7 (from Patterson) 3
MIPS R4000 pipeline stages
•IF–1
st half instruction fetch
PC selection and start instruction cache access
•IS–2
nd half instruction fetch
complete instruction cache access
•RF instruction decode, register fetch, hazard checking,
instruction cache hit detection
•EX execution
includes effective address computation, ALU operation, branch target
computation and condition evaluation
CMSC 411 - 7 (from Patterson) 4
pf3
pf4
pf5

Partial preview of the text

Download Lecture 7: Pipelining & Instruction Level Parallelism in MIPS R4000 - Prof. Alan L. Sussma and more Assignments Computer Science in PDF only on Docsity!

CMSC 411

Computer Systems Architecture

Lecture 7Lecture 7

Basic Pipelining (cont.) &

Instruction Level Parallelism

Alan Sussmanl @

d

d

[email protected]

Administrivia^ •

Questions about HW #1?

-^

HW #2 on pipelining due next Thursday Feb 26

-^

HW #2, on pipelining, due next Thursday, Feb. 26

-^

Start reading Chapter 2 of H&P

CMSC 411 - 7 (from Patterson)

A case study: MIPS R4000^ •

MIPS64 architecture with deeper 8 stage pipelineMIPS64 architecture, with deeper 8 stage pipeline

-^

to get higher clock rates

-^

extra stages come from memory accessest^

h^

i^

ll d

i^

li^

i

-^

techniques called

superpipelining

CMSC 411 - 7 (from Patterson)

3

MIPS R4000 pipeline stages •^

IF

- 1

st^

half instruction fetch

-^

PC selection and start instruction cache access

-^

IS

- 2

nd

half instruction fetch

-^

complete instruction cache access

-^

RF

- instruction decode, register fetch, hazard checking, instruction cache hit detection -^

EX

- executionincludes effective address computation, ALU operation, branch targetcomputation and condition evaluation

CMSC 411 - 7 (from Patterson)

MIPS R4000 pipeline (cont.) • DF

st

half data fetch

-^

st 1 half of data cache access

• DS

nd

half data fetch

-^

complete data cache access

  • TC - tag check -^

determine whether data cache access

hit

• WB

- write back for loads and ALU operations

CMSC 411 - 7 (from Patterson)

5

MIPS R4000 pipeline (cont.)^ A 2 cycle load delay

CMSC 411 - 7 (from Patterson)

Might need to restart ADDD’s ALU

MIPS R4000 pipeline (cont.) A 3 cycle branch delay – 1 delay slot + 2 cycle stall for takenbranch (untaken just delay slot)

CMSC 411 - 7 (from Patterson)

7

Forwarding •^

Deeper pipeline increases number of levels offorwarding for ALU operations

-^

4 possible sources for an ALU bypass – EX/DF, DF/DS, DS/TC,TC/WB

CMSC 411 - 7 (from Patterson)

Pitfalls • Unexpected hazards do occur

-^

for example, when a branch is taken before a previousi^

t^

ti^

fi^

i h

instruction finishes

  • Extensive pipelining can slow a machine down, or

lead to worse cost-performance–

more complex hardware can cause a longer clock cycle,killing the benefits of more pipelining

CMSC 411 - 7 (from Patterson)

13

Pitfalls (cont.) • A poor compiler can make a good machine look bad

-^

compiler writers need to understand the architecture in ordertto^ »

optimize efficiently and » avoid hazards b tt

t^

li^

i^

t^

l^

i^

t^

ti^

th

k^

th

-^

better to eliminate useless instructions, than make them runfaster

CMSC 411 - 7 (from Patterson)

INSTRUCTION-LEVEL PARALLELISM

CMSC 411 - 7 (from Patterson)

15

Outline •^

ILP

•^

Compiler techniques to increase ILP

p

q

•^

Loop Unrolling

-^

Static Branch Prediction

-^

Dynamic Branch Prediction

-^

Overcoming Data Hazards with DynamicSchedulingScheduling

-^

(Start) Tomasulo Algorithm

-^

ConclusionConclusion

CMSC 411 - 7 (from Patterson)

Recall from Pipelining^ • Pipeline CPI = Ideal pipeline CPI + Structural Stalls

D t

H

d St ll

C

t^

l St ll

  • Data Hazard Stalls + Control Stalls
  • Ideal pipeline CPI: measure of the maximum performance

attainable by the implementation

  • Structural hazards: HW cannot support this combination of

instructions

  • Data hazards: Instruction depends on result of prior

Data hazards: Instruction depends on result of priorinstruction still in the pipeline

  • Control hazards: Caused by delay between the fetching of

instructions and decisions about changes in control flowinstructions and decisions about changes in control flow(branches and jumps)

CMSC 411 - 7 (from Patterson)

17

Instruction Level Parallelism •^

Instruction-Level Parallelism (ILP): overlap theexecution of instructions to improve performance

p^

p

•^

2 approaches to exploit ILP:1)

Rely on hardware to help discover and exploit the parallelismdynamically (e g

Pentium 4 AMD Opteron IBM Power)

and

dynamically (e.g., Pentium 4, AMD Opteron, IBM Power) , and

Rely on software technology to find parallelism, statically atcompile-time (e.g., Itanium 2/IA-64)

CMSC 411 - 7 (from Patterson)

Instruction-Level Parallelism (ILP)

(^

  • Basic Block (BB) ILP is quite small
    • BB: a straight-line code sequence with no branches in except to the

entry and no branches out except at the exit

  • average dynamic branch frequency 15% to 25%

=> 4 to 7 instructions execute between a pair of branches

  • Plus instructions in BB likely to depend on each other
    • Need ILP

across

multiple basic blocks

Simplest: loop level parallelism to exploit parallelism

  • Simplest: loop-level parallelism to exploit parallelism

among iterations of a loop. E.g.,

for (i=1; i<=1000; i=i+1)

x[i] = x[i] + y[i];

CMSC 411 - 7 (from Patterson)

19

Loop-Level Parallelism

p

•^

Exploit loop-level parallelism by “unrolling loop” eitherby1. dynamic via branch prediction or2. static via loop unrolling by compiler(Another way is vectors to be covered later)(Another way is vectors, to be covered later)

-^

Determining dependences

critical

•^

If 2 instructions are

-^

If^

2 instructions are– parallel, they can execute simultaneously in apipeline of arbitrary depth without causing any stalls

y^

g^

y

(assuming no structural hazards)

  • dependent, they are not parallel and must be

t d i

d^

lth

h th

ft^

b

CMSC 411 - 7 (from Patterson)

executed in order, although they may often bepartially overlapped