Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Compiler Optimization & Processor Pipelining in CS411: Exploring ILP - Prof. Alan L. Sussm, Study notes of Computer Science

University of Maryland Computer Science

Prof. Alan L. Sussman

This document from cs411 by alan sussman introduces various techniques to optimize compiler generation of code and processor pipelining execution, focusing on instruction-level parallelism. Topics include dynamic pipeline scheduling, register renaming, dynamic branch prediction, loop unrolling, compiler dependence analysis, software pipelining, and trace scheduling. The document also covers data dependences, functional units, and hazards in pipelining.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-clt 🇺🇸

9 documents

1 / 24

This page cannot be seen from the preview

Don't miss anything!

CMSC 411 - A. Sussman (from D. O'Leary) 1

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

March 28, 2006

CMSC 411 - Alan Sussman 2

What we already know about

pipelining

• We need to avoid structural hazards, data hazards

and control hazards in order to get optimal

performance from the pipeline

– Pipeline CPI = Ideal pipeline CPI + Structural stalls +

Data hazard stalls + Control stalls

• Accomplish this by techniques such as

– instruction reordering

– hardware modifications to detect branches earlier

– compiler approaches to reduce branch delays

• Each technique reduces one or more of the stall

components of the Pipeline CPI

CMSC 411 - Alan Sussman 3

What's new in Units 4 & 4b

• Some instructions can be executed independently

of others. In fact, they could be executed in

parallel if there was the hardware to do it

• Idea is to take advantage of this instruction level

parallelism to do major rearrangements to how

compilers generate code (Unit 4b) and how the

MIPS (and other modern processors) pipeline

executes code (Unit 4)

CMSC 411 - Alan Sussman 4

New techniques

• Hardware

– dynamic pipeline scheduling

• with scoreboarding

• with register renaming (Tomasulo)

– dynamic branch prediction

– issuing multiple instructions per cycle

– speculation

– dynamic memory disambiguation

• Software (compiler)

– loop unrolling

– compiler dependence analysis

– software pipelining and trace scheduling

– compiler speculation

CMSC 411 - Alan Sussman 5

Data dependences (background)

• If 2 instructions are parallel, can execute

simultaneously in pipeline without stalls

– assuming no structural hazards

• If 2 instructions are dependent, must be

executed in order, but may sometimes be

partially overlapped

CMSC 411 - Alan Sussman 6

Data dependences (cont.)

• Instruction jis data dependent on instruction iif

either

– instruction iproduces a result that may be used by

instruction j, or

– instruction j is data dependent on instruction k, and

instruction k is data dependent on instruction I

(transitivity)

• Dependence implies a chain of one or more data

hazards between the 2 instructions

– potentially causing a pipelined processor to stall

• Dependences are properties of programs

– and whether one causes a stall depends on the

properties of the pipeline organization

Discover Study notes of Computer Science University of Maryland

Partial preview of the text

Download Compiler Optimization & Processor Pipelining in CS411: Exploring ILP - Prof. Alan L. Sussm and more Study notes Computer Science in PDF only on Docsity!

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

March 28, 2006

CMSC 411 - Alan Sussman 2

What we already know about

pipelining

• We need to avoid structural hazards, data hazards

and control hazards in order to get optimal

performance from the pipeline

Pipeline CPI = Ideal pipeline CPI + Structural stalls + Data hazard stalls + Control stalls

• Accomplish this by techniques such as

instruction reordering
hardware modifications to detect branches earlier
compiler approaches to reduce branch delays

• Each technique reduces one or more of the stall

components of the Pipeline CPI

CMSC 411 - Alan Sussman 3

What's new in Units 4 & 4b

• Some instructions can be executed independently

of others. In fact, they could be executed in

parallel if there was the hardware to do it

• Idea is to take advantage of this instruction level

parallelism to do major rearrangements to how

compilers generate code (Unit 4b) and how the

MIPS (and other modern processors) pipeline

executes code (Unit 4)

CMSC 411 - Alan Sussman 4

New techniques

• Hardware

dynamic pipeline scheduling
- with scoreboarding
- with register renaming (Tomasulo)
dynamic branch prediction
issuing multiple instructions per cycle
speculation
dynamic memory disambiguation

• Software (compiler)

loop unrolling
compiler dependence analysis
software pipelining and trace scheduling
compiler speculation

CMSC 411 - Alan Sussman 5

Data dependences (background)

• If 2 instructions are parallel , can execute

simultaneously in pipeline without stalls

– assuming no structural hazards

• If 2 instructions are dependent , must be

executed in order, but may sometimes be

partially overlapped

CMSC 411 - Alan Sussman 6

Data dependences (cont.)

• Instruction j is data dependent on instruction i if

either

instruction i produces a result that may be used by instruction j, or
instruction j is data dependent on instruction k , and instruction k is data dependent on instruction I (transitivity)

• Dependence implies a chain of one or more data

hazards between the 2 instructions

potentially causing a pipelined processor to stall

• Dependences are properties of programs

and whether one causes a stall depends on the properties of the pipeline organization

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

April 6, 2006

CMSC 411 - Alan Sussman 8

Administrivia

• Midterm returned Tuesday

• Next project posted by tomorrow

– talk about it on Tuesday

• Read A.8 and 3.1-3.7, 3.

CMSC 411 - Alan Sussman 9

Last time

• Virtual memory protection

main memory shared by multiple processes
each process has its own virtual address space that no other process can touch or see
enforced by base/bound registers and OS/kernel that manages page tables

• Hardware/software instruction-level parallelism

data dependences – prevent instructions from executing in parallel - direct vs. indirect dependences - property of programs, and may cause a pipeline stall CMSC 411 - Alan Sussman 10

Data dependences (cont.)

• A dependence

– indicates the possibility of a hazard

– determines the order results must be produced

– sets an upper bound on available parallelism

• Ways to overcome dependences

– maintain the dependence, but avoid the hazard

(schedule the code – hardware or software)

– eliminate the dependence – by transforming the

code (software)

CMSC 411 - Alan Sussman 11

Name dependences

• When 2 instructions use the same register or

memory location (harder to detect), called a name ,

but no flow of data between them

• 2 types

antidependence between instruction i and instruction j when j writes a register or memory location that i reads
output dependence occurs when i and j write the same register or memory location

• In both cases, the instructions can execute at the

same time, or be reordered, if the name is changed

so the instructions don’t conflict

statically by compiler or dynamically by hardware (usually only for registers – why?) CMSC 411 - Alan Sussman 12

Hazards

• True data dependences correspond to RAW

data hazards

• Output dependences correspond to WAW

hazards

• Antidependences correspond to WAR

hazards

CMSC 411 - Alan Sussman 19

Functional units

• One segment of the scoreboard keeps track of, for

each functional unit:

Busy (yes or no)
Op: operation to perform
Fi: name of destination register
Fj, Fk: names of source registers
Qj, Qk: functional units producing the sources
Rj, Rk: ready flags for Fj and Fk:
- "yes" if the source is ready and still need it
- "no" if the source is not ready, or if no longer need it CMSC 411 - Alan Sussman 20

Scoreboard for functional units

Divide Yes Div F10 F0 F6 Mult1 No Yes

Add Yes Sub F8 F6 F2 Integer Yes No

Mult2 No

Mult1 Yes Mult F0 F2 F4 Integer No Yes

Integer Yes Load F2 R3 No

Name Busy Op Fi Fj Fk Qj Qk Rj Rk

CMSC 411 - Alan Sussman 21

Register status

• One segment of the scoreboard keeps track

of which registers are expecting outputs

from which functional units

FU Mult1 Integer Add Divide

F0 F2 F4 F6 F8 F10 F12 … F

• Next, an example of the scoreboard method

CMSC 411 - Alan Sussman 22

The bookkeeping

And we’ll assume these initial register contents and flags: F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 0 0 0 0 0 0 0 0

Unit Opn Fj Fk Qj Qk

Inst Stage

Each functional unit keeps track of:

For our convenience, we’ll also indicate:

CMSC 411 - Alan Sussman 23

Scoreboard - bookkeeping

Mult1 Mult

Add1 Add

x x x x x x x

x x x x x x x x x x x x x x

x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 0 0 0 0 0 0 0 0

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

April 11, 2006

CMSC 411 - Alan Sussman 25

Administrivia

• Midterm returned Thursday

– answers posted

• Project posted

– due April 27

CMSC 411 - Alan Sussman 26

Last time

Data dependences
- true, or flow, dependences – RAW hazards
- name dependences• antidependences – WAR hazards
  - output dependences – WAW hazards
Control dependences– statement orderings with respect to branches
- must preserve data flow and exception behavior
Dynamic pipeline scheduling
- start each instruction as early as possible – once operands are available
- instructions can complete out-of-order
- scoreboard and Tomasulo scheduling
Scoreboard
- – split ID stage into issue and read operands stagesinstructions issue in order, can pass each other in read operands stage,
- enter functional units when readykeeps track of which pipe stage each inst. is in, state of each functional unit, state of registers (which functional unit will supply data)

CMSC 411 - Alan Sussman 27

Time 0

S1: ADD.D F0, F2, F4 Assume 2 cycle add

Mult1 Mult

Add1 Add

x x x x x x x

x x x x x x x Add.d 3 4 0 0 S1 Issue

x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 0 0 0 0 0 0 0 CMSC 411 - Alan Sussman 28

Time 1

S2: MULT.D F2, F6, F8 Assume 4 cycle mult Mult1 Mult

Add1 Add

Mult.d 6 8 0 0 S2 Issue

x x x x x x x Add.d 3 4 0 0 S1 Exec

x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 m1 0 0 0 0 0 0

CMSC 411 - Alan Sussman 29

Time 2

S3: MULT.D F10, F0, F

Mult1 Mult

Add1 Add

Mult.d 6 8 0 0 S2 Exec

Mult.dx x a1 m S3 Issue Add.d 3 4 0 0 S1 Exec

x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 m1 0 0 0 m2 0 0 CMSC 411 - Alan Sussman 30

Time 3

S4: ADD.D F0, F8, F6 Stalled - WAW hazard on F Mult1 Mult

Add1 Add

Mult.d 6 8 0 0 S2 Exec

Mult.d 7 x 0 m S3 Issue Add.d 3 4 0 0 S1 Write

x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 7 3 4 6 8 10 12 14 0 m1 0 0 0 m2 0 0

CMSC 411 - Alan Sussman 37

Tomasulo’s dynamic scheduling

method

• Origins: IBM 360/91, 3 years after CDC

• Scoreboard controls the progress of each

instruction and makes sure that hazards

such as RAW, etc. are avoided

• In Tomasulo's method, these functions are

decentralized in reservation stations

CMSC 411 - Alan Sussman 38

Tomasulo’s method

• Main differences with scoreboard method:

– decentralized execution control

– decentralized hazard detection

– register renaming by having operands passed

to reservation stations

eliminates WAW and WAR hazards

CMSC 411 - Alan Sussman 39

Instruction status

• Only 3 kinds of pipeline stages, not 4:

Issue : send instruction to an empty reservation station, in FIFO order, along with any register contents it needs - also renames registers, removing WAR and WAW hazards
Execute : wait for all operands, check for RAW hazards, then execute the instruction - if multiple instructions ready in same cycle for same functionalunit, for f.p. units choose one arbitrarily - for loads/stores, compute effective address first, execute loads from load buffer as soon as memory unit available, for stores wait for value to be stored before send to memory unit
Write result : put result on the common data bus (CDB) to get it back to a register and to any other reservation stations that need it - stores write to memory here too CMSC 411 - Alan Sussman 40

Instruction status (cont.)

• Note differences from scoreboard:

– no check for WAW and WAR hazards, since

renaming prevents them

– CDB broadcasts results, so don't have to wait

for registers to be filled

– Load and Store are also treated as functional

units

CMSC 411 - Alan Sussman 41

Advantages over scoreboard

• Distribution of hazard detection logic

– from distributed reservation stations and CDB

– can release multiple instructions at once from

single result (if other operands available)

• Eliminates stalls for WAW and WAR

hazards

– from renaming registers using reservation

stations

– and from storing operands into reservation

station as soon as they are available

CMSC 411 - Alan Sussman 42

Bookkeeping for Tomasulo method

• Each reservation station keeps track of

– Busy flag

– Op: the operation to be performed

– Qj, Qk: sources of operands

already present
or to come from another reservation station (functional unit)

– Vj, Vk: values of the operands

– A: the memory address info for a load or store

(eventually the effective address)

CMSC 411 - Alan Sussman 43

Bookkeeping (cont.)

• Register hardware keeps track of which

reservation station will fill each register

– Qi – number of reservation station containing

the operation whose result will be stored into

the register

• The load and store buffers have

– busy flag

– value to be stored

– A: effective address

CMSC 411 - Alan Sussman 44

MIPS FP with Tomasulo’s algorithm

Figure 3.

CMSC 411 - Alan Sussman 45

Instructions (Fig. 3.3)

ADD.D F6,F8,F2 x

DIV.D F10,F0,F6 x

SUB.D F8,F2,F6 x

MUL.D F0,F2,F4 x

L.D F2, 45(R3) x x

L.D F6,34(R2) x x x

Instruction Issue Execute Write Result

CMSC 411 - Alan Sussman 46

Reservation Stations (Fig. 3.3)

Mult2 yes DIV Mem[34+R[R2]]Mult

Mult1 yes MUL R[F4] Load

Add3 no

Add2 yes ADD Add1 Load

Add1 yes SUB Mem[34+R[R2]]Load

Load2 yes Load 45+R[R3]

Load1 no

Name Busy Op Vj Vk Qj Qk A

CMSC 411 - Alan Sussman 47

Register Status (Fig. 3.3)

Qi Mult1Load2 Add2 Add1 Mult

Field F0 F2 F4 F6 F8 F10 F12 …F

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

April 13, 2006

CMSC 411 - Alan Sussman 55

Time 1

S2: MULT.D F2, F6, F8 Assume 4 cycle mult. Mult1 Mult

Add1 Add

Mult.d 6 8 0 0 S2 Issue

x x x x x x x Add.d 3 4 0 0 S1 Exec

x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 m1 0 0 0 0 0 0 CMSC 411 - Alan Sussman 56

Time 2

S3: MULT.D F10, F0, F

Mult1 Mult

Add1 Add

Mult.d 6 8 0 0 S2 Exec

Mult.dx x a1 m S3 Issue Add.d 3 4 0 0 S1 Exec

x x x x x x x F0 F2 F4 F6 F8 F10 F12 F 0 3 4 6 8 10 12 14 a1 m1 0 0 0 m2 0 0

CMSC 411 - Alan Sussman 57

Time 3

S4: ADD.D F0, F8, F

NOT STALLED - F0’s contents already passed to all units that need it! Mult1 Mult

Add1 Add

Mult.d 6 8 0 0 S2 Exec

Mult.d 7 x 0 m S3 Issue Add.d 3 4 0 0 S1 Write

Add.d 8 6 0 0 S4 Issue F0 F2 F4 F6 F8 F10 F12 F 7 3 4 6 8 10 12 14 a2 m1 0 0 0 m2 0 0

To m2, but F0 not really necessary (^) CMSC 411 - Alan Sussman 58

Time 4

S5: ADD.D F4, F6, F

Mult1 Mult

Add1 Add

Mult.d 6 8 0 0 S2 Exec

Mult.d 7 x 0 m S3 Issue Add.d 6 6 0 0 S5 Issue

Add.d 8 6 0 0 S4 Exec F0 F2 F4 F6 F8 F10 F12 F 7 3 4 6 8 10 12 14 a2 m1 a1 0 0 m2 0 0

CMSC 411 - Alan Sussman 59

Time 5

S6: ADD.D F2, F2, F6 stalled Mult1 Mult

Add1 Add

Multd 6 8 0 0 S2 Exec

Mult.d 7 x 0 m S3 Issue Add.d 6 6 0 0 S5 Exec

Add.d 8 6 0 0 S4 Exec F0 F2 F4 F6 F8 F10 F12 F 3 4 6 8 10 12 14 a2 m1 a1 0 0 m2 0 0

CMSC 411 - Alan Sussman 60

Time 6

S6: ADD.D F2, F2, F6 stalled Mult1 Mult

Add1 Add

Mult.d 6 8 0 0 S2 Write

Mult.d 6 48 0 0 S3 Issue Add.d 6 6 0 0 S5 Exec

Add.d 8 6 0 0 S4 Write F0 F2 F4 F6 F8 F10 F12 F 7 48 4 6 8 10 12 14 a2 0 a1 0 0 m2 0 0

stalled

CMSC 411 - Alan Sussman 61

Time 7

S6: ADD.D F2, F2, F

Mult1 Mult

Add1 Add

x x x x x x x

Mult.d 7 48 0 0 S3 Exec Add.d 6 6 0 0 S5 Write

Add.d 8 6 0 0 S4 Write F0 F2 F4 F6 F8 F10 F12 F 14 48 4 6 8 10 12 14 0 0 a1 0 0 m2 0 0

stalled

CMSC 411 - Alan Sussman 62

Time 8

S6: ADD.D F2, F2, F

Mult1 Mult

Add1 Add

x x x x x x x

Mult.d 7 48 0 0 S3 Exec Add.d 6 6 0 0 S5 Write

Add.d 48 6 0 0 S6 Issue F0 F2 F4 F6 F8 F10 F12 F 14 48 12 6 8 10 12 14 0 a2 0 0 0 m2 0 0

CMSC 411 - Alan Sussman 63

Loop-based example

• To show power of eliminating WAR and WAW

hazards from dynamic register renaming

Loop: L.D F0, 0(R1) MUL.D F4,F0,F S.D F4,0(R1) DADDUI R1,R1,- BNE R1,R2,Loop

• If predict branches are taken, multiple loop

iterations can execute at same time – dynamic

loop unrolling

CMSC 411 - Alan Sussman 64

Example (Fig. 3.6)

• Assume all instructions issued in two successive

iterations of loop, but none of f.p. loads/stores or

operations has completed

don’t show integer ALU op, assume branch predicted as taken

• Once system reaches this state, 2 copies of loop

could be sustained, with a CPI close to 1.

if multiplies take less than 4 cycles to complete

CMSC 411 - Alan Sussman 65

Instruction status (Fig. 3.6)

S.D F4,0(R1) 2 x

MUL.D F4,F0,F2 2 x

L.D F0,0(R1) 2 x x

S.D F4,0(R1) 1 x

MUL.D F4,F0,F2 1 x

L.D F0,0(R1) 1 x x

Instruction From iteration Issue Execute Write Result

CMSC 411 - Alan Sussman 66

Reservation stations (Fig. 3.6)

Store2 yes Store R[R1]-8 Mult

Store1 yes Store R[R1] Mult

Mult2 yes MUL R[F2] Load

Mult1 yes MUL R[F2] Load

Add3 no

Add2 no

Add1 no

Load2 yes Load R[R1]-

Load1 yes Load R[R1]+

Name Busy Op Vj Vk Qj Qk A

CMSC 411 - Alan Sussman 73

Branch-prediction buffers (cont.)

• When to access the buffer:

fetch the contents during the ID pipeline cycle
update the contents after the branch is resolved

• What is stored in the buffer:

perhaps one bit per branch, indicating whether the branch was last taken or not - predict taken if it was taken last time. - otherwise predict not taken Example: If the branch is for an inner loop, the prediction will be wrong on the 1st and last iterations, each time the loop is executed CMSC 411 - Alan Sussman 74

Branch-prediction buffers (cont.)

• What is stored in the buffer (cont.):

– perhaps a counter for each branch. If have 3

bits, for example, start the count at 4

add one every time the branch is taken
subtract one every time the branch is not taken
predict taken if the count is 4 or more
predict not taken if the count is 3 or less.

– perhaps a correlating predictor that counts for

this branch and for 1 or more earlier ones

CMSC 411 - Alan Sussman 75

How good are they? – Fig. 3.

For a 4096 entry 2-bit prediction buffer

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

April 18, 2006

CMSC 411 - Alan Sussman 77

Administrivia

• Midterm questions?

• Project 1 grades emailed (with minor error)

• Project 2 questions?

• anyone looking for a summer job?

CMSC 411 - Alan Sussman 78

Last time

Tomasulo dynamic scheduling
- detailed example showed benefit of forwarding results directly to reservation stations, no WAR/WAW hazards
- still get structural hazards (not enough functional units, CDB)
- loop example showed instructions from multiple loop iterations in execution at same time, to get CPI close to 1.
- need care with loads/stores to same address to avoid WAR, WAW hazards - check contents of load/store buffers (res. stations)
Branch prediction
- to reduce control hazards
- branch prediction buffer – keep track of recent branch history, to predict which way it will go in future - fetch during ID stage, to get next instruction - update after branch resolved

CMSC 411 - Alan Sussman 79

Correlating branch predictors

To improve prediction accuracy, look at behavior of recent other branches than the one trying to predict
Take advantage of correlation between behavior of different branches - also called two-level predictors Simple example: if (d == 0) d = 1; if (d == 1) …

MIPS code, with d in R1: BNEZ R1,L1 ; branch b DADDIU R1,R0,# L1: DADDIU R3,R1,#- BNEZ R3,L2 ; branch b … L2: CMSC 411 - Alan Sussman 80

Example (cont.) – Fig. 3.

2 no taken 2 no taken

not taken

1 no taken 1 yes

not taken

not 1 yes taken

0 yes

Value of d d==1? b before b

Initial d==0? b value of d

b1 not taken → b2 not taken

CMSC 411 - Alan Sussman 81

Example – 1 bit predictor

Predictor initialized to not taken – Fig. 3.

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

New b prediction

b action

b prediction

New b prediction

b action

b prediction

d=?

All branches mispredicted! CMSC 411 - Alan Sussman 82

1-bit prediction with 1 bit correlation

Initialized to not taken/not taken – Fig. 3.

0 T/ NT NT T/NT NT /T NT NT/T

2 T /NT T T/NT NT/ T T NT/T

0 T/ NT NT T/NT NT /T NT NT/T

2 NT /NT T T/NT NT/ NT T NT/T

New b prediction

b action

b prediction

New b prediction

b action

b prediction

d=?

Only misprediction is on first iteration/row

CMSC 411 - Alan Sussman 83

Correlating predictors (cont.)

• Predictor on last slide is a (1,1) predictor

uses behavior of last branch to choose from among a pair of 1-bit branch predictors

• In general, an ( m,n ) predictor uses behavior of last

m branches to choose from 2 m^ branch predictors

each is an n -bit predictor for a single branch
simple hardware to do this – can keep global history of most recent m branches in an m -bit shift register, and each bit records whether branch taken/not taken - and index branch prediction buffer using low-order bits of branch address with m -bit global history CMSC 411 - Alan Sussman 84

Comparing predictors

• Correlating vs. standard 2-bit scheme

using same number of state bits

• Number of state bits for ( m,n ) predictor is:

2 m^ × n × # prediction entries to select from with the branch address

• 2-bit predictor with no global history is a (0,2)

predictor

(0,2) predictor from Fig. 3.8 had 4K entries selected by branch address - total number of bits is 2 0 × 2 × 4K = 8K bits
(2,2) predictor from Fig. 3.14 has 64 entries, with 4 entries per branch address - total number of bits is 2 2 × 2 × 16 = 128 bits

CMSC 411 - Alan Sussman 91

BTB algorithm – Fig. 3.

CMSC 411 - Alan Sussman 92

BTB penalties – Fig. 3.

no not taken 0

no taken 2

yes taken not taken 2

yes taken taken 0

Penalty cycles

Actual branch

Instruction Prediction in buffer

Multiple Issue

CMSC 411 - Alan Sussman 94

Issuing multiple instructions at

one time

• Scoreboard and Tomasulo’s method reduce stalls

from data hazards

• Branch buffers reduce stalls from control hazards

• Now talk about getting more parallelism by

issuing several instructions at once

• Two currently used methods:

superscalar processors
VLIW (very long instruction word) processors

CMSC 411 - Alan Sussman 95

5 approaches – Fig. 3.

explicit dep.Itanium by compiler

mostly static

mostly software

EPIC static

Trimedia, Intel i

no hazards bet. packets

VLIW/LIW static software static

PIII/4, R10K, Alpha 21264

o-o-o, with speculation

dynamic, with spec.

Superscalar dynamic hardware (speculative)

some o-o-o IBM Power execution

Superscalar dynamic hardwaredynamic (dynamic)

Sun UltraSparc II/III

in-order execution

Superscalar dynamic hardwarestatic (static)

Interesting Examples feature

Hazard Scheduling detection

Issue structure

Common name

CMSC 411 - Alan Sussman 96

Superscalar MIPS

• Idea: issue several non-interfering instructions at

each clock cycle

• The hardware has the burden of detecting

interfering instructions

• Can be statically scheduled (using compiler

techniques), or dynamically scheduled (using

scoreboard or Tomasulo’s algorithm)

• For example, look at a superscalar MIPS processor

that can issue two instructions at once:

one floating point instruction
one non-floating point (integer or branch) instruction

CMSC 411 - Alan Sussman 97

Superscalar MIPS (cont.)

• 3 steps in instruction fetch and issue:

– fetch 2 instructions from cache

– determine whether 0, 1, or 2 can issue

look at hazards from earlier instructions issued, and from within the 2 fetched instructions
since 1 integer and 1 FP instruction can issue, mostly just look at opcodes for hazards within fetched pair, unless integer instruction is FP load/store/move – possible RAW hazard
also need another read/write port on FP register file, to allow loads/stores to issue with FP operations, and (lots) more bypass paths

– issue the instructions to the correct functional

unit

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

April 20, 2006

CMSC 411 - Alan Sussman 99

Administrivia

• Midterm questions?

• Project 2 questions?

– due next Thursday, April 27

• Guest lecturer Tuesday, (short) class next

Thursday to finish up Unit 4b

CMSC 411 - Alan Sussman 100

Last time

Branch prediction
- for a given branch instruction, local predictor uses history of that branch inst.
- correlating predictor uses history of all recently executed branches
- two-level predictor combines both – also called a tournament predictor
- branch target buffer also stores and uses address of next predicted instruction, to reduce branch penalty further
Multiple instruction issue
- to issue and complete more than one instruction per cycle
- superscalar and VLIW are dynamic and static techniques,respectively
- superscalar MIPS from example can issue 0, 1, or 2 instructionsper cycle, depending on structural and data hazards

CMSC 411 - Alan Sussman 101

Example

x is an array of 1000 double precision numbers, indexed from 1 to 1000 for i=1000, 999, ..., 1 x[i] = x[i] + s; Unroll the loop – execute 4 iterations before branch (for now don’t worry about loop test)

Loop: L.D F0, 0(R1) x[i] = x[i] + s ADD.D F4, F0,F2 uses F0 and F S.D F4,0(R1) L.D F6,-8(R1) x[i-1] = x[i-1] + s ADD.D F8,F6,F2 uses F6 and F S.D F8,-8(R1) L.D F10,-16(R1) x[i-2] = x[i-2] + s ADD.D F12,F10,F2 uses F10 and F S.D F12,-16(R1) L.D F14,-24(R1) x[i-3] = x[i-3] + s ADD.D F16, F14,F2 uses F10 and F S.D F16, -24(R1) DSUBI R1,R1,#32 point to next element BNEZ R1, Loop CMSC 411 - Alan Sussman 102

Example (cont.)

On single issue MIPS pipeline, takes 14 cycles

9 BNEZ R1, Loop (delayed branch) 10 S.D F16, 8(R1)

8 S.D F12, 16(R1)

7 S.D F8, 24(R1)

6 ADD.D F16, F14,F2 S.D F4, 32(R1)

DSUBI R1, R1, # (changes later offsets)

5 ADD.D F12, F10, F

4 ADD.D F8, F6, F2 L.D F14, -24(R1)

3 ADD.D F4, F0, F2 L.D F10, -16(R1)

2 L.D F6, -8(R1)

1 L.D F0, 0(R1)

Cycle FP instruction non-FP instruction

CMSC 411 - Alan Sussman 109

Hardware speculation (cont.)

• 3 key ideas

dynamic branch prediction , to choose which instructions to execute
speculation , to allow executing instructions before branches resolved (and effects undone if speculation was wrong)
dynamic scheduling , to deal with scheduling combinations of basic blocks

• Essentially a data flow execution

operations execute as soon as operands available

• We’ll use Tomasulo’s algorithm for dynamic

scheduling, and just for FP unit

CMSC 411 - Alan Sussman 110

Extensions to Tomasulo’s algorithm

• Separate bypassing of results between

instructions from actual completion of an

instruction

– so that an instruction can execute and produce

results for other instructions, without

performing any updates that can’t be undone –

until know that the instruction should have been

executed (control dependences resolved)

– when instruction is no longer speculative:

update the register file, or memory
call this instruction commit

CMSC 411 - Alan Sussman 111

Implementing speculation

• Allow instructions to execute out-of-order, but

make them commit in order

and prevent any action that can’t be undone (update state, take an exception, etc.) until instruction commits

• Requires some changes to standard pipeline

sequence, and hardware buffers to hold results of

instructions that have completed execution, but

have not committed yet

a reorder buffer
also used to pass results between speculated instructions CMSC 411 - Alan Sussman 112

Reorder buffer (ROB)

• Provides registers, like reservation stations

• Holds instruction result from time operation

completes until instruction commits

• But no writeback to register file until instruction

commits

so ROB supplies operands in this time interval

• Similar to store buffer in Tomasulo’s algorithm

in design shown, store buffer integrated into ROB

CMSC 411 - Alan Sussman 113

Reorder buffer (cont.)

• 4 fields in an ROB entry

– instruction type – branch, store, register op

(ALU or load)

– destination – register number or memory

address

– value – result of instruction

– ready – set when instruction completes

execution, and value is ready

CMSC 411 - Alan Sussman 114

MIPS with Tomasulo + speculation

Fig. 3.

CMSC 411 - Alan Sussman 115

Instruction execution – 4 steps

• Issue

– get instruction from instruction queue

– issue if both reservation station and empty ROB

slot available

– send operands to reservation station if available

in either registers or ROB

– send ROB entry number to reservation station

for writing result

– if reservations stations or ROB full, instruction

stalls

CMSC 411 - Alan Sussman 116

Instruction execution (cont.)

• Execute

– If an operand is not yet available, monitor CDB

until it appears

– When all operands available at reservation

station, execute the operation

– Instructions can take multiple cycles in this

stage, loads require 2 steps (compute effective

address and memory access), stores only need

to compute effective address

CMSC 411 - Alan Sussman 117

Instruction execution (cont.)

• Write result

– When result available, write on CDB (with

ROB tag set in issue step), and from CDB to

ROB, plus any reservation stations waiting

– Mark reservation station available

– For stores, if value to be stored is available,

write into Value field of ROB entry for store

otherwise, wait until it appears on CDB, then update Value field of ROB entry CMSC 411 - Alan Sussman 118

Instruction execution (cont.)

• Commit

For “normal” case (instruction reaches head of ROB and its result is in the buffer), update the register with the result and remove instruction from ROB
For a store, update memory instead of result register
For an incorrectly predicted branch, the speculation was wrong – flush the ROB and reservation stations, and restart execution at correct branch target - correctly predicted branch is just finished

CMSC 411 - Alan Sussman 119

Example – Fig. 3.

L.D F6,34(R2)

L.D F2,45(R3)

MUL.D F0,F2,F

SUB.D F8,F6,F

DIV.D F10,F0,F

ADD.D F6,F8,F

• Assume 2 cycle add, 10

cycle multiply, 40 cycle

divide

• Status tables will show

when MUL.D is ready to

commit

CMSC 411 - Alan Sussman 120

Example (cont.)

Mult2 yes DIV.D Mem[34+R[R2]] #3 #

Mem[45+ R[F4] # R[R3]]

Mult1 no MUL.D

Add3 no

Add2 no

Add1 no

Load2 no

Load1 no

Name Busy Op Vj Vk Qj Qk Dest A

Reservation stations

Compiler Optimization & Processor Pipelining in CS411: Exploring ILP - Prof. Alan L. Sussm, Study notes of Computer Science

Related documents

Partial preview of the text

Download Compiler Optimization & Processor Pipelining in CS411: Exploring ILP - Prof. Alan L. Sussm and more Study notes Computer Science in PDF only on Docsity!

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

March 28, 2006

What we already know about

pipelining

• We need to avoid structural hazards, data hazards

and control hazards in order to get optimal

performance from the pipeline

• Accomplish this by techniques such as

• Each technique reduces one or more of the stall

components of the Pipeline CPI

What's new in Units 4 & 4b

• Some instructions can be executed independently

of others. In fact, they could be executed in

parallel if there was the hardware to do it

• Idea is to take advantage of this instruction level

parallelism to do major rearrangements to how

compilers generate code (Unit 4b) and how the

MIPS (and other modern processors) pipeline

executes code (Unit 4)

New techniques

• Hardware

• Software (compiler)

Data dependences (background)

• If 2 instructions are parallel , can execute

simultaneously in pipeline without stalls

– assuming no structural hazards

• If 2 instructions are dependent , must be

executed in order, but may sometimes be

partially overlapped

Data dependences (cont.)

• Instruction j is data dependent on instruction i if

either

• Dependence implies a chain of one or more data

hazards between the 2 instructions

• Dependences are properties of programs

Computer Systems Architecture

CMSC 411

Unit 4 – Instruction-level

parallelism

Alan Sussman

April 6, 2006

Administrivia

• Midterm returned Tuesday

• Next project posted by tomorrow

– talk about it on Tuesday

• Read A.8 and 3.1-3.7, 3.

Last time

• Virtual memory protection

• Hardware/software instruction-level parallelism

Data dependences (cont.)

• A dependence

– indicates the possibility of a hazard

– determines the order results must be produced

– sets an upper bound on available parallelism

• Ways to overcome dependences

– maintain the dependence, but avoid the hazard

(schedule the code – hardware or software)

– eliminate the dependence – by transforming the

code (software)

Name dependences

• When 2 instructions use the same register or

memory location (harder to detect), called a name ,

but no flow of data between them

• 2 types

• In both cases, the instructions can execute at the

same time, or be reordered, if the name is changed

so the instructions don’t conflict

Hazards

• True data dependences correspond to RAW

data hazards

• Output dependences correspond to WAW

hazards