Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Dynamic Branch Prediction: Techniques and Challenges, Study notes of Computer Architecture and Organization

University of California - Davis Computer Architecture and Organization

Dynamic branch prediction techniques, including 2-bit counters, tournament predictors, and branch history tables. The problems with static branch prediction and the benefits of dynamic prediction, as well as the accuracy of different schemes. The document also covers the use of local history and the correlation between branches.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-0mk 🇺🇸

10 documents

1 / 13

This page cannot be seen from the preview

Don't miss anything!

Page ‹#›

Branch Prediction and Multiple-Issue

Processors

Venkatesh Akella

EEC 270

Winter 2004

Based on Material provided by

Prof. Al Davis and Prof. David Culler

Branch Prediction

•Size of basic blocks limited to 4-7

instructions

•Delayed branches not a solution in multiple-

issue processors

•Why? Hard to find independent instructions

and remember the mess they create for

precise exceptions

•To resolve a branch need two things (a)

branch target address and (b) branch

direction

•Prediction deals with (b) I.e. getting the

direction

•Branch Penalty is governed by (a)

•Deeper pipeline – bad news as BP is higher

Static Branch Prediction

•Let the compiler figure out the branch direction

for each branch instruction

Three strategies:

a) Always Predict Taken - Misprediction is 34%

b) Forward Not Taken; Backward Taken ---

Misprediction is 10% - 40%

c) Profile-driven – using realistic benchmarks and

real data and for each branch determine the

direction – Hennessey & McFarling and Larus and

Ball

Dynamic Branch Prediction

•Run Time

•Hardware assisted

•Intuition – branches direction is not random, they

are BIMODAL i.e. either strongly taken or not taken

•One-bit Branch Prediction Buffer or Branch History

Table (BHT) – Smith 1981

K-bits

1

0

1

0

1

PC

Past a good a good indicator of the future

1 = Taken

0= Not Taken

Update BHT when

You make a mistake

What are the problems?

a) Aliasing due to limited size of the BHT (tag can be stored to

avoid this problem)

b) 1-bit history may not be sufficient? Eg: consider a loop that

iterates 10 times – You will mis predict 2/10 so accuracy is 80%

Discover Study notes of Computer Architecture and Organization University of California - Davis

Partial preview of the text

Download Dynamic Branch Prediction: Techniques and Challenges and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Branch Prediction and Multiple-Issue

Processors

Venkatesh Akella

EEC 270

Winter 2004

Based on Material provided by Prof. Al Davis and Prof. David Culler

Branch Prediction

Size of basic blocks limited to 4-7 instructions
Delayed branches not a solution in multiple- issue processors
Why? Hard to find independent instructions and remember the mess they create for

precise exceptions

To resolve a branch need two things (a) branch target address and (b) branch

direction

Prediction deals with (b) I.e. getting the direction
Branch Penalty is governed by (a)
Deeper pipeline – bad news as BP is higher

Static Branch Prediction

Let the compiler figure out the branch direction for each branch instruction

Three strategies:

a) Always Predict Taken - Misprediction is 34%

b) Forward Not Taken; Backward Taken --- Misprediction is 10% - 40%

c) Profile-driven – using realistic benchmarks and

real data and for each branch determine the direction – Hennessey & McFarling and Larus and

Ball

Dynamic Branch Prediction

Run Time
• Hardware assistedIntuition – branches direction is not random, they
are BIMODAL i.e. either strongly taken or not takenOne-bit Branch Prediction Buffer or Branch History Table (BHT) – Smith 1981

K-bits

PC

Past a good a good indicator of the future

1 = Taken 0= Not Taken

You make a mistake^ Update BHT when What are the problems? a) Aliasing due to limited size of the BHT (tag can be stored to b) avoid this problem)1-bit history may not be sufficient? Eg: consider a loop that iterates 10 times – You will mispredict 2/10 so accuracy is 80%

Better Solution: 2-bit scheme where change prediction only if get misprediction twice:
Adds hysteresis to decision making process

Dynamic Branch Prediction (Jim Smith, 1981)

T

NT

Predict Taken Predict Not Taken

NT T 00

T

NT

2-bit Counters

Upto 93.5% accuracy
If K is sufficiently large, each branch maps to a unique counter
Can store tags if you want to avoid aliasing
How about m-bit counters?
Doesn’t benefit much

PC K-bits

2 k^ 2-bit saturating counters

How do you improve further?

Can we capture the actual history of the specific branch and use that to make our

prediction? – LOCAL HISTORY

Can we capture the sequential correlation between branches – GLOBAL HISTORY
do both?
Make multiple predictions and choose the right prediction based on the context of the

particular branch – TOURNAMENT predictors

Using Local History

Consider the simple for loop FOR (I=1; I<5; I++) { something …} If the branch is at the end of the loop body, it has following pattern – (1110)N The sequence of the branch history is 11101110111011101110 …… Basically, if we know what the branch did the last three times, we can predict EXACTLY what it will do next.

PC k

LHT/PHT 1-bit or 2-bit or m-bit counter

Counter

3-bit local history

X

(^3) accuracy97.1% accesses^ 2-mem

Accuracy of Different Schemes (Figure 3.15, p. 206)

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT

1024 Entries (2,2) BHT

Frequency of

Mispredictions

Re-evaluating Correlation

Several of the SPEC benchmarks have less than a dozen branches responsible for 90%

of taken branches: program branch % static # = 90%

compress eqntott 14%25% (^236494 ) gcc mpeg 15%10% (^95315598 ) real gcc 13% 17361 3214

Real programs + OS more like gcc
Small benefits beyond benchmarks for correlation? problems with branch aliases?

BHT Accuracy

Mispredict because either:
- – Wrong guess for that branchGot branch history of wrong branch when index the table
4096 entry table misprediction (nasa7, programs vary from 1% tomcatv) to 18%

(eqntott), with spice at 9% and gcc at 12%

For SPEC92, 4096 about as good as infinite table

Tournament Predictors

Motivation for hybrid branch predictors is 2-

bit predictor failed on important branches; by adding global information, performance

improved

Tournament based on global predictors: information use (^2) and predictors, 1 based on 1

local information, and combine with a selector

Hopes to select right predictor for right

branch (or right context of branch)

Tournament Predictor in Alpha 21264

4K 2-bit counters to choose from among a global predictor and a local predictor
Global predictor history of the last 12 branches; each entry in the global also has 4K entries and is indexed by the predictor is a standard 2-bit predictor – 12-bit pattern: ith bit 0 => ith prior branch not taken;
Local predictor consists of a 2-level predictor:^ ith^ bit 1 =>^ ith^ prior branch taken;
- Top entries; level each a local10-bit history entry tablecorresponds consisting to theof (^1024) most 10-bitrecent (^10) patterns 10 branches to be discovered and predicted. branch outcomes for the entry. 10-bit history allows
- Next used tolevel index Selected a table entry of from1K entriesthe local consisting history tablea 3-bit is
Total size: 4K2 + 4K2 + 1K10 + 1K3 =^ saturating counters, which provide the local prediction 29K bits! (~180,000 transistors)

% of predictions from local predictor in

Tournament Prediction Scheme

Accuracy of Branch Prediction

Profile: branch profile from last execution (static in that in encoded in instruction, but profile)

fig 3.

Accuracy v. Size (SPEC89)

Branch Folding

Branch Folding or BTA, how about storing the target – Instead of storing Next PC

instruction itself or multiple instructions if it is a multi-issue processor

Eg: L2 : b L

L : add R1, R2, R

At address corresponding to L2, you store the add instruction instead of the unconditional

branch instruction b L

ZERO cycle BRANCH

(eliminated one instruction all together)

L2: Add r1,r2, r

Advanced Approaches

Trace Caches – aggressive prefetching
Return Address Caches – jr $Ra – when $Ra

is return address of a procedure.

85% of indirect jumps are due to procedure returns.

BTB does not work very well because

procedure is called from many different places

So, you a separate stack cache to push $Ra

and pop them off

Special Case Return Addresses

Register Indirect branch hard to predict address
SPEC89 85% such branches for procedure return
Since stack discipline for procedures, save

return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate

Pitfall: Sometimes bigger and dumber

is better

212 64 uses tournament predictor (29 Kbits)
Earlier 21 with 2K entries (or a total of 4 1 64 uses a simple 2-bit predictor Kbits)
SPEC95 benchmarks, 2 – 21264 avg. 11.5 mispredictions 2 264 outperforms per 1000 instructions
- 21164 avg. 16.5 mispredictions per 1000 instructions
Reversed for transaction processing (TP)! – 21264 avg. 17 mispredictions per 1000 instructions
- 21164 avg. 15 mispredictions per 1000 instructions
TP code much larger & 21 1 64 hold 2X

branch predictions based on local behavior (2K vs. 1K local predictor in the 21 2 64)

Dynamic Branch Prediction Summary

Prediction becoming important part of scalar execution
Branch History Table: 2 bits for loop accuracy
Correlation: Recently executed branches correlated with next branch.
- – Either different branchesOr different executions of same branches
Tournament Predictor: more resources to competitive solutions and pick between them
Branch Target Buffer: include branch address & prediction
Predicated Execution can reduce number of branches, number of mispredicted branches
Return address stack for prediction of indirect jump

Multiple Issue

Goal how to reduce CPI below 1.
Consider two consecutive blocks of instructions

Gj – {i1, i2, i3, i4} and Gi = {i5,i6, i7, i8}

Gj is already in execution

1. Fetch Gi

2. Check for all structural hazards that instructions in Gj may introduce

3. Check for data hazards between between instructions in Gi and Gj Gi and

4. Read operands and execute

Flavors of Multiple Issue Processors

Vector = execute a loop in parallel array data structures – directly on
Superscalar – Static = in-order-execution (if I5 has a problem, HALT)
- Dynamic^ •^ Eg : SUN ULTRA SPARC II/III= out-of-order execution (let I6 if I5 has a resource conflict » No Speculation – If i5 is a branch do not allow I6 till branch is resolved • IBM Power 2 » With Speculation- Allow I6 but be prepared to rollback (Pentium 3, Pentium 4, Alpha 21264, MIPS R10K)
VLIW – Compiler determines what to execute in parallel (Trimedia)
- EPIC (basis for Itanium)

Multiple Issue Headaches

Increased I-Cache Fetch BW
Alignment problems may not allow 4

instructions to be fetched

Need to check for more hazards
Branches – 25% of instructions are

branches, so you need to resolve a branch every cycle!

Increased ports on register file and memory

So, how do we proceed

1. Pipeline the Issue unit into 2 stages

2. Restricted Issue eg: one int and one FP

Dynamic Scheduling in Superscalar

How to issue two instructions and keep in-order instruction^ The easy way issue for – Assume 1 integer + 1 floating point Tomasulo? - 1 Tomasulo control for integer, 1 for floating point
Issue 2X Clock Rate, so that issue remains in order
Only loads/stores might cause dependency between integer and FP issue:
- Replace load reservation station with a load queue; operands must be read in the order they are fetched
- Load checks addresses in Store Queue to avoid RAW violation
- Store checks addresses in Load Queue to avoid WAR,WAW

Register renaming, virtual registers

versus Reorder Buffers

Alternative to Reorder Buffer is a larger virtual set of registers and register renaming
Virtual registers registers + temporary values hold both architecturally visible
- replace functions of reorder buffer and reservation station
Renaming process maps names of architectural registers to registers in virtual register set
- Changing subset of virtual registers contains architecturally visible registers
Simplifies instruction commit: mark register as no longer speculative, free register with old value
Adds 40-80 extra registers: Alpha, Pentium,…
- Size limits no. instructions in execution (used until commit)

How much to speculate?

Speculation Pro: uncover events that would

otherwise stall the pipeline (cache misses)

Speculation Con: exceptional event occurs when speculation was speculation costly if

incorrect

Typical solution: speculation allows only low- cost exceptional events (1st-level cache miss)
When expensive exceptional event occurs, (2nd-level cache miss or TLB miss) processor

waits until the instruction causing event is no longer speculative before handling the event

Assuming single branch per cycle: future may speculate across multiple branches!

Limits to ILP

Conflicting studies of amount
- Benchmarks (vectorized Fortran FP vs. integer C programs)
- – Hardware sophisticationCompiler sophistication
How much ILP is available using existing mechanisms with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
- – Intel MMX, SSE (Streaming SIMD Extensions): 64 bitIntel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock ints
- – MotorolaSupersparc AltaVec Multimedia ops, etc.: 128 bit ints and FPs

Limits to ILP

Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions

2 & 3 => machine with perfect speculation & an unbounded Jump prediction – all jumps perfectly predicted buffer of instructions available
store can be moved before a load provided addresses not Memory-address alias analysis – addresses are known & a

Also:^ equal unlimited number of instructions issued/clock cycle; perfect caches; 1 cycle latency for all instructions (FP *,/);

Upper Limit to ILP: Ideal Machine (Figure 3.35 p. 242)

Integer: 18 - 60

FP: 75 - 150

IPC

More Realistic HW: Branch Impact Figure 3.

Change from Infinite window to examine to

2000 and maximum issue of 64

instructions per clock cycle

Perfect Tournament BHT (512) Profile No prediction

FP: 15 - 45

Integer: 6 - 12

IPC

More Realistic HW:

Renaming Register Impact Figure 3.

Change window, 64 2000 instr instr

issue, 8K 2 level Prediction

Infinite 256 128 64 32 None

Integer: 5 - 15

FP: 11 - 45

IPC

SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com

1.6X

3.8X

1.2X

1.7X

1.5X

Conclusion

1985-2000: 1000X performance – Moore’s Law transistors/chip => Moore’s Law for Performance/MPU
Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year – Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order
ILP limits: To make performance progress in future need to^ execution, have explicit parallelism from programmer parallelism of ILP exploited by compiler, HW? vs. implicit - – Otherwise drop to old rate of 1.3X per year?Less than 1.3X because of processor-memory performance gap?
Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?

Dynamic Branch Prediction: Techniques and Challenges, Study notes of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Dynamic Branch Prediction: Techniques and Challenges and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Branch Prediction and Multiple-Issue

Processors

Venkatesh Akella

EEC 270

Winter 2004

Branch Prediction

precise exceptions

direction

Static Branch Prediction

Three strategies:

a) Always Predict Taken - Misprediction is 34%

b) Forward Not Taken; Backward Taken --- Misprediction is 10% - 40%

c) Profile-driven – using realistic benchmarks and

real data and for each branch determine the direction – Hennessey & McFarling and Larus and

Ball

Dynamic Branch Prediction

PC

Dynamic Branch Prediction (Jim Smith, 1981)

T

T

NT

NT T 00

T

NT

NT

2-bit Counters

2 k^ 2-bit saturating counters

How do you improve further?

prediction? – LOCAL HISTORY

particular branch – TOURNAMENT predictors

Using Local History

X

Accuracy of Different Schemes (Figure 3.15, p. 206)

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT

1024 Entries (2,2) BHT

Re-evaluating Correlation

of taken branches: program branch % static # = 90%

BHT Accuracy

(eqntott), with spice at 9% and gcc at 12%

Tournament Predictors

bit predictor failed on important branches; by adding global information, performance

improved

local information, and combine with a selector

branch (or right context of branch)

Tournament Predictor in Alpha 21264

% of predictions from local predictor in

Tournament Prediction Scheme

Accuracy of Branch Prediction

Accuracy v. Size (SPEC89)

Branch Folding

instruction itself or multiple instructions if it is a multi-issue processor

Eg: L2 : b L

L : add R1, R2, R

At address corresponding to L2, you store the add instruction instead of the unconditional

branch instruction b L

ZERO cycle BRANCH

(eliminated one instruction all together)

Advanced Approaches

is return address of a procedure.

85% of indirect jumps are due to procedure returns.

BTB does not work very well because

procedure is called from many different places

So, you a separate stack cache to push $Ra

and pop them off

Special Case Return Addresses

return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate

Pitfall: Sometimes bigger and dumber

is better

branch predictions based on local behavior (2K vs. 1K local predictor in the 21 2 64)

Dynamic Branch Prediction Summary

Multiple Issue

Gj – {i1, i2, i3, i4} and Gi = {i5,i6, i7, i8}

Gj is already in execution

1. Fetch Gi

2. Check for all structural hazards that instructions in Gj may introduce

3. Check for data hazards between between instructions in Gi and Gj Gi and