Dynamic Branch Prediction: Techniques and Challenges, Study notes of Computer Architecture and Organization

Dynamic branch prediction techniques, including 2-bit counters, tournament predictors, and branch history tables. The problems with static branch prediction and the benefits of dynamic prediction, as well as the accuracy of different schemes. The document also covers the use of local history and the correlation between branches.

Typology: Study notes

Pre 2010

Uploaded on 07/30/2009

koofers-user-0mk
koofers-user-0mk 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Page ‹#›
Branch Prediction and Multiple-Issue
Processors
Venkatesh Akella
EEC 270
Winter 2004
Based on Material provided by
Prof. Al Davis and Prof. David Culler
Branch Prediction
Size of basic blocks limited to 4-7
instructions
Delayed branches not a solution in multiple-
issue processors
Why? Hard to find independent instructions
and remember the mess they create for
precise exceptions
To resolve a branch need two things (a)
branch target address and (b) branch
direction
Prediction deals with (b) I.e. getting the
direction
Branch Penalty is governed by (a)
Deeper pipeline – bad news as BP is higher
Static Branch Prediction
Let the compiler figure out the branch direction
for each branch instruction
Three strategies:
a) Always Predict Taken - Misprediction is 34%
b) Forward Not Taken; Backward Taken ---
Misprediction is 10% - 40%
c) Profile-driven – using realistic benchmarks and
real data and for each branch determine the
direction – Hennessey & McFarling and Larus and
Ball
Dynamic Branch Prediction
Run Time
Hardware assisted
Intuition – branches direction is not random, they
are BIMODAL i.e. either strongly taken or not taken
One-bit Branch Prediction Buffer or Branch History
Table (BHT) – Smith 1981
K-bits
1
0
0
1
0
1
PC
Past a good a good indicator of the future
1 = Taken
0= Not Taken
Update BHT when
You make a mistake
What are the problems?
a) Aliasing due to limited size of the BHT (tag can be stored to
avoid this problem)
b) 1-bit history may not be sufficient? Eg: consider a loop that
iterates 10 times – You will mis predict 2/10 so accuracy is 80%
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Dynamic Branch Prediction: Techniques and Challenges and more Study notes Computer Architecture and Organization in PDF only on Docsity!

Branch Prediction and Multiple-Issue

Processors

Venkatesh Akella

EEC 270

Winter 2004

Based on Material provided by Prof. Al Davis and Prof. David Culler

Branch Prediction

  • Size of basic blocks limited to 4-7 instructions
  • Delayed branches not a solution in multiple- issue processors
  • Why? Hard to find independent instructions and remember the mess they create for

precise exceptions

  • To resolve a branch need two things (a) branch target address and (b) branch

direction

  • Prediction deals with (b) I.e. getting the direction
  • Branch Penalty is governed by (a)
  • Deeper pipeline – bad news as BP is higher

Static Branch Prediction

  • Let the compiler figure out the branch direction for each branch instruction

Three strategies:

a) Always Predict Taken - Misprediction is 34%

b) Forward Not Taken; Backward Taken --- Misprediction is 10% - 40%

c) Profile-driven – using realistic benchmarks and

real data and for each branch determine the direction – Hennessey & McFarling and Larus and

Ball

Dynamic Branch Prediction

  • Run Time
  • • Hardware assistedIntuition – branches direction is not random, they
  • are BIMODAL i.e. either strongly taken or not takenOne-bit Branch Prediction Buffer or Branch History Table (BHT) – Smith 1981

K-bits

PC

Past a good a good indicator of the future

1 = Taken 0= Not Taken

You make a mistake^ Update BHT when What are the problems? a) Aliasing due to limited size of the BHT (tag can be stored to b) avoid this problem)1-bit history may not be sufficient? Eg: consider a loop that iterates 10 times – You will mispredict 2/10 so accuracy is 80%

  • Better Solution: 2-bit scheme where change prediction only if get misprediction twice:
  • Adds hysteresis to decision making process

Dynamic Branch Prediction (Jim Smith, 1981)

T

T

NT

Predict Taken Predict Not Taken

Predict Taken Predict Not Taken

NT T 00

T

NT

NT

2-bit Counters

  • Upto 93.5% accuracy
  • If K is sufficiently large, each branch maps to a unique counter
  • Can store tags if you want to avoid aliasing
  • How about m-bit counters?
  • Doesn’t benefit much

PC K-bits

2 k^ 2-bit saturating counters

How do you improve further?

  • Can we capture the actual history of the specific branch and use that to make our

prediction? – LOCAL HISTORY

  • Can we capture the sequential correlation between branches – GLOBAL HISTORY
  • do both?
  • Make multiple predictions and choose the right prediction based on the context of the

particular branch – TOURNAMENT predictors

Using Local History

Consider the simple for loop FOR (I=1; I<5; I++) { something …} If the branch is at the end of the loop body, it has following pattern – (1110)N The sequence of the branch history is 11101110111011101110 …… Basically, if we know what the branch did the last three times, we can predict EXACTLY what it will do next.

PC k

LHT/PHT 1-bit or 2-bit or m-bit counter

Counter

3-bit local history

X

(^3) accuracy97.1% accesses^ 2-mem

Accuracy of Different Schemes (Figure 3.15, p. 206)

4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT

1024 Entries (2,2) BHT

Frequency of

Mispredictions

Re-evaluating Correlation

  • Several of the SPEC benchmarks have less than a dozen branches responsible for 90%

of taken branches: program branch % static # = 90%

compress eqntott 14%25% (^236494 ) gcc mpeg 15%10% (^95315598 ) real gcc 13% 17361 3214

  • Real programs + OS more like gcc
  • Small benefits beyond benchmarks for correlation? problems with branch aliases?

BHT Accuracy

  • Mispredict because either:
    • – Wrong guess for that branchGot branch history of wrong branch when index the table
  • 4096 entry table misprediction (nasa7, programs vary from 1% tomcatv) to 18%

(eqntott), with spice at 9% and gcc at 12%

  • For SPEC92, 4096 about as good as infinite table

Tournament Predictors

  • Motivation for hybrid branch predictors is 2-

bit predictor failed on important branches; by adding global information, performance

improved

  • Tournament based on global predictors: information use (^2) and predictors, 1 based on 1

local information, and combine with a selector

  • Hopes to select right predictor for right

branch (or right context of branch)

Tournament Predictor in Alpha 21264

  • 4K 2-bit counters to choose from among a global predictor and a local predictor
  • Global predictor history of the last 12 branches; each entry in the global also has 4K entries and is indexed by the predictor is a standard 2-bit predictor – 12-bit pattern: ith bit 0 => ith prior branch not taken;
  • Local predictor consists of a 2-level predictor:^ ith^ bit 1 =>^ ith^ prior branch taken;
    • Top entries; level each a local10-bit history entry tablecorresponds consisting to theof (^1024) most 10-bitrecent (^10) patterns 10 branches to be discovered and predicted. branch outcomes for the entry. 10-bit history allows
    • Next used tolevel index Selected a table entry of from1K entriesthe local consisting history tablea 3-bit is
  • Total size: 4K2 + 4K2 + 1K10 + 1K3 =^ saturating counters, which provide the local prediction 29K bits! (~180,000 transistors)

% of predictions from local predictor in

Tournament Prediction Scheme

Accuracy of Branch Prediction

  • Profile: branch profile from last execution (static in that in encoded in instruction, but profile)

fig 3.

Accuracy v. Size (SPEC89)

Branch Folding

  • Branch Folding or BTA, how about storing the target – Instead of storing Next PC

instruction itself or multiple instructions if it is a multi-issue processor

Eg: L2 : b L

L : add R1, R2, R

At address corresponding to L2, you store the add instruction instead of the unconditional

branch instruction b L

ZERO cycle BRANCH

(eliminated one instruction all together)

L2: Add r1,r2, r

Advanced Approaches

  • Trace Caches – aggressive prefetching
  • Return Address Caches – jr $Ra – when $Ra

is return address of a procedure.

85% of indirect jumps are due to procedure returns.

BTB does not work very well because

procedure is called from many different places

So, you a separate stack cache to push $Ra

and pop them off

Special Case Return Addresses

  • Register Indirect branch hard to predict address
  • SPEC89 85% such branches for procedure return
  • Since stack discipline for procedures, save

return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate

Pitfall: Sometimes bigger and dumber

is better

  • 212 64 uses tournament predictor (29 Kbits)
  • Earlier 21 with 2K entries (or a total of 4 1 64 uses a simple 2-bit predictor Kbits)
  • SPEC95 benchmarks, 2 – 21264 avg. 11.5 mispredictions 2 264 outperforms per 1000 instructions
    • 21164 avg. 16.5 mispredictions per 1000 instructions
  • Reversed for transaction processing (TP)! – 21264 avg. 17 mispredictions per 1000 instructions
    • 21164 avg. 15 mispredictions per 1000 instructions
  • TP code much larger & 21 1 64 hold 2X

branch predictions based on local behavior (2K vs. 1K local predictor in the 21 2 64)

Dynamic Branch Prediction Summary

  • Prediction becoming important part of scalar execution
  • Branch History Table: 2 bits for loop accuracy
  • Correlation: Recently executed branches correlated with next branch.
    • – Either different branchesOr different executions of same branches
  • Tournament Predictor: more resources to competitive solutions and pick between them
  • Branch Target Buffer: include branch address & prediction
  • Predicated Execution can reduce number of branches, number of mispredicted branches
  • Return address stack for prediction of indirect jump

Multiple Issue

  • Goal how to reduce CPI below 1.
  • Consider two consecutive blocks of instructions

Gj – {i1, i2, i3, i4} and Gi = {i5,i6, i7, i8}

Gj is already in execution

1. Fetch Gi

2. Check for all structural hazards that instructions in Gj may introduce

3. Check for data hazards between between instructions in Gi and Gj Gi and

4. Read operands and execute

Flavors of Multiple Issue Processors

  • Vector = execute a loop in parallel array data structures – directly on
  • Superscalar – Static = in-order-execution (if I5 has a problem, HALT)
    • Dynamic^ •^ Eg : SUN ULTRA SPARC II/III= out-of-order execution (let I6 if I5 has a resource conflict » No Speculation – If i5 is a branch do not allow I6 till branch is resolved • IBM Power 2 » With Speculation- Allow I6 but be prepared to rollback (Pentium 3, Pentium 4, Alpha 21264, MIPS R10K)
  • VLIW – Compiler determines what to execute in parallel (Trimedia)
    • EPIC (basis for Itanium)

Multiple Issue Headaches

  • Increased I-Cache Fetch BW
  • Alignment problems may not allow 4

instructions to be fetched

  • Need to check for more hazards
  • Branches – 25% of instructions are

branches, so you need to resolve a branch every cycle!

  • Increased ports on register file and memory

So, how do we proceed

1. Pipeline the Issue unit into 2 stages

2. Restricted Issue eg: one int and one FP

Dynamic Scheduling in Superscalar

  • How to issue two instructions and keep in-order instruction^ The easy way issue for – Assume 1 integer + 1 floating point Tomasulo? - 1 Tomasulo control for integer, 1 for floating point
  • Issue 2X Clock Rate, so that issue remains in order
  • Only loads/stores might cause dependency between integer and FP issue:
    • Replace load reservation station with a load queue; operands must be read in the order they are fetched
    • Load checks addresses in Store Queue to avoid RAW violation
    • Store checks addresses in Load Queue to avoid WAR,WAW

Register renaming, virtual registers

versus Reorder Buffers

  • Alternative to Reorder Buffer is a larger virtual set of registers and register renaming
  • Virtual registers registers + temporary values hold both architecturally visible
    • replace functions of reorder buffer and reservation station
  • Renaming process maps names of architectural registers to registers in virtual register set
    • Changing subset of virtual registers contains architecturally visible registers
  • Simplifies instruction commit: mark register as no longer speculative, free register with old value
  • Adds 40-80 extra registers: Alpha, Pentium,…
    • Size limits no. instructions in execution (used until commit)

How much to speculate?

  • Speculation Pro: uncover events that would

otherwise stall the pipeline (cache misses)

  • Speculation Con: exceptional event occurs when speculation was speculation costly if

incorrect

  • Typical solution: speculation allows only low- cost exceptional events (1st-level cache miss)
  • When expensive exceptional event occurs, (2nd-level cache miss or TLB miss) processor

waits until the instruction causing event is no longer speculative before handling the event

  • Assuming single branch per cycle: future may speculate across multiple branches!

Limits to ILP

  • Conflicting studies of amount
    • Benchmarks (vectorized Fortran FP vs. integer C programs)
    • – Hardware sophisticationCompiler sophistication
  • How much ILP is available using existing mechanisms with increasing HW budgets?
  • Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
    • – Intel MMX, SSE (Streaming SIMD Extensions): 64 bitIntel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock ints
    • – MotorolaSupersparc AltaVec Multimedia ops, etc.: 128 bit ints and FPs

Limits to ILP

Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions

  1. 2 & 3 => machine with perfect speculation & an unbounded Jump prediction – all jumps perfectly predicted buffer of instructions available
  2. store can be moved before a load provided addresses not Memory-address alias analysis – addresses are known & a

Also:^ equal unlimited number of instructions issued/clock cycle; perfect caches; 1 cycle latency for all instructions (FP *,/);

Upper Limit to ILP: Ideal Machine (Figure 3.35 p. 242)

Integer: 18 - 60

FP: 75 - 150

IPC

More Realistic HW: Branch Impact Figure 3.

Change from Infinite window to examine to

2000 and maximum issue of 64

instructions per clock cycle

Perfect Tournament BHT (512) Profile No prediction

FP: 15 - 45

Integer: 6 - 12

IPC

More Realistic HW:

Renaming Register Impact Figure 3.

Change window, 64 2000 instr instr

issue, 8K 2 level Prediction

Infinite 256 128 64 32 None

Integer: 5 - 15

FP: 11 - 45

IPC

SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com

1.6X

3.8X

1.2X

1.7X

1.5X

Conclusion

  • 1985-2000: 1000X performance – Moore’s Law transistors/chip => Moore’s Law for Performance/MPU
  • Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year – Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order
  • ILP limits: To make performance progress in future need to^ execution, have explicit parallelism from programmer parallelism of ILP exploited by compiler, HW? vs. implicit - – Otherwise drop to old rate of 1.3X per year?Less than 1.3X because of processor-memory performance gap?
  • Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?