Download Dynamic Branch Prediction: Techniques and Challenges and more Study notes Computer Architecture and Organization in PDF only on Docsity!
Branch Prediction and Multiple-Issue
Processors
Venkatesh Akella
EEC 270
Winter 2004
Based on Material provided by Prof. Al Davis and Prof. David Culler
Branch Prediction
- Size of basic blocks limited to 4-7 instructions
- Delayed branches not a solution in multiple- issue processors
- Why? Hard to find independent instructions and remember the mess they create for
precise exceptions
- To resolve a branch need two things (a) branch target address and (b) branch
direction
- Prediction deals with (b) I.e. getting the direction
- Branch Penalty is governed by (a)
- Deeper pipeline – bad news as BP is higher
Static Branch Prediction
- Let the compiler figure out the branch direction for each branch instruction
Three strategies:
a) Always Predict Taken - Misprediction is 34%
b) Forward Not Taken; Backward Taken --- Misprediction is 10% - 40%
c) Profile-driven – using realistic benchmarks and
real data and for each branch determine the direction – Hennessey & McFarling and Larus and
Ball
Dynamic Branch Prediction
- Run Time
- • Hardware assistedIntuition – branches direction is not random, they
- are BIMODAL i.e. either strongly taken or not takenOne-bit Branch Prediction Buffer or Branch History Table (BHT) – Smith 1981
K-bits
PC
Past a good a good indicator of the future
1 = Taken 0= Not Taken
You make a mistake^ Update BHT when What are the problems? a) Aliasing due to limited size of the BHT (tag can be stored to b) avoid this problem)1-bit history may not be sufficient? Eg: consider a loop that iterates 10 times – You will mispredict 2/10 so accuracy is 80%
- Better Solution: 2-bit scheme where change prediction only if get misprediction twice:
- Adds hysteresis to decision making process
Dynamic Branch Prediction (Jim Smith, 1981)
T
T
NT
Predict Taken Predict Not Taken
Predict Taken Predict Not Taken
NT T 00
T
NT
NT
2-bit Counters
- Upto 93.5% accuracy
- If K is sufficiently large, each branch maps to a unique counter
- Can store tags if you want to avoid aliasing
- How about m-bit counters?
- Doesn’t benefit much
PC K-bits
2 k^ 2-bit saturating counters
How do you improve further?
- Can we capture the actual history of the specific branch and use that to make our
prediction? – LOCAL HISTORY
- Can we capture the sequential correlation between branches – GLOBAL HISTORY
- do both?
- Make multiple predictions and choose the right prediction based on the context of the
particular branch – TOURNAMENT predictors
Using Local History
Consider the simple for loop FOR (I=1; I<5; I++) { something …} If the branch is at the end of the loop body, it has following pattern – (1110)N The sequence of the branch history is 11101110111011101110 …… Basically, if we know what the branch did the last three times, we can predict EXACTLY what it will do next.
PC k
LHT/PHT 1-bit or 2-bit or m-bit counter
Counter
3-bit local history
X
(^3) accuracy97.1% accesses^ 2-mem
Accuracy of Different Schemes (Figure 3.15, p. 206)
4096 Entries 2-bit BHT Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
Frequency of
Mispredictions
Re-evaluating Correlation
- Several of the SPEC benchmarks have less than a dozen branches responsible for 90%
of taken branches: program branch % static # = 90%
compress eqntott 14%25% (^236494 ) gcc mpeg 15%10% (^95315598 ) real gcc 13% 17361 3214
- Real programs + OS more like gcc
- Small benefits beyond benchmarks for correlation? problems with branch aliases?
BHT Accuracy
- Mispredict because either:
- – Wrong guess for that branchGot branch history of wrong branch when index the table
- 4096 entry table misprediction (nasa7, programs vary from 1% tomcatv) to 18%
(eqntott), with spice at 9% and gcc at 12%
- For SPEC92, 4096 about as good as infinite table
Tournament Predictors
- Motivation for hybrid branch predictors is 2-
bit predictor failed on important branches; by adding global information, performance
improved
- Tournament based on global predictors: information use (^2) and predictors, 1 based on 1
local information, and combine with a selector
- Hopes to select right predictor for right
branch (or right context of branch)
Tournament Predictor in Alpha 21264
- 4K 2-bit counters to choose from among a global predictor and a local predictor
- Global predictor history of the last 12 branches; each entry in the global also has 4K entries and is indexed by the predictor is a standard 2-bit predictor – 12-bit pattern: ith bit 0 => ith prior branch not taken;
- Local predictor consists of a 2-level predictor:^ ith^ bit 1 =>^ ith^ prior branch taken;
- Top entries; level each a local10-bit history entry tablecorresponds consisting to theof (^1024) most 10-bitrecent (^10) patterns 10 branches to be discovered and predicted. branch outcomes for the entry. 10-bit history allows
- Next used tolevel index Selected a table entry of from1K entriesthe local consisting history tablea 3-bit is
- Total size: 4K2 + 4K2 + 1K10 + 1K3 =^ saturating counters, which provide the local prediction 29K bits! (~180,000 transistors)
% of predictions from local predictor in
Tournament Prediction Scheme
Accuracy of Branch Prediction
- Profile: branch profile from last execution (static in that in encoded in instruction, but profile)
fig 3.
Accuracy v. Size (SPEC89)
Branch Folding
- Branch Folding or BTA, how about storing the target – Instead of storing Next PC
instruction itself or multiple instructions if it is a multi-issue processor
Eg: L2 : b L
L : add R1, R2, R
At address corresponding to L2, you store the add instruction instead of the unconditional
branch instruction b L
ZERO cycle BRANCH
(eliminated one instruction all together)
L2: Add r1,r2, r
Advanced Approaches
- Trace Caches – aggressive prefetching
- Return Address Caches – jr $Ra – when $Ra
is return address of a procedure.
85% of indirect jumps are due to procedure returns.
BTB does not work very well because
procedure is called from many different places
So, you a separate stack cache to push $Ra
and pop them off
Special Case Return Addresses
- Register Indirect branch hard to predict address
- SPEC89 85% such branches for procedure return
- Since stack discipline for procedures, save
return address in small buffer that acts like a stack: 8 to 16 entries has small miss rate
Pitfall: Sometimes bigger and dumber
is better
- 212 64 uses tournament predictor (29 Kbits)
- Earlier 21 with 2K entries (or a total of 4 1 64 uses a simple 2-bit predictor Kbits)
- SPEC95 benchmarks, 2 – 21264 avg. 11.5 mispredictions 2 264 outperforms per 1000 instructions
- 21164 avg. 16.5 mispredictions per 1000 instructions
- Reversed for transaction processing (TP)! – 21264 avg. 17 mispredictions per 1000 instructions
- 21164 avg. 15 mispredictions per 1000 instructions
- TP code much larger & 21 1 64 hold 2X
branch predictions based on local behavior (2K vs. 1K local predictor in the 21 2 64)
Dynamic Branch Prediction Summary
- Prediction becoming important part of scalar execution
- Branch History Table: 2 bits for loop accuracy
- Correlation: Recently executed branches correlated with next branch.
- – Either different branchesOr different executions of same branches
- Tournament Predictor: more resources to competitive solutions and pick between them
- Branch Target Buffer: include branch address & prediction
- Predicated Execution can reduce number of branches, number of mispredicted branches
- Return address stack for prediction of indirect jump
Multiple Issue
- Goal how to reduce CPI below 1.
- Consider two consecutive blocks of instructions
Gj – {i1, i2, i3, i4} and Gi = {i5,i6, i7, i8}
Gj is already in execution
1. Fetch Gi
2. Check for all structural hazards that instructions in Gj may introduce
3. Check for data hazards between between instructions in Gi and Gj Gi and
4. Read operands and execute
Flavors of Multiple Issue Processors
- Vector = execute a loop in parallel array data structures – directly on
- Superscalar – Static = in-order-execution (if I5 has a problem, HALT)
- Dynamic^ •^ Eg : SUN ULTRA SPARC II/III= out-of-order execution (let I6 if I5 has a resource conflict » No Speculation – If i5 is a branch do not allow I6 till branch is resolved • IBM Power 2 » With Speculation- Allow I6 but be prepared to rollback (Pentium 3, Pentium 4, Alpha 21264, MIPS R10K)
- VLIW – Compiler determines what to execute in parallel (Trimedia)
Multiple Issue Headaches
- Increased I-Cache Fetch BW
- Alignment problems may not allow 4
instructions to be fetched
- Need to check for more hazards
- Branches – 25% of instructions are
branches, so you need to resolve a branch every cycle!
- Increased ports on register file and memory
So, how do we proceed
1. Pipeline the Issue unit into 2 stages
2. Restricted Issue eg: one int and one FP
Dynamic Scheduling in Superscalar
- How to issue two instructions and keep in-order instruction^ The easy way issue for – Assume 1 integer + 1 floating point Tomasulo? - 1 Tomasulo control for integer, 1 for floating point
- Issue 2X Clock Rate, so that issue remains in order
- Only loads/stores might cause dependency between integer and FP issue:
- Replace load reservation station with a load queue; operands must be read in the order they are fetched
- Load checks addresses in Store Queue to avoid RAW violation
- Store checks addresses in Load Queue to avoid WAR,WAW
Register renaming, virtual registers
versus Reorder Buffers
- Alternative to Reorder Buffer is a larger virtual set of registers and register renaming
- Virtual registers registers + temporary values hold both architecturally visible
- replace functions of reorder buffer and reservation station
- Renaming process maps names of architectural registers to registers in virtual register set
- Changing subset of virtual registers contains architecturally visible registers
- Simplifies instruction commit: mark register as no longer speculative, free register with old value
- Adds 40-80 extra registers: Alpha, Pentium,…
- Size limits no. instructions in execution (used until commit)
How much to speculate?
- Speculation Pro: uncover events that would
otherwise stall the pipeline (cache misses)
- Speculation Con: exceptional event occurs when speculation was speculation costly if
incorrect
- Typical solution: speculation allows only low- cost exceptional events (1st-level cache miss)
- When expensive exceptional event occurs, (2nd-level cache miss or TLB miss) processor
waits until the instruction causing event is no longer speculative before handling the event
- Assuming single branch per cycle: future may speculate across multiple branches!
Limits to ILP
- Conflicting studies of amount
- Benchmarks (vectorized Fortran FP vs. integer C programs)
- – Hardware sophisticationCompiler sophistication
- How much ILP is available using existing mechanisms with increasing HW budgets?
- Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
- – Intel MMX, SSE (Streaming SIMD Extensions): 64 bitIntel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock ints
- – MotorolaSupersparc AltaVec Multimedia ops, etc.: 128 bit ints and FPs
Limits to ILP
Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions
- 2 & 3 => machine with perfect speculation & an unbounded Jump prediction – all jumps perfectly predicted buffer of instructions available
- store can be moved before a load provided addresses not Memory-address alias analysis – addresses are known & a
Also:^ equal unlimited number of instructions issued/clock cycle; perfect caches; 1 cycle latency for all instructions (FP *,/);
Upper Limit to ILP: Ideal Machine (Figure 3.35 p. 242)
Integer: 18 - 60
FP: 75 - 150
IPC
More Realistic HW: Branch Impact Figure 3.
Change from Infinite window to examine to
2000 and maximum issue of 64
instructions per clock cycle
Perfect Tournament BHT (512) Profile No prediction
FP: 15 - 45
Integer: 6 - 12
IPC
More Realistic HW:
Renaming Register Impact Figure 3.
Change window, 64 2000 instr instr
issue, 8K 2 level Prediction
Infinite 256 128 64 32 None
Integer: 5 - 15
FP: 11 - 45
IPC
SPEC 2000 Performance 3/2001 Source: Microprocessor Report, www.MPRonline.com
1.6X
3.8X
1.2X
1.7X
1.5X
Conclusion
- 1985-2000: 1000X performance – Moore’s Law transistors/chip => Moore’s Law for Performance/MPU
- Hennessy: industry been following a roadmap of ideas known in 1985 to exploit Instruction Level Parallelism and (real) Moore’s Law to get 1.55X/year – Caches, Pipelining, Superscalar, Branch Prediction, Out-of-order
- ILP limits: To make performance progress in future need to^ execution, have explicit parallelism from programmer parallelism of ILP exploited by compiler, HW? vs. implicit - – Otherwise drop to old rate of 1.3X per year?Less than 1.3X because of processor-memory performance gap?
- Impact on you: if you care about performance, better think about explicitly parallel algorithms vs. rely on ILP?