Download superscalar and more Study notes Advanced Computer Architecture in PDF only on Docsity!
Superscalar Processors:
Branch Prediction
Dynamic Scheduling
Superscalar Processors
Superscalar: A Sequential Architecture
Superscalar processor is a representative ILP
implementation of a sequential architecture
- For every instruction issued by a Superscalar processor, the hardware must check whether the operands interfere with the operands of any other instruction that is either - (1) already in execution, (2) been issued but waiting for completion of interfering instructions that would have been executed earlier in a sequential program, and (3) being issued concurrently but would have been executed earlier in the sequential execution of the program
- Superscalar proc. issues multiple inst. In cycle
Superscalar Terminology
Basic
Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. E.g., MIPS R5000 has 8 stages Branch prediction Logic to guess whether or not branch will be taken, and possibly branch target
Advanced
Out-of-order Able to issue instructions out of program order Speculation Execute instructions beyond branch points, possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete.
Superscalar Execution Example
Single Order, Data Dependence – In Order
Assumptions
- Single FP adder takes 2 cycles
- Single FP multiplier takes 5 cycles
- Can issue add & multiply together
- Must issue in-order
- in,in,out
v: addt $f2, $f4, $f w: mult $f10, $f6, $f x: addt $f10, $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f
v w x y
(Single adder, data dependence)
(In order)
(inorder)
Data Flow
+ +
+
$f2 $f4 $f
$f
$f
$f
v^ y
x z
Critical Path = 9 cycles
+
w
z
$f
z
Adding Advanced Features
Out Of Order Issue
- Can start y as soon as adder available
- Must hold back z until $f10 not busy & adder available
v w x y z
v: addt $f2, $f4, $f w: mult $f10, $f6, $f x: addt $f10 , $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f
Adding Advanced Features
With Register Renaming
v w x y z
v: addt $f2, $f4, $f10a w: mult $f10a, $f6, $f10a x: addt $f10a, $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f
Flow Path Model of Superscalars
I-cache
FETCH
DECODE
COMMIT D-cache
Branch Predictor (^) Instruction Buffer
Store Queue
Reorder Buffer
Integer (^) Floating-point Media Memory
Instruction
Register Data
Memory Data
Flow
EXECUTE
(ROB)
Flow
Flow
Icache
Superscalar issue
F D I...
Decode / Issue Decode / Issue
Scalar issue Typical FX- pipeline layout F^ D/I...
Icache
Instruction buffer
Instruction buffer
Contrasting decoding and instruction issue in a
scalar and a 4-way superscalar processor
Superscalar Processors: Tasks
parallel decoding
superscalar instruction issue
parallel instruction execution
- preserving sequential consistency of exception processing
- preserving sequential consistency of exec.
Superscalar Issues to be considered
Parallel decoding – more complex task than in scalar processors.
- High issue rate can lengthen the decoding cycle therefore use predecoding.
- partial decoding performed while instructions are loaded into the instruction cache Superscalar instruction issue – A higher issue rate gives rise to higher processor performance, but amplifies the restrictive effects of control and data dependencies on the processor performance as well.
- To overcome these problems designers use advanced techniques such as shelving, register renaming, and speculative branch processing
Superscalar issues
Parallel instruction execution task – Also called
“preservation of the sequential consistency of
instruction execution”. While instructions are
executed in parallel, instructions are usually
completed out of order in respect to a sequential
operating procedure
Preservation of sequential consistency of
exception processing task
Pre-Decoding
more EUs than the scalar processors, therefore higher number of instructions in execution
- more dependency check comparisons needed Predecoding – As I-cache is being loaded, a predecode unit, performs a partial decoding and appends a number of decode bits to each instruction. These bits usually indicate :
- the instruction class
- type of resources which are required for the execution
- the fact that branch target addresses have been calculated
- Predecoding used in PowerPC 601, MIPS R8000,SuperSparc
Second-level cache (or memory)
Predecode unit
Icache
Typically 128 bits/cycle
When instructions are written into the Icache, the predecode unit appends 4-7 bits to each RISC instruction
E.g. 148 bits/cycle (^1)
In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte
1
The Principle of Predecoding
Superscalar Instruction Issues
specify how false data and unresolved control
dependencies are coped with during instruction issue
- the design options are either to avoid them during the instruction issue by using register renaming and speculative branch processing, respectively, or not
False data dependencies between register data may
be removed by register renaming
Speculative Execution
- Expensive in hardware
- Alternative is to perform speculative code motion at compile time - Move operations from subsequent blocks up past branch operations into proceeding blocks
- Requires less demanding hardware
- A mechanism to ensure that exceptions caused by speculatively scheduled operations are reported if and only if flow of control is such that they would have been executed in the non-speculative version of the code
- Additional registers to hold the speculative execution state
Hardware Features to Support ILP Next... Superscalar Processor Design
How to deal with instruction flow
- Dynamic Branch prediction
How to deal with register/data flow
Solutions studied:
- Dynamic branch prediction algorithms
- Dynamic scheduling using Tomasulo method
Summary of discussions
ILP processors
Superscalar has hardware logic for extracting
parallelism
- Solutions for stalls etc. must be provided in hardware
Stalls play an even greater role in ILP processors
Software solutions, such as code scheduling through
code movement, can lead to improved execution
times
- More sophisticated techniques needed
- Can we provide some H/W support to help the compiler – leads to EPIC/VLIW
Superscalar Pipeline Design
Instruction Buffer
Fetch
Dispatch Buffer
Decode
Issuing Buffer
Dispatch
Completion Buffer
Execute
Store Buffer
Complete
Retire
Instruction
Flow
Data Flow
Flow Path Model of Superscalars
I-cache
FETCH
DECODE
COMMIT D-cache
Branch Predictor (^) Instruction Buffer
Store Queue
Reorder Buffer
Integer Floating-point Media Memory
Instruction
Register Data
Memory Data
Flow
EXECUTE
(ROB)
Flow
Flow
Instruction Fetch Bandwidth Solutions
Ability to fetch number of instructions from cache is
crucial to superscalar performance
- Use instruction fetch buffer to prefetch instructions
- Fetch multiple instructions in one cycle to support the s -wide issue of superscalar processors
Design instruction cache ( I-Cache) to support this
- Shall discuss solutions when Memory design is covered
Instruction Decoding Issues
Primary tasks:
- Identify individual instructions
- Determine instruction types
- Detect inter-instruction dependences
Predecoding
- Identify inst classes
- Add more bits to instruction after fetching
Two important factors:
- Instruction set architecture
- Width of parallel pipeline
Second-level cache (or memory)
Predecode unit
Icache
Typically 128 bits/cycle
When instructions are written into the Icache, the predecode unit appends 4-7 bits to each RISC instruction
E.g. 148 bits/cycle (^1)
In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte
1
The Principle of Predecoding
Why Branches: CFG and Branches
Basic blocks and their constituent instructions must
be stored in sequential location in memory
- In mapping a CFG to linear consecutive mem location, additional unconditional branches must be added
Encounter of branches (cond and uncond.) at run-
time induces deviations from implied sequential
control flow and consequent disruptions to sequential
fetching of instructions
- These disruptions cause stalls in Inst.Fetch (IF) stage and reduce overall IF bandwidth
Mapping CFG to
Linear Instruction Sequence
A A
B
B
A
B
C
D
D
C
C
D
Conditional branches Unconditional branch
Branch Types and Implementation
Types of Branches
- Conditional or Unconditional?
- Subroutine Call (aka Link), needs to save PC?
- How is the branch target computed?
- Static Target e.g. immediate, PC-relative
- Dynamic targets e.g. register indirect
What’s So Bad About Branches?
Performance Penalties
- Use up execution resources
- Fragmentation of I-Cache lines
- Disruption of sequential control flow
- Need to determine branch direction (conditional
branches)
- Need to determine branch target
Robs instruction fetch bandwidth and ILP
Branch-- actions
When branches occur, disruption to IF occurs
For unconditional branches
- Subsequent instruction cannot be fetched until target address determined
For conditional branches
- Machine must wait for resolution of branch condition
- And if branch taken then wait till target address computed
Branch inst executed by the branch functional unit
Note: Cost in superscalar/ILP processors = width
(parallelism) X stall cycles
- 3 stall cycles on a 4 wide machine = 12 lost cycles
CPU Performance..
Recall: CPU time = ICCPIClk
- CPI = ideal CPI + stall cycles/inst
- Minimizing CPI implies minimize stall cycles
- Stall cycles from branch instructions
- How to determine the number of stall cycles
Branch penalties/stall cycles
When branch occurs two parts needed:
- Branch target address ( BTA ) has to be computed
- Branch condition resolution
Addressing modes will affect BTA delay
- For PC relative, BTA can be generated during Fetch stage for 1 cycle penalty
- For Register indirect, BTA generated after decode stage (to access register) = 2 cycle penalty
- For register indirect with offset = 3 cycle penalty
For branch condition resolution, depends on methods
- If condition code registers used, then penalty =
- If ISA permits comparison of 2 registers then output of ALU => 3 cycles
Penalty will be max of penalties for condition
resolution and BTA
Condition Resolution
Decode Buffer
Fetch
Dispatch Buffer
Decode
Reservation
Dispatch
Store Buffer
Complete
Retire
Issue^ Stations
Execute
Finish (^) Completion Buffer
Branch
CC reg. GP reg. value comp. Stall=
Stall=
Determining Branch Target
Problem: Cannot fetch subsequent instructions until
branch target is determined
Minimize delay
- Generate branch target early in the pipeline
Make use of delay
- Bias for not taken
- Predict branch target
PC-relative vs Register Indirect targets
Keys to Branch Prediction
Target Address Generation
- Access register
- PC, GP register, Link register
- Perform calculation
- +/- offset, auto incrementing/decrementing
⇒ Target Speculation
Condition Resolution
- Access register
- Condition code register, data register, count register
- Perform calculation
- Comparison of data register(s)
⇒ Condition Speculation
History based Branch Target
Speculation – Branch Target Buffer
If you have seen this branch instruction before, can
you figure out the target address faster?
How to organize the “history table”?
History based Branch Target
Speculation – Branch Target Buffer
Use branch target buffer (BTB) to store previous
branch target address
BTB is a small fully associative cache
- Accessed during instruction fetch using PC
BTB can have three fields
- Branch instruction address ( BIA )
- Branch target address (BTA)
- History bits
When PC matches BIA, an entry is made into BTB
- A hit in BTB Implies inst being fetched is branch inst
- The BTA field can be used to fetch next instruction if particular branch is predicted to be taken
- Note: branch inst is still fetched and executed for validation/recovery
A small “cache-like” memory in the instruction fetch stage
Remembers previously executed branches, their addresses,
information to aid prediction, and most recent target
addresses
Instruction fetch stage compares current PC against those
in BTB to “guess” nPC
- If matched then prediction is made else nPC=PC+
- If predict taken then nPC=target address in BTB else nPC=PC+
When branch is actually resolved, BTB is updated
Branch Target (Most Recent)
Branch History
Branch Inst. Address (tag)
Branch Target Buffer (BTB)
current
PC
Branch Condition Speculation
Biased For Not Taken
- Does not affect the instruction set architecture
- Not effective in loops
Software Prediction
- Encode an extra bit in the branch instruction
- Predict not taken: set bit to 0
- Predict taken: set bit to 1
- Bit set by compiler or user; can use profiling
- Static prediction, same behavior every time
Prediction Based on Branch Offsets
- Positive offset: predict not taken
- Negative offset: predict taken
Prediction Based on History
Branch Instruction Speculation
Decode Buffer
Fetch
Dispatch Buffer
Decode
Reservation
Dispatch
Stations Issue
Execute
Finish (^) Completion Buffer
Branch
nPC to Icache
nPC(seq.) = PC+ Branch PC Predictor (using a BTB)
specu. target
BTB update
prediction
(target addr. and history)
specu. cond.
FA-mux
nPC=BP(PC)
Branch Prediction Function
Based on opcode only (%)
IBM1 IBM2 IBM3 IBM4 DEC CDC
Based on history of branch
- Branch prediction function F (X1, X2, .... )
- Use up to 5 previous branches for history (%)
IBM1 IBM2 IBM3 IBM4 DEC CDC 0 64.1 64.4 70.4 54.0 73.8 77. 1 91.9 95.2 86.6 79.7 96.5 82. 2 93.3 96.5 90.8 83.4 97.5 90. 3 93.7 96.7 91.2 83.5 97.7 93. 4 94.5 97.0 92.0 83.7 98.1 95. 5 94.7 97.1 92.2 83.9 98.2 95.
Prediction accuracy approaches maximum with as
few as 2 preceding branch occurrences used as
history
Results (%) IBM1 IBM2 IBM3 IBM4 DEC CDC 93.3 96.5 90.8 83.4 97.5 90.
Example Prediction Algorithm
TT T
N
T NT T
TN T
TN T
NN N N
T
T
N
T N
TT T
last two branches
next prediction
How does prediction algo work?
While (i > 0) do /* Branch 1 */
If (x>y) then /* Branch 2 */
{then part} /* no changes to x,y in this code */
else {else part}
i= i-1;
Two branches in this code: B1, B
How many times is each executed?
Example Prediction Algorithm
TTT
N
T NT T
TN T
TN N
NN N N
T
T
N
T N
TT T
last two branches
next prediction
Assume history bits = TN for B1, TT for B
How does prediction algo work?
i=100; x=30; y=50;
While (i > 0) do /* Branch 1 */
If (x>y) then /* Branch 2 */
{then part} /* no changes to x,y in this code */
else {else part}
i= i-1;
Using the same 2-bit predictor for all branches–
Prediction for B1:?
Prediction for B2:?
N
T N
N
T
TN n?T T
t
T
N
N
T
TN T
t?
T
T N
n?
t? t
N (^) N
n n
T N
Other Prediction Algorithms
Combining prediction accuracy with BTB hit rate
(86.5% for 128 sets of 4 entries each), branch
prediction can provide the net prediction accuracy of
approximately 80%. This implies a 5-20%
performance enhancement.
Saturation Counter
Hysteresis Counter
IBM RS/6000 Study [Nair, 1992]
Five different branch types
- b: unconditional branch
- bl: branch and link (subroutine calls)
- bc: conditional branch
- bcr: conditional branch using link register (subroutine returns)
- bcc: conditional branch using count register (system calls)
Separate branch function unit to overlap of branch
instructions with other instructions
Two causes for branch stalls
- Unresolved conditions
- Branches downstream too close to unresolved branches
Number of Counter Bits Needed
Branch history table size: Direct-mapped array of 2k entries Programs, like gcc, can have over 7000 conditional branches In collisions, multiple branches share the same predictor Variation of branch penalty with branch history table size level out at 1024
li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049)
espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176)
gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128)
doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022)
spice2g6 97.0 (0.009) 97.0 (0.009) 96.2 (0.013) 76.6 (0.031)
3-bit 2-bit 1-bit 0-bit
Benchmark Prediction Accuracy (Overall CPI Overhead)
0%
1%
5%
6% 6%
11%
4%
6% 5%
1% 0%
2%
4%
6%
8%
10%
12%
14%
16%
18%
20%
nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li
Frequency of Mispredictions
4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)
Accuracy of Different Schemes
(
4096 Entries 2-bit BHT
Unlimited Entries 2-bit BHT
1024 Entries (2,2) BHT
0%
18%
Frequency of Mispredictions
Mis-speculation Recovery
Eliminate Incorrect Path
- Must ensure that the mis-speculated instructions produce no side effects
Start New Correct Path
- Must have remembered the alternate (non-predicted) path
NT T NT T NT T NT T
NT T NT T
NT T tag
tag
tag3 tag3 tag
tag
Mis-speculation Recovery
Eliminate Incorrect Path
- Use branch tag(s) to deallocate completion buffer entries occupied by speculative instructions (now determined to be mis-speculated).
- Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations
How expensive is a misprediction?
Start New Correct Path
- Update PC with computed branch target (if it was predicted NT)
- Update PC with sequential instruction address (if it was predicted T)
- Can begin speculation once again when encounter a new branch
How soon can you restart?
Trailing Confirmation
Trailing Confirmation
- When branch is resolved, remove/deallocate speculation tag
- Permit completion of branch and following instructions
NT T NT T NT T NT T
NT T NT T
NT T tag
tag
tag3 (^) tag3 tag
tag
Impediments to Parallel/Wide Fetching
Average Basic Block Size
- integer code: 4-6 instructions
- floating-point code: 6-10 instructions
Branch Prediction Mechanisms
- must make multiple branch predictions per cycle
- potentially multiple predicted taken branches
Conventional I-Cache Organization – discuss later
- must fetch from multiple predicted taken targets per
cycle
- must align and collapse multiple fetch groups per cycle
…Trace Caching!!
Recap..
CPU time = IC * CPI * Clk
- CPI = ideal CPI + stall cycles/instruction
- Stall cycles due to (1) control hazards and (2) data hazards
What did branch prediction do?
- Tries to reduce number of stall cycles from control hazards
What about stall cycles from data hazards
Recap..
CPU time = IC * CPI * Clk
- CPI = ideal CPI + stall cycles/instruction
- Stall cycles due to (1) control hazards and (2) data hazards
What did branch prediction do?
- Tries to reduce number of stall cycles from control hazards
What about stall cycles from data hazards
Next- Register Dataflow and Dynamic
Scheduling
Branch prediction provides a solution to handling the
control flow problem and increase instruction flow
bandwidth
- Stalls due to control flow change can decrease performance
Next step is flow in the execute stage – register data
flow
- Parallel execution of instructions
- Keep dependencies in mind
- Remove false dependencies, honor true dependencies
- “infinite” register set can remove false dependencies
- Go back and look at the nature of true dependencies using the data flow diagram of a computation
Superscalar Pipeline Design
Instruction Buffer
Fetch
Dispatch Buffer
Decode
Issuing Buffer
Dispatch
Completion Buffer
Execute
Store Buffer
Complete
Retire
Instruction
Flow
Data Flow