Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

superscalar, Study notes of Advanced Computer Architecture

Osmania University Advanced Computer Architecture

good material for ACA

Typology: Study notes

2013/2014

Uploaded on 06/03/2014

nagesh 🇮🇳

4.6

(14)

7 documents

1 / 34

This page cannot be seen from the preview

Don't miss anything!

Superscalar Processors:

Branch Prediction

Dynamic Scheduling

Superscalar Processors

Superscalar: A Sequential Architecture

Superscalar processor is a representative ILP

implementation of a sequential architecture

- For every instruction issued by a Superscalar processor, the

hardware must check whether the operands interfere with the

operands of any other instruction that is either

• (1) already in execution, (2) been issued but waiting for

completion of interfering instructions that would have

been executed earlier in a sequential program, and (3)

being issued concurrently but would have been executed

earlier in the sequential execution of the program

- Superscalar proc. issues multiple inst. In cycle

Superscalar Terminology

Basic

Superscalar Able to issue > 1 instruction / cycle

Superpipelined Deep, but not superscalar pipeline.

E.g., MIPS R5000 has 8 stages

Branch prediction Logic to guess whether or not branch will be taken,

and possibly branch target

Advanced

Out-of-order Able to issue instructions out of program order

Speculation Execute instructions beyond branch points,

possibly nullifying later

Register renaming Able to dynamically assign physical registers to

instructions

Retire unit Logic to keep track of instructions as they

complete.

Discover Study notes of Advanced Computer Architecture Osmania University

Partial preview of the text

Download superscalar and more Study notes Advanced Computer Architecture in PDF only on Docsity!

Superscalar Processors:

Branch Prediction

Dynamic Scheduling

Superscalar Processors

Superscalar: A Sequential Architecture

Superscalar processor is a representative ILP

implementation of a sequential architecture

For every instruction issued by a Superscalar processor, the hardware must check whether the operands interfere with the operands of any other instruction that is either - (1) already in execution, (2) been issued but waiting for completion of interfering instructions that would have been executed earlier in a sequential program, and (3) being issued concurrently but would have been executed earlier in the sequential execution of the program
Superscalar proc. issues multiple inst. In cycle

Superscalar Terminology

Basic

Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. E.g., MIPS R5000 has 8 stages Branch prediction Logic to guess whether or not branch will be taken, and possibly branch target

Advanced

Out-of-order Able to issue instructions out of program order Speculation Execute instructions beyond branch points, possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete.

Superscalar Execution Example

Single Order, Data Dependence – In Order

Assumptions

Single FP adder takes 2 cycles
Single FP multiplier takes 5 cycles
Can issue add & multiply together
Must issue in-order
in,in,out

v: addt $f2, $f4, $f w: mult $f10, $f6, $f x: addt $f10, $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f

v w x y

(Single adder, data dependence)

(In order)

(inorder)

Data Flow

+ +

+

$f2 $f4 $f

$f

v^ y

x z

Critical Path = 9 cycles

+

w

z

$f

z

Adding Advanced Features

Out Of Order Issue

Can start y as soon as adder available
Must hold back z until $f10 not busy & adder available

v w x y z

v: addt $f2, $f4, $f w: mult $f10, $f6, $f x: addt $f10 , $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f

Adding Advanced Features

With Register Renaming

v w x y z

v: addt $f2, $f4, $f10a w: mult $f10a, $f6, $f10a x: addt $f10a, $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f

Flow Path Model of Superscalars

I-cache

FETCH

DECODE

COMMIT D-cache

Branch Predictor (^) Instruction Buffer

Store Queue

Reorder Buffer

Integer (^) Floating-point Media Memory

Instruction

Register Data

Memory Data

Flow

EXECUTE

(ROB)

Flow

Icache

Superscalar issue

F D I...

Decode / Issue Decode / Issue

Scalar issue Typical FX- pipeline layout F^ D/I...

Icache

Instruction buffer

Contrasting decoding and instruction issue in a

scalar and a 4-way superscalar processor

Superscalar Processors: Tasks

parallel decoding

superscalar instruction issue

parallel instruction execution

preserving sequential consistency of exception processing
preserving sequential consistency of exec.

Superscalar Issues to be considered

Parallel decoding – more complex task than in scalar processors.

High issue rate can lengthen the decoding cycle therefore use predecoding.
partial decoding performed while instructions are loaded into the instruction cache Superscalar instruction issue – A higher issue rate gives rise to higher processor performance, but amplifies the restrictive effects of control and data dependencies on the processor performance as well.
To overcome these problems designers use advanced techniques such as shelving, register renaming, and speculative branch processing

Superscalar issues

Parallel instruction execution task – Also called

“preservation of the sequential consistency of

instruction execution”. While instructions are

executed in parallel, instructions are usually

completed out of order in respect to a sequential

operating procedure

Preservation of sequential consistency of

exception processing task

Pre-Decoding

more EUs than the scalar processors, therefore higher number of instructions in execution

more dependency check comparisons needed Predecoding – As I-cache is being loaded, a predecode unit, performs a partial decoding and appends a number of decode bits to each instruction. These bits usually indicate :
the instruction class
type of resources which are required for the execution
the fact that branch target addresses have been calculated
Predecoding used in PowerPC 601, MIPS R8000,SuperSparc

Second-level cache (or memory)

Predecode unit

Icache

Typically 128 bits/cycle

When instructions are written into the Icache, the predecode unit appends 4-7 bits to each RISC instruction

E.g. 148 bits/cycle (^1)

In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte

1

The Principle of Predecoding

Superscalar Instruction Issues

specify how false data and unresolved control

dependencies are coped with during instruction issue

the design options are either to avoid them during the instruction issue by using register renaming and speculative branch processing, respectively, or not

False data dependencies between register data may

be removed by register renaming

Speculative Execution

Expensive in hardware
Alternative is to perform speculative code motion at compile time - Move operations from subsequent blocks up past branch operations into proceeding blocks
Requires less demanding hardware
- A mechanism to ensure that exceptions caused by speculatively scheduled operations are reported if and only if flow of control is such that they would have been executed in the non-speculative version of the code
- Additional registers to hold the speculative execution state

Hardware Features to Support ILP Next... Superscalar Processor Design

How to deal with instruction flow

Dynamic Branch prediction

How to deal with register/data flow

Register renaming

Solutions studied:

Dynamic branch prediction algorithms
Dynamic scheduling using Tomasulo method

Summary of discussions

ILP processors

VLIW/EPIC, Superscalar

Superscalar has hardware logic for extracting

parallelism

Solutions for stalls etc. must be provided in hardware

Stalls play an even greater role in ILP processors

Software solutions, such as code scheduling through

code movement, can lead to improved execution

times

More sophisticated techniques needed
Can we provide some H/W support to help the compiler – leads to EPIC/VLIW

Superscalar Pipeline Design

Instruction Buffer

Fetch

Dispatch Buffer

Decode

Issuing Buffer

Dispatch

Completion Buffer

Execute

Store Buffer

Complete

Retire

Instruction

Flow

Data Flow

Flow Path Model of Superscalars

I-cache

FETCH

DECODE

COMMIT D-cache

Branch Predictor (^) Instruction Buffer

Store Queue

Reorder Buffer

Integer Floating-point Media Memory

Instruction

Register Data

Memory Data

Flow

EXECUTE

(ROB)

Flow

Instruction Fetch Bandwidth Solutions

Ability to fetch number of instructions from cache is

crucial to superscalar performance

Use instruction fetch buffer to prefetch instructions
Fetch multiple instructions in one cycle to support the s -wide issue of superscalar processors

Design instruction cache ( I-Cache) to support this

Shall discuss solutions when Memory design is covered

Instruction Decoding Issues

Primary tasks:

Identify individual instructions
Determine instruction types
Detect inter-instruction dependences

Predecoding

Identify inst classes
Add more bits to instruction after fetching

Two important factors:

Instruction set architecture
Width of parallel pipeline

Second-level cache (or memory)

Predecode unit

Icache

Typically 128 bits/cycle

When instructions are written into the Icache, the predecode unit appends 4-7 bits to each RISC instruction

E.g. 148 bits/cycle (^1)

In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte

1

The Principle of Predecoding

Why Branches: CFG and Branches

Basic blocks and their constituent instructions must

be stored in sequential location in memory

In mapping a CFG to linear consecutive mem location, additional unconditional branches must be added

Encounter of branches (cond and uncond.) at run-

time induces deviations from implied sequential

control flow and consequent disruptions to sequential

fetching of instructions

These disruptions cause stalls in Inst.Fetch (IF) stage and reduce overall IF bandwidth

Mapping CFG to

Linear Instruction Sequence

A A

B

A

B

C

D

C

D

Conditional branches Unconditional branch

Branch Types and Implementation

Types of Branches

Conditional or Unconditional?
Subroutine Call (aka Link), needs to save PC?
How is the branch target computed?
- Static Target e.g. immediate, PC-relative
- Dynamic targets e.g. register indirect

What’s So Bad About Branches?

Performance Penalties

Use up execution resources
Fragmentation of I-Cache lines
Disruption of sequential control flow
- Need to determine branch direction (conditional

branches)

Need to determine branch target

Robs instruction fetch bandwidth and ILP

Branch-- actions

When branches occur, disruption to IF occurs

For unconditional branches

Subsequent instruction cannot be fetched until target address determined

For conditional branches

Machine must wait for resolution of branch condition
And if branch taken then wait till target address computed

Branch inst executed by the branch functional unit

Note: Cost in superscalar/ILP processors = width

(parallelism) X stall cycles

3 stall cycles on a 4 wide machine = 12 lost cycles

CPU Performance..

Recall: CPU time = ICCPIClk

CPI = ideal CPI + stall cycles/inst
Minimizing CPI implies minimize stall cycles
Stall cycles from branch instructions
- How to determine the number of stall cycles

Branch penalties/stall cycles

When branch occurs two parts needed:

Branch target address ( BTA ) has to be computed
Branch condition resolution

Addressing modes will affect BTA delay

For PC relative, BTA can be generated during Fetch stage for 1 cycle penalty
For Register indirect, BTA generated after decode stage (to access register) = 2 cycle penalty
For register indirect with offset = 3 cycle penalty

For branch condition resolution, depends on methods

If condition code registers used, then penalty =
If ISA permits comparison of 2 registers then output of ALU => 3 cycles

Penalty will be max of penalties for condition

resolution and BTA

Condition Resolution

Decode Buffer

Fetch

Dispatch Buffer

Decode

Reservation

Dispatch

Store Buffer

Complete

Retire

Issue^ Stations

Execute

Finish (^) Completion Buffer

Branch

CC reg. GP reg. value comp. Stall=

Stall=

Determining Branch Target

Problem: Cannot fetch subsequent instructions until

branch target is determined

Minimize delay

Generate branch target early in the pipeline

Make use of delay

Bias for not taken
Predict branch target

PC-relative vs Register Indirect targets

Keys to Branch Prediction

Target Address Generation

Access register
- PC, GP register, Link register
Perform calculation
- +/- offset, auto incrementing/decrementing

⇒ Target Speculation

Condition Resolution

Access register
- Condition code register, data register, count register
Perform calculation
- Comparison of data register(s)

⇒ Condition Speculation

History based Branch Target

Speculation – Branch Target Buffer

If you have seen this branch instruction before, can

you figure out the target address faster?

Create history table

How to organize the “history table”?

History based Branch Target

Speculation – Branch Target Buffer

Use branch target buffer (BTB) to store previous

branch target address

BTB is a small fully associative cache

Accessed during instruction fetch using PC

BTB can have three fields

Branch instruction address ( BIA )
Branch target address (BTA)
History bits

When PC matches BIA, an entry is made into BTB

A hit in BTB Implies inst being fetched is branch inst
The BTA field can be used to fetch next instruction if particular branch is predicted to be taken
Note: branch inst is still fetched and executed for validation/recovery

A small “cache-like” memory in the instruction fetch stage

Remembers previously executed branches, their addresses,

information to aid prediction, and most recent target

addresses

Instruction fetch stage compares current PC against those

in BTB to “guess” nPC

If matched then prediction is made else nPC=PC+
If predict taken then nPC=target address in BTB else nPC=PC+

When branch is actually resolved, BTB is updated

Branch Target (Most Recent)

Branch History

Branch Inst. Address (tag)

Branch Target Buffer (BTB)

current

PC

Branch Condition Speculation

Biased For Not Taken

Does not affect the instruction set architecture
Not effective in loops

Software Prediction

Encode an extra bit in the branch instruction
- Predict not taken: set bit to 0
- Predict taken: set bit to 1
Bit set by compiler or user; can use profiling
Static prediction, same behavior every time

Prediction Based on Branch Offsets

Positive offset: predict not taken
Negative offset: predict taken

Prediction Based on History

Branch Instruction Speculation

Decode Buffer

Fetch

Dispatch Buffer

Decode

Reservation

Dispatch

Stations Issue

Execute

Finish (^) Completion Buffer

Branch

nPC to Icache

nPC(seq.) = PC+ Branch PC Predictor (using a BTB)

specu. target

BTB update

prediction

(target addr. and history)

specu. cond.

FA-mux

nPC=BP(PC)

Branch Prediction Function

Based on opcode only (%)

IBM1 IBM2 IBM3 IBM4 DEC CDC

Based on history of branch

Branch prediction function F (X1, X2, .... )
Use up to 5 previous branches for history (%)

IBM1 IBM2 IBM3 IBM4 DEC CDC 0 64.1 64.4 70.4 54.0 73.8 77. 1 91.9 95.2 86.6 79.7 96.5 82. 2 93.3 96.5 90.8 83.4 97.5 90. 3 93.7 96.7 91.2 83.5 97.7 93. 4 94.5 97.0 92.0 83.7 98.1 95. 5 94.7 97.1 92.2 83.9 98.2 95.

Prediction accuracy approaches maximum with as

few as 2 preceding branch occurrences used as

history

Results (%) IBM1 IBM2 IBM3 IBM4 DEC CDC 93.3 96.5 90.8 83.4 97.5 90.

Example Prediction Algorithm

TT T

N

T NT T

TN T

NN N N

T

N

T N

TT T

last two branches

next prediction

How does prediction algo work?

While (i > 0) do /* Branch 1 */

If (x>y) then /* Branch 2 */

{then part} /* no changes to x,y in this code */

else {else part}

i= i-1;

Two branches in this code: B1, B

How many times is each executed?

Example Prediction Algorithm

TTT

N

T NT T

TN T

TN N

NN N N

T

N

T N

TT T

last two branches

next prediction

Assume history bits = TN for B1, TT for B

How does prediction algo work?

i=100; x=30; y=50;

While (i > 0) do /* Branch 1 */

If (x>y) then /* Branch 2 */

{then part} /* no changes to x,y in this code */

else {else part}

i= i-1;

Using the same 2-bit predictor for all branches–

Prediction for B1:?

Prediction for B2:?

N

T N

N

T

TN n?T T

t

T

N

T

TN T

t?

T

T N

n?

t? t

N (^) N

n n

T N

Other Prediction Algorithms

Combining prediction accuracy with BTB hit rate

(86.5% for 128 sets of 4 entries each), branch

prediction can provide the net prediction accuracy of

approximately 80%. This implies a 5-20%

performance enhancement.

Saturation Counter

Hysteresis Counter

IBM RS/6000 Study [Nair, 1992]

Five different branch types

b: unconditional branch
bl: branch and link (subroutine calls)
bc: conditional branch
bcr: conditional branch using link register (subroutine returns)
bcc: conditional branch using count register (system calls)

Separate branch function unit to overlap of branch

instructions with other instructions

Two causes for branch stalls

Unresolved conditions
Branches downstream too close to unresolved branches

Number of Counter Bits Needed

Branch history table size: Direct-mapped array of 2k entries Programs, like gcc, can have over 7000 conditional branches In collisions, multiple branches share the same predictor Variation of branch penalty with branch history table size level out at 1024

li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049)

espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176)

gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128)

doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022)

spice2g6 97.0 (0.009) 97.0 (0.009) 96.2 (0.013) 76.6 (0.031)

3-bit 2-bit 1-bit 0-bit

Benchmark Prediction Accuracy (Overall CPI Overhead)

0%

1%

5%

6% 6%

11%

4%

6% 5%

1% 0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li

Frequency of Mispredictions

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes

(

4096 Entries 2-bit BHT

Unlimited Entries 2-bit BHT

1024 Entries (2,2) BHT

0%

18%

Frequency of Mispredictions

Mis-speculation Recovery

Eliminate Incorrect Path

Must ensure that the mis-speculated instructions produce no side effects

Start New Correct Path

Must have remembered the alternate (non-predicted) path

NT T NT T NT T NT T

NT T NT T

NT T tag

tag

tag3 tag3 tag

tag

Mis-speculation Recovery

Eliminate Incorrect Path

Use branch tag(s) to deallocate completion buffer entries occupied by speculative instructions (now determined to be mis-speculated).
Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations

How expensive is a misprediction?

Start New Correct Path

Update PC with computed branch target (if it was predicted NT)
Update PC with sequential instruction address (if it was predicted T)
Can begin speculation once again when encounter a new branch

How soon can you restart?

Trailing Confirmation

When branch is resolved, remove/deallocate speculation tag
Permit completion of branch and following instructions

NT T NT T NT T NT T

NT T NT T

NT T tag

tag

tag3 (^) tag3 tag

tag

Impediments to Parallel/Wide Fetching

Average Basic Block Size

integer code: 4-6 instructions
floating-point code: 6-10 instructions

Branch Prediction Mechanisms

must make multiple branch predictions per cycle
potentially multiple predicted taken branches

Conventional I-Cache Organization – discuss later

must fetch from multiple predicted taken targets per

cycle

must align and collapse multiple fetch groups per cycle

…Trace Caching!!

Recap..

CPU time = IC * CPI * Clk

CPI = ideal CPI + stall cycles/instruction
Stall cycles due to (1) control hazards and (2) data hazards

What did branch prediction do?

Tries to reduce number of stall cycles from control hazards

What about stall cycles from data hazards

Next..

Recap..

CPU time = IC * CPI * Clk

CPI = ideal CPI + stall cycles/instruction
Stall cycles due to (1) control hazards and (2) data hazards

What did branch prediction do?

Tries to reduce number of stall cycles from control hazards

What about stall cycles from data hazards

Next..

Next- Register Dataflow and Dynamic

Scheduling

Branch prediction provides a solution to handling the

control flow problem and increase instruction flow

bandwidth

Stalls due to control flow change can decrease performance

Next step is flow in the execute stage – register data

flow

Parallel execution of instructions
Keep dependencies in mind
- Remove false dependencies, honor true dependencies
- “infinite” register set can remove false dependencies
Go back and look at the nature of true dependencies using the data flow diagram of a computation

Superscalar Pipeline Design

Instruction Buffer

Fetch

Dispatch Buffer

Decode

Issuing Buffer

Dispatch

Completion Buffer

Execute

Store Buffer

Complete

Retire

superscalar, Study notes of Advanced Computer Architecture

Related documents

Partial preview of the text

Download superscalar and more Study notes Advanced Computer Architecture in PDF only on Docsity!

Superscalar Processors:

Branch Prediction

Dynamic Scheduling

Superscalar Processors

Superscalar: A Sequential Architecture

 Superscalar processor is a representative ILP

implementation of a sequential architecture

Superscalar Terminology

Basic

Advanced

Superscalar Execution Example

Adding Advanced Features

Adding Advanced Features

Flow Path Model of Superscalars

Contrasting decoding and instruction issue in a

scalar and a 4-way superscalar processor

Superscalar Processors: Tasks

 parallel decoding

 superscalar instruction issue

 parallel instruction execution

Superscalar Issues to be considered

Superscalar issues

 Parallel instruction execution task – Also called

“preservation of the sequential consistency of

instruction execution”. While instructions are

executed in parallel, instructions are usually

completed out of order in respect to a sequential

operating procedure

 Preservation of sequential consistency of

exception processing task

Pre-Decoding

The Principle of Predecoding

Superscalar Instruction Issues

 specify how false data and unresolved control

dependencies are coped with during instruction issue

 False data dependencies between register data may

be removed by register renaming

 Speculative Execution

Hardware Features to Support ILP Next... Superscalar Processor Design

 How to deal with instruction flow

 How to deal with register/data flow

 Solutions studied:

Summary of discussions

 ILP processors

 Superscalar has hardware logic for extracting

parallelism

 Stalls play an even greater role in ILP processors

 Software solutions, such as code scheduling through

code movement, can lead to improved execution

times

Superscalar Pipeline Design

Instruction

Flow

Data Flow

Flow Path Model of Superscalars

Instruction Fetch Bandwidth Solutions

 Ability to fetch number of instructions from cache is

crucial to superscalar performance

 Design instruction cache ( I-Cache) to support this

Instruction Decoding Issues

 Primary tasks:

 Predecoding

 Two important factors:

The Principle of Predecoding

Why Branches: CFG and Branches

 Basic blocks and their constituent instructions must

be stored in sequential location in memory

 Encounter of branches (cond and uncond.) at run-

time induces deviations from implied sequential

control flow and consequent disruptions to sequential

fetching of instructions

Mapping CFG to

Linear Instruction Sequence

A A

B

B

Superscalar processor is a representative ILP

Basic

Advanced

parallel decoding

superscalar instruction issue

parallel instruction execution

Parallel instruction execution task – Also called

Preservation of sequential consistency of

specify how false data and unresolved control

False data dependencies between register data may

Speculative Execution

How to deal with instruction flow

How to deal with register/data flow

Solutions studied:

ILP processors

Superscalar has hardware logic for extracting

Stalls play an even greater role in ILP processors

Software solutions, such as code scheduling through

Ability to fetch number of instructions from cache is

Design instruction cache ( I-Cache) to support this

Primary tasks:

Predecoding

Two important factors:

Basic blocks and their constituent instructions must

Encounter of branches (cond and uncond.) at run-

Types of Branches

Performance Penalties

When branches occur, disruption to IF occurs

For unconditional branches

For conditional branches

Branch inst executed by the branch functional unit

Note: Cost in superscalar/ILP processors = width

Recall: CPU time = ICCPIClk

When branch occurs two parts needed:

Addressing modes will affect BTA delay

For branch condition resolution, depends on methods

Penalty will be max of penalties for condition

Minimize delay

Make use of delay

Target Address Generation

Condition Resolution

If you have seen this branch instruction before, can

How to organize the “history table”?

Use branch target buffer (BTB) to store previous

BTB is a small fully associative cache

BTB can have three fields

When PC matches BIA, an entry is made into BTB

A small “cache-like” memory in the instruction fetch stage

Remembers previously executed branches, their addresses,

Instruction fetch stage compares current PC against those

When branch is actually resolved, BTB is updated

Biased For Not Taken

Software Prediction

Prediction Based on Branch Offsets

Prediction Based on History