superscalar, Study notes of Advanced Computer Architecture

good material for ACA

Typology: Study notes

2013/2014

Uploaded on 06/03/2014

nagesh
nagesh 🇮🇳

4.6

(14)

7 documents

1 / 34

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Superscalar Processors:
Branch Prediction
Dynamic Scheduling
Superscalar Processors
Superscalar: A Sequential Architecture
Superscalar processor is a representative ILP
implementation of a sequential architecture
- For every instruction issued by a Superscalar processor, the
hardware must check whether the operands interfere with the
operands of any other instruction that is either
(1) already in execution, (2) been issued but waiting for
completion of interfering instructions that would have
been executed earlier in a sequential program, and (3)
being issued concurrently but would have been executed
earlier in the sequential execution of the program
- Superscalar proc. issues multiple inst. In cycle
Superscalar Terminology
Basic
Superscalar Able to issue > 1 instruction / cycle
Superpipelined Deep, but not superscalar pipeline.
E.g., MIPS R5000 has 8 stages
Branch prediction Logic to guess whether or not branch will be taken,
and possibly branch target
Advanced
Out-of-order Able to issue instructions out of program order
Speculation Execute instructions beyond branch points,
possibly nullifying later
Register renaming Able to dynamically assign physical registers to
instructions
Retire unit Logic to keep track of instructions as they
complete.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22

Partial preview of the text

Download superscalar and more Study notes Advanced Computer Architecture in PDF only on Docsity!

Superscalar Processors:

Branch Prediction

Dynamic Scheduling

Superscalar Processors

Superscalar: A Sequential Architecture

‹ Superscalar processor is a representative ILP

implementation of a sequential architecture

  • For every instruction issued by a Superscalar processor, the hardware must check whether the operands interfere with the operands of any other instruction that is either - (1) already in execution, (2) been issued but waiting for completion of interfering instructions that would have been executed earlier in a sequential program, and (3) being issued concurrently but would have been executed earlier in the sequential execution of the program
  • Superscalar proc. issues multiple inst. In cycle

Superscalar Terminology

‹Basic

Superscalar Able to issue > 1 instruction / cycle Superpipelined Deep, but not superscalar pipeline. E.g., MIPS R5000 has 8 stages Branch prediction Logic to guess whether or not branch will be taken, and possibly branch target

‹Advanced

Out-of-order Able to issue instructions out of program order Speculation Execute instructions beyond branch points, possibly nullifying later Register renaming Able to dynamically assign physical registers to instructions Retire unit Logic to keep track of instructions as they complete.

Superscalar Execution Example

Single Order, Data Dependence – In Order

‹ Assumptions

  • Single FP adder takes 2 cycles
  • Single FP multiplier takes 5 cycles
  • Can issue add & multiply together
  • Must issue in-order
  • in,in,out

v: addt $f2, $f4, $f w: mult $f10, $f6, $f x: addt $f10, $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f

v w x y

(Single adder, data dependence)

(In order)

(inorder)

Data Flow

+ +


+

$f2 $f4 $f

$f

$f

$f

v^ y

x z

Critical Path = 9 cycles

+

w

z

$f

z

Adding Advanced Features

‹ Out Of Order Issue

  • Can start y as soon as adder available
  • Must hold back z until $f10 not busy & adder available

v w x y z

v: addt $f2, $f4, $f w: mult $f10, $f6, $f x: addt $f10 , $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f

Adding Advanced Features

‹ With Register Renaming

v w x y z

v: addt $f2, $f4, $f10a w: mult $f10a, $f6, $f10a x: addt $f10a, $f8, $f y: addt $f4, $f6, $f z: addt $f4, $f8, $f

Flow Path Model of Superscalars

I-cache

FETCH

DECODE

COMMIT D-cache

Branch Predictor (^) Instruction Buffer

Store Queue

Reorder Buffer

Integer (^) Floating-point Media Memory

Instruction

Register Data

Memory Data

Flow

EXECUTE

(ROB)

Flow

Flow

Icache

Superscalar issue

F D I...

Decode / Issue Decode / Issue

Scalar issue Typical FX- pipeline layout F^ D/I...

Icache

Instruction buffer

Instruction buffer

Contrasting decoding and instruction issue in a

scalar and a 4-way superscalar processor

Superscalar Processors: Tasks

‹ parallel decoding

‹ superscalar instruction issue

‹ parallel instruction execution

  • preserving sequential consistency of exception processing
  • preserving sequential consistency of exec.

Superscalar Issues to be considered

‹ Parallel decoding – more complex task than in scalar processors.

  • High issue rate can lengthen the decoding cycle therefore use predecoding.
  • partial decoding performed while instructions are loaded into the instruction cache ‹ Superscalar instruction issue – A higher issue rate gives rise to higher processor performance, but amplifies the restrictive effects of control and data dependencies on the processor performance as well.
  • To overcome these problems designers use advanced techniques such as shelving, register renaming, and speculative branch processing

Superscalar issues

‹ Parallel instruction execution task – Also called

“preservation of the sequential consistency of

instruction execution”. While instructions are

executed in parallel, instructions are usually

completed out of order in respect to a sequential

operating procedure

‹ Preservation of sequential consistency of

exception processing task

Pre-Decoding

‹ more EUs than the scalar processors, therefore higher number of instructions in execution

  • more dependency check comparisons needed ‹ Predecoding – As I-cache is being loaded, a predecode unit, performs a partial decoding and appends a number of decode bits to each instruction. These bits usually indicate :
  • the instruction class
  • type of resources which are required for the execution
  • the fact that branch target addresses have been calculated
  • Predecoding used in PowerPC 601, MIPS R8000,SuperSparc

Second-level cache (or memory)

Predecode unit

Icache

Typically 128 bits/cycle

When instructions are written into the Icache, the predecode unit appends 4-7 bits to each RISC instruction

E.g. 148 bits/cycle (^1)

In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte

1

The Principle of Predecoding

Superscalar Instruction Issues

‹ specify how false data and unresolved control

dependencies are coped with during instruction issue

  • the design options are either to avoid them during the instruction issue by using register renaming and speculative branch processing, respectively, or not

‹ False data dependencies between register data may

be removed by register renaming

‹ Speculative Execution

  • Expensive in hardware
  • Alternative is to perform speculative code motion at compile time - Move operations from subsequent blocks up past branch operations into proceeding blocks
  • Requires less demanding hardware
    • A mechanism to ensure that exceptions caused by speculatively scheduled operations are reported if and only if flow of control is such that they would have been executed in the non-speculative version of the code
    • Additional registers to hold the speculative execution state

Hardware Features to Support ILP Next... Superscalar Processor Design

‹ How to deal with instruction flow

  • Dynamic Branch prediction

‹ How to deal with register/data flow

  • Register renaming

‹ Solutions studied:

  • Dynamic branch prediction algorithms
  • Dynamic scheduling using Tomasulo method

Summary of discussions

‹ ILP processors

  • VLIW/EPIC, Superscalar

‹ Superscalar has hardware logic for extracting

parallelism

  • Solutions for stalls etc. must be provided in hardware

‹ Stalls play an even greater role in ILP processors

‹ Software solutions, such as code scheduling through

code movement, can lead to improved execution

times

  • More sophisticated techniques needed
  • Can we provide some H/W support to help the compiler – leads to EPIC/VLIW

Superscalar Pipeline Design

Instruction Buffer

Fetch

Dispatch Buffer

Decode

Issuing Buffer

Dispatch

Completion Buffer

Execute

Store Buffer

Complete

Retire

Instruction

Flow

Data Flow

Flow Path Model of Superscalars

I-cache

FETCH

DECODE

COMMIT D-cache

Branch Predictor (^) Instruction Buffer

Store Queue

Reorder Buffer

Integer Floating-point Media Memory

Instruction

Register Data

Memory Data

Flow

EXECUTE

(ROB)

Flow

Flow

Instruction Fetch Bandwidth Solutions

‹ Ability to fetch number of instructions from cache is

crucial to superscalar performance

  • Use instruction fetch buffer to prefetch instructions
  • Fetch multiple instructions in one cycle to support the s -wide issue of superscalar processors

‹ Design instruction cache ( I-Cache) to support this

  • Shall discuss solutions when Memory design is covered

Instruction Decoding Issues

‹ Primary tasks:

  • Identify individual instructions
  • Determine instruction types
  • Detect inter-instruction dependences

‹ Predecoding

  • Identify inst classes
  • Add more bits to instruction after fetching

‹ Two important factors:

  • Instruction set architecture
  • Width of parallel pipeline

Second-level cache (or memory)

Predecode unit

Icache

Typically 128 bits/cycle

When instructions are written into the Icache, the predecode unit appends 4-7 bits to each RISC instruction

E.g. 148 bits/cycle (^1)

In the AMD K5, which is an x86-compatible CISC-processor, the predecode unit appends 5 bits to each byte

1

The Principle of Predecoding

Why Branches: CFG and Branches

‹ Basic blocks and their constituent instructions must

be stored in sequential location in memory

  • In mapping a CFG to linear consecutive mem location, additional unconditional branches must be added

‹ Encounter of branches (cond and uncond.) at run-

time induces deviations from implied sequential

control flow and consequent disruptions to sequential

fetching of instructions

  • These disruptions cause stalls in Inst.Fetch (IF) stage and reduce overall IF bandwidth

Mapping CFG to

Linear Instruction Sequence

A A

B

B

A

B

C

D

D

C

C

D

Conditional branches Unconditional branch

Branch Types and Implementation

‹ Types of Branches

  • Conditional or Unconditional?
  • Subroutine Call (aka Link), needs to save PC?
  • How is the branch target computed?
    • Static Target e.g. immediate, PC-relative
    • Dynamic targets e.g. register indirect

What’s So Bad About Branches?

‹ Performance Penalties

  • Use up execution resources
  • Fragmentation of I-Cache lines
  • Disruption of sequential control flow
    • Need to determine branch direction (conditional

branches)

  • Need to determine branch target

Robs instruction fetch bandwidth and ILP

Branch-- actions

‹ When branches occur, disruption to IF occurs

‹ For unconditional branches

  • Subsequent instruction cannot be fetched until target address determined

‹ For conditional branches

  • Machine must wait for resolution of branch condition
  • And if branch taken then wait till target address computed

‹ Branch inst executed by the branch functional unit

‹ Note: Cost in superscalar/ILP processors = width

(parallelism) X stall cycles

  • 3 stall cycles on a 4 wide machine = 12 lost cycles

CPU Performance..

‹ Recall: CPU time = ICCPIClk

  • CPI = ideal CPI + stall cycles/inst
  • Minimizing CPI implies minimize stall cycles
  • Stall cycles from branch instructions
    • How to determine the number of stall cycles

Branch penalties/stall cycles

‹ When branch occurs two parts needed:

  • Branch target address ( BTA ) has to be computed
  • Branch condition resolution

‹ Addressing modes will affect BTA delay

  • For PC relative, BTA can be generated during Fetch stage for 1 cycle penalty
  • For Register indirect, BTA generated after decode stage (to access register) = 2 cycle penalty
  • For register indirect with offset = 3 cycle penalty

‹ For branch condition resolution, depends on methods

  • If condition code registers used, then penalty =
  • If ISA permits comparison of 2 registers then output of ALU => 3 cycles

‹ Penalty will be max of penalties for condition

resolution and BTA

Condition Resolution

Decode Buffer

Fetch

Dispatch Buffer

Decode

Reservation

Dispatch

Store Buffer

Complete

Retire

Issue^ Stations

Execute

Finish (^) Completion Buffer

Branch

CC reg. GP reg. value comp. Stall=

Stall=

Determining Branch Target

Problem: Cannot fetch subsequent instructions until

branch target is determined

‹ Minimize delay

  • Generate branch target early in the pipeline

‹ Make use of delay

  • Bias for not taken
  • Predict branch target

PC-relative vs Register Indirect targets

Keys to Branch Prediction

‹ Target Address Generation

  • Access register
    • PC, GP register, Link register
  • Perform calculation
    • +/- offset, auto incrementing/decrementing

⇒ Target Speculation

‹ Condition Resolution

  • Access register
    • Condition code register, data register, count register
  • Perform calculation
    • Comparison of data register(s)

⇒ Condition Speculation

History based Branch Target

Speculation – Branch Target Buffer

‹ If you have seen this branch instruction before, can

you figure out the target address faster?

  • Create history table

‹ How to organize the “history table”?

History based Branch Target

Speculation – Branch Target Buffer

‹ Use branch target buffer (BTB) to store previous

branch target address

‹ BTB is a small fully associative cache

  • Accessed during instruction fetch using PC

‹ BTB can have three fields

  • Branch instruction address ( BIA )
  • Branch target address (BTA)
  • History bits

‹ When PC matches BIA, an entry is made into BTB

  • A hit in BTB Implies inst being fetched is branch inst
  • The BTA field can be used to fetch next instruction if particular branch is predicted to be taken
  • Note: branch inst is still fetched and executed for validation/recovery

‹ A small “cache-like” memory in the instruction fetch stage

‹ Remembers previously executed branches, their addresses,

information to aid prediction, and most recent target

addresses

‹ Instruction fetch stage compares current PC against those

in BTB to “guess” nPC

  • If matched then prediction is made else nPC=PC+
  • If predict taken then nPC=target address in BTB else nPC=PC+

‹ When branch is actually resolved, BTB is updated

Branch Target (Most Recent)

Branch History

Branch Inst. Address (tag)

Branch Target Buffer (BTB)

current

PC

Branch Condition Speculation

‹ Biased For Not Taken

  • Does not affect the instruction set architecture
  • Not effective in loops

‹ Software Prediction

  • Encode an extra bit in the branch instruction
    • Predict not taken: set bit to 0
    • Predict taken: set bit to 1
  • Bit set by compiler or user; can use profiling
  • Static prediction, same behavior every time

‹ Prediction Based on Branch Offsets

  • Positive offset: predict not taken
  • Negative offset: predict taken

‹ Prediction Based on History

Branch Instruction Speculation

Decode Buffer

Fetch

Dispatch Buffer

Decode

Reservation

Dispatch

Stations Issue

Execute

Finish (^) Completion Buffer

Branch

nPC to Icache

nPC(seq.) = PC+ Branch PC Predictor (using a BTB)

specu. target

BTB update

prediction

(target addr. and history)

specu. cond.

FA-mux

nPC=BP(PC)

Branch Prediction Function

‹ Based on opcode only (%)

IBM1 IBM2 IBM3 IBM4 DEC CDC

‹ Based on history of branch

  • Branch prediction function F (X1, X2, .... )
  • Use up to 5 previous branches for history (%)

IBM1 IBM2 IBM3 IBM4 DEC CDC 0 64.1 64.4 70.4 54.0 73.8 77. 1 91.9 95.2 86.6 79.7 96.5 82. 2 93.3 96.5 90.8 83.4 97.5 90. 3 93.7 96.7 91.2 83.5 97.7 93. 4 94.5 97.0 92.0 83.7 98.1 95. 5 94.7 97.1 92.2 83.9 98.2 95.

‹ Prediction accuracy approaches maximum with as

few as 2 preceding branch occurrences used as

history

Results (%) IBM1 IBM2 IBM3 IBM4 DEC CDC 93.3 96.5 90.8 83.4 97.5 90.

Example Prediction Algorithm

TT T

N

T NT T

TN T

TN T

NN N N

T

T

N

T N

TT T

last two branches

next prediction

How does prediction algo work?

While (i > 0) do /* Branch 1 */

If (x>y) then /* Branch 2 */

{then part} /* no changes to x,y in this code */

else {else part}

i= i-1;

Two branches in this code: B1, B

How many times is each executed?

Example Prediction Algorithm

TTT

N

T NT T

TN T

TN N

NN N N

T

T

N

T N

TT T

last two branches

next prediction

‹ Assume history bits = TN for B1, TT for B

How does prediction algo work?

i=100; x=30; y=50;

While (i > 0) do /* Branch 1 */

If (x>y) then /* Branch 2 */

{then part} /* no changes to x,y in this code */

else {else part}

i= i-1;

Using the same 2-bit predictor for all branches–

Prediction for B1:?

Prediction for B2:?

N

T N

N

T

TN n?T T

t

T

N

N

T

TN T

t?

T

T N

n?

t? t

N (^) N

n n

T N

Other Prediction Algorithms

‹ Combining prediction accuracy with BTB hit rate

(86.5% for 128 sets of 4 entries each), branch

prediction can provide the net prediction accuracy of

approximately 80%. This implies a 5-20%

performance enhancement.

Saturation Counter

Hysteresis Counter

IBM RS/6000 Study [Nair, 1992]

‹ Five different branch types

  • b: unconditional branch
  • bl: branch and link (subroutine calls)
  • bc: conditional branch
  • bcr: conditional branch using link register (subroutine returns)
  • bcc: conditional branch using count register (system calls)

‹ Separate branch function unit to overlap of branch

instructions with other instructions

‹ Two causes for branch stalls

  • Unresolved conditions
  • Branches downstream too close to unresolved branches

Number of Counter Bits Needed

‹ Branch history table size: Direct-mapped array of 2k entries ‹ Programs, like gcc, can have over 7000 conditional branches ‹ In collisions, multiple branches share the same predictor ‹ Variation of branch penalty with branch history table size level out at 1024

li 88.3 (0.042) 86.8 (0.048) 82.5 (0.063) 62.4 (0.142) eqntott 89.3 (0.028) 87.2 (0.033) 82.9 (0.046) 78.4 (0.049)

espresso 89.5 (0.045) 89.1 (0.047) 87.2 (0.054) 58.5 (0.176)

gcc 89.7 (0.025) 89.1 (0.026) 86.0 (0.033) 50.0 (0.128)

doduc 94.2 (0.003) 94.3 (0.003) 90.2 (0.004) 69.2 (0.022)

spice2g6 97.0 (0.009) 97.0 (0.009) 96.2 (0.013) 76.6 (0.031)

3-bit 2-bit 1-bit 0-bit

Benchmark Prediction Accuracy (Overall CPI Overhead)

0%

1%

5%

6% 6%

11%

4%

6% 5%

1% 0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

nasa7 matrix300 tomcatv doducd spice fpppp gcc espresso eqntott li

Frequency of Mispredictions

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

Accuracy of Different Schemes

(

4096 Entries 2-bit BHT

Unlimited Entries 2-bit BHT

1024 Entries (2,2) BHT

0%

18%

Frequency of Mispredictions

Mis-speculation Recovery

‹ Eliminate Incorrect Path

  • Must ensure that the mis-speculated instructions produce no side effects

‹ Start New Correct Path

  • Must have remembered the alternate (non-predicted) path

NT T NT T NT T NT T

NT T NT T

NT T tag

tag

tag3 tag3 tag

tag

Mis-speculation Recovery

‹ Eliminate Incorrect Path

  • Use branch tag(s) to deallocate completion buffer entries occupied by speculative instructions (now determined to be mis-speculated).
  • Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations

How expensive is a misprediction?

‹ Start New Correct Path

  • Update PC with computed branch target (if it was predicted NT)
  • Update PC with sequential instruction address (if it was predicted T)
  • Can begin speculation once again when encounter a new branch

How soon can you restart?

Trailing Confirmation

‹ Trailing Confirmation

  • When branch is resolved, remove/deallocate speculation tag
  • Permit completion of branch and following instructions

NT T NT T NT T NT T

NT T NT T

NT T tag

tag

tag3 (^) tag3 tag

tag

Impediments to Parallel/Wide Fetching

‹ Average Basic Block Size

  • integer code: 4-6 instructions
  • floating-point code: 6-10 instructions

‹ Branch Prediction Mechanisms

  • must make multiple branch predictions per cycle
  • potentially multiple predicted taken branches

‹ Conventional I-Cache Organization – discuss later

  • must fetch from multiple predicted taken targets per

cycle

  • must align and collapse multiple fetch groups per cycle

…Trace Caching!!

Recap..

‹ CPU time = IC * CPI * Clk

  • CPI = ideal CPI + stall cycles/instruction
  • Stall cycles due to (1) control hazards and (2) data hazards

‹ What did branch prediction do?

  • Tries to reduce number of stall cycles from control hazards

‹ What about stall cycles from data hazards

  • Next..

Recap..

‹ CPU time = IC * CPI * Clk

  • CPI = ideal CPI + stall cycles/instruction
  • Stall cycles due to (1) control hazards and (2) data hazards

‹ What did branch prediction do?

  • Tries to reduce number of stall cycles from control hazards

‹ What about stall cycles from data hazards

  • Next..

Next- Register Dataflow and Dynamic

Scheduling

‹ Branch prediction provides a solution to handling the

control flow problem and increase instruction flow

bandwidth

  • Stalls due to control flow change can decrease performance

‹ Next step is flow in the execute stage – register data

flow

  • Parallel execution of instructions
  • Keep dependencies in mind
    • Remove false dependencies, honor true dependencies
    • “infinite” register set can remove false dependencies
  • Go back and look at the nature of true dependencies using the data flow diagram of a computation

Superscalar Pipeline Design

Instruction Buffer

Fetch

Dispatch Buffer

Decode

Issuing Buffer

Dispatch

Completion Buffer

Execute

Store Buffer

Complete

Retire

Instruction

Flow

Data Flow