Memory Dependence Prediction and Load Value Prediction in Computer Architecture, Lecture notes of Computer Science

Decision is based on analysis or profile information – 90% of backward-going branches are taken – 50% of forward-going branches are not taken

Typology: Lecture notes

2018/2019

Uploaded on 10/29/2019

bach-hoang
bach-hoang 🇻🇳

4 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS252
Graduate Computer Architecture
Lecture 14
Prediction (Con’t)
(Dependencies, Load Values, Data Values)
John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
http://www-inst.eecs.berkeley.edu/~cs252
3/12/2007 cs252-S07, Lecture 14 2
Review: Yeh and Patt classification
GBHR
GPHT
GAg
GPHT
PABHR
PAg
PAPHT
PABHR
PAp
GAg: Global History Register, Global History Table
PAg: Per-Address History Register, Global History Table
PAp: Per-Address History Register, Per-Address History Table
3/12/2007 cs252-S07, Lecture 14 3
Review: Other Global Variants
GAs: Global History Register,
Per-Address (Set Associative) History Table
Gshare: Global History Register, Global History Table with
Simple attempt at anti-aliasing
GAs
GBHR
PAPHT
GShare
GPHT
GBHR
Address
3/12/2007 cs252-S07, Lecture 14 4
Review: Tournament Predictors
Motivation for correlating branch predictors is 2-
bit predictor failed on important branches; by
adding global information, performance
improved
Tournament predictors: use 2 predictors, 1
based on global information and 1 based on
local information, and combine with a selector
Use the predictor that tends to guess correctly
addr history
Predictor A Predictor B
pf3
pf4
pf5

Partial preview of the text

Download Memory Dependence Prediction and Load Value Prediction in Computer Architecture and more Lecture notes Computer Science in PDF only on Docsity!

CS

Graduate Computer Architecture

Lecture 14Prediction (Con’t)

(Dependencies, Load Values, Data Values)

John Kubiatowicz

Electrical Engineering and Computer Sciences

University of California, Berkeley

http://www.eecs.berkeley.edu/~kubitron/cs252http://www-inst.eecs.berkeley.edu/~cs

3/12/

cs252-S07, Lecture 14

Review: Yeh and Patt classification GBHR

GPHT

GAg

GPHT

PABHRPAg

PAPHT

PABHR

PAp

-^ GAg: Global History Register, Global History Table •^ PAg: Per-Address History Register, Global History Table •^ PAp: Per-Address History Register, Per-Address History Table

3/12/

cs252-S07, Lecture 14

3

Review: Other Global Variants • GAs: Global History Register,Per-Address (Set Associative) History Table • Gshare: Global History Register, Global History Table withSimple attempt at anti-aliasing

GAs

GBHR

PAPHT

GShare

GPHT

GBHR Address

3/12/

cs252-S07, Lecture 14

Review: Tournament Predictors^ •^ Motivation for correlating branch predictors is 2-bit predictor failed on important branches; byadding

global

information,

performance

improved • Tournament

predictors:

use

predictors,

based

on^

global

information

and

1 based

on

local information, and combine with a selector • Use the predictor that tends to guess correctly

addr^

history Predictor A

Predictor B

cs252-S07, Lecture 14

5

Review: Memory Dependence Prediction • Important to speculate?Two Extremes: –^ Naïve Speculation: always letload go forward –^ No Speculation: always waitfor dependencies to beresolved • Compare NaïveSpeculation to NoSpeculation –^ False Dependency: wait whendon’t have to –^ Order Violation: result ofspeculating incorrectly • Goal of prediction: –^ Avoid false dependencies^ and^ order violations

From “Memory Dependence Predictionusing Store Sets”, Chrysos and Emer.

3/12/

cs252-S07, Lecture 14

Premise: Past indicates Future •^ Basic Premise is that past dependencies indicate futuredependencies^ –^ Not always true! Hopefully true most of time •^ Store Set: Set of store insts that affect given load^ –^ Example:

Addr

Inst 0 Store C 4 Store A 8 Store B 12 Store C 28 Load B

⇒^ Store set { PC 8 } 32

Load D

⇒^ Store set { (null) } 36

Load C

⇒^ Store set { PC 0, PC 12 } 40

Load B

⇒^ Store set { PC 8 }

-^ Idea: Store set for load starts empty. If ever load go forward and thiscauses a violation, add offending store to load’s store setApproach: For each indeterminate load: –^ If Store from Store set is in pipeline, stallElse let go forwardDoes this work?

cs252-S07, Lecture 14

7

How well does “infinite” tracking work? • “Infinite” here means to place no limits on:^ –^ Number of store sets^ –^ Number of stores in given set • Seems to do pretty well^ –^ Note: “Not Predicted” means load had empty store set^ –^ Only Applu and Xlisp seems to have false dependencies

3/12/

cs252-S07, Lecture 14

How to track Store Sets in reality? • SSIT: Assigns Loads and Stores to Store Set ID (SSID)^ –^ Notice that this requires each store to be in only one store set! • LFST: Maps SSIDs to most recent fetched store^ –^ When Load is fetched, allows it to find most recent store in its store set that isexecuting (if any)

⇒^ allows stalling until store finished

-^ When Store is fetched, allows it to wait for previous store in store set^ »^

Pretty much same type of ordering as enforced by ROB anyway » Transitivity

⇒^ loads end up waiting for all active stores in store set

-^ What if store needs to be in two store sets?^ –^

Allow store sets to be merged together deterministically^ »^ Two loads, multiple stores get same SSID

-^ Want periodic clearing of SSIT to avoid:^ –^

problems with aliasing across programOut of control merging

cs252-S07, Lecture 14

13

Accuracy of LCT • Question of accuracy isabout how well we avoid: –^ Predicting unpredictable load –^ Not predicting predictable loads • How well does this work? –^ Difference between “Simple” and“Limit”: history depth^ »^ Simple: depth 1^ »^ Limit: depth 16 –^ Limit tends to classify more thingsas predictable (since this worksmore often) • Basic Principle: –^ Often works better to have onestructure decide on the basic“predictability” of structure –^ Independent of predictionstructure

3/12/

cs252-S07, Lecture 14

Constant Value Unit • Idea: Identify a loadinstruction as “constant” –^ Can ignore cache lookup (noverification) –^ Must enforce by monitoring resultof stores to remove “constant”status • How well does this work? –^ Seems to identify 6-18% of loadsas constant –^ Must be unchanging enough tocause LCT to classify as constant

cs252-S07, Lecture 14

15

Load Value Architecture • LCT/LVPT in fetch stage • CVU in execute stage –^ Used to bypass cache entirely –^ (Know that result is good) • Results: Some speedups –^ 21264 seems to do better thanPower PC –^ Authors think this is because ofsmall first-level cache and in-orderexecution makes CVU more useful

3/12/

cs252-S07, Lecture 14

Data Value Prediction + • Why do it?^ –^ Can “Break the DataFlow Boundary”^ –^ Before: Critical path = 4 operations (probably worse)^ –^ After: Critical path = 1 operation (plus verification)

A^ /

B +

Y^

X

A^

B +

Y^

X

Guess

Guess

Guess

cs252-S07, Lecture 14

17

Data Value Predictability •^ “The Predictability of Data Values”^ –^ Yiannakis Sazeides and James Smith, Micro 30, 1997 •^ Three different types of Patterns:^ –^ Constant (C):

-^ Stride (S):

-^ Non-Stride (NS):

•^ Combinations:^ –^

Repeated Stride (RS):

-^ Repeadted Non-Stride (RNS):

3/12/

cs252-S07, Lecture 14

Computational Predictors • Last Value Predictors –^ Predict that instruction will produce same value as last time –^ Requires some form of hysteresis. Two subtle alternatives:^ »^

Saturating counter incremented/decremented on success/failurereplace when the count is below threshold » Keep old value until new value seen frequently enough

–^ Second version predicts a constant when appears temporarily constant • Stride Predictors –^ Predict next value by adding the sum of most recent value to differenceof two most recent values:^ »^

If v^ and vn-^

are the two most recent values, then predict nextn-^ value will be: v

+ (vn-

- v^ n-1 n-

»^ The value (v

- v^ n-1^ n-

) is called the “stride”

-^ Important variations in hysteresis:^ »^

Change stride only if saturating counter falls below threshold » Or “two-delta” method. Two strides maintained.^ •^ First (S1) always updated by difference between two most recent values^ •^ Other (S2) used for computing predictions^ •^ When S1 seen twice in a row, then S

S

•^ More complex predictors:^ –^

Multiple strides for nested loopsComplex computations for complex loops (polynomials, etc!)

cs252-S07, Lecture 14

19

Context Based Predictors • Context Based Predictor –^ Relies on Tables to do trick –^ Classified according to the order: an “n-th” order model takes last nvalues and uses this to produce prediction^ »^ So – 0

th^ order predictor will be entirely frequency based

•^ Consider sequence: a a a b c a a a b c a a a^ –^

Next value is?

•^ “Blending”: Use prediction of highest order available

3/12/

cs252-S07, Lecture 14

Which is better? • Stride-based:^ –^ Learns faster^ –^ less state^ –^ Much cheaper interms of hardware!^ –^ runs into errors forany pattern that is notan infinite stride • Context-based:^ –^ Much longer to train^ –^ Performs perfectlyonce trained^ –^ Much more expensivehardware