ILP Architecture Classifications and Their Impact on Instruction Processing, Study notes of Advanced Computer Architecture

Different architecture classifications for instruction-level parallelism (ilp) in computer systems, including sequential, dependence, and independence architectures. It explores the characteristics and limitations of each architecture and provides examples of hardware implementations. The document also covers the importance of preserving program order and handling hazards to increase ilp.

Typology: Study notes

2013/2014

Uploaded on 06/03/2014

nagesh
nagesh 🇮🇳

4.6

(14)

7 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Page 1
CS211 1
CS 211: Computer Architecture
Introduction to ILP Processors &
Concepts
CS211 2
Course Outline
Introduction: Trends, Performance models
Review of computer organization and ISA
implementation
Overview of Pipelining
ILP Processors: Superscalar Processors
Next! ILP Intro and Superscalar
ILP: EPIC/VLIW Processors
Compiler optimization techniques for ILP
processors – getting max performance out of
ILP design
Part 2: Other components- memory, I/O.
CS211 3
Introduction to Instruction Level
Parallelism (ILP)
What is ILP?
Processor and Compiler design techniques that
speed up execution by causing individual machine
operations to execute in parallel
ILP is transparent to the user
Multiple operations executed in parallel even
though the system is handed a single program
written with a sequential processor in mind
Same execution hardware as a normal RISC
machine
May be more than one of any given type of
hardware
CS211 4
Architectures for
Instruction-Level Parallelism
Scalar Pipeline (baseline)
Instruction Parallelism = D
Operation Latency = 1
Peak IPC = 1
1
23456
IF DE EX WB
1234567890
TIME IN CYCLES (OF BASELINE MACHINE)
SUCCESSIVE
INSTRUCTIONS
D
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download ILP Architecture Classifications and Their Impact on Instruction Processing and more Study notes Advanced Computer Architecture in PDF only on Docsity!

Page 1

CS211^1

CS 211: Computer ArchitectureIntroduction to ILP Processors &Concepts

Course Outline^ •^ Introduction: Trends, Performance models^ •^ Review of computer organization and ISAimplementation^ •^ Overview of Pipelining^ •^ ILP Processors: Superscalar Processors^ –^ Next! ILP Intro and Superscalar^ •^ ILP: EPIC/VLIW Processors^ •^ Compiler optimization techniques for ILPprocessors – getting max performance out ofILP design^ •^ Part 2: Other components- memory, I/O.

CS211^3

Introduction to Instruction LevelParallelism (ILP)^ •^ What is ILP?^ –^ Processor and Compiler design techniques thatspeed up execution by causing individual machineoperations to execute in parallel^ •^ ILP is transparent to the user^ –^ Multiple operations executed in parallel eventhough the system is handed a single programwritten with a sequential processor in mind^ •^ Same execution hardware as a normal RISCmachine^ –^ May be more than one of any given type ofhardware

Architectures forInstruction-Level Parallelism^ Scalar Pipeline (baseline)

Instruction Parallelism = DOperation Latency = 1Peak IPC = 1^1

IF^ DE^

EX^ WB

TIME IN CYCLES (OF BASELINE MACHINE)

SUCCESSIVEINSTRUCTIONS

D

Page 2

CS211^5

Superpipelined Machine^ Superpipelined ExecutionIP =

DxM OL =^ M minor cycles Peak IPC =^

1 per minor cycle

(M per

baseline cycle) 1 2345

IF^ DE

EX

WB

major cycle = M minor cycleminor cycle

Superscalar Machines^ Superscalar (Pipelined) ExecutionIP =

DxN OL =^ 1 baseline cycles Peak IPC =^

N per baseline cycle^ IF^

DE^ EX

WB

N

CS211^7

Superscalar and Superpipelined^ Superscalar and superpipelined machines of equal degreehave roughly the same performance, i.e. if n = m then bothhave about

the same IPC.

Superscalar Parallelism Operation Latency: 1Issuing Rate: NSuperscalar Degree (SSD): N(Determined by Issue Rate)

Superpipeline Parallelism Operation Latency: MIssuing Rate: 1Superpipelined Degree (SPD): M(Determined by Operation Latency)

(^1 0) Time in Cycles (of Base Machine) 2 3 4

5 6 7

8 9 SUPERPIPELINED

10 11 12

13

SUPERSCALAR

Key:IFetchDcode

ExecuteWriteback

Limitations of Inorder Pipelines^ •^ CPI of inorder pipelines degrades very sharply if themachine parallelism is increased beyond a certainpoint,

i.e. when NxM approaches average distancebetween dependent instructions

•^ Forwarding is no longer effective

⇒^ must stall more often

-^ Pipeline may never be full due to frequent dependencystalls!!

IF^ DE

EX^ WB

123456789

Page 4

CS211^13

ILP: Instruction-Level Parallelism^ •^ ILP is is a measure of the amount of inter-dependencies between instructions^ •^ Average ILP =

no. instruction / no. cyc required

code1:^

ILP = 1^ i.e. must execute serially

code2:^

ILP = 3^ i.e. can execute at the same time

code1:^

r1^ ←^ r2 + 1r3^ ←^ r1 / 17r4^ ←^ r0 - r

code2:^

r1^ ←^ r2 + 1r3^ ←^ r9 / 17r4^ ←^ r0 - r

Inter-instruction Dependences^ Š

Data dependence

r^ ←^ r^3

op r^2

Read-after-Write

r^ ←^ r^5

op r^4

(RAW)

Š^ Anti-dependence

r^ ←^ r^3

op r^1

Write-after-Read

r^ ←^ r^1

op r^4

(WAR)

Š^ Output dependence

r^ ←^ r^3

op r^2

Write-after-Write

r^ ←^ r^5

op r^4

(WAW)

r^ ←^ r^3

op r^7

Š^ Control dependence

CS211^15

Scope of ILP Analysis

r1^ ⇐^ r2 + 1r3^ ⇐^ r1 / 17r4^ ⇐^ r0 - r3r11^ ⇐

r12 + 1

r13^ ⇐

r19 / 17

r14^ ⇐

r0 - r

ILP=

ILP=1 Out-of-order execution permits more ILP to beexploited

Questions Facing ILP System Designers^ •^ What gives rise to instruction-level parallelism inconventional, sequential programs^ •^ How is the potential parallelism identified andenhanced, and how much is there?^ •^ What must be done in order to exploit the parallelismthat has been identified?^ •^ How should the work of identifying, enhancing andexploiting the parallelism be divided between thehardware and software (the compiler)?^ •^ What are the alternatives in selecting the architectureof an ILP processor?

Page 5

CS211^17

Sequential Processor^ Sequential Instructions

Processor Execution unit Execution unit

ILP Processors:Superscalar^ Sequential Instructions

Superscalar Processor

Scheduling SchedulingLogicLogic Instruction scheduling/Instruction scheduling/Instruction scheduling/parallelism extractionparallelism extractionparallelism extractiondone by hardwaredone by hardwaredone by hardware

CS211^19

Serial ProgramSerial Program Serial Program(C code)(C code)(C code)

Scheduled Instructions

EPIC Processor

ILP Processors:EPIC/VLIW^ compiler

ILP Architectures^ •^ Between the compiler and the run-timehardware, the following functions must beperformed^ –^ Dependencies between operations must bedetermined^ –^ Operations that are independent of any operationthat has not yet completed must be determined^ –^ Independent operations must be scheduled toexecute at some particular time, on some specificfunctional unit, and must be assigned a registerinto which the result may be deposited.

Page 7

CS211^25

Dependence Architecture^ •^ Compiler or programmer communicates tothe hardware the dependencies betweeninstructions^ –^ removes the need to scan the program in sequentialorder (the bottleneck for superscalar processors)^ •^ Hardware determines at run-time when toschedule the instruction

Dependence Architecture Example^ •^ Dataflow processors are representative ofDependence architectures^ –^ Execute instruction at earliest possible time subjectto availability of input operands and functionalunits^ –^ Dependencies communicated by providing witheach instruction a list of all successor instructions^ –^ As soon as all input operands of an instruction areavailable, the hardware fetches the instruction^ –^ The instruction is executed as soon as a functionalunit is available^ •^ Few Dataflow processors currently exist

CS211^27

Independence Architecture^ •^ By knowing which operations areindependent, the hardware needs no furtherchecking to determine which instructions canbe issued in the same cycle^ •^ The set of independent operations is fargreater than the set of dependent operations^ –^ Only a subset of independent operations arespecified^ •^ The compiler may additionally specify onwhich functional unit and in which cycle anoperation is executed^ –^ The hardware needs to make no run-time decisions

Independence Architecture Example •^ EPIC/VLIW processors are examples ofIndependence architectures^ –^ Specify exactly which functional unit eachoperation is executed on and when each operationis issued^ –^ Operations are independent of other operationsissued at the same time as well as those that are inexecution^ –^ Compiler emulates at compile time what a dataflowprocessor does at run-time

Page 8

CS211^29

Frontend and OptimizerDetermine DependencesDetermine IndependencesBind Operations to Function UnitsBind Transports to Busses

Determine DependencesBind Transports to BussesExecute SuperscalarDataflowIndep. Arch.VLIW TTA

Compiler^

Hardware Determine IndependencesBind Operations to Function Units

Compiler vs. Processor B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. TheJournal of Supercomputing, 7(1-2):9-50, May 1993.

VLIW and Superscalar^ •^ basic structure of VLIW and superscalarconsists of a number of Eus, each capable ofparallel operation on data fetched fromregister file^ •^ VLIW and superscalar require highlymultiported register files^ –^ limit on register ports places inherent limitation onmaximum number of EUs

CS211^31

VLIW & Superscalar-Differences^ •^ presentation of instructions:^ –^ VLIW receive multi-operation instructions^ –^ Superscalar accept traditional sequential streambut can issue more than one instruction^ •^ VLIW needs very long instructions in order tospecify what each EU should do^ •^ Superscalar receive stream of conventionalinstructions

VLIW&Superscalar-Differences^ •^ Decode and Issue unit in superscalar issuesmultiple instructions for the EUs^ –^ Have to figure out dependencies and independentinstructions^ •^ VLIW expect dependency free code whereassuperscalar typically do not expect this.^ –^ Superscalars cope with dependencies usinghardware

Page 10

CS211^37

Superscalar Processors

Superscalar Terminology

• Basic^ Superscalar

Able to issue > 1 instruction / cycle

Superpipelined

Deep, but not superscalar pipeline.E.g., MIPS R5000 has 8 stages

Branch prediction

Logic to guess whether or not branch will betaken, and possibly branch target

• Advanced^ Out-of-order

Able to issue instructions out of program order

Speculation

Execute instructions beyond branch points,possibly nullifying later

Register renaming

Able to dynamically assign physicalregisters to instructions

Retire unit

Logic to keep track of instructions as theycomplete.

CS211^39

Superscalar Execution ExampleSingle Order, Data Dependence – In Order • Assumptions^ –^ Single FP adder takes 2cycles^ –^ Single FP multiplier takes 5cycles^ –^ Can issue add & multiplytogether^ –^ Must issue in-order^ –^ in,in,out v: addt^ $f2, $f4, $f10 w: mult $f10, $f6, $f10 x: addt $f10, $f8, $f12 y: addt^ $f4, $f6,

$f z:^ addt^

$f4, $f8, $f

v w

x y

(In order)(Single adder, data dependence)

(inorder)

Data Flow $f2 $f4^ $f6 +^ + * +

y $f4$f8 $f v x^

z CriticalPath =9 cycles

+ w

z $f12^ z

Adding Advanced Features

-^ Out Of Order Issue^ –^ Can start y as soon as adder available^ –^ Must hold back z until

$f10^ not busy & adder available

-^ With Register Renaming

v w

x y

z v w

x y z

v:^ addt^

$f2, $f4, $f w:^ mult $f10, $f6, $f10 x:^ addt^ $f

, $f8, $f y:^ addt^

$f4, $f6,^

$f z:^ addt^

$f4, $f8,^ $f v:^ addt^

$f2, $f4, $f10a w:^ mult $f10a, $f6, $f10a x:^ addt $f10a, $f8, $f12 y:^ addt^

$f4, $f6,^

$f z:^ addt^

$f4, $f8, $f

Page 11

CS211^41

Flow Path Model of Superscalars

I-cacheFETCHDECODE COMMIT

D-cache

BranchPredictor^

InstructionBuffer ReorderBufferStoreQueue Integer^ Floating-point

Media

Instruction Memory

RegisterData

MemoryData Flow EXECUTE (ROB)

Flow

Flow

Superscalar Pipeline Design

Instruction Buffer Fetch

Dispatch Buffer Decode

Issuing Buffer Dispatch

Completion Buffer Execute

Store Buffer CompleteRetire

InstructionFlow Data Flow

CS211^43

Inorder Pipelines

IF D1 D2 EX WB Intel i

IF^

IF D^

D D^

D EX^

EX

WB^

WB

Intel Pentium

U - Pipe^

V - Pipe

Inorder pipeline, no WAW no WAR (almost always true)

Out-of-order Pipelining 101

-^ •^ • •^ •^ • •^ •^ • •^ •^ • IF ID RD INT WB

Fadd

Fmult

LD/ST

Fadd^

Fmult2Fmult

EX

Program Order^ I:^ F1^ ←^ a^

F2 x F3..... I: F1 ← F4 + F5 b

What is the value of F1? WAW!!!

Out-of-order WB I:^ F1^ ←^ “F4 + F5” b^...... I:^ F1^ ←^ “F2 x F3” a^

Page 13

CS211^49

More Hardware Featuresto Support ILP^ •^ Pipelining^ –^ Advantages^ »

Relatively low cost of implementation - requireslatches within functional units » With pipelining, ILP can be doubled, tripled or more – Disadvantages » Adds delays to execution of individual operations » Increased latency eventually counterbalancesincrease in ILP

•^ Additional Functional Units^ –^ Advantages^ »^

Does not suffer from increased latencybottleneck – Disadvantages » Amount of functional unit hardwareproportional to degree of parallelism » Interconnection network and register file sizeproportional to square of number of functionalunits

Hardware Features to Support ILP

CS211^51

Hardware Features to Support ILP•^ Instruction Issue Unit^ –^ Care must be taken not to issue an instruction ifanother instruction upon which it is dependent isnot complete^ –^ Requires complex control logic in Superscalarprocessors^ –^ Virtually trivial control logic in VLIW processors

•^ Speculative Execution^ –^ Little ILP typically found in basic blocks^ »^

a straight-line sequence of operations with nointervening control flow – Multiple basic blocks must be executed in parallel » Execution may continue along multiple pathsbefore it is known which path will be executed

Hardware Features to Support ILP

Page 14

CS211^53

Hardware Features to Support ILP•^ Requirements for Speculative Execution^ –^ Terminate unnecessary speculative computationonce the branch has been resolved^ –^ Undo the effects of the speculatively executedoperations that should not have been executed^ –^ Ensure that no exceptions are reported until it isknown that the excepting operation should havebeen executed^ –^ Preserve enough execution state at eachspeculative branch point to enable execution toresume down the correct path if the speculativeexecution happened to proceed down the wrongone.

•^ Speculative Execution^ –^ Expensive in hardware^ –^ Alternative is to perform speculative code motion atcompile time^ »^

Move operations from subsequent blocks uppast branch operations into proceeding blocks – Requires less demanding hardware » A mechanism to ensure that exceptions causedby speculatively scheduled operations arereported if and only if flow of control is suchthat they would have been executed in the non-speculative version of the code » Additional registers to hold the speculativeexecution state

Hardware Features to Support ILP

CS211^55

Introduction to S/W Techniques for ILP

Instruction Level Parallelism (ILP) • ILP: Overlap execution of unrelated instructions • How to extract parallelism from program? –^ Beyond single block to get more instruction level parallelism • Who does instruction scheduling and parallelismextraction? –^ Software or Hardware or mix? –^ Superscalar processors require H/W solutions, but can alsouse some compiler help • What new hardware features are required to supportmore ILP..? –^ Different requirements for Superscalar and EPIC

Page 16

CS211^61

Quick recall of data hazards..^ •^ True/flow dependencies - RAW^ •^ Name dependencies WAR, WAW^ –^ Also known as false dependencies, output dep

•^ Instr^ J

is data dependent on Instr

I

Instr^ tries to read operand before InstrJ^

writes itI^

•^ or Instr

is data dependent on InstrJ^

which isK^

dependent on Instr

I

Data Dependence and Hazards • Caused by a “True Dependence” (compiler term) • If true dependence caused a hazard in the pipeline,called a Read After Write (RAW) hazard

I: add r1,r2,r3J: sub r4,r1,r

CS211^63

•^ Name dependence: when 2 instructions use sameregister or memory location, called a name, but noflow of data between the instructions associatedwith that name; 2 versions of name dependence •^ Instr^ J

writes operand

before^

Instr^ reads itI^

Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1” • If anti-dependence caused a hazard in the pipeline,called a Write After Read (WAR) hazard

I: sub r4,r1,r3J: add r1,r2,r3K: mul r6,r1,r

Name Dependence #1:Anti-dependence

Name Dependence #2:Output dependence • Instr^ writes operandJ^

before^

Instr^ writes it.I^

•^ Called an “output dependence” by compiler writersThis also results from the reuse of name “r1” •^ If anti-dependence caused a hazard in the pipeline,called a Write After Write (WAW) hazard

I: sub r1,r4,r3J: add r1,r2,r3K: mul r6,r1,r

Page 17

CS211^65

ILP and Data Hazards • HW/SW must preserve program order:^ –^ order instructions would execute in if executed sequentiallyone at a time as determined by original source program^ –^ Does this mean we can never change order of execution ofinstructions?^ »^ Ask - What happens if we change the order of aninstruction^ »^ Does result change?

ILP and Data Hazards^ •^ HW/SW goal: exploit parallelism bypreserving program order only where itaffects the outcome of the program^ •^ Instructions involved in a name dependencecan execute simultaneously if name used ininstructions is changed so instructions donot conflict^ –^ Register renaming resolves name dependence forregs^ –^ Either by compiler or by HW^ •

CS211^67

Control Dependencies^ •^ Every instruction is control dependent on some set ofbranches, and, in general, these control dependenciesmust be preserved to preserve program order^ if p1 {S1;};if p2 {S2;}^ •^ S1^ is control dependent on

p1, and^ S

is control

dependent on

p2^ but not on

p1.

Control Dependence Ignored^ •^ Control dependence need not be preserved^ –^ willing to execute instructions that should not have beenexecuted, thereby violating the control dependences, ifcan do so without affecting correctness of the program^ •^ Instead, 2 properties critical to programcorrectness are exception behavior and data flow

Page 19

CS211^73

Example FP Loop: Where are the Hazards?^ Loop:^

LD^ F0,0(R1)

;F0=vector element ADDD^ F4,F0,F

;add scalar from F SD^ 0(R1),F

;store result SUBI^ R1,R1,

;decrement pointer 8B (DW) BNEZ^ R1,Loop

;branch R1!=zero NOP^

;delayed branch slot Instruction

Instruction

Latency in

producing result

using result

clock cycles

FP ALU op

Another FP ALU op

FP ALU op

Store double

Load double

FP ALU op

Load double

Store double

Integer op

Integer op

•^ Where are the stalls?

FP Loop Hazards^ Instruction

Instruction

Latency in

producing result

using result

clock cycles

FP ALU op

Another FP ALU op

FP ALU op

Store double

Load double

FP ALU op

Load double

Store double

Integer op

Integer op

Loop:^ LD

F0,0(R1)

;F0=vector element ADDD^ F4,F0,F

;add scalar in F SD^ 0(R1),F

;store result SUBI^ R1,R1,

;decrement pointer 8B (DW) BNEZ^ R1,Loop

;branch R1!=zero NOP^

;delayed branch slot

CS211^75

FP Loop Showing Stalls^ Instruction^ •^ 9 clocks: Rewrite code to minimize stalls?

Instruction

Latency in

producing result

using result

clock cycles

FP ALU op

Another FP ALU op

FP ALU op

Store double

Load double

FP ALU op

1 Loop:^

LD^ F0,0(R1)

;F0=vector element 2

stall 3

ADDD^ F4,F0,F

;add scalar in F 4

stall 5

stall 6

SD^ 0(R1),F

;store result 7

SUBI^ R1,R1,

;decrement pointer 8B (DW) 8

BNEZ^ R1,Loop

;branch R1!=zero 9

stall^

;delayed branch slot

Revised FP Loop Minimizing Stalls^ Instruction 6 clocks: Unroll loop 4 times code to make faster?

Instruction

Latency in

producing result

using result

clock cycles

FP ALU op

Another FP ALU op

FP ALU op

Store double

Load double

FP ALU op

1 Loop:^

LD^ F0,0(R1)

stall 3

ADDD^ F4,F0,F

SUBI^ R1,R1,

BNEZ^ R1,Loop

;delayed branch 6

SD^ 8(R1),F

;altered when move past SUBI

Swap BNEZ and SD by changing address of SD

Page 20

CS211^77

Unroll Loop Four Times (straightforwardway)

Rewrite loop tominimize stalls?

1 Loop: LD

F0,0(R1)

2 ADDD

F4,F0,F

3 SD

0(R1),F

;drop SUBI & BNEZ 4 LD

F6,-8(R1)

5 ADDD

F8,F6,F

6 SD

-8(R1),F

;drop SUBI & BNEZ 7 LD

F10,-16(R1)

8 ADDD

F12,F10,F

9 SD

-16(R1),F

;drop SUBI & BNEZ

10 LD

F14,-24(R1)

11 ADDD

F16,F14,F

12 SD

-24(R1),F

13 SUBI

R1,R1,#

;alter to 4 14 BNEZ*

R1,LOOP

15 NOP^ 15 + 4 x (1+2) = 27 clock cycles, or 6.8 per iterationAssumes R1 is multiple of 4

Unrolled Loop That Minimizes Stalls

-^ What assumptions madewhen moved code?^ –^ OK to move store pastSUBI even though changesregister^ –^ OK to move loads beforestores: get right data?^ –^ When is it safe forcompiler to do suchchanges?

1 Loop: LD

F0,0(R1)

2 LD

F6,-8(R1)

3 LD

F10,-16(R1)

4 LD

F14,-24(R1)

5 ADDD

F4,F0,F

6 ADDD

F8,F6,F

7 ADDD

F12,F10,F

8 ADDD

F16,F14,F

9 SD

0(R1),F

10 SD

-8(R1),F

11 SD

-16(R1),F

12 SUBI

R1,R1,#

13 BNEZ

R1,LOOP

14 SD

8(R1),F

14 clock cycles, or 3.5 per iterationWhen safe to move instructions?

CS211^79

Compiler Perspectives on CodeMovement • Definitions: compiler concerned about dependencies inprogram, whether or not a HW hazard depends on a givenpipeline • Try to schedule to avoid hazards • (True) Data dependencies (RAW if a hazard for HW) –^ Instruction i produces a result used by instruction j, or –^ Instruction j is data dependent on instruction k, and instruction kis data dependent on instruction i. • If dependent, can’t execute in parallel • Easy to determine for registers (fixed names) • Hard for memory: –^ Does 100(R4) = 20(R6)? –^ From different loop iterations, does 20(R6) = 20(R6)?

Where are the data dependencies?^1 Loop:

LD^ F0,0(R1) 2 ADDD^ F4,F0,F2 3 SUBI^ R1,R1,8 4 BNEZ^ R1,Loop

;delayed branch 5

SD^ 8(R1),F

;altered when move past SUBI