CIS371: Instruction Scheduling and Pipelining - Prof. A. Roth, Study notes of Computer Science

Various concepts related to instruction scheduling and pipelining in the context of computer organization. Topics include compiler scheduling, utilization, saxpy performance, pipeline scheduling, loop unrolling, and dynamic scheduling. The document also discusses the importance of independent instructions, the role of the compiler in scheduling, and the impact of unrolling on performance and utilization.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-od6
koofers-user-od6 🇺🇸

10 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CIS 371 (Roth/Martin): Scheduling 1
CIS 371
Computer Organization and Design
Unit 8: Static and Dynamic Scheduling
CIS 371 (Roth/Martin): Scheduling 2
This Unit: Code Scheduling
!Pipelining and superscalar review
!Code scheduling
!To reduce pipeline stalls
!To increase ILP (insn level parallelism)
!Two approaches
!Static scheduling by the compiler
!Dynamic scheduling by the hardware
CPU Mem I/O
System software
App App App
CIS 371 (Roth/Martin): Scheduling 3
Readings
!P+H
!Chapter 6.9 (again)
Pipelining Review
!Increases clock frequency by staging instruction execution
!“Scalar” pipelines have a best-case CPI of 1
!Challenges:
!Data and control dependencies further worsen CPI
!Data: With full bypassing, load-to-use stalls
!Control: use branch prediction to mitigate penalty
!Big win, done by all processors today
!How many stages (depth)?
!Five stages is pretty good minimum
!Intel Pentium II/III: 12 stages
!Intel Pentium 4: 22+ stages
!Intel Core 2: 14 stages
CIS 371 (Roth/Martin): Scheduling 4
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download CIS371: Instruction Scheduling and Pipelining - Prof. A. Roth and more Study notes Computer Science in PDF only on Docsity!

CIS 371 (Roth/Martin): Scheduling 1

CIS 371

Computer Organization and Design

Unit 8: Static and Dynamic Scheduling

CIS 371 (Roth/Martin): Scheduling 2

This Unit: Code Scheduling

•! Pipelining and superscalar review

•! Code scheduling

-! To reduce pipeline stalls -! To increase ILP (insn level parallelism)

•! Two approaches

-! Static scheduling by the compiler -! Dynamic scheduling by the hardware Mem CPU I/O System software App App App CIS 371 (Roth/Martin): Scheduling 3

Readings

•! P+H

-! Chapter 6.9 (again)

Pipelining Review

•! Increases clock frequency by staging instruction execution

•! “Scalar” pipelines have a best-case CPI of 1

•! Challenges:

-! Data and control dependencies further worsen CPI -! Data: With full bypassing, load-to-use stalls -! Control: use branch prediction to mitigate penalty

•! Big win, done by all processors today

•! How many stages (depth)?

-! Five stages is pretty good minimum -! Intel Pentium II/III: 12 stages -! Intel Pentium 4: 22+ stages -! Intel Core 2: 14 stages CIS 371 (Roth/Martin): Scheduling 4

CIS371 (Roth/Martin): Pipelining 5

Pipeline Diagram

•! Use compiler scheduling to reduce load-use stall frequency

-! Like software interlocks, but for performance not correctness 1 2 3 4 5 6 7 8 9 add $3,$2,$1 (^) F D X M W lw $4,4($3) (^) F D X M W addi $6,$4,1 (^) F D d* X M W sub $8,$3,$1 (^) F d* D X M W 1 2 3 4 5 6 7 8 9 add $3,$2,$1 (^) F D X M W lw $4,4($3) F D X M W sub $8,$3,$1 (^) F D X M W addi $6,$4,1 (^) F D X M W

Superscalar Pipeline Review

•! Execute two or more instruction per cycle

•! Challenges:

-! wide fetch (branch prediction harder, misprediction more costly) -! wide decode (stall logic) -! wide execute (more ALUs) -! wide bypassing (more possibly bypassing paths) -! Finding enough independent instructions (and fill delay slots)

•! How many instructions per cycle max (width)?

-! Really simple, low-power cores are still scalar (single issue) -! Even low-power cores a dual-issue (Intel Atom, aka Silverthorne) -! Most desktop/laptop chips three-issue or four-issue -! A few 5 or 6-issue chips have been built (IBM Power4, Itanium II) CIS 371 (Roth/Martin): Scheduling 6 CIS 371 (Roth/Martin): Superscalar Pipelines 7

Superscalar Pipeline Diagrams - Ideal

scalar 1 2 3 4 5 6 7 8 9 10 11 12 lw 0(r1)! r2 (^) F D X M W lw 4(r1)! r3 (^) F D X M W lw 8(r1)! r4 (^) F D X M W add r14,r15! r6 (^) F D X M W add r12,r13! r7 (^) F D X M W add r17,r16! r8 F D X M W lw 0(r18)! r9 F D X M W 2-way superscalar^1 2 3 4 5 6 7 8 9 10 11 lw 0(r1)! r2 F D X M W lw 4(r1)! r3 F D X M W lw 8(r1)! r4 F D X M W add r14,r15! r6 F D X M W add r12,r13! r7 F D X M W add r17,r16! r8 F D X M W lw 0(r18)! r9 F D X M W CIS 371 (Roth/Martin): Superscalar Pipelines 8

Superscalar Pipeline Diagrams - Realistic

scalar 1 2 3 4 5 6 7 8 9 10 11 12 lw 0(r1)! r2 (^) F D X M W lw 4(r1)! r3 (^) F D X M W lw 8(r1)! r4 (^) F D X M W add r4,r5! r6 (^) F d* D X M W add r2,r3! r7 (^) F D X M W add r7,r6! r8 F D X M W lw 0(r8)! r9 F D X M W 2-way superscalar^1 2 3 4 5 6 7 8 9 10 11 lw 0(r1)! r2 F D X M W lw 4(r1)! r3 F D X M W lw 8(r1)! r4 F D X M W add r4,r5! r6 F d* d* D X M W add r2,r3! r7 F d* D X M W add r7,r6! r8 F D X M W lw 0(r8)! r9 F d* D X M W

CIS371 (Roth/Martin): Pipelining 13

Compiler Scheduling Requires

•! Alias analysis

-! Ability to tell whether load/store reference same memory locations -! Effectively, whether load/store can be rearranged -! Example code: easy, all loads/stores use same base register ( sp ) -! New example: can compiler tell that r8 != sp? -! Must be conservative Before ld r2,4(sp) ld r3,8(sp) add r3,r2,r1 //stall st r1,0(sp) ld r5,0(r8) ld r6,4(r8) sub r5,r6,r4 //stall st r4,8(r8) Wrong(?) ld r2,4(sp) ld r3,8(sp) ld r5,0(r8) //does r8==sp? add r3,r2,r ld r6,4(r8) //does r8+4==sp? st r1,0(sp) sub r5,r6,r st r4,8(r8) CIS 371 (Roth/Martin): Scheduling 14

Code Example: SAXPY

•! SAXPY (Single-precision A X Plus Y)

-! Linear algebra routine (used in solving systems of equations) -! Part of early “Livermore Loops” benchmark suite for (i=0;i<N;i++) Z[i]=AX[i]+Y[i]; 0: ldf X(r1)! f1 // loop 1: mulf f0,f1! f2 // A in f 2: ldf Y(r1)! f3 // X,Y,Z are constant addresses 3: addf f2,f3! f 4: stf f4! Z(r1) 5: addi r1,4! r1 // i in r 6: blt r1,r2,0 // N4 in r CIS 371 (Roth/Martin): Scheduling 15

New Metric: Utilization

•! Utilization : actual performance / peak performance

-! Important metric for performance/cost -! No point to paying for hardware you will rarely use -! Adding hardware usually improves performance & reduces utilization -! Additional hardware can only be exploited some of the time -! Diminishing marginal returns -! Compiler can help make better use of existing hardware -! Important for superscalar CIS 371 (Roth/Martin): Scheduling 16

SAXPY Performance and Utilization

•! Scalar pipeline

-! Full bypassing, 5-cycle E, 2-cycle E+, branches predicted taken -! Single iteration (7 insns) latency: 16–5 = 11 cycles -! Performance : 7 insns / 11 cycles = 0.64 IPC -! Utilization : 0.64 actual IPC / 1 peak IPC = 64% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ldf X(r1)! f1 F D X M W mulf f0,f1! f2 F D d E* E* E* E* E* W ldf Y(r1)! f3 F p* D X M W addf f2,f3! f4 F D d* d* d* E+ E+ W stf f4! Z(r1) (^) F p* p* p* D X M W addi r1,4! r1 (^) F D X M W blt r1,r2,0 (^) F D X M W ldf X(r1)! f1 (^) F D X M W

CIS 371 (Roth/Martin): Scheduling 17

SAXPY Performance and Utilization

•! 2-way superscalar pipeline

-! Any two insns per cycle + split integer and floating point pipelines +! Performance : 7 insns / 10 cycles = 0.70 IPC -! Utilization : 0.70 actual IPC / 2 peak IPC = 35% -! More hazards! more stalls -! Each stall is more expensive 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ldf X(r1)! f1 F D X M W mulf f0,f1! f2 F D d* d* E* E* E* E* E* W ldf Y(r1)! f3 F D p* X M W addf f2,f3! f4 F p* p* D d* d* d* d* E+ E+ W stf f4! Z(r1) (^) F p* D p* p* p* p* d* X M W addi r1,4! r1 (^) F p* p* p* p* p* D X M W blt r1,r2,0 (^) F p* p* p* p* p* D d* X M W ldf X(r1)! f1 (^) F D X M W CIS 371 (Roth/Martin): Scheduling 18

(Compiler) Instruction Scheduling

•! Idea: place independent insns between slow ops and uses

-! Otherwise, pipeline stalls while waiting for RAW hazards to resolve -! Have already seen pipeline scheduling

•! To schedule well you need … independent insns

•! Scheduling scope : code region we are scheduling

-! The bigger the better (more independent insns to choose from) -! Once scope is defined, schedule is pretty obvious -! Trick is creating a large scope (must schedule across branches)

•! Compiler scheduling (really scope enlarging) techniques

-! Loop unrolling (for loops) CIS 371 (Roth/Martin): Scheduling 19

Loop Unrolling SAXPY

•! Goal: separate dependent insns from one another

•! SAXPY problem: not enough flexibility within one iteration

-! Longest chain of insns is 9 cycles -! Load (1) -! Forward to multiply (5) -! Forward to add (2) -! Forward to store (1) -! Can’t hide a 9-cycle chain using only 7 insns -! But how about two 9-cycle chains using 14 insns?

•! Loop unrolling : schedule two or more iterations together

-! Fuse iterations -! Schedule to reduce stalls -! Schedule introduces ordering problems, rename registers to fix CIS 371 (Roth/Martin): Scheduling 20

Unrolling SAXPY I: Fuse Iterations

•! Combine two (in general K) iterations of loop

-! Fuse loop control: induction variable ( i ) increment + branch -! Adjust (implicit) induction uses: constants! constants + 4 ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) addi r1,4,r blt r1,r2, ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) addi r1,4,r blt r1,r2, ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) ldf X+4(r1),f mulf f0,f1,f ldf Y+4(r1),f addf f2,f3,f stf f4,Z+4(r1) addi r1, 8 ,r blt r1,r2,

CIS 371 (Roth/Martin): Scheduling 25

Anything The Compiler Can Do…

•! Dynamically-scheduled processors

-! Hardware re-schedules insns… -! …within a sliding window of VonNeumann insns -! Does loop unrolling transparently -! Does equivalent of loop unrolling on non-loop code -! Uses branch prediction to “unroll” branches -! Pentium Pro/II/III (3-wide), Core/2 (4-wide), Alpha 21264 (4- wide), MIPS R10000 (4-wide), Power5 (5-wide)

•! Quick overview of approach

-! Lots more information in CIS501 (graduate level architecture) CIS 501 (Martin/Roth): Dynamic Scheduling I 26

The Problem With In-Order Pipelines

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 addf f0,f1! f2 F D E+ E+ E+ W mulf f2,f3! f2 F D d* d* E* E* E* E* E* W subf f0,f1! f4 F p* p* D E+ E+ E+ W

•! What’s happening in cycle 4?

-! mulf stalls due to data dependence -! OK, this is a fundamental problem -! subf stalls due to pipeline hazard -! Why? subf can’t proceed into D because addf is there -! That is the only reason, and it isn’t a fundamental one -! Maintaining in-order writes to register file

•! Why can’t subf go into D in cycle 4 and E+ in cycle 5?

Code Example

•! Code:

•! Divide insn independent of subtract and multiply insns

-! Can execute in parallel with subtract

•! Many registers re-used

-! Just as in static scheduling, the register names get in the way -! How does the hardware get around this?

•! Approach: (step #1) rename registers, (step #2) schedule

CIS 371 (Roth/Martin): Scheduling 27 Raw insns add r2,r3! r sub r2,r1! r mul r2,r3! r div r1,4! r CIS 501 (Martin/Roth): Dynamic Scheduling I 28

Step #1: Register Renaming

•! To eliminate register conflicts/hazards

•! “Architected” vs “Physical” registers

-! Names: r1,r2,r -! Locations: p1,p2,p3,p4,p5,p6,p -! Original mapping: r1! p1 , r2! p2 , r3! p3 , p4p7 are “available” -! Renaming – conceptually write each register once +!Removes false dependences +!Leaves true dependences intact! -! When to reuse a physical register? After overwriting insn done MapTable FreeList Raw insns Renamed insns r1 r2 r p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p p4 p2 p6 p7 div r1,4,r1 div p4,4,p

Register Renaming Algorithm

•! Data structures:

-! maptable[architectural_reg]! physical_reg -! Free list: get/put free register

•! Algorithm: at decode for each instruction:

insn.phys_input1 = maptable[insn.arch_input1]! insn.phys_input2 = maptable[insn.arch_input2]! insn.phys_to_free = maptable[arch_output]! new_reg = get_free_phys_reg()! insn.phys_output = new_reg! maptable[arch_output] = new_reg

•! At “commit”

-! Once all older instructions have committed, free register put_free_phys_reg(insn.phys_to_free)! CIS 371 (Roth/Martin): Scheduling 29 CIS 501 (Martin/Roth): Dynamic Scheduling I 30 regfile I$ D$ B P insn buffer D S add p2,p3,p sub p2,p4,p mul p2,p5,p div p4,4,p Ready Table P2 P3 P4 P5 P6 P Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes div p4,4,p mul p2,p5,p sub p2,p4,p add p2,p3,p and

Step #2: Dynamic Scheduling

•! Instructions fetch/decoded/renamed into Instruction Buffer

-! Also called “instruction window” or “instruction scheduler”

•! Instructions (conceptually) check ready bits every cycle

-! Execute when ready

Dynamic Scheduling Algorithm

•! Data structures:

-! Ready table[phys_reg]! yes/no

•! Algorithm at “schedule” stage (prior to read registers):

foreach instruction:! if table[insn.phys_input1] == ready &&" table[insn.phys_input2] == ready then! insn as “ready”! select the oldest “ready” instruction! table[insn.phys_output] = ready! CIS 371 (Roth/Martin): Scheduling 31 CIS 501 (Martin/Roth): Dynamic Scheduling I 32

Dynamic Scheduling - OoO Execution

•! Dynamic scheduling

-! Totally in the hardware -! Also called “out-of-order execution” (OoO)

•! Fetch many instructions into instruction window

-! Use branch prediction to speculate past (multiple) branches -! Flush pipeline on branch misprediction

•! Rename to avoid false dependencies

•! Execute instructions as soon as possible

-! Register dependencies are known -! Handling memory dependencies more tricky

•! “Commit” instructions in order

-! Anything strange happens before commit, just flush the pipeline

•! Current machines: 100+ instruction scheduling window