





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various concepts related to instruction scheduling and pipelining in the context of computer organization. Topics include compiler scheduling, utilization, saxpy performance, pipeline scheduling, loop unrolling, and dynamic scheduling. The document also discusses the importance of independent instructions, the role of the compiler in scheduling, and the impact of unrolling on performance and utilization.
Typology: Study notes
1 / 9
This page cannot be seen from the preview
Don't miss anything!






CIS 371 (Roth/Martin): Scheduling 1
CIS 371 (Roth/Martin): Scheduling 2
-! To reduce pipeline stalls -! To increase ILP (insn level parallelism)
-! Static scheduling by the compiler -! Dynamic scheduling by the hardware Mem CPU I/O System software App App App CIS 371 (Roth/Martin): Scheduling 3
-! Chapter 6.9 (again)
-! Data and control dependencies further worsen CPI -! Data: With full bypassing, load-to-use stalls -! Control: use branch prediction to mitigate penalty
-! Five stages is pretty good minimum -! Intel Pentium II/III: 12 stages -! Intel Pentium 4: 22+ stages -! Intel Core 2: 14 stages CIS 371 (Roth/Martin): Scheduling 4
CIS371 (Roth/Martin): Pipelining 5
-! Like software interlocks, but for performance not correctness 1 2 3 4 5 6 7 8 9 add $3,$2,$1 (^) F D X M W lw $4,4($3) (^) F D X M W addi $6,$4,1 (^) F D d* X M W sub $8,$3,$1 (^) F d* D X M W 1 2 3 4 5 6 7 8 9 add $3,$2,$1 (^) F D X M W lw $4,4($3) F D X M W sub $8,$3,$1 (^) F D X M W addi $6,$4,1 (^) F D X M W
-! wide fetch (branch prediction harder, misprediction more costly) -! wide decode (stall logic) -! wide execute (more ALUs) -! wide bypassing (more possibly bypassing paths) -! Finding enough independent instructions (and fill delay slots)
-! Really simple, low-power cores are still scalar (single issue) -! Even low-power cores a dual-issue (Intel Atom, aka Silverthorne) -! Most desktop/laptop chips three-issue or four-issue -! A few 5 or 6-issue chips have been built (IBM Power4, Itanium II) CIS 371 (Roth/Martin): Scheduling 6 CIS 371 (Roth/Martin): Superscalar Pipelines 7
scalar 1 2 3 4 5 6 7 8 9 10 11 12 lw 0(r1)! r2 (^) F D X M W lw 4(r1)! r3 (^) F D X M W lw 8(r1)! r4 (^) F D X M W add r14,r15! r6 (^) F D X M W add r12,r13! r7 (^) F D X M W add r17,r16! r8 F D X M W lw 0(r18)! r9 F D X M W 2-way superscalar^1 2 3 4 5 6 7 8 9 10 11 lw 0(r1)! r2 F D X M W lw 4(r1)! r3 F D X M W lw 8(r1)! r4 F D X M W add r14,r15! r6 F D X M W add r12,r13! r7 F D X M W add r17,r16! r8 F D X M W lw 0(r18)! r9 F D X M W CIS 371 (Roth/Martin): Superscalar Pipelines 8
scalar 1 2 3 4 5 6 7 8 9 10 11 12 lw 0(r1)! r2 (^) F D X M W lw 4(r1)! r3 (^) F D X M W lw 8(r1)! r4 (^) F D X M W add r4,r5! r6 (^) F d* D X M W add r2,r3! r7 (^) F D X M W add r7,r6! r8 F D X M W lw 0(r8)! r9 F D X M W 2-way superscalar^1 2 3 4 5 6 7 8 9 10 11 lw 0(r1)! r2 F D X M W lw 4(r1)! r3 F D X M W lw 8(r1)! r4 F D X M W add r4,r5! r6 F d* d* D X M W add r2,r3! r7 F d* D X M W add r7,r6! r8 F D X M W lw 0(r8)! r9 F d* D X M W
CIS371 (Roth/Martin): Pipelining 13
-! Ability to tell whether load/store reference same memory locations -! Effectively, whether load/store can be rearranged -! Example code: easy, all loads/stores use same base register ( sp ) -! New example: can compiler tell that r8 != sp? -! Must be conservative Before ld r2,4(sp) ld r3,8(sp) add r3,r2,r1 //stall st r1,0(sp) ld r5,0(r8) ld r6,4(r8) sub r5,r6,r4 //stall st r4,8(r8) Wrong(?) ld r2,4(sp) ld r3,8(sp) ld r5,0(r8) //does r8==sp? add r3,r2,r ld r6,4(r8) //does r8+4==sp? st r1,0(sp) sub r5,r6,r st r4,8(r8) CIS 371 (Roth/Martin): Scheduling 14
-! Linear algebra routine (used in solving systems of equations) -! Part of early “Livermore Loops” benchmark suite for (i=0;i<N;i++) Z[i]=AX[i]+Y[i]; 0: ldf X(r1)! f1 // loop 1: mulf f0,f1! f2 // A in f 2: ldf Y(r1)! f3 // X,Y,Z are constant addresses 3: addf f2,f3! f 4: stf f4! Z(r1) 5: addi r1,4! r1 // i in r 6: blt r1,r2,0 // N4 in r CIS 371 (Roth/Martin): Scheduling 15
-! Important metric for performance/cost -! No point to paying for hardware you will rarely use -! Adding hardware usually improves performance & reduces utilization -! Additional hardware can only be exploited some of the time -! Diminishing marginal returns -! Compiler can help make better use of existing hardware -! Important for superscalar CIS 371 (Roth/Martin): Scheduling 16
-! Full bypassing, 5-cycle E, 2-cycle E+, branches predicted taken -! Single iteration (7 insns) latency: 16–5 = 11 cycles -! Performance : 7 insns / 11 cycles = 0.64 IPC -! Utilization : 0.64 actual IPC / 1 peak IPC = 64% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ldf X(r1)! f1 F D X M W mulf f0,f1! f2 F D d E* E* E* E* E* W ldf Y(r1)! f3 F p* D X M W addf f2,f3! f4 F D d* d* d* E+ E+ W stf f4! Z(r1) (^) F p* p* p* D X M W addi r1,4! r1 (^) F D X M W blt r1,r2,0 (^) F D X M W ldf X(r1)! f1 (^) F D X M W
CIS 371 (Roth/Martin): Scheduling 17
-! Any two insns per cycle + split integer and floating point pipelines +! Performance : 7 insns / 10 cycles = 0.70 IPC -! Utilization : 0.70 actual IPC / 2 peak IPC = 35% -! More hazards! more stalls -! Each stall is more expensive 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ldf X(r1)! f1 F D X M W mulf f0,f1! f2 F D d* d* E* E* E* E* E* W ldf Y(r1)! f3 F D p* X M W addf f2,f3! f4 F p* p* D d* d* d* d* E+ E+ W stf f4! Z(r1) (^) F p* D p* p* p* p* d* X M W addi r1,4! r1 (^) F p* p* p* p* p* D X M W blt r1,r2,0 (^) F p* p* p* p* p* D d* X M W ldf X(r1)! f1 (^) F D X M W CIS 371 (Roth/Martin): Scheduling 18
-! Otherwise, pipeline stalls while waiting for RAW hazards to resolve -! Have already seen pipeline scheduling
-! The bigger the better (more independent insns to choose from) -! Once scope is defined, schedule is pretty obvious -! Trick is creating a large scope (must schedule across branches)
-! Loop unrolling (for loops) CIS 371 (Roth/Martin): Scheduling 19
-! Longest chain of insns is 9 cycles -! Load (1) -! Forward to multiply (5) -! Forward to add (2) -! Forward to store (1) -! Can’t hide a 9-cycle chain using only 7 insns -! But how about two 9-cycle chains using 14 insns?
-! Fuse iterations -! Schedule to reduce stalls -! Schedule introduces ordering problems, rename registers to fix CIS 371 (Roth/Martin): Scheduling 20
-! Fuse loop control: induction variable ( i ) increment + branch -! Adjust (implicit) induction uses: constants! constants + 4 ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) addi r1,4,r blt r1,r2, ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) addi r1,4,r blt r1,r2, ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) ldf X+4(r1),f mulf f0,f1,f ldf Y+4(r1),f addf f2,f3,f stf f4,Z+4(r1) addi r1, 8 ,r blt r1,r2,
CIS 371 (Roth/Martin): Scheduling 25
-! Hardware re-schedules insns… -! …within a sliding window of VonNeumann insns -! Does loop unrolling transparently -! Does equivalent of loop unrolling on non-loop code -! Uses branch prediction to “unroll” branches -! Pentium Pro/II/III (3-wide), Core/2 (4-wide), Alpha 21264 (4- wide), MIPS R10000 (4-wide), Power5 (5-wide)
-! Lots more information in CIS501 (graduate level architecture) CIS 501 (Martin/Roth): Dynamic Scheduling I 26
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 addf f0,f1! f2 F D E+ E+ E+ W mulf f2,f3! f2 F D d* d* E* E* E* E* E* W subf f0,f1! f4 F p* p* D E+ E+ E+ W
-! mulf stalls due to data dependence -! OK, this is a fundamental problem -! subf stalls due to pipeline hazard -! Why? subf can’t proceed into D because addf is there -! That is the only reason, and it isn’t a fundamental one -! Maintaining in-order writes to register file
-! Can execute in parallel with subtract
-! Just as in static scheduling, the register names get in the way -! How does the hardware get around this?
CIS 371 (Roth/Martin): Scheduling 27 Raw insns add r2,r3! r sub r2,r1! r mul r2,r3! r div r1,4! r CIS 501 (Martin/Roth): Dynamic Scheduling I 28
-! Names: r1,r2,r -! Locations: p1,p2,p3,p4,p5,p6,p -! Original mapping: r1! p1 , r2! p2 , r3! p3 , p4 – p7 are “available” -! Renaming – conceptually write each register once +!Removes false dependences +!Leaves true dependences intact! -! When to reuse a physical register? After overwriting insn done MapTable FreeList Raw insns Renamed insns r1 r2 r p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p p4 p2 p6 p7 div r1,4,r1 div p4,4,p
-! maptable[architectural_reg]! physical_reg -! Free list: get/put free register
insn.phys_input1 = maptable[insn.arch_input1]! insn.phys_input2 = maptable[insn.arch_input2]! insn.phys_to_free = maptable[arch_output]! new_reg = get_free_phys_reg()! insn.phys_output = new_reg! maptable[arch_output] = new_reg
-! Once all older instructions have committed, free register put_free_phys_reg(insn.phys_to_free)! CIS 371 (Roth/Martin): Scheduling 29 CIS 501 (Martin/Roth): Dynamic Scheduling I 30 regfile I$ D$ B P insn buffer D S add p2,p3,p sub p2,p4,p mul p2,p5,p div p4,4,p Ready Table P2 P3 P4 P5 P6 P Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes div p4,4,p mul p2,p5,p sub p2,p4,p add p2,p3,p and
-! Also called “instruction window” or “instruction scheduler”
-! Execute when ready
-! Ready table[phys_reg]! yes/no
foreach instruction:! if table[insn.phys_input1] == ready &&" table[insn.phys_input2] == ready then! insn as “ready”! select the oldest “ready” instruction! table[insn.phys_output] = ready! CIS 371 (Roth/Martin): Scheduling 31 CIS 501 (Martin/Roth): Dynamic Scheduling I 32
-! Totally in the hardware -! Also called “out-of-order execution” (OoO)
-! Use branch prediction to speculate past (multiple) branches -! Flush pipeline on branch misprediction
-! Register dependencies are known -! Handling memory dependencies more tricky
-! Anything strange happens before commit, just flush the pipeline