Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

CIS371: Instruction Scheduling and Pipelining - Prof. A. Roth, Study notes of Computer Science

University of Pennsylvania (UPenn)Computer Science

Prof. A. Roth

Various concepts related to instruction scheduling and pipelining in the context of computer organization. Topics include compiler scheduling, utilization, saxpy performance, pipeline scheduling, loop unrolling, and dynamic scheduling. The document also discusses the importance of independent instructions, the role of the compiler in scheduling, and the impact of unrolling on performance and utilization.

Typology: Study notes

Pre 2010

Uploaded on 03/28/2010

koofers-user-od6 🇺🇸

10 documents

1 / 9

This page cannot be seen from the preview

Don't miss anything!

CIS 371 (Roth/Martin): Scheduling 1

CIS 371

Computer Organization and Design

Unit 8: Static and Dynamic Scheduling

CIS 371 (Roth/Martin): Scheduling 2

This Unit: Code Scheduling

•!Pipelining and superscalar review

•!Code scheduling

•!To reduce pipeline stalls

•!To increase ILP (insn level parallelism)

•!Two approaches

•!Static scheduling by the compiler

•!Dynamic scheduling by the hardware

CPU Mem I/O

System software

App App App

CIS 371 (Roth/Martin): Scheduling 3

Readings

•!P+H

•!Chapter 6.9 (again)

Pipelining Review

•!Increases clock frequency by staging instruction execution

•!“Scalar” pipelines have a best-case CPI of 1

•!Challenges:

•!Data and control dependencies further worsen CPI

•!Data: With full bypassing, load-to-use stalls

•!Control: use branch prediction to mitigate penalty

•!Big win, done by all processors today

•!How many stages (depth)?

•!Five stages is pretty good minimum

•!Intel Pentium II/III: 12 stages

•!Intel Pentium 4: 22+ stages

•!Intel Core 2: 14 stages

CIS 371 (Roth/Martin): Scheduling 4

Discover Study notes of Computer Science University of Pennsylvania (UPenn)

Partial preview of the text

Download CIS371: Instruction Scheduling and Pipelining - Prof. A. Roth and more Study notes Computer Science in PDF only on Docsity!

CIS 371 (Roth/Martin): Scheduling 1

CIS 371

Computer Organization and Design

Unit 8: Static and Dynamic Scheduling

CIS 371 (Roth/Martin): Scheduling 2

This Unit: Code Scheduling

•! Pipelining and superscalar review

•! Code scheduling

-! To reduce pipeline stalls -! To increase ILP (insn level parallelism)

•! Two approaches

-! Static scheduling by the compiler -! Dynamic scheduling by the hardware Mem CPU I/O System software App App App CIS 371 (Roth/Martin): Scheduling 3

Readings

•! P+H

-! Chapter 6.9 (again)

Pipelining Review

•! Increases clock frequency by staging instruction execution

•! “Scalar” pipelines have a best-case CPI of 1

•! Challenges:

-! Data and control dependencies further worsen CPI -! Data: With full bypassing, load-to-use stalls -! Control: use branch prediction to mitigate penalty

•! Big win, done by all processors today

•! How many stages (depth)?

-! Five stages is pretty good minimum -! Intel Pentium II/III: 12 stages -! Intel Pentium 4: 22+ stages -! Intel Core 2: 14 stages CIS 371 (Roth/Martin): Scheduling 4

CIS371 (Roth/Martin): Pipelining 5

Pipeline Diagram

•! Use compiler scheduling to reduce load-use stall frequency

-! Like software interlocks, but for performance not correctness 1 2 3 4 5 6 7 8 9 add $3,$2,$1 (^) F D X M W lw $4,4($3) (^) F D X M W addi $6,$4,1 (^) F D d* X M W sub $8,$3,$1 (^) F d* D X M W 1 2 3 4 5 6 7 8 9 add $3,$2,$1 (^) F D X M W lw $4,4($3) F D X M W sub $8,$3,$1 (^) F D X M W addi $6,$4,1 (^) F D X M W

Superscalar Pipeline Review

•! Execute two or more instruction per cycle

•! Challenges:

-! wide fetch (branch prediction harder, misprediction more costly) -! wide decode (stall logic) -! wide execute (more ALUs) -! wide bypassing (more possibly bypassing paths) -! Finding enough independent instructions (and fill delay slots)

•! How many instructions per cycle max (width)?

-! Really simple, low-power cores are still scalar (single issue) -! Even low-power cores a dual-issue (Intel Atom, aka Silverthorne) -! Most desktop/laptop chips three-issue or four-issue -! A few 5 or 6-issue chips have been built (IBM Power4, Itanium II) CIS 371 (Roth/Martin): Scheduling 6 CIS 371 (Roth/Martin): Superscalar Pipelines 7

Superscalar Pipeline Diagrams - Ideal

scalar 1 2 3 4 5 6 7 8 9 10 11 12 lw 0(r1)! r2 (^) F D X M W lw 4(r1)! r3 (^) F D X M W lw 8(r1)! r4 (^) F D X M W add r14,r15! r6 (^) F D X M W add r12,r13! r7 (^) F D X M W add r17,r16! r8 F D X M W lw 0(r18)! r9 F D X M W 2-way superscalar^1 2 3 4 5 6 7 8 9 10 11 lw 0(r1)! r2 F D X M W lw 4(r1)! r3 F D X M W lw 8(r1)! r4 F D X M W add r14,r15! r6 F D X M W add r12,r13! r7 F D X M W add r17,r16! r8 F D X M W lw 0(r18)! r9 F D X M W CIS 371 (Roth/Martin): Superscalar Pipelines 8

Superscalar Pipeline Diagrams - Realistic

scalar 1 2 3 4 5 6 7 8 9 10 11 12 lw 0(r1)! r2 (^) F D X M W lw 4(r1)! r3 (^) F D X M W lw 8(r1)! r4 (^) F D X M W add r4,r5! r6 (^) F d* D X M W add r2,r3! r7 (^) F D X M W add r7,r6! r8 F D X M W lw 0(r8)! r9 F D X M W 2-way superscalar^1 2 3 4 5 6 7 8 9 10 11 lw 0(r1)! r2 F D X M W lw 4(r1)! r3 F D X M W lw 8(r1)! r4 F D X M W add r4,r5! r6 F d* d* D X M W add r2,r3! r7 F d* D X M W add r7,r6! r8 F D X M W lw 0(r8)! r9 F d* D X M W

CIS371 (Roth/Martin): Pipelining 13

Compiler Scheduling Requires

•! Alias analysis

-! Ability to tell whether load/store reference same memory locations -! Effectively, whether load/store can be rearranged -! Example code: easy, all loads/stores use same base register ( sp ) -! New example: can compiler tell that r8 != sp? -! Must be conservative Before ld r2,4(sp) ld r3,8(sp) add r3,r2,r1 //stall st r1,0(sp) ld r5,0(r8) ld r6,4(r8) sub r5,r6,r4 //stall st r4,8(r8) Wrong(?) ld r2,4(sp) ld r3,8(sp) ld r5,0(r8) //does r8==sp? add r3,r2,r ld r6,4(r8) //does r8+4==sp? st r1,0(sp) sub r5,r6,r st r4,8(r8) CIS 371 (Roth/Martin): Scheduling 14

Code Example: SAXPY

•! SAXPY (Single-precision A X Plus Y)

-! Linear algebra routine (used in solving systems of equations) -! Part of early “Livermore Loops” benchmark suite for (i=0;i<N;i++) Z[i]=AX[i]+Y[i]; 0: ldf X(r1)! f1 // loop 1: mulf f0,f1! f2 // A in f 2: ldf Y(r1)! f3 // X,Y,Z are constant addresses 3: addf f2,f3! f 4: stf f4! Z(r1) 5: addi r1,4! r1 // i in r 6: blt r1,r2,0 // N4 in r CIS 371 (Roth/Martin): Scheduling 15

New Metric: Utilization

•! Utilization : actual performance / peak performance

-! Important metric for performance/cost -! No point to paying for hardware you will rarely use -! Adding hardware usually improves performance & reduces utilization -! Additional hardware can only be exploited some of the time -! Diminishing marginal returns -! Compiler can help make better use of existing hardware -! Important for superscalar CIS 371 (Roth/Martin): Scheduling 16

SAXPY Performance and Utilization

•! Scalar pipeline

-! Full bypassing, 5-cycle E, 2-cycle E+, branches predicted taken -! Single iteration (7 insns) latency: 16–5 = 11 cycles -! Performance : 7 insns / 11 cycles = 0.64 IPC -! Utilization : 0.64 actual IPC / 1 peak IPC = 64% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ldf X(r1)! f1 F D X M W mulf f0,f1! f2 F D d E* E* E* E* E* W ldf Y(r1)! f3 F p* D X M W addf f2,f3! f4 F D d* d* d* E+ E+ W stf f4! Z(r1) (^) F p* p* p* D X M W addi r1,4! r1 (^) F D X M W blt r1,r2,0 (^) F D X M W ldf X(r1)! f1 (^) F D X M W

CIS 371 (Roth/Martin): Scheduling 17

SAXPY Performance and Utilization

•! 2-way superscalar pipeline

-! Any two insns per cycle + split integer and floating point pipelines +! Performance : 7 insns / 10 cycles = 0.70 IPC -! Utilization : 0.70 actual IPC / 2 peak IPC = 35% -! More hazards! more stalls -! Each stall is more expensive 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ldf X(r1)! f1 F D X M W mulf f0,f1! f2 F D d* d* E* E* E* E* E* W ldf Y(r1)! f3 F D p* X M W addf f2,f3! f4 F p* p* D d* d* d* d* E+ E+ W stf f4! Z(r1) (^) F p* D p* p* p* p* d* X M W addi r1,4! r1 (^) F p* p* p* p* p* D X M W blt r1,r2,0 (^) F p* p* p* p* p* D d* X M W ldf X(r1)! f1 (^) F D X M W CIS 371 (Roth/Martin): Scheduling 18

(Compiler) Instruction Scheduling

•! Idea: place independent insns between slow ops and uses

-! Otherwise, pipeline stalls while waiting for RAW hazards to resolve -! Have already seen pipeline scheduling

•! To schedule well you need … independent insns

•! Scheduling scope : code region we are scheduling

-! The bigger the better (more independent insns to choose from) -! Once scope is defined, schedule is pretty obvious -! Trick is creating a large scope (must schedule across branches)

•! Compiler scheduling (really scope enlarging) techniques

-! Loop unrolling (for loops) CIS 371 (Roth/Martin): Scheduling 19

Loop Unrolling SAXPY

•! Goal: separate dependent insns from one another

•! SAXPY problem: not enough flexibility within one iteration

-! Longest chain of insns is 9 cycles -! Load (1) -! Forward to multiply (5) -! Forward to add (2) -! Forward to store (1) -! Can’t hide a 9-cycle chain using only 7 insns -! But how about two 9-cycle chains using 14 insns?

•! Loop unrolling : schedule two or more iterations together

-! Fuse iterations -! Schedule to reduce stalls -! Schedule introduces ordering problems, rename registers to fix CIS 371 (Roth/Martin): Scheduling 20

Unrolling SAXPY I: Fuse Iterations

•! Combine two (in general K) iterations of loop

-! Fuse loop control: induction variable ( i ) increment + branch -! Adjust (implicit) induction uses: constants! constants + 4 ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) addi r1,4,r blt r1,r2, ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) addi r1,4,r blt r1,r2, ldf X(r1),f mulf f0,f1,f ldf Y(r1),f addf f2,f3,f stf f4,Z(r1) ldf X+4(r1),f mulf f0,f1,f ldf Y+4(r1),f addf f2,f3,f stf f4,Z+4(r1) addi r1, 8 ,r blt r1,r2,

CIS 371 (Roth/Martin): Scheduling 25

Anything The Compiler Can Do…

•! Dynamically-scheduled processors

-! Hardware re-schedules insns… -! …within a sliding window of VonNeumann insns -! Does loop unrolling transparently -! Does equivalent of loop unrolling on non-loop code -! Uses branch prediction to “unroll” branches -! Pentium Pro/II/III (3-wide), Core/2 (4-wide), Alpha 21264 (4- wide), MIPS R10000 (4-wide), Power5 (5-wide)

•! Quick overview of approach

-! Lots more information in CIS501 (graduate level architecture) CIS 501 (Martin/Roth): Dynamic Scheduling I 26

The Problem With In-Order Pipelines

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 addf f0,f1! f2 F D E+ E+ E+ W mulf f2,f3! f2 F D d* d* E* E* E* E* E* W subf f0,f1! f4 F p* p* D E+ E+ E+ W

•! What’s happening in cycle 4?

-! mulf stalls due to data dependence -! OK, this is a fundamental problem -! subf stalls due to pipeline hazard -! Why? subf can’t proceed into D because addf is there -! That is the only reason, and it isn’t a fundamental one -! Maintaining in-order writes to register file

•! Why can’t subf go into D in cycle 4 and E+ in cycle 5?

Code Example

•! Code:

•! Divide insn independent of subtract and multiply insns

-! Can execute in parallel with subtract

•! Many registers re-used

-! Just as in static scheduling, the register names get in the way -! How does the hardware get around this?

•! Approach: (step #1) rename registers, (step #2) schedule

CIS 371 (Roth/Martin): Scheduling 27 Raw insns add r2,r3! r sub r2,r1! r mul r2,r3! r div r1,4! r CIS 501 (Martin/Roth): Dynamic Scheduling I 28

Step #1: Register Renaming

•! To eliminate register conflicts/hazards

•! “Architected” vs “Physical” registers

-! Names: r1,r2,r -! Locations: p1,p2,p3,p4,p5,p6,p -! Original mapping: r1! p1 , r2! p2 , r3! p3 , p4 – p7 are “available” -! Renaming – conceptually write each register once +!Removes false dependences +!Leaves true dependences intact! -! When to reuse a physical register? After overwriting insn done MapTable FreeList Raw insns Renamed insns r1 r2 r p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p p4 p2 p3 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p p4 p2 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p p4 p2 p6 p7 div r1,4,r1 div p4,4,p

Register Renaming Algorithm

•! Data structures:

-! maptable[architectural_reg]! physical_reg -! Free list: get/put free register

•! Algorithm: at decode for each instruction:

insn.phys_input1 = maptable[insn.arch_input1]! insn.phys_input2 = maptable[insn.arch_input2]! insn.phys_to_free = maptable[arch_output]! new_reg = get_free_phys_reg()! insn.phys_output = new_reg! maptable[arch_output] = new_reg

•! At “commit”

-! Once all older instructions have committed, free register put_free_phys_reg(insn.phys_to_free)! CIS 371 (Roth/Martin): Scheduling 29 CIS 501 (Martin/Roth): Dynamic Scheduling I 30 regfile I$ D$ B P insn buffer D S add p2,p3,p sub p2,p4,p mul p2,p5,p div p4,4,p Ready Table P2 P3 P4 P5 P6 P Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes div p4,4,p mul p2,p5,p sub p2,p4,p add p2,p3,p and

Step #2: Dynamic Scheduling

•! Instructions fetch/decoded/renamed into Instruction Buffer

-! Also called “instruction window” or “instruction scheduler”

•! Instructions (conceptually) check ready bits every cycle

-! Execute when ready

Dynamic Scheduling Algorithm

•! Data structures:

-! Ready table[phys_reg]! yes/no

•! Algorithm at “schedule” stage (prior to read registers):

foreach instruction:! if table[insn.phys_input1] == ready &&" table[insn.phys_input2] == ready then! insn as “ready”! select the oldest “ready” instruction! table[insn.phys_output] = ready! CIS 371 (Roth/Martin): Scheduling 31 CIS 501 (Martin/Roth): Dynamic Scheduling I 32

Dynamic Scheduling - OoO Execution

•! Dynamic scheduling

-! Totally in the hardware -! Also called “out-of-order execution” (OoO)

•! Fetch many instructions into instruction window

-! Use branch prediction to speculate past (multiple) branches -! Flush pipeline on branch misprediction

CIS371: Instruction Scheduling and Pipelining - Prof. A. Roth, Study notes of Computer Science

Related documents

Partial preview of the text

Download CIS371: Instruction Scheduling and Pipelining - Prof. A. Roth and more Study notes Computer Science in PDF only on Docsity!

CIS 371

Computer Organization and Design

Unit 8: Static and Dynamic Scheduling

This Unit: Code Scheduling

•! Pipelining and superscalar review

•! Code scheduling

•! Two approaches

Readings

•! P+H

Pipelining Review

•! Increases clock frequency by staging instruction execution

•! “Scalar” pipelines have a best-case CPI of 1

•! Challenges:

•! Big win, done by all processors today

•! How many stages (depth)?

Pipeline Diagram

•! Use compiler scheduling to reduce load-use stall frequency

Superscalar Pipeline Review

•! Execute two or more instruction per cycle

•! Challenges:

•! How many instructions per cycle max (width)?

Superscalar Pipeline Diagrams - Ideal

Superscalar Pipeline Diagrams - Realistic

Compiler Scheduling Requires

•! Alias analysis

Code Example: SAXPY

•! SAXPY (Single-precision A X Plus Y)

New Metric: Utilization

•! Utilization : actual performance / peak performance

SAXPY Performance and Utilization

•! Scalar pipeline

SAXPY Performance and Utilization

•! 2-way superscalar pipeline

(Compiler) Instruction Scheduling

•! Idea: place independent insns between slow ops and uses

•! To schedule well you need … independent insns

•! Scheduling scope : code region we are scheduling

•! Compiler scheduling (really scope enlarging) techniques

Loop Unrolling SAXPY

•! Goal: separate dependent insns from one another

•! SAXPY problem: not enough flexibility within one iteration

•! Loop unrolling : schedule two or more iterations together

Unrolling SAXPY I: Fuse Iterations

•! Combine two (in general K) iterations of loop

Anything The Compiler Can Do…

•! Dynamically-scheduled processors

•! Quick overview of approach

The Problem With In-Order Pipelines

•! What’s happening in cycle 4?

•! Why can’t subf go into D in cycle 4 and E+ in cycle 5?

Code Example

•! Code:

•! Divide insn independent of subtract and multiply insns

•! Many registers re-used

•! Approach: (step #1) rename registers, (step #2) schedule

Step #1: Register Renaming

•! To eliminate register conflicts/hazards

•! “Architected” vs “Physical” registers

Register Renaming Algorithm

•! Data structures:

•! Algorithm: at decode for each instruction:

•! At “commit”

Step #2: Dynamic Scheduling

•! Instructions fetch/decoded/renamed into Instruction Buffer

•! Instructions (conceptually) check ready bits every cycle

Dynamic Scheduling Algorithm

•! Data structures:

•! Algorithm at “schedule” stage (prior to read registers):

Dynamic Scheduling - OoO Execution

•! Dynamic scheduling

•! Fetch many instructions into instruction window

•! Rename to avoid false dependencies

•! Execute instructions as soon as possible

•! “Commit” instructions in order

•! Current machines: 100+ instruction scheduling window