Instruction Scheduling and Parallelism in Computer Architecture, Slides of Compilers

An overview of instruction scheduling techniques, focusing on increasing parallelism through basic block scheduling and data-dependence graphs. It covers the register/parallelism tradeoff, rules for instruction scheduling, kinds of data dependence, and eliminating data dependences. The document also introduces the concept of global code motion and discusses upwards and downwards code motion. Lastly, it touches upon software pipelining and its limitations.

Typology: Slides

2012/2013

Uploaded on 04/29/2013

aalok
aalok 🇮🇳

4.4

(15)

97 documents

1 / 48

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Instruction Scheduling
Increasing Parallelism
Basic-Block Scheduling
Data-Dependency Graphs
1
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30

Partial preview of the text

Download Instruction Scheduling and Parallelism in Computer Architecture and more Slides Compilers in PDF only on Docsity!

Instruction Scheduling

Increasing Parallelism

Basic-Block Scheduling

Data-Dependency Graphs

The Model

• A very-long-instruction-word machine allows

several operations to be performed at once.

  • Given: a list of “resources” (e.g., ALU) and delay

required for each instruction.

• Schedule the intermediate code instructions

of a basic block to minimize the number of

machine instructions.

Example

4

a = b+c e = a+d a = b-c f = a+d

a = b+c e = a+d a = b-c f = a+d

a1 = b+c e = a1+d a2 = b-c f = a2+d

a1 = b+c a2 = b-c e = a1+d f = a2+d

Assume 2 arithmetic operations per instruction

Don’t reuse a

ALU1 ALU2 ALU1^ ALU

More Extreme Example

5

for (i=0; i<N; i++) {

t = a[i]+1;

b[i] = t*t;

} /* no parallelism */

for (i=0; i<N; i++) {

t[i] = a[i]+1;

b[i] = t[i]*t[i];

} /* All iterations can be

executed in parallel */

Kinds of Data Dependence

1. Write-read ( true dependence ):

  • A read of x must continue to follow the previous

write of x.

2. Read-write ( antidependence ):

  • A write of x must continue to follow previous

reads of x.

3. Write-write ( output dependence ):

  • Writes of x must stay in order.

Eliminating Data Dependences

• Only true dependences cannot be eliminated.

• Eliminate output or anti- dependences by

writing into different variables.

Timing in Our Machine Model

• Arithmetic requires one clock cycle (“ clock ”).

• Store requires 1 clock.

• Load requires 2 clocks to complete.

  • But we can store into the same memory location

at the next clock.

  • And one LD can be issued at each clock.

Data-Dependence Graphs

• Nodes = machine instructions.

• Edge i -> j if instruction (j) has a data

dependence on instruction (i).

• Label an edge with the minimum delay

interval between when (i) may initiate and

when (j) may initiate.

  • Delay measured in clock cycles.

Example: Data-Dependence Graph

13

LD r1,a

LD r2,b

ADD r3,r1,r

ST a,r

ST b,r

ST c,r

True dependence regarding r

Antidependence regarding b

True dependence regarding r

Scheduling a Basic Block

• List scheduling is a simple heuristic.

• Choose a prioritized topological order.

1. Respects the edges in the data-dependence

graph (“topological”).

2. Heuristic choice among options, e.g., pick first

the node with the longest path extending from

that node (“prioritized”).

Example: Data-Dependence Graph

16

LD r1,a

LD r2,b

ADD r3,r1,r

ST a,r

ST b,r

ST c,r

Now, these three are enabled. Pick the ADD, since it has the longest path extending.

Example: Data-Dependence Graph

17

LD r1,a

LD r2,b

ADD r3,r1,r

ST a,r

ST b,r

ST c,r

These three can now occur in any order. Pick the order shown.

Example: Making the Schedule

19

LD r1,a: clock 1 earliest. MEM available.

LD r1,a

Example: Making the Schedule

20

LD r2,b: clock 1 earliest. MEM not available. Delay to clock 2.

LD r1,a LD r2,b