Modulo Scheduling - Midterm Exam - Advanced Compilers | EECS 583, Exams of Electrical and Electronics Engineering

Material Type: Exam; Professor: Mahlke; Class: Advanced Compilers; Subject: Electrical Engineering And Computer Science; University: University of Michigan - Ann Arbor; Term: Fall 2007;

Typology: Exams

Pre 2010

Uploaded on 09/02/2009

koofers-user-di5-1
koofers-user-di5-1 🇺🇸

4

(1)

10 documents

1 / 27

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EECS 583 – Class 15
Modulo Scheduling
University of Michigan
October 29, 2007
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b

Partial preview of the text

Download Modulo Scheduling - Midterm Exam - Advanced Compilers | EECS 583 and more Exams Electrical and Electronics Engineering in PDF only on Docsity!

EECS 583 – Class 15 Modulo Scheduling

University of Michigan October 29, 2007

  • 1 -

Schedule for the Rest of the Semester^ ™^

This week – 10/29, 10/31^ »^

Modulo scheduling

™^

Next week – 11/5, 11/7^ »^

Multicluster partitioning, Register allocation

™^

11/12 – 12/10: Research presentations by you guys!^ »^

Schedule by groups, 3 talks per class » Multicore group goes first (ie on 11/12), so get moving! » Rest will be scheduled in SIG meetings

™^

Midterm exam – Exact date still up in the air^ »^

Likely at end of November

  • 3 -

Reading Material^ ™^

Today’s class^ »^

“Iterative Modulo Scheduling: An Algorithm for SoftwarePipelining Loops”, B. Rau, MICRO-27, 1994, pp. 63-74.

™^

Next class^ »^

"Code Generation Schemas for Modulo Scheduled DO-Loops and WHILE-Loops", B. Rau, M. Schlansker, and P. Tirumalai, MICRO-25, Dec. 1992.

  • 4 -

A B^

A

C^

B^

A

D^

C^

B^

A

D^

C^

B^

A

…D

C

B

A

D^ C

B D^ C D

A B C D

From Last Time: A Software PipelineLoop bodywith 4 ops

Prologue - fill the pipe Kernel – steady state Epilogue - drain the pipe

time

Initiation Interval

(II) = fixed delay

between the start of successive iterationsEach iteration can be dividedinto stages

consisting of II cycles

each

  • 6 -

Dependences in a Loop^ ™^

Need worry about 2 kinds^ »^

Intra-iteration » Inter-iteration ™^

Delay^ »^

Minimum time interval betweenthe start of operations » Operation read/write times ™^

Distance^ »^

Number of iterations separatingthe 2 operations involved » Distance of 0 means intra-iteration ™^

Recurrence manifests itself as acircuit in the dependence graph

<1,2> <1,0>

<1,2>

<delay, distance> Edges annotated with tuple

  • 7 -

Dynamic Single Assignment (DSA) Form

1: r3 = load(r1)2: r4 = r3 * 263: store (r2, r4)4: r1 = r1 + 45: r2 = r2 + 46: p1 = cmpp (r1 < r9)7: brct p1 Loop Impossible to overlap iterations because each iteration writes to the sameregister. So, we’ll have to remove the anti and output dependences.Recall back the notion of a rotating register (virtual for now)* Each register is an infinite push down array (Expanded virtual reg or EVR

  • Write to top element, but can reference any element* Remap operation slides everything down

Æ^ r[n] changes to r[n+1]

A program is in DSA form if the same virtual register (EVR element) is neverassigned to more than 1x on any dynamic execution path

1: r3[-1] = load(r1[0])2: r4[-1] = r3[-1] * 263: store (r2[0], r4[-1])4: r1[-1] = r1[0] + 45: r2[-1] = r2[0] + 46: p1[-1] = cmpp (r1[-1] < r9)remap r1, r2, r3, r4, p17: brct p1[-1] Loop DSAconversion

  • 9 -

Loop Dependence Example

1: r3[-1] = load(r1[0])2: r4[-1] = r3[-1] * 263: store (r2[0], r4[-1])4: r1[-1] = r1[0] + 45: r2[-1] = r2[0] + 46: p1[-1] = cmpp (r1[-1] < r9)remap r1, r2, r3, r4, p17: brct p1[-1] Loop

In DSA form, there are no inter-iteration anti or output dependences!

3,0 1, 2,

1,

1, 0,0 1,1 1,1 1, 0,0 <delay, distance>

  • 10 -

Class Problem

1: r1[-1] = load(r2[0])2: r3[-1] = r1[1] – r1[2]3: store (r3[-1], r2[0])4: r2[-1] = r2[0] + 45: p1[-1] = cmpp (r2[-1] < 100)remap r1, r2, r36: brct p1[-1] Loop^ Draw the dependence graph^ showing both intra and inter^ iteration dependences Latencies: ld = 2, st = 1, add = 1, cmpp = 1, br = 1

  • 12 -

ResMII^ Concept: If there were no dependences between the operations, what^ is the the shortest possible schedule?^ Simple resource model^ A processor has a set of resources R. For each resource r in R^ there is count(r) specifying the number of identical copies

ResMII = MAX

(uses(r) / count(r)) for all r in R uses(r) = number of times the resource is used in 1 iteration In reality its more complex than this because operations can have multiple alternatives (different choices for resources it could be assigned to), but we will ignore this for now

  • ALU: used by 2, 4, 5, ResMII Example resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 1: r3 = load(r1)2: r4 = r3 * 263: store (r2, r4)4: r1 = r1 + 45: r2 = r2 + 46: p1 = cmpp (r1 < r9)7: brct p1 Loop
    • Æ 4 ops / 2 units =
      • Mem: used by 1,
        • Æ 2 ops / 1 unit =
          • Br: used by
            • Æ 1 op / 1 unit =
              • ResMII = MAX(2,2,1) =
  • 15 -

RecMII Example^ 1: r3 = load(r1)2: r4 = r3 * 263: store (r2, r4)4: r1 = r1 + 45: r2 = r2 + 46: p1 = cmpp (r1 < r9)7: brct p1 Loop

3,0 1, 2,

1,1 1, 0,0 1,1 1,1 1, 0,0 <delay, distance>

4 Æ

5 Æ

4 Æ

1 Æ

5 Æ

3 Æ

RecMII = MAX(1,1,1,1) = 1 Then, MII = MAX(ResMII, RecMII) MII = MAX(2,1) = 2

  • 16 -

Class Problem

1: r1[-1] = load(r2[0])2: r3[-1] = r1[1] – r1[2]3: store (r3[-1], r2[0])4: r2[-1] = r2[0] + 45: p1[-1] = cmpp (r2[-1] < 100)remap r1, r2, r36: brct p1[-1] Loop^ Calculate RecMII, ResMII, and MII Latencies: ld = 2, st = 1, add = 1, cmpp = 1, br = 1 Resources: 1 ALU, 1 MEM, 1 BR

  • 18 -

Priority Function^ Height-based priority worked well for acyclic scheduling, makes sense^ that it will work for loops as well Acyclic:

Height(X) =

0, if X has no successors MAX

((Height(Y) + Delay(X,Y)), otherwise

for all Y = succ(X)

Cyclic:

HeightR(X) =

0, if X has no successors MAX

((HeightR(Y) + EffDelay(X,Y)), otherwise

for all Y = succ(X) EffDelay(X,Y) = Delay(X,Y) – II*Distance(X,Y)

  • 19 -

Calculating Height

2,2 1,

1.^

Insert pseudo edges from all nodes to branch with latency = 0, distance = 0 (dotted edges)

2.^

Compute II, For this example assume II = 2

3.^

HeightR(4) =

4.^

HeightR(3) =

5.^

HeightR(2) =

6.^

HeightR(1)

2, 0,0 0,

0,