


















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This document from university of michigan's eecs 583 lecture explores various techniques for achieving compact schedules for loops in code generation. The focus is on list scheduling, unrolling, and overlap iterations using pipelining. The document also discusses problems with unrolling and the concept of dynamic single assignment (dsa) form.
Typology: Study notes
1 / 26
This page cannot be seen from the preview
Don't miss anything!



















Today’s focus is loops Most of program execution time is spent in loops Problem:
How do we achieve compact schedules for loops
r1 = _ar2 = _br9 = r1 * 41: r3 = load(r1)2: r4 = r3 * 263: store (r2, r4)4: r1 = r1 + 45: r2 = r2 + 46: p1 = cmpp (r1 < r9)7: brct p1 Loop Loop:
for (j=0; j<100; j++)^ b[j] = a[j] * 26
Unroll then schedule larger body
time
Iteration
n-1,n
Schedule each iteration resources: 4 issue, 2 alu, 1 mem, 1 br latencies: add=1, cmpp = 1, mpy=3, ld = 2, st = 1, br = 1
1: r3 = load(r1)2: r4 = r3 * 263: store (r2, r4)4: r1 = r1 + 45: r2 = r2 + 46: p1 = cmpp (r1 < r9)7: brct p1 Loop
time
ops 0
Total time = 7 * n/
Problems with unrolling^ Y^
y^ Low trip count y^ Lots of branches in the loop body » But, in other settings, there is room for improvement
A software pipeline
time
Prologue - fill the pipe Kernel – steady state
Loop body with 4 ops
Epilogue - drain the pipe
Steady state: 4 iterations executed simultaneously, 1 operation from each iteration. Every cycle, an iteration starts and finishes when the pipe is full.
Creating software pipelines^ Y^
y^ All iteration bodies have identical schedules y^ Each iteration is scheduled to start some fixed number of cycleslater than the previous iteration » Initiation Interval
(II) = fixed delay between the start of
successive iterations » Given the 2 constraints, the unrolled schedule is repetitive(kernel) except the portion at the beginning (prologue) andend (epilogue)^ y^ Kernel can be re-rolled to yield a new loop
Resource usage legality^ Y^
, where the name modulo
scheduling comes from » Modulo reservation table
solves this problem
y^ To schedule an op at time T needing resource R^ X
The entry for R at T mod II must be free y^ Mark busy at T mod II if schedule
br
alu^ alu^ mem b
us0 b
us
Dependences in a loop^ Y^
Need worry about 2 kinds»^ Intra-iteration»^ Inter-iteration Y Delay»^ Minimum time interval betweenthe start of operations»^ Operation read/write times Y Distance»^ Number of iterations separatingthe 2 operations involved»^ Distance of 0 means intra-iteration Y Recurrence manifests itself as acircuit in the dependence graph
<1,2>
<1,2> <1,0> <delay, distance> Edges annotated with tuple
Physical realization of EVRs^ Y^
Loop dependence example
3,0 1, 2,
1, 1, 0,0 1,1 1,
1: r3[-1] = load(r1[0])2: r4[-1] = r3[-1] * 263: store (r2[0], r4[-1])4: r1[-1] = r1[0] + 45: r2[-1] = r2[0] + 46: p1[-1] = cmpp (r1[-1] < r9)remap r1, r2, r3, r4, p17: brct p1[-1] Loop
0,
In DSA form, there are no inter-iteration anti or output dependences!
<delay, distance>
Minimum initiation interval (MII)^ Y^
y^ Resource usage requirements of 1 iteration » RecMII = recurrence constrained MII y^ Latency of the circuits in the dependence graph
ResMII^ Concept: If there were no dependences between the operations, what^ is the the shortest possible schedule?^ Simple resource model^ A processor has a set of resources R. For each resource r in R^ there is count(r) specifying the number of identical copies
ResMII = MAX
(uses(r) / count(r)) for all r in R uses(r) = number of times the resource is used in 1 iteration In reality its more complex than this because operations can have multiple alternatives (different choices for resources it could be assigned to), but we will ignore this for now
RecMII^ Approach: Enumerate all irredundant elementary circuits in the^ dependence graph^ RecMII = MAX
(delay(c) / distance(c)) for all c in C delay(c) = total latency in dependence cycle c (sum of delays) distance(c) = total iteration distance of cycle c (sum of distances)
cycle k^
k+^
k+2k+3k+^
k+^
4 cycles,RecMII = 4
3,
delay(c) = 1 + 3 = 4 distance(c) = 0 + 1 = 1 RecMII = 4/1 = 4
RecMII example^ 1: r3 = load(r1)2: r4 = r3 * 263: store (r2, r4)4: r1 = r1 + 45: r2 = r2 + 46: p1 = cmpp (r1 < r9)7: brct p1 Loop
3,0 1, 2,
1,1 1,1 1, 0,0 1,
RecMII = MAX(1,1,1,1) = 1 Then, MII = MAX(ResMII, RecMII) MII = MAX(2,1) = 2
0,
<delay, distance>