Memory-Hierarchy Tradeoffs, Exercises of Design

In this chapter we study the conditions on the size and speed of these units when a CPU and a memory hierarchy simulate the RAM model. The design of memory ...

Typology: Exercises

2021/2022

Uploaded on 09/27/2022

deffstar
deffstar 🇬🇧

4.6

(17)

240 documents

1 / 45

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
C H A P T E R
Memory-Hierarchy Tradeoffs
Although serial programming languages assume that programs are written for the RAM model,
this model is rarely implemented in practice. Instead, the random-access memory is replaced
with a hierarchy of memory units of increasing size, decreasing cost per bit, and increasing
access time. In this chapter we study the conditions on the size and speed of these units when
a CPU and a memory hierarchy simulate the RAM model. The design of memory hierarchies
is a topic in operating systems.
A memory hierarchy typically contains the local registers of the CPU at the lowest level and
may contain at succeeding levels a small, very fast, local random-access memory called a cache,
a slower but still fast random-access memory, and a large but slow disk. The time to move data
between levels in a memory hierarchy is typically a few CPU cycles at the cache level, tens of
cycles at the level of a random-access memory, and hundreds of thousands of cycles at the disk
level! A CPU that accesses a random-access memory on every CPU cycle may run at about
a tenth of its maximum speed, and the situation can be dramatically worse if the CPU must
access the disk frequently. Thus it is highly desirable to understand for a given problem how
the number of data movements between levels in a hierarchy depends on the storage capacity
of each memory unit in that hierarchy.
In this chapter we study tradeoffs between the number of storage locations (space) at each
memory-hierarchy level and the number of data movements (I/O time) between levels. Two
closely related models of memory hierarchies are used, the memory-hierarchy pebble game and
the hierarchical memory model, which are extensions of those introduced in Chapter 10.
In most of this chapter it is assumed not only that the user has control over the I/O algo-
rithm used for a problem but that the operating system does not interfere with the I/O oper-
ations requested by the user. However, we also examine I/O performance when the operating
system, not the user, controls the sequence of memory accesses (Section 11.10). Competi-
tive analysis is used in this case to evaluate two-level LRU and FIFO memory-management
algorithms.
529
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d

Partial preview of the text

Download Memory-Hierarchy Tradeoffs and more Exercises Design in PDF only on Docsity!

C H A P T E R

Memory-Hierarchy Tradeoffs

Although serial programming languages assume that programs are written for the RAM model, this model is rarely implemented in practice. Instead, the random-access memory is replaced with a hierarchy of memory units of increasing size, decreasing cost per bit, and increasing access time. In this chapter we study the conditions on the size and speed of these units when a CPU and a memory hierarchy simulate the RAM model. The design of memory hierarchies is a topic in operating systems. A memory hierarchy typically contains the local registers of the CPU at the lowest level and may contain at succeeding levels a small, very fast, local random-access memory called a cache, a slower but still fast random-access memory, and a large but slow disk. The time to move data between levels in a memory hierarchy is typically a few CPU cycles at the cache level, tens of cycles at the level of a random-access memory, and hundreds of thousands of cycles at the disk level! A CPU that accesses a random-access memory on every CPU cycle may run at about a tenth of its maximum speed, and the situation can be dramatically worse if the CPU must access the disk frequently. Thus it is highly desirable to understand for a given problem how the number of data movements between levels in a hierarchy depends on the storage capacity of each memory unit in that hierarchy. In this chapter we study tradeoffs between the number of storage locations (space) at each memory-hierarchy level and the number of data movements (I/O time) between levels. Two closely related models of memory hierarchies are used, the memory-hierarchy pebble game and the hierarchical memory model, which are extensions of those introduced in Chapter 10. In most of this chapter it is assumed not only that the user has control over the I/O algo- rithm used for a problem but that the operating system does not interfere with the I/O oper- ations requested by the user. However, we also examine I/O performance when the operating system, not the user, controls the sequence of memory accesses (Section 11.10). Competi- tive analysis is used in this case to evaluate two-level LRU and FIFO memory-management algorithms.

530 Chapter 11 Memory-Hierarchy Tradeoffs Models of Computation

11.1 The Red-Blue Pebble Game

The red-blue pebble game models data movement between adjacent levels of a two-level mem- ory hierarchy. We begin with this model to fix ideas and then introduce the more general memory-hierarchy game. Both games are played on a directed acyclic graph, the graph of a straight-line program. We describe the game and then give its rules. In the red-blue game, (hot) red pebbles identify values held in a fast primary memory whereas (cold) blue pebbles identify values held in a secondary memory. The values identified with the pebbles can be words or blocks of words, such as the pages used by an operating system. Since the red-blue pebble game is used to study the number of I/O operations necessary for a problem, the number of red pebbles is assumed limited and the number of blue pebbles is assumed unlimited. Before the game starts, blue pebbles reside on all input vertices. The goal is to place a blue pebble on each output vertex, that is, to compute the values associated with these vertices and place them in long-term storage. These assumptions capture the idea that data resides initially in the most remote memory unit and the results must be deposited there.

RED-BLUE PEBBLE GAME

  • (Initialization) A blue pebble can be placed on an input vertex at any time.
  • (Computation Step) A red pebble can be placed on (or moved to) a vertex if all its imme- diate predecessors carry red pebbles.
  • (Pebble Deletion) A pebble can be deleted from any vertex at any time.
  • (Goal) A blue pebble must reside on each output vertex at the end of the game.
  • (Input from Blue Level) A red pebble can be placed on any vertex carrying a blue pebble.
  • (Output to Blue Level) A blue pebble can be placed on any vertex carrying a red pebble.

The first rule ( initialization ) models the retrieval of input data from the secondary mem- ory. The second rule (a computation step ) is equivalent to requiring that all the arguments on which a function depends reside in primary memory before the function can be computed. This rule also allows a pebble to move (or slide ) to a vertex from one of its predecessors, mod- eling the use of a register as both the source and target of an operation. The third rule allows pebble deletion : if a red pebble is removed from a vertex that later needs a red pebble, it must be repebbled. The fourth rule (the goal ) models the placement of output data in the secondary memory at the end of a computation. The fifth rule allows data held in the secondary memory to be moved back to the primary memory (an input operation ). The sixth rule allows a result to be copied to a secondary memory of unlimited capacity (an output operation ). Note that a result may be in both memories at the same time. The red-blue pebble game is a direct generalization of the pebble game of Section 10. (which we call the red pebble game ), as can be seen by restricting the sixth rule to allow the placement of blue pebbles only on vertices that are output vertices of the DAG. Under this restriction the blue level cannot be used for intermediate results and the goal of the game becomes to minimize the number of times vertices are pebbled with red pebbles, since the optimal strategy pebbles each output vertex once.

532 Chapter 11 Memory-Hierarchy Tradeoffs Models of Computation

11.1.1 Playing the Red-Blue Pebble Game

The rules for the red-blue pebble game are illustrated by the eight-input FFT graph shown in Fig. 11.1. If S = 3 red pebbles are available to pebble this graph (at least S = 4 pebbles are needed in the one-pebble game), a pebbling strategy that keeps the number of I/O operations small is based on the pebbling of sub-FFT graphs on two inputs. Three such sub-FFT sub- graphs are shown by heavy lines in Fig. 11.1, one at each level of the FFT graph. This pebbling strategy uses three red pebbles to place blue pebbles on the outputs of each of the four lowest- level sub-FFT graphs on two inputs, those whose outputs are second-level vertices of the full FFT graph. (Thus, eight blue pebbles are used.) Shown on a second-level sub-FFT graph are three red pebbles at the time when a pebble has just been placed on the first of the two outputs of this sub-FFT graph. This strategy performs two I/O operations for each vertex except for input and output vertices. A small savings is possible if, after pebbling the last sub-FFT graph at one level, we immediately pebble the last sub-FFT graph at the next level.

11.1.2 Balanced Computer Systems

A balanced computer system is one in which no computational unit or data channel becomes saturated before any other. The results in this chapter can be used to analyze balance. To illustrate this point, we examine a serial computer system consisting of a CPU with a random- access memory and a disk storage unit. Such a system is balanced for a particular problem if the time used for I/O is comparable to the time used for computation. As shown in Section 11.5.2, multiplying two n × n matrices with a variant of the classical matrix multiplication algorithm requires a number of computations proportional to n 3 and a number of I/O operations proportional to n 3 /

S, where S is the number of red pebbles or the capacity of the random-access memory. Let t 0 and t 1 be the times for one computation and I/O operation, respectively. Then the system is balanced when t 0 n 3 ≈ t 1 n 3 /

S. Let the computational and I/O capacities , C (^) comp and C (^) I/O , be the rates at which the CPU and disk can compute and exchange data, respectively; that is, C (^) comp = 1 /t 0 and C (^) I/O = 1 /t 1. Thus, balance is achieved when the following condition holds:

C (^) comp C (^) I/O

S

From this condition we see that if through technological advance the ratio C (^) comp /C (^) I/O in- creases by a factor β, then for the system to be balanced the storage capacity of the system, S, must increase by a factor β 2. Hennessy and Patterson [131, p. 427] observe that CPU speed is increasing between 50% and 100% per year while that of disks is increasing at a steady 7% per year. Thus, if the ratio C (^) comp /C (^) I/O for our simple computer system grows by a factor of 50/ 7 ≈ 7 per year, then S must grow by about a factor of 49 per year to maintain balance. To the extent that matrix multiplication is typical of the type of computing to be done and that computers have two- level memories, a crisis is looming in the computer industry! Fortunately, multi-level memory hierarchies are being introduced to help avoid this crisis. As bad as the situation is for matrix multiplication, it is much worse for the Fourier trans- form and sorting. For each of these problems the number of computation and I/O operations is proportional to n log 2 n and n log 2 n/ log 2 S, respectively (see Section 11.5.3). Thus, bal-

©cJohn E Savage 11.2 The Memory-Hierarchy Pebble Game 533

ance is achieved when C (^) comp C (^) I/O

≈ log 2 S

Consequently, if C (^) comp /C (^) I/O increases by a factor β, S must increase to S β^. Under the conditions given above, namely, β ≈ 7, a balanced two-level memory-hierarchy system for these problems must have a storage capacity that grows from S to about S 7 every year.

11.2 The Memory-Hierarchy Pebble Game The standard memory-hierarchy game (MHG) defined below generalizes the two-level red- blue game to multiple levels. The L-level MHG is played on directed acyclic graphs with p (^) l pebbles at level l, 1 ≤ l ≤ L − 1, and an unlimited number of pebbles at level L. When L = 2, the lower level is the red level and the higher is the blue level. The number of pebbles used at the L − 1 lowest levels is recorded in the resource vector p = (p 1 , p 2 ,... , p (^) L− 1 ), where p (^) j ≥ 1 for 1 ≤ j ≤ L − 1. The rules of the game are given below.

STANDARD MEMORY-HIERARCHY GAME

R1. (Initialization) A level-L pebble can be placed on an input vertex at any time.

R2. (Computation Step) A first-level pebble can be placed on (or moved to) a vertex if all its immediate predecessors carry first-level pebbles.

R3. (Pebble Deletion) A pebble of any level can be deleted from any vertex.

R4. (Goal) A level-L pebble must reside on each output vertex at the end of the game.

R5. (Input from Level l) For 2 ≤ l ≤ L, a level-(l − 1 ) pebble can be placed on any vertex carrying a level-l pebble.

R6. (Output to Level l) For 2 ≤ l ≤ L, a level-l pebble can be placed on any vertex carrying a level-(l − 1 ) pebble. The first four rules are exactly as in the red-blue pebble game. The fifth and sixth rules general- ize the fifth and sixth rules of the red-blue pebble game by identifying inputs from and outputs to level-l memory. These last two rules allow a level-l memory to serve as temporary storage for lower-level memories. In the standard MHG, the highest-level memory can be used for storing intermediate results. An important variant of the MHG is the I/O-limited memory-hierarchy game , in which the highest level memory cannot be used for intermediate storage. The rules of this game are the same as in the MHG except that rule R6 is replaced by the following two rules:

I/O-LIMITED MEMORY-HIERARCHY GAME

R6. (Output to Level l) For 2 ≤ l ≤ L − 1, a level-l pebble can be placed on any vertex carrying a level-(l − 1 ) pebble.

R7. (I/O Limitation) Level-L pebbles can only be placed on output vertices carrying level- (L − 1 ) pebbles.

©cJohn E Savage 11.3 I/O-Time Relationships 535

11.2.1 Playing the MHG

Figure 11.2 shows the FFT graph on eight inputs being pebbled in a three-level MHG with resource vector p = (2, 4). Here black circles denote first-level pebbles, shaded circles denote second-level pebbles and striped circles denote third-level pebbles. Four striped, three shaded and two black pebbles reside on vertices in the second row of the FFT. One of these shaded second-level pebbles shares a vertex with a black first-level pebble, so that this black pebble can be moved to the vertex covered by the open circle without deleting all pebbles on the doubly covered vertex. To pebble the vertex under the open square with a black pebble, we reuse the black pebble on the open circle by swapping it with a fourth shaded pebble, after which we place the black pebble on the vertex that was doubly covered and then slide it to the vertex covered by the open box. This graph can be completely pebbled with the resource vector p = (2, 4) using only four third-level pebbles, as the reader is asked to show. (See Problem 11.3.) Thus, it can also be pebbled in the four-level I/O-limited MHG using resource vector p = (2, 4, 4).

11.3 I/O-Time Relationships

The following simple relationships follow from two observations. First, each input and output vertex must receive a pebble at each level, since every input must be read from level L and every output must be written to level L. Second, at least one computation step is needed for each non-input vertex of the graph. Here we assume that every vertex in V must be pebbled to pebble the output vertices.

LEMMA 11.3.1 Let α be the maximum in-degree of any vertex in G = (V , E) and let In(G)

and Out(G) be the sets of input and output vertices of G , respectively. Then any pebbling P of G with the MHG, whether standard or I/O-limited, satisfies the following conditions for 2 ≤ l ≤ L :

T (^) l( L)(p, G, P) ≥ |In(G)| + |Out(G)| T 1 ( L)(p, G, P) ≥ |V | −| In(G)|

The following theorem relates the number of moves in an L-level game to the number in a two-level game and allows us to use prior results. The lower bound on the level-l I/O time is stated in terms of s (^) l− 1 because pebbles at levels 1, 2,... , l − 1 are treated collectively as red pebbles to derive a lower bound; pebbles at level l and above are treated as blue pebbles.

THEOREM 11.3.1 Let s l =

∑l− 1 j= 1 p^ j^.^ Then the following inequalities hold for every^ L -level standard MHG pebbling strategy P for G , where p is the resource vector used by P and T 1 ( 2 )(S, G)

and T 2 ( 2 )(S, G) are the number of computation and I/O operations used by a minimal pebbling in the red-blue pebble game played on G with S red pebbles:

T (^) l( L)(p, G, P) ≥ T 2 ( 2 )(s (^) l− 1 , G) for 2 ≤ l ≤ L

Also, the following lower bound on computation time holds for all pebbling strategies P in the standard MHG:

T 1 ( L)(p, G, P) ≥ T 1 ( 2 )(s 1 , G),

536 Chapter 11 Memory-Hierarchy Tradeoffs Models of Computation

In the I/O-limited case the following lower bounds apply, where α is the maximum fan-in of any vertex of G :

T (^) l( L)(p, G, P) ≥ T 2 ( 2 )(s (^) l− 1 , G) for 2 ≤ l ≤ L T 1 ( L)(p, G, P) ≥ T 2 ( 2 )(s (^) L− 1 , G)/α Proof The first set of inequalities is shown by considering the red-blue game played with S = s (^) l− 1 red pebbles and an unlimited number of blue pebbles. The S red pebbles and s (^) L− 1 − S blue pebbles can be classified into L − 1 groups with p (^) j pebbles in the jth group, so that we can simulate the steps of an L-level MHG pebbling strategy P. Because there are constraints on the use of pebbles in P, this strategy uses a number of level-l I/O operations that cannot be larger than the minimum number of such I/O operations when pebbles at level l − 1 or less are treated as red pebbles and those at higher levels are treated as blue pebbles. Thus, T (^) l( L)(p, G, P) ≥ T 2 ( 2 )(s (^) l− 1 , G). By similar reasoning it follows that T 1 ( L)(p, G, P) ≥ T 1 ( 2 )(s 1 , G). In the above simulation, blue pebbles simulating levels l and above cannot be used arbi- trarily when the I/O-limitation is imposed. To derive lower bounds under this limitation, we classify S = s (^) L− 1 pebbles into L − 1 groups with p (^) j pebbles in the jth group and simulate in the red-blue pebble game the steps of an L-level I/O-limited MHG pebbling strategy P. The I/O time at level l is no more than the I/O time in the two-level I/O-limited red-blue pebble game in which all S red pebbles are used at level l − 1 or less. Since the number of blue pebbles is unlimited, in a minimal pebbling all I/O operations consist of placing of red pebbles on blue-pebbled vertices. It follows that if T I/O operations are performed on the input vertices, then at least T placements of red pebbles on blue- pebbled vertices occur. Since at least one internal vertex must be pebbled with a red pebble in a minimal pebbling for every α input vertices that are red-pebbled, the computation time is at least T /α. Specializing this to T = T 2 ( 2 )(s (^) L− 1 , G) for the I/O-limited MHG, we have the last result.

It is important to note that the lower bound to T 1 ( 2 )(S, G, P) for the I/O-limited case is not stated in terms of |V |, because |V | may not be the same for each values of S. Consider the multiplication of two n × n matrices. Every graph of the standard algorithm can be pebbled with three red pebbles, but such graphs have about 2n 3 vertices, a number that cannot be reduced by more than a constant factor when a constant number of red pebbles is used. (See Section 11.5.2.) On the other hand, using the graph of Strassen’s algorithm for this problem requires at least Ω(n.^38529 ) pebbles, since it has O(n 2.^807 ) vertices. We close this section by giving conditions under which lower bounds for one graph can be used for another. Let a reduction of DAG G 1 = (V 1 , E 1 ) be a DAG G 0 = (V 0 , E 0 ), V 0 ⊆ V 1 and E 0 ⊆ E 1 , obtained by deleting edges from E 1 and coalescing the non-terminal vertices on a “chain” of vertices in V 1 into the first vertex on the chain. A chain is a sequence v 1 , v 2 ,... , vr of vertices such that, for 2 ≤ i ≤ r − 1, vi is adjacent to vi− 1 and vi+ 1 and no other vertices.

LEMMA 11.3.2 Let G 0 be a reduction of G 1. Then for any minimal pebbling Pmin and 1 ≤

l ≤ L , the following inequalities hold:

T (L) l (p,^ G^1 ,^ Pmin^ )^ ≥^ T^

(L) l (p,^ G^0 ,^ Pmin^ )

538 Chapter 11 Memory-Hierarchy Tradeoffs Models of Computation

the beginning or the end of the sub-pebbling without changing the number of computation steps or I/O operations. Thus, without changing them, we move all computation steps to a middle interval of Pt , between the higher-level I/O operations. We now show how this may be done. Consider a vertex v carrying a red pebble at some time during Pt that is pebbled for the first time with a blue pebble during Pt (vertex 7 at step 11 in Fig. 11.3). Instead of pebbling v with a blue pebble, use a new red pebble to keep a red pebble on v. (This is equivalent to swapping the new and old red pebbles on v.) This frees up the original red pebble to be used later in the sub-pebbling. Because we attach a red pebble to v for the entire pebbling Pt , all later output operations from v in Pt can be deleted except for the last such operation, if any, which can be moved to the end of the interval. Note that if after v is given a blue pebble in P, it is later given a red pebble, this red pebbling step and all subsequent blue pebbling steps except the last, if any, can be deleted. These changes do not affect any computation step in Pt. Consider a vertex v carrying a blue pebble at the start of Pt that later in Pt is given a red pebble (see vertex 4 at step 12 in Fig. 11.3). Consider the first pebbling of this kind. The red pebble assigned to v may have been in use prior to its placement on v. If a new red pebble is used for v, the first pebbling of v with a red pebble can be moved toward the beginning of Pt so that, without violating the precedence conditions of G, it precedes all placements of red pebbles on vertices without pebbles. Attach this new red pebble to v during Pt. Subsequent placements of red pebbles on v when it carries a blue pebble during Pt , if any, are thereby eliminated.

Pt Step 1 2 3 4 5 6 7 8 9 10 11 12 13 Pebble R1 R2 R2 B R2 R2 R1 B R2 R2 B R2 R Vertex ↓ 1 ↓ 2 5 ↑ 5 ↓ 2 6 ↓ 3 ↑ 6 ↓ 4 7 ↑ 7 ↓ 4 8 Step 14 15 16 17 18 19 20 21 22 23 Pebble R1 R2 R2 R2 R2 R1 R2 R2 R2 R Vertex ↓ 5 ↓ 7 9 ↓ 7 11 ↓ 6 ↓ 8 10 ↓ 8 12

Figure 11.3 The vertices of an FFT graph are numbered and a pebbling schedule is given in which the two numbered red pebbles are used. Up (down) arrows identify steps in which an output (input) occurs; other steps are computation steps. Steps 10 through 13 of the schedule Pt contain two I/O operations. With two new red pebbles, the input at step 12 can be moved to the beginning of the interval and the output at step 11 can be moved after step 13.

©cJohn E Savage 11.5 Tradeoffs Between Space and I/O Time 539

We now derive an upper bound to Q. At the start of the pebbling of the middle interval of Pt there are at most 2S red pebbles on G, at most S original red pebbles plus S new red pebbles. Clearly, the number of vertices that can be pebbled in the middle interval with first- level pebbles is largest when all 2S red pebbles on G are allowed to move freely. It follows that at most ρ( 2 S, G) vertices can be pebbled with red pebbles in any interval. Since all vertices must be pebbled with red pebbles, this completes the proof.

Combining Theorems 11.3.1 and 11.4.1 and a weak lower limit on the size of T (^) l( L)(p, G),

we have the following explicit lower bounds to T (^) l( L)(p, G).

COROLLARY 11.4.1 In the standard MHG when T l( L)(p, G) ≥ β(s l− 1 − 1 ) for β > 1 , the

following inequality holds for 2 ≤ l ≤ L :

T (^) l( L)(p, G) ≥

β β + 1

s (^) l− 1 ρ( 2 s (^) l− 1 , G)

(|V | −| In(G)|)

In the I/O-limited MHG when T (^) l( L)(p, G) ≥ β(s (^) l− 1 − 1 ) for β > 1 , the following inequality holds for 2 ≤ l ≤ L :

T (^) l( L)(p, G) ≥ β β + 1

s (^) L− 1 ρ( 2 s (^) L− 1 , G)

(|V | −| In(G)|)

11.5 Tradeoffs Between Space and I/O Time

We now apply the Hong-Kung method to a variety of important problems including matrix- vector multiplication, matrix-matrix multiplication, the fast Fourier transform, convolution, and merging and permutation networks.

11.5.1 Matrix-Vector Product

We examine here the matrix-vector product function f (^) A(n x) : R n (^2) +n (^) - → R n (^) over a commutative

ring R described in Section 6.2.1 primarily to illustrate the development of efficient multi- level pebbling strategies. The lower bounds on I/O and computation time for this problem are trivial to obtain. For the matrix-vector product, we assume that the graphs used are those associated with inner products. The inner product u · v of n-vectors u and v over a ring R is defined by:

u · v =

∑^ n

i= 1

u (^) i · vi

The graph of a straight-line program to compute this inner product is given in Fig. 11.4, where the additions of products are formed from left to right. The matrix-vector product is defined here as the pebbling of a collection of inner product graphs. As suggested in Fig. 11.4, each inner product graph can be pebbled with three red pebbles.

THEOREM 11.5.1 Let G be the graph of a straight-line program for the product of the matrix A

with the vector x_. Let_ G be pebbled in the standard MHG with the resource vector p_. There is a_

©cJohn E Savage 11.5 Tradeoffs Between Space and I/O Time 541

11.5.2 Matrix-Matrix Multiplication

In this section we derive upper and lower bounds on exchanges between I/O time and space for the n × n matrix multiplication problem in the standard and I/O-limited MHG. We show that the lower bounds on computation and I/O time can be matched by efficient pebbling strategies. Lower bounds for the standard MHG are derived for the family Fn of inner product graphs for n × n matrix multiplication , namely, the set of graphs to multiply two n × n ma- trices using just inner products to compute entries in the product matrix. (See Section 6.2.2.) We allow the additions in these inner products to be performed in any order. The lower bounds on I/O time derived below for the I/O-limited MHG apply to all DAGs for matrix multiplication. Since these DAGs include graphs other than the inner product trees in Fn , one might expect the lower bounds for the I/O-limited case to be smaller than those derived for graphs in Fn. However, this is not the case, apparently because efficient pebbling strategies for matrix multiplication perform I/O operations only on input and output vertices, not on internal vertices. The situation is very different for the discrete Fourier transform, as seen in the next section. We derive results first for the red-blue pebble game, that is, the two-level MHG, and then generalize them to the multi-level MHG. We begin by deriving an upper bound on the S-span for the family of inner product matrix multiplication graphs.

LEMMA 11.5.1 For every graph G ∈ Fn the S -span ρ(S, G) satisfies the bound ρ(S, G) ≤

2 S 3 /^2 for S ≤ n 2_._

Proof ρ(S, G) is the maximum number of vertices of G ∈ Fn that can be pebbled with S red pebbles from an initial placement of these pebbles, maximized over all such initial placements. Let A, B, and C be n × n matrices with entries {a (^) i,j }, {bi,j }, and {c (^) i,j }, respectively, where 1 ≤ i, j ≤ n. Let C = A × B. The term c (^) i,j =

k a^ i,k^ bk,j^ is associated with the root vertex in of a unique inner product tree. Vertices in this tree are either addition vertices, product vertices associated with terms of the form a (^) i,k bk,j , or input vertices associated with entries in the matrices A and B. Each product term a (^) i,k bk,j is associated with a unique term c (^) i,j and tree, as is each addition operator. Consider an initial placement of S ≤ n 2 pebbles of which r are in addition trees (they are on addition or product vertices). Let the remaining S − r pebbles reside on input vertices. Let p be the number of product vertices that can be pebbled from these pebbled inputs. We show that at most p + r − 1 additional pebble placements are possible from the initial placement, giving a total of at most π = 2 p + r − 1 pebble placements. (Figure 11.

a (^) 1,1 b1,2 a (^) 1,2 b2,

(a) (b)

a (^) 2,1 b1,1 a (^) 2,2 b2,1 a (^) 2,1 b1,2 a (^) 2,2 b2,

(c) (d)

a (^) 1,1 b1,1 a (^) 1,2 b2,

Figure 11.5 Graph of the inner products used to form the product of two 2 × 2 matrices. (Common input vertices are repeated for clarity.)

542 Chapter 11 Memory-Hierarchy Tradeoffs Models of Computation

shows a graph G for a 2 × 2 matrix multiplication algorithm in which the product vertices are those just below the output vertices. The black vertices carry pebbles. In this example r = 2 and p = 1. While p + r − 1 = 2, only one pebble placement is possible on addition trees in this example.) Given the dependencies of graphs in Fn , there is no loss in generality in assuming that product vertices are pebbled before pebbles are advanced in addition trees. It follows that at most p + r addition-tree vertices carry pebbles before pebbles are advanced in addition trees. These pebbled vertices define subtrees of vertices that can be pebbled from the p + r initial pebble placements. Since a binary tree with n leaves has n − 1 non-leaf nodes, it follows that if there are t such trees, at most p + r − t pebble placements will be made, not counting the original placement of pebbles. This number is maximized at t = 1. (See Problem 11.9.) We now complete the proof by deriving an upper bound on p. Let A be the 0− 1 n × n matrix whose (i, j) entry is 1 if the variable in the (i, j) position of the matrix A carries a pebble initially and 0 otherwise. Let B be similarly defined for B. It follows that the (i, j) entry, δ (^) i,j , of the matrix product C = A × B, where addition and multiplication are over the integers, is equal to the number of products that can be formed that contribute to the (i, j) entry of the result matrix C. Thus p =

i,j δ^ i,j^. We now show that^ p^ ≤

S(S − r). Let A and B have a and b 1’s, respectively, where a + b = S − r. There are at most a/α rows of A containing at least α 1’s. The maximum number of products that can be formed from such rows is ab/α because each 1 in B combine with a 1 in each of these rows. Now consider the product of other rows of A with columns of B. At most S such row-column inner products are formed since at most S outputs can be pebbled. Since each of them involves a row with at most α 1’s, at most αS products of pairs of variables can be formed. Thus, a total of at most p = ab/α + αS products can be formed. We are free to choose α to minimize this sum (α =

ab/S does this) but must choose a and b to maximize it (a = (S − r)/2 satisfies this requirement). The result is that p ≤

S(S − r). We complete the proof by observing that π = 2 p + r − 1 ≤ 2

SS for r ≥ 0.

Theorem 11.5.2 states bounds that apply to the computation and I/O time in the red-blue pebble game for matrix multiplication.

THEOREM 11.5.2 For every graph G in the family Fn of inner product graphs for multiplying

two n × n matrices and for every pebbling strategy P for G in the red-blue pebble game that uses S ≥ 3 red pebbles, the computation and I/O-time satisfy the following lower bounds:

T ( 2 ) 1 (S,^ G,^ P) =^ Ω(n^

T 2 ( 2 )(S, G, P) = Ω

n 3 √ S

Furthermore, there is a pebbling strategy P for G with S ≥ 3 red pebbles such that the following upper bounds hold simultaneously:

T ( 2 ) 1 (S,^ G,^ P) =^ O(n^

T 2 ( 2 )(S, G, P) = O

n 3 √ S

The lower bound on I/O time stated above applies for every graph of a straight-line program for matrix multiplication in the I/O-limited red-blue pebble game. The upper bound on I/O time

544 Chapter 11 Memory-Hierarchy Tradeoffs Models of Computation

This algorithm performs one input operation on each entry of a (^) i,q and bq,j to compute c (^) i,j. It also performs one output operation per entry to compute c (^) i,j itself. Summing over all values of i and j, we find that n 2 output operations are performed on entries in C. Since there are (n/r) 2 submatrices a (^) i,q and bq,j and each is used to compute n/r terms c (^) u,v , the number of input operations on entries in A and B is 2(n/r) 2 r 2 (n/r) = 2 n 3 /r. Because r = 0

S/ 31 , we have r ≥

S/ 3 − 1, from which the upper bound on the number of I/O operations follows. Since each product and addition vertex in each inner product graph is pebbled once, O(n 3 ) computation steps are performed. The bound on T 2 ( 2 )(S, G, P) for the I/O-limited game follows from two observations. First, the computational inequality of Theorem 10.4.1 provides a lower bound to T (^) I , the number of times that input vertices are pebbled in the red-pebble game when only red pebbles are used on vertices. This is the I/O-limited model. Second, the lower bound of Theorem 10.5.4 on T (actually, T (^) I ) is of the form desired.

These results and the strategy given for the two-level case carry over to the multi-level case, although considerable care is needed to insure that the pebbling strategy does not fragment memory and lead to inefficient upper bounds. Even though the pebbling strategy given below is an I/O-limited strategy, it provides bounds on time in terms of space that match the lower bounds for the standard MHG.

THEOREM 11.5.3 For every graph G in the family Fn of inner product graphs for multiplying

two n × n matrices and for every pebbling strategy P for G in the standard MHG with resource vector p that uses p 1 ≥ 3 first-level pebbles, the computation and I/O time satisfy the following lower bounds, where s (^) l =

∑ (^) l j= 1 p^ j^ and^ k^ is the largest integer such that^ s^ k^ ≤^3 n^

T 1 ( L)(p, G, P) = Ω

n 3

T (^) l( L)(p, G, P) =

n 3 /

s (^) l− 1

for 2 ≤ l ≤ k Ω

n 2

for k + 1 ≤ l ≤ L

Furthermore, there is a pebbling strategy P for G with p 1 ≥ 3 such that the following upper bounds hold simultaneously:

T 1 ( L)(p, G, P) = O(n 3 )

T (^) l( L)(p, G, P) =

O

n 3 /

s (^) l− 1

for 2 ≤ l ≤ k O

n 2

for k + 1 ≤ l ≤ L

In the I/O-limited MHG the upper bounds given above apply. The following lower bound on the I/O time applies to every graph G for n × n matrix multiplication and every pebbling strategy P , where S = s (^) L− 1 :

T (^) l( L)(p, G, P) = Ω

n 3 /

S

for 1 ≤ l ≤ L

Proof The lower bounds on T (^) l( L)(p, G, P), 2 ≤ l ≤ L, follow from Theorems 11.3.1 and 11.5.2. The lower bound on T (L) 1 (p,^ G,^ P)^ follows from the fact that every graph in^ Fn has Θ(n 3 ) vertices to be pebbled.

©cJohn E Savage 11.5 Tradeoffs Between Space and I/O Time 545

r 1 = 0

s 1 / 31

r 2 = r 1 0

s 2 − 1 /(

3 r 1 ) 1

r 3 = r 2 0

s 3 − 1 /(

3 r 2 ) 1

Figure 11.7 A three-level decomposition of a matrix.

We now describe a multi-level recursive pebbling strategy satisfying the upper bounds given above. It is based on the two-level strategy given in the proof of Theorem 11.5.2. We compute C from A and B using inner products. Our approach is to successively block A, B, and C into r (^) i × r (^) i submatrices for i = k, k − 1,... , 1 where the r (^) i are chosen, as suggested in Fig. 11.7, so they divide on another and avoid memory fragmentation. Also, they are also chosen relative to s (^) i so that enough pebbles are available to pebble r (^) i × r (^) i submatrices, as explained below.

r (^) i =

s 1 / 3

i = 1

r (^) i− 1

(s (^) i − i + 1 )/(

3 r (^) i− 1 )

i ≥ 2

Using the fact that b/ 2 ≤ a 0 b/a1 ≤ b for integers a and b satisfying 1 ≤ a ≤ b (see Problem 11.1), we see that

(s (^) i − i + 1 )/ 12 ≤ r (^) i ≤

(s (^) i − i + 1 )/3. Thus, s (^) i ≥ 3 r (^) i^2 + i − 1. Also, r (^2) k ≤ n 2 because s (^) k ≤ 3 n 2. By definition, s (^) l pebbles are available at level l and below. As stated earlier, there is at least one pebble at each level above the first. From the s (^) l pebbles at level l and below we create a reserve set containing one pebble at each level except the first. This reserve set is used to perform I/O operations as needed. Without loss of generality, assume that r (^) k divides n. (If not, n must be at most doubled for this to be true. Embed A, B, and C in such larger matrices.) A, B, and C are then blocked into r (^) k ×r (^) k submatrices (call them a (^) i,j , bi,j , and c (^) i,j ), and these in turn are blocked into r (^) k− 1 ×r (^) k− 1 submatrices, continuing until 1×1 submatrices are reached. The submatrix c (^) i,j is defined as

©cJohn E Savage 11.5 Tradeoffs Between Space and I/O Time 547

v 1

u (^2)

v 2

p (^1)

u (^1) p (^2)

Figure 11.8 A two-input butterfly graph with pebbles p 1 and p 2 resident on inputs.

obtain an upper bound on the number of pebbled vertices if we assume that both of them are pebbled. In this proof we let {p (^) i | 1 ≤ i ≤ S} denote the S pebbles available to pebble G. We assign an integer cost num(p (^) i ) (initialized to zero) to the ith pebble p (^) i in order to derive an upper bound to the total number of pebble placements made on G. Consider a matching pair of output vertices v 1 and v 2 of a two-input butterfly graph and their common predecessors u 1 and u 2 , as suggested in Fig. 11.8. Suppose that on the next step we can place a pebble on v 1. Then pebbles (call them p 1 and p 2 ) must reside on u 1 and u 2. Advance p 1 and p 2 to both v 1 and v 2. (Although the rules stipulate that an additional pebble is needed to advance the two pebbles, violating this restriction by allowing their movement to v 1 and v 2 can only increase the number of possible moves, a useful effect since we are deriving an upper bound on the number of pebble placements.) After advancing p 1 and p 2 , if num(p 1 ) = num(p 2 ), augment both by 1; otherwise, augment the smaller by 1. Since the predecessors of two vertices in an FFT graph are in disjoint trees, there is no loss in assuming that all S pebbles remain on the graph in a pebbling that maximizes the number of pebbled vertices. Because two pebble placements are possible each time num(p (^) i ) increases by 1 for some i, ρ(S, G) ≤ 2

1 ≤i≤S num(p^ i^ ). We now show that the number of vertices that contained pebbles initially and are con- nected via paths to the vertex covered by p (^) i is at least 2 num(pi^ )^. That is, 2 num(pi^ )^ ≤ S or num(p (^) i ) ≤ log 2 S, from which the upper bound on ρ(S, G) follows. Our proof is by induction. For the base case of num(p (^) i ) = 1, two pebbles must reside on the two immedi- ate predecessors of a vertex containing the pebble p (^) i. Assume that the hypothesis holds for num(p (^) i ) ≤ e − 1. We show that it holds for num(p (^) i ) = e. Consider the first point in time that num(p (^) i ) = e. At this time p (^) i and a second pebble p (^) j reside on a matching pair of vertices, v 1 and v 2. Before these pebbles are advanced to these two vertices from u 1 and u 2 , the immediate predecessors of v 1 and v 2 , the smaller of num(p (^) i ) and num(p (^) j ) has a value of e − 1. This must be p (^) i because its value has increased. Thus, each of u 1 and u (^2) has at least 2e−^1 predecessors that contained pebbles initially. Because the predecessors of u (^1) and u 2 are disjoint, each of v 1 and v 2 has at least 2e^ = 2 num(pi^ )^ predecessors that carried pebbles initially.

This upper bound on the S-span is combined with Theorem 11.4.1 to derive a lower bound on the I/O time at level l to pebble the FFT graph. We derive upper bounds that match to within a multiplicative constant when the FFT graph is pebbled in the standard MHG. We develop bounds for the red-blue pebble game and then generalize them to the MHG.

548 Chapter 11 Memory-Hierarchy Tradeoffs Models of Computation

THEOREM 11.5.4 Let the FFT graph on n = 2 d^ inputs, F (d)^ , be pebbled in the red-blue

pebble game with S red pebbles. When S ≥ 3 there is a pebbling of F (d)^ such that the following bounds hold simultaneously, where T ( 2 ) 1 (p^1 ,^ F^ (d) (^) ) and T (^2 ) 2 (p^1 ,^ F^ (d) (^) ) are the computation and I/O time in a minimal pebbling of F (d)^ :

T 1 ( 2 )(S, F (d)^ ) = Θ(n log n)

T 2 ( 2 )(S, F (d)^ ) = Θ

n log n log S

Proof The lower bound on T 1 ( 2 )(S, F (d)^ ) is obvious; every vertex in F (d)^ must be peb- bled a first time. The lower bound on T 2 ( 2 )(S, F (d)^ ) follows from Corollary 11.4.1, Theo- rem 11.3.1, Lemma 11.5.2, and the obvious lower bound on |V |. We now exhibit a pebbling strategy giving upper bounds that match the lower bounds up to a multiplicative factor. As shown in Corollary 6.7.1, F (d)^ can be decomposed into )d/e* stages, 0 d/e 1 stages containing 2d−e^ copies of F (e)^ and one stage containing 2d−k^ copies of F (k)^ , k = d − 0 d/e 1 e. (See Fig. 11.9.) The output vertices of one stage are the input vertices to the next. For example, F (^12 )^ can be decomposed into three stages with 2^12 −^4 = 256 copies of F (^4 ) on each stage and one stage with 2^12 copies of F (^0 )^ , a single vertex. (See Fig. 11.10.) We use this decomposition and the observation that F (e)^ can be pebbled level by level with 2e^ + 1 level-1 pebbles without repebbling any vertex to develop our pebbling strategy for F (d)^. Given S red pebbles, our pebbling strategy is based on this decomposition with e = d 0 = 0 log 2 (S − 1 ). Since S ≥ 3, d 0 ≥ 1. Of the S red pebbles, we actually use only S 0 = 2 d^0 + 1. Since S 0 ≤ S, the number of I/O operations with S 0 red pebbles is no

F (^) b(,1d− e) F (^) b(,2d− e) ... F (^) b(,dβ−e)

F (^) t(,1e) F (^) t(,2e) F (^) t(,3e) F (^) t(,4e) F (^) t(,5e) F (^) t(,6e) ... F (^) t(,eτ)

Figure 11.9 Decomposition of the FFT graph F (d)^ into β = 2 e^ bottom FFT graphs F (d−e) and τ = 2 d−e^ top F (e)^. Edges between bottom and top sub-FFT graphs identify common vertices between the two.