Cache-Oblivious Algorithms for Matrix Transpose, FFT, and Sorting, Papers of Computer Science

Cache-oblivious algorithms for matrix transpose, fft, and sorting on computers with multiple levels of caching. The algorithms aim to minimize both work complexity and cache complexity. Detailed analysis of cache complexity for matrix multiplication using the rec-mult algorithm and discusses the cache complexity of other algorithms such as strassen's algorithm and rec-transpose.

Typology: Papers

Pre 2010

Uploaded on 08/19/2009

koofers-user-l0z
koofers-user-l0z 🇺🇸

10 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Cache-Oblivious Algorithms
EXTENDED ABSTRACT
Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran
MIT Laboratory for Computer Science,545 Technology Square,Cambridge, MA 02139
 !"##$%'&(#&*)+%,&-"
Abstract This paper presents asymptotically optimal algo-
rithms for rectangular matrix transpose, FFT, and sorting on
computers with multiple levels of caching. Unlike previous
optimal algorithms, these algorithms are cache oblivious: no
variables dependent on hardware parameters, such as cache
size and cache-line length, need to be tuned to achieve opti-
mality. Nevertheless, these algorithms use an optimal amount
of work and move data optimally among multiple levels of
cache. For a cache with size Zand cache-line length Lwhere
Z
.
/
L2
0
the number of cache misses for an m
1
nma-
trix transpose is Θ
/
1
2
mn
3
L
0
. The number of cache misses
for either an n-point FFT or the sorting of nnumbers is
Θ
/
1
24/
n
3
L
0
/
1
2
logZn
050
. We also give an Θ
/
mnp
0
-work al-
gorithm to multiply an m
1
nmatrix by an n
1
pmatrix that
incurs Θ
/
1
26/
mn
2
np
2
mp
0
3
L
2
mnp
3
L
7
Z
0
cache faults.
We introduce an “ideal-cache” model to analyze our algo-
rithms. We prove that an optimal cache-oblivious algorithm
designed for two levels of memory is also optimal for multi-
ple levels and that the assumption of optimal replacement in
the ideal-cache model can be simulated efficiently by LRUre-
placement. We also provide preliminary empirical results on
the effectiveness of cache-oblivious algorithms in practice.
1. Introduction
Resource-oblivious algorithms that nevertheless use re-
sources efficiently offer advantages of simplicity and
portability over resource-aware algorithms whose re-
source usage must be programmed explicitly. In this
paper, we study cache resources, specifically, the hier-
archy of memories in modern computers. We exhibit
several “cache-oblivious” algorithms that use cache as
effectively as “cache-aware” algorithms.
Before discussing the notion of cache obliviousness,
we first introduce the
8
Z
9
L
:
ideal-cache model to study
the cache complexity of algorithms. This model, which
is illustrated in Figure 1, consists of a computer with a
two-level memory hierarchy consisting of an ideal (data)
cache of Zwords and an arbitrarily large main mem-
ory. Because the actual size of words in a computer is
typically a small, fixed size (4 bytes, 8 bytes, etc.), we
This research was supported in part by the Defense Advanced
Research Projects Agency (DARPA) under Grant F30602-97-1-0270.
Matteo Frigo was supported in part by a Digital Equipment Corpora-
tion fellowship.
Q
cache
misses
organized by
optimal replacement
strategy
Main
Memory
Cache
Z
3
LCache lines
Lines
of length L
CPU
W
work
Figure 1: The ideal-cache model
shall assume that word size is constant; the particular
constant does not affect our asymptotic analyses. The
cache is partitioned into cache lines, each consisting of
Lconsecutive words which are always moved together
between cache and main memory. Cache designers typ-
ically use L
;
1, banking on spatial locality to amortize
the overhead of moving the cache line. We shall gener-
ally assume in this paper that the cache is tall:
Z
<
8
L2
:9
(1)
which is usually true in practice.
The processor can only reference words that reside
in the cache. If the referenced word belongs to a line
already in cache, a cache hit occurs, and the word is
delivered to the processor. Otherwise, a cache miss oc-
curs, and the line is fetched into the cache. The ideal
cache is fully associative [20, Ch. 5]: cache lines can be
stored anywhere in the cache. If the cache is full, a cache
line must be evicted. The ideal cache uses the optimal
off-line strategy of replacing the cache line whose next
access is furthest in the future [7], and thus it exploits
temporal locality perfectly.
Unlike various other hierarchical-memory models
[1, 2, 5, 8] in which algorithms are analyzed in terms of
a single measure, the ideal-cache model uses two mea-
sures. An algorithm with an input of size nis measured
by its work complexity W
8
n
:
—its conventional running
time in a RAM model [4]—and its cache complexity
Q
8
n;Z
9
L
:
—the number of cache misses it incurs as a
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Cache-Oblivious Algorithms for Matrix Transpose, FFT, and Sorting and more Papers Computer Science in PDF only on Docsity!

Cache-Oblivious Algorithms

EXTENDED ABSTRACT

Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran

MIT Laboratory for Computer Science       !, 545 Technology Square "##$ %'&( #, Cambridge, MA 02139 &*)+%,&-"

Abstract This paper presents asymptotically optimal algo- rithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious : no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve opti- mality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z. Ω / L^2 0 the number of cache misses for an m 1 n ma- trix transpose is Θ / 1 2 mn 3 L^0. The number of cache misses for either an n -point FFT or the sorting of n numbers is Θ/ 1 2 4/ n 3 L^0 / 1 2 log Z n^0 50. We also give an Θ / mnp^0 -work al- gorithm to multiply an m 1 n matrix by an n 1 p matrix that incurs Θ / 1 2 6/ mn 2 np 2 mp 03 L 2 mnp 3 L 7 Z 0 cache faults. We introduce an “ideal-cache” model to analyze our algo- rithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multi- ple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU re- placement. We also provide preliminary empirical results on the effectiveness of cache-oblivious algorithms in practice.

1. Introduction

Resource-oblivious algorithms that nevertheless use re- sources efficiently offer advantages of simplicity and portability over resource-aware algorithms whose re- source usage must be programmed explicitly. In this paper, we study cache resources, specifically, the hier- archy of memories in modern computers. We exhibit several “cache-oblivious” algorithms that use cache as effectively as “cache-aware” algorithms. Before discussing the notion of cache obliviousness, we first introduce the 8 Z 9 L : ideal-cache model to study the cache complexity of algorithms. This model, which is illustrated in Figure 1, consists of a computer with a two-level memory hierarchy consisting of an ideal (data) cache of Z words and an arbitrarily large main mem- ory. Because the actual size of words in a computer is typically a small, fixed size (4 bytes, 8 bytes, etc.), we

This research was supported in part by the Defense Advanced Research Projects Agency (DARPA) under Grant F30602-97-1-0270. Matteo Frigo was supported in part by a Digital Equipment Corpora- tion fellowship.

Q

cache misses

organized by optimal replacement strategy

Main Memory

Cache

Z 3 L Cache lines

Lines of length L

CPU W work

Figure 1: The ideal-cache model

shall assume that word size is constant; the particular constant does not affect our asymptotic analyses. The cache is partitioned into cache lines , each consisting of L consecutive words which are always moved together between cache and main memory. Cache designers typ- ically use L ; 1, banking on spatial locality to amortize the overhead of moving the cache line. We shall gener- ally assume in this paper that the cache is tall : Z < Ω 8 L^2 :9 (1)

which is usually true in practice. The processor can only reference words that reside in the cache. If the referenced word belongs to a line already in cache, a cache hit occurs, and the word is delivered to the processor. Otherwise, a cache miss oc- curs, and the line is fetched into the cache. The ideal cache is fully associative [20, Ch. 5]: cache lines can be stored anywhere in the cache. If the cache is full, a cache line must be evicted. The ideal cache uses the optimal off-line strategy of replacing the cache line whose next access is furthest in the future [7], and thus it exploits temporal locality perfectly. Unlike various other hierarchical-memory models [1, 2, 5, 8] in which algorithms are analyzed in terms of a single measure, the ideal-cache model uses two mea- sures. An algorithm with an input of size n is measured by its work complexity W 8 n :—its conventional running time in a RAM model [4]—and its cache complexity Q 8 n ; Z 9 L : —the number of cache misses it incurs as a

function of the size Z and line length L of the ideal cache. When Z and L are clear from context, we denote the cache complexity simply as Q 8 n : to ease notation. We define an algorithm to be cache aware if it con- tains parameters (set at either compile-time or runtime) that can be tuned to optimize the cache complexity for the particular cache size and line length. Otherwise, the algorithm is cache oblivious. Historically, good perfor- mance has been obtained using cache-aware algorithms, but we shall exhibit several optimal^1 cache-oblivious al- gorithms. To illustrate the notion of cache awareness, consider the problem of multiplying two n n matrices A and B to produce their n n product C. We assume that the three matrices are stored in row-major order, as shown in Figure 2(a). We further assume that n is “big,” i.e., n ; L , in order to simplify the analysis. The conventional way to multiply matrices on a computer with caches is to use a blocked algorithm [19, p. 45]. The idea is to view each matrix M as consisting of 8 n

 s : 8 n

 s : submatrices Mi j (the blocks), each of which has size s s , where s is a tuning parame- ter. The following algorithm implements this strategy:

BLOCK-MULT / A  B  C  n^0 1 for i  1 to n 3 s 2 do for j  1 to n 3 s 3 do for k  1 to n 3 s 4 do ORD-MULT / Aik  Bk j  Ci j  s^0

The ORD-MULT 8 A 9 B 9 C 9 s : subroutine computes C  C  AB on s s matrices using the ordinary O 8 s^3 : algorithm. (This algorithm assumes for simplicity that s evenly divides n , but in practice s and n need have no special relationship, yielding more complicated code in the same spirit.) Depending on the cache size of the machine on which BLOCK-MULT is run, the parameter s can be tuned to make the algorithm run fast, and thus BLOCK-MULT is a cache-aware algorithm. To minimize the cache com- plexity, we choose s to be the largest value such that the three s s submatrices simultaneously fit in cache. An s s submatrix is stored on Θ 8 s  s^2  L : cache lines. From the tall-cache assumption (1), we can see that s < Θ 8  Z :. Thus, each of the calls to ORD-MULT runs with at most Z

 L < Θ 8 s^2  L : cache misses needed to bring the three matrices into the cache. Consequently, the cache complexity of the entire algorithm is Θ 8 1 

n^2  L  8 n

  Z :^3 8 Z

 L : :< Θ 81  n^2  L  n^3  L  Z :, since the algorithm has to read n^2 elements, which reside on n^2  L cache lines. The same bound can be achieved using a simple (^1) For simplicity in this paper, we use the term “optimal” as a syn-

onym for “asymptotically optimal,” since all our analyses are asymp- totic.

cache-oblivious algorithm that requires no tuning pa- rameters such as the s in BLOCK-MULT. We present such an algorithm, which works on general rectangular matrices, in Section 2. The problems of computing a matrix transpose and of performing an FFT also suc- cumb to remarkably simple algorithms, which are de- scribed in Section 3. Cache-oblivious sorting poses a more formidable challenge. In Sections 4 and 5, we present two sorting algorithms, one based on mergesort and the other on distribution sort, both of which are op- timal in both work and cache misses. The ideal-cache model makes the perhaps- questionable assumptions that there are only two levels in the memory hierarchy, that memory is man- aged automatically by an optimal cache-replacement strategy, and that the cache is fully associative. We address these assumptions in Section 6, showing that to a certain extent, these assumptions entail no loss of generality. Section 7 discusses related work, and Section 8 offers some concluding remarks, including some preliminary empirical results.

2. Matrix multiplication

This section describes and analyzes a cache-oblivious al- gorithm for multiplying an m n matrix by an n p ma- trix cache-obliviously using Θ 8 mnp :work and incurring Θ 8 m  n  p  8 mn  np  mp :

 L  mnp

 L  Z : cache misses. These results require the tall-cache assumption (1) for matrices stored in row-major layout format, but the assumption can be relaxed for certain other layouts. We also show that Strassen’s algorithm [31] for multi- plying n n matrices, which uses Θ 8 n lg7^ : work^2 , incurs Θ 81  n^2

 L  n lg^  L  Z : cache misses. In [9] with others, two of the present authors analyzed an optimal divide-and-conquer algorithm for n n ma- trix multiplication that contained no tuning parameters, but we did not study cache-obliviousness per se. That algorithm can be extended to multiply rectangular matri- ces. To multiply a m n matrix A and a n p matrix B , the REC-MULT algorithm halves the largest of the three dimensions and recurs according to one of the following three cases: A 1 A 2

B < A^1 B A 2 B

9 (2)

A 1 A 2

B 1

B 2

< A 1 B 1  A 2 B 2 9 (3)

A B 1 B 2 < AB 1 AB 2  (4) In case (2), we have m  max  n 9 p . Matrix A is split horizontally, and both halves are multiplied by matrix B. In case (3), we have n  max  m 9 p . Both matrices are (^2) We use the notation lg to denote log 2.

in column-major order (Figure 2(b)), but the assumption that Z < Ω 8 L^2 : can be relaxed for certain other matrix layouts. The s s -blocked layout (Figure 2(c)), for some tuning parameter s , can be used to achieve the same bounds with the weaker assumption that the cache holds at least some sufficiently large constant number of lines. The cache-oblivious bit-interleaved layout (Figure 2(d)) has the same advantage as the blocked layout, but no tuning parameter need be set, since submatrices of size O 8  L : O 8  L : are cache-obliviously stored on O 81 : cache lines. The advantages of bit-interleaved and re- lated layouts have been studied in [11, 12, 16]. One of the practical disadvantages of bit-interleaved layouts is that index calculations on conventional microprocessors can be costly, a deficiency we hope that processor archi- tects will remedy. For square matrices, the cache complexity Q 8 n : < Θ 8 n  n^2  L  n^3  L  Z : of the REC-MULT algorithm is the same as the cache complexity of the cache-aware BLOCK-MULT algorithm and also matches the lower bound by Hong and Kung [21]. This lower bound holds for all algorithms that execute the Θ 8 n^3 : opera- tions given by the definition of matrix multiplication

ci j <

n

k 1

aikbk j 

No tight lower bounds for the general problem of matrix multiplication are known. By using an asymptotically faster algorithm, such as Strassen’s algorithm [31] or one of its variants [37], both the work and cache complexity can be reduced. When multiplying n n matrices, Strassen’s algorithm, which is cache oblivious, requires only 7 recursive multiplica- tions of n

 2 n

 2 matrices and a constant number of matrix additions, yielding the recurrence

Q 8 n :



 Θ 81  n  n^2

 L : if n^2  α Z 9 7 Q 8 n

 2 : O 8 n^2

 L : otherwise ;

where α is a sufficiently small constant. The solution to this recurrence is Θ 8 n  n^2  L  n lg^  L  Z :.

3. Matrix transposition and FFT

This section describes a recursive cache-oblivious al- gorithm for transposing an m n matrix which uses O 8 mn : work and incurs O 81  mn

 L : cache misses, which is optimal. Using matrix transposition as a sub- routine, we convert a variant [36] of the “six-step” fast Fourier transform (FFT) algorithm [6] into an optimal cache-oblivious algorithm. This FFT algorithm uses O 8 n lg n : work and incurs O 81 ^8 n

 L :8^1  log Z n : : cache misses. The problem of matrix transposition is defined as fol- lows. Given an m n matrix stored in a row-major lay- out, compute and store AT^ into an n m matrix B also

stored in a row-major layout. The straightforward algo- rithm for transposition that employs doubly nested loops incurs Θ 8 mn : cache misses on one of the matrices when m  Z

 L and n  Z

 L , which is suboptimal. Optimal work and cache complexities can be ob- tained with a divide-and-conquer strategy, however. If n  m , the REC-TRANSPOSE algorithm partitions

A < 8 A 1 A 2 : 9 B <

B 1

B 2

and recursively executes REC-TRANSPOSE 8 A 1 9 B 1 : and REC-TRANSPOSE 8 A 2 9 B 2 :. Otherwise, it divides matrix A horizontally and matrix B vertically and likewise per- forms two transpositions recursively. The next two lem- mas provide upper and lower bounds on the performance of this algorithm.

Lemma 2 The REC-TRANSPOSE algorithm involves O 8 mn : work and incurs O 81  mn

 L : cache misses for an m n matrix.

Proof. That the algorithm does O 8 mn : work is straight- forward. For the cache analysis, let Q 8 m 9 n : be the cache complexity of transposing an m n matrix. We as- sume that the matrices are stored in row-major order, the column-major layout having a similar analysis. Let α be a constant sufficiently small such that two submatrices of size m n and n m , where max  m 9 n 



α L , fit completely in the cache even if each row is stored in a different cache line. We distinguish the three cases.

Case I: max  m 9 n 

 α L. Both the matrices fit in O 81 :  2 mn

 L lines. From the choice of α, the number of lines required is at most Z

 L. Therefore Q 8 m 9 n :< O 81  mn

 L :.

Case II: m

 α L  n or n

 α L  m. Suppose first that m

 α L  n. The REC-TRANSPOSE algorithm divides the greater dimension n by 2 and performs divide and conquer. At some point in the recursion, n falls into the range α L

 2

 n

 α L , and the whole problem fits in cache. Because the layout is row-major, at this point the input array has n rows and m columns, and it is laid out in contiguous locations, requiring at most O 81  nm

 L : cache misses to be read. The output array consists of nm elements in m rows, where in the worst case every row lies on a different cache line. Consequently, we incur at most O 8 m  nm

 L : for writing the output array. Since n  α L

 2, the total cache complexity for this base case is O 81  m :. These observations yield the recurrence

Q 8 m 9 n :



 O 81  m : if n  α L

 2 9 α L   2 Q 8 m 9 n

 2 :  O 81 : otherwise ;

whose solution is Q 8 m 9 n : < O 81  mn

 L :. The case n

 α L  m is analogous.

Case III: m 9 n ; α L. As in Case II, at some point in the

recursion both n and m fall into the range α L

2 9 α L .

The whole problem fits into cache and can be solved

with at most O 8 m  n  mn

L :cache misses. The cache

complexity thus satisfies the recurrence

Q 8 m 9 n :

O 8 m  n  mn

L : if m 9 n  α L

2 9 α L 

2 Q 8 m

29 n : O 81 : if m  n 9

2 Q 8 m 9 n

2 : O 81 : otherwise;

whose solution is Q 8 m 9 n : < O 81  mn

L :.

Theorem 3 The REC-TRANSPOSE algorithm has opti- mal cache complexity.

Proof. For an m n matrix, the algorithm must write to

mn distinct elements, which occupy at least  mn

L  <

Ω 81  mn

L : cache lines.

As an example of an application of this cache- oblivious transposition algorithm, in the rest of this sec- tion we describe and analyze a cache-oblivious algo- rithm for computing the discrete Fourier transform of a complex array of n elements, where n is an exact power of 2. The basic algorithm is the well-known “six-step” variant [6, 36] of the Cooley-Tukey FFT algorithm [13]. Using the cache-oblivious transposition algorithm, how- ever, the FFT becomes cache-oblivious, and its perfor- mance matches the lower bound by Hong and Kung [21]. Recall that the discrete Fourier transform (DFT) of an array X of n complex numbers is the array Y given by

Y  i  <

n  1

j 0

X  j  ω n i j 9 (9)

where ω n < e^2 π^  ^1 ^ n^ is a primitive n th root of unity,

and 0

i  n. Many algorithms evaluate Equation (9) in

O 8 n lg n : time for all integers n [15]. In this paper, how-

ever, we assume that n is an exact power of 2, and we compute Equation (9) according to the Cooley-Tukey al- gorithm, which works recursively as follows. In the base

case where n < O 81 :, we compute Equation (9) directly.

Otherwise, for any factorization n < n 1 n 2 of n , we have

Y  i 1  i 2 n 1 < (10)

n 2  1

j 2 0

n 1  1

j 1 0

X  j 1 n 2  j 2 ω n  1 i 1 j^1 ω n  i^1 j^2 ω n 2 i 2 j^2 

Observe that both the inner and outer summations in Equation (10) are DFT’s. Operationally, the computa- tion specified by Equation (10) can be performed by computing n 2 transforms of size n 1 (the inner sum), mul-

tiplying the result by the factors ω n  i^1 j^2 (called the twid-

dle factors [15]), and finally computing n 1 transforms of size n 2 (the outer sum).

We choose n 1 to be 2 lg n ^2 ^ and n 2 to be 2 lg n ^2 ^. The

recursive step then operates as follows:

1. Pretend that input is a row-major n 1 n 2 matrix A.

Transpose A in place, i.e., use the cache-oblivious REC-TRANSPOSE algorithm to transpose A onto an auxiliary array B , and copy B back onto A. Notice

that if n 1 < 2 n 2 , we can consider the matrix to be

made up of records containing two elements.

  1. At this stage, the inner sum corresponds to a DFT of the n 2 rows of the transposed matrix. Compute these n 2 DFT’s of size n 1 recursively. Observe that, because of the previous transposition, we are trans- forming a contiguous array of elements.
  2. Multiply A by the twiddle factors, which can be computed on the fly with no extra cache misses.
  3. Transpose A in place, so that the inputs to the next stage are arranged in contiguous locations.
  4. Compute n 1 DFT’s of the rows of the matrix recur- sively.
  5. Transpose A in place so as to produce the correct output order. It can be proven by induction that the work com-

plexity of this FFT algorithm is O 8 n lg n :. We now an-

alyze its cache complexity. The algorithm always op- erates on contiguous data, by construction. Thus, by the tall-cache assumption (1), the transposition oper- ations and the twiddle-factor multiplication require at

most O 81  n

L : cache misses. Thus, the cache com-

plexity satisfies the recurrence

Q 8 n :

O 81  n

L : 9 if n

α Z 9

n 1 Q 8 n 2 :  n 2 Q 8 n 1 : otherwise ;

 O 81  n

L :

where α ; 0 is a constant sufficiently small that a sub-

problem of size α Z fits in cache. This recurrence has solution

Q 8 n : #< O 81  8 n

L :%8 1  log Z n : *: 9

which is optimal for a Cooley-Tukey algorithm, match- ing the lower bound by Hong and Kung [21] when n is an exact power of 2. As with matrix multiplication, no tight lower bounds for cache complexity are known for the general DFT problem.

4. Funnelsort

Cache-oblivious algorithms, like the familiar two-way merge sort, are not optimal with respect to cache misses. The Z -way mergesort suggested by Aggarwal and Vit- ter [3] has optimal cache complexity, but although it ap- parently works well in practice [23], it is cache aware. This section describes a cache-oblivious sorting algo- rithm called “funnelsort.” This algorithm has optimal

Lemma 6 If Z < Ω 8 L^2 : , then a k-merger operates with

at most

Q M 8 k : < O 81  k  k^3

L  k^3 log Z k

L :

cache misses.

Proof. There are two cases: either k  α  Z or k ;

α  Z , where α is a sufficiently small constant.

Case I: k  α  Z. By Lemma 4, the data structure

associated with the k -merger requires at most O 8 k^2 :<

O 8 Z : contiguous memory locations, and therefore it fits

into cache. The k -merger has k input queues from

which it loads O 8 k^3 : elements. Let ri be the number

of elements extracted from the i th input queue. Since

k  α  Z and the tall-cache assumption (1) implies that

L < O 8  Z :, there are at least Z

L < Ω 8 k : cache lines

available for the input buffers. Lemma 5 applies, whence the total number of cache misses for accessing the input queues is

k

i 1

O 81  ri

L^ : < O^8 k^^  k

3  L :

Similarly, Lemma 4 implies that the cache complexity

of writing the output queue is O 81  k^3

L :. Finally, the

algorithm incurs O 81  k^2

L : cache misses for touching

its internal data structures. The total cache complexity is

therefore Q M 8 k : < O 81  k  k^3

L :.

Case I: k ; α  Z. We prove by induction on k that

whenever k ; α Z , we have

Q M 8 k :

ck^3 log Z k

L  A 8 k :9 (12)

where A 8 k : < k 81  2 c log Z k

L :< o 8 k^3 :. This particular

value of A 8 k : will be justified at the end of the analysis.

The base case of the induction consists of values of

k such that α Z^1 ^4  k  α Z. (It is not sufficient only

to consider k < Θ 8  Z :, since k can become as small as

Θ 8 Z^1 ^4 : in the recursive calls.) The analysis of the first

case applies, yielding Q M 8 k : < O 81  k  k^3

L :. Be-

cause k^2 ; α Z < Ω 8 L : and k < Ω 81 : , the last term

dominates, which implies Q M 8 k :+< O 8 k^3

L :. Conse-

quently, a big enough value of c can be found that satis- fies Inequality (12).

For the inductive case, suppose that k ; α  Z. The

k -merger invokes the  k -mergers recursively. Since

α Z^1 ^4   k  k , the inductive hypothesis can be used to

bound the number Q M 8  k :of cache misses incurred by

the submergers. The “right” merger R is invoked exactly

k^3 ^2 times. The total number l of invocations of “left”

mergers is bounded by l  k^3 ^2  2  k. To see why, con-

sider that every invocation of a left merger puts k^3 ^2 el-

ements into some buffer. Since k^3 elements are output

and the buffer space is 2 k^2 , the bound l  k^3 ^2  2  k

follows.

Before invoking R , the algorithm must check every buffer to see whether it is empty. One such check re-

quires at most  k cache misses, since there are  k

buffers. This check is repeated exactly k^3 ^2 times, lead-

ing to at most k^2 cache misses for all checks. These considerations lead to the recurrence

Q M 8 k :

2 k^3 ^2  2  k  Q M 8  k : k^2 

Application of the inductive hypothesis and the choice

A 8 k :< k 81  2 c log Z k

L : yields Inequality (12) as fol-

lows:

Q M 8 k :

2 k^3 ^2  2  k  Q M 8  k : k^2

k^3 ^2   k 

ck^3 ^2 log Z k

2 L

 A 8  k :  k^2

ck^3 log Z k

L  k^2 81  c log Z k

L :



2 k^3 ^2  2  k  A 8  k :

ck^3 log Z k

L  A 8 k : 

Theorem 7 To sort n elements, funnelsort incurs O 81 

8 n

L :-8 1  log Z n :: cache misses.

Proof. If n  α Z for a small enough constant α, then the algorithm fits into cache. To see why, observe that only one k -merger is active at any time. The biggest

k -merger is the top-level n^1 ^3 -merger, which requires

O 8 n^2 ^3 :  O 8 n : space. The algorithm thus can operate

in O 81  n

L :cache misses.

If N ; α Z , we have the recurrence

Q 8 n : < n^1 ^3 Q 8 n^2 ^3 : Q M 8 n^1 ^3 : 

By Lemma 6, we have Q M 8 n^1 ^3 :^ < O 81  n^1 ^3  n

L 

n log Z n

L :.

By the tall-cache assumption (1), we have n

L <

Ω 8 n^1 ^3 :^. Moreover, we also have n^1 ^3 < Ω 81 :and lg n <

Ω 8 lg Z :. Consequently, Q M 8 n^1 ^3 :^ < O 8 n log Z n

L :

holds, and the recurrence simplifies to

Q 8 n : #< n^1 ^3 Q 8 n^2 ^3 :^  O 8 n log Z n

L : 

The result follows by induction on n. This upper bound matches the lower bound stated by the next theorem, proving that funnelsort is cache- optimal. Theorem 8 The cache complexity of any sorting algo-

rithm is Q 8 n : #< Ω 81  8 n

L : -8 1  log Z n : *:.

Proof. Aggarwal and Vitter [3] show that there is an

Ω 88 n

L : log Z  L 8 n

Z :*: bound on the number of cache

misses made by any sorting algorithm on their “out-of- core” memory model, a bound that extends to the ideal- cache model. The theorem can be proved by apply-

ing the tall-cache assumption Z < Ω 8 L^2 : and the trivial

lower bounds of Q 8 n : < Ω 81 : and Q 8 n : #< Ω 8 n

L :.

5. Distribution sort

In this section, we describe another cache-oblivious op- timal sorting algorithm based on distribution sort. Like the funnelsort algorithm from Section 4, the distribution-

sorting algorithm uses O 8 n lg n : work to sort n elements,

and it incurs O 81  8 n

L :%8 1  log Z n :: cache misses.

Unlike previous cache-efficient distribution-sorting al- gorithms [1, 3, 25, 34, 36], which use sampling or other techniques to find the partitioning elements before the distribution step, our algorithm uses a “bucket splitting” technique to select pivots incrementally during the dis- tribution step. Given an array A (stored in contiguous locations) of length n , the cache-oblivious distribution sort operates as follows:

1. Partition A into  n contiguous subarrays of size

 n. Recursively sort each subarray.

  1. Distribute the sorted subarrays into q buckets

B 1 9  9 Bq of size n 1 9  9 nq , respectively, such that

1. max  x

x  Bi 

min^  x^

x  Bi  1  for i <

1 92 9  9 q  1.

  1. ni

2 ^ n^ for^ i^^ <^1 92 9 ^9 q.

(See below for details.)

  1. Recursively sort each bucket.
  2. Copy the sorted buckets to array A.

A stack-based memory allocator is used to exploit spatial locality.

The goal of Step 2 is to distribute the sorted subarrays

of A into q buckets B 1 9 B 2 9  9 Bq. The algorithm main-

tains two invariants. First, at any time each bucket holds

at most 2 n elements, and any element in bucket Bi is

smaller than any element in bucket Bi  1. Second, every bucket has an associated pivot. Initially, only one empty bucket exists with pivot ∞.

The idea is to copy all elements from the subarrays into the buckets while maintaining the invariants. We keep state information for each subarray and bucket. The state of a subarray consists of the index next of the next element to be read from the subarray and the bucket number bnum where this element should be copied. By

convention, bnum < ∞ if all elements in a subarray have

been copied. The state of a bucket consists of the pivot and the number of elements currently in the bucket.

We would like to copy the element at position next of a subarray to bucket bnum. If this element is greater than the pivot of bucket bnum , we would increment bnum un- til we find a bucket for which the element is smaller than the pivot. Unfortunately, this basic strategy has poor caching behavior, which calls for a more complicated procedure.

The distribution step is accomplished by the recur-

sive procedure DISTRIBUTE 8 i 9 j 9 m : which distributes

elements from the i th through 8 i  m  1 :th subarrays

into buckets starting from B (^) j. Given the precondition

that each subarray i 9 i  1 9  9 i  m  1 has its bnum  j ,

the execution of DISTRIBUTE 8 i 9 j 9 m : enforces the post-

condition that subarrays i 9 i  1 9  9 i  m  1 have their

bnum  j  m. Step 2 of the distribution sort invokes

DISTRIBUTE 81 91 9  n :. The following is a recursive im-

plementation of DISTRIBUTE:

DISTRIBUTE 8 i 9 j 9 m :

1 if m < 1

2 then COPYELEMS 8 i 9 j :

3 else DISTRIBUTE 8 i 9 j 9 m

4 DISTRIBUTE 8 i  m

2 9 j 9 m

5 DISTRIBUTE 8 i 9 j  m

2 9 m

6 DISTRIBUTE 8 i  m

2 9 j  m

2 9 m

In the base case, the procedure COPYELEMS 8 i 9 j :

copies all elements from subarray i that belong to

bucket j. If bucket j has more than 2  n elements af-

ter the insertion, it can be split into two buckets of size

at least  n. For the splitting operation, we use the deter-

ministic median-finding algorithm [14, p. 189] followed by a partition.

Lemma 9 The median of n elements can be found

cache-obliviously using O 8 n : work and incurring O 81 

n

L : cache misses.

Proof. See [14, p. 189] for the linear-time median find- ing algorithm and the work analysis. The cache com- plexity is given by the same recurrence as the work com- plexity with a different base case.

Q 8 m :#<

O 81  m

L : if m

α Z 9

Q 8  m

5  :  Q 87 m

10  6 : otherwise ;

 O 81  m

L :

where α is a sufficiently small constant. The result fol- lows.

In our case, we have buckets of size 2  n  1. In ad-

dition, when a bucket splits, all subarrays whose bnum is greater than the bnum of the split bucket must have their bnum ’s incremented. The analysis of DISTRIBUTE is given by the following lemma.

Lemma 10 The distribution step involves O 8 n : work,

incurs O 81  n

L : cache misses, and uses O 8 n : stack

space to distribute n elements.

Proof. In order to simplify the analysis of the work used by DISTRIBUTE, assume that COPYELEMS uses

O 81 : work for procedural overhead. We will account for

the work due to copying elements and splitting of buck- ets separately. The work of DISTRIBUTE is described by

lemma that bounds the effectiveness of the LRU simu- lation. We then show that algorithms whose complex- ity bounds satisfy a simple regularity condition (includ- ing all algorithms heretofore presented) can be ported to caches incorporating an LRU replacement policy.

Lemma 12 Consider an algorithm that causes Q

8 n ; Z 9 L : cache misses on a problem of size n using a 8 Z 9 L : ideal cache. Then, the same algorithm incurs Q 8 n ; Z 9 L :

 2 Q

8 n ; Z

 2 9 L : cache misses on a 8 Z 9 L : cache that uses LRU replacement.

Proof. Sleator and Tarjan [30] have shown that the cache misses on a 8 Z 9 L : cache using LRU replacement are 8 Z

 L :

 88 Z  Z

:

 L  1 : -competitive with optimal replacement on a 8 Z

9 L : ideal cache if both caches start empty. It follows that the number of misses on a 8 Z 9 L : LRU-cache is at most twice the number of misses on a 8 Z

 2 9 L : ideal-cache.

Corollary 13 For any algorithm whose cache- complexity bound Q 8 n ; Z 9 L : in the ideal-cache model satisfies the regularity condition

Q 8 n ; Z 9 L : < O 8 Q 8 n ; 2 Z 9 L :: 9 (14)

the number of cache misses with LRU replacement is Θ 8 Q 8 n ; Z 9 L : :.

Proof. Follows directly from (14) and Lemma 12.

The second assumption we shall eliminate is the as- sumption of only two levels of memory. Although mod- els incorporating multiple levels of caches may be nec- essary to analyze some algorithms, for cache-oblivious algorithms, analysis in the two-level ideal-cache model suffices. Specifically, optimal cache-oblivious algo- rithms also perform optimally in computers with mul- tiple levels of LRU caches. We assume that the caches satisfy the inclusion property [20, p. 723], which says that the values stored in cache i are also stored in cache i  1 (where cache 1 is the cache closest to the proces- sor). We also assume that if two elements belong to the same cache line at level i , then they belong to the same line at level i  1. Moreover, we assume that cache i  1 has strictly more cache lines than cache i. These as- sumptions ensure that cache i  1 includes the contents of cache i plus at least one more cache line. The multilevel LRU cache operates as follows. A hit on an element in cache i is served by cache i and is not seen by higher-level caches. We consider a line in cache i  1 to be marked if any element stored on the line be- longs to cache i. When cache i misses on an access, it recursively fetches the needed line from cache i  1, re- placing the least-recently accessed unmarked cache line. The replaced cache line is then brought to the front of cache 8 i  1 :’s LRU list. Because marked cache lines are

never replaced, the multilevel cache maintains the inclu- sion property. The next lemma, whose proof is omitted, asserts that even though a cache in a multilevel model does not see accesses that hit at lower levels, it neverthe- less behaves like the first-level cache of a simple two- level model, which sees all the memory accesses.

Lemma 14 A 8 Zi 9 Li : -cache at a given level i of a mul- tilevel LRU model always contains the same cache lines as a simple 8 Zi 9 Li : -cache managed by LRU that serves the same sequence of memory accesses.

Lemma 15 An optimal cache-oblivious algorithm whose cache complexity satisifies the regularity condi- tion (14) incurs an optimal number of cache misses on each level^3 of a multilevel cache with LRU replacement.

Proof. Let cache i in the multilevel LRU model be a 8 Zi 9 Li : cache. Lemma 14 says that the cache holds ex- actly the same elements as a 8 Zi 9 Li : cache in a two-level LRU model. From Corollary 13, the cache complex- ity of a cache-oblivious algorithm working on a 8 Zi 9 Li : LRU cache lower-bounds that of any cache-aware algo- rithm for a 8 Zi 9 Li : ideal cache. A 8 Zi 9 Li : level in a mul- tilevel cache incurs at least as many cache misses as a 8 Zi 9 Li : ideal cache when the same algorithm is executed.

Finally, we remove the two assumptions of automatic replacement and full associativity. Specifically, we shall show that a fully associative LRU cache can be main- tained in ordinary memory with no asymptotic loss in expected performance.

Lemma 16 A 8 Z 9 L : LRU-cache can be maintained us- ing O 8 Z : memory locations such that every access to a cache line in memory takes O 81 : expected time. Proof. Given the address of the memory location to be accessed, we use a 2-universal hash function [24, p. 216] to maintain a hash table of cache lines present in the memory. The Z

 L entries in the hash table point to linked lists in a heap of memory that contains Z

 L records corresponding to the cache lines. The 2- universal hash function guarantees that the expected size of a chain is O 81 :. All records in the heap are organized as a doubly linked list in the LRU order. Thus, the LRU policy can be implemented in O 81 : expected time using O 8 Z

 L : records of O 8 L : words each.

(^3) Alpern, Carter and Feig [5] show that optimality on each level of memory in the UMH model does not necessarily imply global optimal- ity. The UMH model incorporates a single cost measure that combines the costs of work and cache faults at each of the levels of memory. By analyzing the levels independently, our multilevel ideal-cache model remains agnostic about the various schemes by which work and cache faults might be combined.

Theorem 17 An optimal cache-oblivious algorithm whose cache-complexity bound satisfies the regularity condition (14) can be implemented optimally in expec- tation in multilevel models with explicit memory man- agement.

Proof. Combine Lemma 15 and Lemma 16.

Corollary 18 The recursive cache-oblivious algorithms for matrix multiplication, matrix transpose, FFT, and sorting are optimal in multilevel models with explicit memory management.

Proof. Their complexity bounds satisfy the regularity condition (14).

It can also be shown [26] that cache-oblivous algo- rithms satisfying (14) are also optimal (in expectation) in the previously studied SUMH [5, 34] and HMM [1] models. Thus, all the algorithmic results in this paper apply to these models, matching the best bounds previ- ously achieved. Other simulation results can be shown. For example, by using the copying technique of [22], cache-oblivious algorithms for matrix multiplication and other problems can be designed that are provably optimal on direct- mapped caches.

7. Related work

In this section, we discuss the origin of the notion of cache-obliviousness. We also give an overview of other hierarchical memory models. Our research group at MIT noticed as far back as 1994 that divide-and-conquer matrix multiplication was a cache-optimal algorithm that required no tuning, but we did not adopt the term “cache-oblivious” until 1997. This matrix-multiplication algorithm, as well as a cache- oblivious algorithm for LU-decomposition without piv- oting, eventually appeared in [9]. Shortly after leaving our research group, Toledo [32] independently proposed a cache-oblivious algorithm for LU-decomposition with pivoting. For n n matrices, Toledo’s algorithm uses Θ 8 n^3 : work and incurs Θ 81  n^2  L  n^3  L  Z : cache misses. More recently, our group has produced an FFT library called FFTW [18], which in its most recent incar- nation [17], employs a register-allocation and schedul- ing algorithm inspired by our cache-oblivious FFT al- gorithm. The general idea that divide-and-conquer en- hances memory locality has been known for a long time [29]. Previous theoretical work on understanding hierar- chical memories and the I/O-complexity of algorithms has been studied in cache-aware models lacking an auto- matic replacement strategy, although [10, 28] are recent

0

0 200 400 600 800 1000 1200

Time (microseconds)

N

iterative recursive

Figure 4: Average time to transpose an N N matrix, divided by N^2.

exceptions. Hong and Kung [21] use the red-blue peb- ble game to prove lower bounds on the I/O-complexity of matrix multiplication, FFT, and other problems. The red-blue pebble game models temporal locality using two levels of memory. The model was extended by Savage [27] for deeper memory hierarchies. Aggarwal and Vitter [3] introduced spatial locality and investigated a two-level memory in which a block of P contiguous items can be transferred in one step. They obtained tight bounds for matrix multiplication, FFT, sorting, and other problems. The hierarchical memory model (HMM) by Aggarwal et al. [1] treats memory as a linear array, where the cost of an access to element at location x is given by a cost function f 8 x :. The BT model [2] extends HMM to support block transfers. The UMH model by Alpern et al. [5] is a multilevel model that allows I/O at different levels to proceed in parallel. Vitter and Shriver introduce parallelism, and they give algorithms for ma- trix multiplication, FFT, sorting, and other problems in both a two-level model [35] and several parallel hierar- chical memory models [36]. Vitter [33] provides a com- prehensive survey of external-memory algorithms.

8. Conclusion

The theoretical work presented in this paper opens two important avenues for future research. The first is to determine the range of practicality of cache-oblivious algorithms, or indeed, of any algorithms developed in the ideal-cache model. The second is to resolve, from a complexity-theoretic point of view, the relative strengths of cache-oblivious and cache-aware algorithms. To con- clude, we discuss each of these avenues in turn. Figure 4 compares per-element time to transpose a matrix using the naive iterative algorithm employing a doubly nested loop with the recursive cache-oblivious REC-TRANSPOSE algorithm from Section 3. The two algorithms were evaluated on a 450 megahertz AMD K6III processor with a 32-kilobyte 2-way set-associative L1 cache, a 64-kilobyte 4-way set-associative L2 cache, and a 1-megabyte L3 cache of unknown associativ- ity, all with 32-byte cache lines. The code for REC- TRANSPOSE was the same as presented in Section 3, ex-

[3] A. Aggarwal and J. S. Vitter. The input/output complex- ity of sorting and related problems. Communications of the ACM , 31(9):1116–1127, Sept. 1988. [4] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley Publishing Company, 1974. [5] B. Alpern, L. Carter, and E. Feig. Uniform memory hier- archies. In Proceedings of the 31st Annual IEEE Sym- posium on Foundations of Computer Science (FOCS) , pages 600–608, Oct. 1990. [6] D. H. Bailey. FFTs in external or hierarchical memory. Journal of Supercomputing , 4(1):23–35, May 1990. [7] L. A. Belady. A study of replacement algorithms for vir- tual storage computers. IBM Systems Journal , 5(2):78– 101, 1966. [8] G. Bilardi and E. Peserico. Efficient portability across memory hierarchies. Unpublished manuscript, 1999. [9] R. D. Blumofe, M. Frigo, C. F. Joerg, C. E. Leiserson, and K. H. Randall. An analysis of dag-consistent dis- tributed shared-memory algorithms. In Proceedings of the Eighth Annual ACM Symposium on Parallel Algo- rithms and Architectures (SPAA) , pages 297–308, Padua, Italy, June 1996. [10] L. Carter and K. S. Gatlin. Towards an optimal bit- reversal permutation program. In Proceedings of the 39th Annual Symposium on Foundations of Computer Science , pages 544–555. IEEE Computer Society Press, 1998. [11] S. Chatterjee, V. V. Jain, A. R. Lebeck, and S. Mundhra. Nonlinear array layouts for hierarchical memory sys- tems. In Proceedings of the ACM International Confer- ence on Supercomputing , Rhodes, Greece, June 1999. [12] S. Chatterjee, A. R. Lebeck, P. K. Patnala, and M. Thot- tethodi. Recursive array layouts and fast parallel matrix multiplication. In Proceedings of the Eleventh ACM Sym- posium on Parallel Algorithms and Architectures (SPAA) , June 1999. [13] J. W. Cooley and J. W. Tukey. An algorithm for the ma- chine computation of the complex Fourier series. Mathe- matics of Computation , 19:297–301, Apr. 1965. [14] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. In- troduction to Algorithms. MIT Press and McGraw Hill,

[15] P. Duhamel and M. Vetterli. Fast Fourier transforms: a tutorial review and a state of the art. Signal Processing , 19:259–299, Apr. 1990. [16] J. D. Frens and D. S. Wise. Auto-blocking matrix- multiplication or tracking BLAS3 performance from source code. In Proceedings of the Sixth ACM SIG- PLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) , pages 206–216, Las Vegas, NV, June 1997. [17] M. Frigo. A fast Fourier transform compiler. In Proceed- ings of the ACM SIGPLAN’99 Conference on Program- ming Language Design and Implementation (PLDI) , At- lanta, Georgia, May 1999. [18] M. Frigo and S. G. Johnson. FFTW: An adaptive soft- ware architecture for the FFT. In Proceedings of the In- ternational Conference on Acoustics, Speech, and Signal Processing , Seattle, Washington, May 1998. [19] G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, 1989. [20] J. L. Hennessy and D. A. Patterson. Computer Architec- ture: A Quantitative Approach. Morgan Kaufmann, 2nd edition, 1996. [21] J.-W. Hong and H. T. Kung. I/O complexity: the red-blue pebbling game. In Proceedings of the 13th Annual ACM Symposium on Theory of Computing (STOC) , pages 326– 333, Milwaukee, 1981.

[22] M. S. Lam, E. Rothberg, and M. E. Wolf. The cache per- formance and optimizations of blocked algortihms. In Fourth International Conference on Architectural Sup- port for Programming Languages and Operating Systems (ASPLOS) , pages 63–74, Santa Clara, CA, Apr. 1991. ACM SIGPLAN Notices 26:4. [23] A. LaMarca and R. E. Ladner. The influence of caches on the performance of sorting. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages 370–377, 1997. [24] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995. [25] M. H. Nodine and J. S. Vitter. Deterministic distribution sort in shared and distributed memory multiprocessors. In Proceedings of the Fifth ACM Symposium on Paral- lel Algorithms and Architectures (SPAA) , pages 120–129, Velen, Germany, 1993. [26] H. Prokop. Cache-oblivious algorithms. Master’s thesis, Massachusetts Institute of Technology, June 1999. [27] J. E. Savage. Extending the Hong-Kung model to mem- ory hierarchies. In D.-Z. Du and M. Li, editors, Com- puting and Combinatorics , volume 959 of Lecture Notes in Computer Science , pages 270–281. Springer Verlag,

[28] S. Sen and S. Chatterjee. Towards a theory of cache- efficient algorithms. Unpublished manuscript, 1999. [29] R. C. Singleton. An algorithm for computing the mixed radix fast Fourier transform. IEEE Transactions on Audio and Electroacoustics , AU-17(2):93–103, June 1969. [30] D. D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. Communications of the ACM , 28(2):202–208, Feb. 1985. [31] V. Strassen. Gaussian elimination is not optimal. Nu- merische Mathematik , 13:354–356, 1969. [32] S. Toledo. Locality of reference in LU decomposition with partial pivoting. SIAM Journal on Matrix Analysis and Applications , 18(4):1065–1081, Oct. 1997. [33] J. S. Vitter. External memory algorithms and data struc- tures. In J. Abello and J. S. Vitter, editors, External Memory Algorithms and Visualization , DIMACS Series in Discrete Mathematics and Theoretical Computer Sci- ence. American Mathematical Society Press, Providence, RI, 1999. [34] J. S. Vitter and M. H. Nodine. Large-scale sorting in uniform memory hierarchies. Journal of Parallel and Distributed Computing , 17(1–2):107–114, January and February 1993. [35] J. S. Vitter and E. A. M. Shriver. Algorithms for par- allel memory I: Two-level memories. Algorithmica , 12(2/3):110–147, August and September 1994. [36] J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory II: Hierarchical multilevel memories. Algorith- mica , 12(2/3):148–169, August and September 1994. [37] S. Winograd. On the algebraic complexity of functions. Actes du Congr`es International des Math´ematiciens , 3:283–288, 1970.