Cache-Oblivious Algorithms for Matrix Transpose, FFT, and Sorting | Papers Computer Science

Cache-Oblivious Algorithms

EXTENDED ABSTRACT

Matteo Frigo Charles E. Leiserson Harald Prokop Sridhar Ramachandran

MIT Laboratory for Computer Science,545 Technology Square,Cambridge, MA 02139

 !"##$%'&(#&*)+%,&-"

Abstract This paper presents asymptotically optimal algo-

rithms for rectangular matrix transpose, FFT, and sorting on

computers with multiple levels of caching. Unlike previous

optimal algorithms, these algorithms are cache oblivious: no

variables dependent on hardware parameters, such as cache

size and cache-line length, need to be tuned to achieve opti-

mality. Nevertheless, these algorithms use an optimal amount

of work and move data optimally among multiple levels of

cache. For a cache with size Zand cache-line length Lwhere

Ω

the number of cache misses for an m

nma-

trix transpose is Θ

. The number of cache misses

for either an n-point FFT or the sorting of nnumbers is

24/

logZn

050

. We also give an Θ

mnp

-work al-

gorithm to multiply an m

nmatrix by an n

pmatrix that

incurs Θ

26/

mnp

cache faults.

We introduce an “ideal-cache” model to analyze our algo-

rithms. We prove that an optimal cache-oblivious algorithm

designed for two levels of memory is also optimal for multi-

ple levels and that the assumption of optimal replacement in

the ideal-cache model can be simulated efficiently by LRUre-

placement. We also provide preliminary empirical results on

the effectiveness of cache-oblivious algorithms in practice.

1. Introduction

Resource-oblivious algorithms that nevertheless use re-

sources efficiently offer advantages of simplicity and

portability over resource-aware algorithms whose re-

source usage must be programmed explicitly. In this

paper, we study cache resources, specifically, the hier-

archy of memories in modern computers. We exhibit

several “cache-oblivious” algorithms that use cache as

effectively as “cache-aware” algorithms.

Before discussing the notion of cache obliviousness,

we first introduce the

ideal-cache model to study

the cache complexity of algorithms. This model, which

is illustrated in Figure 1, consists of a computer with a

two-level memory hierarchy consisting of an ideal (data)

cache of Zwords and an arbitrarily large main mem-

ory. Because the actual size of words in a computer is

typically a small, fixed size (4 bytes, 8 bytes, etc.), we

This research was supported in part by the Defense Advanced

Research Projects Agency (DARPA) under Grant F30602-97-1-0270.

Matteo Frigo was supported in part by a Digital Equipment Corpora-

tion fellowship.

cache

misses

organized by

optimal replacement

strategy

Main

Memory

Cache

LCache lines

Lines

of length L

CPU

work

Figure 1: The ideal-cache model

shall assume that word size is constant; the particular

constant does not affect our asymptotic analyses. The

cache is partitioned into cache lines, each consisting of

Lconsecutive words which are always moved together

between cache and main memory. Cache designers typ-

ically use L

;

1, banking on spatial locality to amortize

the overhead of moving the cache line. We shall gener-

ally assume in this paper that the cache is tall:

Ω

:9

(1)

which is usually true in practice.

The processor can only reference words that reside

in the cache. If the referenced word belongs to a line

already in cache, a cache hit occurs, and the word is

delivered to the processor. Otherwise, a cache miss oc-

curs, and the line is fetched into the cache. The ideal

cache is fully associative [20, Ch. 5]: cache lines can be

stored anywhere in the cache. If the cache is full, a cache

line must be evicted. The ideal cache uses the optimal

off-line strategy of replacing the cache line whose next

access is furthest in the future [7], and thus it exploits

temporal locality perfectly.

Unlike various other hierarchical-memory models

[1, 2, 5, 8] in which algorithms are analyzed in terms of

a single measure, the ideal-cache model uses two mea-

sures. An algorithm with an input of size nis measured

by its work complexity W

—its conventional running

time in a RAM model [4]—and its cache complexity

n;Z

—the number of cache misses it incurs as a

Cache-Oblivious Algorithms for Matrix Transpose, FFT, and Sorting, Papers of Computer Science

Related documents

Partial preview of the text

Download Cache-Oblivious Algorithms for Matrix Transpose, FFT, and Sorting and more Papers Computer Science in PDF only on Docsity!