Understanding Cache Locality in Matrix Multiplication, Study notes of Computer Science

Cache locality in matrix multiplication algorithms, discussing the impact of spatial and temporal locality on cache misses and hits. It provides examples of different matrix multiplication functions and analyzes their locality. The document also introduces the concept of blocking to improve temporal locality.

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-on3-1
koofers-user-on3-1 🇺🇸

10 documents

1 / 35

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 201
Writing Cache-Friendly
Code
CS 201
Writing Cache-Friendly
Code
Gerson Robboy
Portland State University
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23

Partial preview of the text

Download Understanding Cache Locality in Matrix Multiplication and more Study notes Computer Science in PDF only on Docsity!

CS 201

Writing Cache-Friendly

CS 201Code

Writing Cache-Friendly

CodeGerson Robboy

Portland State University

An Example Memory Hierarchy^ An Example Memory Hierarchy

registerson-chip L1cache (SRAM)main memory(DRAM) local secondary storage(local disks)

Larger,slower,andcheaper(per byte)storagedevices

remote secondary storage(distributed file systems, Web servers)

Main memory holds diskblocks retrieved from localdisks. Local disks hold filesretrieved from disks onremote network servers.

off-chip L2cache (SRAM)

CPU registers hold words retrievedfrom L1 cache.L1 cache holds cache lines retrievedfrom the L2 cache memory.L2 cache holds cache linesretrieved from main memory. L0: L1: L2: L3: L4: Smaller,faster,andcostlier(per byte)storagedevices L5:

Just what does a cache do?^ Just what does a cache do?The cache stores memory in units orThe cache stores memory in units or

cache linescache lines

„^ Fixed length chunks, hardware dependent „^ For our example, let’s say cache lines are 32 bytes „^ Aligned on a cache-line (32 byte) boundaryWhen the CPU accesses a memory address (store orWhen the CPU accesses a memory address (store orload), the cache line containing that address isload), the cache line containing that address ispulled into the cachepulled into the cache

Examples^ ExamplesSuppose a certain processor has a 32-Suppose a certain processor has a 32

-byte cache linebyte cache line

size.size.You access address 0x3a40. What addresses areYou access address 0x3a40. What addresses arepulled into the cache?pulled into the cache?You access address 0x3a94. What addresses areYou access address 0x3a94. What addresses arepulled into the cache?pulled into the cache?Next you access 0x3a48. What happens?Next you access 0x3a48. What happens?You access 4 32-You access 4 32

-bit words sequentially, from 0x8000 tobit words sequentially, from 0x8000 to

0x801c0x801c^ „^ How many cache misses and how many cache hits?

Locality Example^ Locality ExampleClaim: Being able to look at code and get a qualitativeClaim:

Being able to look at code and get a qualitativesense of its locality is a key skill for a professionalsense of its locality is a key skill for a professionalprogrammer.programmer.

Question: Does this function have good locality?Question:

Does this function have good locality?

„^ Spatial, temporal, both, or neither?

int sumarrayrows(int a[M][N]){ int i, j, sum

for (i =

0; i^ < M; i++)for (j = 0; j <

N; j++)

sum += a[i][j];

return sum}

Locality Example^ Locality ExampleQuestion: Does this function have good locality?Question:

Does this function have good locality?

„^ Spatial, temporal, both, or neither?

int sumarraycols(int a[M][N]){ int i, j, sum

for (j =

0; j^ < N; j++)for (i = 0; i <

M; i++)

sum += a[i][j];

return sum}

Why does traversing a matrix with stride 1 give you^ Why does traversing a matrix with stride 1 give yougood spatial locality?good spatial locality?Why do strides other than 1 give you bad spatialWhy do strides other than 1 give you bad spatiallocality?locality?

15-213, F’

Writing Cache Friendly Code^ Writing Cache Friendly CodeRepeated references to variables are good (temporalRepeated references to variables are good (temporallocality)locality)Stride--1 reference patterns are good (spatial locality)1 reference patterns are good (spatial locality)StrideExamples:Examples:^ „^ cold cache, 4-byte words, 4-word cache blocks int^ sumarrayrows(int

a[M][N]) { int^

i,^ j,^ sum

=^ 0;

for^ (i^ =

0;^ i^ <^

M;^ i++) for^ (j^ =

0;^ j^ <^

N;^ j++) sum^ +=^ a[i][j]; return^ sum;}

int^ sumarraycols(int

a[M][N]) { int^

i,^ j,^ sum

=^ 0;

for^ (j^ =

0;^ j^ <^

N;^ j++) for^ (i^ =

0;^ i^ <^

M;^ i++) sum^ +=^ a[i][j]; return^ sum;}

Miss rate =

Miss rate =

Memory Mountain Test Function^ Memory Mountain Test Function /* The test function /void test(int elems, int stride) {int i, result = 0;volatile int sink;for (i = 0; i < elems; i += stride)result += data[i];sink = result; / So compiler doesn't optimize away the loop /} / Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride, double Mhz){ double cycles;int elems = size / sizeof(int);test(elems, stride);

/* warm up the cache */

cycles = fcyc2(test, elems, stride, 0);

/* call test(elems,stride) */

return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */}

Memory Mountain Main Routine^ Memory Mountain Main Routine /* mountain.c - Generate the memory mountain. */#define MINBYTES (1 << 10)

/* Working set size ranges from 1 KB */

#define MAXBYTES (1 << 23)

/* ... up to 8 MB */

#define MAXSTRIDE 16

/* Strides range from 1 to 16 */

#define MAXELEMS MAXBYTES/sizeof(int)int data[MAXELEMS];

/* The array we'll be traversing */

int main(){ int size;

/* Working set size (in bytes) */ int stride;

/* Stride (in array elements) */ double Mhz;

/* Clock frequency / init_data(data, MAXELEMS); / Initialize each element in data to 1 */Mhz = mhz(0);

/* Estimate the clock frequency */

for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz));printf("\n"); } exit(0);}

Ridges of Temporal Locality^ Ridges of Temporal LocalitySlice through the memory mountain with stride=1Slice through the memory mountain with stride=1^ „^ illuminates read throughputs of different caches andmemory

1200 1000 800 600 400 200 0 8m4m

2m512k1024k

64k256k128k

8k32k16k 4k2k1k working set size (bytes) read througput (MB/s)

L1 cacheregion L2 cacheregion main memoryregion

A Slope of Spatial Locality^ A Slope of Spatial LocalitySlice through memory mountain with size=256KBSlice through memory mountain with size=256KB^ „^ shows cache block size.^8007006005004003002001000

s1^ s2^ s^

s4^ s5^ s^

s7^ s8^ s^

s10^ s11^ s

s13^ s14^ s

s

stride (words)

read throughput (MB/s)

one access per cache line

Miss Rate Analysis for Matrix Multiply^ Miss Rate Analysis for Matrix MultiplyAssume:Assume:^ „^ Line size = 32B (big enough for 4 64-bit words)^ „^ Matrix dimension (N) is very large^ z

Approximate 1/N as 0.0 „ Cache is not even big enough to hold multiple rows

Analysis Method:Analysis Method:^ „^ Look at access pattern of inner loop

C

k i A

j k B

j i

Layout of C Arrays in Memory^ Layout of C Arrays in Memory(review)(review)C arrays allocated in row-C arrays allocated in row

-major ordermajor order

„^ each row in contiguous memory locationsStepping through columns in one row:Stepping through columns in one row: „^ for (i = 0; i < N; i++)sum +=

a[0][i];

„^ accesses successive elements „^ if block size (B) > 4 bytes, exploit spatial locality^ z^ compulsory miss rate = 4 bytes / BStepping through rows in one column:Stepping through rows in one column: „^ for (i = 0; i < n; i++)sum +=

a[i][0];

„^ accesses distant elements „^ no spatial locality!^ z^ compulsory miss rate = 1 (i.e. 100%)