Download Understanding Cache Locality in Matrix Multiplication and more Study notes Computer Science in PDF only on Docsity!
CS 201
Writing Cache-Friendly
CS 201Code
Writing Cache-Friendly
CodeGerson Robboy
Portland State University
An Example Memory Hierarchy^ An Example Memory Hierarchy
registerson-chip L1cache (SRAM)main memory(DRAM) local secondary storage(local disks)
Larger,slower,andcheaper(per byte)storagedevices
remote secondary storage(distributed file systems, Web servers)
Main memory holds diskblocks retrieved from localdisks. Local disks hold filesretrieved from disks onremote network servers.
off-chip L2cache (SRAM)
CPU registers hold words retrievedfrom L1 cache.L1 cache holds cache lines retrievedfrom the L2 cache memory.L2 cache holds cache linesretrieved from main memory. L0: L1: L2: L3: L4: Smaller,faster,andcostlier(per byte)storagedevices L5:
Just what does a cache do?^ Just what does a cache do?The cache stores memory in units orThe cache stores memory in units or
cache linescache lines
^ Fixed length chunks, hardware dependent ^ For our example, let’s say cache lines are 32 bytes ^ Aligned on a cache-line (32 byte) boundaryWhen the CPU accesses a memory address (store orWhen the CPU accesses a memory address (store orload), the cache line containing that address isload), the cache line containing that address ispulled into the cachepulled into the cache
Examples^ ExamplesSuppose a certain processor has a 32-Suppose a certain processor has a 32
-byte cache linebyte cache line
size.size.You access address 0x3a40. What addresses areYou access address 0x3a40. What addresses arepulled into the cache?pulled into the cache?You access address 0x3a94. What addresses areYou access address 0x3a94. What addresses arepulled into the cache?pulled into the cache?Next you access 0x3a48. What happens?Next you access 0x3a48. What happens?You access 4 32-You access 4 32
-bit words sequentially, from 0x8000 tobit words sequentially, from 0x8000 to
0x801c0x801c^ ^ How many cache misses and how many cache hits?
Locality Example^ Locality ExampleClaim: Being able to look at code and get a qualitativeClaim:
Being able to look at code and get a qualitativesense of its locality is a key skill for a professionalsense of its locality is a key skill for a professionalprogrammer.programmer.
Question: Does this function have good locality?Question:
Does this function have good locality?
^ Spatial, temporal, both, or neither?
int sumarrayrows(int a[M][N]){ int i, j, sum
for (i =
0; i^ < M; i++)for (j = 0; j <
N; j++)
sum += a[i][j];
return sum}
Locality Example^ Locality ExampleQuestion: Does this function have good locality?Question:
Does this function have good locality?
^ Spatial, temporal, both, or neither?
int sumarraycols(int a[M][N]){ int i, j, sum
for (j =
0; j^ < N; j++)for (i = 0; i <
M; i++)
sum += a[i][j];
return sum}
Why does traversing a matrix with stride 1 give you^ Why does traversing a matrix with stride 1 give yougood spatial locality?good spatial locality?Why do strides other than 1 give you bad spatialWhy do strides other than 1 give you bad spatiallocality?locality?
15-213, F’
Writing Cache Friendly Code^ Writing Cache Friendly CodeRepeated references to variables are good (temporalRepeated references to variables are good (temporallocality)locality)Stride--1 reference patterns are good (spatial locality)1 reference patterns are good (spatial locality)StrideExamples:Examples:^ ^ cold cache, 4-byte words, 4-word cache blocks int^ sumarrayrows(int
a[M][N]) { int^
i,^ j,^ sum
=^ 0;
for^ (i^ =
0;^ i^ <^
M;^ i++) for^ (j^ =
0;^ j^ <^
N;^ j++) sum^ +=^ a[i][j]; return^ sum;}
int^ sumarraycols(int
a[M][N]) { int^
i,^ j,^ sum
=^ 0;
for^ (j^ =
0;^ j^ <^
N;^ j++) for^ (i^ =
0;^ i^ <^
M;^ i++) sum^ +=^ a[i][j]; return^ sum;}
Miss rate =
Miss rate =
Memory Mountain Test Function^ Memory Mountain Test Function /* The test function /void test(int elems, int stride) {int i, result = 0;volatile int sink;for (i = 0; i < elems; i += stride)result += data[i];sink = result; / So compiler doesn't optimize away the loop /} / Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride, double Mhz){ double cycles;int elems = size / sizeof(int);test(elems, stride);
/* warm up the cache */
cycles = fcyc2(test, elems, stride, 0);
/* call test(elems,stride) */
return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */}
Memory Mountain Main Routine^ Memory Mountain Main Routine /* mountain.c - Generate the memory mountain. */#define MINBYTES (1 << 10)
/* Working set size ranges from 1 KB */
#define MAXBYTES (1 << 23)
/* ... up to 8 MB */
#define MAXSTRIDE 16
/* Strides range from 1 to 16 */
#define MAXELEMS MAXBYTES/sizeof(int)int data[MAXELEMS];
/* The array we'll be traversing */
int main(){ int size;
/* Working set size (in bytes) */ int stride;
/* Stride (in array elements) */ double Mhz;
/* Clock frequency / init_data(data, MAXELEMS); / Initialize each element in data to 1 */Mhz = mhz(0);
/* Estimate the clock frequency */
for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz));printf("\n"); } exit(0);}
Ridges of Temporal Locality^ Ridges of Temporal LocalitySlice through the memory mountain with stride=1Slice through the memory mountain with stride=1^ ^ illuminates read throughputs of different caches andmemory
1200 1000 800 600 400 200 0 8m4m
2m512k1024k
64k256k128k
8k32k16k 4k2k1k working set size (bytes) read througput (MB/s)
L1 cacheregion L2 cacheregion main memoryregion
A Slope of Spatial Locality^ A Slope of Spatial LocalitySlice through memory mountain with size=256KBSlice through memory mountain with size=256KB^ ^ shows cache block size.^8007006005004003002001000
s1^ s2^ s^
s4^ s5^ s^
s7^ s8^ s^
s10^ s11^ s
s13^ s14^ s
s
stride (words)
read throughput (MB/s)
one access per cache line
Miss Rate Analysis for Matrix Multiply^ Miss Rate Analysis for Matrix MultiplyAssume:Assume:^ ^ Line size = 32B (big enough for 4 64-bit words)^ ^ Matrix dimension (N) is very large^ z
Approximate 1/N as 0.0 Cache is not even big enough to hold multiple rows
Analysis Method:Analysis Method:^ ^ Look at access pattern of inner loop
C
k i A
j k B
j i
Layout of C Arrays in Memory^ Layout of C Arrays in Memory(review)(review)C arrays allocated in row-C arrays allocated in row
-major ordermajor order
^ each row in contiguous memory locationsStepping through columns in one row:Stepping through columns in one row: ^ for (i = 0; i < N; i++)sum +=
a[0][i];
^ accesses successive elements ^ if block size (B) > 4 bytes, exploit spatial locality^ z^ compulsory miss rate = 4 bytes / BStepping through rows in one column:Stepping through rows in one column: ^ for (i = 0; i < n; i++)sum +=
a[i][0];
^ accesses distant elements ^ no spatial locality!^ z^ compulsory miss rate = 1 (i.e. 100%)