Matrix Multiplication Optimization: Blocking and Cache Alignment for Better Performance, Assignments of Computer Architecture and Organization

Various methods to optimize matrix multiplication code, including recognizing redundancies, changing the order of operations, and using blocking. The best results are obtained through blocking with a block size of 50, which significantly reduces l1 cache misses and improves cache hit ratio and running time.

Typology: Assignments

Pre 2010

Uploaded on 03/10/2009

koofers-user-gmh
koofers-user-gmh 🇺🇸

8 documents

1 / 1

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS232 Section 12: Model Solution
There are some quick ways to optimize the code somewhat. One is to recognize that the assignment
to C[i][j] does not need to be done k times. But VTune shows that this produces only marginal
improvement.
Another way is to change the ordering of the additions and the multiplications. You can either do
this by transposing the matrix B, or by changing the order in which the variables are iterated. Switching
j and k produces good results. The improvement is very good, however, it is still limited by the size
of the matrix. If the matrix size is increased enough so that a single row will not fit in the L1 cache,
reordering the operations is unlikely to help.
The best way, of course, is to use blocking. Some experimentation shows that a block size of 50
produces the best time. (Actually a block size of 51 does better, but a block size of 50 allows us to avoid
edge cases.)
void multiplyFaster(int block){
int i0, j0, k0, i, j, k;
for (i0 = 0; i0 < SIZE; i0 += block)
for (j0 = 0; j0 < SIZE; j0 += block)
for (k0 = 0; k0 < SIZE; k0 += block)
for (i = i0; i < i0 + block; i++)
for (j = j0; j < j0 + block; j++)
for (k = k0; k < k0 + block; k++)
C[i][j] += A[i][k] * B[k][j];
}
Block size 1000 50 250
1st Level Cache Load Misses Retired 1183 M 28 M 1026 M
Loads Retired 3014 M 3206 M 3045 M
Cache Hit Ratio 60.7% 99.1% 66.3%
Running time (in s) 10.15 2.06 9.29
How large is the L1 cache? Each matrix has 4 * 1000 * 1000 bytes, but we are only keeping a 50 * 50
block of each matrix in the working set. This gives us a working set of size of about 29kB. The L1 cache
is almost certainly larger than that, but we know that doubling the block size produces fairly horrible
times, so it is likely the L1 cache is smaller than 117kB. The two powers of 2 that are inbetween those
two numbers are 32kB and 64kB - both are fairly reasonable, although they do seem to be larger than
the L1 cache numbers I could find for a Pentium 4.
If we assume the time differences are above are entirely due to L1 cache misses, we get an L1 miss
time between 7.0ns and 7.2ns. Which, given that we are dealing with a 3.2GHz procsesor, is not a small
amount of time.
1

Partial preview of the text

Download Matrix Multiplication Optimization: Blocking and Cache Alignment for Better Performance and more Assignments Computer Architecture and Organization in PDF only on Docsity!

CS232 Section 12: Model Solution

There are some quick ways to optimize the code somewhat. One is to recognize that the assignment to C[i][j] does not need to be done k times. But VTune shows that this produces only marginal improvement. Another way is to change the ordering of the additions and the multiplications. You can either do this by transposing the matrix B, or by changing the order in which the variables are iterated. Switching j and k produces good results. The improvement is very good, however, it is still limited by the size of the matrix. If the matrix size is increased enough so that a single row will not fit in the L1 cache, reordering the operations is unlikely to help. The best way, of course, is to use blocking. Some experimentation shows that a block size of 50 produces the best time. (Actually a block size of 51 does better, but a block size of 50 allows us to avoid edge cases.)

void multiplyFaster(int block){ int i0, j0, k0, i, j, k; for (i0 = 0; i0 < SIZE; i0 += block) for (j0 = 0; j0 < SIZE; j0 += block) for (k0 = 0; k0 < SIZE; k0 += block) for (i = i0; i < i0 + block; i++) for (j = j0; j < j0 + block; j++) for (k = k0; k < k0 + block; k++) C[i][j] += A[i][k] * B[k][j]; }

Block size 1000 50 250 1st Level Cache Load Misses Retired 1183 M 28 M 1026 M Loads Retired 3014 M 3206 M 3045 M Cache Hit Ratio 60.7% 99.1% 66.3% Running time (in s) 10.15 2.06 9.

How large is the L1 cache? Each matrix has 4 * 1000 * 1000 bytes, but we are only keeping a 50 * 50 block of each matrix in the working set. This gives us a working set of size of about 29kB. The L1 cache is almost certainly larger than that, but we know that doubling the block size produces fairly horrible times, so it is likely the L1 cache is smaller than 117kB. The two powers of 2 that are inbetween those two numbers are 32kB and 64kB - both are fairly reasonable, although they do seem to be larger than the L1 cache numbers I could find for a Pentium 4. If we assume the time differences are above are entirely due to L1 cache misses, we get an L1 miss time between 7.0ns and 7.2ns. Which, given that we are dealing with a 3.2GHz procsesor, is not a small amount of time.