
Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various methods to optimize matrix multiplication code, including recognizing redundancies, changing the order of operations, and using blocking. The best results are obtained through blocking with a block size of 50, which significantly reduces l1 cache misses and improves cache hit ratio and running time.
Typology: Assignments
1 / 1
This page cannot be seen from the preview
Don't miss anything!

CS232 Section 12: Model Solution
There are some quick ways to optimize the code somewhat. One is to recognize that the assignment to C[i][j] does not need to be done k times. But VTune shows that this produces only marginal improvement. Another way is to change the ordering of the additions and the multiplications. You can either do this by transposing the matrix B, or by changing the order in which the variables are iterated. Switching j and k produces good results. The improvement is very good, however, it is still limited by the size of the matrix. If the matrix size is increased enough so that a single row will not fit in the L1 cache, reordering the operations is unlikely to help. The best way, of course, is to use blocking. Some experimentation shows that a block size of 50 produces the best time. (Actually a block size of 51 does better, but a block size of 50 allows us to avoid edge cases.)
void multiplyFaster(int block){ int i0, j0, k0, i, j, k; for (i0 = 0; i0 < SIZE; i0 += block) for (j0 = 0; j0 < SIZE; j0 += block) for (k0 = 0; k0 < SIZE; k0 += block) for (i = i0; i < i0 + block; i++) for (j = j0; j < j0 + block; j++) for (k = k0; k < k0 + block; k++) C[i][j] += A[i][k] * B[k][j]; }
Block size 1000 50 250 1st Level Cache Load Misses Retired 1183 M 28 M 1026 M Loads Retired 3014 M 3206 M 3045 M Cache Hit Ratio 60.7% 99.1% 66.3% Running time (in s) 10.15 2.06 9.
How large is the L1 cache? Each matrix has 4 * 1000 * 1000 bytes, but we are only keeping a 50 * 50 block of each matrix in the working set. This gives us a working set of size of about 29kB. The L1 cache is almost certainly larger than that, but we know that doubling the block size produces fairly horrible times, so it is likely the L1 cache is smaller than 117kB. The two powers of 2 that are inbetween those two numbers are 32kB and 64kB - both are fairly reasonable, although they do seem to be larger than the L1 cache numbers I could find for a Pentium 4. If we assume the time differences are above are entirely due to L1 cache misses, we get an L1 miss time between 7.0ns and 7.2ns. Which, given that we are dealing with a 3.2GHz procsesor, is not a small amount of time.