Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Matrix Multiplication Optimization: Blocking and Cache Alignment for Better Performance, Assignments of Computer Architecture and Organization

University of Illinois - Urbana-Champaign Computer Architecture and Organization

Various methods to optimize matrix multiplication code, including recognizing redundancies, changing the order of operations, and using blocking. The best results are obtained through blocking with a block size of 50, which significantly reduces l1 cache misses and improves cache hit ratio and running time.

Typology: Assignments

Pre 2010

Uploaded on 03/10/2009

koofers-user-gmh 🇺🇸

8 documents

1 / 1

This page cannot be seen from the preview

Don't miss anything!

CS232 Section 12: Model Solution

There are some quick ways to optimize the code somewhat. One is to recognize that the assignment

to C[i][j] does not need to be done k times. But VTune shows that this produces only marginal

improvement.

Another way is to change the ordering of the additions and the multiplications. You can either do

this by transposing the matrix B, or by changing the order in which the variables are iterated. Switching

j and k produces good results. The improvement is very good, however, it is still limited by the size

of the matrix. If the matrix size is increased enough so that a single row will not fit in the L1 cache,

reordering the operations is unlikely to help.

The best way, of course, is to use blocking. Some experimentation shows that a block size of 50

produces the best time. (Actually a block size of 51 does better, but a block size of 50 allows us to avoid

edge cases.)

void multiplyFaster(int block){

int i0, j0, k0, i, j, k;

for (i0 = 0; i0 < SIZE; i0 += block)

for (j0 = 0; j0 < SIZE; j0 += block)

for (k0 = 0; k0 < SIZE; k0 += block)

for (i = i0; i < i0 + block; i++)

for (j = j0; j < j0 + block; j++)

for (k = k0; k < k0 + block; k++)

C[i][j] += A[i][k] * B[k][j];

}

Block size 1000 50 250

1st Level Cache Load Misses Retired 1183 M 28 M 1026 M

Loads Retired 3014 M 3206 M 3045 M

Cache Hit Ratio 60.7% 99.1% 66.3%

Running time (in s) 10.15 2.06 9.29

How large is the L1 cache? Each matrix has 4 * 1000 * 1000 bytes, but we are only keeping a 50 * 50

block of each matrix in the working set. This gives us a working set of size of about 29kB. The L1 cache

is almost certainly larger than that, but we know that doubling the block size produces fairly horrible

times, so it is likely the L1 cache is smaller than 117kB. The two powers of 2 that are inbetween those

two numbers are 32kB and 64kB - both are fairly reasonable, although they do seem to be larger than

the L1 cache numbers I could find for a Pentium 4.

If we assume the time differences are above are entirely due to L1 cache misses, we get an L1 miss

time between 7.0ns and 7.2ns. Which, given that we are dealing with a 3.2GHz procsesor, is not a small

amount of time.

1

Discover Assignments of Computer Architecture and Organization University of Illinois - Urbana-Champaign

Partial preview of the text

Download Matrix Multiplication Optimization: Blocking and Cache Alignment for Better Performance and more Assignments Computer Architecture and Organization in PDF only on Docsity!

CS232 Section 12: Model Solution

There are some quick ways to optimize the code somewhat. One is to recognize that the assignment to C[i][j] does not need to be done k times. But VTune shows that this produces only marginal improvement. Another way is to change the ordering of the additions and the multiplications. You can either do this by transposing the matrix B, or by changing the order in which the variables are iterated. Switching j and k produces good results. The improvement is very good, however, it is still limited by the size of the matrix. If the matrix size is increased enough so that a single row will not fit in the L1 cache, reordering the operations is unlikely to help. The best way, of course, is to use blocking. Some experimentation shows that a block size of 50 produces the best time. (Actually a block size of 51 does better, but a block size of 50 allows us to avoid edge cases.)

void multiplyFaster(int block){ int i0, j0, k0, i, j, k; for (i0 = 0; i0 < SIZE; i0 += block) for (j0 = 0; j0 < SIZE; j0 += block) for (k0 = 0; k0 < SIZE; k0 += block) for (i = i0; i < i0 + block; i++) for (j = j0; j < j0 + block; j++) for (k = k0; k < k0 + block; k++) C[i][j] += A[i][k] * B[k][j]; }

Block size 1000 50 250 1st Level Cache Load Misses Retired 1183 M 28 M 1026 M Loads Retired 3014 M 3206 M 3045 M Cache Hit Ratio 60.7% 99.1% 66.3% Running time (in s) 10.15 2.06 9.

How large is the L1 cache? Each matrix has 4 * 1000 * 1000 bytes, but we are only keeping a 50 * 50 block of each matrix in the working set. This gives us a working set of size of about 29kB. The L1 cache is almost certainly larger than that, but we know that doubling the block size produces fairly horrible times, so it is likely the L1 cache is smaller than 117kB. The two powers of 2 that are inbetween those two numbers are 32kB and 64kB - both are fairly reasonable, although they do seem to be larger than the L1 cache numbers I could find for a Pentium 4. If we assume the time differences are above are entirely due to L1 cache misses, we get an L1 miss time between 7.0ns and 7.2ns. Which, given that we are dealing with a 3.2GHz procsesor, is not a small amount of time.

Matrix Multiplication Optimization: Blocking and Cache Alignment for Better Performance, Assignments of Computer Architecture and Organization

Related documents

Partial preview of the text

Download Matrix Multiplication Optimization: Blocking and Cache Alignment for Better Performance and more Assignments Computer Architecture and Organization in PDF only on Docsity!