Memory System Performance, Lecture Slide - Computer Science, Slides of Introduction to Computers

Impact of cache pointers, Impact of Memory reference patterns, Matrix Multiply, Transpose, Memory Mountain Range

Typology: Slides

2010/2011

Uploaded on 10/07/2011

rolla45
rolla45 🇺🇸

4

(6)

133 documents

1 / 35

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Memory System Performance
October 29, 1998
Topics
Impact of cache parameters
Impact of memory reference patterns
matrix multiply
transpose
memory mountain range
15-213
class20.ppt
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23

Partial preview of the text

Download Memory System Performance, Lecture Slide - Computer Science and more Slides Introduction to Computers in PDF only on Docsity!

Memory System Performance

October 29, 1998

Topics

Impact of cache parameters

  • memory mountain range– transpose – matrix multiply Impact of memory reference patterns

class20.ppt

CS 213 F’

  • 2 –

class20.ppt

Basic Cache Organization

t

s

b

Cache (C = S x E x B bytes)

S = 2

s sets

E blocks/set

(cache line) Cache block

Address space (

N = 2

n bytes)

Valid bit

data

1 bit

B = 2

b bytes (block size)

t bitstag

(n = t + s + b bits) Address

CS 213 F’

  • 4 –

class20.ppt

Cache Performance Metrics

Miss Rate

(misses/references)fraction of memory references not found in cache

1-2% for L2 5-10% for L1 Typical numbers:

Hit Time

to determine whether the block is in the cache)time to deliver a block in the cache to the processor (includes time

3-8 clock cycles for L2 1 clock cycle for L1 Typical numbers

Miss Penalty

  • Typically 10-30 cycles for main memory additional time required because of a miss

CS 213 F’

  • 5 –

class20.ppt

Impact of Cache and Block Size

Cache Size

  • Larger is better Effect on miss rate
  • Smaller is faster Effect on hit time

Block Size

  • For given cache size, can hold fewer big blocks than little ones, though – Big blocks help exploit spatial locality Effect on miss rate
  • Longer transfer time Effect on miss penalty

CS 213 F’

  • 7 –

class20.ppt

Impact of Write Strategy

Write-through or write-back?

Advantages of Write Through

Read misses are cheaper. Why?

Simpler to implement.

Requires a write buffer to pipeline writes

Advantages of Write Back

  • Especially if bus used to connect multiple processors or I/O devices Reduced traffic to memory

Individual writes performed at the processor rate

CS 213 F’

  • 8 –

class20.ppt

Compulsory Misses^ Qualitative Cache Performance Model

First access to line not in cache

Also called “Cold start” misses

Capacity Misses

Active portion of memory exceeds cache size

Conflict Misses

map to same cache entryActive portion of address space fits in cache, but too many lines

Direct mapped and set associative placement only

CS 213 F’

  • 10 –

class20.ppt

Interactions Between Program & Cache

Major Cache Effects to Consider

  • Try to keep heavily used data in highest level cache Total cache size
  • Exploit spatial locality Block size (sometimes referred to “line size”)

Example Application

Multiply n X n matrices

O(n (^) ) total operations (^3)

  • n values summed per destination – n reads per source element Accesses » But may be able to hold in register

/ ijk*

(^) */

for (i=0; i<n; i++)

{

for (j=0; j<n; j++) { for (k=0; k<n; k++)sum = 0.0; sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

} / ijk*

(^) */

for (i=0; i<n; i++)

{

for (j=0; j<n; j++) { for (k=0; k<n; k++)sum = 0.0; sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

}

Variable

(^) sum

held in register

CS 213 F’

  • 11 –

class20.ppt

Matmult Performance (Sparc20)

n n n n n n n l l l l l l l s s s s s s s u u u u u u u q q q q q q q m m m m m m m 50

75

100

125

150

175

200

0 2 4 6 8 10 12 14 16 18 20

mflops (d.p.)

matrix size (n)

n ikj

l kij

s ijk

u jik

q jki

m kji

As matrices grow in size, exceed cache capacity

  • Whether or not can accumulate in register – Cache effects Different loop orderings give different performance

CS 213 F’

  • 13 –

class20.ppt

Matrix multiplication (ijk)

for (i=0; i<n; i++) / ijk /

for (j=0; j<n; j++)

for (k=0; k<n;sum = 0.0;

(^) k++)

sum += a[i][k]

b[k][j];

c[i][j] = sum;

} for (i=0; i<n; i++)/ ijk /

for (j=0; j<n; j++)

for (k=0; k<n; k++)sum = 0.0; sum += a[i][k]

b[k][j];

c[i][j] = sum;

A

B

C

(i,*)

(*,j)

(i,j)

Inner loop:

wiseColumn-

Row-wise

Fixed

Approx. Miss Rates

a

b

c

CS 213 F’

  • 14 –

class20.ppt

Matrix multiplication (jik)

for (j=0; j<n; j++) / jik /

for (i=0; i<n; i++)

for (k=0; k<n; k++)sum = 0.0; sum += a[i][k]

b[k][j];

c[i][j] = sum

} for (j=0; j<n; j++)/ jik /

for (i=0; i<n; i++)

for (k=0; k<n; k++)sum = 0.0; sum += a[i][k]

b[k][j];

c[i][j] = sum

A

B

C

(i,*)

(*,j)

(i,j)

Inner loop:

Row-wise

wiseColumn-

Fixed

Approx. Miss Rates

a

b

c

CS 213 F’

  • 16 –

class20.ppt

Matrix multiplication (ikj)

for (i=0; i<n; i++) / ikj /

for (k=0; k<n; k++)

for (j=0; j<n;r = a[i][k];

j++)

c[i][j] += r

b[k][j];

A

B

C

(i,*)

(i,k)

(k,*)

Inner loop:

Row-wise

Row-wise

Fixed

Approx. Miss Rates

a

b

c

CS 213 F’

  • 17 –

class20.ppt

Matrix multiplication (jki)

for (j=0; j<n; j++) / jki /

for (k=0; k<n; k++)

for (i=0; i<n;r = b[k][j];

i++)

c[i][j] += a[i][k]

r;

A

B

C

(*,j)

(k,j)

Inner loop: (*,k)

wiseColumn -

wiseColumn-

Fixed

Approx. Miss Rates

a

b

c

CS 213 F’

  • 19 –

class20.ppt Summary of Matrix Multiplication

for (j=0; j<n; j++) {for (i=0; i<n; i++) { for (k=0; k<n; k++)sum = 0.0; sum += a[i][k] * b[k][j];

c[i][j] = sum;

}

} ijk (L=2, S=0, MR=1.25)

for (j=0; j<n; j++) { for (i=0; i<n; i++) { for (k=0; k<n; k++)sum = 0.0; sum += a[i][k] * b[k][j];

c[i][j] = sum

}

}

for (k=0; k<n; k++) { for (i=0; i<n; i++) { for (j=0; j<n; j++)r = a[i][k]; c[i][j] += r * b[k][j];

}

}

jik (L=2, S=0, MR=1.25) kij (L=2, S=1, MR=0.5)

for (i=0; i<n; i++) { for (j=0; j<n; j++)r = a[i][k];for (k=0; k<n; k++) { c[i][j] += rb[k][j];*

} } ikj (L=2, S=1, MR=0.5)

for (j=0; j<n; j++) { for (k=0; k<n; k++) { for (i=0; i<n; i++)r = b[k][j]; c[i][j] += a[i][k] * r;

}

} jki (L=2, S=1, MR=2.0)

for (k=0; k<n; k++) { for (j=0; j<n; j++) { for (i=0; i<n; i++)r = b[k][j]; c[i][j] += a[i][k] * r;

} } kji (L=2, S=1, MR=2.0)

CS 213 F’

  • 20 –

class20.ppt

n^ Matmult performance (DEC5000)

n n n n n n

l l l l l l l s s s s s s s u u u u u u u q q q q q q q m m m m m m m 50

75

100

125

150

175

200

0

1

2

3

mflops (d.p.)

matrix size (n)

n ikj

l kij

s ijk

u jik

q jki

m kji

(L=2, S=0, MR=1.25) (L=2, S=1, MR=0.5)

(L=2, S=1, MR=2.0)