Understanding Cache Memory and Memory Hierarchy Progression in High Performance Computing, Slides of Computer Science

A lecture from docsity.com on high performance computing, focusing on memory hierarchy progression and cache. It covers the concept of cache, its hierarchy levels, and its relationship with programming. The lecture includes examples and analyses of cache misses and hits, and their impact on performance. It also discusses the importance of assessing cache related performance issues for important parts of programs.

Typology: Slides

2012/2013

Uploaded on 04/28/2013

dewaan
dewaan 🇮🇳

3.8

(4)

43 documents

1 / 12

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
High Performance Computing
Lecture 29
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download Understanding Cache Memory and Memory Hierarchy Progression in High Performance Computing and more Slides Computer Science in PDF only on Docsity!

High Performance Computing

Lecture 29

2

Memory Hierarchy Progression

Cache

Main (Primary) Memory

Secondary

Memory

Level 2 Cache

Level 1 Cache

L3, L4… Cache

4

Example 1: Vector Sum Reduction

double A[2048], sum=0.0;

for (i=0; i<2048, i++) sum = sum +A[i];

• To do analysis, must view program close to

machine code form (to see loads/stores)

• Recall from static instruction scheduling examples

how loop index i was implemented in a register

and not load/stored inside loop

• Will assume that both loop index i and variable

sum are implemented in registers

• Will consider only accesses to array elements

5

Example 1: Reference Sequence

 load A[0] load A[1] load A[2] … load A[2047]

 Assume base address of A (i.e., address of

A[0]) is 0xA000, 1010 0000 0000 0000

 Cache index bits: 100000000 (value = 256)

 Size of an array element (double) = 8B

 So, 4 consecutive array elements fit into each

cache block (block size is 32B)

 A[0] – A[3] have index of 256

 A[4] – A[7] have index of 257 and so on

7 Example 1: Cache Misses and Hits A[2047] 0xDFF8 255 Hit A[2046] 0xDFF0 255 Hit A[2045] 0xDFE8 255 Hit A[2044] 0xDFE0 255 Miss Cold start

A[7] 0xA038 257 Hit A[6] 0xA030 257 Hit A[5] 0xA028 257 Hit A[4] 0xA020 257 Miss Cold start A[3] 0xA018 256 Hit A[2] 0xA010 256 Hit A[1] 0xA008 256 Hit A[0] 0xA000 256 Miss Cold start Cold start miss: we assume that the cache is initially empty. Also called a Compulsory Miss Hit ratio of our loop is 75% -- there are 1536 hits out of 2048 memory accesses This is entirely due to spatial locality of reference. If the loop was preceded by a loop that accessed all array elements, the hit ratio of our loop would be 100%, 25% due to temporal locality and 75% due to spatial locality Cold start miss: we assume that the cache is initially empty. Also called a Compulsory Miss Cold start miss: we assume that the cache is initially empty. Also called a Compulsory Miss

8

Example 1 with double A[4096]

Why should it make a difference?

 Consider the case where the loop is preceded by

another loop that accesses all array elements in

order

 The entire array no longer fits into the cache –

cache size: 16KB, array size: 32KB

 After execution of the previous loop, the second half

of the array will be in cache

 Analysis: our loop sees misses as we just saw

 Called Capacity Misses as they would not be misses

if the cache had been big enough

10

Example 2: Vector Dot Product

.. .. 511 Miss Conflict

B[3] 0xE018 256 Miss Conflict A[3] 0xA018 256 Miss Conflict B[2] 0xE010 256 Miss Conflict A[2] 0xA010 256 Miss Conflict B[1] 0xE008 256 Miss Conflict A[1] 0xA008 256 Miss Conflict B[0] 0xE000 256 Miss Cold start A[0] 0xA000 256 Miss Cold start Tag : 18b Index: 9b Offset: 5b

11 Example 2: Cache Hits and Misses B[1023] 0xFFF8 511 Miss Conflict

B[3] 0xE018 256 Miss Conflict A[3] 0xA018 256 Miss Conflict B[2] 0xE010 256 Miss Conflict A[2] 0xA010 256 Miss Conflict B[1] 0xE008 256 Miss Conflict A[1] 0xA008 256 Miss Conflict B[0] 0xE000 256 Miss Cold start A[0] 0xA000 256 Miss Cold start Conflict miss: a miss due to conflicts in cache block requirements from memory accesses of the same program Hit ratio for our program: 0% Source of the problem: the elements of arrays A and B accessed in order have the same cache index Hit ratio would be better if the base address of B is such that these cache indices differ