Download Cache Memory Management: How to Decide What to Remove when Full - Prof. Alan L. Sussman and more Study notes Computer Science in PDF only on Docsity!
Computer Systems Architecture
CMSC 411
Unit 5 – Memory Hierarchy
Alan Sussman
October 7, 2004
CMSC 411 - Alan Sussman 2
Administrivia
• HW #3 due today
– questions
• Quiz 2 Tuesday, Oct. 12
– on Unit 3, basic pipelining
– practice quiz posted, answers posted later today
– questions?
• Read Chapter 5
– except 5.11-5.
CMSC 411 - Alan Sussman 3
Last time
• Long instructions
- can cause structural hazards, and WAW hazards – why?
- detect hazards early, to allow precise exceptions
- in ID pipeline stage, and delay EX cycle if problem detected
- can delay WB, use history or future file, let OS deal with it, to enable precise exceptions
• MIPS R4000 pipeline design
- 8 stage pipeline – superpipelining
- extra stages come from multi-cycle cache accesses
- 2 cycle load delay and 3 cycle branch delay (1 delay slot, 2 cycle stall for taken branches)
- complex FP pipeline – 8 stages used in different combinations for different operations
Cache Memory
CMSC 411 - Alan Sussman 5
Issues to consider
• How big should the fastest memory (cache
memory) be?
• How do we decide what to put in cache
memory?
• If the cache is full, how do we decide what
to remove?
• How do we find something in cache?
• How do we handle writes?
CMSC 411 - Alan Sussman 6
First, there is main memory
• Jargon:
– frame address – which page?
– block number – which cache block?
– contents – the data
CMSC 411 - Alan Sussman 7
Then add a cache
• Jargon: Each address of a memory location
is partitioned into
– block address
– block offset
Fig. 5. CMSC 411 - Alan Sussman 8
How does cache memory work?
• The following slides discuss:
– what cache memory is
– three organizations for cache memory
- direct mapped.
- set associative
- fully associative
– how the bookkeeping is done
• Important note : All addresses shown are in
octal. Addresses in the book are usually decimal.
CMSC 411 - Alan Sussman 9
What is cache memory?
Main memory first
Main memory is divided into (cache) blocks. Each block contains many words (16-64 common now).
CMSC 411 - Alan Sussman 10
Main memory
Blocks are grouped into frames (pages), 3 frames in this picture.
CMSC 411 - Alan Sussman 11
Main memory (cont.)
Blocks are addressed by their frame number, and their block number within the frame.
CMSC 411 - Alan Sussman 12
Cache memory
Cache has many, MANY fewer blocks than main memory, each with a block number ,
a memory address ,
data ,
a valid bit,
a dirty bit.
CMSC 411 - Alan Sussman 19
Note that the last two bits of the memory block’s address always match the set number, so do not need to be stored. This part of the address is called the index. The higher order bits are stored, and are called the tag. In these pictures, both index and tag shown.
Set 0 Set 1 Set 2 Set 3
Set associative cache (cont.)
CMSC 411 - Alan Sussman 20
Set associative cache replacement
• Which entry in the set to replace?
• Three common choices:
– Replace an eligible random block
– Replace the least recently used (LRU) block
- can be hard to keep track of, so often only approximated
– Replace the oldest eligible block (First In, First
Out, or FIFO)
CMSC 411 - Alan Sussman 21
Data cache replacement – Fig. 5.
256KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.
64KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.
16KB 114.1 117.3 115.5 111.7 115.1 113.3109.0 111.8 110.
Size LRU RandomFIFO LRU Random FIFO LRU RandomFIFO
Two-way Four-way Eight-Way
SPEC2000, in misses per 1000 instructions Set associativity
Computer Systems Architecture
CMSC 411
Unit 5 – Memory Hierarchy
Alan Sussman
October 12, 2004
CMSC 411 - Alan Sussman 23
Administrivia
• Quiz 2 today
– questions?
• HW for Unit 5 out soon
CMSC 411 - Alan Sussman 24
Last time
• Main memory
- frame address – page number
- block number – cache block within page
- contents – the data
• Cache memory
- block address
- tag – high order bits for matching
- index – which set, for set associative caches
- block offset – which byte within the block
- contains way fewer blocks than main memory
- for each cache block – block number, memory address of block it contains, data, valid bit, dirty bit
CMSC 411 - Alan Sussman 25
Last time (cont.)
• Direct mapped cache
- each memory block can only go into 1 cache block – use low order bits of block address
• Set associative cache
- multiple places for a memory block to go – the degree of set associativity is how many
- don’t need to store the index (the set number), since its known from the cache block number – rest of block address is the tag
- replacement policy determines which block to replace when new one is loaded (e.g., random, LRU, FIFO)
CMSC 411 - Alan Sussman 26
Fully associative cache
In fully associative cache, memory blocks may be stored anywhere.
So block 14 might be put in the first available block -- one with valid = 0.
CMSC 411 - Alan Sussman 27
Fully associative cache (cont.)
With this result.
CMSC 411 - Alan Sussman 28
Managing cache
Use direct mapped cache as an example.
After first read operation, cache memory looked like this.
Valid Dirty
CMSC 411 - Alan Sussman 29
Managing cache (cont.)
If all other memory references involved block 14, no other blocks would need to be fetched from memory.
But suppose eventually need to fetch blocks 10, 31 and 66.
Need to fetch all three, because don’t have valid versions of them.
Valid Dirty
CMSC 411 - Alan Sussman 30
Managing cache (cont.)
The result looks like this.
Now suppose write to block 66.
Valid Dirty
CMSC 411 - Alan Sussman 37
Write through vs. write back
• Which is better?
– Write back gives faster writes, since don't have
to wait for main memory
– Write back is very efficient if want to modify
many bytes in a given block
– But write back can slow down some reads,
since a cache miss might cause a write back
– In multiprocessors, write through might be the
only correct solution. Why?
CMSC 411 - Alan Sussman 38
Cache summary
• Cache memory can be organized as direct
mapped, set associative, or fully associative
• Can be write-through or write-back
• Extra bits such as valid and dirty bits help
keep track of the status of the cache
Computer Systems Architecture
CMSC 411
Unit 5 – Memory Hierarchy
Alan Sussman
October 14, 2004
CMSC 411 - Alan Sussman 40
Administrivia
• HW for Unit 5 posted
– due date TBD
• Quizzes returned Tuesday
– answers already posted
• Grad school workshop Tuesday, Oct. 19, 5-
7PM, CSIC 2117
– come ask questions to both faculty and current
grad students!
CMSC 411 - Alan Sussman 41
Last time
• Fully associative cache
- any memory block can go into any cache block
• Write through cache
- memory gets updated immediately on write
- reads only cause block to get loaded on miss
• Write back cache
- writes only to cache
- cache and main memory can be inconsistent
- reads can cause updates from cache to memory, if block replaced is dirty
• Write through vs. write back
- name one good feature of each
CMSC 411 - Alan Sussman 42
How much do memory stalls slow
down a machine?
• Suppose that on pipelined MIPS, each instruction
takes, on average, 2 clock cycles, not counting
cache faults/misses
• Suppose, on average, there are 1.33 memory
references per instruction, memory access time is
50 cycles, and the miss rate is 2%
• Then each instruction takes, on average:
2 + (0 × .98) + (1.33 × .02 × 50) = 3.33 clock cycles
CMSC 411 - Alan Sussman 43
Memory stalls (cont.)
• To reduce the impact of cache misses, can
reduce any of three parameters:
– main memory access time (miss penalty)
– miss rate
– cache access (hit) time
CMSC 411 - Alan Sussman 44
Reducing cache miss penalty
• 5 strategies:
– Give priority to read misses over write misses
– Don't wait for the whole block
– Use a nonblocking cache
– Multi-level cache
– Victim caches
• First 4 used in most desktop and server
machines
CMSC 411 - Alan Sussman 45
Give priority to read misses over
write misses
• But need to be careful
• Example:
– Suppose have a direct mapped cache, with
room for 8 blocks of 16 bytes each
– Then M[512] and M[1024] both get stored in
block 0, so can't be in cache at the same time
• Consider the following instructions:
SD R3, 512(R0)
LD R1, 1024(R0)
LD R2, 512(R0)
CMSC 411 - Alan Sussman 46
Example (cont.)
- If the cache is write-through, the SD will cause memory location 512 to be changed
- The first LW will cause block 0 to be replaced, so that the contents M[512] are no longer available in cache - If the system is write-back, this is when memory location 512 will be changed
- Physically, the contents of block 0 will be put into temporary storage (a write buffer ) while the new block is loaded, then the write back proceeds
- The second LW again replaces block 0, but this time no write-back is necessary
- But get a RAW hazard if don’t ensure that the write- through or write-back completes before the second LW reads memory
CMSC 411 - Alan Sussman 47
Example (cont.)
• To avoid such RAW hazards:
– Can force the read miss to always wait until the
write buffer is empty
– Or can force the hardware to check the write
buffer before read and only wait if there is a
potential hazard
CMSC 411 - Alan Sussman 48
Another write buffer optimization
• Write buffer mechanics, with merging
- An entry may contain multiple words (maybe even a whole cache block)
- If there’s an empty entry, the data and address are written to the buffer, and the CPU is done with the write
- If buffer contains other modified blocks, check to see if new address matches one already in the buffer – if so, combine the new data with that entry
- If buffer full and no address match, cache and CPU wait for an empty entry to appear (meaning some entry has been written to main memory)
- Merging improves memory efficiency, since multi- word writes usually faster than one word at a time
CMSC 411 - Alan Sussman 55
Miss rate – Fig. 5.
SPEC2000,
LRU
replacement
CMSC 411 - Alan Sussman 56
How to reduce the miss rate?
• Use larger blocks
• Use more associativity, to reduce conflict misses
• Victim cache
• Pseudo-associative caches (won’t talk about this)
• Prefetch (hardware controlled)
• Prefetch (compiler controlled)
• Compiler optimizations
CMSC 411 - Alan Sussman 57
Increasing block size
• Want the block size large so don’t have to stop so
often to load blocks
• Want the block size small so that blocks load
quickly
Fig. 5.16 – SPEC
CMSC 411 - Alan Sussman 58
Increasing block size (cont.)
• So large block size reduces miss rates, but...
• Example:
– Suppose that loading a block takes 80 cycles
(overhead) plus 2 clock cycles for each 16 bytes
– A block of size 64 bytes can be loaded in
80 + 2*64/16 cycles = 88 cycles (miss penalty)
– If the miss rate is 7%, then the average memory
access time is
1 + .07 * 88 = 7.16 cycles
CMSC 411 - Alan Sussman 59
Memory Access Times – Fig. 5.
Miss 4K 16K 64K 256K
penalty
Block size
Cache size
SPEC92 benchmarks on DEC workstation Computer Systems Architecture
CMSC 411
Unit 5 – Memory Hierarchy
Alan Sussman
October 19, 2004
CMSC 411 - Alan Sussman 61
Administrivia
• HW for Unit 5 posted
• Quizzes returned today
- Average: 62
- Median: 65 25%: 51 75%: 73
- questions
• Grad school workshop today, 5-7PM, CSIC 2117
CMSC 411 - Alan Sussman 62
Last time
• Reducing cache miss penalty
- priority to read misses over write misses
- be careful to use contents of write buffer
- can merge entries into write buffer
- don’t wait for whole block
- early restart or critical word first
- use a non-blocking cache
- works best for more complex pipelines than we’ve seen so far
- multi-level cache
- to capture misses in lower level caches
- lowers effective miss penalty
- victim cache
- to reduce conflict misses
CMSC 411 - Alan Sussman 63
Last time (cont.)
• Reducing miss rate - compulsory, capacity,
conflict misses
– use larger blocks
- what’s the cost of larger blocks?
– use higher associativity
– victim cache
– prefetch – hardware or software/compiler
– compiler optimizations
CMSC 411 - Alan Sussman 64
Higher associativity
- A direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/ - 2:1 cache rule of thumb (seems to work up to 128KB caches)
- But associative cache is slower than direct-mapped, so the clock may need to run slower
- Example:
- Suppose that the clock for 2-way memory needs to run at a factor of 1.1 times the clock for 1-way memory
- the hit time increases with higher associativity
- Then the average memory access time for 2-way is 1.10 + miss rate × 50 (assuming that the miss penalty is 50)
CMSC 411 - Alan Sussman 65
Memory access time – Fig. 5.
Cache size One-way Two-way Four-way Eight-way (KB)
Associativity
CMSC 411 - Alan Sussman 66
Pseudo-associative cache
• Uses the technique of chaining , with a series of
cache locations to check if the block is not found
in the first location
- e.g., invert most significant bit of index part of address (as if it were a set associative cache)
• The idea:
- Check the direct mapped address
- Until the block is found or the chain of addresses ends, check the next alternate address
- If the block has not been found, bring it in from memory
• Three different delays generated, depending on
which step succeeds
CMSC 411 - Alan Sussman 73
Merging arrays (cont.)
Means that at least 2 blocks must be in cache to begin using the arrays.
val[0] val[1] val[2] val[3] . . .
val[64] val[65] val[66] val[67] . . .
val[size-1] key[0] key[1] key[2] key[3] . . CMSC 411 - Alan Sussman 74
Merging arrays (cont.)
More efficient, especially if more than two arrays are coupled this way, to store them together.
val[0] key[0] val[1] key[1] . . .
val[32] key[32] val[33] key[33] . . .
CMSC 411 - Alan Sussman 75
Merging arrays (cont.)
Can do this by making the two arrays part of a structure.
val[0] key[0] val[1] key[1] . . .
val[32] key[32] val[33] key[33] . . .
CMSC 411 - Alan Sussman 76
Technique 2:
interchanging loops
Example:
x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
CMSC 411 - Alan Sussman 77
Interchanging loops (cont.)
Notice that accesses are by columns, so the elements are spaced 100 words apart.
Blocks are bouncing in and out of cache.
x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
CMSC 411 - Alan Sussman 78
Interchanging loops (cont.)
First color the loops:
x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
CMSC 411 - Alan Sussman 79
Interchanging loops (cont.)
Notice that the program has the same effect if the two loops are interchanged:
x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
CMSC 411 - Alan Sussman 80
Interchanging loops (cont.)
But with this ordering, use every element in a cache block before needing another block!
x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
CMSC 411 - Alan Sussman 81
Technique 3: loop fusion
Example:
x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
y[i][j] = x[i][j] * a[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for; CMSC 411 - Alan Sussman 82
Loop fusion (cont.)
Note that the loop control is the same for both sets of loops.
x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
y[i][j] = x[i][j] * a[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
CMSC 411 - Alan Sussman 83
Loop fusion (cont.)
And note that the array x is used in each, so probably needs to be loaded into cache twice, which wastes cycles.
x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
y[i][j] = x[i][j] * a[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for; CMSC 411 - Alan Sussman 84
Loop fusion (cont.)
So combine,or fuse , the loops to improve efficiency. x[i][j] = 2 * x[i][j];
For i=0, 1, …, 4999
End for;
For j=0, 1, …, 99
End for;
y[i][j] = x[i][j] * a[i][j];
CMSC 411 - Alan Sussman 91
Blocking access to arrays (cont.)
A B = C
CMSC 411 - Alan Sussman 92
Blocking access to arrays (cont.)
A B = C
CMSC 411 - Alan Sussman 93
Blocking access to arrays (cont.)
A B = C
CMSC 411 - Alan Sussman 94
Blocking access to arrays (cont.)
Instead, order the computation using rectangular blocks of A and B.
A B = C
Partial answer!
CMSC 411 - Alan Sussman 95
Blocking access to arrays (cont.)
If the block of A has k rows, then only need to load B m/k times.
A B = C
Partial answer!
CMSC 411 - Alan Sussman 96
Blocking access to arrays (cont.)
Improves temporal locality
/* Before / for (i=0; i<N; i++) for (j=0; j<N; j++) { r=0; for (k=0; k<N; k++) r=r+y[i][k]z[k][j]; x[i][j]=r; }
/* After / for (jj=0; jj<N; jj=jj+B) for (kk=0; kk<N; kk=kk+B) for (i=0; i<N; i++) for (j=jj; j<min(jj+B,N); j++) { r=0; for (k=kk; k<min(kk+B,N); k++) r= r+y[i][k]z[k][j]; x[i][j]=x[i][j]+r; }
Computer Systems Architecture
CMSC 411
Unit 5 – Memory Hierarchy
Alan Sussman
October 21, 2004
CMSC 411 - Alan Sussman 98
Administrivia
• Quiz 2 questions?
• HW for Unit 5
– questions?
– due date posted by tomorrow
• Midterm
– will be rescheduled to later by tomorrow
CMSC 411 - Alan Sussman 99
Last time
• Reducing cache miss rate
– larger blocks
– higher associativity
- but can make cache hits slower
– hardware prefetch
- works well for sequential accesses
- cost?
– software/compiler prefetch
- instruction that moves data into cache, w/o causing exceptions or pipeline bubbles
– compiler optimizations
CMSC 411 - Alan Sussman 100
Last time (cont.)
• Compiler optimizations
– merging arrays
- separate arrays to array of structs ordering, improve spatial locality
– loop interchange
- to access data in the order it is stored, improve spatial locality
– loop fusion
- to improve temporal locality
– blocking
- improves both temporal and spatial locality
CMSC 411 - Alan Sussman 101
Reducing the time for cache hits
• K.I.S.S.
• Use virtual addresses rather than physical
addresses in the cache.
• Pipeline cache accesses
• Trace caches (won’t talk about these)
CMSC 411 - Alan Sussman 102
K.I.S.S.
• Cache should be small enough to fit on the
processor chip
• Direct mapped is faster than associative,
especially on read
– overlap tag check with transmitting data
• For current processors, small L1 caches to
keep fast clock cycle time, hide L1 misses
with dynamic scheduling, and use L
caches to avoid main memory accesses
CMSC 411 - Alan Sussman 109
Main memory management
• Questions:
– How big should main memory be?
– How to handle reads and writes?
– How to find something in main memory?
– How to decide what to put in main memory?
– If main memory is full, how to decide what to
replace?
CMSC 411 - Alan Sussman 110
The scale of things
• Typically (as of 2000):
- Registers : < 1 KB, access time .25 - .5 ns
- Cache : < 8 MB, access time .5 - 25 ns
- Main Memory : < 4 GB, access time 150 - 250 ns
- Disk Storage : > 30 GB, access time 5,000,000 ns (5ms)
• Memory Technology: CMOS (Complementary
Metal Oxide Semiconductor)
- uses a combination of n- and p-doped semiconductor material to achieve low power dissipation.
CMSC 411 - Alan Sussman 111
Memory hardware
• DRAM : dynamic random access memory,
typically used for main memory
– one transistor per data bit
– each bit must be refreshed periodically (e.g.,
every 8 milliseconds), so maybe 5% of time is
spent in refresh
– access time < cycle time
– address sent in two halves so that fewer pins are
needed on chip (row and column access)
CMSC 411 - Alan Sussman 112
Memory hardware (cont.)
• SRAM : static random access, typically used
for cache memory
– 4-6 transistors per data bit
– no need for refresh
– access time = cycle time
– address sent all at once, for speed
CMSC 411 - Alan Sussman 113
Bottleneck
• Main memory access will slow down the CPU
unless the hardware designer is careful
• Some techniques can improve memory bandwidth ,
the amount of data that can be delivered from
memory in a given amount of time:
- wider main memory
- interleaved memory
- independent memory banks
- avoiding memory bank conflicts CMSC 411 - Alan Sussman 114
Wider main memory
• Cache miss: If a cache block contains k words,
then each cache miss involves these steps repeated
k times:
- Send the address to main memory
- Access the word (i.e., locate it)
- Send the word to cache, with the bits transmitted in parallel
• Idea behind wider memory: the user thinks about
32 bit words, but physical memory can have
longer words
• Then the operations above are done only k/n
times, where n is the number of 32 bit words in a
physical word
CMSC 411 - Alan Sussman 115
Wider main memory (cont.)
• Extra costs:
– a wider memory bus : hardware to deliver 32 n
bits in parallel, instead of 32 bits
– a multiplexor to choose the correct 32 bits to
transmit from the cache to the CPU
CMSC 411 - Alan Sussman 116
Interleaved memory
• Partition memory into banks, with each bank able
to access a word and send it to cache in parallel
• Organize address space so that adjacent words live
in different banks - called interleaving
• For example, 4 banks might have words with the
following octal addresses:
Bank 0 Bank 1 Bank 2 Bank
CMSC 411 - Alan Sussman 117
Interleaved memory (cont.)
• Note how nice interleaving is for write-
through
• Also helps speed read and write-back
• Note : Interleaved memory acts like wide
memory, except that words are transmitted
through the bus sequentially, not in parallel
CMSC 411 - Alan Sussman 118
Independent memory banks
• Each bank of memory has its own address
lines and (usually) a bus
• Can have several independent banks:
perhaps
– one for instructions
– one for data
• Banks can operate independently without
slowing others
CMSC 411 - Alan Sussman 119
Avoid memory bank conflicts
• By having a prime number of memory
banks
• Since arrays frequently have even
dimension sizes - and often dimension sizes
that are a power of 2 - strides that match the
number of banks (or a multiple) give very
slow access
CMSC 411 - Alan Sussman 120
Example
• First access the first column
of x :
- x[0][0], x[1][0], x[2][0], ... x[255][0,
• with addresses
– K, K+5124, K+5128, ...
K+512*something
• With 4 memory banks, all of
the elements live in the same
memory bank, so the CPU
will stall in the worst
possible way
int x[256][512];
for (j=0; j<512; j=j+1) for (i=0; i<256; i=i+1) x[i][j] = 2 * x[i][j];