Download Cache Memory Structures & Organization: SRAM, Mapping, Policies, Performance and more Slides Electrical and Electronics Engineering in PDF only on Docsity!
Memory & Cache
Lec 14
Memory Hierarchy
Memory
- Just an “ocean of bits”
- Many technologies are available
Key issues
- Technology (how bits are stored)
- Placement (where bits are stored)
- Identification (finding the right bits)
- Replacement (finding space for new bits)
- Write policy (propagating changes to bits)
Must answer these regardless of memory type
Types of Memory
On-chip SRAM 8KB-6MB < 10ns $$$
Disk 40GB – 1PB < 20ms ~
DRAM 64MB – 1TB < 100ns $
Off-chip SRAM 1Mb – 16Mb < 20ns $$
Register < 1KB < 1ns $$$$
Type Size Speed Cost/bit
Memory Hierarchy
Registers
On-Chip SRAM Off-Chip SRAM DRAM
Disk
CAPACITY
SPEED and COST
Caches: Automatic Management of Fast Storage
CPU cache^
Main Memory
CPU L
cache
Main Memory
L
L1 cache
16~32KB
1~2 pclk latency
~256KB
~10 pclk latency ~50 pclk latency
~4MB
Why Memory Hierarchy?
Fast and small memories
- Enable quick access (fast cycle time)
- Enable lots of bandwidth (1+ L/S/I-fetch/cycle) Slower larger memories
- Capture larger share of memory
- Still relatively fast Slow huge memories
- Hold rarely-needed state
- Needed for correctness
Empirically observed
-Significant! -Even small local storage (8KB) often satisfies >90% of references to multi-MB data set
All together:All together:
- provide appearance of large, fast memory with cost ofprovide appearance of large, fast memory with cost of
cheap, slow memorycheap, slow memory
Four Burning Questions
These are:
- PlacementPlacement
- Where can a block of memory go?
- -^ IdentificationIdentification
- How do I find a block of memory?
- ReplacementReplacement
- How do I make space for new blocks?
- -^ Write PolicyWrite Policy
- How do I propagate changes?
Consider these for caches
Placement
Disk Anywhere O/S manages
DRAM Anywhere O/S manages
Direct-mapped, set-associative, fully-associative
Cache Fixed in H/W (SRAM)
Compiler/programmer manages
Anywhere; Int, FP, SPR
Registers
Memory Type Placement Comments
HUH?
N-way Set-Associative
Set-associative
- Block can be in a locations
- Hash collisions:
Identification
- Still perform tag check
- However, only N in parallel
SRAM Cache
Hash
Address
Offset
32-bit Address Tag Index
Data Out
Offset
N Data Blocks Index N Tags^ Index
?= ?=^ Tag ?=^ ?=
Cache Memory Structures
index key idx^ key tag data tag^ data
decoder decoder
“Indexed Memory” “Direct Mapped” i-bit index 2 i^ blocks
“Associative Memory” “Fully Associative” “CAM” no index unlimited blocks
“N-Way Set-Associative” i-bit index 2 i^ • N blocks
Placement and Identification
Consider a $$ of: <BS=block size, S= # of sets, B= # of blocks>
- <64,64,64>: o=6, i=6, t=20: direct-mapped (S=B)
- <64,16,64>: o=6, i=4, t=22: 4-way S-A (S = B / 4)
- <64,1,64>: o=6, i=0, t=26: fully associative (S=1) Total size = BS x B = BS x S x (B/S)
Offset
32-bit Address Tag Index
ID block within set
Select set of blocks
Select word within block
Purpose Offset o =log 2 (block size)
Tag t =32 - o - i
Index i =log 2 (number of sets)
Portion Length
Each block (or cache line) has only one tag but can hold
multiple “chunks” of data – benefit?
- the entire cache block is transferred to and from memory all at once good for spatial locality since if you access address i, you will probably want i+1 as well (prefetching effect)
- reduce tag storage overhead In 32-bit addressing, an 1-MB direct-mapped cache: how many bits of tags? 4-byte cache block ⇒ 256K blocks ⇒ ~384KB of tag 128-byte cache block ⇒ 8K blocks ⇒ ~12KB of tag
Block size = 2o; Direct Mapped Cache Size = 2B+o
Cache Block Size
tag block index block offset
MSB LSB
B-bits o-bits
tag blk.offset
Fully Associative Cache
Multiplexor
Associative Search
Tag
tag index BO
N-Way Set Associative Cache
Multiplexor
Associative decoder search
Cache Size = N x 2B+o
Here N=
2-Way Set Associative Cache
tag idx b.o.
= (^) matchTag
decoder
= (^) matchTag Multiplexor
decoder
a way (bank) a set
Cache Size = N x 2B+o
Here N=
Replacement
Cache has finite size
- What do we do when it is full?
Analogy: desktop full?
- Move books to bookshelf to make room
Same idea:
- Move blocks to next level of cache
Write Policy – write through
Easiest policy: writewrite--throughthrough
Every write propagates directly through hierarchy
- Write in L1, L2, memory, disk (?!?)
Why is this a bad idea?
- Very high bandwidth requirement
- Remember, large memories are slow
Popular in real systems only to the L
- Every write updates L1 and L
- Beyond L2, use write-back policy
Write Policy – write back
Most widely used: writewrite--backback
Maintain state of each line in a cache
- Invalid – not present in the cache
- Clean – present, but not written (unmodified)
- Dirty – present and written (modified)
Store state in tag array, next to address tag
- Mark dirty bit on a write
On eviction, check dirty bit
- If set, write back dirty line to next level
- Called a writeback or castout
Write Policy
Complications of write-back policy
- Stale copies lower in the hierarchy
- Must always check higher level for dirty copies before accessing copy in a lower level
Not a big problem in uniprocessors
- In multiprocessors: the cache coherence problem
I/O devices that use DMA (direct memory access)
can cause problems even in uniprocessors
- Called coherent I/O
- Must check caches for dirty copies before reading main memory
Caches and Performance
Caches
- Enable design for common case: cache hit
- Cycle time, pipeline organization
- Recovery policy
- Uncommon case: cache miss
- Fetch from next level − Apply recursively if multiple levels
- What to do in the meantime?
What is performance impact?
Various optimizations are possible
Cache Hits and Performance
Cache hit latency determined by:
- Cache organization
- Associativity − Parallel tag checks expensive, slow − Way select slow (fan-in, wires)
- Block size − Word select may be slow (fan-in, wires)
- Number of block (sets x associativity) − Wire delay across array − “Manhattan distance” = width + height − Word line delay: width − Bit line delay: height
Array design is an art form
- Detailed analog circuit/wire delay modeling
Word Line Bit Line
Cache Misses and Performance
Miss penalty: what to do on a miss?
- Detect miss : 1 or more cycles
- Find victim (replace line): 1 or more cycles
- Request line from next level: several cycles
- Transfer line from next level: several cycles
- (block size) / (bus width)
- Fill line into data array , update tag array: 1+ cycles
- Resume execution
In practice: 6 cycles to 100s of cycles
Cache Miss Rate
Determined by:
- Program characteristics
- Temporal locality
- Spatial locality
- Cache organization
- Block size, associativity, number of sets
Improving Locality
Instruction text placement
- Profile program, place unreferenced or rarely
referenced paths “elsewhere”
- Maximize temporal locality
- Eliminate taken branches
- Fall-through path has spatial locality
Cache Miss Rate Effects
Number of blocks (sets x associativity)
- More is better: fewer conflicts, greater capacity
Associativity
- Higher associativity reduces conflicts
- Very little benefit beyond 8-way set-associative
Block size
- Larger blocks reduces compulsory misses
- exploit spatial locality − Usually: miss rates improve until 64B-256B − 512B or more miss rates get worse
- Larger blocks less efficient: more capacity misses
- Fewer placement choices: more conflict misses
Cache Miss Rate
Subtle tradeoffs between cache organization parameters
- Large blocks reduce compulsory misses but increase miss penalty
- #compulsory = (working set) / (block size)
- #transfers = (block size)/(bus width)
- Large blocks increase conflict misses
- #blocks = (cache size) / (block size)
- Associativity reduces conflict misses but increases access time
Can associative cache ever have higher miss rate than direct-
mapped cache of same size?
Cache Miss Rates: 3 C’s
8K1W 8K4W 16K1W 16K4W
Miss per Instruction (%)
Conflict Capacity Compulsory
Vary size and associativity
- Compulsory misses are constant
- Capacity and conflict misses are reduced
Cache Miss Rates: 3 C’s
8K32B 8K64B16K32B16K64B
Miss per Instruction (%)
Conflict Capacity Compulsory
Vary size and block size
- Compulsory misses drop with increased block size
- Capacity and conflict can increase with larger blocks