Cache Memory Structures & Organization: SRAM, Mapping, Policies, Performance, Slides of Electrical and Electronics Engineering

An in-depth analysis of various cache memory structures, including SRAM, direct-mapped, set-associative, and fully-associative caches. It covers placement and identification techniques, replacement policies, and write policies. The document also discusses the performance implications of cache organization parameters.

Typology: Slides

2019/2020

Uploaded on 06/15/2020

janeka
janeka 🇺🇸

4.1

(15)

260 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Memory & Cache
Lec 14
Memory Hierarchy
Memory
- Just an “ocean of bits
- Many technologies are available
Key issues
- Technology (how bits are stored)
- Placement (where bits are stored)
- Identification (finding the right bits)
- Replacement (finding space for new bits)
- Write policy (propagating changes to bits)
Must answer these regardless of memory type
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Cache Memory Structures & Organization: SRAM, Mapping, Policies, Performance and more Slides Electrical and Electronics Engineering in PDF only on Docsity!

Memory & Cache

Lec 14

Memory Hierarchy

 Memory

  • Just an “ocean of bits”
  • Many technologies are available

 Key issues

  • Technology (how bits are stored)
  • Placement (where bits are stored)
  • Identification (finding the right bits)
  • Replacement (finding space for new bits)
  • Write policy (propagating changes to bits)

 Must answer these regardless of memory type

Types of Memory

On-chip SRAM 8KB-6MB < 10ns $$$

Disk 40GB – 1PB < 20ms ~

DRAM 64MB – 1TB < 100ns $

Off-chip SRAM 1Mb – 16Mb < 20ns $$

Register < 1KB < 1ns $$$$

Type Size Speed Cost/bit

Memory Hierarchy

Registers

On-Chip SRAM Off-Chip SRAM DRAM

Disk

CAPACITY

SPEED and COST

Caches: Automatic Management of Fast Storage

CPU cache^

Main Memory

CPU L

cache

Main Memory

L

L1 cache

16~32KB

1~2 pclk latency

~256KB

~10 pclk latency ~50 pclk latency

~4MB

Why Memory Hierarchy?

 Fast and small memories

  • Enable quick access (fast cycle time)
  • Enable lots of bandwidth (1+ L/S/I-fetch/cycle)  Slower larger memories
  • Capture larger share of memory
  • Still relatively fast  Slow huge memories
  • Hold rarely-needed state
  • Needed for correctness

Empirically observed

-Significant! -Even small local storage (8KB) often satisfies >90% of references to multi-MB data set

 All together:All together:

    • provide appearance of large, fast memory with cost ofprovide appearance of large, fast memory with cost of

cheap, slow memorycheap, slow memory

Four Burning Questions

 These are:

    • PlacementPlacement
      • Where can a block of memory go?
  • -^ IdentificationIdentification
    • How do I find a block of memory?
    • ReplacementReplacement
      • How do I make space for new blocks?
  • -^ Write PolicyWrite Policy
    • How do I propagate changes?

 Consider these for caches

  • Usually SRAM

Placement

Disk Anywhere O/S manages

DRAM Anywhere O/S manages

Direct-mapped, set-associative, fully-associative

Cache Fixed in H/W (SRAM)

Compiler/programmer manages

Anywhere; Int, FP, SPR

Registers

Memory Type Placement Comments

HUH?

N-way Set-Associative

 Set-associative

  • Block can be in a locations
  • Hash collisions:
    • N still OK

 Identification

  • Still perform tag check
  • However, only N in parallel

SRAM Cache

Hash

Address

Offset

32-bit Address Tag Index

Data Out

Offset

N Data Blocks Index N Tags^ Index

?= ?=^ Tag ?=^ ?=

Cache Memory Structures

index key idx^ key tag data tag^ data

decoder decoder

“Indexed Memory” “Direct Mapped” i-bit index 2 i^ blocks

“Associative Memory” “Fully Associative” “CAM” no index unlimited blocks

“N-Way Set-Associative” i-bit index 2 i^ • N blocks

Placement and Identification

 Consider a $$ of: <BS=block size, S= # of sets, B= # of blocks>

  • <64,64,64>: o=6, i=6, t=20: direct-mapped (S=B)
  • <64,16,64>: o=6, i=4, t=22: 4-way S-A (S = B / 4)
  • <64,1,64>: o=6, i=0, t=26: fully associative (S=1)  Total size = BS x B = BS x S x (B/S)

Offset

32-bit Address Tag Index

ID block within set

Select set of blocks

Select word within block

Purpose Offset o =log 2 (block size)

Tag t =32 - o - i

Index i =log 2 (number of sets)

Portion Length

 Each block (or cache line) has only one tag but can hold

multiple “chunks” of data – benefit?

  • the entire cache block is transferred to and from memory all at once good for spatial locality since if you access address i, you will probably want i+1 as well (prefetching effect)
  • reduce tag storage overhead In 32-bit addressing, an 1-MB direct-mapped cache: how many bits of tags? 4-byte cache block ⇒ 256K blocks ⇒ ~384KB of tag 128-byte cache block ⇒ 8K blocks ⇒ ~12KB of tag

 Block size = 2o; Direct Mapped Cache Size = 2B+o

Cache Block Size

tag block index block offset

MSB LSB

B-bits o-bits

tag blk.offset

Fully Associative Cache

Multiplexor

Associative Search

Tag

tag index BO

N-Way Set Associative Cache

Multiplexor

Associative decoder search

Cache Size = N x 2B+o

Here N=

2-Way Set Associative Cache

tag idx b.o.

= (^) matchTag

decoder

= (^) matchTag Multiplexor

decoder

a way (bank) a set

Cache Size = N x 2B+o

Here N=

Replacement

 Cache has finite size

  • What do we do when it is full?

 Analogy: desktop full?

  • Move books to bookshelf to make room

 Same idea:

  • Move blocks to next level of cache

Write Policy – write through

 Easiest policy: writewrite--throughthrough

 Every write propagates directly through hierarchy

  • Write in L1, L2, memory, disk (?!?)

 Why is this a bad idea?

  • Very high bandwidth requirement
  • Remember, large memories are slow

 Popular in real systems only to the L

  • Every write updates L1 and L
  • Beyond L2, use write-back policy

Write Policy – write back

 Most widely used: writewrite--backback

 Maintain state of each line in a cache

  • Invalid – not present in the cache
  • Clean – present, but not written (unmodified)
  • Dirty – present and written (modified)

 Store state in tag array, next to address tag

  • Mark dirty bit on a write

 On eviction, check dirty bit

  • If set, write back dirty line to next level
  • Called a writeback or castout

Write Policy

 Complications of write-back policy

  • Stale copies lower in the hierarchy
  • Must always check higher level for dirty copies before accessing copy in a lower level

 Not a big problem in uniprocessors

  • In multiprocessors: the cache coherence problem

 I/O devices that use DMA (direct memory access)

can cause problems even in uniprocessors

  • Called coherent I/O
  • Must check caches for dirty copies before reading main memory

Caches and Performance

 Caches

  • Enable design for common case: cache hit
    • Cycle time, pipeline organization
    • Recovery policy
  • Uncommon case: cache miss
    • Fetch from next level − Apply recursively if multiple levels
    • What to do in the meantime?

 What is performance impact?

 Various optimizations are possible

Cache Hits and Performance

 Cache hit latency determined by:

  • Cache organization
    • Associativity − Parallel tag checks expensive, slow − Way select slow (fan-in, wires)
    • Block size − Word select may be slow (fan-in, wires)
    • Number of block (sets x associativity) − Wire delay across array − “Manhattan distance” = width + height − Word line delay: width − Bit line delay: height

 Array design is an art form

  • Detailed analog circuit/wire delay modeling

Word Line Bit Line

Cache Misses and Performance

 Miss penalty: what to do on a miss?

  • Detect miss : 1 or more cycles
  • Find victim (replace line): 1 or more cycles
    • Write back if dirty
  • Request line from next level: several cycles
  • Transfer line from next level: several cycles
    • (block size) / (bus width)
  • Fill line into data array , update tag array: 1+ cycles
  • Resume execution

 In practice: 6 cycles to 100s of cycles

Cache Miss Rate

 Determined by:

  • Program characteristics
    • Temporal locality
    • Spatial locality
  • Cache organization
    • Block size, associativity, number of sets

Improving Locality

 Instruction text placement

  • Profile program, place unreferenced or rarely

referenced paths “elsewhere”

  • Maximize temporal locality
  • Eliminate taken branches
  • Fall-through path has spatial locality

Cache Miss Rate Effects

 Number of blocks (sets x associativity)

  • More is better: fewer conflicts, greater capacity

 Associativity

  • Higher associativity reduces conflicts
  • Very little benefit beyond 8-way set-associative

 Block size

  • Larger blocks reduces compulsory misses
    • exploit spatial locality − Usually: miss rates improve until 64B-256B − 512B or more miss rates get worse
  • Larger blocks less efficient: more capacity misses
  • Fewer placement choices: more conflict misses

Cache Miss Rate

 Subtle tradeoffs between cache organization parameters

  • Large blocks reduce compulsory misses but increase miss penalty
    • #compulsory = (working set) / (block size)
    • #transfers = (block size)/(bus width)
  • Large blocks increase conflict misses
    • #blocks = (cache size) / (block size)
  • Associativity reduces conflict misses but increases access time

 Can associative cache ever have higher miss rate than direct-

mapped cache of same size?

Cache Miss Rates: 3 C’s

8K1W 8K4W 16K1W 16K4W

Miss per Instruction (%)

Conflict Capacity Compulsory

 Vary size and associativity

  • Compulsory misses are constant
  • Capacity and conflict misses are reduced

Cache Miss Rates: 3 C’s

8K32B 8K64B16K32B16K64B

Miss per Instruction (%)

Conflict Capacity Compulsory

 Vary size and block size

  • Compulsory misses drop with increased block size
  • Capacity and conflict can increase with larger blocks