Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Cache Memory Structures & Organization: SRAM, Mapping, Policies, Performance, Slides of Electrical and Electronics Engineering

Kent State University (KSU) - Ashtabula Campus Electrical and Electronics Engineering

An in-depth analysis of various cache memory structures, including SRAM, direct-mapped, set-associative, and fully-associative caches. It covers placement and identification techniques, replacement policies, and write policies. The document also discusses the performance implications of cache organization parameters.

Typology: Slides

2019/2020

Uploaded on 06/15/2020

janeka 🇺🇸

4.1

(15)

260 documents

1 / 23

This page cannot be seen from the preview

Don't miss anything!

Memory & Cache

Lec 14

Memory Hierarchy

Memory

- Just an “ocean of bits”

- Many technologies are available

Key issues

- Technology (how bits are stored)

- Placement (where bits are stored)

- Identification (finding the right bits)

- Replacement (finding space for new bits)

- Write policy (propagating changes to bits)

Must answer these regardless of memory type

Discover Slides of Electrical and Electronics Engineering Kent State University (KSU) - Ashtabula Campus

Partial preview of the text

Download Cache Memory Structures & Organization: SRAM, Mapping, Policies, Performance and more Slides Electrical and Electronics Engineering in PDF only on Docsity!

Memory & Cache

Lec 14

Memory Hierarchy

Memory

Just an “ocean of bits”
Many technologies are available

Key issues

Technology (how bits are stored)
Placement (where bits are stored)
Identification (finding the right bits)
Replacement (finding space for new bits)
Write policy (propagating changes to bits)

Must answer these regardless of memory type

Types of Memory

On-chip SRAM 8KB-6MB < 10ns $$$

Disk 40GB – 1PB < 20ms ~

DRAM 64MB – 1TB < 100ns $

Off-chip SRAM 1Mb – 16Mb < 20ns $$

Type Size Speed Cost/bit

Memory Hierarchy

Registers

On-Chip SRAM Off-Chip SRAM DRAM

Disk

CAPACITY

SPEED and COST

Caches: Automatic Management of Fast Storage

CPU cache^

Main Memory

CPU L

cache

Main Memory

L

L1 cache

16~32KB

1~2 pclk latency

~256KB

~10 pclk latency ~50 pclk latency

~4MB

Why Memory Hierarchy?

Fast and small memories

Enable quick access (fast cycle time)
Enable lots of bandwidth (1+ L/S/I-fetch/cycle) Slower larger memories
Capture larger share of memory
Still relatively fast Slow huge memories
Hold rarely-needed state
Needed for correctness

Empirically observed

-Significant! -Even small local storage (8KB) often satisfies >90% of references to multi-MB data set

All together:All together:

- provide appearance of large, fast memory with cost ofprovide appearance of large, fast memory with cost of

cheap, slow memorycheap, slow memory

Four Burning Questions

These are:

- PlacementPlacement
  - Where can a block of memory go?
-^ IdentificationIdentification
- How do I find a block of memory?
- ReplacementReplacement
  - How do I make space for new blocks?
-^ Write PolicyWrite Policy
- How do I propagate changes?

Consider these for caches

Usually SRAM

Placement

Disk Anywhere O/S manages

DRAM Anywhere O/S manages

Direct-mapped, set-associative, fully-associative

Cache Fixed in H/W (SRAM)

Compiler/programmer manages

Anywhere; Int, FP, SPR

Registers

Memory Type Placement Comments

HUH?

N-way Set-Associative

Set-associative

Block can be in a locations
Hash collisions:
- N still OK

Identification

Still perform tag check
However, only N in parallel

SRAM Cache

Hash

Address

Offset

32-bit Address Tag Index

Data Out

Offset

N Data Blocks Index N Tags^ Index

?= ?=^ Tag ?=^ ?=

Cache Memory Structures

index key idx^ key tag data tag^ data

decoder decoder

“Indexed Memory” “Direct Mapped” i-bit index 2 i^ blocks

“Associative Memory” “Fully Associative” “CAM” no index unlimited blocks

“N-Way Set-Associative” i-bit index 2 i^ • N blocks

Placement and Identification

Consider a $$ of: <BS=block size, S= # of sets, B= # of blocks>

<64,64,64>: o=6, i=6, t=20: direct-mapped (S=B)
<64,16,64>: o=6, i=4, t=22: 4-way S-A (S = B / 4)
<64,1,64>: o=6, i=0, t=26: fully associative (S=1) Total size = BS x B = BS x S x (B/S)

Offset

32-bit Address Tag Index

ID block within set

Select set of blocks

Select word within block

Purpose Offset o =log 2 (block size)

Tag t =32 - o - i

Index i =log 2 (number of sets)

Portion Length

Each block (or cache line) has only one tag but can hold

multiple “chunks” of data – benefit?

the entire cache block is transferred to and from memory all at once good for spatial locality since if you access address i, you will probably want i+1 as well (prefetching effect)
reduce tag storage overhead In 32-bit addressing, an 1-MB direct-mapped cache: how many bits of tags? 4-byte cache block ⇒ 256K blocks ⇒ ~384KB of tag 128-byte cache block ⇒ 8K blocks ⇒ ~12KB of tag

Block size = 2o; Direct Mapped Cache Size = 2B+o

Cache Block Size

tag block index block offset

MSB LSB

B-bits o-bits

tag blk.offset

Fully Associative Cache

Multiplexor

Associative Search

Tag

tag index BO

N-Way Set Associative Cache

Multiplexor

Associative decoder search

Cache Size = N x 2B+o

Here N=

2-Way Set Associative Cache

tag idx b.o.

= (^) matchTag

decoder

= (^) matchTag Multiplexor

decoder

a way (bank) a set

Cache Size = N x 2B+o

Here N=

Replacement

Cache has finite size

What do we do when it is full?

Analogy: desktop full?

Move books to bookshelf to make room

Same idea:

Move blocks to next level of cache

Write Policy – write through

Easiest policy: writewrite--throughthrough

Every write propagates directly through hierarchy

Write in L1, L2, memory, disk (?!?)

Why is this a bad idea?

Very high bandwidth requirement
Remember, large memories are slow

Popular in real systems only to the L

Every write updates L1 and L
Beyond L2, use write-back policy

Write Policy – write back

Most widely used: writewrite--backback

Maintain state of each line in a cache

Invalid – not present in the cache
Clean – present, but not written (unmodified)
Dirty – present and written (modified)

Store state in tag array, next to address tag

Mark dirty bit on a write

On eviction, check dirty bit

If set, write back dirty line to next level
Called a writeback or castout

Write Policy

Complications of write-back policy

Stale copies lower in the hierarchy
Must always check higher level for dirty copies before accessing copy in a lower level

Not a big problem in uniprocessors

In multiprocessors: the cache coherence problem

I/O devices that use DMA (direct memory access)

can cause problems even in uniprocessors

Called coherent I/O
Must check caches for dirty copies before reading main memory

Caches and Performance

Caches

Enable design for common case: cache hit
- Cycle time, pipeline organization
- Recovery policy
Uncommon case: cache miss
- Fetch from next level − Apply recursively if multiple levels
- What to do in the meantime?

What is performance impact?

Various optimizations are possible

Cache Hits and Performance

Cache hit latency determined by:

Cache organization
- Associativity − Parallel tag checks expensive, slow − Way select slow (fan-in, wires)
- Block size − Word select may be slow (fan-in, wires)
- Number of block (sets x associativity) − Wire delay across array − “Manhattan distance” = width + height − Word line delay: width − Bit line delay: height

Array design is an art form

Detailed analog circuit/wire delay modeling

Word Line Bit Line

Cache Misses and Performance

Miss penalty: what to do on a miss?

Detect miss : 1 or more cycles
Find victim (replace line): 1 or more cycles
- Write back if dirty
Request line from next level: several cycles
Transfer line from next level: several cycles
- (block size) / (bus width)
Fill line into data array , update tag array: 1+ cycles
Resume execution

In practice: 6 cycles to 100s of cycles

Cache Miss Rate

Determined by:

Program characteristics
- Temporal locality
- Spatial locality
Cache organization
- Block size, associativity, number of sets

Improving Locality

Instruction text placement

Profile program, place unreferenced or rarely

referenced paths “elsewhere”

Maximize temporal locality
Eliminate taken branches
Fall-through path has spatial locality

Cache Miss Rate Effects

Number of blocks (sets x associativity)

More is better: fewer conflicts, greater capacity

Associativity

Higher associativity reduces conflicts
Very little benefit beyond 8-way set-associative

Block size

Larger blocks reduces compulsory misses
- exploit spatial locality − Usually: miss rates improve until 64B-256B − 512B or more miss rates get worse
Larger blocks less efficient: more capacity misses
Fewer placement choices: more conflict misses

Cache Miss Rate

Subtle tradeoffs between cache organization parameters

Large blocks reduce compulsory misses but increase miss penalty
- #compulsory = (working set) / (block size)
- #transfers = (block size)/(bus width)
Large blocks increase conflict misses
- #blocks = (cache size) / (block size)
Associativity reduces conflict misses but increases access time

Can associative cache ever have higher miss rate than direct-

mapped cache of same size?

Cache Miss Rates: 3 C’s

8K1W 8K4W 16K1W 16K4W

Miss per Instruction (%)

Conflict Capacity Compulsory

Vary size and associativity

Compulsory misses are constant
Capacity and conflict misses are reduced

Cache Miss Rates: 3 C’s

8K32B 8K64B16K32B16K64B

Miss per Instruction (%)

Conflict Capacity Compulsory

Vary size and block size

Compulsory misses drop with increased block size
Capacity and conflict can increase with larger blocks

Cache Memory Structures & Organization: SRAM, Mapping, Policies, Performance, Slides of Electrical and Electronics Engineering

Related documents

Partial preview of the text

Download Cache Memory Structures & Organization: SRAM, Mapping, Policies, Performance and more Slides Electrical and Electronics Engineering in PDF only on Docsity!

Memory & Cache

Lec 14

Memory Hierarchy

Memory

Key issues

Must answer these regardless of memory type

CAPACITY

CPU L

L

16~32KB

~256KB

~4MB

Empirically observed

cheap, slow memorycheap, slow memory

HUH?

Set-associative

Identification

Each block (or cache line) has only one tag but can hold

multiple “chunks” of data – benefit?

Block size = 2o; Direct Mapped Cache Size = 2B+o

tag block index block offset

B-bits o-bits

Cache Size = N x 2B+o

Here N=

Cache Size = N x 2B+o

Here N=

Cache has finite size

Analogy: desktop full?

Same idea:

Easiest policy: writewrite--throughthrough

Every write propagates directly through hierarchy

Why is this a bad idea?

Popular in real systems only to the L

Most widely used: writewrite--backback

Maintain state of each line in a cache

Store state in tag array, next to address tag

On eviction, check dirty bit

Complications of write-back policy

Not a big problem in uniprocessors

I/O devices that use DMA (direct memory access)

can cause problems even in uniprocessors

Caches

What is performance impact?

Various optimizations are possible

Cache hit latency determined by:

Array design is an art form

Miss penalty: what to do on a miss?

In practice: 6 cycles to 100s of cycles

referenced paths “elsewhere”

Number of blocks (sets x associativity)

Associativity

Block size

Subtle tradeoffs between cache organization parameters

Can associative cache ever have higher miss rate than direct-

mapped cache of same size?

8K1W 8K4W 16K1W 16K4W

Vary size and associativity

Vary size and block size