Memory Hierarchy and Cache Systems: Understanding Memory Organization and Performance - Pr, Study notes of Electrical and Electronics Engineering

An in-depth exploration of memory hierarchy and cache systems, covering topics such as memory organization, cache associativity, placement and identification, cache performance, and prefetching. Learn about the different types of caches, their organization, and their role in improving system performance.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-4bh-1
koofers-user-4bh-1 🇺🇸

5

(1)

9 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Memory & Cache
Lec 16
Memory Hierarchy
Why?
How?
Issues
- Cache organization
- Write policy
- Performance
Analysis on hit
Analysis on miss
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download Memory Hierarchy and Cache Systems: Understanding Memory Organization and Performance - Pr and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

Memory & Cache

Lec 16

Memory Hierarchy

 Why?

 How?

 Issues

  • Cache organization
  • Write policy
  • Performance
    • Analysis on hit
    • Analysis on miss

Cache Organization

 Cache-line (block) size

  • How does it affect your cache organization?

 How is the position of a block in the cache calculated

based on the address?

 Associativity

  • Directly mapped
  • Fully associative
  • Set associative

Cache Associativity

index key idx^ key

tag data tag^ data

decoder decoder

“Indexed Memory” “Direct Mapped”

i-bit index 2 i^ blocks

“Associative Memory” “Fully Associative” “CAM” no index unlimited blocks

“N-Way Set-Associative” i-bit index 2 i^ • N blocks

Cache performance

 Average memory access time =

hit time + miss rate * miss penalty

 To improve performance, reduce:

  • hit time
  • miss rate
  • miss penalty

 Primary cache parameters:

  • Total cache capacity
  • Cache line size
  • Associativity

 3 C’s of cache misses: compulsory, capacity, conflict

A Typical Memory Hierarchy

L1 Data Cache

L

Instruction Cache Unified L Cache RF (^) Memory

Memory

Memory

Memory

Multiported register file (part of CPU)

Split instruction & data primary caches (on-chip SRAM)

Multiple interleaved memory banks (off-chip DRAM)

Large unified secondary cache (on-chip SRAM)

CPU

Presence of L2 influences L1 design

 Use smaller L1 if there is also L

  • Trade increased L1 miss rate for reduced L1 hit time and reduced L1 miss penalty
  • Reduces average access energy

 Use simpler write-through L1 cache with on-chip L

  • Write-back L2 cache absorbs write traffic, doesn’t go off-chip
  • At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control
  • Simplifies coherence issues
  • Simplifies error recovery in L1 (can use just parity bits in L and reload from L2 when parity error detected on L1 read)

Inclusion Policy

 Inclusive multilevel cache:

  • Inner cache holds copies of data in outer cache
  • External access need only check outer cache
  • Most common case

 Exclusive multilevel caches:

  • Inner cache may hold data not in outer cache
  • Swap lines between inner/outer caches on miss
  • Used in AMD Athlon with 64KB primary and 256KB secondary cache

Victim Caches (Jouppi 1990)

Unified L Cache RF

CPU

Evicted data from L

Evicted data From VC

where? Hit data from VC (miss in L1)

Victim cache is a small associative back up cache, added to a direct mapped cache, which holds recently evicted lines

  • First look up in direct mapped cache
  • If miss, look in victim cache
  • If hit in victim cache, swap hit line with line now evicted from L
  • If miss in victim cache, L1 victim -> VC, VC victim->? Fast hit time of direct mapped but with reduced conflict misses

(HP 7200)

Victim Cache Fully Assoc. 4 blocks

L1 Data Cache Direct Map.

Way Predicting Caches

(MIPS R10000 off-chip L2 cache)

 Use processor address to index into way prediction table  Look in predicted way at given index, then:

HIT MISS

Return copy of data from cache

Look in other way

Read block of data from next level of cache

MISS

SLOW HIT

(change entry in prediction table)

Prefetching

 Speculate on future instruction and data accesses

and fetch them into cache(s)

  • Instruction accesses easier to predict than data

accesses

 Varieties of prefetching

  • Hardware prefetching
  • Software prefetching
  • Mixed schemes

 What types of misses does prefetching affect?

Issues in Prefetching

 Usefulness – should produce hits

 Cache and bandwidth pollution

 Timeliness – not late and not too early

L1 Data

L

Instruction Unified L Cache RF

CPU

Prefetched data

“What” is Computer Architecture?

Instr. Set Proc. I/O system

Compiler

Operating System

Application

Digital Design Circuit Design

Instruction Set Architecture

Firmware

 Coordination of many levels of abstraction  Under a rapidly changing set of forces  Design, Measurement, and Evaluation

Datapath & Control

Layout

Memory Management

 From early absolute addressing schemes, to modern virtual

memory systems with support for virtual machine monitors

 Can separate into orthogonal functions:

  • Translation
    • mapping of virtual address to physical address
  • Protection
    • permission to access word in memory
  • Virtual memory
    • transparent extension of memory space using slower disk storage

 Support for above functions: a common page-based system

Absolute Addresses

 Only one program ran at a time

  • unrestricted access to entire machine (RAM + I/O devices)

Addresses in a program depended upon where the program

was to be loaded in memory

 But it was more convenient for programmers to write

location-independent subroutines

EDSAC, early 50’s

How could location independence be achieved?

 Linker and/or loader:

  • modify addresses of subroutines and callers when building a program memory image

Dynamic Address Translation

Motivation

In the early machines, I/O operations were slow and each word transferred involved the CPU Higher throughput if CPU and I/O of 2 or more programs were overlapped. How?⇒ multiprogramming

Location-independent programs

Programming and storage management ease ⇒ need for a base register

Protection

Independent programs should not affect each other inadvertently ⇒ need for a bound register

prog

prog

Physical Memory

Memory Fragmentation

As users come and go, the storage is “fragmented”.

Therefore, at some stage programs have to be moved

around to compact the storage.

OS Space 16K 24K 24K

32K

24K

user 1

user 2

user 3

OS Space 16K 24K 16K

32K

24K

user 1 user 2

user 3

user 5

user 4 8K

Users 4 & 5 arrive

Users 2 & 5 leave OS Space 16K 24K 16K

32K

24K

user 1

user 4 8K user 3

free

 Processor generated address can be interpreted as a pair <page number, offset>

 A page table contains the physical address of the base of each page

Paged Memory Systems

Page tables make it possible to store the

pages of a program non-contiguously.

0 1 2 3

0 1 2 3 Address Space of User-

Page Table of User-

1 0

2

3

page number offset

Private Address Space per User

  • Each user has a page table
  • Page table contains an entry for each user page

User 1 VA

Page Table

User 2 VA

Page Table

User 3^ VA

Page Table

PhysicalMemory

free

OS pages

Where Should Page Tables Reside?

 Space required by the page tables (PT) is proportional to

the address space, number of users, ...

⇒ Space requirement is large ⇒ Too expensive to keep in registers

 Idea: Keep PTs in the main memory

  • needs one reference to retrieve the page base address

and another to access the data word

⇒ doubles the number of memory references!

Manual Overlays

Ferranti Mercury 1956

40k bits main

640k bits drum Central Store

 Assume an instruction can address all the storage on the drum

 Method 1: programmer keeps track of addresses in the main memory and initiates an I/O transfer when required

 Method 2: automatic initiation of I/O transfers by software address translation Brookers interpretive coding, 1960 Method1: Difficult, error prone Method2: Inefficient

Not just an ancient black art, e.g., IBM Cell microprocessor explicitly managed local store has same issues

Demand Paging in Atlas (1962)

Secondary (Drum) 32x6 pages

Primary 32 Pages 512 words/page

Central

Memory

User sees 32 x 6 x 512 words of storage

“A page from secondary

storage is brought into the

primary storage whenever it is

(implicitly) demanded by the

processor.”

Tom Kilburn

Primary memory as a cache for secondary memory

Hardware Organization of Atlas

Initial Address Decode

16 ROM pages 0.4 ~1 μsec 2 subsidiary pages 1.4 μsec

Main 32 pages 1.4 μsec

Drum (4) 192 pages (^8) 88 sec/word^ Tape decks

48-bit words 512-word pages

1 Page Address Register (PAR) per page frame

Compare the effective page address against all 32 PARs match ⇒ normal access no match ⇒ page fault save the state of the partially executed instruction

Effective Address

system code (not swapped) system data (not swapped) 0

31

PARs

<effective PN , status>

Atlas Demand Paging Scheme

 On a page fault:

  • Input transfer into a free page is initiated
  • The Page Address Register (PAR) is updated
  • If no free page is left, a page is selected to be replaced

(based on usage)

  • The replaced page is written on the drum
    • to minimize drum latency effect, the first empty page on the drum was selected
  • The page table is updated to point to the new location of

the page on the drum

Linear Page Table

VPN Offset Virtual address

PT Base Register

VPN

Data word

Data Pages

Offset

PPN

PPN

DPN PPN

PPN

PPN

Page Table

DPN

PPN

DPN

DPN

DPN PPN

 Page Table Entry (PTE)

contains:

  • A bit to indicate if a page exists
  • PPN ( physical page number ) for a memory-resident page
  • DPN ( disk page number ) for a page on the disk
  • Status bits for protection and usage

 OS sets the PT Base Register

whenever active user process

changes

Size of Linear Page Table

With 32-bit addresses, 4-KB pages & 4-byte PTEs:

⇒ 220 PTEs, i.e, 4 MB page table per user

⇒ 4 GB of swap needed to back up full virtual address

space

Larger pages?

  • Internal fragmentation (Not all memory in a page is used)
  • Larger page fault penalty (more time to read from disk)

What about 64-bit virtual address space???

  • Even 1MB pages would require 2^44 8-byte PTEs (35 TB!)

Hierarchical Page Table

Level 1 Page Table

Level 2 Page Tables

Data Pages

page in primary memory page in secondary memory

Root of the Current Page Table p

offset

p

Virtual Address

(Processor Register)

PTE of a nonexistent page

p1 p2 offset

31 2221 1211 0

10-bit L1 index

10-bit L2 index

Address Translation & Protection

  • Every instruction and data access needs address

translation and protection checks

A good VM design needs to be fast (~ one cycle) and

space efficient

Physical Address

Virtual Address

Address Translation

Virtual Page No. (VPN) offset

Physical Page No. (PPN) offset

Protection Check

Exception?

Kernel/User Mode

Read/Write