Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Memory Hierarchy and Cache Systems: Understanding Memory Organization and Performance - Pr, Study notes of Electrical and Electronics Engineering

University of Illinois - Chicago Electrical and Electronics Engineering

Prof. Wenjing Rao

An in-depth exploration of memory hierarchy and cache systems, covering topics such as memory organization, cache associativity, placement and identification, cache performance, and prefetching. Learn about the different types of caches, their organization, and their role in improving system performance.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-4bh-1 🇺🇸

(1)

9 documents

1 / 26

This page cannot be seen from the preview

Don't miss anything!

Memory & Cache

Lec 16

Memory Hierarchy

Why?

How?

Issues

- Cache organization

- Write policy

- Performance

• Analysis on hit

• Analysis on miss

Discover Study notes of Electrical and Electronics Engineering University of Illinois - Chicago

Partial preview of the text

Download Memory Hierarchy and Cache Systems: Understanding Memory Organization and Performance - Pr and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

Memory & Cache

Lec 16

Memory Hierarchy

Why?

How?

Issues

Cache organization
Write policy
Performance
- Analysis on hit
- Analysis on miss

Cache Organization

Cache-line (block) size

How does it affect your cache organization?

How is the position of a block in the cache calculated

based on the address?

Associativity

Directly mapped
Fully associative
Set associative

Cache Associativity

index key idx^ key

tag data tag^ data

decoder decoder

“Indexed Memory” “Direct Mapped”

i-bit index 2 i^ blocks

“Associative Memory” “Fully Associative” “CAM” no index unlimited blocks

“N-Way Set-Associative” i-bit index 2 i^ • N blocks

Cache performance

Average memory access time =

hit time + miss rate * miss penalty

To improve performance, reduce:

hit time
miss rate
miss penalty

Primary cache parameters:

Total cache capacity
Cache line size
Associativity

3 C’s of cache misses: compulsory, capacity, conflict

A Typical Memory Hierarchy

L1 Data Cache

L

Instruction Cache Unified L Cache RF (^) Memory

Memory

Multiported register file (part of CPU)

Split instruction & data primary caches (on-chip SRAM)

Multiple interleaved memory banks (off-chip DRAM)

Large unified secondary cache (on-chip SRAM)

CPU

Presence of L2 influences L1 design

Use smaller L1 if there is also L

Trade increased L1 miss rate for reduced L1 hit time and reduced L1 miss penalty
Reduces average access energy

Use simpler write-through L1 cache with on-chip L

Write-back L2 cache absorbs write traffic, doesn’t go off-chip
At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control
Simplifies coherence issues
Simplifies error recovery in L1 (can use just parity bits in L and reload from L2 when parity error detected on L1 read)

Inclusion Policy

Inclusive multilevel cache:

Inner cache holds copies of data in outer cache
External access need only check outer cache
Most common case

Exclusive multilevel caches:

Inner cache may hold data not in outer cache
Swap lines between inner/outer caches on miss
Used in AMD Athlon with 64KB primary and 256KB secondary cache

Victim Caches (Jouppi 1990)

Unified L Cache RF

CPU

Evicted data from L

Evicted data From VC

where? Hit data from VC (miss in L1)

Victim cache is a small associative back up cache, added to a direct mapped cache, which holds recently evicted lines

First look up in direct mapped cache
If miss, look in victim cache
If hit in victim cache, swap hit line with line now evicted from L
If miss in victim cache, L1 victim -> VC, VC victim->? Fast hit time of direct mapped but with reduced conflict misses

(HP 7200)

Victim Cache Fully Assoc. 4 blocks

L1 Data Cache Direct Map.

Way Predicting Caches

(MIPS R10000 off-chip L2 cache)

Use processor address to index into way prediction table Look in predicted way at given index, then:

HIT MISS

Return copy of data from cache

Look in other way

Read block of data from next level of cache

MISS

SLOW HIT

(change entry in prediction table)

Prefetching

Speculate on future instruction and data accesses

and fetch them into cache(s)

Instruction accesses easier to predict than data

accesses

Varieties of prefetching

Hardware prefetching
Software prefetching
Mixed schemes

What types of misses does prefetching affect?

Issues in Prefetching

Usefulness – should produce hits

Cache and bandwidth pollution

Timeliness – not late and not too early

L1 Data

L

Instruction Unified L Cache RF

CPU

Prefetched data

“What” is Computer Architecture?

Instr. Set Proc. I/O system

Compiler

Operating System

Application

Digital Design Circuit Design

Instruction Set Architecture

Firmware

Coordination of many levels of abstraction Under a rapidly changing set of forces Design, Measurement, and Evaluation

Datapath & Control

Layout

Memory Management

From early absolute addressing schemes, to modern virtual

memory systems with support for virtual machine monitors

Can separate into orthogonal functions:

Translation
- mapping of virtual address to physical address
Protection
- permission to access word in memory
Virtual memory
- transparent extension of memory space using slower disk storage

Support for above functions: a common page-based system

Absolute Addresses

Only one program ran at a time

unrestricted access to entire machine (RAM + I/O devices)

Addresses in a program depended upon where the program

was to be loaded in memory

But it was more convenient for programmers to write

location-independent subroutines

EDSAC, early 50’s

How could location independence be achieved?

Linker and/or loader:

modify addresses of subroutines and callers when building a program memory image

Dynamic Address Translation

Motivation

In the early machines, I/O operations were slow and each word transferred involved the CPU Higher throughput if CPU and I/O of 2 or more programs were overlapped. How?⇒ multiprogramming

Location-independent programs

Programming and storage management ease ⇒ need for a base register

Protection

Independent programs should not affect each other inadvertently ⇒ need for a bound register

prog

Physical Memory

Memory Fragmentation

As users come and go, the storage is “fragmented”.

Therefore, at some stage programs have to be moved

around to compact the storage.

OS Space 16K 24K 24K

32K

24K

user 1

user 2

user 3

OS Space 16K 24K 16K

32K

24K

user 1 user 2

user 3

user 5

user 4 8K

Users 4 & 5 arrive

Users 2 & 5 leave OS Space 16K 24K 16K

32K

24K

user 1

user 4 8K user 3

free

Processor generated address can be interpreted as a pair <page number, offset>

A page table contains the physical address of the base of each page

Paged Memory Systems

Page tables make it possible to store the

pages of a program non-contiguously.

0 1 2 3

0 1 2 3 Address Space of User-

Page Table of User-

1 0

page number offset

Private Address Space per User

Each user has a page table
Page table contains an entry for each user page

User 1 VA

Page Table

User 2 VA

Page Table

User 3^ VA

Page Table

PhysicalMemory

free

OS pages

Where Should Page Tables Reside?

Space required by the page tables (PT) is proportional to

the address space, number of users, ...

⇒ Space requirement is large ⇒ Too expensive to keep in registers

Idea: Keep PTs in the main memory

needs one reference to retrieve the page base address

and another to access the data word

⇒ doubles the number of memory references!

Manual Overlays

Ferranti Mercury 1956

40k bits main

640k bits drum Central Store

Assume an instruction can address all the storage on the drum

Method 1: programmer keeps track of addresses in the main memory and initiates an I/O transfer when required

Method 2: automatic initiation of I/O transfers by software address translation Brooker ’ s interpretive coding, 1960 Method1: Difficult, error prone Method2: Inefficient

Not just an ancient black art, e.g., IBM Cell microprocessor explicitly managed local store has same issues

Demand Paging in Atlas (1962)

Secondary (Drum) 32x6 pages

Primary 32 Pages 512 words/page

Central

Memory

User sees 32 x 6 x 512 words of storage

“A page from secondary

storage is brought into the

primary storage whenever it is

(implicitly) demanded by the

processor.”

Tom Kilburn

Primary memory as a cache for secondary memory

Hardware Organization of Atlas

Initial Address Decode

16 ROM pages 0.4 ~1 μsec 2 subsidiary pages 1.4 μsec

Main 32 pages 1.4 μsec

Drum (4) 192 pages (^8) 88 sec/word^ Tape decks

48-bit words 512-word pages

1 Page Address Register (PAR) per page frame

Compare the effective page address against all 32 PARs match ⇒ normal access no match ⇒ page fault save the state of the partially executed instruction

Effective Address

system code (not swapped) system data (not swapped) 0

PARs

Atlas Demand Paging Scheme

On a page fault:

Input transfer into a free page is initiated
The Page Address Register (PAR) is updated
If no free page is left, a page is selected to be replaced

(based on usage)

The replaced page is written on the drum
- to minimize drum latency effect, the first empty page on the drum was selected
The page table is updated to point to the new location of

the page on the drum

Linear Page Table

VPN Offset Virtual address

PT Base Register

VPN

Data word

Data Pages

Offset

PPN

DPN PPN

PPN

Page Table

DPN

PPN

DPN

DPN PPN

Page Table Entry (PTE)

contains:

A bit to indicate if a page exists
PPN ( physical page number ) for a memory-resident page
DPN ( disk page number ) for a page on the disk
Status bits for protection and usage

OS sets the PT Base Register

whenever active user process

changes

Size of Linear Page Table

With 32-bit addresses, 4-KB pages & 4-byte PTEs:

⇒ 220 PTEs, i.e, 4 MB page table per user

⇒ 4 GB of swap needed to back up full virtual address

space

Larger pages?

Internal fragmentation (Not all memory in a page is used)
Larger page fault penalty (more time to read from disk)

What about 64-bit virtual address space???

Even 1MB pages would require 2^44 8-byte PTEs (35 TB!)

Hierarchical Page Table

Level 1 Page Table

Level 2 Page Tables

Data Pages

page in primary memory page in secondary memory

Root of the Current Page Table p

offset

Virtual Address

(Processor Register)

PTE of a nonexistent page

p1 p2 offset

31 2221 1211 0

10-bit L1 index

10-bit L2 index

Address Translation & Protection

Every instruction and data access needs address

translation and protection checks

A good VM design needs to be fast (~ one cycle) and

space efficient

Physical Address

Virtual Address

Address Translation

Virtual Page No. (VPN) offset

Physical Page No. (PPN) offset

Protection Check

Exception?

Kernel/User Mode

Read/Write

Memory Hierarchy and Cache Systems: Understanding Memory Organization and Performance - Pr, Study notes of Electrical and Electronics Engineering

Related documents

Partial preview of the text

Download Memory Hierarchy and Cache Systems: Understanding Memory Organization and Performance - Pr and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

Memory & Cache

Lec 16

Memory Hierarchy

Why?

How?

Issues

Cache-line (block) size

How is the position of a block in the cache calculated

based on the address?

Associativity

Average memory access time =

hit time + miss rate * miss penalty

To improve performance, reduce:

Primary cache parameters:

3 C’s of cache misses: compulsory, capacity, conflict

L

CPU

Use smaller L1 if there is also L

Use simpler write-through L1 cache with on-chip L

Inclusive multilevel cache:

Exclusive multilevel caches:

CPU

(HP 7200)

HIT MISS

MISS

SLOW HIT

accesses

Usefulness – should produce hits

Cache and bandwidth pollution

Timeliness – not late and not too early

L

CPU

From early absolute addressing schemes, to modern virtual

memory systems with support for virtual machine monitors

Can separate into orthogonal functions:

Support for above functions: a common page-based system

Only one program ran at a time

was to be loaded in memory

location-independent subroutines

EDSAC, early 50’s

How could location independence be achieved?

Motivation

Location-independent programs

Protection

Physical Memory

As users come and go, the storage is “fragmented”.

Therefore, at some stage programs have to be moved

around to compact the storage.

Page tables make it possible to store the

pages of a program non-contiguously.

Space required by the page tables (PT) is proportional to

the address space, number of users, ...

Idea: Keep PTs in the main memory

and another to access the data word

⇒ doubles the number of memory references!

Central

Memory

“A page from secondary

storage is brought into the

primary storage whenever it is

(implicitly) demanded by the

processor.”

Tom Kilburn

(based on usage)

the page on the drum

Page Table Entry (PTE)

contains:

OS sets the PT Base Register

whenever active user process

changes

With 32-bit addresses, 4-KB pages & 4-byte PTEs:

⇒ 220 PTEs, i.e, 4 MB page table per user

⇒ 4 GB of swap needed to back up full virtual address

space

Larger pages?

What about 64-bit virtual address space???