Download Virtual Memory Readings A Computer System: Hardware and more Summaries Operating Systems in PDF only on Docsity!
CIS 501 (Martin): Virtual Memory 1
CIS 501
Computer Architecture
Unit 6: Virtual Memory
Slides originally developed by Amir Roth with contributions by Milo Martin at University of Pennsylvania with sources that included University of Wisconsin slides by Mark Hill, Guri Sohi, Jim Smith, and David Wood. CIS 501 (Martin): Virtual Memory 2
This Unit: Virtual Memory
• The operating system (OS)
- A super-application
- Hardware support for an OS
• Virtual memory
- Page tables and address translation
- TLBs and memory hierarchy issues Mem CPU I/O System software App App App
Readings
• Textbook (MA:FSPTCM)
A Computer System: Hardware
• CPUs and memories
• I/O peripherals : storage, input, display, network, …
- With separate or built-in DMA
- Connected by system bus (which is connected to memory bus) Memory Disk kbd
DMA DMA
display (^) NIC I/O ctrl Memory bus System (I/O) bus CPU/$ bridge CPU/$
CIS 501 (Martin): Virtual Memory 5
A Computer System: + App Software
• Application software : computer must do something
Memory Disk kbd
DMA DMA
display (^) NIC I/O ctrl Memory bus System (I/O) bus CPU/$ bridge CPU/$ Application sofware CIS 501 (Martin): Virtual Memory 6
A Computer System: + OS
• Operating System (OS): virtualizes hardware for apps
- Abstraction : provides services (e.g., threads, files, etc.)
- Simplifies app programming model, raw hardware is nasty
- Isolation : gives each app illusion of private CPU, memory, I/O
- Simplifies app programming model
- Increases hardware resource utilization Memory Disk kbd
DMA DMA
display (^) NIC I/O ctrl Memory bus System (I/O) bus CPU/$ bridge CPU/$ OS Application Application Application Application
Operating System (OS) and User Apps
• Sane system development requires a split
- Hardware itself facilitates/enforces this split
• Operating System (OS) : a super-privileged process
- Manages hardware resource allocation/revocation for all processes
- Has direct access to resource allocation features
- Aware of many nasty hardware details
- Aware of other processes
- Talks directly to input/output devices (device driver software)
• User-level apps : ignorance is bliss
- Unaware of most nasty hardware details
- Unaware of other apps (and OS)
- Explicitly denied access to resource allocation features
System Calls
• Controlled transfers to/from OS
• System Call : a user-level app “function call” to OS
- Leave description of what you want done in registers
- SYSCALL instruction (also called TRAP or INT)
- Can’t allow user-level apps to invoke arbitrary OS code
- Restricted set of legal OS addresses to jump to ( trap vector )
- Processor jumps to OS using trap vector
- OS performs operation
- OS does a “return from system call”
CIS 501 (Martin): Virtual Memory 13
Virtual Memory (VM)
• Virtual Memory (VM) :
- Level of indirection
- Application generated addresses are virtual addresses (VAs)
- Each process thinks it has its own 2N^ bytes of address space
- Memory accessed using physical addresses (PAs)
- VAs translated to PAs at some coarse granularity (page)
- OS controls VA to PA mapping for itself and all other processes
- Logically: translation performed before every insn fetch, load, store
- Physically: hardware acceleration removes translation overhead …
OS
App … App VAs PAs (physical memory) OS controlled VA!PA mappings CIS 501 (Martin): Virtual Memory 14 Disk
Virtual Memory (VM)
• Programs use virtual addresses (VA)
- VA size (N) aka machine size (e.g., Core 2 Duo: 48-bit)
• Memory uses physical addresses (PA)
- PA size (M) typically M<N, especially if N=
- 2 M^ is most physical memory machine supports
• VA!PA at page granularity (VP!PP)
- Mapping need not preserve contiguity
- VP need not be mapped to any PP
- Unmapped VPs live on disk (swap) or nowhere (if not yet touched) …
OS
App … App
VM is an Old Idea: Older than Caches
• Original motivation: single-program compatibility
- IBM System 370: a family of computers with one software suite
- Same program could run on machines with different memory sizes
- Prior, programmers explicitly accounted for memory size
• But also: full-associativity + software replacement
- Memory tmiss is high: extremely important to reduce %miss Parameter I$/D$ L2 Main Memory thit 2ns 10ns 30ns tmiss 10ns 30ns 10ms (10M ns) Capacity 8–64KB 128KB–2MB 64MB–64GB Block size 16–32B 32–256B 4+KB Assoc./Repl. 1–4, LRU 4–16, LRU Full, “working set”
Uses of Virtual Memory
• More recently: isolation and multi-programming
- Each app thinks it has 2N^ B of memory, its stack starts 0xFFFFFFFF,…
- Apps prevented from reading/writing each other’s memory
- Can’t even address the other program’s memory!
• Protection
- Each page with a read/write/execute permission set by OS
- Enforced by hardware
• Inter-process communication.
- Map same physical pages into multiple virtual address spaces
- Or share files via the UNIX mmap() call …
OS
App … App
CIS 501 (Martin): Virtual Memory 17
Address Translation
• VA!PA mapping called address translation
- Split VA into virtual page number (VPN) & page offset (POFS)
- Translate VPN into physical page number (PPN)
- POFS is not translated
- VA!PA = [VPN, POFS]! [PPN, POFS]
• Example above
- 64KB pages! 16-bit POFS
- 32-bit machine! 32-bit VA! 16-bit VPN
- Maximum 256MB memory! 28-bit PA! 12-bit PPN virtual address[31:0] VPN[31:16] POFS[15:0] physical address[25:0] PPN[27:16] POFS[15:0] translate don’t touch CIS 501 (Martin): Virtual Memory 18
Address Translation Mechanics I
• How are addresses translated?
- In software (for now) but with hardware acceleration (a little later)
• Each process allocated a page table (PT)
- Software data structure constructed by OS
- Maps VPs to PPs or to disk (swap) addresses
- VP entries empty if page never referenced
- Translation is table lookup struct { int ppn; int is_valid, is_dirty, is_swapped; } PTE; struct PTE page_table[NUM_VIRTUAL_PAGES]; int translate(int vpn) { if (page_table[vpn].is_valid) return page_table[vpn].ppn; } PT vpn Disk(swap)
Page Table Size
• How big is a page table on the following machine?
- 32-bit machine
- 4B page table entries (PTEs)
- 4KB pages
- 32-bit machine! 32-bit VA! 4GB virtual memory
- 4GB virtual memory / 4KB page size! 1M VPs
- 1M VPs * 4B PTE! 4MB
• How big would the page table be with 64KB pages?
• How big would it be for a 64-bit machine?
• Page tables can get big
- There are ways of making them smaller
Multi-Level Page Table (PT)
• One way: multi-level page tables
- Tree of page tables (“trie”)
- Lowest-level tables hold PTEs
- Upper-level tables hold pointers to lower-level tables
- Different parts of VPN used to index different levels
• Example: two-level page table for machine on last slide
- Compute number of pages needed for lowest-level (PTEs)
- 4KB pages / 4B PTEs! 1K PTEs/page
- 1M PTEs / (1K PTEs/page)! 1K pages
- Compute number of pages needed for upper-level (pointers)
- 1K lowest-level pages! 1K pointers
- 1K pointers * 32-bit VA! 4KB! 1 upper level page
CIS 501 (Martin): Virtual Memory 25
Translation Lookaside Buffer
• Translation lookaside buffer (TLB)
- Small cache: 16–64 entries
- Associative (4+ way or fully associative)
- Exploits temporal locality in page table
- What if an entry isn’t found in the TLB?
- Invoke TLB miss handler VPN PPN VPN PPN VPN PPN “tag” “data”
CPU
D$
L
Main Memory
I$
TLB
VA PA
TLB
CIS 501 (Martin): Virtual Memory 26
Serial TLB & Cache Access
• “Physical” caches
- Indexed and tagged by physical addresses
- Natural, “lazy” sharing of caches between apps/OS
• VM ensures isolation (via physical addresses )
• No need to do anything on context switches
• Multi-threading works too
- Cached inter-process communication works
- Single copy indexed by physical address
- Slow: adds at least one cycle to thit
• Note: TLBs are by definition “virtual”
- Indexed and tagged by virtual addresses
- Flush across context switches
- Or extend with process identifier tags (x86)
CPU
D$
L
Main Memory
I$
TLB
VA PA
TLB
CIS 501 (Martin): Virtual Memory 27
Parallel TLB & Cache Access
• Two ways to look at VA
- Cache: tag+index+offset
- TLB: VPN +page offset
• Parallel cache/TLB…
- If address translation doesn’t change index
- That is, VPN/index must not overlap virtual tag [31:12] [4:0] data index [11:5] address
TLB hit/miss
VPN [31:16] page offset [15:0] cache
TLB
cache hit/miss tags data CIS 501 (Martin): Virtual Memory 28
Parallel TLB & Cache Access
• What about parallel access?
- Only if… (cache size) / (associativity)! page size
- Index bits same in virt. and physical addresses!
• Access TLB in parallel with cache
- Cache access needs tag only at very end
- Fast: no additional thit cycles
- No context-switching/aliasing problems
- Dominant organization used today
• Example: Core 2, 4KB pages,
32KB, 8-way SA L1 data cache
- Implication: associativity allows bigger caches
CPU
D$
L
Main Memory
TLB I$
VA TLB PA tag [31:12] index [11:5] [4:0] VPN [31:16] page offset [15:0] ? PPN[27:16] page offset [15:0]
CIS 501 (Martin): Virtual Memory 29
TLB Organization
• Like caches : TLBs also have ABCs
- Capacity
- Associativity (At least 4-way associative, fully-associative common)
- What does it mean for a TLB to have a block size of two?
- Two consecutive VPs share a single tag
- Like caches : there can be L2 TLBs
• Example: AMD Opteron
- 32-entry fully-assoc. TLBs, 512-entry 4-way L2 TLB (insn & data)
- 4KB pages, 48-bit virtual addresses, four-level page table
• Rule of thumb : TLB should “cover” L2 contents
- In other words: (#PTEs in TLB) * page size! L2 size
- Why? Consider relative miss latency in each… CIS 501 (Martin): Virtual Memory 30
TLB Misses
• TLB miss: translation not in TLB, but in page table
- Two ways to “fill” it, both relatively fast
• Software-managed TLB : e.g., Alpha, MIPS
- Short (~10 insn) OS routine walks page table, updates TLB
- Keeps page table format flexible
- Latency: one or two memory accesses + OS call (pipeline flush)
• Hardware-managed TLB : e.g., x86, recent SPARC, ARM
- Page table root in hardware register, hardware “walks” table
- Latency: saves cost of OS call (avoids pipeline flush)
- Page table format is hard-coded
• Trend is towards hardware TLB miss handler
Page Faults
• Page fault : PTE not in TLB or page table
-! page not in memory
- Or no valid mapping! segmentation fault
- Starts out as a TLB miss, detected by OS/hardware handler
• OS software routine :
- Choose a physical page to replace
- “Working set” : refined LRU, tracks active page usage
- If dirty, write to disk
- Read missing page from disk
- Takes so long (~10ms), OS schedules another task
- Requires yet another data structure: frame map
- Maps physical pages to <process, virtual page> pairs
- Treat like a normal TLB miss from here
Summary
• OS virtualizes memory and I/O devices
• Virtual memory
- “infinite” memory, isolation, protection, inter-process communication
- Page tables
- Translation buffers
- Parallel vs serial access, interaction with caching
- Page faults