Understanding Virtual Memory and Paging: A Deep Dive into Memory Management - Prof. Willia, Exams of Computer Science

How virtual memory works, focusing on paging techniques. It covers the division of memory into pages, the use of page tables, and the implementation of paging. The text also discusses the impact of virtual memory on algorithms and the importance of translation lookaside buffers (tlbs).

Typology: Exams

Pre 2010

Uploaded on 03/16/2009

koofers-user-vck-1
koofers-user-vck-1 🇺🇸

10 documents

1 / 30

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Computer Architecture and
Performance:
Virtual Memory
William Gropp
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e

Partial preview of the text

Download Understanding Virtual Memory and Paging: A Deep Dive into Memory Management - Prof. Willia and more Exams Computer Science in PDF only on Docsity!

Computer Architecture and

Performance:

Virtual Memory

William Gropp

Virtual Memory

• So far, we’ve assumed that the

process is addressing “memory”

• In most systems, (user) processes

use “virtual” addresses

♦ Gives the process the illusion that it

directly addresses all real memory

♦ Gives the process the illusion that

there is more real memory than is

really available

Paging Example

High bits in address Low bits … Page Table Memory All of memory is divided^ Memory page into pages A page table entry is required for each memory page Low bits in address give location within page

Implementing Paging

  • Virtual memory introduces some costs because the virtual address must be translated to a physical address
  • Consider this case: ♦ Let each page contain 4k bytes - A common size ♦ Address uses lower 12 bits to represent location in the page ♦ Upper bits give page number - For a 32-bit address space (4GB of memory), use the top 20 bits
  • For each page number, there is a corresponding location ♦ Either in physical (real) memory ♦ On “backing store” (in the swap file on disk)

Paging Example With Cache

High bits in address Low bits … Page Table (^) Memory Memory page Look for this index in cache If found If not found, lookup and replace entry in cache

Translation Lookaside Buffer

(TLB)

  • The page mapping cache is called a

Translation Lookaside Buffer (TLB)

♦ Lookup is not easy when it has to be very fast ♦ As a result, TLBs are often small but fast enough to return physical address quickly

  • What happens on a page miss (entry is

not in the TLB)?

♦ Fetch entry from memory (the whole page table isn’t big relative to main (DRAM) memory

  • Main memory latency cost

TLB Revisited

  • When an page location is not found in the TLB, first find the entry in the page table ♦ Requires a memory read - latencies of 20 to 100s of cycles.
  • Determine if the page is stored in the main memory (resident) or has been moved to slower disk storage ♦ If resident, replace a TLB entry with the location of this page and return the physical address ♦ If not resident, transfer control to the operating system to handle a page fault - A page fault has latencies in milliseconds (time to find and read data from disk)

Impact on Algorithms

  • Large cost if data outside of TLB set is accessed frequently
  • Consider the transpose example with a 2048 x 2048 matrix and a TLB with 64 entries
  • Each entry an 8-byte double precision value

Transpose with 4K pages:

• Each column of the matrix requires

4 pages

♦ A page is mapped for stores every

512 rows

♦ A page is mapped for loads on every

column:

  • Use only a single entry from a page before going to the next one
  • Process 2k-1 pages before returning to a previous page
  • Every load incurs a TLB miss

Transpose with 64k pages

• 4 columns per page

♦ It takes 512 pages to cover one row

of the matrix

♦ But get 4 values out of each page

  • Every fourth load incurs a TLB miss

Observations

  • Note that the TLB and the L1/L2/L3 cache have different behavior ♦ For example, consider 512 separate cache lines of 128 bytes each ♦ Only 64K bytes of storage ♦ But if they are in 512 different pages, each reference may incure a TLB miss, even though data fits within cache!
  • If a page is located in secondary media, performance may be orders of magnitude lower ♦ Drop in performance is severe and sudden
  • Large pages can give modest (several loads satisfied from each page) or large improvements in performance (no extra TLB misses)

Discussion Questions

  • Architecture Issues ♦ TLB is often very small ♦ Even regular accesses (as in the strided accesses in transpose) can cause problems - Can hardware effectively predict pages and preload a guess at the next TLB entry? - Can alternative approaches be used? − If there was more or different information from the program, would other architectural solutions be practical?
  • Programming Model Issues ♦ Optimizing the transpose code appears simple - Blocking for cache and TLB is straightforward - Why don’t compilers (usually) generate good code for this case?

Double Buffering and Asynchronous I/O

  • Out of core algorithms replied on double buffering. Pseudo code looks like this:
  • Load A with data Initiated nonblocking load of B with data to be used later while (not done) { work on data in A initiate nonblocking load of A with data to be used later wait for load of B to complete swap pointers to A and B }
  • These algorithms can address problems with TLB misses, even to secondary storage
  • But they are hard to implement in practice. Why?

Challenges in Implementing

Out of Core Algorithms

  • Most programming models provide no support for asynchronous operations ♦ It is nearly impossible to robustly use nonblocking operations in Fortran because of the language design - Compiler may “optimize” around calls to library routines that implement nonblocking or asynchronous operations
  • A key part of the algorithm is performing work while the “other” buffer is filled with data ♦ How much work? ♦ Does the work (computation) overlap (take place at the same time) with filling the buffer (communication)? - Programming models and hardware may support the operation without making it efficient