Memory Hierarchy - Prof. Mchter, Summaries of Computer Science

The memory hierarchy in computer systems, which combines small amounts of expensive, fast memory with large amounts of inexpensive, slow memory to create the illusion of large amounts of fast memory. It covers topics such as caching, virtual memory, and the translation lookaside buffer (tlb). How the memory hierarchy takes advantage of spatial and temporal locality to improve performance, and describes the interface between cache and main memory, including interleaved memory. It also discusses the motivation and design decisions behind virtual memory, which allows programs to address more memory than is physically available by storing frequently used data in main memory and other data on the hard disk. A detailed overview of the integration of the tlb with virtual memory and the cache to speed up memory access.

Typology: Summaries

2022/2023

Uploaded on 03/29/2024

astine-oc
astine-oc 🇰🇪

1 document

1 / 30

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 5
Memory Hierarchy
Reading: The corresponding chapter in the 2nd edition is Chapter 7, in the 3rd edition it
is Chapter 7 and in the 4th edition it is Chapter 5.
5.1 Overview
While studying CPU design in the previous chapter, we considered memory at a high level of
abstraction, assuming it was a hardware component that consists of millions of memory cells,
which can be individually addressed, for reading or writing, in a reasonable time (i.e., one
CPU clock cycle). In this chapter, we will learn that memory is, in fact, built hierarchically,
in different layers. This is because the ultimate goals in memory design are to
have lots of it (gigabytes, terabytes, etc., enough to contain the entire address space),
make it fast (as fast as CPU registers),
make it affordable (not too expensive).
These goals are challenging to combine since fast memory, such as S-RAM, is very expensive,
while cheaper memory, such as D-RAM, is slower and the cheapest memory, like, e.g., hard
drive storage, is extremely slow compared to S-RAM or D-RAM. Building memory based on
S-RAM only would make it too expensive though. Building memory based on D-RAM only
would soften the price but slow down the overall performance significantly.
To achieve all of the three design goals, hardware designers combine a small amount of
expensive, fast memory and large amounts of inexpensive, slow memory in such a way that
the combination of the two behaves as if large amounts of fast memory were available (and
that, at an affordable price). To create this illusion of lots of fast memory, we create a
hierarchical memory structure, with multiple levels. An example of a structure with 4 levels
is shown in Figure 5.1. Studying such hierarchical structure in more detail is the topic of
this chapter.
Each level in the memory hierarchy contains a subset of the information that is stored
in the level right below it:
CPU Cache Main Memory Disk.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e

Partial preview of the text

Download Memory Hierarchy - Prof. Mchter and more Summaries Computer Science in PDF only on Docsity!

Chapter 5

Memory Hierarchy

Reading: The corresponding chapter in the 2nd edition is Chapter 7, in the 3rd edition it is Chapter 7 and in the 4th edition it is Chapter 5.

5.1 Overview

While studying CPU design in the previous chapter, we considered memory at a high level of abstraction, assuming it was a hardware component that consists of millions of memory cells, which can be individually addressed, for reading or writing, in a reasonable time (i.e., one CPU clock cycle). In this chapter, we will learn that memory is, in fact, built hierarchically, in different layers. This is because the ultimate goals in memory design are to

  • have lots of it (gigabytes, terabytes, etc., enough to contain the entire address space),
  • make it fast (as fast as CPU registers),
  • make it affordable (not too expensive).

These goals are challenging to combine since fast memory, such as S-RAM, is very expensive, while cheaper memory, such as D-RAM, is slower and the cheapest memory, like, e.g., hard drive storage, is extremely slow compared to S-RAM or D-RAM. Building memory based on S-RAM only would make it too expensive though. Building memory based on D-RAM only would soften the price but slow down the overall performance significantly. To achieve all of the three design goals, hardware designers combine a small amount of expensive, fast memory and large amounts of inexpensive, slow memory in such a way that the combination of the two behaves as if large amounts of fast memory were available (and that, at an affordable price). To create this illusion of lots of fast memory, we create a hierarchical memory structure, with multiple levels. An example of a structure with 4 levels is shown in Figure 5.1. Studying such hierarchical structure in more detail is the topic of this chapter. Each level in the memory hierarchy contains a subset of the information that is stored in the level right below it:

CPU ⊂ Cache ⊂ Main Memory ⊂ Disk.

2 CHAPTER 5. MEMORY HIERARCHY

Disk

Main Memory

Cache

CPU

Registers

cheap

expensive fast

slow

Figure 5.1: Memory hierarchy.

In the CPU, registers allow to store 32 words, which can be accessed extremely fast. If information is not present in one of the 32 registers, the CPU will request information from memory, by providing the address of the location where the required information is stored. First, the cache will verify whether it has the requested information available, or not. The cache is located close to the CPU and composed of a relatively small amount of fast and expensive memory (S-RAM). So, if the requested information is available in the cache, it can retrieved quickly. If not, main memory, which is significantly larger and composed of slower and cheaper D-RAM, is accessed. If the requested information is in the main memory, it is provided to the cache, which then provides it to the CPU. If not, the hard drive, which contains all information that is stored in the machine, is accessed. The hard drive offers a vast amount of storage space, at an affordable price, however, accessing it is slow. So, fundamentally, the closer to the CPU a level in the memory hierarchy is located, the faster, smaller, and more expensive it is. In order to create the illusion of having lots of fast memory available, it is crucial that, with high probability, the cache contains the data the CPU is looking for, such that the main memory and especially the hard drive get accessed only sporadically. Fortunately, not all data in the entire address space is equally likely to be accessed: usually, only a small portion of the entire address space is being accessed over any range of 10-100 lines of code. This is because of locality:

  • Temporal locality: recently accessed memory content tends to get accessed again soon. For example, most programs have simple loops which cause instructions and data to be referenced repeatedly. In

for(i=0; i<100; i++)

the memory location that stores the variable i will be referenced repeatedly, as well as the locations that contain the sequence of machine instructions that encode the loop.

  • Spatial locality: memory content that is located nearby recently accessed memory content tends to be accessed as well, within the next 10-100 clock cycles. For example, program instructions are usually accessed sequentially, if no branches or jumps occur. Also, reading or writing arrays usually results in accessing memory sequentially.

4 CHAPTER 5. MEMORY HIERARCHY

Disk Main Memory Cache

CPU

Registers A is found in the cache.

HIT

cache provides CPU with the contents of A

MISS

Disk Main Memory Cache

CPU

Registers

A?

Disk Main Memory Cache

CPU

Registers

A?

CPU requests A

A is not found in the cache. Cache request to main memory

Disk Main Memory Cache

CPU

Registers

A?

CPU requests A

A

Disk Main Memory Cache

CPU

Registers

Cache passes the contents to CPU

Disk Main Memory Cache

CPU

Registers

Found in main memory

A

Contents are copied to cache

Figure 5.2: Memory access, resulting in a hit or a miss.

5.2. CACHE 5

  1. the content is available in the cache, and,
  2. if yes, where it can be found in the cache.

The latter will depend on how the cache is organized. In this subsection, we assume the simplest possible organization, called direct-mapped cache. In direct-mapped cache, the content of a location in main memory can be stored at one and only one, specific location in the cache, i.e., it is “mapped” to exactly one location in the cache. Figure 5.3 shows how 32 sequential memory locations are cyclically mapped into only 8 cache locations.

00001 00011 00101 00111 01001

01011 01101 01111 10001 10011 10101 10111 11001 11011 11101

11111

000 001 010 011 100 101 110 111

Cache

CPU’s view of memory

CPU

Figure 5.3: Direct-mapped cache.

Such cyclical mapping ensures that every block in main memory (a one-word block, in this subsection) could be stored at just one single location in the cache, indexed by:

(Block address in main memory) MOD (# blocks in the cache),

which uniquely depends on the address of the block in main memory. It is clear from Figure 5.3 that this is a many-to-one mapping. This answers the second question above: if the CPU requests the contents of a specific memory location, there is only one block in the cache, indexed by the result of the previous equation, that could possibly contain that

5.2. CACHE 7

corresponding block in the cache is invalid (V = 0), so we have a cache miss. The block containing the requested word is copied into the cache from the next level below in the memory hierarchy (i.e., the main memory), the tag bits are set to 10 (the most-significant bits of the word address 10110 2 , that are not used to index the cache) and the valid bit is set (as the cache block is now valid), resulting in the following state of the cache.

Index V Tag Data (block = 32 bits) 000 0 001 0 010 0 011 0 100 0 101 0 110 1 10 Mem[10110 2 ] 111 0

The next access is at word address 11010 2. The index bits are 010. The corresponding block in the cache is invalid again, so we have a cache miss, copy the appropriate block from main memory, set the tag bits to 11 and the valid bit to 1, resulting in the cache state below.

Index V Tag Data (block = 32 bits) 000 0 001 0 010 1 11 Mem[11010 2 ] 011 0 100 0 101 0 110 1 10 Mem[10110 2 ] 111 0

101102 is accessed next. The index bits are 110. The corresponding block of the cache is valid (V = 1), with tag bits 10, which match the tag bits of the word address 10110 2. This implies a cache hit, so the cache can provide the CPU promptly with the requested data, Mem[10110 2 ]. 10000 2 is accessed next, with index 000, which corresponds to an invalid cache block and thus a miss. Copying the right block from main memory into the cache and adjusting tag and valid bit results in the following state of the cache.

Index V Tag Data (block = 32 bits) 000 1 10 Mem[10000 2 ] 001 0 010 1 11 Mem[11010 2 ] 011 0 100 0 101 0 110 1 10 Mem[10110 2 ] 111 0

8 CHAPTER 5. MEMORY HIERARCHY

Lastly, 10010 2 is accessed. The block indexed by 010 is valid, however, the tag bits of the word address, 10, don’t match the tag of the corresponding cache block, which is 11. This implies the block indexed by 010, in the cache, is storing the memory word at 11010 2 and not the memory word at 10010 2. Therefore, we have a cache miss and replace this block in the cache by a new block, i.e., the contents of 10010 2 in main memory. After updating the tag, the cache has been updated as follows.

Index V Tag Data (block = 32 bits) 000 1 10 Mem[10000 2 ] 001 0 010 1 10 Mem[10010 2 ] 011 0 100 0 101 0 110 1 10 Mem[10110 2 ] 111 0

Using 32-bit byte addressing in MIPS R

We can now use this direct-mapped cache organization for the MIPS R2000 architecture we designed before, which uses 32-bit byte addressing. Assume we want to build a cache that contains 64 Kbyte of data. That will require space for 2^16 bytes of data, i.e., 2^14 words. Since each line in the cache contains a one-word block of data, this cache requires 2^14 lines, to be indexed by 14 index bits (the 14 LSB of the word address). Ignoring the 2 least significant bits of the 32-bit byte address (the so-called “byte offset”, to specify specific bytes in each word, is ignored because MIPS R2000 usually reads and writes words) leaves the 16 MSB of the address for the tag (32 - 14 index bits - 2 byte offset bits = 16 tag bits), as shown in Figure 5.4.

The total size of the cache is given by (1 + 16 + 32) × 214 = 49 × 16 × 210 = 784 Kbits. So, building a cache to contain 64 Kbyte = 512 Kbits of data, requires 784 Kbits of memory. The 16 MSB of the address matching the tag, with the valid bit set, creates a hit.

10 CHAPTER 5. MEMORY HIERARCHY

until the main memory write is completed, the effective CPI is given by

  1. 9 × 1 + 0. 1 × (1 + 10) = 0.9 + 1.1 = 2.

Although write-through is a simple write strategy, it can deteriorate the performance significantly. To avoid the latter, we can use a write buffer, which stores the write data while it’s waiting to be written into main memory. Instead of writing to the main memory directly, the cache writes to the write buffer and allows the CPU to continue execution while data is written from the write buffer to main memory. A write buffer is a simple FIFO which can contain a small number of words (e.g., 4). This is shown in Figure 5.5.

CPU

write

Cache

Write Buffer

Main Memory (^) write

write

Figure 5.5: Cache writes data to the write buffer, which then writes it to the main memory.

When a write to main memory completes, the corresponding entry in the write buffer is freed up. If, however, the write buffer is full when the CPU reaches a new write instruction, the CPU stalls until there is an empty entry in the write buffer. The key assumption to make a write buffer effective is that the rate at which the processor does writes is less than the rate at which the data can be written into the main memory. By allowing to buffer multiple words, the write buffer can accommodate sporadic write bursts without loss of efficiency. On the other hand, a write-back scheme could be used for writes. In a write-back scheme, a new value only gets written to the corresponding block in the cache and not in the main memory. The modified block is written back from the cache to the lower level in the memory hierarchy only when it is being replaced in the upper level. This scheme can improve the performance if the CPU issues writes at a higher rate than the rate at which the data can be written into the main memory. However, it is more complex to implement. To keep track of whether a cache block has been modified by a write (and, therefore, needs to be written back when replaced), every line in the cache is provided with an additional bit, the dirty bit. A block only gets written back into main memory, upon replacement, if the dirty bit is set.

5.2. CACHE 11

5.2.2 Direct-mapped cache with 4-word blocks

In the previous subsection, we considered the simplest possible cache structure, where each block contains one word. In this subsection, we want the cache to take better advantage of spatial locality, by increasing the block size: by having a cache block larger than one word, multiple words, adjacent to the one the CPU requested, will be fetched from main memory when a cache miss occurs. Because of spatial locality, it is likely that the adjacent words will be requested soon as well, which would increase the cache’s hit rate. For a cache with a block size of 4 words (16 bytes) and the same total number of data bits as the cache in the previous subsection (64 KByte), we obtain the layout shown in Figure 5.6. A cache miss will fetch four words from the main memory (including the word requested by the CPU).

=

31

Index

16 32

V Tag

Hit

byte offset

15

16

1

32 32 32

Data

32

12

2

block offset

Tag

32-bit address

16

D

4

0

2

3 (word index)

Figure 5.6: Direct-mapped cache with 4-word blocks, providing 64 KByte of data.

Comparing the direct-mapped cache with 4-word blocks, in Figure 5.6, to the direct-mapped cache with one-word blocks, in Figure 5.4, we notice the following differences:

  • The presence of block offset bits, also called word index bits. A variation in these bits will specify a different word, however, as these are the least significant bits in the word address, they specify a different but adjacent word, in one and the same block, corresponding to one single cache index. Thus, in case of a cache miss, the cache will copy the same block from main memory into the cache, if only the block offset differs. With a block size of 4 words, the block offset (or, word index) consists of 2 bits. As these bits specify which specific word in a block is requested by the CPU, a multiplexer is provided to output the correct word, given the block offset;

5.2. CACHE 13

CPU

32

32

Cache

Main memory

BUS

Figure 5.7: One-word wide memory.

Given the physical distance between cache and main memory, the bus is usually clocked at a slower rate than the CPU, to allow electric signals to propagate over the entire length of the bus, within one memory bus clock cycle (MBCC). Since the bus is clocked at a slower rate, 1 MBCC usually equals several CPU clock cycles. Assume that it takes

  • 1 MBCC to send an address from the cache to the main memory;
  • 15 MBCCs to access the main memory (D-RAM) once;
  • 1 MBCC to send one word of data from the main memory to the cache.

If a cache block consists of 4 words, each of the words in a block needs to be accessed and transferred sequentially, in case of a cache miss. To copy the 4-word block from the main memory to the cache, it takes

1 MBCC (to send the appropriate address to main memory)

  • 4 × 15 MBCCs (to access four words, sequentially, in main memory)
  • 4 × 1 MBCCs (to transfer the four words, sequentially, to the cache) = 65 MBCCs

to accommodate a cache miss. During these 65 MBCCs, the CPU stalls, which makes for a very high miss penalty (also, remember that 1 MBCC is significantly longer than 1 CPU clock cycle). Using an “early restart” or “fetch requested word first” scheme is one way to

14 CHAPTER 5. MEMORY HIERARCHY

reduce the miss penalty. In this subsection, we investigate direct adjustments to the interface between cache and main memory, to reduce the 65 MBCCs and improve the miss penalty. First of all, it is clear that the latency to fetch the first word from memory cannot be improved. To reduce the miss penalty, we therefore focus on increasing the memory bandwidth, to fetch the next, adjacent words from main memory faster and reduce the second (4 × 15 MBCCs) and third (4 × 1 MBCCs) term in the previous equation.

Interleaved memory

In interleaved memory, the memory bandwidth is increased by widening the memory but not the interconnection bus, as shown in Figure 5.8.

CPU

Cache

32

32

32 32

Main memory

32

32

32

32 32 32

Bank 0 Bank 1 Bank 2 Bank 3

Figure 5.8: Interleaved memory.

Main memory is divided into four one-word wide banks, which can be accessed in parallel. The cache spends 1 MBCC to send out the address. Then, the 4 adjacent words of the appropriate block are read, in parallel, from the 4 one-word wide memory banks, which reduces the memory access time from 4 × 15 MBCCs to 1 × 15 MBCCs. Finally, the 4 words are sent to the cache, sequentially, over the bus, which takes 4 × 1 MBCCs. That reduces the total time to copy a block from main memory to 1 + 1 × 15 + 4 × 1 = 20 MBCCs.

4-word wide memory

Further improvement is possible by widening both the memory and the interconnection bus, as shown in Figure 5.9.

16 CHAPTER 5. MEMORY HIERARCHY

  • requires to compare against all tags, in parallel (since any line of the cache could contain the requested word);
  • allows a more optimal hit rate (by replacing the least frequently used blocks).

A hybrid form of both extremes of cache organization is the so-called set associative cache, where a block, copied from a lower level in the memory hierarchy, can be placed in a fixed number of locations in the cache. The group of all possible locations where a block could be placed is called a set. So, each block in the main memory maps to a unique set of locations in the cache, given by the index field (sometimes also called the set index ), and a block can be placed in any element of that set (called a set entry). Therefore, all tags within a set must be checked, in parallel, to determine a cache hit (which is better than checking all tags, in fully associative cache). It is clear that direct-mapped cache and fully associative cache are special cases of set associative organization. For sets of size 2, i.e., two-way set associative cache, the layout is shown in Figure 5.10, assuming blocks contain 4 words and the total amount of data that can be stored is, again, 64 KByte.

=

31

Index

17 32

V (^) Tag

H

byte offset 01

14

15

D

4 3 2

32 32 32

Data

32

11

2

block offset

Tag

32-bit address

=

17 32

V (^) Tag

H

D

32 32 32

Data

32

2

17 17

Index 11

32

H H

Hit

(word index)

Figure 5.10: 2-way set associative cache with 4-word blocks.

We represent the layout using one line for each set and as many parallel tables as there are blocks per set. Thus, a given block can only get stored in a specific line of the cache and,

5.2. CACHE 17

within each line, it can be stored either in the left or the right table. The CPU provides tag, index and block offset to both tables. A hit signal is generated from either the left or the right table, or we have a miss. If one of the hit signals is set, we have a hit. The 32-bit multiplexer in the bottom determines which entry in the set is being output, based on which table generated the hit (H0 or H1 set). Storing 64 KByte of data requires 2^11 sets (lines) in the cache (since there are two 4-word blocks per set), indexed by 11 bits. The remaining 17 bits are tag bits. In case of a cache miss, a replacement policy needs to be followed, to decide which block in the corresponding set will be replaced. This requires additional decision hardware. There are several possible replacement policies:

  • Random : randomly select which block to replace (cheap and easy in hardware, however, it is possible to replace a block that is often accessed).
  • LRU (Least Recently Used) : replace the block which was least recently referenced (more complex and expensive hardware, but lower miss rate, assuming that the most recently referenced words are most likely to be referenced again in the near future).
  • FIFO : replace the block that was replaced least recently (so, blocks are replaced based on the order in which they were copied, rather than accessed).

5.2.5 Cache’s impact on performance

In the previous subsections, we described the effectiveness of a cache in terms of the average memory access time (AMAT). Eventually, what we are really interested in is the overall performance of the machine. We illustrate this with an example. Assume we want to compute the effective CPI of a machine with the following specifications:

  • instruction cache miss rate is 5%;
  • data cache miss rate is 10%;
  • frequency of loads and stores is 33%;
  • cache miss penalty is 12 clock cycles;
  • CPI without cache misses, CP IN CM , is 4.

Moreover, we assume that the CPU must stall in case of cache misses. To compute the effective, total CPI, we need to know how many clock cycles are spent, averaged per instruction, on cache misses. Once we know that number (let’s call it CP ICM ), we can simply add it to CP IN CM , to obtain the total, effective CPI, since the CPU will stall on cache misses (so, the cache miss cycles can simply be added). To compute CP ICM , let’s first compute the average number of cache misses over I instructions and then divide that by I.

cache misses over I instructions = 0. 05 × I + 0. 1 × 0. 33 × I

= 0. 083 × I,

5.3. VIRTUAL MEMORY 19

5.3 Virtual Memory

5.3.1 Motivation

Just like the cache between main memory and the CPU aids at speeding up main memory access (and creates the illusion that lots of fast memory is available), similarly, the main memory can acts as a “cache” to the next level of memory in the hierarchy, i.e., the disk, using a technique called virtual memory. Historically, virtual memory was motivated by two factors:

  1. To remove the programming burdens that arise from only having a small, limited amount of main memory available.

Indeed, programmers often desire a lot more memory than what is available. Virtual memory addresses this by allowing a program to address a large range of virtual mem- ory locations, covering a lot more memory than there is main memory available. The physical main memory will contain a subset of the data in that large, virtual memory and what’s not in main memory will be stored on the hard disk. The CPU directly addresses the virtual memory and therefore always issues a virtual memory address.

  1. To allow efficient and safe sharing of memory amongst multiple programs, when lots of programs are running concurrently.

Indeed, at compile time, it is not known which programs will share the main memory with other programs. In fact, even when a program is running, the programs that are sharing the main memory with it usually change dynamically throughout its execution. To prevent one program from altering another program’s address space, one compiles each program in its own “virtual” address space, i.e., a separate range of memory locations that is accessible to that program only, as shown in Figure 5.11. A subset of all programs’ virtual memories’ content will be available in the physical main memory (shared by all programs, including the O.S.), while the remainder is stored on the hard drive. Virtual memory implements the translation of a program’s virtual address space to physical memory addresses and enforces protection of a program’s address space from other programs.

Based on the same principles of temporal and spatial locality, virtual memory accesses can be made efficient by storing frequently used data in the main memory and other data on the hard disk.

20 CHAPTER 5. MEMORY HIERARCHY

Stack

Heap Static

Text Reserved

Stack

Heap Static

Text Reserved

Stack

Heap Static

Text Reserved

Stack

Heap Static

Text Reserved

for Program 9

Program 10

Program 1

Program 2

Program 9

Physical Memory

Virtual Memory for Program 1

Virtual Memory for Program 10

Virtual Memory for Program 2

Virtual Memory

All programs share the same physical memory

Figure 5.11: Programs share the same physical memory, which contains subsets of each program’s virtual memory.

5.3.2 Virtual memory design

To understand virtual memory, we can focus on one single program. The extension to multiple programs is obtained by simply keeping a page table (as explained below) for each individual program. Virtual memory concepts are very similar to the concepts explained for cache between CPU and main memory. However, the terminology is different since both were historically motivated in a different way: while virtual memory was designed to allow programmers to work with more memory than what was physically available and share the same physical