Improving Cache Performance: Techniques to Reduce Miss Rate, Penalty, and Hit Time - Prof., Study notes of Electrical and Electronics Engineering

Various techniques for improving cache performance beyond manipulating cache size, block size and set size. It covers methods for reducing miss rate, miss penalty and hit time, and the benefits of using a second level cache. The text also explores the impact of cache organization and the use of write buffers, victim caches and prefetch buffers.

Typology: Study notes

Pre 2010

Uploaded on 11/08/2009

koofers-user-7no
koofers-user-7no 🇺🇸

10 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ECEN 5253 Digital Computer Design
Cache Performance Improvement December 8, 2005 page 1 of 9
Cache Performance Improvement
There are many hardware techniques for improving cache performance beyond manipulat-
ing the three basic design parameters of cache size, block size and set size. These tech-
niques can be broken down into categories for minimizing miss rate, miss penalty and hit
time.
Reducing Miss Penalty
Processor speed has been increasing faster than memory speed (fig. 7.37, p. 554). This
means that a given miss penalty time corresponds to more clock cycles while the proces-
sor is stalled. To get the full benefit of the high clock rates (short clock periods) of modern
processors, it is very important to reduce the miss penalty in modern high performance
memory system designs.
Multilevel Cache. As the first level cache is made faster by putting it inside the processor
chip, main memory becomes much slower relative to the processor and first level cache.
It takes many more processor cycles for the cache fill from main memory to complete
making the miss penalty larger relative to the hit time. The miss penalty can be reduced
by inserting a second level cache (usually external to the processor chip) that can respond
faster than main memory.
To see clearly how the second level cache increases performance, lets compare two mem-
ory designs, one with second level cache and one without.
CPU
d-cache i-cache
2nd-cache
main memory
CPU
d-cache i-cache
main memory
2-level cache
1-level cache
Recall that
SPImiss Pinstruction-miss Tread-miss-penalty
=
ICload
IC
----------------


+Pdata-miss Tread-miss-penalty
ICstore
IC
-----------------


+Pdata-miss Twrite-miss-penalty
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Improving Cache Performance: Techniques to Reduce Miss Rate, Penalty, and Hit Time - Prof. and more Study notes Electrical and Electronics Engineering in PDF only on Docsity!

Cache Performance Improvement

There are many hardware techniques for improving cache performance beyond manipulat- ing the three basic design parameters of cache size, block size and set size. These tech- niques can be broken down into categories for minimizing miss rate, miss penalty and hit time.

Reducing Miss Penalty

Processor speed has been increasing faster than memory speed (fig. 7.37, p. 554). This means that a given miss penalty time corresponds to more clock cycles while the proces- sor is stalled. To get the full benefit of the high clock rates (short clock periods) of modern processors, it is very important to reduce the miss penalty in modern high performance memory system designs.

Multilevel Cache. As the first level cache is made faster by putting it inside the processor chip, main memory becomes much slower relative to the processor and first level cache. It takes many more processor cycles for the cache fill from main memory to complete making the miss penalty larger relative to the hit time. The miss penalty can be reduced by inserting a second level cache (usually external to the processor chip) that can respond faster than main memory.

To see clearly how the second level cache increases performance, lets compare two mem- ory designs, one with second level cache and one without.

CPU

d-cache i-cache

2nd-cache

main memory

CPU

d-cache i-cache

main memory

2-level cache

1-level cache

Recall that

SPI miss = P instruction-miss ⋅T read-miss-penalty

IC load

IC

+ ^ ⋅ P

data-miss

⋅T read-miss-penalty

IC store

IC

+ ^ ⋅ P

data-miss

⋅T write-miss-penalty

Let us assume for simplicity that Pread-miss and Pwrite-miss are the same (a write allocate

cache with no write buffer, for example). The stalls per instruction due to misses in top level cache can be expressed as

SPI miss = P miss-L1 ⋅T miss-penalty-L

where

P miss-L1 P instruction-miss

IC load

IC

IC store

IC

  P

= + ⋅ data-miss

T miss-penalty-L1 = T read-miss-penalty = T write-miss-penalty

Pmiss-L1 is the same in both systems, but Tmiss-penalty-L1 depends on whether or not the second level cache is present.

T miss-penalty-L

T hit-L2 + P miss-L2 ⋅T miss-penalty-L2,2-level cache

^ T^ main ,1-level cache

Tmain is the response time of main memory for reading a block (assume for now that the

data or instruction is always in main memory and not on disk). Also in the 2-level cache

T miss-penalty-L2 = T main

It is now clear that the miss penalty (Tmiss-penalty-L1) is reduced with the 2-level cache if

T hit-L2 + P miss-L2 ⋅T main <T main

T hit-L2 <( 1 - P miss-L2) ⋅T main

which is easily satisfied for reasonably low miss rates. Unfortunately, Rmiss-L2 may not be

as low as one might at first think. This is because the misses in first and second level cache are not independent. If the word the processor wants is not in the first level cache, it has a much higher probability than usual of not being in second level cache either. Even so, second level cache that is sufficiently larger than first level cache has miss rates that are low enough to reduce the miss penalty.

Just as for first level cache, increasing the set size and block size of second level cache is going to reduce the second level miss rate. As we have just shown, this also reduces the average miss penalty.

Handling Read Miss After Write Miss. We have already seen how a write buffer can be used to eliminate the miss penalty when a write miss occurs. The write data are stored in a write buffer while the data is being written into lower level memory. Meanwhile, the pro- cessor continues operation.

Suppose that a subsequent read miss occurs before the write buffer has a chance to empty. This is very similar to the RAW data hazard, but this time it is a memory location instead of a processor register that has the hazard.

start executing other instructions while waiting for the cache fill, then a stall would be unnecessary. This would require executing instructions out of order. The simple pipe- lined processors that we have studied do not allow this to happen. Super scaler processors do allow execution out of order, and nonblocking cache would be very useful for these processors. Nonblocking cache allows the processor to continue to read and write to the cache while the cache waits for lower level memory to provide data from a previous miss. The average miss penalty for non-blocking cache can be substantially lower than regular cache.

Reducing Miss Rate

Victim Cache. We have already seen the utility of a write buffer to save replaced dirty blocks that must be written back to lower level memory (write back cache). This write buffer for a write back cache must really be a small cache because we must check it to see if it has a block in it that we want to read.

A victim cache is very similar except that each replaced block is put in the victim cache regardless of whether it is dirty or not. Locality of reference implies that the blocks in the victim cache are the most likely to be used again soon.

Victim caches are usually fully associative caches but very small (1 - 4 blocks) so that vic- tim cache hit time is short. When a cache miss and victim cache hit occurs, the block in the victim cache is swapped with the block in cache. This swapping requires an extra cache write cycle, which is a much smaller penalty than having to fill the cache from lower down in the memory hierarchy.

Prefetching. While the processor is running normally with cache hits, the memory system is free for other activities. Whenever a cache miss occurs and a new block is fetched, it is highly likely that more than one block will eventually be needed. Additional blocks can be fetched after the processor starts running again. The additional blocks are fetched before they are needed (prefetched) and stored temporarily in a prefetch buffer. When a cache miss occurs, there is a good chance that the desired block can be obtained from the prefetch buffer.

The prefetch buffer is really another small cache, similar to the victim cache, in parallel with the regular cache. The difference is that the prefetch buffer is loaded from lower in the memory hierarchy as part of the cache fill sequence, whereas the victim cache is loaded with replaced blocks discarded from the cache.

When a cache miss and prefetch buffer hit occurs, an extra cache write cycle is required to put the new block in the cache, but this is a smaller penalty than filling the cache from lower in the memory hierarchy. On cache miss, the miss penalty is just one cache cycle if the prefetch buffer hits. Thus, the average access time is

T cache+pre = P hit ⋅ T hit+P miss ⋅( P hit-pre ⋅ T hit +P miss-pre ⋅T miss-penalty)

so that both the cache and the prefetch have to miss before we pay the miss penalty.

Prefetching only makes sense if there is extra time on the memory bus to fetch more blocks, that is, the cache miss rate must be sufficiently low that it is unlikely that another cache miss occurs while the prefetching is going on. It is not certain that the prefetched block will ever be used. We certainly do not want prefetching to delay the cache fill sequence for a real cache miss block unless we can be more certain that the prefetch block will actually be needed.

The optimizing compiler can often predict when prefetching would be advantageous. The compiler can control prefetching by using special prefetch instructions that are provided as part of the instruction set.

Reducing Hit Time

The hit time of top level cache is important in determining the processor clock rate. The hit time of lower level cache determines the miss penalty of higher level cache.

Small and Simple First Level Cache. In general, a smaller memory is faster so that smaller caches have shorter hit times. Also, most first level caches are on chip and must be small enough to fit in the chip along with the processor.

Direct mapped cache has the simplest hardware and is slightly faster that set associative cache. As the set size increases, there is extra delay with the additional hardware needed for the comparators and multiplexing the data from the different sets (see fig. 7.17, p. 503).

With a small on chip cache, a high miss rate is expected. Does it really do any good to have on chip cache or should we slow the processor down to work with off chip cache that can be big enough to have a low miss rate? To answer this question, let us compare mem- ory systems with and without on chip cache.

CPU

d-cache i-cache

2nd-cache

main memory

CPU

main memory

on chip cache

off chip cache

2nd-cache chip

chip

Assume that the same off chip cache (2nd-cache) is used in both systems and that read and write miss penalties are the same. The clock period for the on chip cache system, Ton-chip

is faster than for the off chip cache system, Toff-chip.

of correlations between misses in L1 and L2 cache. Then, on-chip cache is best when

CPI ⋅ T hit-L1+ P miss-L1-on-chip ⋅ T hit-L2+P miss-L2-off-chip ⋅T main

CPI ⋅ T hit-L2+P miss-L2-off-chip ⋅T main

or

CPI ⋅ T hit-L1+ P miss-L1-on-chip ⋅T hit-L2<CPI ⋅T hit-L

which is satisfied when

T hit-L1 1 -

R miss-L1-on-chip

CPI

  T

< ⋅ hit-L

Even small cache with a large miss rate, for example 0.5 (50%), would give a performance improvement provided that Thit-L1 is significantly smaller than Thit-L2.

Delayed Write Buffer. In a typical cache, for example fig. 7.17, p. 503, we are able to read a data block and read the tag in parallel since reading an incorrect block does no harm if there is a cache miss. Unfortunately, we cannot write a data block and read the tag in parallel since writing an incorrect block would be a disaster if there is a cache miss. A clever way around this problem is to use a delayed write buffer.

CPU

data cache tag

delayed write buffer

data cache data

hit

address data

Instead of writing directly into cache, all writes (stores) are to the delayed write buffer. Since the cache is not altered, the cache data is not corrupted if a miss occurs. Thus, the write to the delayed write buffer and reading the tag can be done in parallel. Then, the delayed write buffer contents is invalidated if the tag comparison indicates a miss. If writes are to take the same amount of time as reads, there is no time left, at this point, to write the data to cache from the delayed write buffer. Instead the write data is left in the delayed write buffer (that is why it is a “delayed write” buffer).

The next time a write occurs, the old contents of the delayed write buffer is written to cache (provided it is valid) while the tag for the new data is read. Meanwhile, new write data is written into the delayed write buffer and the cycle repeats.

At any given time, the data from the last write are in the delayed write buffer, not the cache. If a subsequent read is from the same location, the data must be provided from the delayed write buffer. For this reason, the delayed write buffer is implemented as an asso- ciative cache. A hit on the delayed write buffer overrides the cache hit (incorrect data is in the cache).

Pipelined Cache Memory. As processor clocks get faster and faster, it becomes increas- ingly difficult to design cache memory that is fast enough to keep up (Thit < T) even when

it is on chip. One way to handle this is to pipeline the cache memory just as we did the processor hardware. This is called “superpipelining.”

We must divide the cache hardware (for example the direct mapped cache in fig. 7.7, p. 478). The memory array cannot be divided into stages, but the address decoding (for the cache index and block offset) can be done in a separate stage (just like the register file decoders). Also, the comparator that does the tag check can be in a separate stage. Here is an example of how that might be done for instruction memory with the MIPS.

index decoder IF1/IF

cache memory array

offset decoder

IF/ID

index

offset

tag

block

tag

word PC hit

The IF stage has been divided into two with the cache index decoder in the first stage. The block offset could be decoded in this stage or put in parallel with the memory array in the next stage. The index decoder provides the select lines for the cache lines and the offset decoder provides the select lines for the block to word multiplexer implemented with a single level of transmission gates to minimize delay.

The comparator provides a hit (or not) indication during the decode stage. This means that the word is used to fetch operands out of the register file as usual during the ID stage even though it is not certain that there has been a hit. This avoids an extra stage in the pipeline. When a miss occurs, the (incorrect) instruction that was started gets canceled in the EX stage and the processor stalls until the instruction cache fill is complete. A similar scheme is used in the MIPS R4000 processor.

Pipelining data memory is more complicated because data writes (stores) must be accom- modated as well as data reads (loads). Without a delayed write buffer, an extra stage in the data cache pipeline is needed to write the data after the hit signal is known. This is what is done in the MIPS R4000 processor. With a delayed write buffer, the data cache pipeline can be the same length as the instruction cache pipeline.