Cache Memory Management: How to Decide What to Remove when Full - Prof. Alan L. Sussman, Study notes of Computer Science

Cache memory management, focusing on how to decide what to remove when the cache is full. Various cache organization techniques, such as fully associative, set associative, and direct mapped caches, and their replacement policies. It also touches upon the importance of cache hits and misses, and the impact of cache misses on machine performance. From a computer science course, cmsc 411, taught by alan sussman.

Typology: Study notes

Pre 2010

Uploaded on 02/13/2009

koofers-user-vxj-1
koofers-user-vxj-1 🇺🇸

5

(2)

10 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CMSC 411 - A. Sussman (from D. O'Leary) 1
Computer Systems Architecture
CMSC 411
Unit 5 – Memory Hierarchy
Alan Sussman
October 7, 2004
CMSC 411 - Alan Sussman 2
Administrivia
HW #3 due today
questions
Quiz 2 Tuesday, Oct. 12
on Unit 3, basic pipelining
practice quiz posted, answers posted later today
questions?
Read Chapter 5
except 5.11-5.15
CMSC 411 - Alan Sussman 3
Last time
Long instructions
can cause structural hazards, and WAW hazards – why?
detect hazards early, to allow precise exceptions
in ID pipeline stage, and delay EX cycle if problem detected
can delay WB, use history or future file, let OS deal wit h it, to
enable precise exceptions
MIPS R4000 pipeline design
8 stage pipeline – superpipelining
extra stages come from multi-cycle cache accesses
2 cycle load delay and 3 cycle branch delay (1 delay
slot, 2 cycle stall for taken branches)
complex FP pipeline – 8 stages used in different
combinations for different operations
Cache Memory
CMSC 411 - Alan Sussman 5
Issues to consider
How big should the fastest memory (cache
memory) be?
How do we decide what to put in cache
memory?
If the cache is full, how do we decide what
to remove?
How do we find something in cache?
How do we handle writes?
CMSC 411 - Alan Sussman 6
First, there is main memory
Jargon:
frame address – which page?
block number – which cache block?
contents – the data
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download Cache Memory Management: How to Decide What to Remove when Full - Prof. Alan L. Sussman and more Study notes Computer Science in PDF only on Docsity!

Computer Systems Architecture

CMSC 411

Unit 5 – Memory Hierarchy

Alan Sussman

October 7, 2004

CMSC 411 - Alan Sussman 2

Administrivia

• HW #3 due today

– questions

• Quiz 2 Tuesday, Oct. 12

– on Unit 3, basic pipelining

– practice quiz posted, answers posted later today

– questions?

• Read Chapter 5

– except 5.11-5.

CMSC 411 - Alan Sussman 3

Last time

• Long instructions

  • can cause structural hazards, and WAW hazards – why?
  • detect hazards early, to allow precise exceptions
    • in ID pipeline stage, and delay EX cycle if problem detected
    • can delay WB, use history or future file, let OS deal with it, to enable precise exceptions

• MIPS R4000 pipeline design

  • 8 stage pipeline – superpipelining
  • extra stages come from multi-cycle cache accesses
  • 2 cycle load delay and 3 cycle branch delay (1 delay slot, 2 cycle stall for taken branches)
  • complex FP pipeline – 8 stages used in different combinations for different operations

Cache Memory

CMSC 411 - Alan Sussman 5

Issues to consider

• How big should the fastest memory (cache

memory) be?

• How do we decide what to put in cache

memory?

• If the cache is full, how do we decide what

to remove?

• How do we find something in cache?

• How do we handle writes?

CMSC 411 - Alan Sussman 6

First, there is main memory

• Jargon:

– frame address – which page?

– block number – which cache block?

– contents – the data

CMSC 411 - Alan Sussman 7

Then add a cache

• Jargon: Each address of a memory location

is partitioned into

– block address

  • tag
  • index

– block offset

Fig. 5. CMSC 411 - Alan Sussman 8

How does cache memory work?

• The following slides discuss:

– what cache memory is

– three organizations for cache memory

  • direct mapped.
  • set associative
  • fully associative

– how the bookkeeping is done

• Important note : All addresses shown are in

octal. Addresses in the book are usually decimal.

CMSC 411 - Alan Sussman 9

What is cache memory?

Main memory first

Main memory is divided into (cache) blocks. Each block contains many words (16-64 common now).

CMSC 411 - Alan Sussman 10

Main memory

Blocks are grouped into frames (pages), 3 frames in this picture.

CMSC 411 - Alan Sussman 11

Main memory (cont.)

Blocks are addressed by their frame number, and their block number within the frame.

CMSC 411 - Alan Sussman 12

Cache memory

Cache has many, MANY fewer blocks than main memory, each with a block number ,

a memory address ,

data ,

a valid bit,

a dirty bit.

CMSC 411 - Alan Sussman 19

Note that the last two bits of the memory block’s address always match the set number, so do not need to be stored. This part of the address is called the index. The higher order bits are stored, and are called the tag. In these pictures, both index and tag shown.

Set 0 Set 1 Set 2 Set 3

Set associative cache (cont.)

CMSC 411 - Alan Sussman 20

Set associative cache replacement

• Which entry in the set to replace?

• Three common choices:

– Replace an eligible random block

– Replace the least recently used (LRU) block

  • can be hard to keep track of, so often only approximated

– Replace the oldest eligible block (First In, First

Out, or FIFO)

CMSC 411 - Alan Sussman 21

Data cache replacement – Fig. 5.

256KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.

64KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.

16KB 114.1 117.3 115.5 111.7 115.1 113.3109.0 111.8 110.

Size LRU RandomFIFO LRU Random FIFO LRU RandomFIFO

Two-way Four-way Eight-Way

SPEC2000, in misses per 1000 instructions Set associativity

Computer Systems Architecture

CMSC 411

Unit 5 – Memory Hierarchy

Alan Sussman

October 12, 2004

CMSC 411 - Alan Sussman 23

Administrivia

• Quiz 2 today

– questions?

• HW for Unit 5 out soon

CMSC 411 - Alan Sussman 24

Last time

• Main memory

  • frame address – page number
  • block number – cache block within page
  • contents – the data

• Cache memory

  • block address
    • tag – high order bits for matching
    • index – which set, for set associative caches
  • block offset – which byte within the block
  • contains way fewer blocks than main memory
  • for each cache block – block number, memory address of block it contains, data, valid bit, dirty bit

CMSC 411 - Alan Sussman 25

Last time (cont.)

• Direct mapped cache

  • each memory block can only go into 1 cache block – use low order bits of block address

• Set associative cache

  • multiple places for a memory block to go – the degree of set associativity is how many
  • don’t need to store the index (the set number), since its known from the cache block number – rest of block address is the tag
  • replacement policy determines which block to replace when new one is loaded (e.g., random, LRU, FIFO)

CMSC 411 - Alan Sussman 26

Fully associative cache

In fully associative cache, memory blocks may be stored anywhere.

So block 14 might be put in the first available block -- one with valid = 0.

CMSC 411 - Alan Sussman 27

Fully associative cache (cont.)

With this result.

CMSC 411 - Alan Sussman 28

Managing cache

Use direct mapped cache as an example.

After first read operation, cache memory looked like this.

Valid Dirty

CMSC 411 - Alan Sussman 29

Managing cache (cont.)

If all other memory references involved block 14, no other blocks would need to be fetched from memory.

But suppose eventually need to fetch blocks 10, 31 and 66.

Need to fetch all three, because don’t have valid versions of them.

Valid Dirty

CMSC 411 - Alan Sussman 30

Managing cache (cont.)

The result looks like this.

Now suppose write to block 66.

Valid Dirty

CMSC 411 - Alan Sussman 37

Write through vs. write back

• Which is better?

– Write back gives faster writes, since don't have

to wait for main memory

– Write back is very efficient if want to modify

many bytes in a given block

– But write back can slow down some reads,

since a cache miss might cause a write back

– In multiprocessors, write through might be the

only correct solution. Why?

CMSC 411 - Alan Sussman 38

Cache summary

• Cache memory can be organized as direct

mapped, set associative, or fully associative

• Can be write-through or write-back

• Extra bits such as valid and dirty bits help

keep track of the status of the cache

Computer Systems Architecture

CMSC 411

Unit 5 – Memory Hierarchy

Alan Sussman

October 14, 2004

CMSC 411 - Alan Sussman 40

Administrivia

• HW for Unit 5 posted

– due date TBD

• Quizzes returned Tuesday

– answers already posted

• Grad school workshop Tuesday, Oct. 19, 5-

7PM, CSIC 2117

– come ask questions to both faculty and current

grad students!

CMSC 411 - Alan Sussman 41

Last time

• Fully associative cache

  • any memory block can go into any cache block

• Write through cache

  • memory gets updated immediately on write
  • reads only cause block to get loaded on miss

• Write back cache

  • writes only to cache
  • cache and main memory can be inconsistent
  • reads can cause updates from cache to memory, if block replaced is dirty

• Write through vs. write back

  • name one good feature of each

CMSC 411 - Alan Sussman 42

How much do memory stalls slow

down a machine?

• Suppose that on pipelined MIPS, each instruction

takes, on average, 2 clock cycles, not counting

cache faults/misses

• Suppose, on average, there are 1.33 memory

references per instruction, memory access time is

50 cycles, and the miss rate is 2%

• Then each instruction takes, on average:

2 + (0 × .98) + (1.33 × .02 × 50) = 3.33 clock cycles

CMSC 411 - Alan Sussman 43

Memory stalls (cont.)

• To reduce the impact of cache misses, can

reduce any of three parameters:

– main memory access time (miss penalty)

– miss rate

– cache access (hit) time

CMSC 411 - Alan Sussman 44

Reducing cache miss penalty

• 5 strategies:

– Give priority to read misses over write misses

– Don't wait for the whole block

– Use a nonblocking cache

– Multi-level cache

– Victim caches

• First 4 used in most desktop and server

machines

CMSC 411 - Alan Sussman 45

Give priority to read misses over

write misses

• But need to be careful

• Example:

– Suppose have a direct mapped cache, with

room for 8 blocks of 16 bytes each

– Then M[512] and M[1024] both get stored in

block 0, so can't be in cache at the same time

• Consider the following instructions:

SD R3, 512(R0)

LD R1, 1024(R0)

LD R2, 512(R0)

CMSC 411 - Alan Sussman 46

Example (cont.)

  • If the cache is write-through, the SD will cause memory location 512 to be changed
  • The first LW will cause block 0 to be replaced, so that the contents M[512] are no longer available in cache - If the system is write-back, this is when memory location 512 will be changed
  • Physically, the contents of block 0 will be put into temporary storage (a write buffer ) while the new block is loaded, then the write back proceeds
  • The second LW again replaces block 0, but this time no write-back is necessary
  • But get a RAW hazard if don’t ensure that the write- through or write-back completes before the second LW reads memory

CMSC 411 - Alan Sussman 47

Example (cont.)

• To avoid such RAW hazards:

– Can force the read miss to always wait until the

write buffer is empty

– Or can force the hardware to check the write

buffer before read and only wait if there is a

potential hazard

CMSC 411 - Alan Sussman 48

Another write buffer optimization

• Write buffer mechanics, with merging

  • An entry may contain multiple words (maybe even a whole cache block)
  • If there’s an empty entry, the data and address are written to the buffer, and the CPU is done with the write
  • If buffer contains other modified blocks, check to see if new address matches one already in the buffer – if so, combine the new data with that entry
  • If buffer full and no address match, cache and CPU wait for an empty entry to appear (meaning some entry has been written to main memory)
  • Merging improves memory efficiency, since multi- word writes usually faster than one word at a time

CMSC 411 - Alan Sussman 55

Miss rate – Fig. 5.

SPEC2000,

LRU

replacement

CMSC 411 - Alan Sussman 56

How to reduce the miss rate?

• Use larger blocks

• Use more associativity, to reduce conflict misses

• Victim cache

• Pseudo-associative caches (won’t talk about this)

• Prefetch (hardware controlled)

• Prefetch (compiler controlled)

• Compiler optimizations

CMSC 411 - Alan Sussman 57

Increasing block size

• Want the block size large so don’t have to stop so

often to load blocks

• Want the block size small so that blocks load

quickly

Fig. 5.16 – SPEC

CMSC 411 - Alan Sussman 58

Increasing block size (cont.)

• So large block size reduces miss rates, but...

• Example:

– Suppose that loading a block takes 80 cycles

(overhead) plus 2 clock cycles for each 16 bytes

– A block of size 64 bytes can be loaded in

80 + 2*64/16 cycles = 88 cycles (miss penalty)

– If the miss rate is 7%, then the average memory

access time is

1 + .07 * 88 = 7.16 cycles

CMSC 411 - Alan Sussman 59

Memory Access Times – Fig. 5.

Miss 4K 16K 64K 256K

penalty

Block size

Cache size

SPEC92 benchmarks on DEC workstation Computer Systems Architecture

CMSC 411

Unit 5 – Memory Hierarchy

Alan Sussman

October 19, 2004

CMSC 411 - Alan Sussman 61

Administrivia

• HW for Unit 5 posted

  • due date TBD
  • turn it in!

• Quizzes returned today

  • Average: 62
  • Median: 65 25%: 51 75%: 73
  • questions

• Grad school workshop today, 5-7PM, CSIC 2117

CMSC 411 - Alan Sussman 62

Last time

• Reducing cache miss penalty

  • priority to read misses over write misses
    • be careful to use contents of write buffer
    • can merge entries into write buffer
  • don’t wait for whole block
    • early restart or critical word first
  • use a non-blocking cache
    • works best for more complex pipelines than we’ve seen so far
  • multi-level cache
    • to capture misses in lower level caches
    • lowers effective miss penalty
  • victim cache
    • to reduce conflict misses

CMSC 411 - Alan Sussman 63

Last time (cont.)

• Reducing miss rate - compulsory, capacity,

conflict misses

– use larger blocks

  • what’s the cost of larger blocks?

– use higher associativity

– victim cache

– prefetch – hardware or software/compiler

– compiler optimizations

CMSC 411 - Alan Sussman 64

Higher associativity

  • A direct-mapped cache of size N has about the same miss rate as a 2-way set-associative cache of size N/ - 2:1 cache rule of thumb (seems to work up to 128KB caches)
  • But associative cache is slower than direct-mapped, so the clock may need to run slower
  • Example:
  • Suppose that the clock for 2-way memory needs to run at a factor of 1.1 times the clock for 1-way memory
  • the hit time increases with higher associativity
  • Then the average memory access time for 2-way is 1.10 + miss rate × 50 (assuming that the miss penalty is 50)

CMSC 411 - Alan Sussman 65

Memory access time – Fig. 5.

Cache size One-way Two-way Four-way Eight-way (KB)

Associativity

CMSC 411 - Alan Sussman 66

Pseudo-associative cache

• Uses the technique of chaining , with a series of

cache locations to check if the block is not found

in the first location

  • e.g., invert most significant bit of index part of address (as if it were a set associative cache)

• The idea:

  • Check the direct mapped address
  • Until the block is found or the chain of addresses ends, check the next alternate address
  • If the block has not been found, bring it in from memory

• Three different delays generated, depending on

which step succeeds

CMSC 411 - Alan Sussman 73

Merging arrays (cont.)

Means that at least 2 blocks must be in cache to begin using the arrays.

val[0] val[1] val[2] val[3] . . .

val[64] val[65] val[66] val[67] . . .

val[size-1] key[0] key[1] key[2] key[3] . . CMSC 411 - Alan Sussman 74

Merging arrays (cont.)

More efficient, especially if more than two arrays are coupled this way, to store them together.

val[0] key[0] val[1] key[1] . . .

val[32] key[32] val[33] key[33] . . .

CMSC 411 - Alan Sussman 75

Merging arrays (cont.)

Can do this by making the two arrays part of a structure.

val[0] key[0] val[1] key[1] . . .

val[32] key[32] val[33] key[33] . . .

CMSC 411 - Alan Sussman 76

Technique 2:

interchanging loops

Example:

x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

CMSC 411 - Alan Sussman 77

Interchanging loops (cont.)

Notice that accesses are by columns, so the elements are spaced 100 words apart.

Blocks are bouncing in and out of cache.

x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

CMSC 411 - Alan Sussman 78

Interchanging loops (cont.)

First color the loops:

x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

CMSC 411 - Alan Sussman 79

Interchanging loops (cont.)

Notice that the program has the same effect if the two loops are interchanged:

x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

CMSC 411 - Alan Sussman 80

Interchanging loops (cont.)

But with this ordering, use every element in a cache block before needing another block!

x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

CMSC 411 - Alan Sussman 81

Technique 3: loop fusion

Example:

x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

y[i][j] = x[i][j] * a[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for; CMSC 411 - Alan Sussman 82

Loop fusion (cont.)

Note that the loop control is the same for both sets of loops.

x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

y[i][j] = x[i][j] * a[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

CMSC 411 - Alan Sussman 83

Loop fusion (cont.)

And note that the array x is used in each, so probably needs to be loaded into cache twice, which wastes cycles.

x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

y[i][j] = x[i][j] * a[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for; CMSC 411 - Alan Sussman 84

Loop fusion (cont.)

So combine,or fuse , the loops to improve efficiency. x[i][j] = 2 * x[i][j];

For i=0, 1, …, 4999

End for;

For j=0, 1, …, 99

End for;

y[i][j] = x[i][j] * a[i][j];

CMSC 411 - Alan Sussman 91

Blocking access to arrays (cont.)

A B = C

CMSC 411 - Alan Sussman 92

Blocking access to arrays (cont.)

A B = C

CMSC 411 - Alan Sussman 93

Blocking access to arrays (cont.)

A B = C

CMSC 411 - Alan Sussman 94

Blocking access to arrays (cont.)

Instead, order the computation using rectangular blocks of A and B.

A B = C

Partial answer!

CMSC 411 - Alan Sussman 95

Blocking access to arrays (cont.)

If the block of A has k rows, then only need to load B m/k times.

A B = C

Partial answer!

CMSC 411 - Alan Sussman 96

Blocking access to arrays (cont.)

Improves temporal locality

/* Before / for (i=0; i<N; i++) for (j=0; j<N; j++) { r=0; for (k=0; k<N; k++) r=r+y[i][k]z[k][j]; x[i][j]=r; }

/* After / for (jj=0; jj<N; jj=jj+B) for (kk=0; kk<N; kk=kk+B) for (i=0; i<N; i++) for (j=jj; j<min(jj+B,N); j++) { r=0; for (k=kk; k<min(kk+B,N); k++) r= r+y[i][k]z[k][j]; x[i][j]=x[i][j]+r; }

Computer Systems Architecture

CMSC 411

Unit 5 – Memory Hierarchy

Alan Sussman

October 21, 2004

CMSC 411 - Alan Sussman 98

Administrivia

• Quiz 2 questions?

• HW for Unit 5

– questions?

– due date posted by tomorrow

• Midterm

– will be rescheduled to later by tomorrow

CMSC 411 - Alan Sussman 99

Last time

• Reducing cache miss rate

– larger blocks

– higher associativity

  • but can make cache hits slower

– hardware prefetch

  • works well for sequential accesses
  • cost?

– software/compiler prefetch

  • instruction that moves data into cache, w/o causing exceptions or pipeline bubbles

– compiler optimizations

CMSC 411 - Alan Sussman 100

Last time (cont.)

• Compiler optimizations

– merging arrays

  • separate arrays to array of structs ordering, improve spatial locality

– loop interchange

  • to access data in the order it is stored, improve spatial locality

– loop fusion

  • to improve temporal locality

– blocking

  • improves both temporal and spatial locality

CMSC 411 - Alan Sussman 101

Reducing the time for cache hits

• K.I.S.S.

• Use virtual addresses rather than physical

addresses in the cache.

• Pipeline cache accesses

• Trace caches (won’t talk about these)

CMSC 411 - Alan Sussman 102

K.I.S.S.

• Cache should be small enough to fit on the

processor chip

• Direct mapped is faster than associative,

especially on read

– overlap tag check with transmitting data

• For current processors, small L1 caches to

keep fast clock cycle time, hide L1 misses

with dynamic scheduling, and use L

caches to avoid main memory accesses

CMSC 411 - Alan Sussman 109

Main memory management

• Questions:

– How big should main memory be?

– How to handle reads and writes?

– How to find something in main memory?

– How to decide what to put in main memory?

– If main memory is full, how to decide what to

replace?

CMSC 411 - Alan Sussman 110

The scale of things

• Typically (as of 2000):

  • Registers : < 1 KB, access time .25 - .5 ns
  • Cache : < 8 MB, access time .5 - 25 ns
  • Main Memory : < 4 GB, access time 150 - 250 ns
  • Disk Storage : > 30 GB, access time 5,000,000 ns (5ms)

• Memory Technology: CMOS (Complementary

Metal Oxide Semiconductor)

  • uses a combination of n- and p-doped semiconductor material to achieve low power dissipation.

CMSC 411 - Alan Sussman 111

Memory hardware

• DRAM : dynamic random access memory,

typically used for main memory

– one transistor per data bit

– each bit must be refreshed periodically (e.g.,

every 8 milliseconds), so maybe 5% of time is

spent in refresh

– access time < cycle time

– address sent in two halves so that fewer pins are

needed on chip (row and column access)

CMSC 411 - Alan Sussman 112

Memory hardware (cont.)

• SRAM : static random access, typically used

for cache memory

– 4-6 transistors per data bit

– no need for refresh

– access time = cycle time

– address sent all at once, for speed

CMSC 411 - Alan Sussman 113

Bottleneck

• Main memory access will slow down the CPU

unless the hardware designer is careful

• Some techniques can improve memory bandwidth ,

the amount of data that can be delivered from

memory in a given amount of time:

  • wider main memory
  • interleaved memory
  • independent memory banks
  • avoiding memory bank conflicts CMSC 411 - Alan Sussman 114

Wider main memory

• Cache miss: If a cache block contains k words,

then each cache miss involves these steps repeated

k times:

  • Send the address to main memory
  • Access the word (i.e., locate it)
  • Send the word to cache, with the bits transmitted in parallel

• Idea behind wider memory: the user thinks about

32 bit words, but physical memory can have

longer words

• Then the operations above are done only k/n

times, where n is the number of 32 bit words in a

physical word

CMSC 411 - Alan Sussman 115

Wider main memory (cont.)

• Extra costs:

– a wider memory bus : hardware to deliver 32 n

bits in parallel, instead of 32 bits

– a multiplexor to choose the correct 32 bits to

transmit from the cache to the CPU

CMSC 411 - Alan Sussman 116

Interleaved memory

• Partition memory into banks, with each bank able

to access a word and send it to cache in parallel

• Organize address space so that adjacent words live

in different banks - called interleaving

• For example, 4 banks might have words with the

following octal addresses:

Bank 0 Bank 1 Bank 2 Bank

CMSC 411 - Alan Sussman 117

Interleaved memory (cont.)

• Note how nice interleaving is for write-

through

• Also helps speed read and write-back

• Note : Interleaved memory acts like wide

memory, except that words are transmitted

through the bus sequentially, not in parallel

CMSC 411 - Alan Sussman 118

Independent memory banks

• Each bank of memory has its own address

lines and (usually) a bus

• Can have several independent banks:

perhaps

– one for instructions

– one for data

• Banks can operate independently without

slowing others

CMSC 411 - Alan Sussman 119

Avoid memory bank conflicts

• By having a prime number of memory

banks

• Since arrays frequently have even

dimension sizes - and often dimension sizes

that are a power of 2 - strides that match the

number of banks (or a multiple) give very

slow access

CMSC 411 - Alan Sussman 120

Example

• First access the first column

of x :

  • x[0][0], x[1][0], x[2][0], ... x[255][0,

• with addresses

– K, K+5124, K+5128, ...

K+512*something

• With 4 memory banks, all of

the elements live in the same

memory bank, so the CPU

will stall in the worst

possible way

int x[256][512];

for (j=0; j<512; j=j+1) for (i=0; i<256; i=i+1) x[i][j] = 2 * x[i][j];