Understanding Memory Bandwidth and Data Size in Computer Architecture - Prof. William D. G, Exams of Computer Science

The importance of memory in computer architecture performance, focusing on memory impact and instruction execution. The role of memory in performance bounds, refining performance bounds, memory bandwidth vs data size, and the impact of memory hierarchy. It also discusses the importance of spatial locality and temporal locality, as well as the effects of virtual memory and traps for the unwary.

Typology: Exams

Pre 2010

Uploaded on 03/16/2009

koofers-user-zlp
koofers-user-zlp 🇺🇸

9 documents

1 / 32

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Computer Architecture and
Performance:
Memory Impact;
Instruction Execution
William Gropp
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20

Partial preview of the text

Download Understanding Memory Bandwidth and Data Size in Computer Architecture - Prof. William D. G and more Exams Computer Science in PDF only on Docsity!

Computer Architecture and

Performance:

Memory Impact;

Instruction Execution

William Gropp

Importance of Memory in

Performance Bounds

• We have seen:

♦ Loads and stores can be as important as

floating point operations

♦ Simple models that look at just sustained

memory bandwidth (and ignore details of

cache effects) can provide useful bounds on

performance

  • Recall the sparse matrix-multiply example
  • True for problems where the majority of data accesses are consecutive

♦ Note that this is a bound, a guaranteed-not-

to-exceed value for the performance

Memory Bandwidth vs Data Size

L

L

Main Memory

Impact of Memory Hierarchy

Data Size (Bytes) 10 3 10 4 10 5 10 6 10 7 10 (^10008) 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 STREAM performance in MB/s versus data size L L

Breaking the Model: TLB

  • Adding virtual memory requires a extremely

fast way to convert virtual addresses to

physical addresses

♦ The Translation Lookaside Buffer is a cache that performs this translation ♦ However, with typical page sizes, the TLB does not provide fast translation for all memory in cache

  • Cost of occasional TLB miss in consecutive accesses (for data in memory and not on disk) is relatively small
  • Cost for non-consecutive addresses can be very large ♦ Partial fix: Specify larger pages
  • No standard way to do this in language (or among flavors of Unix) ♦ Algorithmic fix: Change order of accesses
  • No standard way to control in language
  • Depends on page and cache line size

What’s Next?

• There are many details that we’ve

ignored

♦ Can more than one operation take

place at a time?

♦ Does each assignment require a store

into memory?

♦ What about the other operations

(loop counts and tests, array

indexing, etc.)?

• Before answering these, lets revisit

the CPU

More Details

• Can more than one operation take place

at a time?

♦ Yes, if they involve different functional units

♦ Or if there are multiple units of the same

type, as long as enough units are available

  • Architecture Feature: Quickest way to add to peak floating point performance is to add floating point units
  • Algorithm and Programming language must make use of these − Discussion Question: Are there natural ways to use and express this?

More Details (2)

• Does each assignment require a store

into memory?

• Consider this code in C:

double sum = 0;

for (I=0;I <n; I++) {

sum = sum + a[I];

• The value “sum” may be stored in

register, requiring no load or store.

♦ Making use of registers can be crucial in

achieving high performance

♦ Recall the CPU diagram: most operations

take place between operands in register

Perils of Aliasing

  • They do not compute the same value!
  • Consider this usage of the routines ♦ Sum( &a[2], a, 3 ); ♦ In the first case, the routine computes - A[2] + A[0] + a[1] + a[2] + a[0] + a[1] - Why? ♦ In the second case, the routine computes - A[0] + a[1] + a[2]
  • When two variables may describe overlapping memory regions, they are said to alias one another ♦ Programming languages with pointers often permit aliasing (how can they prevent it) ♦ The potential for aliasing can force the compiler to store a value (or in a different example, load it) even though the programmer does not intend to use aliased data ♦ Discussion Question: Is this a flaw in the programming model? If so, how would you fix it?

More Details (4)

• What about the other operations (loop

counts and tests, array indexing, etc.)?

♦ Operations on integers are relatively fast in

modern CPUs

  • Exceptions include integer divide and modulus

♦ Branches (conditional jumps to other parts

of the code, such as at a loop test) are also

relatively expensive

♦ However, most are still faster than an L

cache miss

Some Rules for Bounding Performance

  • Most importantly remember: the goal is to create an effective (but possibly approximate) bound on performance - not an estimate! ♦ Discussion Question: What’s the difference?
  • Count the number of operations in each functional unit category: ♦ Loads/Stores ♦ Floating Point (add, subtract, multiply - divides are a special subcase) ♦ Other operations (integer arithmetic, branches, comparisons, etc.)
  • For each of these, compute the time they will take
  • The bound on the time is the max of these three ♦ Note: not really a bound because we’ve ignored any dependencies between the different operations ♦ You can refine each of these by including more detail - Refine load/store by considering cache

Another Example: Matrix-Matrix Multiply (ddot form)

  • do i=1,n do j=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) =
  • Like transpose, but two new features:
  • Perform a calculation (we’ll see why this is important later)
  • Reuse of data: n 2 data used for n 3 operations

Reusing Data

  • Load data into register
  • Use several times (each load, even from

cache, is at least a cycle)

  • Use loop unrolling to expose register use ♦ … c(i,j) += a(i,k) * b(k,j) c(i+1,j) += a(i+1,k) * b(k,j) c(i,j+1) += a(i,k) * b(k,j+1) c(i+1,j+1) += a(i+1,k) * b(k,j+1)
  • Each a(i,j) etc. used twice ♦ Cuts the numbers of loads in half ♦ But requires enough registers to hold all items - 4 registers for a(I,k), a(I+1,k), b(k,j), b(k,j+1) plus 2 registers for I, j, and 4 registers for address of a(I,k), address of b(k,j), address of c(I,j), and address of c(I,j+1).

Blocking for Cache

• Reuse data in cache by blocking

Block for each level of memory hierarchy