






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An introduction to cache coherence in shared memory multiprocessors. It covers the four organizations of shared memory systems, the role of caches in reducing latency and bandwidth, and the importance of maintaining cache coherence to ensure the intuitive memory model for concurrent programs. The document also discusses the differences between bus-based smp, distributed shared memory, and hierarchical shared memory systems, and the challenges of ordering memory operations in the presence of independent caches.
Typology: Slides
1 / 10
This page cannot be seen from the preview
Don't miss anything!







Shared cache
The switch is a simple controller for granting access to cache banks Interconnect is between the processors and the shared cache Which level of cache hierarchy is shared depends on the design: Chip multiprocessors today normally share the outermost level (L2 or L3 cache) The cache and memory are interleaved to improve bandwidth by allowing multiple concurrent accesses Normally small scale due to heavy bandwidth demand on switch and shared cache Bus-based SMP
Scalability is limited by the shared bus bandwidth Interconnect is a shared bus located between the private cache hierarchies and memory
requirement If an access is satisfied in cache, the transaction will not appear on the interconnect and hence the bandwidth requirement of the interconnect will be less (shared L1 cache does not have this advantage) In distributed shared memory (DSM) cache and local memory should be used cleverly Bus-based SMP and DSM are the two designs supported today by industry vendors In bus-based SMP every cache miss is launched on the shared bus so that all processors can see all transactions In DSM this is not the case
Possible to combine bus-based SMP and DSM to build hierarchical shared memory Sun Wildfire connects four large SMPs (28 processors) over a scalable interconnect to form a 112p multiprocessor IBM POWER4 has two processors on-chip with private L1 caches, but shared L2 and L3 caches (this is called a chip multiprocessor); connect these chips over a network to form scalable multiprocessors Next few lectures will focus on bus-based SMPs only
Intuitive memory model For sequential programs we expect a memory location to return the latest value written to that location For concurrent programs running on multiple threads or processes on a single processor we expect the same model to hold because all threads see the same cache hierarchy (same as shared L1 cache) For multiprocessors there remains a danger of using a stale value: in SMP or DSM the caches are not shared and processors are allowed to replicate data independently in each cache; hardware must ensure that cached values are coherent across the system and they satisfy programmers’ intuitive memory model
Assume a write-through cache i.e. every store updates the value in cache as well as in memory P0: reads x from memory, puts it in its cache, and gets the value 5 P1: reads x from memory, puts it in its cache, and gets the value 5 P1: writes x=7, updates its cached value and memory value P0: reads x from its cache and gets the value 5 P2: reads x from memory, puts it in its cache, and gets the value 7 (now the system is completely incoherent) P2: writes x=10, updates its cached value and memory value Consider the same example with a writeback cache i.e. values are written back to memory only when the cache line is evicted from the cache P0 has a cached value 5, P1 has 7, P2 has 10, memory has 5 (since caches are not write through) The state of the line in P1 and P2 is M while the line in P0 is clean Eviction of the line from P1 and P2 will issue writebacks while eviction of the line from P0 will not issue a writeback (clean lines do not need writeback) Suppose P2 evicts the line first, and then P Final memory value is 7: we lost the store x=10 from P
For write through cache The memory value may be correct if the writes are correctly ordered But the system allowed a store to proceed when there is already a cached copy
Memory operation: a read (load), a write (store), or a read-modify-write Assumed to take place atomically A memory operation is said to issue when it leaves the issue queue and looks up the cache A memory operation is said to perform with respect to a processor when a processor can tell that from other issued memory operations A read is said to perform with respect to a processor when subsequent writes issued by that processor cannot affect the returned read value A write is said to perform with respect to a processor when a subsequent read from that processor to the same address returns the new value
A memory operation is said to complete when it has performed with respect to all processors in the system Assume that there is a single shared memory and no caches Memory operations complete in shared memory when they access the corresponding memory locations Operations from the same processor complete in program order: this imposes a partial order among the memory operations Operations from different processors are interleaved in such a way that the program order is maintained for each processor: memory imposes some total order (many are possible)
P0: x = 8; u = y; v = 9;
P1: r = 5; y = 4; t = v;
Legal total order:
x = 8; u = y; r = 5; y = 4; t = v; v = 9;
Another legal total order:
x = 8; r = 5; y = 4; u = y; v = 9; t = v;
“Last” means the most recent in some legal total order A system is coherent if Reads get the last written value in the total order All processors see writes to a location in the same order
Formal definition A memory system is coherent if the values returned by reads to a memory location during an execution of a program are such that all operations to that location can form a hypothetical total order that is consistent with the serial order and has the following two properties:
Extend the philosophy of uniprocessor bus transactions Three phases: arbitrate for bus, launch command (often called request) and address, transfer data Every device connected to the bus can observe the transaction Appropriate device responds to the request In SMP, processors also observe the transactions and may take appropriate actions to guarantee coherence The other device on the bus that will be of interest to us is the memory controller (north bridge in standard mother boards) Depending on the bus transaction a cache block executes a finite state machine implementing the coherence protocol
What about read hits? They take place transparently in the cache But they are correct as long as they are correctly ordered with respect to writes And all writes appear on the bus and hence are visible immediately in the presence of an atomic bus In general, in between writes reads can happen in any order without violating coherence Writes establish a partial order
High bandwidth requirement Every write appears on the bus Assume a 3 GHz processor running application with 10% store instructions, assume CPI of 1 If the application runs for 100 cycles it generates 10 stores; assume each store is 4 bytes; 40 bytes are generated per 100/3 ns i.e. BW of 1.2 GB/s A 1 GB/s bus cannot even support one processor There are multiple processors and also there are read misses Writeback caches absorb most of the write traffic Writes that hit in cache do not go on bus (not visible to others) Complicated coherence protocol with many choices