






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concept of sequential consistency (sc) and cache coherence in the context of shared memory multiprocessors. It covers the total order achieved by interleaving accesses from different processors, lamport's definition of sc, and the implementation of sc through writeback caches. The document also discusses the differences between invalidation-based and update-based protocols, focusing on the msi and mesi protocols.
Typology: Slides
1 / 11
This page cannot be seen from the preview
Don't miss anything!







Need a more formal description of memory ordering How to establish the order between reads and writes from different processors to different variables? The most clear way is to use synchronization P0: A=1; flag= P1: while (!flag); print A; Another example (assume A=0, B=0 initially) P0: A=1; print B; P1: B=1; print A; What do you expect? Memory consistency model is a contract between programmer and hardware regarding memory ordering
A multiprocessor normally advertises the supported memory consistency model This essentially tells the programmer what the possible correct outcome of a program could be when run on that machine Cache coherence deals with memory operations to the same location, but not different locations Without a formally defined order across all memory operations it often becomes impossible to argue about what is correct and what is wrong in shared memory Various memory consistency models Sequential consistency (SC) is the most intuitive one and we will focus on it now (more consistency models later)
Total order achieved by interleaving accesses from different processors The accesses from the same processor are presented to the memory system in program order Essentially, behaves like a randomly moving switch connecting the processors to memory Picks the next access from a randomly chosen processor Lamport’s definition of SC A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program
Any legal re-ordering is allowed The program order is the order of instructions from a sequential piece of code where programmer’s intuition is preserved The order must produce the result a programmer expects Can out-of-order execution violate program order? No. All microprocessors commit instructions in-order and that is where the state
Consider a simple example (all are zero initially)
P0: x=w+1; r=y+1;
P1: y=2; w=y+1;
Suppose the load that reads w takes a miss and so w is not ready for a long time; therefore, x=w+1 cannot complete immediately; eventually w returns with value 3 Inside the microprocessor r=y+1 completes (but does not commit) before x=w+1 and gets the old value of y (possibly from cache); eventually instructions commit in order with x=4, r=1, y=2, w= So we have the following partial orders
P0: x=w+1 < r=y+1 and P1: y=2 < w=y+
Cross-thread: w=y+1 < x=w+1 and r=y+1 < y=
Combine these to get a contradictory total order What went wrong?
Consider the following example
P0: A=1; print B;
P1: B=1; print A;
Possible outcomes for an SC machine (A, B) = (0,1); interleaving: B=1; print A; A=1; print B (A, B) = (1,0); interleaving: A=1; print B; B=1; print A (A, B) = (1,1); interleaving: A=1; B=1; print A; print B A=1; B=1; print B; print A (A, B) = (0,0) is impossible: read of A must occur before write of A and read of B must occur before write of B i.e. print A < A=1 and print B < B=1, but A=1 < print B and B= < print A; thus print B < B=1 < print A < A=1 < print B which implies print B < print B, a contradiction
Two basic requirements Memory operations issued by a processor must become visible to others in program order Need to make sure that all processors see the same total order of memory operations: in the previous example for the (0,1) case both P0 and P1 should see the same interleaving: B=1; print A; A=1; print B The tricky part is to make sure that writes become visible in the same order to all processors Write atomicity: as if each write is an atomic operation Otherwise, two processors may end up using different values (which may still be
correct from the viewpoint of cache coherence, but will violate SC)
Example (A=0, B=0 initially)
P0: A=1;
P1: while (!A); B=1;
P2: while (!B); print A;
A correct execution on an SC machine should print A= A=0 will be printed only if write to A is not visible to P2, but clearly it is visible to P since it came out of the loop Thus A=0 is possible if P1 sees the order A=1 < B=1 and P2 sees the order B=1 < A=1 i.e. from the viewpoint of the whole system the write A=1 was not “atomic” Without write atomicity P2 may proceed to print 0 with a stale value from its cache
Program order from each processor creates a partial order among memory operations Interleaving of these partial orders defines a total order Sequential consistency: one of many total orders A multiprocessor is said to be SC if any execution on this machine is SC compliant Sufficient but not necessary conditions for SC Issue memory operation in program order Every processor waits for write to complete before issuing the next operation Every processor waits for read to complete and the write that affects the returned value to complete before issuing the next operation (important for write atomicity)
write transactions (carrying just the modified bytes) on the bus even on write hits (not very attractive with writeback caches) Advantage of update-based protocols: sharers continue to hit in the cache while in invalidation-based protocols sharers will miss next time they try to access the line Advantage of invalidation-based protocols: only write misses go on bus (suited for writeback caches) and subsequent stores to the same line are cache hits
Difficult to answer Depends on program behavior and hardware cost When is update-based protocol good? What sharing pattern? (large-scale producer/consumer) Otherwise it would just waste bus bandwidth doing useless updates When is invalidation-protocol good? Sequence of multiple writes to a cache line Saves intermediate write transactions Also think about the overhead of initiating small updates for every write in update protocols Invalidation-based protocols are much more popular Some systems support both or maybe some hybrid based on dynamic sharing pattern of a cache line
Forms the foundation of invalidation-based writeback protocols Assumes only three supported cache line states: I, S, and M There may be multiple processors caching a line in S state There must be exactly one processor caching a line in M state and it is the owner of the line If none of the caches have the line, memory must have the most up-to-date copy of the line Processor requests to cache: PrRd, PrWr Bus transactions: BusRd, BusRdX, BusUpgr, BusWB
Few things to note Flush operation essentially launches the line on the bus Processor with the cache line in M state is responsible for flushing the line on bus whenever there is a BusRd or BusRdX transaction generated by some other processor On BusRd the line transitions from M to S, but not M to I. Why? Also at this point both the requester and memory pick up the line from the bus; the requester puts the line in its cache in S state while memory writes the line back. Why does memory need to write back? On BusRdX the line transitions from M to I and this time memory does not need to pick up the line from bus. Only the requester picks up the line and puts it in M state in its cache. Why?
BusRd takes a cache line in M state to S state The assumption here is that the processor will read it soon, so save a cache miss by going to S May not be good if the sharing pattern is migratory: P0 reads and writes cache line A,
The most popular invalidation-based protocol e.g., appears in Intel Xeon MP Why need E state? The MSI protocol requires two transactions to go from I to M even if there is no intervening requests for the line: BusRd followed by BusUpgr We can save one transaction by having memory controller respond to the first BusRd with E state if there is no other sharer in the system How to know if there is no other sharer? Needs a dedicated control wire that gets asserted by a sharer (wired OR) Processor can write to a line in E state silently and take it to M state
If a cache line is in M state definitely the processor with the line is responsible for flushing it on the next BusRd or BusRdX transaction If a line is not in M state who is responsible? Memory or other caches in S or E state? Original Illinois MESI protocol assumed cache-to-cache transfer i.e. any processor in E or S state is responsible for flushing the line However, it requires some expensive hardware, namely, if multiple processors are caching the line in S state who flushes it? Also, memory needs to wait to know if it should source the line Without cache-to-cache sharing memory always sources the line unless it is in M state
Take the following example P0 reads x, P0 writes x, P1 reads x, P1 writes x, …
P0 generates BusRd, memory provides line, P0 puts line in cache in E state
P0 does write silently, goes to M state
P1 generates BusRd, P0 provides line, P1 puts line in cache in S state, P0 transitions to S state Rest is identical to MSI
Consider this example: P0 reads x, P1 reads x, …
P0 generates BusRd, memory provides line, P0 puts line in cache in E state
P1 generates BusRd, memory provides line, P1 puts line in cache in S state, P transitions to S state (no cache-to-cache sharing) Rest is same as MSI