Memory Consistency Models - Parallel Computer Architecture - Lecture Slides, Slides for Computer Science. All India Institute of Medical Sciences

Computer Science

Description: These are the Lecture Slides of Parallel Computer Architecture which includes Conflict Resolution, Cache Miss, Write Serialization, In-Order Response, Multi-Level Caches, Dependence Graph etc.Key important points are: Memory Consistency Models, Sequential Consistency, Relaxed Models, Total Store Ordering, Weak Ordering, Coherence Protocol, Foundation
Showing pages  1  -  4  of  6
Objectives_template
file:///E|/parallel_com_arch/lecture34/34_1.htm[6/13/2012 12:15:56 PM]
Module 15: "Memory Consistency Models"
Lecture 34: "Sequential Consistency and Relaxed Models"
Memory Consistency Models
Memory consistency
SC
SC in MIPS R10000
Relaxed models
Total store ordering
PC and PSO
TSO, PC, PSO
Weak ordering (WO)
[From Chapters 9 and 11 of Culler, Singh, Gupta]
[Additional reading: Adve and Gharachorloo , WRL Tech Report, 1995]
Objectives_template
file:///E|/parallel_com_arch/lecture34/34_2.htm[6/13/2012 12:15:56 PM]
Module 15: "Memory Consistency Models"
Lecture 34: "Sequential Consistency and Relaxed Models"
Memory consistency
Coherence protocol is not enough to completely specify the output(s) of a parallel program
Coherence protocol only provides the foundation to reason about legal outcome of
accesses to the same memory location
Consistency model tells us the possible outcomes arising from legal ordering of
accesses to all memory locations
A shared memory machine advertises the supported consistency model; it is a
“contract” with the writers of parallel software and the writers of parallelizing compilers
Implementing memory consistency model is really a hardware-software tradeoff: a strict
sequential model (SC) offers execution that is intuitive, but may suffer in terms of
performance; relaxed models (RC) make program reasoning difficult, but may offer
better performance
SC
Recall that an execution is SC if the memory operations form a valid total order i.e. it is an
interleaving of the partial program orders
Sufficient conditions require that a new memory operation cannot issue until the
previous one is completed
This is too restrictive and essentially disallows compiler as well as hardware any re-
ordering of instructions
No microprocessor that supports SC implements sufficient conditions
Instead, all out-of-order execution is allowed, and a proper recovery mechanism is
implemented in case of a memory order violation
Let’s discuss the MIPS R10000 implementation
SC in MIPS R10000
Issues instructions out of program order, but commits in order
The problem is with speculatively executed loads: a load may execute and use a value
long before it finally commits
In the meantime, some other processor may modify that value through a store and the
store may commit (i.e. become globally visible) before the load commits: may violate
SC (why?)
How do you detect such a violation?
How do you recover and guarantee an SC execution?
Any special consideration for prefetches?
Binding and non-binding prefetches
In MIPS R10000 a store remains at the head of the active list until it is completed in cache
Can we just remove it as soon as it issues and let the other instructions commit (the
store can complete from store buffer at a later point)? How far can we go and still
guarantee SC?
The Stanford DASH multiprocessor, on receiving a read reply that is already invalidated,
forces the processor to retry that load
Why can’t it use the value in the cache line and then discard the line?
Does the cache controller need to take any special action when a line is replaced from the
cache?
Objectives_template
file:///E|/parallel_com_arch/lecture34/34_2.htm[6/13/2012 12:15:56 PM]
Objectives_template
file:///E|/parallel_com_arch/lecture34/34_3.htm[6/13/2012 12:15:57 PM]
Module 15: "Memory Consistency Models"
Lecture 34: "Sequential Consistency and Relaxed Models"
Relaxed models
Implementing SC requires complex hardware
Is there an example that clearly shows the disaster of not implementing all these?
Observe that cache coherence protocol is orthogonal
But such violations are rare
Does it make sense to invest so much time (for verification) and hardware (associative
lookup logic in load queue)?
Many processors today relax the consistency model to get rid of complex hardware and
achieve some extra performance at the cost of making program reasoning complex
P0: A=1; B=1; flag=1; P1: while (!flag); print A; print B;
SC is too restrictive; relaxing it does not always violate programmers’ intuition
Three attributes
System specification: which orders are preserved and which are not; if all program
orders are not preserved what support is provided (software and hardware) to enforce
a particular order that the programmer wishes
Programmer’s interface: set of rules, if followed, will lead to an execution as expected
by the programmer; normally specified in terms of high-level language annotations and
labels
Translation mechanism: how to translate programmer’s annotations to hardware actions
Let’s take a look at a few relaxed models: TSO, PSO, PC, WO/WC, RC, DC
Total store ordering
Allows a read to bypass (i.e. commit before) an earlier incomplete write
This essentially means a blocked store at the head of the ROB can be removed (but
remains in write buffer) and subsequent instructions are allowed to commit bypassing
the blocked store
Can hide latency of write operations
Note that this is the only allowed re-ordering
Programmer’s intuition is preserved in most cases, but not always
P0: A=1; flag=1; P1: while (!flag); print A; [same as SC]
P0: A=1; B=1; P1: print B; print A; [same as SC]
P0: A=1; print B; P1: B=1; print A; [violates SC]
Implemented in many Sun UltraSPARC microprocessors
How do I enforce SC in the last example if I really care?
May be needed when porting this program from R10000 to UltraSPARC
Must ensure that a read cannot bypass earlier writes
Microprocessors provide “fence” instructions for this purpose
SPARC v9 specification provides MEMBAR (memory barrier) instruction of different
flavors
Here we only need to use one of these flavors, namely, write-to-read fence just before
the load instruction
This fence will not allow graduation of load until all stores before it graduates
If fence instruction is not available, substituting the read by a read-modify-write (e.g.,
ldstub in SPARC) also works
The preview of this document ends here! Please or to read the full document or to download it.
Document information
Uploaded by: ekana
Views: 951
Downloads : 0
Address:
University: All India Institute of Medical Sciences
Upload date: 28/03/2013
Embed this document:
Docsity is not optimized for the browser you're using. In order to have a better experience please switch to Google Chrome, Firefox, Internet Explorer 9+ or Safari! Download Google Chrome