Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Multiprocessor Systems: Case Studies on Cache Coherence and Bus Architectures, Slides of Computer Science

All India Institute of Medical Sciences Computer Science

Lecture materials for module 12, lecture 26 of a parallel computer architecture course. It covers case studies on multiprocessors connected via a snoopy bus, focusing on conflict resolution, path of a cache miss, write serialization, in-order response, multi-level caches, and dependence graph. Topics include the sgi challenge, sun enterprise, and sun gigaplane bus.

Typology: Slides

2012/2013

Uploaded on 03/28/2013

ekana 🇮🇳

(44)

370 documents

1 / 10

This page cannot be seen from the preview

Don't miss anything!

Objectives_template

file:///E|/parallel_com_arch/lecture26/26_1.htm[6/13/2012 11:59:55 AM]

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Conflict resolution

Path of a cache miss

Write serialization

Write atomicity and SC

Another example

In-order response

Multi-level caches

Dependence graph

Multiple outstanding requests

SGI Challenge

Sun Enterprise

Sun Gigaplane bus

[From Chapter 6 of Culler, Singh, Gupta]

Discover Slides of Computer Science All India Institute of Medical Sciences

Partial preview of the text

Download Multiprocessor Systems: Case Studies on Cache Coherence and Bus Architectures and more Slides Computer Science in PDF only on Docsity!

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Conflict resolution

Path of a cache miss

Write serialization

Write atomicity and SC

Another example

In-order response

Multi-level caches

Dependence graph

Multiple outstanding requests

SGI Challenge

Sun Enterprise

Sun Gigaplane bus

[From Chapter 6 of Culler, Singh, Gupta]

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Conflict resolution

Use the pending request table to resolve conflicts Every processor has a copy of the table Before arbitrating for the address bus every processor looks up the table to see if there is a match In case of a match the request is not issued and is held in a pending buffer Flow control is needed at different levels Essentially need to detect if any buffer is full SGI Challenge uses a separate NACK line for each of address and data phases Before the phases reach the “ack” cycle any cache controller can assert the NACK line if it runs out of some critical buffer; this invalidates the transaction and the requester must retry (may use back-off and/or priority) Sun Enterprise requires the receiver to generate the retry when it has buffer space (thus only one retry)

Path of a cache miss

Assume a read miss Look up request table; in case of a match with BusRd just mark the entry indicating that this processor will snoop the response from the bus and that it will also assert the shared line In case of a request table hit with BusRdX the cache controller must hold on to the request until the conflict resolves In case of a request table miss the requester arbitrates for address bus; while arbitrating if a conflicting request arrives, the controller must put a NOP transaction within the slot it is granted and hold on to the request until the conflict resolves Suppose the requester succeeds in putting the request on address/command bus Other cache controllers snoop the request, register it in request table (the requester also does this), take appropriate coherence action within own cache hierarchy, main memory also starts fetching the cache line If a cache holds the line in M state it should source it on bus during response phase; it keeps the inhibit line asserted until it gets the data bus; then it lowers inhibit line and asserts the modified line; at this point the memory controller aborts the data fetch/response and instead fields the line from the data bus for writing back If the memory fetches the line even before the snoop is complete, the inhibit line will not allow the memory controller to launch the data on bus After the inhibit line is lowered depending on the state of the modified line memory cancels the data response If no one has the line in M state, the requester grabs the response from memory A store miss is similar Only difference is that even if a cache has the line in M state, the memory controller does not write the response back Also any pending BusUpgr to the same cache line must be converted to BusReadX

Write serialization

In a split-transaction bus setting, the request table provides sufficient support for write

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Write atomicity and SC

Sequential consistency (SC) requires write atomicity i.e. total order of all writes seen by all processors should be identical Since a BusRdX or BusUpgr does not wait until the invalidations are actually applied to the caches, you have to be careful

P0: A=1; B=1; P1: print B; print A

Under SC (A, B) = (0, 1) is not allowed Suppose to start with P1 has the line containing A in cache, but not the line containing B The stores of P0 queue the invalidation of A in P1’s cache controller P1 takes read miss for B, but the response of B is re-ordered by P1’s cache controller so that it overtakes the invalidaton (thought it may be better to prioritize reads)

Another example

P0: A=1; print B;

P1: B=1; print A;

Under SC (A, B) = (0, 0) is not allowed Same problem if P0 executes both instructions first, then P1 executes the write of B (which let’s assume generates an upgrade so that it is marked complete as soon as the address arbitration phase finishes), then the upgrade completion is re-ordered with the pending invalidation of A So, the reason these two cases fail is that the new values are made visible before older invalidations are applied One solution is to have a strict FIFO queue between the bus controller and the cache hierarchy But it is sufficient as long as replies do not overtake invalidations; otherwise the bus responses can be re-ordered without violating write atomicity and hence SC (e.g., if there are only read and write responses in the queue, it sometimes may make sense to prioritize read responses)

In-order response

In-order response can simplify quite a few things in the design The fully associative request table can be replaced by a FIFO queue Conflicting requests where one is a write can actually be allowed now (multiple reads were allowed even before although only the first one actually appears on the bus) Consider a BusRdX followed by a BusRd from two different processors With in-order response it is guaranteed that the BusRdX response will be granted the data bus before the BusRd response (which may not be true for ooo response and hence such a conflict is disallowed) So when the cache controller generating the BusRdX sees the BusRd it only notes that

it should source the line for this request after its own write is completed The performance penalty may be huge Essentially because of the memory Consider a situation where three requests are pending to cache lines A, B, C in that order A and B map to the same memory bank while C is in a different bank Although the response for C may be ready long before that of B, it cannot get the bus

reply; but after popping the head of L1-to-L2 queue it is impossible to backtrack if the message does need space in L2-to-L1 queue Similarly, L1 cache controller refuses to drain L2-to-L1 queue if there is no space in L1- to-L2 queue How do we break this cycle? Observe that responses for processor requests are guaranteed not to generate any more messages and intervention requests do not generate new requests, but can only generate replies Solving the queue deadlock Introduce one more queue in each direction i.e. have a pair of queues in each direction L1-to-L2 processor request queue and L1-to-L2 intervention response queue Similarly, L2-to-L1 intervention request queue and L2-to-L1 processor response queue Now L2 cache controller can serve L1-to-L2 processor request queue as long as there is space in L2-to-L1 processor response queue, but there is no constraint on L1 cache controller for draining L2-to-L1 processor response queue Similarly, L1 cache controller can serve L2-to-L1 intervention request queue as long as there is space in L1-to-L2 intervention response queue, but L1-to-L2 intervention response queue will drain as soon as bus is granted

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Dependence graph

Now we have four queues Processor request (PR) and intervention reply (IY) are L1 to L Processor reply (PY) and intervention request (IR) are L2 to L

Possible to combine PR and IY into a supernode of the graph and still be cycle-free Leads to one L1 to L2 queue Similarly, possible to combine IR and PY into a supernode Leads to one L2 to L1 queue Cannot do both Leads to cycle as already discussed Bottomline: need at least three queues for two-level cache hierarchy

Multiple outstanding requests

Today all processors allow multiple outstanding cache misses We have already discussed issues related to ooo execution Not much needs to be added on top of that to support multiple outstanding misses For multi-level cache hierarchy the queue depths may be made bigger for performance reasons Various other buffers such as writeback buffer need to be made bigger

Snoop result is available 5 cycles after the request phase Memory fetches data speculatively MOESI protocol

Multiprocessor Systems: Case Studies on Cache Coherence and Bus Architectures, Slides of Computer Science

Related documents

Partial preview of the text

Download Multiprocessor Systems: Case Studies on Cache Coherence and Bus Architectures and more Slides Computer Science in PDF only on Docsity!

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Conflict resolution

Path of a cache miss

Write serialization

Write atomicity and SC

Another example

In-order response

Multi-level caches

Dependence graph

Multiple outstanding requests

SGI Challenge

Sun Enterprise

Sun Gigaplane bus

[From Chapter 6 of Culler, Singh, Gupta]

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Conflict resolution

Path of a cache miss

Write serialization

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Write atomicity and SC

Another example

P0: A=1; print B;

P1: B=1; print A;

In-order response

Module 12: "Multiprocessors on a Snoopy Bus"

Lecture 26: "Case Studies"

Dependence graph

Multiple outstanding requests