Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Directory Overhead - Parallel Computer Architecture - Lecture Slides, Slides of Computer Science

All India Institute of Medical Sciences Computer Science

These are the Lecture Slides of Parallel Computer Architecture which includes Conflict Resolution, Cache Miss, Write Serialization, In-Order Response, Multi-Level Caches, Dependence Graph etc.Key important points are: Directory Overhead, Cache Coherence, Protocol, Directory Overhead, Handling Read Miss, Handling Write Miss, Handling Writebacks

Typology: Slides

2012/2013

Uploaded on 03/28/2013

ekana 🇮🇳

(44)

370 documents

1 / 7

This page cannot be seen from the preview

Don't miss anything!

Objectives_template

file:///E|/parallel_com_arch/lecture33/33_1.htm[6/13/2012 12:15:17 PM]

Module 14: "Directory-based Cache Coherence"

Lecture 33: "SCI Protocol"

Directory-based Cache Coherence:

Special Topics

Sequent NUMA-Q

SCI protocol

Directory overhead

Cache overhead

Handling read miss

Handling write miss

Handling writebacks

Roll-out protocol

Snoop interaction

Protocol processor

[From Chapter 8 of Culler, Singh, Gupta]

[SGI Origin 2000 material taken from Laudon and Lenoski, ISCA 1997]

[GS320 material taken from Gharachorloo et al., ASPLOS 2000]

Discover Slides of Computer Science All India Institute of Medical Sciences

Partial preview of the text

Download Directory Overhead - Parallel Computer Architecture - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

Module 14: "Directory-based Cache Coherence"

Lecture 33: "SCI Protocol"

Directory-based Cache Coherence:

Special Topics

Sequent NUMA-Q

SCI protocol

Directory overhead

Cache overhead

Handling read miss

Handling write miss

Handling writebacks

Roll-out protocol

Snoop interaction

Protocol processor

[From Chapter 8 of Culler, Singh, Gupta]

[SGI Origin 2000 material taken from Laudon and Lenoski, ISCA 1997]

[GS320 material taken from Gharachorloo et al., ASPLOS 2000]

Module 14: "Directory-based Cache Coherence"

Lecture 33: "SCI Protocol"

Sequent NUMA-Q

Implements the IEEE SCI directory protocol One node is an Intel Pentium Pro quad SMP The IQ-Link board connects to the system bus and implements the directory protocol Also contains a 32 MB 4-way set associative RAC Processors within a node are kept coherent via a MESI snoop-based protocol already implemented in Pentium Pro quad The SCI protocol keeps the RACs coherent across nodes The RAC maintains inclusion with the processor caches

SCI protocol

Directory structure Home contains the id of the most recently queued sharer or the owner (6 bits) Sharing list A sharer contains the id of the next sharer and the previous sharer The last sharer contains the id of home node and previous sharer A circular doubly linked list Three major states in directory Home: remotely unowned, but may be in local quad Fresh: same as shared Gone: some node has exclusive ownership; memory stale Cache states Processor cache: MESI RAC: 29 stable states and many transient states 7 bits for representing RAC state Two-part naming of RAC state: first part says the location of the block in the list (ONLY, HEAD, TAIL, MID), second part mentions the actual state (modified, exclusive, fresh, copy, …) We will use some of these to understand the basics of SCI (full description available from IEEE standards) HEAD_DIRTY, TAIL_CLEAN, etc Three major operations on the list List construction: involves adding a new sharer to the list Rollout: remove a sharer from the list; must synchronize with immediate neighbors Purge/invalidate: head node always has write permission and so it can purge the entire list before writing; naturally, only the head node has the privilege of doing this Three classes of protocol Minimal SCI: sharing not allowed Typical SCI (will discuss this): all supports that a normal human being can imagine Full SCI: lot of optimizations including hardware support for synchronization

Note that directory remains in GONE state and memory is not updated (similar to an M to O transition) Handling races Suppose when the requester’s (say A) message reaches the old head (say B) the RAC line is in PENDING state SCI doesn’t have any pending state in directory or doesn’t use NACKs (actually uses, but small in number) B does become the new head (has to because the home has already updated the directory), but inherits the PENDING state from A Any subsequent request will come to B and will become the new pending head Ultimately the PENDING state is resolved along the chain starting from A upstream FIFO nature of the pending list guarantees fairness Also, no problem related to sizing the buffers for holding pending requests (no extra space needed

Module 14: "Directory-based Cache Coherence"

Lecture 33: "SCI Protocol"

Handling write miss

CASE A: requester is in HEAD_DIRTY state already Directory must be in GONE state Only need to invalidate sharers Requester sends an invalidation to the next sharer A sharer upon receiving an invalidation sends a roll-out request to its next sharer (unless TAIL); the receiving node sets its upstream pointer properly and sends a roll- out acknowledgment Eventually roll-out request is acknowledged, the sharer invalidates its RAC line and sends a reply back to head with the id of the next sharer Head moves on to purge the sharer with received id During the entire process requester’s RAC line remains in PENDING state Note that home is not at all involved here CASE B: requester is in ONLY_DIRTY state No transaction needed CASE C: requester is in HEAD_FRESH state Send state change request to home (FRESH to GONE) Once acknowledgment from home is received list purging can be started What if the home is in a state different from FRESH with a different head node? The only case in SCI when a NACK is generated The requester on receiving the NACK changes its state to PENDING and initiates a new write request to home for transitioning to ONLY_DIRTY CASE D: requester in MID_FRESH or TAIL_FRESH state First it must roll out from the list and attach itself to the head in HEAD_FRESH state (recall that only the head node can write) This roll-out may require acknowledgments from upstream and downstream neighbors (if MID) or just the upstream neighbor (if TAIL) Follow CASE C CASE E: requester not a sharer First get the block in HEAD_DIRTY state Follow CASE A

The biggest problem is that the MESI protocol is designed for in-order response (so what?) Had to use the deferred response signal for remote requests Lesson learned: for hierarchical protocols bus must be split-transaction with out- of-order response (what happens otherwise?) Snoop response is available after four cycles earliest Stall wire may be asserted by any processor unable to meet this four-cycle limit Bus controller samples the stall wire every two cycles RAC and directory (for local requests) are also looked up in parallel

Protocol processor

NUMA-Q runs protocols in microcode The protocol processor is customized with bit-field operations and is a three-stage dual issue pipeline Has dedicated cache for holding recently accessed directory entries and RAC tags Protocol processor also contains three counters for monitoring performance These counters can be programmed through protocol code (i.e. read and written to)

Directory Overhead - Parallel Computer Architecture - Lecture Slides, Slides of Computer Science

Related documents

Partial preview of the text

Download Directory Overhead - Parallel Computer Architecture - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

Module 14: "Directory-based Cache Coherence"

Lecture 33: "SCI Protocol"

Directory-based Cache Coherence:

Special Topics

Sequent NUMA-Q

SCI protocol

Directory overhead

Cache overhead

Handling read miss

Handling write miss

Handling writebacks

Roll-out protocol

Snoop interaction

Protocol processor

[From Chapter 8 of Culler, Singh, Gupta]

[SGI Origin 2000 material taken from Laudon and Lenoski, ISCA 1997]

[GS320 material taken from Gharachorloo et al., ASPLOS 2000]

Module 14: "Directory-based Cache Coherence"

Lecture 33: "SCI Protocol"

Sequent NUMA-Q

SCI protocol

Module 14: "Directory-based Cache Coherence"

Lecture 33: "SCI Protocol"

Handling write miss

Protocol processor