Managing Directory Overhead - Parallel Computer Architecture - Lecture Slides, Slides of Computer Science

These are the Lecture Slides of Parallel Computer Architecture which includes Conflict Resolution, Cache Miss, Write Serialization, In-Order Response, Multi-Level Caches, Dependence Graph etc.Key important points are: Managing Directory Overhead, Replacement, Serialization, Deadlock, Starvation, Schemes, Sparse Directory

Typology: Slides

2012/2013

Uploaded on 03/28/2013

ekana
ekana 🇮🇳

4

(44)

370 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Objectives_template
file:///E|/parallel_com_arch/lecture31/31_1.htm[6/13/2012 12:14:03 PM]
Module 14: "Directory-based Cache Coherence"
Lecture 31: "Managing Directory Overhead"
Directory-based Cache Coherence:
Replacement of S blocks
Serialization
VN deadlock
Starvation
Overflow schemes
Sparse directory
Remote access cache
COMA
Latency tolerance
Page migration
Queue lock in hardware
[From Chapter 8 of Culler, Singh, Gupta]
[SGI Origin 2000 material taken from Laudon and Lenoski, ISCA 1997]
[GS320 material taken from Gharachorloo et al., ASPLOS 2000]
pf3
pf4
pf5

Partial preview of the text

Download Managing Directory Overhead - Parallel Computer Architecture - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

Module 14: "Directory-based Cache Coherence"

Lecture 31: "Managing Directory Overhead"

Directory-based Cache Coherence:

Replacement of S blocks

Serialization

VN deadlock

Starvation

Overflow schemes

Sparse directory

Remote access cache

COMA

Latency tolerance

Page migration

Queue lock in hardware

[From Chapter 8 of Culler, Singh, Gupta]

[SGI Origin 2000 material taken from Laudon and Lenoski, ISCA 1997]

[GS320 material taken from Gharachorloo et al., ASPLOS 2000]

Module 14: "Directory-based Cache Coherence"

Lecture 31: "Managing Directory Overhead"

Replacement of S blocks

Send notification to directory? Can save a future invalidation Does it reduce overall traffic? Origin 2000 does not use replacement hints No notification to directory Why? Replacements of E blocks are hinted and require acknowledgments also (why?) Summary of transaction types Coherence: 9 request transaction types, 6 invalidation/intervention, 39 reply types Non-coherent (I/O, synch, special): 19 requests, 14 replies

Serialization

Home is used to serialize requests The order determined by the home is final No node should violate this order Example: read-invalidate races P0, P1, and P2 are trying to access a cache block P0 and P2 want to read while P1 wants to write The requests from P0 and P2 reach home first, home replies and marks both in sharer vector; but the reply message to P0 gets delayed in the network P1’s write causes home to send out invalidation to P0 and P2; P0’s inv. reaches P0 before the read reply P0’s hub sends acknowledgment to P1 and also forwards the invalidation to P0’s processor cache What happens when P0’s reply arrives? Can the data be used? Requester’s viewpoint When a read reply arrives it finds the OTT entry has the “inv” bit set Under what conditions can it happen? Seen one in the last slide Can replacement hints help? What about upgrade-invalidation races? What about readX-invalidation races?

VN deadlock

Origin 2000 has only two virtual networks, but has three-hop transactions Resorts to back-off invalidate or intervention to fall back to strict request-reply Does it really solve the problem or just move the problem elsewhere? Stanford DASH has same problems Uses NACKs after a time-out period if the outgoing network doesn’t free up Worse compared to Origin because NACKs inflate total traffic and may lead to livelock DASH avoids livelocks by sizing the queues according to the machine size (not a scalable solution)

Observation: total number of cache blocks in all processors is far less than total number of memory blocks Assume a 32 MB L3 cache and 4 GB memory: less than 1% of directory entries are active at any point in time Idea is to organize directory as a highly associative cache On a directory entry “eviction” send invalidations to all sharers or retrieve line if dirty

Module 14: "Directory-based Cache Coherence"

Lecture 31: "Managing Directory Overhead"

Remote access cache

Essentially a large tertiary cache Captures remote cache blocks evicted from local cache hierarchy Also visible to the coherence protocol: so inclusion must be maintained with processor caches Must be highly associative and larger than the outermost level of cache Usually part of DRAM is reserved for RAC For multiprocessor nodes, requests from different processors to the same cache block can be merged together; also there is a prefetching effect Used in Stanford DASH Disadvantage: latency and space

COMA

Cache-only memory architecture Solves the space problem of RAC Home node only maintains the directory entries, but may not have the cache block in memory A node requesting a cache block brings it to its local memory and local cache as usual Entire memory is treated as a large tertiary cache Known as the attraction memory (AM) Home as well as any node having a cache block maintain a directory entry for the cache block A request first looks up AM directory state and, if unowned, gets forwarded to home which, in turn, forwards it to one of the sharers Cache-only memory architecture To start with home has the cache blocks It retains a cache block until it is replaced by some other migration There is always a master copy of each cache block The last valid copy What happens on a replacement of the master copy? Swap with source of migrating cache block Latency problem remains at the requester Inclusion problems between AM and processor cache hierarchy Complicates the protocol

Module 14: "Directory-based Cache Coherence"

Lecture 31: "Managing Directory Overhead"

Page migration

Page migration changes the existing VA to PA mapping of the migrated page Requires notifying all TLBs caching the old mapping Introduces a TLB coherence problem Origin 2000 uses a smart page migration algorithm: allows the page copy and TLB shootdown to proceed in parallel Array of 64 page reference counters per directory entry to decide whether to migrate a page or not: compare requester’s counter against home’s and send an interrupt to home if migration is required What does the interrupt handler do? Access all directory entries of the lines belonging to the to-be migrated page Send invalidations to sharers or interventions to owners; at the end all cache lines of that page must be in memory Set the poison bits in the directory entries of all the cache lines of the page Start a block transfer of the page from home to requester at this point (30 μs to copy 16 KB) An access to a poisoned cache line from a node results in a bus error which invalidates the TLB entry for that page in the requesting node (avoids broadcast shootdown) Until the page is completely migrated and is assigned a physical page frame on target node, all nodes accessing a poisoned line wait in a pending queue After the page copy is completed the waiting nodes are served one by one; however, the directory entries and the page itself are moved to a “poisoned list” and are not yet freed at the home (i.e. you still cannot use that physical page frame) On every scheduler tick the kernel invalidates one TLB entry per processor After a time equal to TLB entries per processor multiplied by scheduling quantum the page frame is marked free and is removed from the poisoned list Major advantage: requesting nodes only see the page copy latency including invalidation and interventions in critical path, but not the TLB shootdown latency

Queue lock in hardware

Stanford DASH Memory controller recognizes lock accesses Requires changes in compiler and instruction set Marks the directory entry with contenders On unlock a contender is chosen and lock is granted to that node Unlock is forced to generate a notification message to home Possibly requires special cache state for lock variables or special uncached instructions for unlock if lock variables are not allowed to be cached