



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
These are the Lecture Slides of Parallel Computer Architecture which includes Conflict Resolution, Cache Miss, Write Serialization, In-Order Response, Multi-Level Caches, Dependence Graph etc.Key important points are: Managing Directory Overhead, Replacement, Serialization, Deadlock, Starvation, Schemes, Sparse Directory
Typology: Slides
1 / 7
This page cannot be seen from the preview
Don't miss anything!




Send notification to directory? Can save a future invalidation Does it reduce overall traffic? Origin 2000 does not use replacement hints No notification to directory Why? Replacements of E blocks are hinted and require acknowledgments also (why?) Summary of transaction types Coherence: 9 request transaction types, 6 invalidation/intervention, 39 reply types Non-coherent (I/O, synch, special): 19 requests, 14 replies
Home is used to serialize requests The order determined by the home is final No node should violate this order Example: read-invalidate races P0, P1, and P2 are trying to access a cache block P0 and P2 want to read while P1 wants to write The requests from P0 and P2 reach home first, home replies and marks both in sharer vector; but the reply message to P0 gets delayed in the network P1’s write causes home to send out invalidation to P0 and P2; P0’s inv. reaches P0 before the read reply P0’s hub sends acknowledgment to P1 and also forwards the invalidation to P0’s processor cache What happens when P0’s reply arrives? Can the data be used? Requester’s viewpoint When a read reply arrives it finds the OTT entry has the “inv” bit set Under what conditions can it happen? Seen one in the last slide Can replacement hints help? What about upgrade-invalidation races? What about readX-invalidation races?
Origin 2000 has only two virtual networks, but has three-hop transactions Resorts to back-off invalidate or intervention to fall back to strict request-reply Does it really solve the problem or just move the problem elsewhere? Stanford DASH has same problems Uses NACKs after a time-out period if the outgoing network doesn’t free up Worse compared to Origin because NACKs inflate total traffic and may lead to livelock DASH avoids livelocks by sizing the queues according to the machine size (not a scalable solution)
Observation: total number of cache blocks in all processors is far less than total number of memory blocks Assume a 32 MB L3 cache and 4 GB memory: less than 1% of directory entries are active at any point in time Idea is to organize directory as a highly associative cache On a directory entry “eviction” send invalidations to all sharers or retrieve line if dirty
Essentially a large tertiary cache Captures remote cache blocks evicted from local cache hierarchy Also visible to the coherence protocol: so inclusion must be maintained with processor caches Must be highly associative and larger than the outermost level of cache Usually part of DRAM is reserved for RAC For multiprocessor nodes, requests from different processors to the same cache block can be merged together; also there is a prefetching effect Used in Stanford DASH Disadvantage: latency and space
Cache-only memory architecture Solves the space problem of RAC Home node only maintains the directory entries, but may not have the cache block in memory A node requesting a cache block brings it to its local memory and local cache as usual Entire memory is treated as a large tertiary cache Known as the attraction memory (AM) Home as well as any node having a cache block maintain a directory entry for the cache block A request first looks up AM directory state and, if unowned, gets forwarded to home which, in turn, forwards it to one of the sharers Cache-only memory architecture To start with home has the cache blocks It retains a cache block until it is replaced by some other migration There is always a master copy of each cache block The last valid copy What happens on a replacement of the master copy? Swap with source of migrating cache block Latency problem remains at the requester Inclusion problems between AM and processor cache hierarchy Complicates the protocol
Page migration changes the existing VA to PA mapping of the migrated page Requires notifying all TLBs caching the old mapping Introduces a TLB coherence problem Origin 2000 uses a smart page migration algorithm: allows the page copy and TLB shootdown to proceed in parallel Array of 64 page reference counters per directory entry to decide whether to migrate a page or not: compare requester’s counter against home’s and send an interrupt to home if migration is required What does the interrupt handler do? Access all directory entries of the lines belonging to the to-be migrated page Send invalidations to sharers or interventions to owners; at the end all cache lines of that page must be in memory Set the poison bits in the directory entries of all the cache lines of the page Start a block transfer of the page from home to requester at this point (30 μs to copy 16 KB) An access to a poisoned cache line from a node results in a bus error which invalidates the TLB entry for that page in the requesting node (avoids broadcast shootdown) Until the page is completely migrated and is assigned a physical page frame on target node, all nodes accessing a poisoned line wait in a pending queue After the page copy is completed the waiting nodes are served one by one; however, the directory entries and the page itself are moved to a “poisoned list” and are not yet freed at the home (i.e. you still cannot use that physical page frame) On every scheduler tick the kernel invalidates one TLB entry per processor After a time equal to TLB entries per processor multiplied by scheduling quantum the page frame is marked free and is removed from the poisoned list Major advantage: requesting nodes only see the page copy latency including invalidation and interventions in critical path, but not the TLB shootdown latency
Stanford DASH Memory controller recognizes lock accesses Requires changes in compiler and instruction set Marks the directory entry with contenders On unlock a contender is chosen and lock is granted to that node Unlock is forced to generate a notification message to home Possibly requires special cache state for lock variables or special uncached instructions for unlock if lock variables are not allowed to be cached