





Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
These are the Lecture Slides of Parallel Computer Architecture which includes Conflict Resolution, Cache Miss, Write Serialization, In-Order Response, Multi-Level Caches, Dependence Graph etc.Key important points are: Scalable Multiprocessors, Basics of Scalability, Bandwidth Scaling, Agenda, Latency Scaling, Cost Scaling, Physical Scaling
Typology: Slides
1 / 9
This page cannot be seen from the preview
Don't miss anything!






In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported every such instruction will generate a transaction (may be good or bad depending on the support in memory controller; will discuss later) Let us assume that the lock location is cacheable and is kept coherent Every invocation of test & set must generate a bus transaction; Why? What is the transaction? What are the possible states of the cache line holding lock_addr? Therefore all lock contenders repeatedly generate bus transactions even if someone is still in the critical section and is holding the lock Can we improve this? Test & set with backoff
Instead of retrying immediately wait for a while How long to wait? Waiting for too long may lead to long latency and lost opportunity Constant and variable backoff Special kind of variable backoff: exponential backoff (after the i th attempt the delay is k*ci where k and c are constants) Test & set with exponential backoff works pretty well
delay = k Lock: ts register, lock_addr bez register, Enter_CS pause (delay) /* Can be simulated as a timed loop / delay = delayc j Lock
Reduce traffic further Before trying test & set make sure that the lock is free
Lock: ts register, lock_addr bez register, Enter_CS Test: lw register, lock_addr bnez register, Test j Lock
How good is it? In a cacheable lock environment the Test loop will execute from cache until it receives an invalidation (due to store in unlock); at this point the load may return a zero value after fetching the cache line If the location is zero then only everyone will try test & set
Recall that unlock is always a simple store
Low latency: if no contender the lock should be acquired fast Low traffic: worst case lock acquire traffic should be low; otherwise it may affect unrelated transactions Scalability: Traffic and latency should scale slowly with the number of processors Low storage cost: Maintaining lock states should not impose unrealistic memory overhead Fairness: Ideally processors should enter CS according to the order of lock request (TS or TTS does not guarantee this)
Similar to Bakery algorithm but simpler A nice application of fetch & inc Basic idea is to come and hold a unique ticket and wait until your turn comes Bakery algorithm failed to offer this uniqueness thereby increasing complexity
Shared: ticket = 0, release_count = 0; Lock: fetch & inc reg1, ticket_addr Wait: lw reg2, release_count_addr /* while (release_count != ticket); */ sub reg3, reg2, reg bnez reg3, Wait
Unlock: addi reg2, reg2, 0x1 /* release_count++ */ sw reg2, release_count_addr
Initial fetch & inc generates O(P) traffic on bus-based machines (may be worse in DSM depending on implementation of fetch & inc) But the waiting algorithm still suffers from 0.5P2 messages asymptotically Researchers have proposed proportional backoff i.e. in the wait loop put a delay proportional to the difference between ticket value and last read release_count Latency and storage-wise better than Bakery Traffic-wise better than TTS and Bakery (I leave it to you to analyze the traffic of Bakery) Guaranteed fairness: the ticket value induces a FIFO queue
Solves the O(P 2 ) traffic problem The idea is to have a bit vector (essentially a character array if boolean type is not supported) Each processor comes and takes the next free index into the array via fetch & inc Then each processor loops on its index location until it becomes set On unlock a processor is responsible to set the next index location if someone is waiting Initial fetch & inc still needs O(P) traffic, but the wait loop now needs O(1) traffic Disadvantage: storage overhead is O(P) Performance concerns Avoid false sharing: allocate each array location on a different cache line Assume a cache line size of 128 bytes and a character array: allocate an array of size 128P bytes and use every 128th position in the array For distributed shared memory the location a processor loops on may not be in its local
memory: on acquire it must take a remote miss; allocate P pages and let each processor loop on one bit in a page? Too much wastage; better solution: MCS lock (Mellor-Crummey & Scott) Correctness concerns Make sure to handle corner cases such as determining if someone is waiting on the next location (this must be an atomic operation) while unlocking Remember to reset your index location to zero while unlocking
Compare & swap: Compare with r1, swap r2 and memory location (here we keep on trying until comparison passes)
Try: LL r3, addr sub r4, r3, r bnez r4, Try add r4, r2, r SC r4, addr beqz r4, Try add r2, r3, r
Execution of SC in an OOO pipeline Rather subtle For now assume that SC issues only when it comes to the head of ROB i.e. non- speculative execution of SC It first checks the load_linked bit; if reset doesn’t even access cache (saves cache bandwidth and unnecessary bus transactions) and returns zero in register If load_linked bit is set, it accesses cache and issues bus transaction if needed (BusReadX if cache line in I state and BusUpgr if in S state) Checks load_linked bit again before writing to cache (note that cache line goes to M state in any case) Can wake up dependents only when SC graduates (a case where a store initiates a dependence chain)
What happens if SC is issued speculatively? Actual store happens only when it graduates and issuing a store early only starts the write permission process Suppose two processors are contending for a lock Both do LL and succeed because nobody is in CS Both issue SC speculatively and due to some reason the graduation of SC in both of them gets delayed So although initially both may get the line one after another in M state in their caches, the load_linked bit will get reset in both by the time SC tries to graduate They go back and start over with LL and may issue SC again speculatively leading to a livelock (probability of this type of livelock increases with more processors) Speculative issue of SC with hardwired backoff may help Better to turn off speculation for SC What about the branch following SC? Can we speculate past that branch? Assume that the branch predictor tells you that the branch is not taken i.e. fall through: we speculatively venture into the critical section We speculatively execute the critical section This may be good and bad If the branch prediction was correct we did great If the predictor went wrong, we might have interfered with the execution of the processor that is actually in CS: may cause unnecessary invalidations and extra traffic Any correctness issues?
Normally done in software with flags
P0: A = 1; flag = 1; P1: while (!flag); print A;
Some old machines supported full/empty bits in memory