Scalable Locking Primitives - Parallel Computer Architecture - Lecture Slides, Slides of Computer Science

These are the Lecture Slides of Parallel Computer Architecture which includes Conflict Resolution, Cache Miss, Write Serialization, In-Order Response, Multi-Level Caches, Dependence Graph etc.Key important points are: Scalable Multiprocessors, Basics of Scalability, Bandwidth Scaling, Agenda, Latency Scaling, Cost Scaling, Physical Scaling

Typology: Slides

2012/2013

Uploaded on 03/28/2013

ekana
ekana 🇮🇳

4

(44)

370 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Objectives_template
file:///E|/parallel_com_arch/lecture22/22_1.htm[6/13/2012 11:53:18 AM]
Module 11: "Synchronization"
Lecture 22: "Scalable Locking Primitives"
Traffic of test & set
Backoff test & set
Test & test & set
TTS traffic analysis
Goals of a lock algorithm
Ticket lock
Array-based lock
RISC processors
LL/SC
Locks with LL/SC
Fetch & op with LL/SC
Store conditional & OOO
Speculative SC?
Point-to-point synch.
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Scalable Locking Primitives - Parallel Computer Architecture - Lecture Slides and more Slides Computer Science in PDF only on Docsity!

Module 11: "Synchronization"

Lecture 22: "Scalable Locking Primitives"

Traffic of test & set

Backoff test & set

Test & test & set

TTS traffic analysis

Goals of a lock algorithm

Ticket lock

Array-based lock

RISC processors

LL/SC

Locks with LL/SC

Fetch & op with LL/SC

Store conditional & OOO

Speculative SC?

Point-to-point synch.

Module 11: "Synchronization"

Lecture 22: "Scalable Locking Primitives"

Traffic of test & set

In some machines (e.g., SGI Origin 2000) uncached fetch & op is supported every such instruction will generate a transaction (may be good or bad depending on the support in memory controller; will discuss later) Let us assume that the lock location is cacheable and is kept coherent Every invocation of test & set must generate a bus transaction; Why? What is the transaction? What are the possible states of the cache line holding lock_addr? Therefore all lock contenders repeatedly generate bus transactions even if someone is still in the critical section and is holding the lock Can we improve this? Test & set with backoff

Backoff test & set

Instead of retrying immediately wait for a while How long to wait? Waiting for too long may lead to long latency and lost opportunity Constant and variable backoff Special kind of variable backoff: exponential backoff (after the i th attempt the delay is k*ci where k and c are constants) Test & set with exponential backoff works pretty well

delay = k Lock: ts register, lock_addr bez register, Enter_CS pause (delay) /* Can be simulated as a timed loop / delay = delayc j Lock

Test & test & set

Reduce traffic further Before trying test & set make sure that the lock is free

Lock: ts register, lock_addr bez register, Enter_CS Test: lw register, lock_addr bnez register, Test j Lock

How good is it? In a cacheable lock environment the Test loop will execute from cache until it receives an invalidation (due to store in unlock); at this point the load may return a zero value after fetching the cache line If the location is zero then only everyone will try test & set

TTS traffic analysis

Recall that unlock is always a simple store

Module 11: "Synchronization"

Lecture 22: "Scalable Locking Primitives"

Goals of a lock algorithm

Low latency: if no contender the lock should be acquired fast Low traffic: worst case lock acquire traffic should be low; otherwise it may affect unrelated transactions Scalability: Traffic and latency should scale slowly with the number of processors Low storage cost: Maintaining lock states should not impose unrealistic memory overhead Fairness: Ideally processors should enter CS according to the order of lock request (TS or TTS does not guarantee this)

Ticket lock

Similar to Bakery algorithm but simpler A nice application of fetch & inc Basic idea is to come and hold a unique ticket and wait until your turn comes Bakery algorithm failed to offer this uniqueness thereby increasing complexity

Shared: ticket = 0, release_count = 0; Lock: fetch & inc reg1, ticket_addr Wait: lw reg2, release_count_addr /* while (release_count != ticket); */ sub reg3, reg2, reg bnez reg3, Wait

Unlock: addi reg2, reg2, 0x1 /* release_count++ */ sw reg2, release_count_addr

Initial fetch & inc generates O(P) traffic on bus-based machines (may be worse in DSM depending on implementation of fetch & inc) But the waiting algorithm still suffers from 0.5P2 messages asymptotically Researchers have proposed proportional backoff i.e. in the wait loop put a delay proportional to the difference between ticket value and last read release_count Latency and storage-wise better than Bakery Traffic-wise better than TTS and Bakery (I leave it to you to analyze the traffic of Bakery) Guaranteed fairness: the ticket value induces a FIFO queue

Array-based lock

Solves the O(P 2 ) traffic problem The idea is to have a bit vector (essentially a character array if boolean type is not supported) Each processor comes and takes the next free index into the array via fetch & inc Then each processor loops on its index location until it becomes set On unlock a processor is responsible to set the next index location if someone is waiting Initial fetch & inc still needs O(P) traffic, but the wait loop now needs O(1) traffic Disadvantage: storage overhead is O(P) Performance concerns Avoid false sharing: allocate each array location on a different cache line Assume a cache line size of 128 bytes and a character array: allocate an array of size 128P bytes and use every 128th position in the array For distributed shared memory the location a processor loops on may not be in its local

memory: on acquire it must take a remote miss; allocate P pages and let each processor loop on one bit in a page? Too much wastage; better solution: MCS lock (Mellor-Crummey & Scott) Correctness concerns Make sure to handle corner cases such as determining if someone is waiting on the next location (this must be an atomic operation) while unlocking Remember to reset your index location to zero while unlocking

Compare & swap: Compare with r1, swap r2 and memory location (here we keep on trying until comparison passes)

Try: LL r3, addr sub r4, r3, r bnez r4, Try add r4, r2, r SC r4, addr beqz r4, Try add r2, r3, r

Module 11: "Synchronization"

Lecture 22: "Scalable Locking Primitives"

Store conditional & OOO

Execution of SC in an OOO pipeline Rather subtle For now assume that SC issues only when it comes to the head of ROB i.e. non- speculative execution of SC It first checks the load_linked bit; if reset doesn’t even access cache (saves cache bandwidth and unnecessary bus transactions) and returns zero in register If load_linked bit is set, it accesses cache and issues bus transaction if needed (BusReadX if cache line in I state and BusUpgr if in S state) Checks load_linked bit again before writing to cache (note that cache line goes to M state in any case) Can wake up dependents only when SC graduates (a case where a store initiates a dependence chain)

Speculative SC?

What happens if SC is issued speculatively? Actual store happens only when it graduates and issuing a store early only starts the write permission process Suppose two processors are contending for a lock Both do LL and succeed because nobody is in CS Both issue SC speculatively and due to some reason the graduation of SC in both of them gets delayed So although initially both may get the line one after another in M state in their caches, the load_linked bit will get reset in both by the time SC tries to graduate They go back and start over with LL and may issue SC again speculatively leading to a livelock (probability of this type of livelock increases with more processors) Speculative issue of SC with hardwired backoff may help Better to turn off speculation for SC What about the branch following SC? Can we speculate past that branch? Assume that the branch predictor tells you that the branch is not taken i.e. fall through: we speculatively venture into the critical section We speculatively execute the critical section This may be good and bad If the branch prediction was correct we did great If the predictor went wrong, we might have interfered with the execution of the processor that is actually in CS: may cause unnecessary invalidations and extra traffic Any correctness issues?

Point-to-point synch.

Normally done in software with flags

P0: A = 1; flag = 1; P1: while (!flag); print A;

Some old machines supported full/empty bits in memory