Required reading: Mellor-Crummey and Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.
The Sequent Symmetry has up to 32 CPUs, all on the same bus, and uses a cache-coherence protocol to keep caches consistent. It's similar to what common processor implement on a single chip, and current CC protocol is like the Symmetry one, but has an additional state (exclusive).
why are we reading this paper?
multi-processor systems becoming quite common; hard to buy a single-core laptop
want to use multiple CPUs to improve performance
synchronization between CPUs often expensive (performance-wise) and tricky
scalability limitations of kernel-provided services passed onto all applications
let's briefly look back at the x86 memory system to understand what's slow..
back in the locking lecture, we had a fairly simple model of multiple CPUs
shared bus
memory connected to bus
to implement acquire, x86's xchg instruction locked the bus
provided atomicity for xchg
in reality things are a bit different
bus, memory quite slow compared to CPU speed
each CPU has a few levels of cache to compensate
each CPU directly attached to a part of the system's memory (AMD, recent Intels)
recent AMD chip:
cache latency ~10's of cycles
going over bus to DRAM or another CPU is ~500 cycles
need to make sure xchg still atomic with all these caches, and performs well
x86 uses caching system to provide fast atomic ops
what's in a cache line?
state, address, 64 bytes of data
states: Modified, Exclusive, Shared, Invalid [MESI]
compatibility of states between 2 CPUs:
CPU1
M E S I
M - - - +
CPU2 E - - - +
S - - + +
I + + + +
how do the CPUs coordinate with each other?
each CPU listens on bus for address requests, looks up in cache, responds
"read address"
"read and invalidate address"
"invalidate address"
atomicity: hold cacheline while executing xchg, don't give it to other CPUs
Our basic estimator of performance for locking algorithms will be the number of bus transactions, since they are so much slower than local cache reads/writes.
This paper is about cost and scalability of locking; what if you have 10 CPUs waiting for the same lock? For example, what would happen if xv6 runs on an SMP with many processors?
What's the cost of a simple spinning acquire/release? Algorithm 1 *without* the delays, which is like xv6's implementation of acquire and release (xv6 uses XCHG instead of test_and_set):
each of the 10 CPUs gets the lock in turn
meanwhile, remaining CPUs in XCHG on lock
lock must be X in cache to run XCHG
otherwise all might read, then all might write
so bus is busy all the time with XCHGs!
gets in the way of the lock-holder doing real work
can we avoid constant XCHGs while lock is held?
test-and-test-and-set
only run expensive TSL if not locked
spin on ordinary load, so cache line is S, so no bus xaction
acquire(l)
while(1){
while(l->locked != 0) { }
if(TSL(&l->locked) == 0)
return;
}
suppose 10 CPUs are waiting, let's count cost in total bus transactions
CPU1 gets lock in one cycle
sets lock's cache line to I in other CPUs
9 CPUs each use bus once in XCHG
then everyone has the line S, so they spin locally
CPU1 release the lock
CPU2 gets the lock in one cycle
8 CPUs each use bus once...
So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
Look at "test-and-test-and-set" in Figure 6
What's the minimum number of bus xactions we can reasonably hope for in order to have n CPUs each acquire a lock?
What is the point of the exponential backoff in Algorithm 1?
What does Figure 6 say about its scalability? Why does it work well? How many bus transactions do we expect? Is there anything wrong with it? may not be fair exponential backoff may increase delay after release
What's the point of the ticket locks, Algorithm 2?
one interlocked instruction to get my ticket number then I spin on now_serving with ordinary load release() just increments now_serving
why is that good?
+ fair + no exponential backoff overshoot + no spinning on XCHG
but what's the cost, in bus transactions?
while lock is held, now_serving is S in all caches release makes it I in all caches then each waiters uses a bus transaction to get new value so still O(n^2)
So why does Figure 5 show that the ticket algorithm has basically O(n) cost?
Do we believe it's reasonable to predict lock hold times?
What's the point of the array-based queuing locks, Algorithm 3?
a lock has an array of "slots"
waiter allocates a slot, spins on that slot
release wakes up just next slot
three bus transactions per locker:
allocate slot, read "has_lock" from slot, write "has_lock" to next slot
so O(n) bus transactions to get through n waiters: good!
anderson lines in Figure 4 and 6 are flat-ish
they only go up because lock data structures protected by simpler lock
no delay() , so no requirement to predict lock hold times
BUT O(n) space *per lock*!
Algorithm 5 (MCS), the new algorithm of the paper, uses compare_and_swap (CMPXCHG on x86):
int compare_and_swap(addr, v1, v2) {
int ret = 0;
// stop all memory activity and ignore interrupts
if (*addr == v1) {
*addr = v2;
ret = 1;
}
// resume other memory activity and take interrupts
return ret;
}
How does the MCS lock (Algorithm 5) work?
one "qnode" per thread, used for whatever lock it's waiting for
lock holder's qnode points to start of list
lock variable points to end of list
acquire() adds your qnode to end of list
then you spin on your own qnode
release() wakes up next qnode
How much space per MCS lock?
How many bus transactions for n CPUs to acquire an MCS lock?
Does Figure 6 show that MCS is worthwhile?
Suppose there is not contention for a lock. Which of these algorithms will be fastest?
Remember the point of locks was to synchronize updates to shared data structures. Since the data are shared, and mutable, we have to worry about the same scalability issues all over again for the data that locks protect! This is critical in real life and leads to careful attention to how data is split over cache lines.