Scalable coordination

Required reading: Mellor-Crummey and Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, TOCS, Feb 1991.

Overview

High-end computers (and now many inexpensive ones) are often built with multiple CPUs sharing memory. Why? If you have more than one CPU's worth of work, it may be easiest to split that work among multiple CPUs if they share memory. Or, perhaps it's cheaper to buy one computer w/ two CPUs than two separate computers.

Shared-memory multi-processors only make sense if they are scalable: that the additional work an extra CPU does outweighs the amount it slows down the rest of the system. To see why adding a CPU might slow the rest of the system down, let's look at multi-processor architecture.

A multi-processor typicaly consists of a set of CPUs, an interconnect, and a shared memory system. CPUs are cheap and fast and memory is cheap, so you might think you could easily build a fast multi-processor with a lot of CPUs and independent memory banks. However, interconnects are expensive, slow, and a shared resource. If N CPUs keep the interconnect busy, then perhaps N+1 CPUs will be no faster because there are no spare interconnect cycles. You can alleviate this by putting a separate cache next to each CPU, but then you have to worry about cache consistency -- that is, if CPU#1 reads x, then CPU#2 increments x, will CPU#1 see the new value or a stale cached value?

Thus multi-processors either don't cache or must implement a "cache-consistency" protocol to ensure that each CPU sees the latest data. The BBN Butterfly doesn't cache, and the Sequent Symmetry has cache consistency.

The BBN Butterfly has 256 CPUs. Each CPU has some local memory next to it (e.g. one megabyte); there is no separate memory system. Thus the CPUs have to talk to each others' memories. A switched interconnect lets any CPU send and receive read and write messages with any memory; the interconnect allows lots of parallelism. Thus if you start using a new CPU, maybe its memory use won't interfere with that of any other CPU. A local memory reference takes somewhat under a microsecond, a remote reference about four microseconds; called NUMA for non-uniform memory access. Programmers have to carefully decide where to put data. Perhaps the high bits of the memory address determine which CPU's memory is used. There's no caching, so no need for cache consistency. Today we'd probably have caching, with a "directory" at each memory recording which CPUs have cached copies; then a write would trigger invalidate messages to those CPUs telling them to discard the cached data.

The Sequent Symmetry has up to 32 CPUs and six memory systems, all on the same bus. It's similar to common small-scale SMP machines today, so it's worth studying in a bit more detail.

  diagram: CPUs, caches, bus, memories
  each CPU's cache controller "snoops" every bus transaction
  each cache line:
    address
    value
    S shared (clean)
    X exclusive (dirty)
    I invalid
  invariant: addr X in exactly one cache
    or no X, same value in all S caches and mem
  CPU can read from its own cache if S or X
  CPU can write to its own cache (no bus) if X
  otherwise, CPU must broadcast request on bus
    if writing, our cache X, other caches to I
    if reading, our cache S, any cached X to S
      we fetch from X cache or memory
  cached reads/writes 100x faster than bus

If different CPUs use different data, the Symmetry's caches will automatically accumulate each CPU's data X in its cache, and be fast. If CPUs use the same data but read-only, the data will be S in all caches, and fast. If the caches are too small, the CPUs will have to use the bus to get to memory, and the bus may be a bottleneck. Finally, if different CPUs modify the same data, writes will lead to bus transactions to invalidate other cached copies, and reads will lead to bus transactions because the data is I.

Our basic estimator of performance for locking algorithms will be the number of bus transactions, since they are so much slower than local cache reads/writes.

Scalable locks

This paper is about cost and scalability of locking; what if you have 10 CPUs waiting for the same lock? For example, what would happen if xv6 runs on an SMP with many processors?

What's the cost of a simple spinning acquire/release? Algorithm 1 *without* the delays, which is like xv6's implementation of acquire and release (xv6 uses XCHG instead of test_and_set):

  each of the 10 CPUs gets the lock in turn
  meanwhile, remaining CPUs in XCHG on lock
  lock must be X in cache to run XCHG
    otherwise all might read, then all might write
  so bus is busy all the time with XCHGs!
    gets in the way of the lock-holder doing real work
  can we avoid constant XCHGs while lock is held?

test-and-test-and-set

  only run expensive TSL if not locked
  spin on ordinary load, so cache line is S, so no bus xaction
  acquire(l)
    while(1){
      while(l->locked != 0) { }
      if(TSL(&l->locked) == 0)
        return;
    }

suppose 10 CPUs are waiting, let's count cost in total bus transactions

  CPU1 gets lock in one cycle
    sets lock's cache line to I in other CPUs
  9 CPUs each use bus once in XCHG
    then everyone has the line S, so they spin locally
  CPU1 release the lock
  CPU2 gets the lock in one cycle
  8 CPUs each use bus once...
  So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
  Look at "test-and-test-and-set" in Figure 6

What's the minimum number of bus xactions we can reasonably hope for in order to have n CPUs each acquire a lock?

What is the point of the exponential backoff in Algorithm 1?

  What does Figure 6 say about its scalability?
  Why does it work well?
  How many bus transactions do we expect?
  Is there anything wrong with it?
  may not be fair
  exponential backoff may increase delay after release

What's the point of the ticket locks, Algorithm 2?

  one interlocked instruction to get my ticket number
  then I spin on now_serving with ordinary load
  release() just increments now_serving

why is that good?

  + fair
  + no exponential backoff overshoot
  + no spinning on XCHG

but what's the cost, in bus transactions?

  while lock is held, now_serving is S in all caches
  release makes it I in all caches
  then each waiters uses a bus transaction to get new value
  so still O(n^2)

So why does Figure 5 show that the ticket algorithm has basically O(n) cost?

Do we believe it's reasonable to predict lock hold times?

What's the point of the array-based queuing locks, Algorithm 3?

    a lock has an array of "slots"
    waiter allocates a slot, spins on that slot
    release wakes up just next slot
  three bus transactions per locker:
    allocate slot, read "has_lock" from slot, write "has_lock" to next slot
  so O(n) bus transactions to get through n waiters: good!
  anderson lines in Figure 4 and 6 are flat-ish
    they only go up because lock data structures protected by simpler lock
  no delay() , so no requirement to predict lock hold times
  BUT O(n) space *per lock*!

Algorithm 5 (MCS), the new algorithm of the paper, uses compare_and_swap (CMPXCHG on x86):

int compare_and_swap(addr, v1, v2) {
  int ret = 0;
  // stop all memory activity and ignore interrupts
  if (*addr == v1) {
    *addr = v2;
    ret = 1;
  }
  // resume other memory activity and take interrupts
  return ret;
}

How does the MCS lock (Algorithm 5) work?

  one "qnode" per thread, used for whatever lock it's waiting for
  lock holder's qnode points to start of list
  lock variable points to end of list
  acquire() adds your qnode to end of list
    then you spin on your own qnode
  release() wakes up next qnode

How much space per MCS lock?

How many bus transactions for n CPUs to acquire an MCS lock?

Does Figure 6 show that MCS is worthwhile?

Suppose there is not contention for a lock. Which of these algorithms will be fastest?

Remember the point of locks was to synchronize updates to shared data structures. Since the data are shared, and mutable, we have to worry about the same scalability issues all over again for the data that locks protect! This is critical in real life and leads to careful attention to how data is split over cache lines. Remember the L3 IPC paper.

Lock-free data structures

Perhaps you can avoid worrying about the scalability of locks by not using them, and instead directly using the atomic hardware memory operations directly to implement operations on shared data structures. The easiest to see is atomic add: it's faster to use an atomic add instruction directly than to acquire(), add, and release(). Assuming your hardware has atomic add.

MCS locks use atomic instructions to implement a linked list. We could instead directly implement a "lock-free" linked list insert.

A linked list with locks is as follows:

Lock list_lock;

insert(int x) {
  element *n = new Element;
  n->x = x;

  acquire(&list_lock);
  n->next = list;
  list = n;
  release(&list_lock);
}

A lock-free implementation is as follows:

insert (int x) {
  element *n = new Element;
  n->x = x;
  do {
     n->next = list;
  } while (compare_and_swap (&list, n->next, n) == 0);
}

On the x86 you can implement compare_and_swap() with CMPXCHG.

How many bus transactions with 10 CPUs inserting one element in the list? Could you do better?

This paper by Fraser and Harris compares lock-based implementations versus corresponding non-blocking implementations of a number of data structures.

It is not possible to make every operation lock-free, and there are times we will need an implementation of acquire and release. Research on lock-free data structures is active; the last word isn't said on this topic yet.