Coordination

Required reading: Algorithms for scalable synchronization on shared-memory multiprocessors.

Overview

We revisit the topic of mutual-exclusion coordination: the techniques to protect variables that are shared among multiple threads. These techniques allow programmers to implement atomic sections so that one thread can safely update the shared variables without having to worry that another thread intervening.

The technique for mutual-exclusion coordination used in v6 is disabling/enabling interrupts in the kernel, through spln() calls. In v6, the only program with multiple threads is the kernel, thus that is the only program that needs need mutual-exclusion mechanism. User-level processes don't share data directly.

We focus today on algorithms for implementing coordination on shared-memory multiprocessors. In this case, the kernel may share datastructures among threads running on multiple processors and disabling/enabling interrupts is insufficient for implementing atomic sections.

List and insert example:


struct List {
  int data;
  struct List *next;
};

List *list = 0;

insert(int data) {
  List *l = new List;
  l->data = data;
  l->next = list;  // A
  list = l;        // B
}

fn() {
  insert(100);
}

main() {
  thread_create(..., fn);
  thread_create(..., fn);
  thread_schedule();
}

What needs to be atomic? The two statements labeled A and B should always be executed together, as an indivisible fragment of code. If two threads execute A and B interleaved, then we end up with an incorrect list. To see that this is the case, draw out the list after the sequence A1 (statement executed A by thread 1), A2 (statement A executed by thread 2), B2, and B1).

How could this erroneous sequence happen? If thread 1 and 2 are running on a multiprocessors, the varilable list lives in physical memory shared among multiple processors, connected by a bus. The accesses to the shared memory will be ordered in some total order by the bus/memory system. If the programmer doesn't coordinate the execution of the statements A and B, any order can happen, including the erroneous one.

Atomic instructions and wait-free

The programmer must be able express that A and B should be executed as single atomic instruction. We generally use a concept like locks to mark an atomic region, acquiring the lock at the beginning of the section and releasing it at the end:

 
void acquire(int *lock) {
   while (TSL(lock) != 0) ; 
}

void release (int *lock) {
  *lock = 0;
}

Acquire and release, of course, need to be atomic too, which can, for example, be done with a hardware atomic TSL (try-set-lock) instruction:

The semantics of TSL are:

   R <- [mem]   // load content of mem into register R
   [mem] <- 1   // store 1 in mem.

In a harware implementation, the bus arbiter guarantees that both the load and store are executed without any other load/stores coming in between.

We can use locks to implement an atomic insert, or we can use TSL directly:

int insert_lock = 0;

insert(int data) {

  /* acquire the lock: */
  while(TSL(&insert_lock) != 0)
    ;

  /* critical section: */
  List *l = new List;
  l->data = data;
  l->next = list;
  list = l;

  /* release the lock: */
  insert_lock = 0;
}

The TSL insert is blocking: if T1 is executing the insert is pre-empted and control is passed to T2, T2 cannot execute insert without waiting for T1. Other atomic hardware operations allows one to build implementation of insert that don't block T2. Such versions are called wait free. For example, using cmpxchg we can:

int cmpxchg(addr, v1, v2) {
  int ret = 0;
  // stop all memory activity and ignore interrupts
  if (*addr == v1) {
    *addr = v2;
    ret = 1;
  }
  // resume other memory activity and take interrupts
  return ret;
}

insert (int data) {
  element *n = new Element;
  n->x = x;
  do {
     n->next = list;
  } while (cmpxchg (&list, n->next, n) == 0);
}

It is not possible to make every operation wait-free, and often we will need an implementation of acquire and release. The paper discusses how to implement acquire and release well.

The paper (thanks to mazieres for the notes)

Shared memory machines are bunch of CPUs, sharing physical memory. Typically each processor also mantains a cache (for performance), which introduces the problem of keep caches coherent. If processor 1 writes a memory location whose value processor 2 has cached, then processor 2's cache must be updated in some way. How?

Bus-based schemes. Any CPU can access "dance with" any memory equally ("dance hall arch"). Use "Snoopy" protocols: Each CPU's cache listens to the memory bus. With write-through architecture, invalidate copy when see a write. Or can have "ownership" scheme with write-back cache (E.g., Pentium cache have MESI bits---modified, exclusive, shared, invalid). If E bit set, CPU caches exclusively and can do write back. But bus places limits on scalability.
More scalability w. NUMA schemes (non-uniform memory access). Each CPU comes with fast "close" memory. Slower to access memory that is stored with another processor. Use a directory to keep track of who is caching what. For example, processor 0 is responsible for all memory starting with address "000", processor 1 is responsible for all memory starting with "001", etc.
COMA - cache-only memory architecture. Each CPU has local RAM, treated as cache. Cache lines migrate around to different nodes based on access pattern. Data only lives in cache, no permanent memory location. (These machines aren't too popular any more.)

With this minimal background on shared-memory multiprocessors, we can understand the algorithms for implementing acquire and release:

Algorithm 1: test_and_set spin lock

  + Uses single memory location for arbitrarily many threads
  - Lots of write traffic over memory bus, but less than our naive
    implementation of acquire above.
  - Not necessarily fair

Algorithm 2: ticket lock

  + Just two memory locations for arbitrarily (exponentially) many threads
  + Fair (FIFO)
  + Probes with read operations only (unlike t&s which issues writes)
  - Still requires a lot of bandwidth
      everyone reading now_serving all the time, updated w. each switch
  + Can maybe estimate how long wait is to reduce reads, but hard
      high penalty if you guess wrong, because of perfect FIFO order
  - Requires fetch_and_add
      if implemented with a test_and_set spin lock, adds more memory traffic

Algorithm 3: array-based locks

  Look at Graunke and Thakkar (because anderson needs fetch and add)
    Each proc has a bool (call it mybool[vpid]) on its own cache line
    Global:  bool *who; bool locked_value;
    To lock:
      bool *prev = &mybool[vpid];
      bool val = mybool[vpid];
      atomic_swap: (prev, val) <-> (who, locked_value)
      while (*prev == val);
    To unlock:
      mybool[vpid] = !mybool[vpid];
  + Fair (FIFO)
  - Requires space per lock proportional to number of threads      
      can't easily recycle mybool, must stay unchanged till next guy releases
  + Less memory traffic than ticket lock if you have a coherent cache
  - Memory traffic than ticket if you don't have cache
      because can't back off as accurately; no estimate of wait
  + In NUMA, could arrange to have mybool[vpid] local to proc

Algorithm 5: MCS list-based queuing lock

  + Fair (almost FIFO, is FIFO if you have compare_and_swap)
  + Only spin on your own qnode I, which can be in local memory
  + Small amount of network traffic
  + Requires constant space per lock (plus each thread has a qnode)