6.828 2005 Lecture 18: Coordination

introduction
  JOS and v6 are meant to run on single-CPU machines
  most high-end machines, and soon most desktops, will have multiple CPUs
  diagram: CPU1, CPU2, bus, memory
  what would have to change?

easy plan:
  both CPUs can run user programs
  only one CPU can run in the kernel
  if CPU1 in kernel (syscall or interrupt),
    and user program on CPU2 makes a system call,
    CPU2 must wait until CPU1 returns to user space
  also may need to force re-schedule, TLB flush on other CPU
    inter-processor interrupts

hard plan:
  let both CPUs run in kernel, to get higher performance
    many workloads are kernel-intensive
  JOS would need one kernel stack per CPU
  would JOS otherwise be correct?

Look at Example 1, would it work correctly in multi-CPU JOS?
  what does "correct" mean?
  As if insert()s occured one at a time, in some order.
  e.g. two insert()s yield a list that has two more elements

What could go wrong with two CPUs?
  two CPUs call insert() at about the same time, for same head
  both read the old head
  both set their nl->next to old head
  both write the head
  one insert()'s write goes second
  other insert() is effectively ignored
  so two insert()s could yield a list with only one more element

How can we ensure only one CPU is in lines A/B?
  Borrow B_BUSY flag idea from v6?
  Does Example 2 work?
  No. There are still two operations that we want to be indivisible:
    MOV busy, %eax
    MOV $1, busy

But we're in luck!
  x86 gives us the indivisible instruction we need
  XCHG %eax, (busy)
  "exchange"
  1. freeze other CPUs' memory activity
  2. temp := busy
  3. busy := %eax
  4. %eax := temp
  5. un-freeze other CPUs
  steps 1 and 5 make XCHG special: it is "locked"
  special signal lines on the inter-CPU bus, bus arbitration

We really want "test-and-set-locked" (TSL) rather than XCHG
  always set the lock variable
  tell us its old value
  if it was zero, then we hold the lock
  if it was one, it's still one, but someone else holds it

Example 3
  wraps TSL around XCHG
  wraps acquire/release around TSL

Our release/acquire is ideal for short critical sections
  increment a counter, search in i-node cache, allocate a free buffer
  often called "spin locks"

What are spin locks not so great for?
  long critical sections may waste waiters' CPU time
  bad to sleep while holding locks
    so locks can't replace v6 ILOCK and B_BUSY mechanisms
    though they can ensure that those flags are set correctly
  acquire doesn't suppress interrupts
    what if interrupt code itself tries to take a held lock? deadlock!
    could have spl+acquire
    could not take interrupts in kernel, like JOS
      other CPU may take interrupts
      OK: no deadlock even if interrupt needs a lock

Lock granularity
  my list_lock protects all lists
  inserts to different lists are blocked
  lock per list would waste less time spinning
  so you might want "fine-grained" locks, one for every object
  BUT acquire/release are expensive (500 cycles on my 3 ghz machine?)
    because they need to talk off-chip
  AND "correctness" is not that simple
    what if invariant is:
    "every buffer must be on exactly one of free list and device list"?
    per-list locks are irrelevant!
  so you might want "large-grained" locks
    a lock around the whole file system?
    a lock around each block device and its buffers?
  tension: concurrency vs lock overhead
  this is hard to get right, Linux fiddles with this all the time

Lock-free data structures
  XCHG made two critical instructions atomic
  could we use atomic instructions directly? w/o locks?
    then we would not need to spin
    and wouldn't need to worry about interrupt in insert()
  Can we implement insert() directly with XCHG?
    nl->next = nl;
    XCHG(&head, &nl->next);
  Too bad XCHG won't take two memory operands!
  but CMPXCHG will check that head hasn't changed before exchanging...
    CMPXCHG %old, %new, addr
  also LOCK ADD %eax, addr

****

Mellor-Crummey / Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors
TOCS Feb 1991

all about cost and scalability of locking
  what if you have 10 CPUs waiting for the same lock?

we need to know more about how the memory system works
  mostly interested in how per-CPU caching works, since that's where we get speed
  two architectures: bus-based and directory
  we'll only look at bus-based

bus-based shared memory with caching
  Sequent Symmetry
  CPUs, caches, bus, memory
  each CPU's cache controller "snoops" every bus transaction
  each cache line:
    address
    value
    S shared / clean
    X exclusive / dirty
    - not valid
  invariant: addr X in exactly one cache
    or no X, same value in all S caches and mem
  CPU can read from its own cache if S or X
  CPU can write to its own cache (no bus) if X
  otherwise, CPU must broadcast request on bus
    if writing, our cache X, other caches to -
    if reading, our cache S, any cached X to S
      we fetch from X cache or memory
  cached reads/writes 100x faster than bus

what's the cost of our simple spinning acquire/release?
  (their Algorithm 1 *without* the delays)
  each of the 10 CPUs gets the lock in turn
  meanwhile, remaining CPUs in XCHG on lock
  lock must be X in cache to run XCHG
    otherwise all might read, then all might write
  so bus is busy all the time with XCHGs!
  can we avoid constant XCHGs while lock is held?

test-and-test-and-set
  only run expensive TSL if not locked
  spin on ordinary load instruction, so cache line is S
  acquire(l)
    while(1){
      while(l->locked != 0) { }
      if(TSL(&l->locked) == 0)
        return;
    }
  suppose 10 CPUs are waiting
  let's count cost in total bus transactions
  CPU1 gets lock in one cycle
    sets lock's cache line to - in other CPUs
  9 CPUs each use bus once in XCHG
    then everyone has the line S, so they spin locally
  CPU1 release the lock
  CPU2 gets the lock in one cycle
  8 CPUs each use bus once...
  So 10 + 9 + 8 + ... = 50 transactions, O(n^2) in # of CPUs!
  Look at "test-and-test-and-set" in Figure 6
  Can we have n CPUs acquire a lock in O(n) time?

What is the point of the exponential backoff in Algorithm 1?
  Does it buy us O(n) time for n acquires?
  Is there anything wrong with it?
  may not be fair
  exponential backoff may increase delay after release

What's the point of the ticket locks?
  Algorithm 2
  one interlocked instruction to get my ticket number
  then I spin on now_serving with ordinary load
  release() just increments now_serving
  why is that good?
  fair
  no exponential backoff overshoot
  no spinning on XCHG
  but what's the cost, in bus transactions?
  while lock is held, now_serving is S in all caches
  release makes it - in all caches
  then each waiters uses a bus transaction to get new value
  so still O(n^2)

What's the point of the array-based queuing locks?
  Algorithm 3
  algorithm:
    a lock has an array of "slots"
    waiter allocates a slot, spins on that slot
    release wakes up just next slot
  so O(n) bus transactions to get through n waiters: good!
  anderson lines in Figure 4 and 6 are flat-ish
    they only go up because lock data structures protected by simpler lock
  but O(n) space *per lock*!

What's the point of the MCS lock, Algorithm 5?
  constant space per lock, rather than O(n)
  one "qnode" per thread, used for whatever lock it's waiting for
  lock holder's qnode points to start of list
  lock variable points to end of list
  acquire adds your qnode to end of list
    then you spin on your own qnode
  release wakes up next qnode

/*
 * Example 1: linked list insert.
 */

struct List {
  int data;
  struct List *next;
};

insert(List **head, int data) {
  List *nl = new List;
  nl->data = data;
  nl->next = *head; /* line A */
  *head = nl;       /* line B */
}


/*
 * Example 2: Insert with busy-waiting.
 */

int busy = 0;

insert(List **head, int data) {
  List *nl = new List;
  nl->data = data;

  while(busy == 1) { }
  busy = 1;

  /* critical section: */
  nl->next = *head;
  *head = nl;

  busy = 0;
}

/*
 * Example 3: acquire() and release() with test-and-set-locked.
 */

struct Lock {
  unsigned int locked;
};

acquire(Lock *lck) {
  while(TSL(&(lck->locked)) != 0)
    ;
}

release(Lock *lck) {
  lck->locked = 0;
}

#ifdef i386
int
TSL(int *addr)
{
  register int content = 1;
  // xchgl content, *addr
  // xchgl exchanges the values of its two operands, while
  // locking the memory bus to exclude other operations.
  asm volatile ("xchgl %0,%1" :
                "=r" (content),
                "=m" (*addr) :
                "0" (content),
                "m" (*addr));
  return(content);
}
#endif


Lock list_lock;

insert(List **head, int data) {
  List *nl = new List;
  nl->data = data;

  acquire(&list_lock);

  nl->next = *head;
  *head = nl;

  release(&list_lock);
}