6.828 2011 Lecture 16: Scalable Locks

Plan:
  unscalable spinlocks (xv6/JOS style,  ticket)
  scalable locks (Anderson, MCS, ..)

Why are we reading this paper?
  multi-processor systems are common; hard to buy a single-core laptop
  want to use multiple CPUs to improve performance
  synchronization between CPUs often expensive (performance-wise) and tricky
  scalability limitations of kernel-provided services passed onto all applications
  let's look at the x86 memory system to understand performance

back in the locking lecture, we had a fairly simple model of multiple CPUs
  shared bus
  memory connected to bus
  to implement acquire, x86's xchg instruction locked the bus
  provided atomicity for xchg

in reality things are a bit different
  bus, memory quite slow compared to CPU speed
  each CPU has a few levels of cache to compensate
  L1 hit: a few cycles
  RAM: 100s of cycles

how to ensure caches aren't stale?
  CPU 1 reads x=10, CPU 2 writes x=11, CPU 2 reads x=?
  locks don't fix this problem, occurs even if not concurrent
  "cache coherence protocol"

how does cache coherence work?
  many schemes, here's a simple one
  each cache line: state, address, 64 bytes of data
  states: Modified, Shared, Invalid [MSI]
  "snooping" idea:
    CPUs watch each others bus reads/writes
    CPU 1 has x cached, sees w x on bus -> invalidate x
    CPU 1 reads x on bus, CPU 2 has dirty in cache -> CPU 2 supplies data

how do the CPUs coordinate with each other?
  I + local read -> S
  I + local write -> M
  S + local read -> S
  S + local write -> M
  S + remote write -> I
  M + remote write -> I (+ give him our content)
  M + remote read -> S (+ give him our content)

can read w/o bus traffic if already S
can write w/o bus traffic if already M
  "write-back"

this is a "write-invalidate" scheme
  alternative: "distributed-write"

compatibility of states between 2 CPUs:

		   CPU1
		  M S I
		M - - +
	CPU2    S - + +
		I + + +

Cache coherence without snooping
  Bus is a bottleneck
  Use a network to connect processors
  Who knows about the state of a cache line?
     Each block of has an owner; the owner keep information
     See 5 and 6 of table
  Directory-based cache coherence

Paper builds locks out of atomic operations:
  CompareAndExchange(addr, val1, val2)
  TestAndSet(addr)
  ReadAndIncrement(addr)

how does h/w make these operations atomic?
  on bus: lock bus
  on mesh:  
    get the line in exclusive mode
    Keep it in exclusive mode for the duration of the instruction 

what is performance of locks?
  assume N CPUs want the lock at the same time
  how long until each CPU gets the lock once?
  measure time in protocol messages
 since interconnect is slow (relatively)

test&set spinlock (xv6/jos)
  constant bus traffic while spinning
    we don't care if waiting CPUs waste their own time
    we do care if waiting CPUs slow lock holder!
  single holding time = O(N) waiters
  time for all N to get lock: O(N^2)
  how to use less bus traffic?

test-and-set with exponential backoff (t_s_exp_acquire):
  goal: avoid everyone jumping in at once
    space out attempts to acquire lock
    simultaneous attempts were reason for N^2 w/ t-s
    if total rate of tries is low, only one CPU will attempt per release
  why not constant delay?
    each CPU re-tries after random delay with constant average
    hard to choose delay time
    too large: waste
    too small: all N CPUs probe mult times/crit, so N^2
  why exponential backoff?
    i.e. why start with small delay, double it?
    try to get lucky at first (maybe only a few CPUs attempting)
    doubling means takes only a few attempts until delay >= N * crit section
      i.e. just one attempt per release
  illustration:
    8 CPUs, critical section lasts one time unit
    watch as density of attempts decreases
    goal is just one between each pair of releases
  can we analyze # of probes?
    not that easy
    suppose takes time O(N) for all CPUs to succeed
    how many probes does each CPU make in time N? logN
    so total probes: N*logN
  traffic cc: N*logN
  problem: unlikely to be fair!
    some CPUs will have much lower delays than others
    will win, and come back, and win again
    some CPUs with have huge delays, will sit idle
      doing no harm, but doing no work

ticket locks (ticket_acquire):
  (Linux uses ticket locks)
  goal: fairer than exp backoff, cheaper than t-t-s
  idea: assign numbers, wake up one at a time
    avoid repeated t-s after every release
  why is it fair?
  time analysis:
    what happens in acquire?
      N ReadAndIncrement
      though spread out in steady state
      each is just one bus xaction (no re-loads after invalidate)
    what happens after release?
      invalidate
      N re-loads
    so N^2
  traffic cc: N^2 
  problem: N^2 still blows up rapidly

paper: how rapidly can it blow up?
  See Figure 2;
  Oops.

Why so dramatic and sudden?
  markov chain
  time to get a lock increases with N
  probability of contention goes up
  quickly move to the left end

Figure 9
  long serial sections don't blow up as bad
  but short ones do! 
  counter intuitive?

anderson:
  goal: avoid N^2 and be fair
  what if each core spins on a *different* cache line?
  acquire cost?
    N ReadAndIncrements, then local spin on per-CPU cache line
  release cost?
    invalidate someone else's slots[] element
    only they have to re-load
    no other CPUs involved
  so O(N)
  problem: high space cost
    often much more than size of protected object

MCS
  one "qnode" per thread, used for whatever lock it's waiting for
  lock holder's qnode points to start of list
  lock variable points to end of list
  acquire() adds your qnode to end of list
    then you spin on your own qnode
  release() wakes up next qnode
  space O(1)
  change in API (need to pass qnode to acquire and release to qnode allocation)

results on 48-CPU modern c.c. AMD, slowest to fastest:
   see figure 10 and figure 11
   notice they are all super fast for one cpu
     atomic instructions generate no interconnect traffic if cache line M state

Concl.
  use scalable locks
  even better: fix the underlying problem!