6.828 2010 Lecture 17: Scalable Locks

Thomas Anderson, The Performance of Spin Lock Alternatives for
Shared-Memory Multiprocessors, IEEE TOPDS, Jan 1990.

why are we reading this paper?
  20 years old; still useful?
  multi-processor systems are common; hard to buy a single-core laptop
  want to use multiple CPUs to improve performance
  synchronization between CPUs often expensive (performance-wise) and tricky
  scalability limitations of kernel-provided services passed onto all applications
  let's look at the x86 memory system to understand performance

back in the locking lecture, we had a fairly simple model of multiple CPUs
  shared bus
  memory connected to bus
  to implement acquire, x86's xchg instruction locked the bus
  provided atomicity for xchg

in reality things are a bit different
  bus, memory quite slow compared to CPU speed
  each CPU has a few levels of cache to compensate
  L1 hit: a few cycles
  RAM: 100s of cycles

how to ensure caches aren't stale?
  CPU 1 reads x=10, CPU 2 writes x=11, CPU 2 reads x=?
  locks don't fix this problem, occurs even if not concurrent
  "cache coherence protocol"

how does cache coherence work?
  many schemes, here's a simple one
  each cache line: state, address, 64 bytes of data
  states: Modified, Shared, Invalid [MSI]
  "snooping" idea:
    CPUs watch each others bus reads/writes
    CPU 1 has x cached, sees w x on bus -> invalidate x
    CPU 1 reads x on bus, CPU 2 has dirty in cache -> CPU 2 supplies data

how do the CPUs coordinate with each other?
  I + local read -> S
  I + local write -> M
  S + local read -> S
  S + local write -> M
  S + remote write -> I
  M + remote write -> I (+ give him our content)
  M + remote read -> S (+ give him our content)

can read w/o bus traffic if already S
can write w/o bus traffic if already M
  "write-back"

this is a "write-invalidate" scheme
  alternative: "distributed-write"

compatibility of states between 2 CPUs:

		   CPU1
		  M S I
		M - - +
	CPU2    S - + +
		I + + +

modern cache coherence is more complex than this!

paper builds locks out of atomic operations:
  TestAndSet(addr)
  ReadAndIncrement(addr)

how does h/w make these operations atomic on single bus?
  reserve bus
  read mem (or snoop from a cache)
  [add]
  write mem (+invalidate caches)
  release bus

so
  how to implement locks using atomic operations?
  how well do they perform?

what is performance?
  assume N CPUs want the lock at the same time
  how long until each CPU gets the lock once?
  measure time in bus operations
    since bus is slow and a shared bottleneck

test-and-set (t_s_acquire):
  constant bus traffic while spinning
    we don't care if waiting CPUs waste their own time
    we do care if waiting CPUs slow lock holder!
  if bus fair, single holding time = O(N) waiters
    since holder only gets 1/Nth of bus capacity
  time for all N to get lock: O(N^2)
  why is this bad?
  how to have less bus traffic from waiters' TAS instructions?

test-and-set (t_t_s_acquire):
  idea: spin locally in cache using ordinary read, not TestAndSet
    to avoid hogging bus and slowing the holder
  while holder holds lock:
    waiters spin with cache line in S state
  on release:
    waiters invalidate, re-read, one runs TAS and wins
  should be much better than t+s
  what happens during release?
    invalidate
    everyone sees "unlocked"
    everyone is going to run t+s
    first core to run t+s wins
    2nd core runs t+s, loses, goes to top
    3rd core t+s, invalidates, goes to top
      2nd core re-loads
    4th core t+s, invalidates, goes to top
      2nd and 3rd re-load
    Nth core t+s, invalidates, goes to top
      N-2 cores re-load
    there were about N^2 re-loads for a single release!
    so maybe N*N*N for all N CPUs to acquire
  best case:
    winner runs t+s before any core sees unlock
    so N-1 cores re-load and see locked=1
    so N^2 in total
  problem: N^3 is bad!

test-and-set-and-set with exponential backoff (t_s_exp_acquire):
  goal: avoid everyone jumping in at once
    space out attempts to acquire lock
    simultaneous attempts were reason for N^3 or N^2 w/ t-t-s
    if total rate of tries is low, only one CPU will attempt per release
  why not constant delay?
    each CPU re-tries after random delay with constant average
    hard to choose delay time
    too large: waste
    too small: all N CPUs probe mult times/crit, so N^2
  why exponential backoff?
    i.e. why start with small delay, double it?
    try to get lucky at first (maybe only a few CPUs attempting)
    doubling means takes only a few attempts until delay >= N * crit section
      i.e. just one attempt per release
  illustration:
    8 CPUs, critical section lasts one time unit
    watch as density of attempts decreases
    goal is just one between each pair of releases
  can we analyze # of probes?
    not that easy
    suppose takes time O(N) for all CPUs to succeed
    how many probes does each CPU make in time N? logN
    so total probes: N*logN
  bus cc: N*logN
  problem: unlikely to be fair!
    some CPUs will have much lower delays than others
    will win, and come back, and win again
    some CPUs with have huge delays, will sit idle
      doing no harm, but doing no work

ticket locks (ticket_acquire):
  (Linux uses ticket locks)
  goal: fairer than exp backoff, cheaper than t-t-s
  idea: assign numbers, wake up one at a time
    avoid repeated t-s after every release
  why is it fair?
  time analysis:
    what happens in acquire?
      N ReadAndIncrement
      though spread out in steady state
      each is just one bus xaction (no re-loads after invalidate)
    what happens after release?
      invalidate
      N re-loads
      no atomic instructions, so no repeated re-loads as in t-t-s
    so N^2
  bus cc: N^2 (but NOT N^3)
  problem: N^2 still blows up rapidly

anderson:
  can we do better than N^2?
  can we get rid of the N re-loads after release?
  what if each core spins on a *different* cache line?
  acquire cost?
    N ReadAndIncrements, then local spin on per-CPU cache line
  release cost?
    invalidate someone else's slots[] element
    only they have to re-load
    no other CPUs involved
  so O(N)
  problem: high space cost
    often much more than size of protected object
   
what results does the paper show?
  are they what we expect?

Figure 1:
  time to do 1,000,000 acquires
  why does it go down?
  why does it then go up?
  we're happy that spin-on-read has lower cost
  per-acquire time is linear with N
    so total time is O(N^2), as we predicted
  note takes 10 to 20 microseconds to acquire/release

Figure 3:
  not much to say: anderson wins
  he's expensive with small N b/c faking ReadAndIncrement
  
my results on 48-CPU modern c.c. AMD, slowest to fastest:
  ticket (N^2 w/ high slope: fetch+add expensive???, same n=2 as anderson)
  t-t-s (only looks good b/c grossly unfair)
  t-s (drifts up, unfair but not as bad as t-t-s)
  anderson (flat, start higher than t-s b/c fetch+add expensive??)
  t-s-exp (flat. wins b/c unfair? some shut out, so N lower)

notice they are all super fast for one cpu
  atomic instructions generate no interconnect traffic if cache line M state

if you think you won't have contention
  use t-s
but if you later get contention
  anderson? maybe not -- space costs?
  ticket? slow but fair, must be desirable since Linux uses them
  best: fix the underlying problem!