6.828 2010 Lecture 20: Determinator

Efficient System-Enforced Deterministic Parallelism, Aviram, Weng, Hu,
Ford, OSDI 2010.

the paper is all about coping w/ concurrency bugs
  so let's talk about bugs first
  not just any bugs -- races
  but keep in mind concurrency is about performance!

what kind of parallel programs is the paper about?
  parallel computation w/ simple input and output
  e.g. matrix inversion, sorting, brute-force password guessing
    but not parallel web server
  you have some work, split it up, wait for partial results
  maybe iterate, as in merge sort
    diagram of array, assignment to cpus, phases

what can go wrong?
  assume that you started w/ correct serial code
  performance: load imbalance, serial code / locks
  correctness: races 

what is a race?
  concurrent code whose result is timing-dependent
  and as a result can sometimes produce incorrect output
  sometimes you see incorrect output, sometimes you don't

write/write race
  example: inserting result into a list
  w/o lock
  n = new()
  n->next = head
  head = n
  why timing-dependent?
    a second insert might slip between read and write of head
    or it might not, depending on timing
    incorrect result might be rare, since timing window is small
  why can that lead to incorrectness?
    the spec says that head should point to a list w/ two new elements

note timing-dependent != incorrect
  e.g. if you add locks to our insert()
  result is still timing dependent (order in list will change)
  but both outcomes are correct -- no lost insert

read/write race
  read result before it is ready
  cpu(i)
    while(1)
      a[i] = f(a[0..n])
  answer depends on relative speeds of CPUs
  probably a bug, though may not be

how to cope with races?
  1. avoid 
  2. detect (for debugging)
  3. make repeatable (for debugging)

let's start with #3 -- achieving determinism
  definition: two runs w/ same input yield same output
  why might you want determinism?
    customer reports bug, sends you input, can you reproduce?
  remember: inputs must be identical
  remember: determinism doesn't eliminate races
    just causes them to consistently appear, or not appear

does concurrency cause non-determinism?
  is serial code more deterministic than concurrent code?

where does non-determinism come from?
  things that change interleaving of instructions
    differing start times
    differing CPU clock speeds
    interrupts
    O/S scheduler (due to other activities)
  external non-deterministic inputs
    fetch http://...
    print getpid()
    print time()

Simple Scheme 1 for determinism
  shared memory multiprocessor
  put code, inputs in memory
  give CPUs repeatable initial state (cache content &c)
  start CPUs at exactly the same time
  run all CPUs from same clock
  make sure cache coherence, bus arbitrator are deterministic
  forbid interrupts, I/O

SS1 is deterministic for the same reason that simple serial code is
  CPUs in lock-step always execute the same sequence of joint instructions

what problems does SS1 have?
  app must avoid I/O &c
  hard to sync start times
  hard to sync clock rates in large multiprocessors
  hard to avoid perturbations in instruction rate
    interrupts, O/S scheduler, cache misses/replacement, &c

Simple Scheme 2
  goal: avoid precisely synced CPU clocks
    and need for precisely aligned start times
  idea: sync only every 1000 instructions
  compiler or CPU inserts a "sync point" every 1000 instructions
    all CPUs must wait for all other CPUs to reach each sync point
  now CPUs need not be same speed, can tolerate e.g. interrupts
    faster CPUs wait for slower at each sync point
  problem:
    not deterministic within each 1000 instructions
    interleave may change from run to run
    so cannot allow *any* interaction except at sync points
  solution:
    memory updates are private between sync points
    updates exchanged only at sync points
    maybe use copy-on-write between sync points
    merge memory changes at sync points
      can detect changes by comparing old copy of page with new one
  problem:
    ordinary spin-locks will not work!
    every CPU will be able to acquire lock, since on private copied page
  solution:
    acquire() must delay until next sync point
    merge all mem changes
    pick lock winner (if any -- old holder may still have it)
    start next 1000 instructions

SS2 can detect write/write races!
  if memory merge sees two CPUs modified same variable
  print an error message

why isn't SS2 perfect?
  sync point merging &c may be serial code: slow
  read/write races more likely to produce incorrect results
    since longer window in which one CPU doesn't see another's changes
  global sync points are bad if computation isn't uniform
    if some CPUs lock often, some rarely
    or some need to see other CPU's results often, some rarely
  
Determinator fixes SS2's problems
  1. s/w on each CPU declares sync points (not every 1000 instructions)
     with "ret()" system call
     can delay call to ret() until done with an intermediate computation
       avoiding some read/write races
       ret() is a little like transaction commit
     can ret() more often or less often
  2. sync points are not global
     "spaces" arranged in a parent/child hierarchy
     ret() only syncs with parent
     parent can sync more often with some children than with others
     different parents can sync w/ children at different rates
     parents can run sync point code in parallel

diagram of spaces
  parent, children

system calls
  Tables 1, 2
  put(child#, ...)
    may create the child
    or waits for child ret()
    copies memory to the child
    makes a snapshot of memory (for future merge)
    may start the child
  get(child#, ...)
    waits for child to ret()
    copies memory from child, perhaps merging
    fault if more than one child modifies same mem
  ret()
    stops the child

simple use
  parent put()s to create each child, w/ *copy* of memory
  starts each child
  each child computes a result
    they do not interact at all! no shared memory!
  each child calls ret()
  parent calls get() for each child
    merges result into parent's memory
  parent calls put() for each child
    copies intermediate results from parent to child
    and starts child executing
  ret()s and get()s again, ...

can we have get(any child) / pthread_join_any() / wait() ?
  i.e. wait for first child that calls ret()?
  no: will expose non-determinism
  parent must get() for just one specific child at a time

what if parent is not sure which child will ret() first?
  it must somehow make child ret order predictable
  give each child the same amt of work, then wait for one of them
  use special execute-for-N-instructions feature
  this is OK for e.g. matrix multiply
    OK for some sort algorithms but not all
    not OK for e.g. web server -- diff CGI scripts take diff time
  parent can create more children than there are CPUs
    if one child ret()s early, CPU can run some other child
    this can compensate for imperfect division of work among children

Determinator also eliminates a host of O/S API non-determinisms
  it is a derivative of JOS, not e.g. Linux
  so they started from a pretty clean slate
  but they need to emulate various standard O/S APIs
    these APIs often involve non-determinism

fork() returns a non-deterministic PID
  emulation chooses from a space-local PID space
  breaks programs that send PIDs between spaces

wait() chooses a child to collect non-deterministically
  emulation waits only for parent's first-forked child

locks
  every lock is owned by some child -- the last holder
  suppose child 1 was last lock holder
    and it has modified protected data
  child 2 wants lock
    it must get the protected data as well
  acquire() calls ret()
  parent waits for old lock holder to ret()
    collects it memory modifications
  parent gives lock to acquire()er
    updates its memory
    re-starts children

will these locks be deterministic
  ordinary spin-locks are not
  assume computation deterministic up to this point
  so that same set of children call acquire()
    same child is last holder
  parent must make deterministic choice of next holder
    lowest CPU #, round robin, &c
    so that it makes the same choice from run to run

FS
  they don't have a disk file system
  each space has a private in-memory file system
  get() and put() merge, as with memory
  so only useful for memory-like temporary storage
  but you could imagine an on-disk file system that worked the same way
    each child works in a snapshot

how would they do network I/O -- sockets?
  the paper doesn't describe a plan
  may not make sense
    determinism only useful for reproducibility
    arbitrary I/O may not be the same twice
  but it might be OK: e.g. fetching input from AFS server

let's look back at Determinator design vs goals
  is it deterministic?
  will its determinism make our lives better?
  does it eliminate races? w/w? r/w?
  do they give up programmer convenience?
  do they give up performance?

performance
  figure 7
  seven different benchmark applications
  1 to 12 CPUs
  y-axis is speed relative to Linux
  why is md5 much better than lu_*?
    md5 tasks are independent
    lu_ splits up matrix but requires communication

...

here's a read/write race that Determinator does not fix:
  despite claim in Section 2.2
  bool x_owns = false, y_owns = false;
  x(){
    if(y_owns == false)
      x_owns = true;
  }
  y(){
    if(x_owns == false)
      y_owns = true;
  }