6.828 2011 Lecture 18: Race Detection

Read: Effective Data-Race Detection for the Kernel, Erickson et al, OSDI 2010

Topic: bugs in concurrent programs
  Concurrency bugs hard due to non-determinism
  Results depend on interleaving of instructions
  Run program twice on same input -- different output!
  A concurrency bug may be very hard to reproduce

Why do we care about detecting concurrency problems?
  Might be useful to you
  Helpful to think about nature of concurrency problems
  It's a good research area

Outline
  Data Races
  Lock-Sets
  Happens-Before
  DataCollider

What is a race?
  Paper's definition (there are others)
  two memory operations "conflict" if
    overlapping memory, and
    at least one is a write, and
    not both sync accesses (e.g. lock implementation)
  "race" = conflicting ops could execute simultaneously

Example:
  Thr 1:       Thr 2:
  x = x + 1    x = 0
  Suspicious: why doesn't programmer care if x=0 before or after?
  Suspicious: x=0 might be lost between Thr 1's LD and ST
  This code is *probably* incorrect
  Maybe programmer meant to use locks
  Maybe the programmer meant:
    x = x + 1
    ...
    exit()
                  wait()
                  x = 0

Why does defn declare a race even if one operation is a read?
  Our x=x+1 example had both cores writing
  Example of problem even though one core only reads?
    f(x):
      n = n + 1
      log[n] = x
    g():
      print log[n]

Why do people focus on races?
  They reflect the hard part: non-deterministic interleaving
  Fixing race = adding lock = restricting interleaving

Are there concurrency problems that are not races?
  Deadlock, for example
  Any non-deadlock problems?
  Yes
    int x;
    f():
      acq(a)
      reg = x
      rel(a)
      reg = reg + 1
      acq(a)
      x = reg
      rel(a)
  The code is correct if there's no concurrency
  No race, since lock prevents simultaneous accesses to x
  Yet the code is not correct if f() executed concurrently
  Concurrency problems are often failure to achieve atomicity
    Not just about simultaneous memory accesses

Is a race always a bug?

Examples of benign races
  Statistics counters: ok to miss an increment
  Tricks to avoid expensive locks
    if buf == 0:
      acq
      if buf == 0:
        buf = alloc()
      rel
  Lock-free data structures
    int rd, wr;
    send(m):
      fifo[wr++] = m
    recv():
      if rd < wr:
        return fifo[rd++]

Summary of races
  Race = potential simultaneous use of memory location
  Often a bug, e.g. from forgetting a lock
  But some races are benign
  Races are a big source of concurrency problems but not the only source
 
How to find races?

Eraser
  Many of you saw Eraser in 6.033
  Checks that every shared variable is protected by some lock

The basic idea:
  Execute the program
  Track a candidate "lock set" for each memory location
    initially all locks
  Track of what locks are held at each point in time
  On memory reference instruction:
    location's lock set &= current lock set
  Empty location lock set = race

Eraser example
  v = 0       C(v) = {a,b,...}
  acq(a)
  v = v + 1   C(v) = {a}
  rel(a)
  acq(b)
  v = v + 1   C(v) = {}
  rel(b)

Cool: Eraser flagged this problem even though the bug didn't occur

Also:
  Ignore variables that only one thread has used
  Ignore variables that are shared but always read-only

Does Eraser generate false positives?
  Yes!
  Non-lock synchronization
    sleep/wakeup
    while(!done);
  Benign races
    see above

Does Eraser miss any races?
  Yes: it only sees code that was executed
  So you need big test suites

Is Eraser suitable for production code?
  Can you leave it on always to increase coverage?
  No: they report 10x to 30x slowdown
  Code inserted at every load and store
  "Shadow" memory to trace each variable's C()

Step back:
  Eraser detects violation of locking discipline
  Pros: discipline allows automated checking
  Con: discipline is pretty restrictive
       follows discipline != correct
  Can we relax the discipline?

Happens-before race detection
  Allows more styles of syncronization than just locking
  The discipline:
    Accesses must be separated by a sync operation
    lock, sleep/wakeup, fork, &c

Happens-before graph
  Build the graph as the program executes
  A graph node for every instruction
  A graph edge for each successive instruction of a given thread
  A graph edge between sync operations on the same sync object
    If the operation forbids reorder
    release -> acquire
    wakeup -> sleep returns
    fork -> child starts

Example:
  acq(a)
  v = v + 1
  rel(a)
              acq(a)
              v = v + 1
              rel(a)

How a happens-before system works
  Run the program
  Accumulate the happens-before graph
    Log every mem access and sync operation to a file
  Analyze the graph afterwards
  Race if:
    Two threads access the same variable
    Accesses not ordered by happens-before

Example:
  If one of the example threads didn't lock

Does happens-before reflect our definition of "race"?
  Yes: not ordered -> could execute simultaneously
  Assuming that we understand all forms of synchronization

When is happens-before better than lock-set (Eraser)?
  If the program uses non-lock sync

When is lock-set better than happens-before?
  y = y + 1
  acq/rel
              acq/rel
              y = y + 1
  Happens-before misses this race!
  Whether it sees the race depends on particular execution order

Will happens-before complain about benign races?
  
Happens-before is even slower than lock-set
  Trace of every memory access

Happens-before summary
  Depends on discipline (sync), but allows more kinds of sync than lock-set
  Can identify some races that did not actually happen
  Misses some potential races
  Slow

Today's paper -- DataCollider
  Fast (no traces, doesn't instrument each load/store)
  Doesn't need to know about details of locking/sync
    Good for kernel...

How does it work?
  Figure 2
  Set a code breakpoint at a random mem instruction
  On breakpoint interrupt
    Set a *data* breakpoint on memory address -- on all cores
    Pause for 1 or 15 ms
    Data breakpoint indicates whether anyone else used that address!

Work through example
  Thr 1        Thr 2
  x = x + 1
               x = 0

And
  Thr 1        Thr 2
  x = x + 1
  done = 1
               while(done==0);
               x = 0

Why the temp = read() ... if temp!= temp' ?
  DMA and multiple mappings

Do they have false positives?
  They only flag races that happened, so no
  But 90% of the races the flag are benign!
  Statistics counters, read of one bit vs write of another, time variable
  But they do *not* report seeing lock-free data structures...
    Suggests Windows kernel follows some discipline very thoroughly

Do they miss races?
  Yes: the other access has to happen during the pause

Did they find anything useful?
  Yes: 25 real bugs in Windows kernel
  Of 12 that are understood
    9 missing locks (or missing atomic add)
    2 false sharing
    1 serious design problem

In what ways is DataCollider better than happens-before or lock-set?
  Fast!
    Checkpoint monitoring is very cheap in h/w
  Doesn't need to know details of sync discipline
    If you use while(!done), DataCollider won't flag a race
    Or init data; enable interrupt; intr code reads data

In what ways is DC worse?
  I.e. what do they give up by not maintaining lock-sets or happens-before?
  The test execution has to encounter the simultaneous accesses
    Whereas Eraser and happens-before can find potential races 
  It samples, so it may take a while to observe a race

All three techniques depend on a sync discipline
  Eraser: every access protected by a lock
  Happens-before: sync between accesses in diff threads
  DataCollider: sync between accesses in diff threads

Discipline is good:
  Makes it easier to write / reason about programs
  Often checkable (easier to check than correctness)

Discipline is bad:
  A tight discipline tempts programmer to escape
    Yields false positives from checker, and difficult code

Worth thinking about other possible disciplines
  For programmer, for checker
  Every object method must be atomic?

Do races have any relation to bugs in lock-free code?
 
What would you need for lock-free code?