6.828 2011 Lecture 18: Race Detection Read: Effective Data-Race Detection for the Kernel, Erickson et al, OSDI 2010 Topic: bugs in concurrent programs Concurrency bugs hard due to non-determinism Results depend on interleaving of instructions Run program twice on same input -- different output! A concurrency bug may be very hard to reproduce Why do we care about detecting concurrency problems? Might be useful to you Helpful to think about nature of concurrency problems It's a good research area Outline Data Races Lock-Sets Happens-Before DataCollider What is a race? Paper's definition (there are others) two memory operations "conflict" if overlapping memory, and at least one is a write, and not both sync accesses (e.g. lock implementation) "race" = conflicting ops could execute simultaneously Example: Thr 1: Thr 2: x = x + 1 x = 0 Suspicious: why doesn't programmer care if x=0 before or after? Suspicious: x=0 might be lost between Thr 1's LD and ST This code is *probably* incorrect Maybe programmer meant to use locks Maybe the programmer meant: x = x + 1 ... exit() wait() x = 0 Why does defn declare a race even if one operation is a read? Our x=x+1 example had both cores writing Example of problem even though one core only reads? f(x): n = n + 1 log[n] = x g(): print log[n] Why do people focus on races? They reflect the hard part: non-deterministic interleaving Fixing race = adding lock = restricting interleaving Are there concurrency problems that are not races? Deadlock, for example Any non-deadlock problems? Yes int x; f(): acq(a) reg = x rel(a) reg = reg + 1 acq(a) x = reg rel(a) The code is correct if there's no concurrency No race, since lock prevents simultaneous accesses to x Yet the code is not correct if f() executed concurrently Concurrency problems are often failure to achieve atomicity Not just about simultaneous memory accesses Is a race always a bug? Examples of benign races Statistics counters: ok to miss an increment Tricks to avoid expensive locks if buf == 0: acq if buf == 0: buf = alloc() rel Lock-free data structures int rd, wr; send(m): fifo[wr++] = m recv(): if rd < wr: return fifo[rd++] Summary of races Race = potential simultaneous use of memory location Often a bug, e.g. from forgetting a lock But some races are benign Races are a big source of concurrency problems but not the only source How to find races? Eraser Many of you saw Eraser in 6.033 Checks that every shared variable is protected by some lock The basic idea: Execute the program Track a candidate "lock set" for each memory location initially all locks Track of what locks are held at each point in time On memory reference instruction: location's lock set &= current lock set Empty location lock set = race Eraser example v = 0 C(v) = {a,b,...} acq(a) v = v + 1 C(v) = {a} rel(a) acq(b) v = v + 1 C(v) = {} rel(b) Cool: Eraser flagged this problem even though the bug didn't occur Also: Ignore variables that only one thread has used Ignore variables that are shared but always read-only Does Eraser generate false positives? Yes! Non-lock synchronization sleep/wakeup while(!done); Benign races see above Does Eraser miss any races? Yes: it only sees code that was executed So you need big test suites Is Eraser suitable for production code? Can you leave it on always to increase coverage? No: they report 10x to 30x slowdown Code inserted at every load and store "Shadow" memory to trace each variable's C() Step back: Eraser detects violation of locking discipline Pros: discipline allows automated checking Con: discipline is pretty restrictive follows discipline != correct Can we relax the discipline? Happens-before race detection Allows more styles of syncronization than just locking The discipline: Accesses must be separated by a sync operation lock, sleep/wakeup, fork, &c Happens-before graph Build the graph as the program executes A graph node for every instruction A graph edge for each successive instruction of a given thread A graph edge between sync operations on the same sync object If the operation forbids reorder release -> acquire wakeup -> sleep returns fork -> child starts Example: acq(a) v = v + 1 rel(a) acq(a) v = v + 1 rel(a) How a happens-before system works Run the program Accumulate the happens-before graph Log every mem access and sync operation to a file Analyze the graph afterwards Race if: Two threads access the same variable Accesses not ordered by happens-before Example: If one of the example threads didn't lock Does happens-before reflect our definition of "race"? Yes: not ordered -> could execute simultaneously Assuming that we understand all forms of synchronization When is happens-before better than lock-set (Eraser)? If the program uses non-lock sync When is lock-set better than happens-before? y = y + 1 acq/rel acq/rel y = y + 1 Happens-before misses this race! Whether it sees the race depends on particular execution order Will happens-before complain about benign races? Happens-before is even slower than lock-set Trace of every memory access Happens-before summary Depends on discipline (sync), but allows more kinds of sync than lock-set Can identify some races that did not actually happen Misses some potential races Slow Today's paper -- DataCollider Fast (no traces, doesn't instrument each load/store) Doesn't need to know about details of locking/sync Good for kernel... How does it work? Figure 2 Set a code breakpoint at a random mem instruction On breakpoint interrupt Set a *data* breakpoint on memory address -- on all cores Pause for 1 or 15 ms Data breakpoint indicates whether anyone else used that address! Work through example Thr 1 Thr 2 x = x + 1 x = 0 And Thr 1 Thr 2 x = x + 1 done = 1 while(done==0); x = 0 Why the temp = read() ... if temp!= temp' ? DMA and multiple mappings Do they have false positives? They only flag races that happened, so no But 90% of the races the flag are benign! Statistics counters, read of one bit vs write of another, time variable But they do *not* report seeing lock-free data structures... Suggests Windows kernel follows some discipline very thoroughly Do they miss races? Yes: the other access has to happen during the pause Did they find anything useful? Yes: 25 real bugs in Windows kernel Of 12 that are understood 9 missing locks (or missing atomic add) 2 false sharing 1 serious design problem In what ways is DataCollider better than happens-before or lock-set? Fast! Checkpoint monitoring is very cheap in h/w Doesn't need to know details of sync discipline If you use while(!done), DataCollider won't flag a race Or init data; enable interrupt; intr code reads data In what ways is DC worse? I.e. what do they give up by not maintaining lock-sets or happens-before? The test execution has to encounter the simultaneous accesses Whereas Eraser and happens-before can find potential races It samples, so it may take a while to observe a race All three techniques depend on a sync discipline Eraser: every access protected by a lock Happens-before: sync between accesses in diff threads DataCollider: sync between accesses in diff threads Discipline is good: Makes it easier to write / reason about programs Often checkable (easier to check than correctness) Discipline is bad: A tight discipline tempts programmer to escape Yields false positives from checker, and difficult code Worth thinking about other possible disciplines For programmer, for checker Every object method must be atomic? Do races have any relation to bugs in lock-free code? What would you need for lock-free code?