6828 2011 L17: Lock-free coordination Required reading: Linux scalability Plan programming without locks example: lists rcu: plan widely-used in Linux paper discussion Problem: Locks limit scalability (serialize lock holders) Transferring lock from one holder to another is expensive Even when locks are scalable Can we do better? lock-free concurrent data structures Example: a stack Sequential program: struct element { int key; int value; struct element *next; }; struct element *top; void push(struct element *e) { e->next = top; top = e; } struct element *pop(void) { struct element *e = top; top = e->next; return e; } int search(int key) { struct element *e = top; while (e) { if (e->key == key) return e->value; e = e->next; } return -1; } this is clearly not going to work on a concurrent system what is a race? global spinlock: correct, but what about performance? how many concurrent ops? just one CPU at a time bus interactions? bounce cache line for both reads, writes global read-write lock: correct, but what about performance how many concurrent ops? theoretically, could do one per CPU bus interactions? bounce cache line for both reads, writes draw timing diagram of CPU interactions we might be slower on two CPUs than on a single CPU! is it going to get better if we allocate a read-write lock for each item? concurrency still possible but penalty even worse: N_{elem} cache line bounces for each search traversal other possible solutions partition data into n lists (e.g., n free lists, one per core) if one runs out of memory, steal memory from other core's free lists why do we want to avoid locks? performance complexity deadlock priority inversion what's the plan? reduce the operation we want to perform to some atomic x86 instruction x86 LOCK prefix makes many read-modify-write instructions atomic simple example: implement atomic counters by adding LOCK prefix most general thing is cmpxchg (which we seen many times now) int cmpxchg(int *addr, int old, int new) { int was = *addr; if (was == old) *addr = new; return was; } cmpxchg can be used to implement locks, but we can also use it directly for concurrent, correct access to the linked list. example: concurrent stack with out locks void push(struct element *e) { again: e->next = top; if (cmpxchg(&top, e->next, e) != e->next) goto again; } struct element *pop(void) { again: struct element *e = top; if (cmpxchg(&top, e, e->next) != e) goto again; return e; } search can be the same as in the non-concurrent case (almost..) why is this better than not having locks? readers no longer generate spurious updates, which incurred performance hit may be not having to think about deadlock is great problem 1: lock-free data structures require careful design, hw support suppose we want to remove arbitrary elements, not just the first one what could go wrong if we try to use the same cmpxchg? race condition when two processors one processor deletes a node next to the other node being deleted need DCAS (double-compare-and-swap) to implement remove properly must make sure neither previous nor next element changed x86 hardware doesn't have DCAS one approach: on delete, mark node's next pointer to signal node is deleted now the CAS that removes a subsequent node if this previous node is delete concurrently too problem 2: memory reuse when can we free a memory block, if other CPUs could be accessing it? other CPU's search might be traversing any of the elements reusing a memory block can corrupt list stack contains three elements top -> A -> B -> C CPU 1 about to pop off the top of the stack, preempted just before cmpxchg(&top, A, B) CPU 2 pops off A, B, frees both of them top -> C CPU 2 allocates another item (malloc reuses A) and pushes onto stack top -> A -> C CPU 1: cmpxchg succeeds, stack now looks like top -> B -> C this is called the "ABA problem" (memory switches from A-state to B-state and back to A-state without being able to tell) strawman solution for specific problem (actually used by some systems): type-stable memory (stack-elements never become non-stack-elements) each stack element has a reuse counter include generation# along with each pointer (so that cmpxchg notices) but for more complex data structures this becomes hard to solve for example, readers must check that data structure hasn't been freed need a general-purpose garbage-collection plan see below RCU: ready-copy update concurrent data structure approach used in the Linux kernel for read-intensive concurrent data structures many usages in kernel, for hash tables, lists, etc. 3 components: - lock-free readers, but serialize writers using locks - writers don't update in place on write, make copy update copy switch one pointer atomically => readers seen atomic switch from old to new - garbage collection to allow readers to read old versions and avoid ABA problem Can be applied to many data structures Example: RCU stack void rcu_push(int k, int v) { acquire(l); push(k, v); release(l); } elem_t* rcu_pop() { elem_t *e; acquire(l); e = pop(); release(l); return e; } int rcu_search(k) { int v; v = search(k); return v; } problem: garbage collection after pop, can we just free e? no! such may be still looking at it! in a garbage-collected language, the garbage collection will make this safe many different gc schemes possible for C on free, put element on a list for delayed freeing remove an element when sure that no reader is looking at the element one scheme: epochs global epoch counter writers increment epoch number they hold lock anyway on free of element, record epoch number in element at beginning of read, threads stores global epoch in thread-local state gc: free all elements with epoch # < minimum of all readers example stack int rcu_search(k) { int v; rcu_start_read() v = search(k); rcu_end_read() return v; } rcu_begin_read(int tid) { epochs[tid].epoch = global_epoch; __sync_synchronize(); } void rcu_end_read(int tid) { epochs[tid].epoch = INF; } void free(e) { // record global epoch into e } void gc() { unsigned long min = global_epoch; for (i = 0; i < nthread; i++) { if (min > epochs[i].epoch) min = epochs[i].epoch; } // free all e whose epoch < min } RCU concurrent code very similar to sequential code general approach for read-intensive data structures but serializes all writers ok for stack case but what if inserting at the beginning and end of list concurrently? may make unnecessary copies Future lock-free data structures for readers and writers? exist for lists, hash tables, skip lists, etc. Paper What is the current state of scalability in current kernels? Many short sections Partitioning and replication of data structures for scalability RCU for lock-free sections Example: exim Many processes do path name lookup concurrent Linux has a directory entry cache for mappings from directory entry to inode # [i#,name] -> file struct RCU has been used for the directory entry cache lookups for different pathnames run in parallel without locks lookups for the same dir entry of a lock Problem: Look ups for "/", however, hit the same dentry Contention on spinlock, and collapse. Not simple fix: reads increase refcnt writes (e.g., rename), too + [i#,name]->file struct Strawman solution: scalable locks Avoid collapse, but not more scalability Solution: lock-free dentry? add generation # write: lock dentry, g = gen, gen=0, change, gen=g+1 read: g <- gen# if g== 0, use locks (some core is modifying) copy relevant fields g1 <- gen# if g1 != g locking protocol if match, refcnt++ (if refcnt != 0) Gen # is well-known trick Result: all lookups run lock free Future: ?