6.828 2016 : Locking Why talk about locking? apps want to use multi-core processors for parallel speed-up so kernel must deal with parallel system calls and thus parallel access to kernel data (buffer cache, processes, &c) locks help with correct sharing of data locks can limit parallel speedup Locking homework recall: ph.c, multi-threaded hash table, put(), get() vi ph0.c Q: why run ph.c on multiple cores? diagram: CPUs, bus, RAM assumption: CPU is bottleneck, can divide work between CPUs Q: can we beat single-core put() time? single-core get() time? ./ph0 1 Q: how to measure parallel speedup? ./ph0 2 twice as much work per unit time! Q: where are the missing keys? Q: specific scenario? diagram... table[0] = 15 concurrent put(5), put(10) both insert()s allocate new entry, point next to 15 both set table[0] to their new entry last inserter wins, other one is lost! called a "lost update"; example of a "race" race = concurrent accesses; at least one write Q: where to put the lock/unlock? Q: one lock covering the whole hash table? why? why not? called a "big" or "coarse-grained" lock ./ph1 2 faster? slower? why? Q: one lock per table[] entry? this lock is "finer grained" why might this be good? ./ph2 2 faster? slower? why? what might be harder with per-bucket locks? will we get a good speedup with 10 cores? NBUCKET=5... Q: one lock per struct entry, protecting the next pointer? why? why not? Q: does get() need to lock? Q: does get() need to lock if there are concurrent put()s? it's a race; but is it incorrect? The lock abstraction: lock l acquire(l) x = x + 1 -- "critical section" release(l) a lock is itself an object if multiple threads call acquire(l) only one will return right away the others will wait for release() -- "block" a program typically has lots of data, lots of locks if different threads use different data, then they likely hold different locks, so they can execute in parallel -- get more work done. note that lock l is not specifically tied to data x the programmer has a plan for the correspondence A conservative rule to decide when you need to lock: any time two threads use a memory location, and at least one is a write don't touch shared data unless you hold the right lock! (too strict: program logic may sometimes rule out sharing; lock-free) (too loose: printf(); not always simple lock/data correspondence) Could locking be automatic? perhaps the language could associate a lock with every data object compiler adds acquire/release around every use less room for programmer to forget! that idea is often too rigid: rename("d1/x", "d2/y"): lock d1, erase x, unlock d1 lock d2, add y, unlock d2 problem: the file didn't exist for a while! rename() should be atomic other system calls should see before, or after, not in between otherwise too hard to write programs we need: lock d1 ; lock d2 erase x, add y unlock d2; unlock d1 that is, programmer often needs explicit control over the region of code during which a lock is held in order to hide awkward intermediate states Ways to think about what locks achieve locks help avoid lost updates locks help you create atomic multi-step operations -- hide intermediate states locks help operations maintain invariants on a data structure assume the invariants are true at start of operation operation uses locks to hide temporary violation of invariants operation restores invariants before releasing locks Problem: deadlock notice rename() held two locks what if: core A core B rename(d1/x, d2/y) rename(d2/a, d1/b) lock d1 lock d2 lock d2 ... lock d1 ... solution: programmer works out an order for all locks all code must acquire locks in that order i.e. predict locks, sort, acquire -- complex! Locks versus modularity locks make it hard to hide details inside modules to avoid deadlock, I need to know locks acquired by functions I call and I may need to acquire them before calling, even if I don't use them i.e. locks are often not the private business of individual modules How to think about where to place locks? here's a simple plan for new code 1. write module to be correct under serial execution i.e. assuming one CPU, one thread insert(){ e->next = l; l = e; } but not correct if executed in parallel 2. add locks to FORCE serial execution since acquire/release allows execution by only one CPU at a time the point: it's easier for programmers to reason about serial code locks can cause your serial code to be correct despite parallelism What about performance? after all, we're probably locking as part of a plan to get parallel speedup Locks and parallelism locks *prevent* parallel execution to get parallelism, you often need to split up data and locks in a way that lets each core use different data and different locks "fine grained locks" choosing best split of data/locks is a design challenge whole ph.c table; each table[] row; each entry whole FS; directory/file; disk block whole kernel; each subsystem; each object you may need to re-design code to make it work well in parallel example: break single free memory list into per-core free lists helps if threads were waiting a lot on lock for single free list such re-writes can require a lot of work! Lock granularity advice start with big locks, e.g. one lock protecting entire module less deadlock since less opportunity to hold two locks less reasoning about invariants/atomicity required measure to see if there's a problem big locks are often enough -- maybe little time spent in that module re-design for fine-grained locking only if you have to Let's look at locking in xv6. A typical use of locks: ide.c typical of many O/S's device driver arrangements diagram: user processes, kernel, FS, iderw, append to disk queue IDE disk hardware ideintr sources of concurrency: processes, interrupt only one lock in ide.c: idelock -- fairly coarse-grained iderw() -- what does idelock protect? 1. no races in idequeue operations 2. if queue not empty, IDE h/w is executing head of queue 3. no concurrent access to IDE registers ideintr() -- interrupt handler acquires lock -- might have to wait at interrupt level! uses idequeue (1) hands next queued request to IDE h/w (2) touches IDE h/w registers (3) How to implement locks? why not: struct lock { int locked; } acquire(l) { while(1){ if(l->locked == 0){ // A l->locked = 1; // B return; } } } oops: race between lines A and B how can we do A and B atomically? Atomic exchange instruction: mov $1, %eax xchg %eax, addr does this in hardware: lock addr globally (other cores cannot use it) temp = *addr *addr = %eax %eax = temp unlock addr x86 h/w provides a notion of locking a memory location different CPUs have had different implementations diagram: cores, bus, RAM, lock thing so we are really pushing the problem down to the hardware h/w implements at granularity of cache-line or entire bus memory lock forces concurrent xchg's to run one at a time, not interleaved Now: acquire(l){ while(1){ if(xchg(&l->locked, 1) == 0){ break } } } if l->locked was already 1, xchg sets to 1 (again), returns 1, and the loop continues to spin if l->locked was 0, at most one xchg will see the 0; it will set it to 1 and return 0; other xchgs will return 1 this is a "spin lock", since waiting cores "spin" in acquire loop Look at xv6 spinlock implementation spinlock.h -- you can see "locked" member of struct lock spinlock.c / acquire(): see while-loop and xchg() call what is the pushcli() about? why disable interrupts? release(): sets lk->locked = 0 and re-enables interrupts Detail: memory read/write ordering suppose two cores use a lock to guard a counter, x and we have a naive lock implementation Core A: Core B: locked = 1 x = x + 1 while(locked == 1) locked = 0 ... locked = 1 x = x + 1 locked = 0 the compiler AND the CPU re-order memory accesses i.e. they do not obey the source program's order of memory references e.g. the compiler might generate this code for core A: locked = 1 locked = 0 x = x + 1 i.e. move the increment outside the critical section! the legal behaviors are called the "memory model" release()'s call to __sync_synchronize() prevents re-order compiler won't move a memory reference past a __sync_synchronize() and (may) issue "memory barrier" instruction to tell the CPU acquire()'s call to xchg() has a similar effect: intel promises not to re-order past xchg instruction some junk in x86.h xchg() tells C compiler not to delete or re-order (volatile asm says don't delete, "m" says no re-order) if you use locks, you don't need to understand the memory ordering rules you need them if you want to write exotic "lock-free" code Why spin locks? don't they waste CPU while waiting? why not give up the CPU and switch to another process, let it run? what if holding thread needs to run; shouldn't waiting thread yield CPU? spin lock guidelines: hold spin locks for very short times don't yield CPU while holding a spin lock systems often provide "blocking" locks for longer critical sections waiting threads yield the CPU but overheads are typically higher you'll see some xv6 blocking schemes later Advice: don't share if you don't have to start with a few coarse-grained locks instrument your code -- which locks are preventing parallelism? use fine-grained locks only as needed for parallel performance use an automated race detector