6.S081 2020 Lecture 10: Locking Why talk about locking? apps want to use multi-core processors for parallel speed-up so kernel must deal with parallel system calls and thus parallel access to kernel data (buffer cache, processes, &c) locks help with correct sharing of data locks can limit parallel speedup What goes wrong if we don't have locks Case study: delete acquire/release in kalloc.c Boot kernel works! Run usertests all tests pass! except we lose some pages Why do we lose pages? picture of shared-memory multiprocessor race between two cores calling kfree() BUMMER: we need locks for correctness but loose performance (kfree is serialized) The lock abstraction: lock l acquire(l) x = x + 1 -- "critical section" release(l) a lock is itself an object if multiple threads call acquire(l) only one will return right away the others will wait for release() -- "block" a program typically has lots of data, lots of locks if different threads use different data, then they likely hold different locks, so they can execute in parallel -- get more work done. note that lock l is not specifically tied to data x the programmer has a plan for the correspondence A conservative rule to decide when you need to lock: any time two threads use a memory location, and at least one is a write don't touch shared data unless you hold the right lock! (too strict: program logic may sometimes rule out sharing; lock-free) (too loose: printf(); not always simple lock/data correspondence) Could locking be automatic? perhaps the language could associate a lock with every data object compiler adds acquire/release around every use less room for programmer to forget! that idea is often too rigid: rename("d1/x", "d2/y"): lock d1, erase x, unlock d1 lock d2, add y, unlock d2 problem: the file didn't exist for a while! rename() should be atomic other system calls should see before, or after, not in between otherwise too hard to write programs we need: lock d1 ; lock d2 erase x, add y unlock d2; unlock d1 that is, programmer often needs explicit control over the region of code during which a lock is held in order to hide awkward intermediate states Ways to think about what locks achieve locks help avoid lost updates locks help you create atomic multi-step operations -- hide intermediate states locks help operations maintain invariants on a data structure assume the invariants are true at start of operation operation uses locks to hide temporary violation of invariants operation restores invariants before releasing locks Problem: deadlock notice rename() held two locks what if: core A core B rename(d1/x, d2/y) rename(d2/a, d1/b) lock d1 lock d2 lock d2 ... lock d1 ... solution: programmer works out an order for all locks all code must acquire locks in that order i.e. predict locks, sort, acquire -- complex! Locks versus modularity locks make it hard to hide details inside modules to avoid deadlock, I need to know locks acquired by functions I call and I may need to acquire them before calling, even if I don't use them i.e. locks are often not the private business of individual modules Locks and parallelism locks *prevent* parallel execution to get parallelism, you often need to split up data and locks in a way that lets each core use different data and different locks "fine grained locks" choosing best split of data/locks is a design challenge whole FS; directory/file; disk block whole kernel; each subsystem; each object you may need to re-design code to make it work well in parallel example: break single free memory list into per-core free lists helps if threads were waiting a lot on lock for single free list such re-writes can require a lot of work! Lock granularity advice start with big locks, e.g. one lock protecting entire module less deadlock since less opportunity to hold two locks less reasoning about invariants/atomicity required measure to see if there's a problem big locks are often enough -- maybe little time spent in that module re-design for fine-grained locking only if you have to Let's look at locking in xv6. A typical use of locks: uart.c typical of many O/S's device driver arrangements diagram: user processes, kernel, UART, uartputc, remove from uart_tx_buf, uartintr() sources of concurrency: processes, interrupt only one lock in uart.c: uart_tx_lock -- fairly coarse-grained uartputc() -- what does uart_tx_lock protect? 1. no races in uart_tx_buf operations 2. if queue not empty, UART h/w is executing head of queue 3. no concurrent access to UART write registers uartintr() -- interrupt handler acquires lock -- might have to wait at interrupt level! removes character from uart_tx_buf hands next queued char to UART h/w (2) touches UART h/w registers (3) How to implement locks? why not: struct lock { int locked; } acquire(l) { while(1){ if(l->locked == 0){ // A l->locked = 1; // B return; } } } oops: race between lines A and B how can we do A and B atomically? Atomic swap instruction: a5 = 1 s1 = &lk->locked amoswap.w.aq a5, a5, (s1) does this in hardware: lock addr globally (other cores cannot use it) temp = *s5 *addr = a5 a5 = temp unlock addr RISC-V h/w provides a notion of locking a memory location different CPUs have had different implementations diagram: cores, bus, RAM, lock thing so we are really pushing the problem down to the hardware h/w implements at granularity of cache-line or entire bus memory lock forces concurrent swamp to run one at a time, not interleaved Look at xv6 spinlock implementation acquire(l){ while(__sync_lock_test_and_set(&lk->locked, 1) != 0) } if l->locked was already 1, sync_lock_test_and_set sets to 1 (again), returns 1, and the loop continues to spin if l->locked was 0, at most one lock_test_and_set will see the 0; it will set it to 1 and return 0; other test_an_set will return 1 this is a "spin lock", since waiting cores "spin" in acquire loop what is the push_off() about? why disable interrupts? release(): sets lk->locked = 0 and re-enables interrupts Detail: memory read/write ordering suppose two cores use a lock to guard a counter, x and we have a naive lock implementation Core A: Core B: locked = 1 x = x + 1 while(locked == 1) locked = 0 ... locked = 1 x = x + 1 locked = 0 the compiler AND the CPU re-order memory accesses i.e. they do not obey the source program's order of memory references e.g. the compiler might generate this code for core A: locked = 1 locked = 0 x = x + 1 i.e. move the increment outside the critical section! the legal behaviors are called the "memory model" release()'s call to __sync_synchronize() prevents re-order compiler won't move a memory reference past a __sync_synchronize() and (may) issue "memory barrier" instruction to tell the CPU acquire()'s call to __sync_synchronize() has a similar effect: if you use locks, you don't need to understand the memory ordering rules you need them if you want to write exotic "lock-free" code Why spin locks? don't they waste CPU while waiting? why not give up the CPU and switch to another process, let it run? what if holding thread needs to run; shouldn't waiting thread yield CPU? spin lock guidelines: hold spin locks for very short times don't yield CPU while holding a spin lock systems provide "blocking" locks for longer critical sections waiting threads yield the CPU but overheads are typically higher you'll see some xv6 blocking schemes later Advice: don't share if you don't have to start with a few coarse-grained locks instrument your code -- which locks are preventing parallelism? use fine-grained locks only as needed for parallel performance use an automated race detector