6.828 2014 : Locking Alarm homework recall: call user process "alarm handler" every X timer interrupts timer interrupts arrive in trap() Q: how to transfer execution to handler function? Q: where is user EIP stored while in trap? Q: how did user EIP get there? Q: when will handler start executing? Q: how to have handler return to interrupted user instructions? Q: on what stack should handler execute (user or kernel)? Q: why not call handler directly from trap()? e.g. (*proc->alarmhandler)() ("works" but write() sys call arguments pushed on wrong stack) Q: (challenge) security attack if kernel doesn't check tf->esp? Q: (challenge) how to save/restore caller-saved registers? Why talk about locking? kernels are full of shared data (buffer cache, process list, &c) how to access shared data from multiple cores? locks help sort concurrent access big source of bugs critical to getting parallel speedup => high performance Locking homework recall: ph.c, multi-threaded hash table, put(), get() Q: why run ph.c on multiple cores? diagram: CPUs, bus, RAM assumption: CPU is bottleneck, can divide work between CPUs Q: how to measure parallel speedup? Q: where are the missing keys? Q: specific scenario? diagram... table[0] = 10, 25 concurrent put(15), put(20) both insert()s allocate new entry, point next to 25 both set 10->next to their new entry last inserter wins, other one is lost! called a "race" or a "lost update" Q: where to put the lock/unlock? Q: one lock covering the whole hash table? why? why not? called a "big" or "coarse-grained" lock Q: one lock per table[] entry? this lock is "finer grained" what might be harder with this per-row lock? Q: one lock per struct entry, protecting the next pointer? why? why not? this lock is even more fine-grained tricky: 20 minutes wasn't enough to get my version of this right Q: should get() also lock? if so, how? Q: will per-row locks show parallel speedup w/ 10 cores? NBUCKETS=5... The lock abstraction: lock l acquire(l) x = x + 1 -- "critical section" release(l) a lock is itself an object only one core can hold a given lock acquire() forces other cores to "block" waiting for a held lock a program can have lots of locks different locks can be held at same time so cores can operate on different data in parallel note that lock l is not specifically tied to data x the program keeps track of the correspondence this kind of locking is popular in kernels but not the only possibility! One way to think about locking locks are part of a method for writing correct parallel code 1. programmer writes "serial" code code that is correct assuming only one CPU int x; void inc() { x = x + 1; } but perhaps not correct if executed in parallel 2. programmer add locks to FORCE serial execution since acquire/release allows execution by only one CPU at a time the point: it's easier for programmers to reason about serial code locks can cause your serial code to be correct despite parallelism When do you need locks? any time two cores use a memory location, and at least one is a write rule: don't touch data unless you hold the right lock! Could locking be automatic? perhaps the language could associate a lock with every data object compiler adds acquire/release around every use less room for programmer to forget! that idea is often too rigid: rename("d1/x", "d2/y"): lock d1, erase x, unlock d1 lock d2, add y, unlock d2 problem: the file didn't exist for a while! rename() should be atomic other system calls should see before, or after, not in between otherwise too hard to write programs we need: lock d1 ; lock d2 erase x, add y unlock d2; unlock d1 that is, programmer often needs explicit control over the region of code during which a lock is held in order to hide awkward intermediate states More ways to think about locks locks help you create atomic multi-step operations locks help operations maintain invariants on a data structure assume the invariants are true at start of operation operation uses locks to hide temporary violation of invariants operation restores invariants before releasing locks Problem: deadlock notice rename() held two locks what if: core A core B rename(d1/x, d2/y) rename(d2/a, d1/b) lock d1 lock d2 lock d2 ... lock d1 ... solution: programmer works out an order for all locks all code must acquire locks in that order i.e. predict locks, sort, acquire -- complex! Lock granularity you often have many choices about lock granularity whole ph.c table; each table[] row; each entry whole FS; directory/file; disk block whole kernel; each subsystem; each object big locks get you simplicity (= fewer bugs) less deadlock since less opportunity to hold two locks less reasoning about invariants/atomicity required big locks are sometimes enough if chance of parallelism in that code was low e.g. Big Kernel Lock OK if most CPU time spent in user code fine-grained locks can increase parallel performance e.g. processes can write different files in parallel on different cores Modularity locks make it hard to hide details inside modules to avoid deadlock, I need to know locks acquired by functions I call for fine-grained locks, I may need to acquire locks before calling functions finer grain -> more ugly Advice: start with big locks instrument your code -- which locks are preventing parallelism? use fine-grained locks only as needed for parallel performance Let's switch to locking in xv6. A use of locks: ide.c / iderw() typical of many O/S's device driver arrangements diagram: user processes, kernel, FS, iderw, append to disk queue IDE disk hardware ideintr sources of concurrency: processes, interrupt only one lock in ide.c: idelock what does idelock protect? 1. no races in idequeue operations 2. IDE h/w always executing head of idequeue 3. IDE h/w only executing one operation 4. only one process touching IDE registers 5. a process is waiting for each buffer in queue now ideintr(), interrupt handler acquires lock -- might have to wait at interrupt level! uses idequeue (1) touches IDE h/w registers (3, 4) wakes waiting process (5) hands next queued request to IDE h/w (2, 4) How to implement locks? what is wrong with this: struct lock { int locked; } acquire(l) { while(1){ if(l->locked == 0){ // A l->locked = 1; // B return; } } } oops: race between lines A and B how can we do A and B atomically? Atomic exchange instruction: mov $1, %eax xchg %eax, addr does this in hardware: lock addr globally (other cores cannot use it) temp = *addr *addr = %eax %eax = temp unlock addr x86 h/w provides a notion of locking a memory location diagram: cores, bus, RAM, lock thing so we are really pushing the problem down to the hardware h/w implements at granularity of cache-line or entire bus memory lock forces concurrent xchg's to run one at a time, not interleaved Now: acquire(l){ while(1){ if(xchg(&l->locked, 1) == 0){ break } } } if l->locked was already 1, xchg sets to 1 (again), returns 1, and the loop continues to spin if l->locked was 0, at most one xchg will see the 0; it will set it to 1 and return 0; other xchgs will return 1 this is a "spin lock", since waiting cores "spin" in acquire loop Look at xv6 spinlock implementation spinlock.h -- you can see "locked" member of struct lock spinlock.c / acquire(): see while-loop and xchg() call what is the pushcli() about? why disable interrupts? release(): uses xchg() to set lk->locked = 0 re-enables interrupts Detail: memory ordering suppose two cores use a lock to guard a counter, x Core A: Core B: locked = 1 x = x + 1 while(locked == 1) locked = 0 ... locked = 1 x = x + 1 locked = 0 the compiler AND the CPU re-order memory accesses e.g. the compiler might generate this code for core A: locked = 1 locked = 0 x = x + 1 i.e. move the increment outside the critical section! call to xchg() tells compiler and x86 not to re-order: intel promises not to re-order past xchg instruction some junk in x86.h xchg() tells C compiler not to delete or re-order (volatile asm says don't delete, "m" says no re-order) thus the xchg() in release() learn the memory model rules if you like pushing the envelope to write tricky and efficient parallel code but the rules are too complex for me to remember Why spin locks? don't they waste CPU while waiting? why not give up the CPU and switch to another process, let it run? what if holding thread needs to run; shouldn't you yield CPU? spin lock guidelines: hold spin locks for very short times don't yield CPU while holding a spin lock systems often provide "blocking" locks for longer critical sections waiting threads yield the CPU but overheads are typically higher you'll see some xv6 blocking schemes later Remember: don't share if you don't have to use locks when multiple cores might r/w same data at same time choice: coarse-grained, simple, slow fine-grained, complex, fast