6.828 2016 : Locking

Why talk about locking?
  apps want to use multi-core processors for parallel speed-up
  so kernel must deal with parallel system calls
  and thus parallel access to kernel data (buffer cache, processes, &c)
  locks help with correct sharing of data
  locks can limit parallel speedup

Locking homework
  recall: ph.c, multi-threaded hash table, put(), get()
  vi ph0.c
  Q: why run ph.c on multiple cores?
     diagram: CPUs, bus, RAM
     assumption: CPU is bottleneck, can divide work between CPUs
  Q: can we beat single-core put() time? single-core get() time?
     ./ph0 1
  Q: how to measure parallel speedup?
     ./ph0 2
     twice as much work per unit time!
  Q: where are the missing keys?
  Q: specific scenario?
     diagram...
     table[0] = 15
     concurrent put(5), put(10)
     both insert()s allocate new entry, point next to 15
     both set table[0] to their new entry
     last inserter wins, other one is lost!
     called a "lost update"; example of a "race"
     race = concurrent accesses; at least one write
  Q: where to put the lock/unlock?
  Q: one lock covering the whole hash table?
     why? why not?
     called a "big" or "coarse-grained" lock
     ./ph1 2
     faster? slower? why?
  Q: one lock per table[] entry?
     this lock is "finer grained"
     why might this be good?
     ./ph2 2
     faster? slower? why?
     what might be harder with per-bucket locks?
     will we get a good speedup with 10 cores? NBUCKET=5...
  Q: one lock per struct entry, protecting the next pointer?
     why? why not?
  Q: does get() need to lock?
  Q: does get() need to lock if there are concurrent put()s?
     it's a race; but is it incorrect?

The lock abstraction:
  lock l
  acquire(l)
    x = x + 1 -- "critical section"
  release(l)
  a lock is itself an object
  if multiple threads call acquire(l)
    only one will return right away
    the others will wait for release() -- "block"
  a program typically has lots of data, lots of locks
    if different threads use different data,
    then they likely hold different locks,
    so they can execute in parallel -- get more work done.
  note that lock l is not specifically tied to data x
    the programmer has a plan for the correspondence

A conservative rule to decide when you need to lock:
  any time two threads use a memory location, and at least one is a write
  don't touch shared data unless you hold the right lock!
  (too strict: program logic may sometimes rule out sharing; lock-free)
  (too loose: printf(); not always simple lock/data correspondence)

Could locking be automatic?
  perhaps the language could associate a lock with every data object
    compiler adds acquire/release around every use
    less room for programmer to forget!
  that idea is often too rigid:
    rename("d1/x", "d2/y"):
      lock d1, erase x, unlock d1
      lock d2, add y, unlock d2
    problem: the file didn't exist for a while!
      rename() should be atomic
        other system calls should see before, or after, not in between
      otherwise too hard to write programs
    we need:
      lock d1 ; lock d2
      erase x, add y
      unlock d2; unlock d1
  that is, programmer often needs explicit control over
    the region of code during which a lock is held
    in order to hide awkward intermediate states

Ways to think about what locks achieve
  locks help avoid lost updates
  locks help you create atomic multi-step operations -- hide intermediate states
  locks help operations maintain invariants on a data structure
    assume the invariants are true at start of operation
    operation uses locks to hide temporary violation of invariants
    operation restores invariants before releasing locks

Problem: deadlock
  notice rename() held two locks
  what if:
    core A              core B
    rename(d1/x, d2/y)  rename(d2/a, d1/b)
      lock d1             lock d2
      lock d2 ...         lock d1 ...
  solution:
    programmer works out an order for all locks
    all code must acquire locks in that order
    i.e. predict locks, sort, acquire -- complex!

Locks versus modularity
  locks make it hard to hide details inside modules
  to avoid deadlock, I need to know locks acquired by functions I call
  and I may need to acquire them before calling, even if I don't use them
  i.e. locks are often not the private business of individual modules

How to think about where to place locks?
  here's a simple plan for new code
  1. write module to be correct under serial execution
     i.e. assuming one CPU, one thread
     insert(){ e->next = l; l = e; }
     but not correct if executed in parallel
  2. add locks to FORCE serial execution
     since acquire/release allows execution by only one CPU at a time
  the point:
    it's easier for programmers to reason about serial code
    locks can cause your serial code to be correct despite parallelism

What about performance?
  after all, we're probably locking as part of a plan to get parallel speedup

Locks and parallelism
  locks *prevent* parallel execution
  to get parallelism, you often need to split up data and locks
    in a way that lets each core use different data and different locks
    "fine grained locks"
  choosing best split of data/locks is a design challenge
    whole ph.c table; each table[] row; each entry
    whole FS; directory/file; disk block
    whole kernel; each subsystem; each object
  you may need to re-design code to make it work well in parallel
    example: break single free memory list into per-core free lists
      helps if threads were waiting a lot on lock for single free list
    such re-writes can require a lot of work!

Lock granularity advice
  start with big locks, e.g. one lock protecting entire module
    less deadlock since less opportunity to hold two locks
    less reasoning about invariants/atomicity required
  measure to see if there's a problem
    big locks are often enough -- maybe little time spent in that module
  re-design for fine-grained locking only if you have to

Let's look at locking in xv6.

A typical use of locks: ide.c
  typical of many O/S's device driver arrangements
  diagram:
    user processes, kernel, FS, iderw, append to disk queue
    IDE disk hardware
    ideintr
  sources of concurrency: processes, interrupt
  only one lock in ide.c: idelock -- fairly coarse-grained
  iderw() -- what does idelock protect?
    1. no races in idequeue operations
    2. if queue not empty, IDE h/w is executing head of queue
    3. no concurrent access to IDE registers
  ideintr() -- interrupt handler
    acquires lock -- might have to wait at interrupt level!
    uses idequeue (1)
    hands next queued request to IDE h/w (2)
    touches IDE h/w registers (3)

How to implement locks?
  why not:
    struct lock { int locked; }
    acquire(l) {
      while(1){
        if(l->locked == 0){ // A
          l->locked = 1;    // B
          return;
        }
      }
    }
  oops: race between lines A and B
  how can we do A and B atomically?

Atomic exchange instruction:
  mov $1, %eax
  xchg %eax, addr
  does this in hardware:
    lock addr globally (other cores cannot use it)
    temp = *addr
    *addr = %eax
    %eax = temp
    unlock addr
  x86 h/w provides a notion of locking a memory location
    different CPUs have had different implementations
    diagram: cores, bus, RAM, lock thing
    so we are really pushing the problem down to the hardware
    h/w implements at granularity of cache-line or entire bus
  memory lock forces concurrent xchg's to run one at a time, not interleaved

Now:
  acquire(l){
    while(1){
      if(xchg(&l->locked, 1) == 0){
        break
      }
    }
  }
  if l->locked was already 1, xchg sets to 1 (again), returns 1,
    and the loop continues to spin
  if l->locked was 0, at most one xchg will see the 0; it will set
    it to 1 and return 0; other xchgs will return 1
  this is a "spin lock", since waiting cores "spin" in acquire loop

Look at xv6 spinlock implementation
  spinlock.h -- you can see "locked" member of struct lock
  spinlock.c / acquire():
    see while-loop and xchg() call
    what is the pushcli() about?
      why disable interrupts?
  release():
    sets lk->locked = 0
    and re-enables interrupts

Detail: memory read/write ordering
  suppose two cores use a lock to guard a counter, x
  and we have a naive lock implementation
  Core A:          Core B:
    locked = 1
    x = x + 1      while(locked == 1)
    locked = 0       ...
                   locked = 1
                   x = x + 1
                   locked = 0
  the compiler AND the CPU re-order memory accesses
    i.e. they do not obey the source program's order of memory references
    e.g. the compiler might generate this code for core A:
      locked = 1
      locked = 0
      x = x + 1
      i.e. move the increment outside the critical section!
    the legal behaviors are called the "memory model"
  release()'s call to __sync_synchronize() prevents re-order
    compiler won't move a memory reference past a __sync_synchronize()
    and (may) issue "memory barrier" instruction to tell the CPU
  acquire()'s call to xchg() has a similar effect:
    intel promises not to re-order past xchg instruction
    some junk in x86.h xchg() tells C compiler not to delete or re-order
      (volatile asm says don't delete, "m" says no re-order)
  if you use locks, you don't need to understand the memory ordering rules
    you need them if you want to write exotic "lock-free" code

Why spin locks?
  don't they waste CPU while waiting?
  why not give up the CPU and switch to another process, let it run?
  what if holding thread needs to run; shouldn't waiting thread yield CPU?
  spin lock guidelines:
    hold spin locks for very short times
    don't yield CPU while holding a spin lock
  systems often provide "blocking" locks for longer critical sections
    waiting threads yield the CPU
    but overheads are typically higher
    you'll see some xv6 blocking schemes later

Advice:
  don't share if you don't have to
  start with a few coarse-grained locks
  instrument your code -- which locks are preventing parallelism?
  use fine-grained locks only as needed for parallel performance
  use an automated race detector