6.1810 2025 Lecture 12: Locking

Why talk about locking?
  apps want to use multi-core processors for parallel speed-up
  so kernel must deal with parallel system calls
  and thus parallel access to kernel data (buffer cache, processes, &c)
  locks help with correct sharing of data
  locks can limit parallel speedup

What goes wrong if we don't have locks
  Case study: delete acquire/release in kfree() in kalloc.c

What will go wrong?
  example: simultaneous kfree() -> lost page
  [diagram: core1, core2, bus, RAM]

Boot, usertests
  Gets pretty far!
  Why crash?
    r->next = freelist
      kalloc()
    freelist = r
    so the allocated page remains on the free list!
      will soon be double-allocated, overwritten while in use
    despite kalloc()'s locks
  BUMMER:
    we need locks for correctness
    but lose performance (kfree is serialized)
    
The lock abstraction:
  lock l
  acquire(l)
    x = x + 1 -- "critical section"
  release(l)
  a lock is itself an object
  if multiple threads call acquire(l)
    only one will return right away
    the others will wait for release() -- "block"
  a program typically has lots of data, lots of locks
    if different threads use different data,
    then they likely hold different locks,
    so they can execute in parallel on multi-core -- get more work done.
  note that lock l is not tied to data x
    the programmer must have a plan for the correspondence

A conservative rule to decide when you need to lock:
  any time two threads could use a memory location, and at least one is a write
  thus: don't touch shared read/write data unless you hold the right lock!

Why "conservative"?
  sometimes too strict: program logic may sometimes rule out sharing
  sometimes too loose: printf(); not always simple lock/data correspondence

Could locking be automatic?
  perhaps the language could associate a lock with every data object
    compiler adds acquire/release around every use
    less room for programmer to forget!
  but hard for compiler to decide if data is shared between cores
  and "protect individual object uses" is not a sufficient view
    suppose kfree() said
      acquire
      r->next = freelist;
      release
      acquire
      freelist = r;
      release
    every memory access is protected by a lock!
    but the code is still broken.
  thus the programmer often needs explicit control over
    the region of code during which a lock is held
    in order to hide incorrect intermediate states

Ways to think about what locks achieve
  atomic multi-step operations -- hide intermediate states
    read-modify-write
  avoid lost updates (x = x + 1, or list prepend)
  maintain invariants
    assume the invariants are true at start of operation
    operation uses locks to hide temporary violation of invariants
    operation restores invariants before releasing locks
    if some function is correct w/o parallelism, can often
      make it correct under parallelism by surrounding with locks.

Problem: deadlock
  rename(f1, f2) must hold two locks, on the two directories
  what if:
    core A              core B
    rename(d1/x, d2/y)  rename(d2/a, d1/b)
      lock d1             lock d2
      lock d2 ...         lock d1 ...
  solution:
    programmer works out an order for all locks
    all code must acquire locks in that order
    i.e. predict locks, sort, acquire -- complex!

Locks versus modularity
  locks make it hard to hide details inside modules
  to avoid deadlock, I need to know locks acquired by functions I call
  and I may need to acquire them before calling, even if I don't use them
  i.e. locks are often not the private business of individual modules

Locks and parallelism
  locks *prevent* parallel execution
  to get parallelism, you often need to split up data and locks
    so that each core usually uses different data and locks
    "fine grained locks"
  choosing best split of data/locks is a design challenge
    whole FS; directory/file; disk block
    whole kernel; each subsystem; each object
  you may need to re-design code to make it work well in parallel
    example: break single free memory list into per-core free lists
      helps if threads were waiting a lot on lock for single free list
    can require a lot of analysis and programming!

Lock granularity advice
  start with big locks, e.g. one lock protecting entire module
    less deadlock since less opportunity to hold two locks
    less reasoning about invariants/atomicity required
  measure to see if there's a problem
    big locks are often enough -- maybe little time spent in that module
  re-design for fine-grained locking only if you have to

Let's look at locking in xv6.

A typical use of locks: uart.c output
  typical of many O/S's device driver arrangements
  diagram:
    user processes, kernel, UART, uartwrite()
    uartintr()
  sources of concurrency: processes, interrupt
  note interrupt handler may block in acquire()!
    waiting for uartwrite on another core
  what does tx_lock protect?
    * process/process: don't intermingle output
    * process/process: atomic write(THR) and tx_busy=1
    * process/interrupt: invariant tx_busy=1 => interrupt will come
    * process/interrupt: atomic read(LSR) and tx_busy=0

How to implement locks?
  how about:
    struct lock { int locked; }
    acquire(l) {
      while(1){
        if(l->locked == 0){ // A
          l->locked = 1;    // B
          return;
        }
      }
    }

Let's try it (this is The Question):
  replace acquire() loop's atomic test-and-set with non-atomic
    while(*(volatile int *) &lk->locked != 0)
      ;
    lk->locked = 1;
  result: "panic: release" while booting
  two acquire()s both saw locked == 0, both set locked = 1 (oops)
    (one core's wakeup() vs another core's scheduler())
  one releases, sets locked = 0
  2nd release sees already released, panics

How can we make locked == 0 atomic with locked = 1?

RISC-V atomic swap instruction:
  if you look at acquire() in kernel/kernel.asm:

  a5 = 1
  s1 = &lk->locked
  amoswap.w.aq a5, a5, (s1)

  does this in hardware:
    lock *s1 in memory system
    temp = *s1
    *s1 = a5
    a5 = temp
    unlock *s1
  RISC-V h/w internally has a notion of locking a memory location
    different CPUs have had different implementations
    diagram: cores, bus, RAM, bus arbitration gadget
    or via cache coherence protocol
    so we are really pushing the problem down to the hardware
  memory lock forces concurrent amoswap to run one at a time, not interleaved

Look at xv6 spinlock implementation
  acquire(l){
    while(__sync_lock_test_and_set(&lk->locked, 1) != 0)
      ;
  }
  if l->locked was already 1,
    sync_lock_test_and_set sets to 1 (again), yields 1,
    and the loop continues to spin
  if l->locked was 0, at most one lock_test_and_set will see the 0;
    it will set to 1 and return 0;
    other test_and_set will return 1
  this is a "spin lock", since waiting cores "spin" in acquire loop

what is the push_off() about?
  why disable interrupts?
  what if device interrupt, and interrupt handler needs the same lock?
  as in uart code...

release():
  sets lk->locked = 0
  and re-enables interrupts

Detail: memory read/write ordering
  an additional reason code like this is broken:
    locked = 1  (aquire)
    x = x + 1
    locked = 0  (release)
  the compiler AND the CPU re-order memory accesses!
    i.e. they do not obey the source program's order of memory references
    e.g. the compiler might generate this code for core A:
      locked = 1
      locked = 0
      x = x + 1
      i.e. move the increment outside the critical section!
    the legal behaviors are called the "memory model"
  release()'s call to __sync_synchronize() prevents re-order
    compiler won't move a memory reference past a __sync_synchronize()
    and (may) issue "memory barrier" instruction to tell the CPU
  acquire()'s call to __sync_synchronize() has a similar effect
  if you use locks, you don't need to understand the memory ordering rules

Why *spin* locks?
  doesn't acquire() waste CPU while waiting?
  why not give up the CPU and switch to another process, let it run?
  spin lock guidelines:
    hold spin locks for very short times
    don't yield CPU while holding a spin lock
  systems provide "blocking" locks for longer critical sections
    waiting threads yield the CPU
    but overheads are typically higher
    you'll see some xv6 blocking schemes later

Advice:
  avoid sharing, if possible
  start with a few coarse-grained locks
  measure -- which locks are preventing parallelism?
  use fine-grained locks only as needed for parallel performance
  use an automated race detector

Next week:
  midterm exam!
  in Walker
  during class time
  review last year's midterm