6.S081 2020 L23: Multi-Core scalability and RCU

Today's topic: 
  Multi-core kernel performance for read-heavy shared data.
  RCU (Read-Copy Update): very successful in Linux.

Puzzle:
  Parallel execution is crucial to performance on multi-core.
  Kernels have lots of natural parallelism.
    Different process's system calls are often independent at a high level.
    Fork, read/write different files in cache, pipes, sockets, &c.
  But kernel's shared resources often suppress parallelism.
    E.g. memory allocator, scheduler, disk cache, i-node cache, &c.
    The usual symptom is CPU time wasted spinning in acquire().
  Huge effort in production kernels to eliminate parallel bottlenecks.
  Many design patterns, tailored to different kinds of situations.

Today's focus: read-heavy data structures.

Example: singly-linked list.
  E.g. list of processes; of VMAs; of cached blocks; &c.
  Global variable with pointer to first element.
  Each element has data, e.g. a short string, and a next pointer.
  Assume most uses are reads that scan the list.
  Occasional writers add element; delete element; change string in element.

xv6 would require list readers and writers to acquire a spin-lock.
  Safe, but no parallelism even when all threads are read-only.
  Can we have a lock that allows parallel reads?
  Read/write lock seems like a perfect solution.

Read/write lock API.
  r_lock(l) / r_unlock(l)
  w_lock(l) / w_unlock(l)
  Semantics: either
    One writer, no readers or other writers; or
    Lots of readers, no writers.

Here's a simplified version of Linux's read/write lock.
struct rwlock { int n; };
  n=0  -> not locked
  n=-1 -> locked by one writer
  n>0  -> locked by n readers
r_lock(l):
  while 1:
    x = l->n
    if x < 0
      continue
    if CAS(&l->n, x, x + 1)
      return
CAS(p, a, b) is atomic compare-and-swap instruction
  if *p == a, set *p = b, return true
  else return false
w_lock(l):
  while 1:
    if CAS(&l->n, 0, -1)
      return

Surprise: r_lock() is very slow if called a lot on many cores:
  [diagram: cores, caches]
  Assume N readers, no writers.
  All N calls to r_lock() will read and cache l->n.
  Only one of the N CAS will succeed, and increment l->n.
    That CAS does O(N) work, since must invalidate N-1 cached copies of l->n.
  Other cores all retry read and CAS.
  It takes N loop iterations for all N cores to finish r_lock().
    At O(N) work each.
  Total: O(N^2) work.
Disappointing:
  The list read by itself is probably a few dozen cycles, if cached.
    Even if lots of cores are reading.
  But r_lock() can take 100s or 1000s of cycles, if it's popular.

The underlying problem with r_lock(): it does expensive *writes*.

Can we have pure read-only reads of shared read-write data?
  I.e. avoid even the writes that would be required to lock?
  (Though we'll assume that writers still have to lock.)

What goes wrong if readers read the list without locking?
  [list diagram, with string in each entry]
  Nothing goes wrong if there's no writer.
  If there is a concurrent writer?
    1. Modifying the string in a list element?
    2. Inserting a new list element?
    3. Deleting an element?

So it doesn't work for readers to simply not lock.
  But the specific problems can be fixed!
  That's what RCU does.

Read-Copy Update (RCU).
  Fast reads: no locks (and no writes).
  Slower writes: writers need to lock, and do more work.
  Helps many situations with read-heavy shared data, but not all.
  Used extensively in Linux, a big part of its multi-core performance.

RCU idea 1: don't modify data in place; prepare a new copy.
  Head  ->  E1  ->  E2  ->  E3  ->  nil
  Suppose writer wants to change the string in E2->x.
  1. Lock
  2. e = alloc()
  3. e->next = E2->next
  4. e->x = "..."
  5. E1->next = e
  6. Unlock
  What about a reader on another core traversing the list?
    At some point reader will read E1->next.
    Either before or after step 5 (this is a simplification, see below).
    If before, reader sees unmodified E2 and original string.
    If after, reader sees new element and new string.
    Either way, reader will see ...->next pointing to E3.
  The point: the reader won't see a partially updated string.

A good way to think about this idea:
  Update of E1->next is a "committing write".
  Before, it's as if nothing changed.
  After, all changes are made visible.
  RCU is best suited to data structures where a single pointer write commits.
    E.g. lists and trees.

RCU idea 2: readers and writers must use memory barriers to enforce order.
  We don't want compiler or machine to move step 3 or 4 after step 5.
  Writer must have a barrier just before step 5.
  Reader needs a barrier between Rx = E1->next,
    and dereference of Rx to read element contents.

What happens when a writer frees a list element?
  The problem: a reader might be looking at the element to be deleted.
  Writer needs to give readers time to finish.
    After it removes the visible reference to the element.
    But how long should the writer wait?
  Could use GC or reference counts, but they are expensive.

RCU idea 3:
  1. Reader isn't allowed to hold pointer to RCU data across context switch.
     Similar to rule that spin-lock can't be held across context switch.
  2. Writer defers free until all CPUs have context-switched.
  Writer may have to wait a few milliseconds,
    but the assumption is that writes are rare.
  How to do this?
    Simple plan: writer forces itself to be scheduled on each CPU.
    (There are more efficient implementations.)

Linux RCU interface for readers:
  rcu_read_lock()
  rcu_read_unlock()
  rcu_dereference(p)
For writers:
  synchronize_rcu()
  call_rcu(fn,x)
  rcu_assign_pointer(pa, p)

rcu_read_lock/unlock do almost nothing.
  Disable timer interrupts to prevent pre-emption.
  Since context switch implies done with RCU objects.

synchronize_rcu() can take a long time, and can yield the CPU.
call_rcu(fn,x) returns immediately; adds <fn,c> to a list.

List reader using Linux's RCU interface:
  rcu_read_lock()
  e = head
  while(p){
    e = rcu_dereference(e)
    look at e->x ...
    e = e->next
  }
  rcu_read_unlock()

Note: code can only use e inside rcu_read_lock/unlock!
  It cannot e.g. return(e)

Code to replace the first list element:

  acquire(lock)
  old = head
  e = alloc()
  e->x = ...
  e->next = head->next
  rcu_assign_pointer(&head, e)
  release(lock)

  synchronize_rcu()
  free(old)
   
RCU performance versus r/w locks?
  For readers:
    RCU imposes nearly zero cost on reads.
    r/w locks might take 100s or 1000s of cycles (Figure 5, upper line).
    Also, RCU can read while a write is in progress!
  For writers:
    RCU can take much longer due to synchronize_rcu().
    So RCU makes sense when writes are rare or non-time-sensitive.

Example: removing NMI handler; paper's Section 4.1; Figure 4.
  NMIs are interrupts that cannot be disabled.
    Used for CPU profiling, critical hardware errors, watchdog timer.
  Kernel code can register handlers to be called during NMI.
  Can also un-register handler, and free memory containing code.
  NMIs are frequent; register/unregister is rare; thus reads >> writes.
  What if some core is in NMI code when un-register is called?
  Figure 4's solution, in unregister_nmi_handler():
    1. Delete NMI handler list entry (so future interrupts won't see it).
    2. synchronize_rcu() to wait for all current interrupts to complete.
       Since a core won't context-switch until it has left interrupt.
    3. (not shown) free the memory holding NMI handler code.
  This example uses RCU to defer free until all uses have finished.
  An interesting wrinkle:
    NMI cannot be disabled.
    So spinlocks can't prevent NMIs.
    So NMIs cannot generally be allowed to acquire locks (deadlock...).
    So here RCU is not just faster than locking for reader,
      but actually works where locking would not.

Example: IP options; Section 4.2; Figure 6.
  IP packet header can contain options (e.g. record route).
  Kernel socket has (optional) options, copied into every outgoing packet.
  Send code (udp_sendmsg()) copies socket options w/o lock.
  Options are read much more often than written -> good RCU candidate.
  setsockopt() defers free of superseded options with call_rcu(kfree,old).
  Why is the reading code safe?
    1. Writer entirely prepares new option before publishing pointer.
    2. Writer and reader use memory barriers.
    3. Writer defers free until any reader of old must have finished.
  The paper calls this "reference counting" but there's no reference count.
    "Garbage collection" might be a better term,
    since the goal is to free after last reference disppears.
  
RCU is not universally applicable.
  Doesn't help writes.
  Only improves performance when reads >> writes.
  Doesn't work for code that must hold references across yield/sleep.
    Much like spinlocks.
  Doesn't work if data structure not amenable to single committing write.
    E.g. if readers scan both directions of doubly-linked list.
    Or if in-place updates are required.
  Readers can see stale data.
    E.g. udp_sendmsg() in Figure 6 might send with old IP options.
    But only if concurrent with setsockopt().
    Not a problem here; may be in a few cases (paper mentions Sys V IPC).
  RCU adds complexity to writers.
    Search the web for "review checklist for RCU patches"
  RCU needs context switches to be frequent, and it needs to know about them.
    Easy in the kernel, harder for user-space threads.

Can synchronize_rcu() be more efficient than "schedule on each CPU"?
  Here's a design from
    http://www2.rdrop.com/users/paulmck/RCU/rclockpdcsproof.pdf
    Section 3.5.4
  Per-core data:
    Count of this core's context switches.
    List of waiting synchronize_rcu() and call_rcu().
  Periodically, each core processes a whole batch (list) of RCU writers:
    Save the list, empty it.
    Record all other cores' counts.
    Every context switch, look again at other cores' counts.
    If all have changed, wakeup / run everything on saved list.

What if you have shared write-heavy data?
  Best: re-design for no sharing.
  Second best: semi-private per-core partitions.
  Example: kalloc in locking lab.
  Example: Linux per-core scheduling lists.
  Example: statistics counters.
    Each core keeps its own count, under a separate spin lock.
    To write, acquire local lock, increment local count.
      A spin lock is cheap if it's only used from one core.
    To read, acquire *all* locks, read and sum all counts.
      Expensive but we're assuming write-heavy.
  
RCU summary
  Very successful at speeding up read access to shared kernel data.
    Used all over modern Linux.
  Totally eliminates locking cost for readers.
  Allows readers to execute in parallel with a writer.
  Key idea is a GC-like grace period to defer freeing objects.

---

Lockless patterns: https://lwn.net/Articles/850202/