6.1810 2022 L23: Multi-Core scalability and RCU

Today's topic: 
  Multi-core kernel performance for read-heavy shared data.
  RCU (Read-Copy Update): very successful in Linux.

Puzzle:
  Parallel execution is crucial to performance on multi-core.
  Kernels have lots of natural parallelism.
    Different process's system calls are often independent at a high level.
    Fork, read/write different files in cache, pipes, sockets, &c.
  But kernel's shared resources often obstruct parallelism.
    E.g. memory allocator, scheduler, disk cache, i-node cache, &c.
    The usual symptom is CPU time wasted spinning in acquire().
  Huge effort in production kernels to eliminate parallel bottlenecks.
  Many design patterns, tailored to different kinds of situations.

Today's focus: read-heavy data structures.

Example: singly-linked list.
  E.g. list of processes; of VMAs; of cached blocks; &c.
  Global variable with pointer to first element.
  Each element has data, e.g. an embedded string, and a next pointer.
  Assume most uses are reads that scan the list.
  Occasional writers add element; delete element; change string in element.

xv6 would require list readers and writers to acquire a spin-lock.
  Safe, but no parallelism even when all threads are read-only.
  Can we have a lock that allows parallel reads?
  Read/write lock seems like a perfect solution.

Read/write lock API.
  r_lock(l) / r_unlock(l)
  w_lock(l) / w_unlock(l)
  Semantics: either
    One writer, no readers or other writers; or
    Lots of readers, no writers.

Here's a simplified version of Linux's read/write lock.
struct rwlock { int n; };
  n=0  -> not locked
  n=-1 -> locked by one writer
  n>0  -> locked by n readers
r_lock(l):
  while 1:
    x = l->n
    if x < 0
      continue
    if CAS(&l->n, x, x + 1)
      return
CAS(p, a, b) is atomic compare-and-swap instruction
  if *p == a, set *p = b, return true
  else return false
w_lock(l):
  while 1:
    if CAS(&l->n, 0, -1)
      return

Surprise: r_lock() is very slow if called a lot on many cores:
  [diagram: cores, caches, bus, RAM]
  Threads on N cores want to read; no writers.
  All N cores fetch l->n's cache line, see l->n == 0, execute CAS.
  CAS needs exclusive access to the cache line holding l->n.
    Fetches latest copy, invalidates other copies.
    So there are N serial cache line transfers for the first successful locker.
    Thus O(N) work for one CPU to acquire read-lock.
  Other N-1 cores all retry read and CAS, only one wins.
  Thus O(N^2) total work for all cores to acquire shared read lock.
Disappointing:
  The list read by itself is probably a few dozen cycles, if cached.
    Even if lots of cores are reading.
  But r_lock() can take 100s or 1000s of cycles, if it's popular.

The underlying problem with r_lock(): it does expensive *writes*.

Can we have pure read-only reads of shared read-write data?
  I.e. avoid even the writes that would be required to lock?
  (Though we'll assume that writers still have to lock.)

What goes wrong if readers read the list without locking?
  [list diagram, with string in each entry]
  Nothing goes wrong if there's no writer.
  If there is a concurrent writer?
    1. Modifying the string in a list element?
    2. Inserting a new list element?
    3. Deleting an element?

So it doesn't work for readers to simply not lock.
  But the specific problems can be fixed!
  That's what RCU does.

Read-Copy Update (RCU).
  Fast reads: readers neither lock nor write.
  Slower writes: writers must lock, and do extra work to help reads.
  Helps many situations with read-heavy shared data, but not all.
  Used extensively in Linux, a big part of its multi-core performance.

RCU is a set of rules and patterns for readers and writers.
  Plus some mechanisms.

RCU idea 1: don't modify data in place; prepare a new copy.
  Head  ->  E1  ->  E2  ->  E3  ->  nil
  Suppose writer wants to change the string in E2->x.
  1. Lock
  2. e = alloc()
  3. e->next = E2->next
  4. strcpy(e->x, "xxx");
  5. E1->next = e
  6. Unlock
  What about a reader on another core traversing the list?
    At some point reader will read E1->next.
    Either before or after step 5 (this is a simplification, see below).
    If before, reader sees unmodified E2 and original string.
    If after, reader sees new element and new string.
    Either way, reader will see ...->next pointing to E3.
  The point: the reader won't see a partially updated string.

A good way to think about this idea:
  Update of E1->next is a "committing write".
  Before, it's as if nothing changed.
  After, all changes are made visible.
  RCU is best suited to data structures where a single pointer write commits.
    E.g. lists and trees.

RCU idea 2: readers and writers must use memory barriers to enforce order.
  We don't want compiler or machine to move step 3 or 4 after step 5.
  Writer must have a barrier just before step 5.
  Reader may need a barrier between Rx = E1->next,
    and dereference of Rx to read element contents.

Problem: what happens when a writer calls free() on a deleted list element?
  Removing the visible reference prevents *new* readers, but
  A concurrent reader might still be looking at the deleted element!
  Writer needs to give readers time to finish -- a "grace period".
    After it removes the visible reference to the element.
    How long should the writer wait?
  Could use GC or reference counts, but they are expensive.

RCU idea 3 (grace period):
  1. Rule: reader can't hold pointer to RCU data across context switch.
     The programmer is in charge of following this rule.
  2. Writer defers free until all CPUs have context-switched.
  Writer may have to wait a few milliseconds,
    but the assumption is that writes are rare.
  How to do this?
    Simple plan: writer forces itself to be scheduled on each CPU.
    Better: each core counts context switches, writer watches counts.
    (There are lots of ways to implement the grace period.)

Linux RCU interface for readers:
  rcu_read_lock()
  rcu_read_unlock()
  rcu_dereference(p)
For writers:
  synchronize_rcu()
  call_rcu(fn,x)
  rcu_assign_pointer(pa, p)

List reader using Linux's RCU interface:
  rcu_read_lock()
  e = head
  while(e){
    e = rcu_dereference(e)
    look at e->x ...
    e = e->next
  }
  rcu_read_unlock()

rcu_read_lock/unlock do almost nothing.
  Despite their names, these don't lock.
  Disable timer interrupts to prevent pre-emption.
    Since context switch implies done with RCU objects.
  Act as documentation.

Note: code is only allowed to use e inside rcu_read_lock/unlock!
  It cannot e.g. return(e)

What does rcu_dereference(p) do?
  It prevents the compiler from reading head more than once.
    ( (volatile) (p) )
  Otherwise compiler might transform first loop iteration into:
    look at head->x ...
    e = head->next
  Would be incorrect if writer modified head in between.
  And, on some CPUs, ensures that reading CPUs sees the
    modifications to e->x if it sees e.

Code to replace the first list element:

  acquire(lock)
  old = head
  e = alloc()
  e->x = ...
  e->next = head->next
  rcu_assign_pointer(&head, e)
  release(lock)

  synchronize_rcu()
  free(old)

synchronize_rcu() can take a long time, and can yield the CPU.
call_rcu(free,old) returns immediately; adds <free,old> to a list.

What does rcu_assign_pointer(p, val) do?
  memory fence instruction
  *p = val
   
RCU performance versus r/w locks?
  For readers:
    RCU imposes nearly zero cost on reads.
    r/w locks might take 100s or 1000s of cycles (Figure 5, upper line).
    Also, RCU can read while a write is in progress!
  For writers:
    RCU can take much longer due to synchronize_rcu().
    So RCU makes sense when writes are rare or non-time-sensitive.

Example: removing NMI handler; paper's Section 4.1; Figure 4.
  NMIs are interrupts that cannot be disabled.
    Used for CPU profiling, critical hardware errors, watchdog timer.
  Kernel code can register handlers to be called during NMI.
  Can also un-register handler, and free memory containing code.
  NMIs are frequent; register/unregister is rare; thus reads >> writes.
  What if some core is in NMI code when un-register is called?
  Figure 4's solution, in unregister_nmi_handler():
    1. Delete NMI handler list entry (so future interrupts won't see it).
    2. synchronize_rcu() to wait for all current interrupts to complete.
       Since a core won't context-switch until it has left interrupt.
    3. (not shown) free the memory holding NMI handler code.
  This example uses RCU to defer free until all uses have finished.
  An interesting wrinkle:
    NMI cannot be disabled.
    So spinlocks can't prevent NMIs.
    So NMIs cannot generally be allowed to acquire locks (deadlock...).
    So here RCU is not just faster than locking for reader,
      but actually works where locking would not.

Example: IP options; Section 4.2; Figure 6.
  IP packet header can contain options (e.g. record route).
  Kernel socket has (optional) options, copied into every outgoing packet.
  Send code (udp_sendmsg()) copies socket options w/o lock.
  Options are read much more often than written -> good RCU candidate.
  setsockopt() defers free of superseded options with call_rcu(kfree,old).
  Why is the reading code safe?
    1. Writer entirely prepares new option before publishing pointer.
    2. Writer and reader use memory barriers.
    3. Writer defers free until any reader of old must have finished.
  The paper calls this "reference counting" but there's no reference count.
    "Garbage collection" might be a better term,
    since the goal is to free after last reference disppears.
  
RCU is not universally applicable.
  Doesn't help writes.
  Only improves performance when reads >> writes.
  Doesn't work for code that must hold references across yield/sleep.
  Hard to apply if data structure is not amenable to single committing write.
    E.g. if readers scan both directions of doubly-linked list.
    Or if in-place updates are required.
  Readers can see stale data.
    E.g. udp_sendmsg() in Figure 6 might send with old IP options.
    But only if concurrent with setsockopt().
    Not a problem here; may be in a few cases (e.g. database transactions).
  RCU adds complexity to writers.
    Search the web for "review checklist for RCU patches"
  RCU needs context switches to be frequent, and it needs to know about them.
    Easy in the kernel, harder for user-space threads.

Can synchronize_rcu() be more efficient than "schedule on each CPU"?
  Here's a design from
    http://www2.rdrop.com/users/paulmck/RCU/rclockpdcsproof.pdf
    Section 3.5.4
  Per-core data:
    Count of this core's context switches.
    List of waiting synchronize_rcu() and call_rcu().
  Periodically, each core processes a whole batch (list) of RCU writers:
    Save the list, empty it.
    Record all other cores' counts.
    Every context switch, look again at other cores' counts.
    If all have changed, wakeup / run everything on saved list.

What if you have shared write-heavy data?
  Best: re-design for no sharing.
  Second best: semi-private per-core partitions.
  Example: kalloc in locking lab.
  Example: Linux per-core scheduling lists.
  Example: statistics counters.
    Each core keeps its own count, under a separate spin lock.
    To write, acquire local lock, increment local count.
      A spin lock is cheap if it's only used from one core.
    To read, acquire *all* locks, read and sum all counts.
      Expensive but we're assuming write-heavy.
  
RCU summary
  Very successful at speeding up read access to shared kernel data.
    Used all over modern Linux.
  Totally eliminates locking cost for readers.
  Allows readers to execute in parallel with a writer.
  Key idea is a GC-like grace period to defer freeing objects.

---

Lockless patterns: https://lwn.net/Articles/850202/
Detecting missing memory barriers with KCSAN: https://lwn.net/Articles/877200/