6.1810 2023 L22: Multi-Core scalability and RCU Today's topic: Multi-core kernel performance for read-heavy shared data. RCU (Read-Copy Update): very successful in Linux. Puzzle: Parallel execution is crucial to performance on multi-core. Kernels have lots of natural parallelism. Different process's system calls are often independent at a high level. Fork, read/write different files in cache, pipes, sockets, &c. But kernel's shared resources often obstruct parallelism. E.g. memory allocator, scheduler, disk cache, i-node cache, &c. The usual symptom is CPU time wasted spinning in acquire(). Huge effort in production kernels to eliminate parallel bottlenecks. Many design patterns, tailored to different kinds of situations. Today's focus: read-heavy kernel data structures. Example: singly-linked list. E.g. list of processes; of cached blocks; &c. Global variable with pointer to first element. Each element has data, e.g. an embedded string, and a next pointer. Assume most uses are reads that scan the list. Occasional writers add element; delete element; change string in element. xv6 would require list readers and writers to acquire a spin-lock. Safe, but no parallelism even when all threads are read-only. Can we have a lock that allows parallel reads? Read/write lock seems like a perfect solution. Read/write lock API. r_lock(l) / r_unlock(l) w_lock(l) / w_unlock(l) Semantics: either One writer, but no readers or other writers; or Lots of readers, but no writers. Here's a simplified version of Linux's read/write lock. struct rwlock { int n; }; n=0 -> not locked n=-1 -> locked by one writer n>0 -> locked by n readers r_lock(l): while 1: x = l->n if x < 0 continue if CAS(&l->n, x, x + 1) return CAS(p, a, b) is atomic compare-and-swap instruction if *p == a, set *p = b, return true else return false w_lock(l): while 1: if CAS(&l->n, 0, -1) return Surprise: r_lock() is very slow if called a lot on many cores: Even if no writers! [diagram: cores, bus, RAM] Threads on N cores want to read; no writers. All N cores fetch l->n, see l->n == 0, execute CAS(&n, 0, 1). Only one succeeds (i.e. see that n is still 0). The rest fail, and must be re-executed. But atomic CAS's execute one at a time -- not in parallel. Thus O(N) time for one CPU to acquire read-lock. Other N-1 cores all retry read and CAS; again, only one wins. Thus O(N^2) total time for all N cores to acquire shared read lock. Disappointing: The list read by itself is probably a few dozen cycles, if cached. Even if lots of cores are reading. But r_lock() can take 100s or 1000s of cycles, if it's popular. The underlying problem with r_lock(): it does expensive *writes*. Can we have pure read-only reads of shared read-write data? I.e. avoid even the writes that would be required to lock? (Though we'll assume that writers still lock.) What goes wrong if readers read the list without locking? [list diagram, with string in each entry] Nothing goes wrong if there's no writer. If there is a concurrent writer? 1. Modifying the string in a list element? 2. Inserting a new list element? 3. Deleting an element? So it doesn't work for readers to simply not lock. But the specific problems can be fixed! That's what RCU does. Read-Copy Update (RCU). Fast reads: readers do not lock (and thus do not write). Slower writes: writers must lock, and do extra work to help reads. Helps many situations with read-heavy shared data, but not all. Used extensively in Linux, a big part of its multi-core performance. RCU is a set of rules and patterns for readers and writers. Plus some mechanisms. RCU idea 1: writers don't modify data in place; instead prepare a new copy. Head -> E1 -> E2 -> E3 -> nil Suppose writer wants to change the string in E2. 1. Lock 2. e = alloc() 3. e->next = E2->next 4. strcpy(e->x, "xxx"); 5. E1->next = e 6. Unlock What about a reader on another core traversing the list? At some point reader will read E1->next. Either before or after step 5 (this is a simplification, see below). If before, reader sees unmodified E2 and original string. If after, reader sees new element and new string. Either way, reader will see ...->next pointing to E3. The point: the reader won't see a partially updated string. Even though the reader didn't lock. A good way to think about this idea: Update of E1->next is a "committing write". Before, no change visible to readers. After, *all* changes are made visible. This avoids the problem of readers seeing partially-complete updates. RCU is best suited to data structures where a single pointer write commits. E.g. lists and trees. RCU idea 2: readers and writers must use memory barriers to enforce order. We don't want compiler or machine to move step 3 or 4 after step 5. Writer must have a barrier just before step 5. Reader may need a barrier between Rx = E1->next, and dereference of Rx to read element contents. Problem: what happens when a writer free()s a deleted list element? Removing the visible reference prevents *new* readers, but A concurrent reader might still be looking at the deleted element! Use-after-free is a potential disaster: may be reallocated for some other use. Writer needs to give readers time to finish -- a "grace period". After it removes the visible reference to the element. How long should the writer wait? Could use GC or reference counts, but they are expensive. RCU idea 3 (grace period): 1. Rule: reader can't hold pointer to RCU data across context switch. The programmer is in charge of following this rule. 2. Writer delays free until all CPUs have context-switched. Writer may have to wait a few milliseconds, but the assumption is that writes are rare. How to implement this delay? Simple plan: writing thread forces itself to be scheduled on each CPU. Better: each core counts context switches, writer watches counts. (There are lots of ways to implement the grace period.) Linux RCU interface for readers: rcu_read_lock() rcu_read_unlock() rcu_dereference(p) For writers: synchronize_rcu() call_rcu(fn,x) rcu_assign_pointer(pa, p) List reader using Linux's RCU interface: rcu_read_lock() e = rcu_dereference(head) while(e){ look at e->x ... e = rcu_dereference(e->next) } rcu_read_unlock() rcu_read_lock/unlock do almost nothing. Despite their names, they don't lock. Disable timer interrupts to prevent pre-emption. Since context switch implies done with RCU objects. Act as documentation. Note: code is only allowed to use e inside rcu_read_lock/unlock! It cannot e.g. return(e) And is not allowed to context-switch while holding onto e. What does rcu_dereference(e) do? Tells the compiler to fetch the pointer exactly once. Tells the CPU and compiler to ensure e->x is as up-to-date as e. Writer's code to replace the first list element: acquire(lock) old = head e = alloc() e->x = ... e->next = head->next rcu_assign_pointer(&head, e) release(lock) synchronize_rcu() free(old) What does synchronize_rcu() do? implements the grace period: delays until all CPUs have context-switched can take a long time, and can yield the CPU. alternative: call_rcu(free,old) adds to a list. returns immediately What does rcu_assign_pointer(p, val) do? memory fence *p = val RCU performance versus r/w locks? For readers: RCU imposes nearly zero cost on reads. r/w locks might take 100s or 1000s of cycles (Figure 8, upper line). Also, RCU can read while a write is in progress! For writers: RCU can take much longer due to synchronize_rcu(). So RCU makes sense when writes are rare or non-time-sensitive. Example: removing NMI handler; paper's Section 4.1; Figure 4. NMIs are interrupts that cannot be disabled. Used for CPU profiling, critical hardware errors, watchdog timer. Kernel code can register handlers to be called during NMI. Can also un-register handler, and free memory containing handler code. NMIs are frequent; register/unregister is rare; thus reads >> writes. What if some core is in NMI code when un-register is called? Figure 4's solution, in unregister_nmi_handler(): 1. Delete NMI handler list entry (so future interrupts won't see it). 2. synchronize_rcu() to wait for all current interrupts to complete. Since a core won't context-switch until it has left interrupt. 3. (not shown) free the memory holding NMI handler code. This example uses RCU to defer free until all uses have finished. An interesting wrinkle: NMI cannot be disabled. So spinlocks can't prevent NMIs. So NMIs cannot generally be allowed to acquire locks (deadlock...). So here RCU is not just faster than locking for reader, but actually works where locking would not. Example: IP options; Section 4.2; Figure 6. IP packet header can contain options (e.g. record route). Kernel socket has (optional) options, copied into every outgoing packet. Send code (udp_sendmsg()) copies socket options w/o lock. Options are read much more often than written -> good RCU candidate. setsockopt() defers free of superseded options with call_rcu(kfree,old). Why is the reading code safe? 1. Writer entirely prepares new option before publishing pointer. 2. Writer and reader use memory barriers. 3. Writer defers free until any reader of old must have finished. The paper calls this "reference counting" but there's no reference count. "Garbage collection" might be a better term, since the goal is to free after last reference disppears. RCU is not universally applicable. Doesn't help writes. Only improves performance when reads >> writes. Doesn't work for code that must hold references across yield/sleep. Hard to apply if data structure is not amenable to single committing write. E.g. if readers scan both directions of doubly-linked list. Or if in-place updates are required. Readers can see stale data. E.g. udp_sendmsg() in Figure 6 might send with old IP options. But only if concurrent with setsockopt(). Not a problem here; may be in a few cases (e.g. database transactions). RCU adds complexity to writers. Search the web for "review checklist for RCU patches" RCU needs context switches to be frequent, and it needs to know about them. Easy in the kernel, harder for user-space threads. How to implement synchronize_rcu() efficiently? Here's a design from a paper by McKenney and Slingwine http://www2.rdrop.com/users/paulmck/RCU/rclockpdcsproof.pdf Section 3.5.4 Per-core data: Count of this core's context switches. List of waiting synchronize_rcu() and call_rcu(). Periodically, each core processes a whole batch (list) of RCU writers: Save the list, empty it. Record all other cores' counts. Every context switch, look again at other cores' counts. If all have changed, wakeup / run everything on saved list. What if you have shared write-heavy data? Best: re-design for no sharing. Second best: partition data in some way, and lock per partition. Example: kalloc in locking lab -- partition by core. Example: statistics counters. Each core keeps its own count, under a separate spin lock. To write, acquire local lock, increment local count. A spin lock is cheap if it's only used from one core. To read, acquire *all* locks, read and sum all counts. Expensive but we're assuming write-heavy. RCU summary Very successful at speeding up read access to shared kernel data. Used all over modern Linux. Totally eliminates locking cost for readers. Allows readers to execute in parallel with a writer. Key idea is a GC-like grace period to defer freeing objects. --- RCU rules: https://www.kernel.org/doc/Documentation/RCU/checklist.txt RDU implementation: http://www2.rdrop.com/users/paulmck/RCU/rclockpdcsproof.pdf Lockless patterns: https://lwn.net/Articles/850202/ Detecting missing memory barriers with KCSAN: https://lwn.net/Articles/877200/