Multi-processor coordination

Required reading: Read-Copy Update.

x86 memory model

why are we reading this paper?
    multi-processor systems becoming quite common; hard to buy a single-core laptop
    want to use multiple CPUs to improve performance
    synchronization between CPUs often expensive (performance-wise) and tricky
    scalability limitations of kernel-provided services passed onto all applications
    RCU is a cool technique for cheap, simple synchronization in some cases
    let's briefly look back at the x86 memory system to understand what's slow..

back in the locking lecture, we had a fairly simple model of multiple CPUs
    shared bus
    memory connected to bus
    to implement acquire, x86's xchg instruction locked the bus
    provided atomicity for xchg

in reality things are a bit different
    bus, memory quite slow compared to CPU speed
    each CPU has a few levels of cache to compensate
    each CPU directly attached to a part of the system's memory (AMD, recent Intels)
    recent AMD chip:
	cache latency ~10's of cycles
	going over bus to DRAM or another CPU is ~500 cycles
    need to make sure xchg still atomic with all these caches, and performs well
    x86 uses caching system to provide fast atomic ops

what's in a cache line?
    state, address, 64 bytes of data
    states: Modified, Exclusive, Shared, Invalid [MESI]
    compatibility of states between 2 CPUs:

		   CPU1
		  M E S I
		M - - - +
	CPU2	E - - - +
		S - - + +
		I + + + +

how do the CPUs coordinate with each other?
    each CPU listens on bus for address requests, looks up in cache, responds
	"read address"
	"read and invalidate address"
	"invalidate address"
    atomicity: hold cacheline while executing xchg, don't give it to other CPUs

Searchable stack using locks

	struct element {
	    int key;
	    int value;
	    struct element *next;
	};

	struct element *top;

	void push(struct element *e) {
	    e->next = top;
	    top = e;
	}

	struct element *pop(void) {
	    struct element *e = top;
	    top = e->next;
	    return e;
	}

	int search(int key) {
	    struct element *e = top;
	    while (e) {
		if (e->key == key)
		    return e->value;
	    }
	    return -1;
	}

this is clearly not going to work on a concurrent system

global spinlock: performance
    how many concurrent ops?  just one CPU at a time
    bus interactions?  bounce cache line for both reads, writes

global read-write lock: performance
    how many concurrent ops?  theoretically, could do one per CPU
    bus interactions?  bounce cache line for both reads, writes
    draw timing diagram of CPU interactions
    we might be slower on two CPUs than on a single CPU!

is it going to get better if we allocate a read-write lock for each item?
    concurrency still possible
    but penalty even worse: N_{elem} cache line bounces for each search traversal

more scalable locking schemes
    paper alludes to a fairly simple one: brlock
    N cpus would have N locks, one per CPU
    readers lock their CPU's lock, writers must lock each CPU's lock
    must use up N 64-byte cache lines, even though each lock is just a bit..

Avoiding locks

why do we want to avoid locks?
    performance
    complexity
    deadlock
    priority inversion

what's the plan?
    reduce the operation we want to perform to some atomic x86 instruction
    x86 LOCK prefix makes many read-modify-write instructions atomic
    simple example: implement atomic counters by adding LOCK prefix
    most general thing is cmpxchg, whose pseudo-code is:

	int cmpxchg(int *addr, int old, int new) {
	    int was = *addr;
	    if (was == old)
		*addr = new;
	    return was;
	}

    how to avoid race conditions in our linked list?

	void push(struct element *e) {
	again:
	    e->next = top;
	    if (cmpxchg(&top, e->next, e) != e->next)
		goto again;
	}

	struct element *pop(void) {
	again:
	    struct element *e = top;
	    if (cmpxchg(&top, e, e->next) != e)
		goto again;
	    return e;
	}

    search can be the same as in the non-concurrent case (almost..)

why is this better than not having locks?
    readers no longer generate spurious updates, which incurred performance hit
    not having to think about deadlock is great

problem 1: lock-free data structures require careful design, hw support
    suppose we want to remove arbitrary elements, not just the first one
    what could go wrong if we try to use the same cmpxchg?
    need DCAS (double-compare-and-swap) to implement remove properly
	must make sure neither previous nor next element changed
	x86 hardware doesn't have DCAS
    can do some tricks because x86 provides 8-byte cmpxchg
	only for sequential memory addresses
	can potentially play tricks by staggering the 8 bytes across two pages
	unlikely to win in performance once we start invalidating TLB entries

problem 2: memory reuse
    when can we free a memory block, if other CPUs could be accessing it?
	other CPU's search might be traversing any of the elements
	we could try bumping a refcount, but that introduces cacheline bounces
	and what about CPUs that are just about to bump the refcount?

    worse, reusing a memory block can corrupt list
	stack contains three elements
	    top -> A -> B -> C
	CPU 1 about to pop off the top of the stack,
	    preempted just before cmpxchg(&top, A, B)
	CPU 2 pops off A, B, frees both of them
	    top -> C
	CPU 2 allocates another item (malloc reuses A) and pushes onto stack
	    top -> A -> C
	CPU 1: cmpxchg succeeds, stack now looks like
	    top -> B -> C

    strawman solution for specific problem (actually used by some systems):
	type-stable memory (stack-elements never become non-stack-elements)
	each stack element has a reuse counter
	include generation# along with each pointer (so that cmpxchg notices)

    but for more complex data structures this becomes hard to solve
	for example, readers must check that data structure hasn't been freed
	we'll see shortly how RCU can help us reuse memory

RCU

lock-free data structures are good for readers, but fairly tricky for updaters
    lock-free updates were serving two purposes:
	atomically change state seen by readers
	    relatively easy -- swap pointers
	detect and prevent conflicts with other updaters
	    requires a lot more thought, makes spinlocks seem good again?
    RCU separates these two purposes
	make updates appear atomic to readers
	    allows readers to run without any special synchronization
	use any other scheme to coordinate between updaters
	    locks are easy, but can use lock-free updates, or a single writer
	    performance-wise, doesn't matter -- cacheline bounce going to happen either way

how will this work?
    key idea: leave old data for readers, don't update in-place
	no need for reader locks if they don't see changes in-place
	fundamentally removes need for feedback from readers to updaters!
    instead, update by copying data items, atomically change pointers
	makes need for memory reuse mechanism even more acute, more on that in a bit
    benefits
	reader code simpler, avoids locking issues and bus ops to notify updaters
	writer code simpler than lock-free data structures
	good performance for read-heavy workloads

example: linked list
    section 2 in paper
    what's the performance like for the locking+refcounted case (fig 4,5,6)?
	search: 3 shared memory writes, even though not modifying anything
	delete: 5 memory writes
    what if we make the list lock more optimized for readers?
	search: still 1 shared memory write, due to refcount for reclaiming memory
	delete: 3+NCPU memory writes
    what's the RCU plan?
	figures 8, 9, 10
	illustrate what happens with diagram
	unusual semantics -- reader sees old data, as we don't update in-place
	paper argues this is OK in some cases
	    when does this occur?  reader concurrent with update
	    since they are concurrent, order often not strictly defined
	    so OK to make reader execute on old data as if before update

	search: no shared memory writes, going to be really fast
	delete: 4 memory writes (2 maybe local to deleting CPU?), plus RCU stuff
    paper makes RCU seem very natural here; what wouldn't have worked?
	in-place list modifications
	would need to translate into insert of a new element, removal of older one
	hold mutex for the list during update

what's the problem, again?
    when can we reuse memory from an old element?
    .. so that we don't crash readers accessing it
    .. so that we don't see weird cmpxchg behavior

cool observation: CPUs go through cyclical activity
    get a system call, interrupt, etc
    process that event, perhaps issuing other operations, and go to sleep/run user code
    once we're done, kernel code likely not holding any pointers of interest
    reuse memory once every possible reader has reached such a quiescent state
    kernels are good workload: kernel code never runs for a long time
    diagram: quiescent states, grace period

how to make this work?
    need to figure out what's a quiescent state
    need to detect when every reader has been through a quiescent state

quiescent states
    on Linux, idling or context-switching indicates a quiescent state
    paper claims Linux is "non-preemptible": what does this mean?
	will not idle or yield if kernel code wants to run
    kernel not allowed to hold pointer to an RCU-managed element when blocking
    what could happen on xv6?
	kernel could be pre-empted, context-switched elsewhere while holding ptr
	what would be a quiescent state and grace period in xv6?

detection algorithm
    simple alg: use scheduler to wait for a quiescent period on each CPU
    figure 17 illustrates how the detection algorithm works
    code in figure 18

    better alg: figure 29
	basically, put code from figure 18 into scheduler, and parallelize
	global bitmask tracks CPUs that need to reach a quiescent state
	when CPU notices its bit set, waits for a quiescent state and clears it
	when bitmask=0 (all CPUs have quiesced), we've quiesced

tricks to making RCU work well
    batch deferred RCU work to only run one instance of detection alg at a time
    pre-allocate space for batching linked list entry, to defer-free from intr
    avoiding global data structures in RCU itself
	keep callbacks (e.g. memory freeing) local to each CPU
	bump a global generation counter for each time we quiesce
	for each new generation, CPU can run callbacks from previous generation
	    must keep track of when a callback was entered
	    cannot run a callback too soon

so what did RCU buy us?
    good scalability for read-heavy workloads -- reading incurs no bus writes
    updates locking overhead similar, but perhaps more data copying
    cost: memory isn't freed immediately, grace period detection running in background

    paper doesn't talk a whole lot about performance: 30% better on 4 CPUs?
	other works: 4-CPU system RCU microbenchmark wins by factor of 2-4
	    assuming <20% updates
	unclear if the paper's suggested 1/NCPU fraction is relevant or not
	seems to depend more on the cost of update-in-place vs copy-update

RCU applications

widely used in Linux: about 200 files using RCU

network routing table
    OK to have stale data because external state won't wait for our spinlock anyway
    entries not updated in-place -- instead, insert and delete operations
    often many more packets than route updates

module unloading
    kernel module:
	loads code into kernel's memory (e.g. /dev/random)
	registers function pointers (e.g. open for /dev/random's major/minor numbers)
    problem: if we're unloading, when can we free module's memory?
	other CPUs might be holding references to module code, or be executing it
    strawman: refcount on module tracks saved pointers or active function calls
	very expensive for short calls, e.g. stat(/dev/random)
    RCU allows us to avoid refcounts for short-lived use that doesn't span quiescence
    still use refcounts when we go to sleep or save a pointer for later use..

    two-phase unloading:
	wait for refcount to reach zero (no long-term holders)
	remove function pointers from major/minor number table: prevent new threads
	wait for short-term holders to go away (grace period)
	unload module for real

    similar design comes up in many other cases, e.g. event processing
	want to register for input events (keyboard, mouse)
	when is it safe to unregister and free memory?
	not a performance problem
	instead, RCU helps avoid read-locking, potential deadlock, etc

    NMI processing
	a particular case of the event processing workload
	but there's no way to mask interrupts!
	could not have done this with locks even if we wanted to

FD flags
    effectively an existence lock
    why is it safe to let other CPUs access an old FD flags array?
	presumably updates grab a lock, so therefore ordered w.r.t. updates?
	order for reads not well-defined anyway, so doesn't matter

CPU array
    what's cool about this case?
    without RCU, pervasive changes: locks or other changes for each access to CPU array
    with RCU, almost no work outside of the update code

RCU on a single CPU in Linux
    slight performance win from pointless locking
    simplifies locking complexity, esp. in interrupt context
    do we need quiescent state detection on a single CPU?  can we free memory right away?
	still need quiescent state detection due to interrupts

RCU in xv6
    due to preemption, must reach quiescent state in each kernel thread
    how would we find a grace period?
	look at threads running something other than wait() or user code
	wait for them to go through a quiescent state
    where might RCU come in useful?
	read-only access to inodes, buffer cache entries?
	is it ok to use stale copy?  order undefined for concurrent read+update anyway.
	single array of structs would not work so well -- requires update-in-place
	would need a separate linked list of "active" inodes, bufs
	so we could avoid acquire/release in fstat, for example

would we use RCU in JOS?
    no SMP support -- not going to get any SMP benefits.
    no preemption or interrupts while in kernel -- not needed even for single CPU.

so when is RCU a good technique?
    perhaps if we don't need immediate feedback from readers to updaters
    provides performance, simplicity for these readers