The Multikernel: a new OS architecture for scalable multicore systems, Bauman et al, SOSP '09. Plan Multicore processors IPC through shared-memory Barrelfish AMD slides The slides provide a modern version of the machine used in the paper Cache-coherence protocol and topology (see figure 2 in paper) are important for understanding paper IPC through shared-memory code + performance handout (results from running ipc.c on tom) 4 cache-line tranfers per inc for IPC case (no matter how many cores) note: results don't match figure 3 in the paper (most likely authors ran without probe filters) Movitation for Barrelfish Parallelizing single-address space kernels with shared data structures (e.g., xv6) may difficult Locks/atomic instructions limit scalability Every shared data structures must be modified for scalability partition it apply RCU to it etc. Tremendous amount of engineering even for xv6! Underlying problem: sharing is expensive Moving cache-lines is expensive Congestion of interconnect Observation: make sharing explicit using messages Treat the multicore chip as a distributed system No shared data strucures Send messages to a core to access to its data Good match for chips with heterogenous cores Good match if future chips don't provide CC shared memory Challenge: global state Replicate it Read locally (low latency) Update locally Propagate changes asychronously (for weak consistency) Barrelfish (figure 5) CPU driver (kernel-mode part per core) exokernel-like IPC --- what is split phase? Monitors OS abstractions single-core replication of allocation tables and address space mappings (see below) IPC like homework + pipelining Memory management shared objects named by capabilities coherence of capability lists on each core one-phase commit operations between monitors to handle AS changes removing a page some ops require two-phase commit e.g., changing memory ownership and usage via capability retyping Case study: TLB shootdown When is it necessary? (one core changes page table, but page table is shared between several cores) Can this happen in xv6? Windows & Linux send IPIs Barrelfish sends messages to the involved monitors 1 Broadcast message (N-1 invalidates, N-1 fetches) N Unicast Multicast protocol send message to each processor each processor forwads to its cores NUMA-aware multicast send to highest latency node first (multiple nodes for 8x4) Beats Linux's IPI based method Could Linux adopt the protocol? Figure 9 stops at 16?