The Multikernel: a new OS architecture for scalable multicore systems,
Bauman et al, SOSP '09.

Plan
  Multicore processors
  IPC through shared-memory
  Barrelfish

AMD slides
  The slides provide a modern version of the machine used in the paper
  Cache-coherence protocol and topology (see figure 2 in paper) are
  important for understanding paper

IPC through shared-memory
 code + performance handout (results from running ipc.c on tom)
   4 cache-line tranfers per inc for IPC case (no matter how many cores)
 note:  results don't match figure 3 in the paper (most likely authors
   ran without probe filters)
 
Movitation for Barrelfish
  Parallelizing single-address space kernels with shared data
  structures (e.g., xv6) may difficult
    Locks/atomic instructions limit scalability
    Every shared data structures must be modified for scalability
      partition it
      apply RCU to it
      etc.
    Tremendous amount of engineering
      even for xv6!

Underlying problem: sharing is expensive
  Moving cache-lines is expensive
  Congestion of interconnect

Observation: make sharing explicit using messages
  Treat the multicore chip as a distributed system
    No shared data strucures
    Send messages to a core to access to its data
  Good match for chips with heterogenous cores
  Good match if future chips don't provide CC shared memory

Challenge: global state
  Replicate it
   Read locally (low latency)
   Update locally
     Propagate changes asychronously (for weak consistency)
  
Barrelfish (figure 5)
  CPU driver (kernel-mode part per core)
    exokernel-like
    IPC --- what is split phase?
  Monitors
    OS abstractions
    single-core
    replication of allocation tables and address space mappings (see below)
  IPC
    like homework
    + pipelining

Memory management
  shared objects named by capabilities
   coherence of capability lists on each core
  one-phase commit operations between monitors to handle AS changes
   removing a page
  some ops require two-phase commit
    e.g., changing memory ownership and usage via capability retyping

Case study: TLB shootdown
  When is it necessary?  (one core changes page table, but page table
    is shared between several cores)
    Can this happen in xv6?
  Windows & Linux send IPIs
  Barrelfish sends messages to the involved monitors
    1 Broadcast message (N-1 invalidates, N-1 fetches)
    N Unicast
    Multicast protocol
      send message to each processor
      each processor forwads to its cores
    NUMA-aware multicast
      send to highest latency node first (multiple nodes for 8x4)
      Beats Linux's IPI based method
      Could Linux adopt the protocol?

Figure 9 stops at 16?