Hive: Fault Containment for Shared-Memory Multiprocessors
Chapin, Rosenblum, Devine, Lahiri, Teodosiu, Gupta
SOSP 1995

why are we reading this paper?
  FLASH is a big distributed computation system
    just like a bunch of machines on a network
  tring to handle failures in big systems
  relaxed failure model
    strict enough to be useful
    relaxed enough to be reasonable to implement
    i.e. more practical than Hypervisor scheme
  happens to involve a sequentially consistent memory system

background
  what are the apps like? shared mem, threaded, scientific
  FLASH hardware, grid, per-node memory, cache-coherence, directory(?)
  stress that every node h/w can write all memory (modulo firewall)
  ordinary SMP kernel vs one kernel per node vs kernel per cell
  single system image across cells

sharing seems to be file-system based
  every node has a disk?
  when I read, node w/ disk reads to local mem, I use that remote buffer
  anonymous memory is copy-on-write, not shared r/w? (5.3)

what are the top-level goals of the entire system?
  huge shared-memory multiprocessor
  only justified if good support for shared memory
  and flexible allocation of CPUs

so they're going to spread computations out over nodes
  a computation's memory
  a computation's threads

what are the key problems?
  nodes fail, making memory inaccessible
  nodes return bad values for memory reads
  s/w issues wild writes that corrupt other nodes' memory

what properties are they looking for?
  what is "fault containment"
  not true fault-tolerance / masking
  if a node fails, they are willing to lose the programs/data on that node
  they don't want the problem to spread
  and they'd like policies that make a 1% failure affect only 1% of apps

what are they willing to give up?
  smp-style single kernel
    though they hack cell kernels to present single system image?

what mechanisms do they propose

careful reads
  point: detect kernel data mangling due to nodes failing
    not really protecting against arbitrary failures
    is the point really crash while updating some kernel data structure?

firewall hardware help protect against wild writes

where does the firewall hardware sit?
  guards memory module against remote writes

what's in the firewall hardware?
  64 bits per phys mem page, one bit per node

when does the system set the firewall to allow writes?
  when any CPU has mapped that page
  so really just protecting against app wild writes

OK, firewall protects against some wild writes
  but what about pages a failed node was allowed to write?
  they might have been corrupted before the crash!

what does Hive due after a cell has crashed to deal w/
  wild writes to allowed pages?
  they detect damaged files
    all user-level pages writeable by failed node?
  and give I/O errors to processes that had those files open and
    try to use them
  presumably including LD/ST as well as read()/write()
    looks like shared memory only occurs via shared mmap()ed files

why is firewall better than VM protection?
  VM enforced by potentially faulty h/w and o/s
  firewall enforced by memory owner

how do they detect failed cells?

policies in 5.6:
  try to place a process's pages on few cells
    to minimize the number of nodes that could crash a process
  try to place a file's pages on few cells
    since entire files are marked bad if bad cell could write one page

they talk about a memory fault model
  what is the model?

how could you test a system like this?
  can it contain faults?