Execution replay for multiprocessor virtual machine  (VEE'08)
Dunlap, Lucchetti, Chen, and Fetterman

Why read this paper?
  Example of use of virtual machines
  Faithful replay of races is a challenging problem

Use of virtual machines:
  Develop OS
  Run Windows on Linux
  Manage server farms
  Backup
  This lecture some more exotic usages: record&replay

Record&Replay (single processor case):
  Record phase: log all inputs to a virtual machine
    Interrupts, Network packets, time-stamp counter, etc.
    E.g., record at what instruction an interrupt was delivered
  Replay phase: deliver all recorded inputs at same point
    Deliver interrupts at the same instruction
    If virtual machine deterministic, then same state
    Don't have to log output!
  Virtual machine monitor can gets its hand on all inputs
    trap time-stamp counter
    trap driver interactions
    etc.
  All usages rcordable/playable?
    SSL connections?
  Useful for debugging
    Can perhaps record more info during replay (e.g., intrusion analysis)

Problem: replay races (e.g., write/write race)
 virtual machine monitor with N virtual CPUs
 write/write race from last lecture: 
   two concurrent inserts
   all updates to shared memory must form some total order
     most orderings are fine
     one ordering is a problem: inserts slips between read and update of head
   ordering is sensitive to external factors:
     interrupts arriving
     cache misses
   slight change in environment and race may not appear
     e.g., cache miss pattern different during replay
     e.g., adding a debug statement in program

Goal: detect and replay races
  detect races
    two truely-concurrent writes: insert/insert
  record enough information to reproduce execution faithfully
    challenge 1: sources of non-determinism (see determinator lecture)
    challenge 2: performance (slow down due to recording and space overhead)
  no modification to OS and applications (unlike determinator)

Determinstic races
  races happen only when physical memory is concurrently access by virtual CPUs
    virtual machine monitor virtualizes physical memory
    virtual CPU traps to monitor on *each* reference to shared physical memory
  during recording, log loads and stores to shared physical memory
     if trap all loads/stores, we have a total order.
  during replay, deliver load and stores to shmem in same total order
  bonus: works for races in guest kernel and apps

What about performance?
  trapping every load/store to shmem is expensive!
  do we need to order loads in total order?
    two loads don't effect each other
  a partial order is sufficient
    if a load observes a store during recording, during replay it
    should too

Recording partial order: CREW protocol
  all shared memory objects are in one of two states:
   1. Concurrent read (virtual CPUs can load)
   2. Exclusive write (only the owning virtual CPU can load and store)
  virtual monitor maps all shared objects with appropriate permission in guest
   shared objects are mapped at the granularity of pages
  on trap:
    record program counter of trapper
    IPI to other CPUs, which causes virtual CPUs to trap
      record their program counters (PCs)
      these PCs may not correspond to the last access of shared o
      but the last access will be before PC
  many executions can match partial order (see figure 1)
    partial order is (a)
    we can replay the store of a anytime before the load d and a
     previous load/store to that location on P2

Replay
  replay virtual processors to same point as traps
    Force trap after executing the same number of instructions
      Log PC + the number of branch instructions executed since last interrupt
      During replay generate interrupt after the recorded number of branches
  map shared objects according to state
  resume all virtual processors

Loose end: DMA  
  Devices are another "virtual processor"
  I/O MMU allows SMP ReVirt to implement CREW
    Paper uses more complicated scheme that doesn't use I/O MMU

Implementation using Xen
  log non-deterministic hypervisor calls (e.g., reading TSC)
  hypervisor provides guest w shadow page tables
    real page table may have fewer permissions than guest expects
      e.g., a shared page may be mapped read-only for CREW protocol
    updates require manipulation of several pages tables
      acquire shadow lock per domain
  challenge: hypervisor calls to guest that modify guest-visible memory
    e.g., setting dirty and access bits in shadow page tables
    upcalls are non-deterministic
    solution: global hypervisor lock (i.e., no concurrent hypercalls)
    
Evaluation
  page may be the wrong unit for CREW protocol (e.g., false sharing)
  Revirt without SMP:
    kernel workload: 5% overhead, .6 GB/day
  2-processor Revirt
    kernel workload: slower than single CPU case, 43.3 GB/day
    100,000 faults/sec
  4-processor Revirt:
    kernel workload: even slower  (~425,000 faults/sec)