Execution replay for multiprocessor virtual machine (VEE'08) Dunlap, Lucchetti, Chen, and Fetterman Why read this paper? Example of use of virtual machines Faithful replay of races is a challenging problem Use of virtual machines: Develop OS Run Windows on Linux Manage server farms Backup This lecture some more exotic usages: record&replay Record&Replay (single processor case): Record phase: log all inputs to a virtual machine Interrupts, Network packets, time-stamp counter, etc. E.g., record at what instruction an interrupt was delivered Replay phase: deliver all recorded inputs at same point Deliver interrupts at the same instruction If virtual machine deterministic, then same state Don't have to log output! Virtual machine monitor can gets its hand on all inputs trap time-stamp counter trap driver interactions etc. All usages rcordable/playable? SSL connections? Useful for debugging Can perhaps record more info during replay (e.g., intrusion analysis) Problem: replay races (e.g., write/write race) virtual machine monitor with N virtual CPUs write/write race from last lecture: two concurrent inserts all updates to shared memory must form some total order most orderings are fine one ordering is a problem: inserts slips between read and update of head ordering is sensitive to external factors: interrupts arriving cache misses slight change in environment and race may not appear e.g., cache miss pattern different during replay e.g., adding a debug statement in program Goal: detect and replay races detect races two truely-concurrent writes: insert/insert record enough information to reproduce execution faithfully challenge 1: sources of non-determinism (see determinator lecture) challenge 2: performance (slow down due to recording and space overhead) no modification to OS and applications (unlike determinator) Determinstic races races happen only when physical memory is concurrently access by virtual CPUs virtual machine monitor virtualizes physical memory virtual CPU traps to monitor on *each* reference to shared physical memory during recording, log loads and stores to shared physical memory if trap all loads/stores, we have a total order. during replay, deliver load and stores to shmem in same total order bonus: works for races in guest kernel and apps What about performance? trapping every load/store to shmem is expensive! do we need to order loads in total order? two loads don't effect each other a partial order is sufficient if a load observes a store during recording, during replay it should too Recording partial order: CREW protocol all shared memory objects are in one of two states: 1. Concurrent read (virtual CPUs can load) 2. Exclusive write (only the owning virtual CPU can load and store) virtual monitor maps all shared objects with appropriate permission in guest shared objects are mapped at the granularity of pages on trap: record program counter of trapper IPI to other CPUs, which causes virtual CPUs to trap record their program counters (PCs) these PCs may not correspond to the last access of shared o but the last access will be before PC many executions can match partial order (see figure 1) partial order is (a) we can replay the store of a anytime before the load d and a previous load/store to that location on P2 Replay replay virtual processors to same point as traps Force trap after executing the same number of instructions Log PC + the number of branch instructions executed since last interrupt During replay generate interrupt after the recorded number of branches map shared objects according to state resume all virtual processors Loose end: DMA Devices are another "virtual processor" I/O MMU allows SMP ReVirt to implement CREW Paper uses more complicated scheme that doesn't use I/O MMU Implementation using Xen log non-deterministic hypervisor calls (e.g., reading TSC) hypervisor provides guest w shadow page tables real page table may have fewer permissions than guest expects e.g., a shared page may be mapped read-only for CREW protocol updates require manipulation of several pages tables acquire shadow lock per domain challenge: hypervisor calls to guest that modify guest-visible memory e.g., setting dirty and access bits in shadow page tables upcalls are non-deterministic solution: global hypervisor lock (i.e., no concurrent hypercalls) Evaluation page may be the wrong unit for CREW protocol (e.g., false sharing) Revirt without SMP: kernel workload: 5% overhead, .6 GB/day 2-processor Revirt kernel workload: slower than single CPU case, 43.3 GB/day 100,000 faults/sec 4-processor Revirt: kernel workload: even slower (~425,000 faults/sec)