Virtual Machines

Required reading: A comparison of software and hardware techniques for x86 virtualizatonKeith Adams and Ole Agesen, ASPLOS 2006
what's a virtual machine?
  simulation of a computer
  running as an application on a host computer
  accurate
  isolated
  fast

why use a VM?
  one computer, multiple operating systems (OSX and Windows)
  manage big machines (allocate CPUs/memory at o/s granularity)
  kernel development environment (like qemu)
  better fault isolation: contain break-ins

how accurate do we need?
  handle weird quirks of operating system kernels
  reproduce bugs exactly
  handle malicious software
    cannot let guest break out of virtual machine!
  usual goal:
    impossible for guest to distinguish VM from real computer
    impossible for guest to escape its VM
  some VMs compromise, require guest kernel modifications

VMs are an old idea
  1960s: IBM used VMs to share big machines
  1990s: VMWare re-popularized VMs, for x86 hardware

terminology
  [diagram: h/w, VMM, VMs..]
  VMM ("host")
  guest: kernel, user programs
  VMM might run in a host O/S, e.g. OSX
    or VMM might be stand-alone

VMM responsibilities
  divide memory among guests
  time-share CPU among guests
  simulate per-guest virtual disk, network
    really e.g. slice of real disk

why not simulation?
  VMM interpret each guest instruction
  maintain virtual machine state for each guest
    eflags, %cr3, &c
  much too slow!

idea: execute guest instructions on real CPU when possible
  works fine for most instructions
  e.g. add %eax, %ebx
  how to prevent guest from modifying e.g. %cr3 and wrecking the VMM?

idea: run each guest kernel at CPL=3
  ordinary instructions work fine
  writing %cr3 will trap to VMM
    VMM can examine guest's page table
    detect any attempt to get at non-guest physical memory
    perhaps modify page table before installing in h/w %cr3
  "trap-and-emulate"

VMM hides real machine from guests
  virtual vs real
  hardware state:
    "virtual" %cr3: set by guest
    "real" %cr3: managed by VMM
  also machine defined data strctures:
    virtual page table
    real page table (often called "shadow")
  VMM must cause guest to see only virtual CPU state
    and completely hide/protect real state

trap-and-emulate is tricky on an x86
  not all privileged instructions trap at CPL=3
  all those traps can be slow
  VMM must see PTE writes, which don't use privileged instructions

what real x86 state do we have to hide (i.e. != virtual state)?
  physical memory 
  CPL (low bits of CS) since it is 3, guest expecting 0
  gdt descriptors (DPL 3, not 0)
  gdtr (pointing to shadow gdt)
  idt descriptors (traps go to VMM, not guest kernel)
  idtr
  pagetable (doesn't map to expected physical addresses)
  %cr3 (points to shadow pagetable)
  IF in EFLAGS
  %cr0 &c

how shall we give guest illusion of physical memory?
  guest wants to start at PA=0, use all "installed" DRAM
  VMM must support many guests, they can't all really use PA=0
  VMM must protect one guest's memory from other guests
  idea:
    claim DRAM size is smaller than real DRAM
    ensure paging is enabled
    rewrite guest's pagetable PTEs
    map PA in each PTE
  example:
    VMM allocates a guest phys mem 0x1000000 to 0x2000000
    VMM gets trap if guest changes %cr3 (since guest kernel at CPL=3)
    VMM copies guest's pagetable to "shadow" pagetable
    VMM adds 0x1000000 to each PA in shadow table
    VMM checks that each PA is < 0x2000000
  VMM must copy the guest's pagetable
    so guest doesn't see VMM's modifications to PAs

also shadow the GDT, IDT
  real IDT refers to VMM's trap entry points
    VMM can forward to guest kernel if needed
    VMM may also fake interrupts from virtual disk
  real GDT allows execution of guest kernel by CPL=3

note we rely on h/w trapping to VMM if guest writes %cr3, gdtr, &c
  do we also need a trap if guest *read*s?

do all instructions that read/write sensitive state cause traps at CPL=3?
  push %cs will show CPL=3, not 0
  sgdt reveals real GDTR
  pushf pushes real IF
    suppose guest turned IF off
    VMM will leave real IF on, just postpone interrupts to guest
  popf ignores IF if CPL=3, no trap
    so VMM won't know if guest kernel wants interrupts
  IRET: no ring change so won't restore restore SS/ESP 

how can we cope with non-trapping instructions that reveal real state?
  rewrite guest code, change them to INT 3, which traps
  keep track of original instruction, emulate in VMM
  INT 3 is one byte, so doesn't change code size/layout
  this is a simplified version of the paper's Binary Translation

how does rewriter know where instruction boundaries are?
  or whether bytes are code or data?
  can VMM look at symbol table for function entry points?

idea: scan only as executed, since execution reveals instr boundaries
  original start of kernel (making up these instructions):
  entry:
    pushl %ebp
    ...
    popf
    ...
    jnz x
    ...
    jxx y
  x:
    ...
    jxx z
  when VMM first loads guest kernel, rewrite from entry to first jump
    replace bad instrs (popf) with int3
    replace jump with int3
    then start the guest kernel
  on int3 trap to VMM
    look where the jump could go (now we know the boundaries)
    for each branch, xlate until first jump again
    replace int3 w/ original branch
    re-start
  keep track of what we've rewritten, so we don't do it again

indirect calls/jumps?
  same, but can't replace int3 with the original jump
  since we're not sure address will be the same next time
  so must take a trap every time

ret (function return)?
  == indirect jump via ptr on stack
  can't assume that ret PC on stack is from a call
  so must take a trap every time. slow!

what if guest reads or writes its own code?
  can't let guest see int3
  must re-rewrite any code the guest modifies
  can we use page protections to trap and emulate reads/writes?
    no: can't set up PTE for X but no R
  perhaps make CS != DS
    put rewritten code in CS
    put original code in DS
    write-protect original code pages
  on write trap
    emulate write
    re-rewrite if already rewritten
    tricky: must find first instruction boundary in overwritten code

do we need to rewrite guest user-level code?
  technically yes: SGDT, IF
  but probably not in practice
  user code only does INT, which traps to VMM

how to handle pagetable?
  remember VMM keeps shadow pagetable w/ different PAs in PTEs

what if guest kernel writes a PTE?
  no trap from %cr3 write...
  idea: VMM can write-protect guest's PTE pages
  trap on PTE write, emulate, also in shadow pagetable

what if guest writes %cr3 often, during context switches?
  does VMM have to scan the new page table, modify all PTEs?
  idea: lazy population of shadow page table
  start w/ empty real page table (just VMM mappings)
  so guest will generate many page faults after it load %cr3
  VMM page fault handler just copies needed PTE to shadow pagetable
    restarts guest, no guest-visible page fault

guest probably switches among same set of page tables over and over
  as it context-switches among running processes
  idea: VMM could cache multiple shadow page tables
    cache indexed by address of guest pagetable
  start with pre-populated page table on guest %cr3 write
  would make context switch much faster

how to guard guest kernel against writes by guest programs?
  both are at CPL=3
  delete kernel PTEs on IRET, re-install on INT?

how to handle devices?
  trap INB and OUTB
  DMA addresses are physical, VMM must translate and check
  rarely makes sense for guest to use real device
    want to share w/ other guests
    each guest gets a part of the disk
    each guest looks like a distinct Internet host
    each guest gets an X window
  VMM might mimic some standard ethernet or disk controller
    regardless of actual h/w on host computer
  or guest might run special drivers that jump to VMM

VMware avoids many faults 
  re-writing w/ VMM code, rather than int3
  often faster than non-VM kernel, e.g. cli vs setting a flag in virt state
  but then code size and fn addresses change
  how does VMware hide e.g. return EIPs?

VMWare supports Binary Translation (see paper)
  int3 can be expensive: every fn return?
  VMWare allows translations that increase code size
  so actually executes translated code at different address
  indirect and function return pointers are different
    variables/stack hold virtual pointers, that guest expects
    translated code maps indirect pointers before call/ret
  examples of clever BT translations?
    don't trap: directly r/w VMM data structures
    for e.g. instrs that read/write EFLAGS
    via %gs segment register, which points to high address
    BT detects/rewrites guest use of %gs
    and %ds bound prevents non-%gs access to VMM memory
  adaptive PTE update handling
    can detect instructions that often write PTEs
    have them directly modify shadow PTE also
    avoid page-fault trap

Intel/AMD hardware support for virtual machines
  has made it much easier to implement a VMM w/ reasonable performance
  h/w itself directly maintains per-guest virtual state
    CS (w/ CPL), EFLAGS, idtr, &c
  h/w knows it is in "guest mode"
    instructions directly modify virtual state
    avoids lots of traps to VMM
  h/w basically adds a new priv level
    VMM mode, CPL=0, ..., CPL=3
    guest-mode CPL=0 is not fully privileged
  no traps to VMM on system calls
    h/w handles CPL transition
  what about memory, pagetables?
    h/w supports *two* page tables
    guest page table
    VMM's page table
    guest memory refs go through double lookup
      each phys addr in guest pagetable translated through VMM's pagetable
    thus guest can directly modify its page table w/o VMM having to shadow it
      no need for VMM to write-protect guest pagetables
      no need for VMM to track %cr3 changes
    and VMM can ensure guest uses only its own memory
      only map guest's memory in VMM page table