Required reading: A comparison of software and hardware techniques for x86 virtualizatonKeith Adams and Ole Agesen, ASPLOS 2006
what's a virtual machine? simulation of a computer running as an application on a host computer accurate isolated fast why use a VM? one computer, multiple operating systems (OSX and Windows) manage big machines (allocate CPUs/memory at o/s granularity) kernel development environment (like qemu) better fault isolation: contain break-ins how accurate do we need? handle weird quirks of operating system kernels reproduce bugs exactly handle malicious software cannot let guest break out of virtual machine! usual goal: impossible for guest to distinguish VM from real computer impossible for guest to escape its VM some VMs compromise, require guest kernel modifications VMs are an old idea 1960s: IBM used VMs to share big machines 1990s: VMWare re-popularized VMs, for x86 hardware terminology [diagram: h/w, VMM, VMs..] VMM ("host") guest: kernel, user programs VMM might run in a host O/S, e.g. OSX or VMM might be stand-alone VMM responsibilities divide memory among guests time-share CPU among guests simulate per-guest virtual disk, network really e.g. slice of real disk why not simulation? VMM interpret each guest instruction maintain virtual machine state for each guest eflags, %cr3, &c much too slow! idea: execute guest instructions on real CPU when possible works fine for most instructions e.g. add %eax, %ebx how to prevent guest from modifying e.g. %cr3 and wrecking the VMM? idea: run each guest kernel at CPL=3 ordinary instructions work fine writing %cr3 will trap to VMM VMM can examine guest's page table detect any attempt to get at non-guest physical memory perhaps modify page table before installing in h/w %cr3 "trap-and-emulate" VMM hides real machine from guests virtual vs real hardware state: "virtual" %cr3: set by guest "real" %cr3: managed by VMM also machine defined data strctures: virtual page table real page table (often called "shadow") VMM must cause guest to see only virtual CPU state and completely hide/protect real state trap-and-emulate is tricky on an x86 not all privileged instructions trap at CPL=3 all those traps can be slow VMM must see PTE writes, which don't use privileged instructions what real x86 state do we have to hide (i.e. != virtual state)? physical memory CPL (low bits of CS) since it is 3, guest expecting 0 gdt descriptors (DPL 3, not 0) gdtr (pointing to shadow gdt) idt descriptors (traps go to VMM, not guest kernel) idtr pagetable (doesn't map to expected physical addresses) %cr3 (points to shadow pagetable) IF in EFLAGS %cr0 &c how shall we give guest illusion of physical memory? guest wants to start at PA=0, use all "installed" DRAM VMM must support many guests, they can't all really use PA=0 VMM must protect one guest's memory from other guests idea: claim DRAM size is smaller than real DRAM ensure paging is enabled rewrite guest's pagetable PTEs map PA in each PTE example: VMM allocates a guest phys mem 0x1000000 to 0x2000000 VMM gets trap if guest changes %cr3 (since guest kernel at CPL=3) VMM copies guest's pagetable to "shadow" pagetable VMM adds 0x1000000 to each PA in shadow table VMM checks that each PA is < 0x2000000 VMM must copy the guest's pagetable so guest doesn't see VMM's modifications to PAs also shadow the GDT, IDT real IDT refers to VMM's trap entry points VMM can forward to guest kernel if needed VMM may also fake interrupts from virtual disk real GDT allows execution of guest kernel by CPL=3 note we rely on h/w trapping to VMM if guest writes %cr3, gdtr, &c do we also need a trap if guest *read*s? do all instructions that read/write sensitive state cause traps at CPL=3? push %cs will show CPL=3, not 0 sgdt reveals real GDTR pushf pushes real IF suppose guest turned IF off VMM will leave real IF on, just postpone interrupts to guest popf ignores IF if CPL=3, no trap so VMM won't know if guest kernel wants interrupts IRET: no ring change so won't restore restore SS/ESP how can we cope with non-trapping instructions that reveal real state? rewrite guest code, change them to INT 3, which traps keep track of original instruction, emulate in VMM INT 3 is one byte, so doesn't change code size/layout this is a simplified version of the paper's Binary Translation how does rewriter know where instruction boundaries are? or whether bytes are code or data? can VMM look at symbol table for function entry points? idea: scan only as executed, since execution reveals instr boundaries original start of kernel (making up these instructions): entry: pushl %ebp ... popf ... jnz x ... jxx y x: ... jxx z when VMM first loads guest kernel, rewrite from entry to first jump replace bad instrs (popf) with int3 replace jump with int3 then start the guest kernel on int3 trap to VMM look where the jump could go (now we know the boundaries) for each branch, xlate until first jump again replace int3 w/ original branch re-start keep track of what we've rewritten, so we don't do it again indirect calls/jumps? same, but can't replace int3 with the original jump since we're not sure address will be the same next time so must take a trap every time ret (function return)? == indirect jump via ptr on stack can't assume that ret PC on stack is from a call so must take a trap every time. slow! what if guest reads or writes its own code? can't let guest see int3 must re-rewrite any code the guest modifies can we use page protections to trap and emulate reads/writes? no: can't set up PTE for X but no R perhaps make CS != DS put rewritten code in CS put original code in DS write-protect original code pages on write trap emulate write re-rewrite if already rewritten tricky: must find first instruction boundary in overwritten code do we need to rewrite guest user-level code? technically yes: SGDT, IF but probably not in practice user code only does INT, which traps to VMM how to handle pagetable? remember VMM keeps shadow pagetable w/ different PAs in PTEs what if guest kernel writes a PTE? no trap from %cr3 write... idea: VMM can write-protect guest's PTE pages trap on PTE write, emulate, also in shadow pagetable what if guest writes %cr3 often, during context switches? does VMM have to scan the new page table, modify all PTEs? idea: lazy population of shadow page table start w/ empty real page table (just VMM mappings) so guest will generate many page faults after it load %cr3 VMM page fault handler just copies needed PTE to shadow pagetable restarts guest, no guest-visible page fault guest probably switches among same set of page tables over and over as it context-switches among running processes idea: VMM could cache multiple shadow page tables cache indexed by address of guest pagetable start with pre-populated page table on guest %cr3 write would make context switch much faster how to guard guest kernel against writes by guest programs? both are at CPL=3 delete kernel PTEs on IRET, re-install on INT? how to handle devices? trap INB and OUTB DMA addresses are physical, VMM must translate and check rarely makes sense for guest to use real device want to share w/ other guests each guest gets a part of the disk each guest looks like a distinct Internet host each guest gets an X window VMM might mimic some standard ethernet or disk controller regardless of actual h/w on host computer or guest might run special drivers that jump to VMM VMware avoids many faults re-writing w/ VMM code, rather than int3 often faster than non-VM kernel, e.g. cli vs setting a flag in virt state but then code size and fn addresses change how does VMware hide e.g. return EIPs? VMWare supports Binary Translation (see paper) int3 can be expensive: every fn return? VMWare allows translations that increase code size so actually executes translated code at different address indirect and function return pointers are different variables/stack hold virtual pointers, that guest expects translated code maps indirect pointers before call/ret examples of clever BT translations? don't trap: directly r/w VMM data structures for e.g. instrs that read/write EFLAGS via %gs segment register, which points to high address BT detects/rewrites guest use of %gs and %ds bound prevents non-%gs access to VMM memory adaptive PTE update handling can detect instructions that often write PTEs have them directly modify shadow PTE also avoid page-fault trap Intel/AMD hardware support for virtual machines has made it much easier to implement a VMM w/ reasonable performance h/w itself directly maintains per-guest virtual state CS (w/ CPL), EFLAGS, idtr, &c h/w knows it is in "guest mode" instructions directly modify virtual state avoids lots of traps to VMM h/w basically adds a new priv level VMM mode, CPL=0, ..., CPL=3 guest-mode CPL=0 is not fully privileged no traps to VMM on system calls h/w handles CPL transition what about memory, pagetables? h/w supports *two* page tables guest page table VMM's page table guest memory refs go through double lookup each phys addr in guest pagetable translated through VMM's pagetable thus guest can directly modify its page table w/o VMM having to shadow it no need for VMM to write-protect guest pagetables no need for VMM to track %cr3 changes and VMM can ensure guest uses only its own memory only map guest's memory in VMM page table