Required reading: A comparison of software and hardware techniques for x86 virtualizatonKeith Adams and Ole Agesen, ASPLOS 2006
what's a virtual machine?
accurate simulation of a computer (like Bochs)
directly on the real computer's CPU + memory (faster than Bochs)
why use a VM?
one computer, multiple operating systems (OSX and Windows)
development environment (like Bochs)
better fault isolation: contain break-ins
manage big machines (allocate CPUs/memory at o/s granularity)
simplify o/s development (Disco)
Disco led to VMware, which started modern popularity of VMs
terminology
guest o/s vs VMM
virtual state vs real state
want to execute guest instructions on real CPU when possible
works fine for most instructions
e.g. add %eax, %ebx
here there's no separate virtual %eax, just using real registers
fast!
why not load JOS into a user-space process on Linux and run it?
_start executes lgdt
would modify *real* state
but DPLs in guest descriptors have DPL=0 -- would crash since CPL=3
luckily lgdt forbidden when cpl=3 => trap to VMM
VMM might copy GDT somewhere, fix DPLs, lgdt on shadow copy
this is the basic technique for making VMs
need to have virtual state != physical state
trap on instructions that write / read real state
VMM translates, executes real instruction
how could we virtualize the x86?
like VMware, Parallels, Microsoft Virtual PC
how shall we give guest illusion of physical memory?
can't allow direct access, but must look like mem from PA=0..size
use some range of real phys mem
real page table will map 0..size to that range
what CPL should we use for guest o/s?
can't use 0: we can't get traps for e.g. lgdt
can't use 1 or 2: can then read any page (e.g. VMM's pages)
so guest o/s AND user programs at CPL=3
what real state do we have to hide (i.e. != virtual state)?
real physical memory
CPL (low bits of CS) since it is 3, guest expecting 0
gdt descriptors (DPL 3, not 0)
gdtr (pointing to shadow gdt)
idt descriptors (traps go to VMM, not guest o/s)
idtr
pagetable (doesn't map to expected physical addresses)
%cr3 (points to shadow pagetable)
control flags: IF &c in EFLAGS, %cr0, &c
can we hide the real state from the guest?
do all instructions that read/write sensitive state cause traps?
push %cs will show CPL=3, not 0
sgdt reveals real GDTR
pushf pushes real IF
suppose guest turned IF off
VMM will leave real IF on, just defer interrupts to guest
popf ignores IF if CPL=3, no trap
so VMM won't know if guest o/s wants interrupts
IRET no ring change so won't restore restore SS/ESP
how can we cope with non-trapping instructions that reveal real state?
rewrite guest code, change them to INT 3, which traps
keep track of original instruction, emulate in VMM
INT 3 is one byte, so doesn't change code size/layout
how does rewriter know where instruction boundaries are?
or whether bytes are code or data?
scan only as executed, since execution reveals instr boundaries
original start of kernel (making up these instructions):
entry:
pushl %ebp
...
popf
...
jnz x
...
jxx y
x:
...
jxx z
when VMM first loads guest kernel, translate from entry to first jump
replace bad instrs (popf) with int3
replace jump with int3
then start the guest kernel
on int3 trap to VMM
look where the jump could go (now we know the boundaries)
for each branch, xlate until first jump again
replace int3 w/ original branch
re-start
keep track of what we've translated, so we don't do it again
indirect jumps?
same, but probably can't replace with a real jump
since we're not sure address will be the same next time
so must take a trap every time
what if guest reads or writes its own code?
can't set up PTE for X but no R
perhaps make CS != DS
put translated code in CS
put original code in DS
write-protect original code pages
on write trap
emulate write
find all jumps to modified code
re-translate starting at those entry points
VMware avoids many faults
re-writing w/ VMM code, rather than int3
often faster than non-VM o/s, e.g. cli vs setting a flag in virt state
but then code size and fn addresses change
how does VMware hide e.g. return EIPs?
do we need to rewrite guest user-level code?
technically yes: SGDT, IF
but probably not in practice
user code only does INT, which traps to VMM
how to handle pagetable?
VMM must modify phys addrs in PTEs
simple plan:
trap on writes to %cr3
copy entire page table to VMM memory: shadow pagetable
fix the phys addrs in PTEs
load %cr3 to point to shadow pagetable
what if guest o/s writes a PTE?
must immediately be reflected to real pagetable
VMM must write-protect guest's PTE pages
what if too slow to scan entire pagetable after every %cr3 load?
i.e. on every process switch
could start w/ empty real page table (just VMM mappings)
look at guest pagetable on demand, driven by page faults
could cache entire page tables
guest o/s probably switching among fairly static per-process tables
VMM could learn where they are in guest memory
nasty tradeoff
if you pre-compute/cache, fault for every guest table write
if lazy, lots of page faults to populate shadow pagetable
need to reflect dirty/accessed bits back to guest pagetable
when? how lazy can we be?
could take a trap on first read and first write of each data page
could take a trap on guest o/s reads of its page table
how to guard guest kernel against writes by guest programs?
both are at CPL=3
delete kernel PTEs on IRET, re-install on INT?
what shall we do about devices?
trap INB and OUTB
DMA addresses are physical, VMM must translate and check
rarely makes sense for guest to use real device
want to share w/ other guests
each guest gets a part of the disk
each guest looks like a distinct Internet host
each guest gets an X window
VMM might mimic some standard ethernet or disk controller
regardless of actual h/w on host computer
or guest might run special drivers that jump to VMM
John Scott Robin, Cynthia E. Irvine. Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor.
Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor. In Proceedings of the 2001 Usenix Technical Conference.
Kevin Lawton, Drew Northup. Plex86 Virtual Machine.
Xen and the Art of Virtualization, Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003