Required reading: Disco
what's a virtual machine? accurate simulation of a computer (like Bochs) directly on the real computer's CPU + memory (faster than Bochs) why use a VM? one computer, multiple operating systems (OSX and Windows) development environment (like Bochs) better fault isolation: contain break-ins manage big machines (allocate CPUs/memory at o/s granularity) simplify o/s development (Disco) Disco led to VMware, which started modern popularity of VMs terminology guest o/s vs VMM virtual state vs real state want to execute guest instructions on real CPU when possible works fine for most instructions e.g. add %eax, %ebx here there's no separate virtual %eax, just using real registers fast! why not load JOS into a user-space process on Linux and run it? _start executes lgdt would modify *real* state but DPLs in guest descriptors have DPL=0 -- would crash since CPL=3 luckily lgdt forbidden when cpl=3 => trap to VMM VMM might copy GDT somewhere, fix DPLs, lgdt on shadow copy this is the basic technique for making VMs need to have virtual state != physical state trap on instructions that write / read real state VMM translates, executes real instruction
how could we virtualize the x86? like VMware, Parallels, Microsoft Virtual PC how shall we give guest illusion of physical memory? can't allow direct access, but must look like mem from PA=0..size use some range of real phys mem real page table will map 0..size to that range what CPL should we use for guest o/s? can't use 0: we can't get traps for e.g. lgdt can't use 1 or 2: can then read any page (e.g. VMM's pages) so guest o/s AND user programs at CPL=3 what real state do we have to hide (i.e. != virtual state)? real physical memory CPL (low bits of CS) since it is 3, guest expecting 0 gdt descriptors (DPL 3, not 0) gdtr (pointing to shadow gdt) idt descriptors (traps go to VMM, not guest o/s) idtr pagetable (doesn't map to expected physical addresses) %cr3 (points to shadow pagetable) control flags: IF &c in EFLAGS, %cr0, &c can we hide the real state from the guest? do all instructions that read/write sensitive state cause traps? push %cs will show CPL=3, not 0 sgdt reveals real GDTR pushf pushes real IF suppose guest turned IF off VMM will leave real IF on, just defer interrupts to guest popf ignores IF if CPL=3, no trap so VMM won't know if guest o/s wants interrupts IRET no ring change so won't restore restore SS/ESP how can we cope with non-trapping instructions that reveal real state? rewrite guest code, change them to INT 3, which traps keep track of original instruction, emulate in VMM INT 3 is one byte, so doesn't change code size/layout how does rewriter know where instruction boundaries are? or whether bytes are code or data? scan only as executed, since execution reveals instr boundaries original start of kernel (making up these instructions): entry: pushl %ebp ... popf ... jnz x ... jxx y x: ... jxx z when VMM first loads guest kernel, translate from entry to first jump replace bad instrs (popf) with int3 replace jump with int3 then start the guest kernel on int3 trap to VMM look where the jump could go (now we know the boundaries) for each branch, xlate until first jump again replace int3 w/ original branch re-start keep track of what we've translated, so we don't do it again indirect jumps? same, but probably can't replace with a real jump since we're not sure address will be the same next time so must take a trap every time what if guest reads or writes its own code? can't set up PTE for X but no R perhaps make CS != DS put translated code in CS put original code in DS write-protect original code pages on write trap emulate write find all jumps to modified code re-translate starting at those entry points VMware avoids many faults re-writing w/ VMM code, rather than int3 often faster than non-VM o/s, e.g. cli vs setting a flag in virt state but then code size and fn addresses change how does VMware hide e.g. return EIPs? do we need to rewrite guest user-level code? technically yes: SGDT, IF but probably not in practice user code only does INT, which traps to VMM how to handle pagetable? VMM must modify phys addrs in PTEs simple plan: trap on writes to %cr3 copy entire page table to VMM memory: shadow pagetable fix the phys addrs in PTEs load %cr3 to point to shadow pagetable what if guest o/s writes a PTE? must immediately be reflected to real pagetable VMM must write-protect guest's PTE pages what if too slow to scan entire pagetable after every %cr3 load? i.e. on every process switch could start w/ empty real page table (just VMM mappings) look at guest pagetable on demand, driven by page faults could cache entire page tables guest o/s probably switching among fairly static per-process tables VMM could learn where they are in guest memory nasty tradeoff if you pre-compute/cache, fault for every guest table write if lazy, lots of page faults to populate shadow pagetable need to reflect dirty/accessed bits back to guest pagetable when? how lazy can we be? could take a trap on first read and first write of each data page could take a trap on guest o/s reads of its page table how to guard guest kernel against writes by guest programs? both are at CPL=3 delete kernel PTEs on IRET, re-install on INT? what shall we do about devices? trap INB and OUTB DMA addresses are physical, VMM must translate and check rarely makes sense for guest to use real device want to share w/ other guests each guest gets a part of the disk each guest looks like a distinct Internet host each guest gets an X window VMM might mimic some standard ethernet or disk controller regardless of actual h/w on host computer or guest might run special drivers that jump to VMM
Disco overview mid-90s, sparked renewed interest in virtual machines designed for Stanford FLASH machine board w/ CPU, memory, and router MIPS R10000 CPU many boards in a 2-d grid 200 ns local memory time 900 ns remote memory time wanted to avoid huge time required to fix a UNIX for many CPUs lots of work to get good performance fix every data structure to avoid bouncing among CPUs observation: you just want your app to harness all the CPUs not directly beneficial to have O/S use them all Disco approach: run lots of single-CPU O/Ss, one per CPU IRIX, commercial O/S from SGI run them w/o modification, in virtual machines underlying (simple) VMM VMM manages memory and devices Disco memory structure split up real RAM among guests guest virtual addresses guest physical addresses machine addresses R10000 virtual address structure in user-mode, only low half of virtual address space in supervisor-mode, also some of top half in kernel-mode, all VMM in kernel, guest o/s in supervisor, guest user in user required small changes to IRIX Disco virtual state TLB (to hide machine addresses) R10000 has just TLB for va translation, no h/w pagetables fault on miss, o/s loads translation user / supervisor / kernel state VMM is driven by faults from guest o/s TLB write (VA -> PA) map PA to MA, install VA->MA in real TLB VMM keeps pmap[PA] -> MA also cache PA in l2tlb[VA] TLB read return PA from l2tlb[VA] TLB miss faults if VA in l2tlb, directly update TLB otherwise forward to guest o/s priv instructions that read/write other state (e.g. user/kernel flag) guest user system calls so it can change virtual user/kernel flag and switch to supervisor rather than kernel calls from special IRIX device drivers they wrote special ones that knew about Disco device interrupts decide which guest wants the interrupt look at its interrupt vectors
John Scott Robin, Cynthia E. Irvine. Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor.
Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor. In Proceedings of the 2001 Usenix Technical Conference.
Kevin Lawton, Drew Northup. Plex86 Virtual Machine.
Xen and the Art of Virtualization, Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield, SOSP 2003
A comparison of software and hardware techniques for x86 virtualizatonKeith Adams and Ole Agesen, ASPLOS 2006