Virtualizing Processors

Required reading: Disco

Overview

What is a virtual machine? IBM definition: a fully protected and isolated copy of the underlying machine's hardware.

What's the basic idea behind virtualization? Having the CPU do most of the work natively and making the programs running in the virtual machine think that they are running on a real machine.

Virtual machines can be useful for a number of reasons:

Fault isolation: like processes on a machine but more generic.
Run multiple operating systems on single piece of hardware;

Run "older" programs on the same hardware (e.g. how win16/dos programs used to run.)
Or run applications that require different operating system.

Customizing the apparent hardware: virtual machine may have different view of hardware than is physically present.
Simplify deployment/development of software for scalable processors. (e.g. Disco)

What are the alternatives? Processor emulation (e.g. bochs) or binary emulation (WINE). Emulation runs instructions purely in software; virtualization gets out of the way whenever possible. Therefore emulation gives portability whereas virtualization focuses on performance. However, this means that you need to model your hardware very carefully in software. Binary emulation focuses on just getting system call for a particular operating system's interface. Binary emulation can be hard because it is targetted towards a particular operating system (and even that can change between revisions).

What needs to be virtualized? What might that entail?

CPU: instructions -- trap all privileged instructions
Memory: address spaces -- map "virtual physical" pages to machine pages, handle translation, etc.
Devices: any I/O communication needs to be trapped and passed through/handled appropriately.

Types of virtualization

Run VMM directly on hardware: like Disco.
Run VMM as an application (though still running as root, with integration into OS) in a host OS: like VMware. Provides additional hardware support at low development cost in VMM. Intercept CPU-level I/O requests and translate them into system calls (e.g. read()).

Single vs multiprocessor

Multiprocessor has additional memory issues.
Most devices need to be exclusive to one VM.

Virtualization in detail

Memory virtualization

Understanding memory virtualization. Let's consider the MIPS example from the paper. Ideally, we'd be able to intercept and rewrite all memory address references. (e.g. by intercepting virtual memory calls). Why can't we do this on the MIPS? (There are addresses that don't go through address translation --- but we don't want the virtual machine to directly access memory!) What does Disco do to get around this problem? (Relink the kernel outside this address space.)

Having gotten around that problem, how do we handle things in general?

// Disco's tlb miss handler.
// Called when a memory refernce for virtual adddress
// 'VA' is made, but there is not VA->MA (virtual -> machine)
// mapping in the cpu's TLB.
void tlb_miss_handler (VA)
{
  // see if we have a mapping in our "shadow" tlb (which includes
  // "main" tlb)
  tlb_entry *t = tlb_lookup (thiscpu->l2tlb, va);
  if (t)
    tlbwrite (va, t->pa, t->otherdata);
  else
    // trap to the virtual CPU/OS's handler
}

// Disco's procedure which emulates the MIPS
// instruction which writes to the tlb.
//
// VA -- virtual addresss
// PA -- physical address (NOT MA machine address!)
// otherdata -- perms and stuff
void emulate_tlbwrite_instruction (VA, PA, otherdata)
{
  tlb_insert (thiscpu->l2tlb, VA, PA, otherdata); // cache
  if (!defined (thiscpu->pmap[PA])) { // fill in pmap dynamically
    MA = allocate_machine_page ();
    thiscpu->pmap[PA] = MA; // See 4.2.2
    thiscpu->pmapbackmap[MA] = PA;
    thiscpu->memmap[MA] = VA; // See 4.2.3 (for TLB shootdowns)
  }
  tlbwrite (va, thiscpu->pmap[PA], otherdata);
}

// Disco's procedure which emulates the MIPS
// instruction which read the tlb.
tlb_entry *emulate_tlbread_instruction (VA)
{
  // Must return a TLB entry that has a "Physical" address;
  // This is recorded in our secondary TLB cache.
  // (We don't have to read from the hardware TLB since
  // all writes to the hardware TLB are mediated by Disco.
  // Thus we can always keep the l2tlb up to date.)
  return tlb_lookup (thiscpu->l2tlb, va);
}

In the x86, must intercept any modifications to the page table and substitute appropriate responses. And update things like the accessed/dirty bits.

CPU virtualization

Requirements:

Method of executing non-privileged instructions in privileged and user mode must be roughly equivalent. (Why? B/c the virtual "privileged" system will not be running in true "privileged" mode.)
There must be a way to protect the VM from the real machine. (Some sort of memory protection/address translation. For fault isolation.)
There must be a way to detect and transfer control to the VMM when the VM tries to execute a sensitive instruction (e.g. a privileged instruction, or one that could expose the "virtualness" of the VM.) It must be possible to emulate these instructions in software. Can be classified into completely virtualizable (i.e. there are protection mechanisms that cause traps for all instructions), partly (insufficient or incomplete trap mechanisms), or not at all (e.g. no MMU).

The MIPS didn't quite meet the second criteria, as discussed above. But, it does have a supervisor mode that is between user mode and kernel mode where any privileged instruction will trap.

What might a the VMM trap handler look like?

void privilege_trap_handler (addr) {
  instruction, args = decode_instruction (addr)
  switch (instruction) {
  case foo:
    emulate_foo (thiscpu, args, ...);
    break;
  case bar:
    emulate_bar (thiscpu, args, ...);
    break;
  case ...:
    ...
  }
}

The emulator_foo bits will have to evaluate the state of the virtual CPU and compute the appropriate "fake" answer.

What sort of state is needed in order to appropriately emulate all of these things?

- all user registers
- CPU specific regs (e.g. on x86, %crN, debugging, FP...)
- page tables (or tlb)
- interrupt tables

This is needed for each virtual processor.

What about in the x86? We know that it meets the first two criteria above. If you run the CPU in ring 3, most x86 instructions will be fine.

// addr is a physical address
void emulate_lcr3 (thiscpu, addr)
{
  thiscpu->cr3 = addr;
  Pte *fakepdir = lookup (addr, oldcr3cache);
  if (!fakepdir) {
    fakedir = ppage_alloc ();
    store (oldcr3cache, addr, fakedir);
    // May wish to scan through supplied page directory to see if
    // we have to fix up anything in particular.
    // Exact settings will depend on how we want to handle
    // problem cases below and our own MM.
  }
  asm ("movl fakepdir,%cr3");
  // Must make sure our page fault handler is in sync with what we do here.
}

However, there are some that are bad. Examples?

pushf/popf: FL_IF is handled different, for example.
Anything (push, pop, mov) that reads or writes from %cs.
And some others... (total, 17 instructions).

They are unpriviliged instructions that read the processor state. These could reveal details of virtualization that should not be revealed. How can we get around this?

Basic idea is to decode the instruction stream that is provided by the user and look for bad instructions. When we find them, replace them with an interrupt (INT 3) that will allow the VMM to handle it correctly. This might look something like:

void initcode () {
  scan_for_nonvirtual (0x7c00);
}

void scan_for_nonvirtualizable (thiscpu, startaddr) {
  addr  = startaddr;
  instr = disassemble (addr);
  while (instr is not branch or bad) {
    addr += len (instr);
    instr = disassemble (addr);
  }
  // remember that we wanted to execute this instruction.
  replace (addr, "int 3");
  record (thiscpu->rewrites, addr, instr);
}

void breakpoint_handler (tf) {
  oldinstr = lookup (thiscpu->rewrites, tf->eip);
  if (oldinstr is branch) {
    newcs:neweip = evaluate branch
    scan_for_nonvirtualizable (thiscpu, newcs:neweip)
    return;
  } else { // something non virtualizable
    // dispatch to appropriate emulation
  }
}

All pages must be scanned in this way. Fortunately, most pages probably are okay and don't really need any special handling so after scanning them once, we can just remember that the page is okay and let it run natively.

What about writes? We must detect self-modifying code (e.g. must simulate buffer overflow attacks correctly.) When a write to a physical page that happens to be in code segment happens, must trap the write and then rescan the affected portions of the page.

What about self-examining code? Need to protect it some how --- possibly by playing tricks with instruction/data TLB caches, or introducing a private segment for code (%cs) that is different than the segment used for reads/writes (%ds).

The above can be slow! So sometimes you want the guest operating system to be aware that it is a guest and allow it to avoid the slow path. Special device drivers or changing instructions that would cause traps into memory read/write instructions. XXX how does that latter work?

Device I/O virtualization

We intercept all communication to the I/O devices: read/writes to reserved memory addresses cause page faults into special handlers which will emulate or pass through I/O as appropriate.

In a system like Disco, the sequence would look something like:

VM executes instruction to access I/O
Trap generated by CPU (based on memory or privilege protection) transfers control to VMM.
VMM emulates I/O instruction, saving information about where this came from (for demultiplexing async reply from hardware later) .
VMM reschedules a VM.

Interrupts will require some additional work:

Interrupt occurs on real machine, transfering control to VMM handler.
VMM determines the VM that ought to receive this interrupt.
VMM causes a simulated interrupt to occur in the VM, and reschedules a VM.
VM runs its interrupt handler, which may involve other I/O instructions that need to be trapped.

This is more complex when in a hosted state since it involves more transitions between different modules. However, it may be easier to code: VMM can emulate by calling write on the appropriate device and select'ing for read.

Some Disco paper notes

Disco has some I/O specific optimizations.

Disk reads only need to happen once and can be shared between virtual machines via copy-on-write virtual memory tricks.
Network cards do not need to be fully virtualized --- intra VM communication doesn't need a real network card backing it.
Special handling for NFS so that all VMs "share" a buffer cache.

Disco developers clearly had access to IRIX source code.

Need to deal with KSEG0 segment of MIPS memory by relinking kernel at different address space.
Ensuring page-alignment of network writes (for the purposes of doing memory map tricks.)

Performance?

Evaluated in simulation.
Where are the overheads? Where do they come from?
Does it run better than NUMA IRIX?

Related papers

John Scott Robin, Cynthia E. Irvine. Analysis of the Intel Pentium's Ability to Support a Secure Virtual Machine Monitor.

Jeremy Sugerman, Ganesh Venkitachalam, Beng-Hong Lim. Virtualizing I/O Devices on VMware Workstation's Hosted Virtual Machine Monitor. In Proceedings of the 2001 Usenix Technical Conference.

Kevin Lawton, Drew Northup. Plex86 Virtual Machine.