6.1810 2024 Lecture 18: Virtual Machines, Dune

Read: Dune: Safe User-level Access to Privileged CPU features, Belay et al,
OSDI 2012.

Plan:
  virtual machines
  trap-and-emulate virtualization
  hardware-supported virtualization (Intel VT-x)
  Dune

*** Virtual Machines

what's a virtual machine?
  simulation of a computer, accurate enough to run an O/S

diagram: h/w, host/VMM, guest linux and apps, guest Windows and apps
  VMM might be stand-alone, or
  VMM might run in a host O/S, e.g. Linux

why VMs?
  cloud: many customer guest "instances" on each physical machine
    each customer can run whatever O/S &c they want in their VM
    cloud can share each machine among many customers
  isolation, more serious than e.g. process
  migration, replication
  s/w developers:
    virtual "crash" boxes for testing (sound familiar?)

VMs have a long history
  1960s: IBM used VMs to share big expensive machines
  1980s: (computers got small and cheap)
         (then machine rooms got full)
  1990s: VMWare re-popularized VMs, for x86 hardware
  2000s: widely used in cloud, enterprise

why look at virtual machines in 6.1810?
  VMMs have much in common with O/S kernels
  some of the most interesting action in O/S design has shifted to VMs
  VMs have affected both O/S (above) and hardware (below)

how accurate must a VM be?
  usual goal is 100% accuracy
    boot any guest O/S without modification
    prevent a malicious guest from breaking out
    guest cannot even detect if in VM!
  in practice, VMM and O/S often cooperate for efficiency
    e.g. VMM offers special disk/net "devices" that guest knows about

we could build a VM by writing software to simulate machine instructions
  VMM interprets each guest instruction
  maintain virtual machine state for the guest
    32 registers, satp, mode, RAM, disk, net, &c
  pro: this works e.g qemu
  con: slow

idea: execute guest instructions directly on the CPU -- fast!
  what if the guest kernel executes a privileged instruction?
    e.g. guest loads a new page table into satp

idea: run the guest kernel in user mode
  similar to running the guest kernel as an xv6 process
  of course the guest kernel assumes it is in supervisor mode
  ordinary instructions work fine
    adding two registers, function call, &c
  privileged RISC-V instructions are illegal in user mode
    will cause a trap, to the VMM
  VMM trap handler emulates privileged instruction
    maybe apply the privileged operation to the "virtual state"
      e.g. read/write sepc
    maybe transform and apply to real hardware
      e.g. assignment to satp
  "trap-and-emulate"
  nice b/c you can build such a virtual machine entirely in software
    perhaps one could turn xv6 into a trap-and-emulate VMM for RISC-V

which guest instructions will trap?
  csrr, csrw, ecall, sret, ld/st to device memory

what virtual state does a RISC-V trap-and-emulate VMM need to keep?
  all "privileged CPU state"
    CPU state that the guest kernel assumes it can read/write
    but is forbidden by user mode
    and often must differ from "real" state
  mode
  all s* registers (sepc, stvec, scause, satp, &c)
  page table
  PLIC/CLINT
  device hardware

the RISC-V is nice w.r.t. trap-and-emulate virtualization
  all privileged instructions trap if you try to execute them in user mode
  not all CPUS are as nice -- 32-bit x86, for example
    some privileged instructions don't trap; x86 ignores if run in user mode

for RISC-V trap-and-emulate, what has to happen when:

... guest user code executes ecall to make a system call?
    [diagram: guest user, guest kernel, VMM, virtual state, real sepc]
    CPU traps into the VMM (ecall always generates a trap)
      h/w saves guest's PC in (real) sepc
    VMM trap handler:
      examine the guest instruction
      virtual sepc <- real sepc
      virtual mode <- supervisor
      virtual scause <- 8 "system call"
      real sepc <- virtual stvec
      modify (real) page table -- set PTE_V for non-PTE_U entries
      sret: return from trap (sets real mode to user)

... the guest kernel reads scause, e.g. csrr a0, scause
    trap into VMM (since csrr is a privileged instruction)
    examine the guest instruction
    trapframe a0 <- virtual scause
    real sepc += 4
    return from trap

... the guest kernel executes sret (return to user)?
    CPU traps into the VMM
    VMM trap handler:
      virtual mode <- user
      real sepc <- virtual sepc
      modify (real) page table -- clear PTE_V for non-PTE_U entries
      return from trap

... the guest kernel writes satp?
  VMM must ensure that guest only accesses its own memory
    and must remap guest physical addresses
  VMM sets up a "shadow" page table derived from guest's page table
  guest's page table:
    guest va -> guest pa
  vmm map for this guest
    guest pa -> host pa
  VMM's "shadow" page table
    guest va -> host pa
  VMM installs the shadow page table in the real satp

... the guest kernel modifies a PTE in the active page table?
  VMM doesn't have to do anything
  RISC-V spec says PTE modifications don't take effect until sfence.vma
  sfence.vma causes trap to VMM
    VMM generates a new shadow page table

how to simulate devices?
  e.g. disk, NIC, display
  a big challenge!
  strategy #1: emulate a common existing real device
    needed in order to run oblivious guest O/S
    intercept memory-mapped control register read/write
      by marking those pages invalid, so VMM gets page faults
    VMM turns page faults into operations on simulated device state
    e.g. qemu simulates uart/console for xv6
      qemu turns uart r/w into characters to your display or ssh
  strategy #2: special virtual device tailored for efficiency
    requires guest O/S driver -- i.e. guest knows it's in a VM
    can be more streamlined than trapping on control register r/w
    e.g. xv6's virtio_disk.c; qemu turns into r/w on file fs.img
  strategy #3: pass-through access to a real hardware device
    guest O/S gets direct access to device h/w, no traps
    often requires specific support in device
      modern NICs have separate DMA ring per VM
    can be efficient

trap-and-emulate works well -- but it can be slow!
  lots of traps into the VMM

*** Hardware-supported x86 virtualization

VT-x/VMX/SVM: hardware supported virtualization
  modern Intel (and AMD) CPUs support virtualization in hardware
    guest can execute privileged instructions without trapping!
    can modify control registers, change page table, handle exceptions!
    can switch to user mode, and receive system call traps
    etc.
  faster than trap-and-emulate, and simpler VMM software
  widely used to implement virtual machines
    e.g. WSL (Windows Subsystem for Linux), cloud

(How can this possibly be secure?)

Some terminology
  Each CPU is in either root mode -- running the VMM i.e. host
    or in non-root mode -- running the guest (kernel + user processes)
    execution switches back and forth
  VMCS (VM Control Structure) -- configuration, save/restore
  Special instructions switch VT-x mode
    VMLAUNCH/VMRESUME: host -> guest
    VMCALL: guest -> host
  Certain events also force guest->host "exit"

What must VT-x prevent the guest from doing, given access to privileged state?
  read/write outside its own memory
  talk to hardware devices, or grab interrupts
  interfere with VMM host's control register setup

EPT (extended page table) constrains guest memory access
  problem:
    we want to let the guest kernel control its own page table,
    we also want to restrict the guest to just its allowed physical memory,
  MMU has *two* layers of address translation in VT-x guest mode
    first, %cr3 page table maps guest va -> guest pa (as usual)
    second, EPT maps guest pa -> host pa
  VMM sets up EPT to have only mappings for guest's own memory
  guest cannot see or change the EPT
  so:
    guest can freely read/write %cr3, change PTEs, read D bits, &c
    VMM can still provide isolation via EPT
  CPU delivers page faults from ordinary (%cr3) page table to guest
  page faults from EPT force exit to host -- guest does not see them

Device and timer interrupts
  CPU forces exit from guest, delivers interrupts to host

the VMCS memory area holds saved host state
  VMLAUNCH and VMRESUME save all of host privileged state (registers &c)
  and restore all of guest's (previously saved) state
  exit from guest to host restores host's state
    so guest cannot disturb host's privileged state

Thus: if the host configures things properly, the guest cannot escape

Hardware virtualization is widely used, e.g. in the cloud.

*** Dune

the big idea:
  use VT-x to run a Linux process (rather than to run a guest kernel)
  then application code has fast direct access to page tables, page faults, &c
  to allow user code to efficiently:
    sandbox untrusted code
    modify page table and take page faults

the scheme
  [linux, dune module, process]
  Dune is a "loadable kernel module" for Linux
  an ordinary process can switch into "Dune mode"
  a Dune-mode process is still a process
    has memory, can make Linux system calls (via VMCALL)
  the isolation machinery is a little different
    VT-x guest supervisor mode
    memory protection via EPT page table
  Dune gives a process additional functionality
    read and write its own page table, including PTE D (dirty) bit
      faster than Linux mprotect() system call
    handle its own page faults
      faster than having Linux turn fault into upcall to signal handler
    switch into (guest) user mode, for sandboxing
      guest user mode can only use guest PTE_U addresses
      and cannot use privileged instructions/registers
    process can intercept (guest) user system calls, page faults

Example: sandboxed execution (paper section 5.1)
  suppose your web browser wants to run a 3rd-party plug-in
    e.g. a video decoder or ad blocker
    the plug-in might be malicious or buggy
  browser needs a "sandbox"
    execute the plug-in, but limit syscalls / memory accesses
  assume browser runs as a Dune process:
    [diagram: browser in guest supervisor mode, plug-in guest user mode]
    browser creates page table with PTE_U mappings for memory plug-in can use
      and non-PTE_U mappings for rest of browser's memory
    set %cr3
    sret into untrusted code, in guest user mode
    plug-in can read/write allowed PTE_U memory via page table
    plug-in can execute system call instruction
      but its system calls trap into the browser (not the underlying kernel)
      and the browser can decide whether to allow each one

Example: garbage collection (GC)
  (modified Boehm concurrent mark-and-sweep collector)
  GC follows pointers to find all live (reachable) objects
    starting at registers
  But this GC is concurrent
    so program may modify an object after GC has traced it
  GC needs a way to know which objects were modified,
    so it can re-visit modified objects
  How does Dune help?
    Use PTE dirty bit (PTE_D) to detect written pages
    Dune allows direct access to PTEs
      much faster than making Linux system calls to get at PTEs

Fast user-level access to VM could help many programs
  Appel and Li paper

How might Dune hurt performance?
  Table 2
    sys call overhead higher due to VT-x entry/exit
    faults to kernel slower, for same reason
    TLB misses slower b/c of EPT
  But they claim most apps aren't much affected
    b/c they don't spend much time in short syscalls &c
    Figure 3 shows Dune within 5% for most apps in SPEC2000 benchmark
      slower ones suffer from EPT lookups

Of course it's not enough to merely not slow down apps much.

How much can clever use of Dune speed up real apps?
  Table 6 -- GC
  compare "Dune dirty" line to "Normal" line
  overall benefit depends on how fast the program allocates
  huge win on three allocation-intensive micro-benchmarks
  not a win for applications that don't allocate much -- XML parser
    EPT overhead does slow it down
    but many real apps allocate more than this

Next week:
  yet another different approach to kernel architecture!