Frequently Asked Questions for "Dune: Safe User-level Access to
Privileged CPU Features".

Q: Why is it useful for Dune to expose the virtual memory hardware
to user processes?

A: There are a bunch of neat things an application can do if it can
control how its own virtual memory is set up, and if it can handle its
own page faults. Last week's "Virtual Memory Primitives for User
Programs" paper talked about some, and the Dune paper's Application
section mentions sandboxing and a few others.

Q: If user mode processes in Dune mode run in ring 0 (supervisor
mode), how are they prevented from accessing privileged resources?

A: The CPU hardware acts slightly differently in "VMX non-root mode".

One way is that virtual address translation proceeds through two page
tables: the page table controlled by the process (in %cr3, the
equivalent of satp), and a separate page table called EPT that the
process cannot look at or modify. When the process uses a virtual
address, the MMU first translates it to a "guest physical address" using
the process's %cr3 page table, and then uses the EPT to translate that
guest physical address to a real physical address. The underlying kernel
configures the EPT to only allow access to the process's own physical
pages.

Another aspect is that the underlying kernel's privileged control
registers are saved when the kernel executes VMLAUNCH or VMRESUME to
enter the process; and restored whenever control exits the process back
into the kernel. This means that the process cannot tamper with the
kernel's control registers.

Another mechanism is that device interrupts (including the timer that
drives involuntary context switch) are delivered to the underlying
kernel, not to the process. The CPU only delivers page faults and
exceptions like divide-by-zero directly to the process.

The meanings of some of the control registers are modified in VMX
non-root mode. For example, the interrupt-enable flag is ignored;
interrupts are always delivered to the underlying kernel.

Q: How does the overhead to create a Dune child process compare to
that of the native implementation in Linux?

A: The paper does not discuss this. It must take significantly longer
to create a new Dune process than to just call ordinary fork(). One
has to create a VMCS and an EPT, and switching into the process (to
get it running, and every time there's a switch) with VMLAUNCH and
VMRESUME must be more expensive than ordinary kernel->user transition
because more state has to be saved and restored.

Q: The paper mentions that it takes special care about the ELF loader
when loading sandboxed processes. What is the issue here?

A: The danger is that a program's ELF headers might exploit a bug in
the ELF loader code, in particular in the code that parses the complex
ELF headers. If you run the ELF loader in privileged code, and load a
malicious ELF file, then a bug may allow an attacker to trick that
privileged code into doing something bad. In this situation, the user
process is privileged -- for example it might be your web browser,
which knows sensitive data like passwords you have typed into web
sites.

So the scheme seems to be to run the first ELF loader in privileged
code, but to have it load a known non-malicious second loader into the
sandbox. Then the second loader runs in the sandbox and reads the ELF
program that might be malicious. So only that second loader is exposed
to attack. Since the second loader is running in the sandbox, a
successful attack will hopefully not be a problem.

Q: Why does the paper find a high TLB miss rate in the EPT?

A: My guess: the MMU's traversal of each level of the main %cr3 page
table requires a complete EPT lookup (to translate the "guest
physical" address in that level's PTE to a "host physical" address).
So more addresses have to be looked up (in the EPT), and thus there's
more pressure on the part of the TLB that caches guest physical to
host physical translations.