Q: It seems like the majority of papers we have read have been about improving performance by allowing user programs to have greater access to the hardware while maintaining security properties. Has there been any research into what you can make and how it would work if you could assume only trusted software would run on the machine? A: Have a look at Unikernels -- http://unikernel.org/ Q: This paper lists six authors, only two of whom are professors. Until now, the papers we have read typically have only one (or two) student author(s). How does research with more contributors generally differ from that with fewer? How is work divided? Is size a preference/style of the particular research groups? A.1: I'm not sure that the number of contributors correlates very strongly with anything. You might imagine that having lots of contributors would help you implement big, complex systems. Or not: some small teams are as productive as big teams. And anyway "big and complex" is rarely a good thing in a research project. A.2: It depends on what the people are good at, and have time for. Here are some things that some people are particularly good at; few people are good at all of them: * choosing research ideas. * writing papers and explaining ideas. * proving formal properties. * designing systems. * implementing systems. * analyzing performance. A.3: Often. Some people like, or are more effective with, big groups, or small groups. There are also big groups that do projects with only a subset of the group, and there are big collaborations across groups. Q: Dune is implemented as a standalone Linux module, with no . Does this mean it is easier or more likely to be picked up and used in the real world? Has there been an effort to turn Dune from a research project to a real world tool (such as fully supporting signals)? A: Embedding Dune in Linux, and implementing it as a kernel module, definitely makes it relatively easy for others to use (and for the authors to build). As opposed to, say, having Dune be a completely new operating system. I'm not aware of anyone using Dune. Q: Do any current production systems implement similar mechanisms to reduce the overhead of virtualization (i.e. VMWare, cloud VMs, etc.)? If not, what reasons besides limited compatibility with existing systems (i.e. performance, security concerns not mentioned in the paper) can explain this? A: Many virtual machine monitors (VMWare, KVM, Bhyve, &c) use Intel VT-x to increase performance -- the guest operating system can directly modify its page table, get interrupts via its IDT, use both CPL=0 and CPL=3, &c. VT-x was originally intended to increase the performance (and decrease the complexity) of virtual machine monitors. The key new idea in the Dune paper is for a kernel to apply VT-x to individual processes (rather than a VMM applying VT-x to entire guest operating systems), and to thereby give processes direct access to privileged hardware. I do not know of anyone having picked up that idea. However, it's a good idea, and it's easy for me to imagine it being used in the future, particularly if Intel continues to increase the efficiency of VT-x. Q:I don’t really understand what the paper means by providing a ‘process’ abstraction vs. a machine abstraction. I understand the benefits the applications mentioned get from using Dune, but I’m a little confused on the above terminology. A: A process can make system calls to an operating system kernel to read files, allocate memory, &c. Code running on a machine abstraction can execute instructions and use machine registers, but can't make system calls. The JOS and xv6 kernels expect to run directly on computer hardware, and can also run as a guests in a virtual machine that provides a machine abstraction. One of the key new ideas in the Dune paper is to take virtualization hardware originally intended to provide a machine abstraction to guests, and use it to instead provide a process abstraction. It's a process abstraction because guests can make Linux system calls (using the VMCALL instruction). Q: The 5th paragraph of section 3.4 about Memory Management notes that an MMU notifier chain is used to handle various scenarios that may alter or require page mappings. I tried looking this up but I didn't find anything that explained it cleanly. What is it and how does it this address the issues mentioned in that same paragraph? A: Notifier chains are an internal mechanism in Linux that allow different parts of the kernel to ask for notification when various things change. If you search the web for linux notifier chain you'll find some explanations. Q: What does it mean by "shadow copy of privileged state" (section 2.1)? A: The idea is that the guest software (running in "VMX non-root" mode) is able to read and write privileged registers such as %cr3, but it is not accessing the real registers. The hardware hides the real register values when the host VMM switches from root to non-root mode, and restores them when the guest "exits" back to root mode. Thus there are two sets of privileged registers; the paper refers to one of them as the "shadow copy" of the registers. I do not know whether the paper means "shadow" to refer to the registers that the guest sees, or the registers that the VMM sees. Q: Both Dune and Exokernel aim to give applications more power in controlling memory for performance reasons. While Dune adds layers to an existing kernel, allowing for easier application development, Exokernel seems to strip many layers away. Does this mean that Dune makes the trade-off of faster application development for power compared to Exokernel, or are their performances still comparable? A: It's probably easier in practice to develop or port applications for Dune than for Exokernel because Dune lets applications use all the system calls and kernel services in Linux. Dune probably delivers better performance than an Exokernel for pagetable manipulation and delivering page faults to processes, because Dune really does let the process have direct access to the hardware %cr3/pagetable and IDT. The Exokernel, in contrast, requires the process to perform system calls into the kernel to change its page table, and page fault delivery requires user/kernel/user transitions. Of course when the Exokernel was designed VT-x didn't exist, so Dune wasn't possible. Q: This is a more tangentially related question. Would it be possible to implement an lib OS with Dune in an exokernal type setup? A: Yes. This could make a lot of sense because a big goal of the Exokernel is to let ordinary applications use powerful hardware features, and Dune allows exactly that. Q: What is meant by the libDune library being "completely untrusted by the kernel?" Is the implication that libDune does not enable any behaviors that are already possible with a traditional user process? A: That's correct; libDune can't do anything that a Dune process can't already do. Q: What is the difference between Dune sandboxing and other sandboxing mechanisms (e.g. seccomp)? Which is better? Which gives better performance? A: Dune isn't a full sandboxing system -- it just provides efficient mechanisms to restrict virtual address mappings, and to intercept system calls. It doesn't provide the logic for deciding what memory to reveal, or for deciding which systems calls should be allowed. That logic can be pretty complex. It would make sense to combine Dune with a larger sandboxing policy system -- which the paper does by using Wedge. Q: What are downsides of running all user processes in Dune mode? It doesn't seem unsafe to do so, and, as I understand, there is no serious performance disadvantage. A: There are some applications that run slower in Dune (e.g. mcf and ammp in Figure 3). Whether that's a serious downside depends on whether you care about the performance of such applictions. Maybe Intel will improve the performance of VT-x entry/exit and EPT lookup so that no application are slowed down. Q: Does running in VT-x slow down ordinary programs? A: It might -- you can see in Table 2 that some operations are slower in Dune than in ordinary Linux, due to VM-x costs. I think the main cost is the time to switch between VM-x root and non-root modes (i.e. between kernel and Dune process), which affects system calls, interrupts, and page faults. The paper argues that, for most programs, the extra per-system-call cost is not very significant -- for example, many of the entries in Figure 2 show a Dune program executing in a time within a few percent of how long it takes on standard Linux. Q: The overhead from Dune seems to come from VMX mode transitions and EPT translations. How do these overheads compare, and do more traditional systems have similar overhead? A: You are right that system call overhead in Dune is much larger than in Linux; Table 2 suggests a factor of 5x. This might be a serious problem for a program that spends a lot of its time making simple system calls, but would not be a problem for most programs (e.g. most of the programs in Figure 3 run within 5% of the speed of Linux). Q: Why does the pool of threads improve the performance of process switching? A: The pool of threads technique (in Section 6.3.2) makes sthread creation faster -- re-using an existing sthread is faster than creating an sthread from scratch. They improve context switch time by using the Intel hardware TLB tagging feature, the PCID mentioned in Section 2.2. This tags each TLB entry with the identifier of the thread it belongs to, so that they can switch page tables without having to flush the TLB. The TLB uses only entries that are tagged with the current thread's identifier. Q: If I'm not mistaken, the novelty of the speed increases from not needing to create sthreads are from recycling due to not needing TLB flushes and context switches-- existing threads are recycled for newer usages. This seems like a very strong idea in minimizing TLB flushes and creation processes, and reducing processing time and memory as a result, so are there any examples of prominent features encoded in Linux that also use this idea already somehow? A: It's pretty common to see systems that reduce thread-creation cost by initially creating a set of "worker" threads, and then re-using them for each request. For example, the Apache web server uses worker threads and processes. Q: The paper compares the example of running a garbage collector in a VM vs with a Dune module. How would even run a garbage collector in a VM? The memory it'll be cleaning up is the virtualized memory? A: In ordinary situations -- for example if you run Java on Linux and Java's garbage collector runs -- the garbage collector uses virtual addresses. The MMU's page-table hardware translates these to physical addresses which refer to real storage in RAM, so ultimately the garbage collector is finding real free space in real RAM. The situation with running a garbage collector in a virtual machine guest (or in a Dune process) is similar. Java and the garbage collector use virtual addresses, which the guest page table and the VT-x EPT translate to physical addresses that refer to real RAM. So again the garbage collector is ultimately referring to real memory. Q: How do fast faults and memory protections relate to improving performance of garbage collection? It doesn't seem to be something that you can optimize by skipping copying some data like you can by knowing dirty bits or controlling page tables. A: Virtual memory tricks turn out to be useful in garbage collectors that operate concurrently with the program. It's often the case that the garbage collection algorithm needs to be aware of the program's reads and writes in order to preserve correctness despite concurrency, and page protection hardware can be an efficient way to do this. In the Boehm collector that the paper benchmarks, the collector traces pointers (to find all live objects) in parallel with the program's execution. If the program changes some pointers in objects after the collector has traced those objects, the collector must re-trace them later. The collector uses the dirty bits maintained by the MMU (in each PTE) in order to detect which pages contain objects that the program modified after they were initially traced. People have thought of many ways to use virtual memory hardware for garbage collection. This paper by Appel and Li mentions a few: http://www.cs.cornell.edu/courses/cs614/2003sp/papers/AL91.pdf Q: Is exposing the EPT safe because VT-x only modifies page table as part of a virtual machine (so you're not actually managing data apart from what you're given as part of your VM)? A: Only Dune (in the kernel) can read or write the EPT. A Dune process can modify its own page table, but it can't get at the EPT. Allowing the Dune process to modify its own page table is safe because Dune sets up the EPT so that it only provides mappings to physical memory that Dune has allocated to the process. Q: I would be happy for some additional information regarding Intel�s process-context identifier (PCID) feature. A: You can read about PCID in Section 4.10 of the "Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3: System Programming Guide": http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html Q: The paper references multiple different "ring"s what are these specifically? A: When the paper says "ring 0", it means "executing with CPL=0", that is, executing with hardware privileges. And the paper's "ring 3" means executing with CPL=3, i.e. in user mode without hardware privileges. Q: How does Dune maintain a consistent TLB? By exposing the tagged PCID feature to dune processes, would it be possible to collide with an ID used by the host-kernel for another host-process? Similarly, how are VPIDs related? A: The Intel VT-x mechanism virtualizes TLB tags, so each Dune process has an independent set of tags. Each process can assign TLB tags however it likes, without interfering with other processes. In the language of VT-x, each process has a separate VPID assigned by Dune; Dune tells the VT-x hardware which VPID to use with each process. Each VPID has an independent set of PCIDs (which the paper calls TLB tags); the process tells the processor the current PCID (using the low 12 bits of %cr3). Each TLB entry is tagged with both a VPID and a PCID, and the MMU only uses entries that have the currently-set values. Q: I was most confused about exactly how Dune can provide all of its features completely safely without having to use a fully fledged virtual machine. Are managed system calls and special state really all that is needed for these privleged CPU features to be exposed to something running in user space? A: I think one surprise about virtual machines is how little (at a conceptual level) needs to be protected -- mostly the address space (page tables). On the other hand, Dune uses VT-x, and VT-x is basically a virtual machine mechanism. So while Dune doesn't provide a fully fledged virtual machine, it uses hardware that is powerful enough to do so. Q: Much of the implementation of Dune is highly architecture dependent (hardware architecture). Would the overall implementation (the interface to the user applications) be independent of architecture, or would this need to be adjusted on a per architecture level. A: Dune requires the hardware to provide VT-x or something similar. Dune could be made architecture-independent by telling how to detect what processor type it is running on, and using the instructions &c for the virtual machine support in the kind of processor. Of course the processor would have to provide something with capabilities similar enough to VT-x that Dune would make sense. I do not know how much of the Dune code could be made independent of the machine architecture. Q: Although they take steps to prevent Dune from allowing processes to monopolize a cpu, how much inefficiency could a malicious program introduce with the additional privileges granted by Dune? A: I'm not aware of any damage a Dune process could do that can't be done equally well with a traditional Linux process. A place to look might be the TLB -- perhaps there's a way to reduce TLB caching effectiveness by manipulating the PCID. Q: What would happen if a dune program used the int instruction? Would we end up in the kernel, or in the Dune module? I don't quite understand why VM exits are the only way we can access the kernel from a Dune program. A: If a Dune process executed INT, the hardware would try to deliver the interrupt to the process itself through the process's IDT. If the IDT entry isn't valid, that's a double-fault, so the processor would try to deliver the double fault to the Dune process through the process's IDT. If that IDT entry isn't valid, then the processor will exit VT-x non-root mode and give control back to the kernel. It's cheaper for the process to just do the VM exit directly. The Linux kernel (and the Dune extension) need to run in VT-x root mode so that they can control VT-x (e.g. set up the EPT and the VMCS). So a system call from a Dune process into the kernel needs to switch from VT-x non-root mode to VT-x root mode -- i.e. do a VM exit. Q. Section 4.2, one of the limitations listed is: "we have not fully integrated support for signals despite the fact that they are reported by the Dune module. Applications are required to use dune signal whereas a more compatible solution would override several libc symbols like signal and sigaction." What is the difference between using dune_signal and the libc symbols? I'm assuming it's more than just the naming to override the symbols. Also what are the signals generally used for? A: I suspect that dune_signal() takes different arguments and has different behavior than the standard Linux signal call. But I don't know what the differences are. Here are some ways that Linux uses signals: * A process can ask Linux to notify it when it suffers a page fault; Linux delivers the notification via a signal. * When a process does something that the hardware views as illegal (e.g., divide by zero), Linux can notify the process using a signal. * When you type control-C, Linux delivers a signal to the process you're running. * A process can ask Linux to notify it when input arrives on a pipe or socket; Linux delivers the notification with a signal. * I'm sure I'm forgetting other uses. Q: Why is it not possible to leave the Dune mode? A: Intel VT-x doesn't support this. VT-x is intended for for virtual machines, and it wouldn't make sense for a virtual machine guest to switch to executing as the host. Perhaps Dune could implement exiting Dune mode by creating a new non-Dune process and giving the Dune process's memory to the new process, but I suspect they never needed this feature. Q: Is there any way to have a program use Dune with restrictions on its access to privileged instructions? Or is it all or nothing? What's the extent of the process isolation that is preserved in Dune? A: VT-x can be configured to restrict privileges, but I don't think Dune makes much use of that. Dune tries to let processes do as much as possible, consistent with isolation. Dune processes are as isolated as ordinary Linux processes. It may seem that the ability of a Dune process to put anything it likes into its page table might break isolation. However, Intel VT-x hardware maps the "physical addresses" in the process's page table a second time, using the EPT (extended page table). So there are three kinds of address and two mappings: ProcessVirtual --pagetable--> ProcessPhysical --EPT--> RealPhysical The process can control how virtual addresses map to ProcessPhysical addresses, but the EPT controls how (and whether) ProcessPhysical addresses map to RealPhysical addresses (which refer to RAM). Dune controls the EPT; the process cannot see or modify the EPT. Dune sets up the EPT so that contains mappings only to RealPhysical addresses that Dune allocates to the process. So a Dune process is isolated so that it can only use its own memory, regardless of what it puts into its page table. Q: What are the extra complications of nested VT-x that makes it not commonly supported? A: The hardware doesn't directly support nested VT-x -- if code executing in non-root mode executes the VMLAUNCH instruction, it's an error. For the instruction to work correctly, the hardware would have to save and restore a stack of virtual machine state records, but it doesn't do that. Software can provide nested VT-x -- a nested VMLAUNCH will cause a VM exit to the surrounding root-mode software, which can create a VMCS and EPT mimicing the nested VMCS created in the non-root-mode code. Q: According to the abstract, Dune uses the virtualization _hardware_ in modern processors to provide process abstraction. Doesn't this imply some sort of limit on the number of concurrent processes that can be running? What implications does that have on the overall useability of Dune? A: I don't think the VT-x hardware imposes a limit on the number of Dune processes that can exist. Dune can create lots of VMCS structures, one for each process. When Dune context-switches to a particular Dune process, it tells the processor hardware which VMCS it is switching to. Of course, only as many processes can actually execute at a given time as there are cores (just as on non-Dune Linux). Each core has its own VT-x machinery, so each core can execute a different Dune process. Q: Has the attempt to build Dune on top of VT-x revealed any limitations to the VT-x extension itself? A; The paper's Section 7 mentions two: EPT performance could be improved, and the EPT guest-physical address space size should be increased. Q: It seems like Dune relies extensively on many of the features provided by Intel/AMD (e.g VMX , EPT). Does that imply that Dune is incompatible with non Intel/AMD hardware? A: The current Dune implementation probably only runs on modern Intel processors. If some non-Intel processor supported something similar to VT-x, Dune could probably be modified to run on it. Q: I've seen and used Intel VT-d to give VMs access to physical devices (like a GPU). Could Dune similarly expand to take advantage of that > tech to give processes direct access to pcie cards and similar devices? A: Yes, I'm sure it could. I know the authors of Dune were at one point thinking of adding VT-d support, though I don't know if they have.