6.S081 2020 L18: Operating System Organization, Microkernels Topic: What should a kernel do? What should its abstractions / system calls look like? Answers depend on the application, and on programmer taste! There is no single best answer This topic is more about ideas and less about specific mechanisms The traditional approach 1) powerful abstractions, and 2) a "monolithic" kernel implementation UNIX, Linux, xv6 The philosophy behind traditional kernels is powerful abstractions: portable interfaces files, not disk controller registers address spaces, not MMU access simple interfaces that hide complexity all I/O via FDs and read/write, not specialized for each device &c address spaces with transparent disk paging abstractions help the kernel manage and share resources process abstraction lets kernel be in charge of scheduling file/directory abstraction lets kernel be in charge of disk layout abstractions help the kernel enforce security file permissions processes with private address spaces lots of indirection e.g. FDs, virtual addresses, file names, PIDs helps kernel virtualize, hide, revoke, schedule, &c Powerful abstractions have led to big "monolithic" kernels kernel is one big program, like xv6 easy for kernel sub-systems to cooperate -- no irritating boundaries exec() and mmap() are part of both FS and VM system relatively easy to add sym links, COW fork, mmap, &c all kernel code runs with high privilege -- no internal security restrictions What's wrong with traditional kernels? big => complex => buggy/insecure perhaps over-general and thus slow how much code executes to send one byte via a UNIX pipe? buffering, locks, sleep/wakeup, scheduler many design decisions are baked in, can't be changed, may be awkward maybe I want to wait for a process that's not my child maybe I want to change another process's address space maybe DB is better at laying out B-Tree files on disk than kernel FS hard to create kernel "extensions" that others can use new device drivers, file systems, &c Microkernels -- a different approach big idea: move most O/S functionality to user-space service processes [diagram: h/w, kernel, services (FS disk VM TCP NIC display), apps] kernel can be small address spaces, threads, IPC (inter-process communication) IPC lets threads send each other messages 1980s saw big burst of research on microkernel designs CMU's Mach perhaps the most influential used today in embedded systems, phone chips, car entertainment ideas (esp user-level servers and IPC) influential e.g. Windows and MacOS Why the interest in microkernels? focused, elegant, clean slate small -> more security -- less code means fewer bugs to exploit small -> verifiable (see seL4) small -> easier to optimize you don't have to pay for features you don't use small -> avoid forcing design decisions on applications user-level -> may encourage modularity of O/S services user-level -> easier to extend / customize / replace user-level services user-level -> more robust -- restart individual user-level services most bugs are in drivers, get them out of the kernel! can run/emulate multiple O/Ses, like a VMM Microkernel challenges What's a minimum kernel API? Need simple primitives on which to build exec, fork, mmap, &c Need to build the rest of the O/S at user level How to get good performance, despite IPC and less integration? L4 has evolved over time, many versions and re-implementations used commercially today, in phones and embedded controllers representative of the micro-kernel approach emphasis on minimality: 7 system calls (Linux has 300+, xv6 has 21) 13,000 lines of code L4 basic abstractions [diagram] address space ("task") thread IPC L4 system calls: create an address space create/destroy a thread in [another] address space send/recv message via IPC (addresses are thread IDs) map pages of your memory into another address space it must agree this happens via IPC -- one task can modify another task's page table used to create new tasks, share memory intercept another address space's page faults -- "pager" kernel delivers via IPC access device hardware (not a system call, happens directly) handle device interrupts kernel delivers via IPC Note L4 kernel is missing almost everything that Linux or even xv6 has file system, fork(), exec(), pipes, device drivers, network stack, &c If you want these, they have to be user-level code library or server process how does L4 thread switching work? current user-level thread can yield for 3 reasons: IPC system call waits timer interrupt yield() system call L4 kernel saves user thread registers, picks a RUNNABLE thread to run, restores user registers, switches page table, jumps to user space no surprises here how do L4 external pagers work? every task has a pager task 1. page fault 2. kernel suspends thread 3. kernel sends fault info in IPC to pager 4. pager picks one its own pages 5. pager sends virtual page address in IPC reply to faulting thread 6. kernel intercepts IPS, maps in target, resumes target what can you use an L4 pager for? allocating memory -- "sigma0" allocates on fault for early tasks copy-on-write fork coupled with a system call that revokes access mmap of file problem: IPC performance Microkernel programs do lots of IPC! Was expensive in early systems multiple kernel crossings, TLB misses, context switches, &c Cost of IPC caused many to dismiss microkernels L4 designers put huge effort into IPC performance Here's a slow IPC design patterned on UNIX pipes [diagram, message queue in kernel] send(id, msg) append msg to queue in kernel, return recv(&id, &data) if msg waiting in queue, remove, return otherwise sleep() called "asynchronous" and "buffered" now the usual request-response pattern (RPC) involves: [diagram: 2nd message queue for replies] 4 system calls (user->kernel->user) send() -> recv() recv() <- send each may disturb CPU's caches (TLB, data, instruction) four message copies (two for request, two for reply) two context switches, two general-purpose schedulings L4's fast IPC "Improving IPC by Kernel Design," Jochen Liedtke, 1993 * synchronous [diagram] send() waits for target thread's recv() common case: target is already waiting in recv() send() jumps into target's user space, as if returning from recv() no real context switch, no scheduler loop * unbuffered no queue in kernel since synchronous, kernel can copy directly between user buffers * small messages in registers kernel send() path does not disturb many of the registers e.g., no context switch no copying required for small messages since send() jumps into target's user space, along with registers * huge messages as virtual memory grants again, no copy required, though kernel send() code must change page table * combined call() and sendrecv() system calls [diagram] IPC almost always used as request-response RPC thus wasteful to use separate send() and recv() system calls client: call(): send a message, wait for response server: sendrecv(): reply to one request, wait for the next one 2x reduction in user/kernel crossings * careful layout of kernel code to minimize cache footprint result: 20x reduction in IPC cost How to build a full operating system on a microkernel? Remember the idea was to move most features into user-level servers. File system, device drivers, network stack, process control, &c For embedded systems this can be fairly simple. What about services for general-purpose use, e.g. workstations, web servers? Really need compatibility for existing applications. E.g. the system needs mimic something like UNIX. Re-implement UNIX kernel services as lots of user-level services? Or: run existing Linux kernel as a process on top of the microkernel. An "O/S server". Perhaps not elegant, but pragmatic. Part of a path to adoption: Users might start by just running Linux apps. Then gradually exploit possibilites of underlying microkernel. Which brings us to today's paper: "The Performance of micro-Kernel-Based Systems", by Hartig et al, 1997 basic picture [diagram] L4 kernel Linux kernel server one L4 task per Linux process IPC for system calls What does it mean to run a Linux kernel at user-level? The Linux kernel is just a program! The authors modified Linux in a number of ways, replacing hardware access with L4 system calls or IPC. Process creation, configuring user page tables, memory allocation, system call handling, interrupt handling. L4/Linux's use of threads Each Linux process has one or more L4 threads for its user code Linux server has just one L4 thread (plus L4 threads waiting for interrupts) At rest it is waiting for IPCs with system calls Linux server switches its own L4 thread among kernel threads for its processes When e.g. file system code sleep()s waiting for disk read Or pipe read() sleep()s waiting for someone to write the pipe Much as xv6 switches among kernel threads. But an L4/Linux kernel thread switch has no relation to user process switching Instead, L4 separately switches among runnable L4 threads that implement the Linux processes So Linux kernel server can be running a kernel thread for process P1, while L4 is running process P2 on another core Why not use L4 threads to implement Linux server's kernel threads? Because that would cause pain without any benefit. Would introduce parallelism inside Linux. But Linux 2.0 did not have SMP support -- e.g. no spinlocks. And their hardware had only one core, so could be no parallel speedup anyway. Drawback: L4 is in charge of scheduling user threads So L4/Linux couldn't enforce Linux's notions of priority &c L4/Linux server maps all user memory into its address space (really, it allocates lots of memory, then gives its own memory to user processes) uses this for copyin()/copyout(), to dereference user pointers from sys calls this keeps system call IPCs small -- data address, not the data itself Linux server also uses its memory access for fork() and exec() Example: how does fork() work? process P1 calls fork() (P1 is really an L4 task) P1's libc library turns fork() into an IPC to L4/Linux server L4/Linux asks L4 to create a new task and thread -- P2 L4/Linux allocates memory pages (as many as P1 has) L4/Linux uses IPC to tell L4 to map pages into P2 L4/Linux copies data from P1's pages to P2's pages L4/Linux sends special IPC to P2 with SP and PC to cause it to run L4/Linux sends reply to P1 via IPC L4/Linux server acts as the pager for user processes so L4 turns process page faults into IPC to Linux server for e.g. copy-on-write fork, lazy allocation, memory mapped files Drawback: L4 doesn't allow direct control over page tables so Linux server could not switch its page table to include user virt addresses until recently Linux used this trick to gain performance (no page table switch), and for convenience in dereferencing syscall arguments L4/Linux server uses Linux device drivers unchanged! since L4 allows it direct access to device registers except interrupts arrive via L4 IPC How to evaluate? What are some questions that the paper might answer? It's not really about whether microkernels are a good idea. It's main goal is to show they have good performance. What kind of performance do we care about? Is IPC fast? -> microbenchmark Is there some other performance obstacle? -> whole-system benchmarks IPC microbenchmarks Table 2 getpid() is one system call on native Linux and two L4 system calls (IPC send, IPC recv) on L4/Linux nice result: takes only somewhat more than 2x as long on L4/Linux and FAR faster than Mach+LinuxServer What do we think the impact of syscalls taking 2x as long might be? Disaster? Hardly noticeable? Whole-system benchmark: AIM AIM forks a bunch of processes Each randomly uses the disk, allocates memory, uses pipe, computes, &c To do a fixed amount of total work Figure 8 x-axis shows [some function of] number of concurrent AIM processes y-axis shows time for all processes to complete Only the slope really matters slope is time per unit of work, so lower is better Native Linux is best, but L4Linux is only a little slower Mach+Linux is noticeably less efficient Conclusions: 2x IPC time doesn't seem to make much overall difference L4+Linux is only somewhat slower than Linux L4+Linux is significantly faster than Mach+Linux These results are not by themselves an argument for using L4 But they are an argument against rejecting L4 due to performance worries What's the current situation? Microkernels are sometimes used for embedded computing Microcontrollers, Apple "enclave" processor Running custom software Microkernels, as such, never caught on for general computing No compelling story for why one should switch from Linux &c Many ideas from microkernel research have been adopted into modern UNIXes Mach spurred adoption of sophisticated virtual memory support Virtual machines are partially a response to the O/S server idea Loadable kernel modules are a response to need to extensibility Client/server e.g. DNS server, window server MacOS has microkernel-style IPC References: The Fiasco.OC Microkernel -- a current L4 descendent https://l4re.org/doc/ fast IPC in L4 https://cs.nyu.edu/~mwalfish/classes/15fa/ref/liedtke93improving.pdf later evolution of L4 https://ts.data61.csiro.au/publications/nicta_full_text/8988.pdf