6.1810 2025 L4: Operating System Organization, Microkernels Topic: What should a kernel do? What should its abstractions / system calls look like? Is more than one approach possible? This topic is more about ideas and less about specific mechanisms The "traditional approach" -- UNIX, Linux, xv6 1) big abstractions, and 2) a "monolithic" kernel implementation Big abstractions example: file system with names, directories, permissions, &c for programmers: convenience, portability e.g. vs a raw disk for kernel: help sharing and managing resources e.g. kernel in charge of allocating disk space to different programs for kernel: help with security e.g. file permissions Big abstractions have led to big "monolithic" kernels kernel is one big program, like xv6 easy for kernel sub-systems to cooperate -- no irritating boundaries exec() integrated with kernel's process, memory, and file code all kernel code runs with high privilege -- no internal security restrictions What's wrong with monolithic kernels / big abstractions? big => complex => perhaps buggy/insecure over-general => perhaps slow how much code executes to send one byte via a UNIX pipe? buffering, locks, sleep/wakeup, scheduler big abstractions hide/enforce lots of design decisions, perhaps awkwardly maybe I want to wait for a process that's not my child maybe I want to change another process's address space maybe DB is better at laying out B-Tree files on disk than kernel FS Microkernels -- a different approach -- e.g. L4, Mach main goal: simplest possible kernel main technique: move most O/S functionality to user-space service processes [diagram: h/w, kernel, apps, services (FS disk VM TCP NIC display)] what does the kernel provide? address spaces, memory management, threads, IPC IPC = Inter-Process Communication 1980s saw big burst of research on microkernel designs CMU's Mach perhaps the most influential What did people hope to gain from microkernels? clean slate -> more elegance small -> fewer bugs -> more secure small -> verifiable (see seL4) small -> easier to optimize small -> fewer design decisions forced on applications user-level -> force more modular O/S services user-level -> easier to extend / customize / replace user-level services user-level -> more robust -- restart individual user-level services most bugs are in drivers, get them out of the kernel! Design challenges? what's the minimum useful functionality? is there more than one minimum? will it know enough to enforce security? e.g. w/o knowing about users? will programs be able to share e.g. disk and net? without a kernel that provides file/directory/socket abstractions? will performance be good without monolithic kernel's integration? will kernel understand enough to allocate resources well? which processes should get CPU, memory, disk access, &c Pragmatic concerns What's the target? user workstations? web servers? embedded? Existing apps require a full O/S e.g. UNIX, Windows, &c might be a lot of work to re-implement UNIX interface on a microkernel How to persuade people to switch to your new microkernel? L4 has evolved over time, many versions and re-implementations used commercially today, in phones and embedded controllers representative of the micro-kernel approach emphasis on minimality: a few dozen system calls (Linux has 300+) 13,000 lines of code L4 basic abstractions [diagram] address space ("task") page mappings thread IPC (Inter-Process Communication) What's not in the L4 kernel, compared to e.g. Linux? Almost everything! file system, fork(), exec(), pipes, device drivers, network stack, &c If you want these, they have to be user-level code library or server process and built out of what little L4 does provide L4 system calls: create a task / address space create/destroy a thread send/recv message via IPC (addresses are thread IDs) intercept another address space's page faults -- "pager" and change target's mappings access device hardware receive device interrupts via IPC Example: exec() on L4 an existing task T1 wants to create a new task and have it execute instructions from a file assume an FS user-level service 1. T1 creates a new task T2 -- it will have no memory! 2. T1 tell L4 to start T2 executing at PC=0 3. T2 will immediately access invalid memory and page-fault 4. L4 kernel sends page fault info via IPC to T1 5. T1 asks FS server to read file block, via IPC 6. T1 now has file block in its memory 7. T1 tells L4 to map file block in T2's memory, and resume T2 (page-fault-driven file access often called "demand paging") problem: IPC performance Microkernel programs do lots of IPC! Was expensive in early systems multiple kernel crossings, cache misses, context switches, &c Cost of IPC caused many to dismiss microkernels L4 designers put huge effort into IPC performance Here's a slow IPC design patterned on UNIX pipes [diagram, message queue in kernel] send(id, msg) copy msg to a queue in the kernel, return recv(&id, &data) if msg waiting in queue, copy to user space, return otherwise wait for send() called "asynchronous" and "buffered" the usual request-response pattern (RPC) would be: [diagram: 2nd message queue for replies] 4 system calls (user->kernel->user) send() recv() recv() send() (recv() returns) or eight user/kernel or kernel/user crossings each is slow! four message copies (two for request, two for reply) L4's fast IPC "Improving IPC by Kernel Design," Jochen Liedtke, 1993 * combined call() and sendrecv() system calls IPC almost always used as request-response RPC thus wasteful to use separate send() and recv() system calls [diagram] client: call(): send a message, obtain response server: sendrecv(): reply to one request, wait for the next one 2x reduction in user/kernel crossings * synchronous [diagram] call() waits for target thread's sendrecv() common case: target is already waiting in sendrecv() call() returns into target, as if returning from sendrecv() * unbuffered synchronous => both src and dst user buffer address known => direct copy no need to copy to/from kernel buffer * small messages in registers, not memory * big messages as page mappings send() maps sender's pages into target's address space recv() specifies address in target again, no copy * careful layout of kernel code to minimize cache footprint result: 20x reduction in IPC cost In this case, simplicity did enable optimization -- nice. How to build a full operating system on a microkernel? Need a set of user-level servers. File system, device drivers, network stack, &c For embedded systems this can be fairly simple. What about services for general-purpose use, e.g. workstations, web servers? Really need compatibility for existing applications. E.g. the system needs mimic something like UNIX. Re-implement UNIX kernel services as lots of user-level service tasks? Idea: run existing Linux kernel as a process on top of the microkernel. An "O/S server". Perhaps not elegant, but pragmatic. Part of a path to adoption: Users might start by just running Linux apps. Then gradually exploit possibilites of underlying microkernel. Which brings us to today's paper on L4/Linux. "The Performance of micro-Kernel-Based Systems", by Hartig et al, 1997 Why reading papers is useful but hard: new ideas often described mainly in research papers typically only the new material is explained hard to read if you're not familiar with paper's context strategies: extract main ideas without getting lost in details first few pages usually outline the main points evaluation (at end) often reveals the bottom line will get easier as you learn more, read more basic picture [diagram] L4 kernel Linux kernel server one L4 task per Linux process IPC for Linux system calls What does it mean to run a Linux kernel at user-level? The Linux kernel is just a program! The authors modified Linux in small ways replacing hardware access with L4 system calls or IPC. Process creation, configuring user page tables, user memory allocation, system call handling, interrupt handling. No changes to Linux file system, network protocols, device drivers, &c. L4/Linux system calls via IPC Each Linux process is an L4 task Linux server is mostly a single L4 thread, waiting for IPC System call: IPC to Linux server (process blocks, waiting for IPC reply) Linux server execute system call implementation Send IPC reply (L4 delivers, process can resume) Linux server waits for next system call IPC L4/Linux server allocates all memory, hands out to processes so all user memory also mapped into Linux server's address space uses this for copyin()/copyout(), to dereference user pointers from sys calls this keeps system call IPCs small -- data address, not the data itself Linux server also uses its memory access for fork() and exec() Example: how does L4/Linux fork() work? process P1 calls fork() (P1 is really an L4 task) P1's libc library turns fork() into an IPC to L4/Linux server L4/Linux allocates some of its pages for P2, copies P1's mem. L4/Linux asks L4 to create a new task -- P2 L4/Linux sends special IPC to P2 with SP and PC to cause it to run L4/Linux gets external pager faults from P2, maps pages one by one How to evaluate? A paper must evaluate whether its ideas are good! Preferably with objective measurements. What does "good" mean for this paper? It's not really about whether microkernels are a good idea. Its main goal is to show they can have good performance. What kind of performance might readers care about? Is IPC fast? -> microbenchmark Is there some other performance obstacle? -> whole-system benchmarks What to compare against? Linux, to help persuade people to switch to L4 Mach/Linux, to show L4 fixes bad performance of older microkernels IPC microbenchmarks Table 2 -- cost of getpid() system call Linux: 1.68 us L4/Linux: 3.95 us Mach Linux: 15.41 us getpid() is one system call on native Linux but two L4 system calls (call, sendrecv) to Linux server task for two IPC messages (request, reply) nice result: takes only somewhat more than 2x as long on L4/Linux FAR faster than Mach+LinuxServer why is Mach slow? the paper doesn't say, sadly. What do we think the impact of 2x syscall overhead might be? Disaster? Hardly noticeable? Whole-system benchmark: AIM Figure 8 AIM forks a bunch of processes Each randomly uses the disk, allocates memory, uses pipe, computes, &c To do a fixed amount of total work x-axis shows [some function of] number of concurrent AIM processes thus total amount of work y-axis shows time to complete all work Only the slope really matters slope is time per unit of work, so lower is better Native Linux is best, but L4Linux is only a little slower Mach+Linux is noticeably less efficient Conclusions: 2x IPC time doesn't seem to make much overall difference L4+Linux is only somewhat slower than Linux L4+Linux is significantly faster than Mach+Linux These results are not by themselves an argument for using L4 But they are an argument against rejecting L4 due to performance worries What's the current situation? Microkernels are sometimes used for embedded computing Microcontrollers, Apple "enclave" processor Running custom software Microkernels never caught on for general computing No compelling story for why one should switch from Linux &c Many ideas from microkernel research have been adopted into modern UNIXes Sophisticated virtual memory support Threads in user programs Extensibility (but via loading code into kernel) IPC and user-level services Next lecture: Page tables References: L4 details http://www.cse.unsw.edu.au/~cs9242/02/lectures/01-l4.pdf http://www.cse.unsw.edu.au/~cs9242/02/lectures/01-l4/01-l4.html fast IPC in L4 https://cs.nyu.edu/~mwalfish/classes/15fa/ref/liedtke93improving.pdf later evolution of L4 https://trustworthy.systems/publications/nicta_full_text/8988.pdf an earlier paper on the ideas behind L4: https://www.cs.fsu.edu/~awang/courses/cop5611_s2004/microkernel.pdf The Fiasco.OC Microkernel -- a current L4 descendent https://l4re.org/doc/