6.1810 2025 L4: Operating System Organization, Microkernels

Topic:
  What should a kernel do?
  What should its abstractions / system calls look like?
  Is more than one approach possible?
  This topic is more about ideas and less about specific mechanisms

The "traditional approach" -- UNIX, Linux, xv6
  1) big abstractions, and
  2) a "monolithic" kernel implementation

Big abstractions
  example: file system with names, directories, permissions, &c
  for programmers: convenience, portability
    e.g. vs a raw disk
  for kernel: help sharing and managing resources
    e.g. kernel in charge of allocating disk space to different programs
  for kernel: help with security
    e.g. file permissions

Big abstractions have led to big "monolithic" kernels
  kernel is one big program, like xv6
  easy for kernel sub-systems to cooperate -- no irritating boundaries
    exec() integrated with kernel's process, memory, and file code
  all kernel code runs with high privilege -- no internal security restrictions

What's wrong with monolithic kernels / big abstractions?
  big => complex => perhaps buggy/insecure
  over-general => perhaps slow
    how much code executes to send one byte via a UNIX pipe?
      buffering, locks, sleep/wakeup, scheduler
  big abstractions hide/enforce lots of design decisions, perhaps awkwardly
    maybe I want to wait for a process that's not my child
    maybe I want to change another process's address space
    maybe DB is better at laying out B-Tree files on disk than kernel FS

Microkernels -- a different approach -- e.g. L4, Mach
  main goal: simplest possible kernel
  main technique: move most O/S functionality to user-space service processes
  [diagram: h/w, kernel, apps, services (FS disk VM TCP NIC display)]
  what does the kernel provide?
    address spaces, memory management, threads, IPC
    IPC = Inter-Process Communication
  1980s saw big burst of research on microkernel designs
    CMU's Mach perhaps the most influential

What did people hope to gain from microkernels?
  clean slate -> more elegance
  small -> fewer bugs -> more secure
  small -> verifiable (see seL4)
  small -> easier to optimize
  small -> fewer design decisions forced on applications
  user-level -> force more modular O/S services
  user-level -> easier to extend / customize / replace user-level services
  user-level -> more robust -- restart individual user-level services
    most bugs are in drivers, get them out of the kernel!

Design challenges?
  what's the minimum useful functionality?
    is there more than one minimum?
  will it know enough to enforce security?
    e.g. w/o knowing about users?
  will programs be able to share e.g. disk and net?
    without a kernel that provides file/directory/socket abstractions?
  will performance be good without monolithic kernel's integration?
  will kernel understand enough to allocate resources well?
    which processes should get CPU, memory, disk access, &c

Pragmatic concerns
  What's the target? user workstations? web servers? embedded?
  Existing apps require a full O/S e.g. UNIX, Windows, &c
    might be a lot of work to re-implement UNIX interface on a microkernel
  How to persuade people to switch to your new microkernel?

L4
  has evolved over time, many versions and re-implementations
  used commercially today, in phones and embedded controllers
  representative of the micro-kernel approach
  emphasis on minimality:
    a few dozen system calls (Linux has 300+)
    13,000 lines of code

L4 basic abstractions
  [diagram]
  address space ("task")
  page mappings
  thread
  IPC (Inter-Process Communication)

What's not in the L4 kernel, compared to e.g. Linux?
  Almost everything!
  file system, fork(), exec(), pipes, device drivers, network stack, &c
  If you want these, they have to be user-level code
    library or server process
    and built out of what little L4 does provide

L4 system calls:
  create a task / address space 
  create/destroy a thread
  send/recv message via IPC (addresses are thread IDs)
  intercept another address space's page faults -- "pager"
    and change target's mappings
  access device hardware
  receive device interrupts via IPC

Example: exec() on L4
  an existing task T1 wants to create a new task
    and have it execute instructions from a file
  assume an FS user-level service
  1. T1 creates a new task T2 -- it will have no memory!
  2. T1 tell L4 to start T2 executing at PC=0
  3. T2 will immediately access invalid memory and page-fault
  4. L4 kernel sends page fault info via IPC to T1
  5. T1 asks FS server to read file block, via IPC
  6. T1 now has file block in its memory
  7. T1 tells L4 to map file block in T2's memory, and resume T2
  (page-fault-driven file access often called "demand paging")

problem: IPC performance
  Microkernel programs do lots of IPC!
  Was expensive in early systems
    multiple kernel crossings, cache misses, context switches, &c
  Cost of IPC caused many to dismiss microkernels
  L4 designers put huge effort into IPC performance

Here's a slow IPC design
  patterned on UNIX pipes
  [diagram, message queue in kernel]
  send(id, msg)
    copy msg to a queue in the kernel, return
  recv(&id, &data)
    if msg waiting in queue, copy to user space, return
    otherwise wait for send()
  called "asynchronous" and "buffered"
  the usual request-response pattern (RPC) would be:
    [diagram: 2nd message queue for replies]
    4 system calls (user->kernel->user)
      send()
      recv()
               recv()
               send()
      (recv() returns)
    or eight user/kernel or kernel/user crossings
      each is slow!
    four message copies (two for request, two for reply)

L4's fast IPC
  "Improving IPC by Kernel Design," Jochen Liedtke, 1993
  * combined call() and sendrecv() system calls
    IPC almost always used as request-response RPC
    thus wasteful to use separate send() and recv() system calls
    [diagram]
    client: call(): send a message, obtain response
    server: sendrecv(): reply to one request, wait for the next one
    2x reduction in user/kernel crossings
  * synchronous
    [diagram]
    call() waits for target thread's sendrecv()
    common case: target is already waiting in sendrecv()
    call() returns into target, as if returning from sendrecv()
  * unbuffered
    synchronous => both src and dst user buffer address known => direct copy
    no need to copy to/from kernel buffer
  * small messages in registers, not memory
  * big messages as page mappings
    send() maps sender's pages into target's address space
    recv() specifies address in target
    again, no copy
  * careful layout of kernel code to minimize cache footprint
  result: 20x reduction in IPC cost

In this case, simplicity did enable optimization -- nice.

How to build a full operating system on a microkernel?
  Need a set of user-level servers.
    File system, device drivers, network stack, &c
  For embedded systems this can be fairly simple.
  What about services for general-purpose use, e.g. workstations, web servers?
  Really need compatibility for existing applications.
    E.g. the system needs mimic something like UNIX.
  Re-implement UNIX kernel services as lots of user-level service tasks?

Idea: run existing Linux kernel as a process on top of the microkernel.
  An "O/S server".
  Perhaps not elegant, but pragmatic.
  Part of a path to adoption:
    Users might start by just running Linux apps.
    Then gradually exploit possibilites of underlying microkernel.

Which brings us to today's paper on L4/Linux.
  "The Performance of micro-Kernel-Based Systems",
  by Hartig et al, 1997

Why reading papers is useful but hard:
  new ideas often described mainly in research papers
  typically only the new material is explained
  hard to read if you're not familiar with paper's context
  strategies:
    extract main ideas without getting lost in details
    first few pages usually outline the main points
    evaluation (at end) often reveals the bottom line
  will get easier as you learn more, read more

basic picture
  [diagram]
  L4 kernel
  Linux kernel server
  one L4 task per Linux process
  IPC for Linux system calls

What does it mean to run a Linux kernel at user-level?
  The Linux kernel is just a program!
  The authors modified Linux in small ways
    replacing hardware access with L4 system calls or IPC.
    Process creation, configuring user page tables, user memory allocation,
      system call handling, interrupt handling.
  No changes to Linux file system, network protocols, device drivers, &c.

L4/Linux system calls via IPC
  Each Linux process is an L4 task
  Linux server is mostly a single L4 thread, waiting for IPC
  System call:
    IPC to Linux server (process blocks, waiting for IPC reply)
    Linux server execute system call implementation
    Send IPC reply (L4 delivers, process can resume)
    Linux server waits for next system call IPC

L4/Linux server allocates all memory, hands out to processes
  so all user memory also mapped into Linux server's address space
  uses this for copyin()/copyout(), to dereference user pointers from sys calls
  this keeps system call IPCs small -- data address, not the data itself
  Linux server also uses its memory access for fork() and exec()

Example: how does L4/Linux fork() work?
  process P1 calls fork() (P1 is really an L4 task)
  P1's libc library turns fork() into an IPC to L4/Linux server
  L4/Linux allocates some of its pages for P2, copies P1's mem.
  L4/Linux asks L4 to create a new task -- P2
  L4/Linux sends special IPC to P2 with SP and PC to cause it to run
  L4/Linux gets external pager faults from P2, maps pages one by one

How to evaluate?
  A paper must evaluate whether its ideas are good!
    Preferably with objective measurements.
  What does "good" mean for this paper?
  It's not really about whether microkernels are a good idea.
  Its main goal is to show they can have good performance.

What kind of performance might readers care about?
  Is IPC fast?
    -> microbenchmark
  Is there some other performance obstacle?
    -> whole-system benchmarks

What to compare against?
  Linux, to help persuade people to switch to L4
  Mach/Linux, to show L4 fixes bad performance of older microkernels

IPC microbenchmarks
  Table 2 -- cost of getpid() system call
    Linux:       1.68 us
    L4/Linux:    3.95 us
    Mach Linux: 15.41 us
  getpid() is one system call on native Linux
    but two L4 system calls (call, sendrecv) to Linux server task
    for two IPC messages (request, reply)
  nice result: takes only somewhat more than 2x as long on L4/Linux
  FAR faster than Mach+LinuxServer
    why is Mach slow? the paper doesn't say, sadly.

What do we think the impact of 2x syscall overhead might be?
  Disaster?
  Hardly noticeable?

Whole-system benchmark: AIM
  Figure 8
  AIM forks a bunch of processes
  Each randomly uses the disk, allocates memory, uses pipe, computes, &c
    To do a fixed amount of total work
  x-axis shows [some function of] number of concurrent AIM processes
    thus total amount of work
  y-axis shows time to complete all work
  Only the slope really matters
    slope is time per unit of work, so lower is better
    Native Linux is best, but L4Linux is only a little slower
    Mach+Linux is noticeably less efficient
  Conclusions:
    2x IPC time doesn't seem to make much overall difference
    L4+Linux is only somewhat slower than Linux
    L4+Linux is significantly faster than Mach+Linux

These results are not by themselves an argument for using L4
  But they are an argument against rejecting L4 due to performance worries

What's the current situation?
  Microkernels are sometimes used for embedded computing
    Microcontrollers, Apple "enclave" processor
    Running custom software
  Microkernels never caught on for general computing
    No compelling story for why one should switch from Linux &c
  Many ideas from microkernel research have been adopted into modern UNIXes
    Sophisticated virtual memory support
    Threads in user programs
    Extensibility (but via loading code into kernel)
    IPC and user-level services

Next lecture:
  Page tables

References:
L4 details
  http://www.cse.unsw.edu.au/~cs9242/02/lectures/01-l4.pdf
  http://www.cse.unsw.edu.au/~cs9242/02/lectures/01-l4/01-l4.html
fast IPC in L4
  https://cs.nyu.edu/~mwalfish/classes/15fa/ref/liedtke93improving.pdf
later evolution of L4
  https://trustworthy.systems/publications/nicta_full_text/8988.pdf
an earlier paper on the ideas behind L4:
  https://www.cs.fsu.edu/~awang/courses/cop5611_s2004/microkernel.pdf
The Fiasco.OC Microkernel -- a current L4 descendent
  https://l4re.org/doc/