6.1810 2024 L17: Operating System Organization, Microkernels

Topic:
  What should a kernel do?
  What should its abstractions / system calls look like?

Or:
  xv6 / Unix / Linux are similar in overall architecture
  Are there alternatives?
  Can we learn from them?
  This topic is more about ideas and less about specific mechanisms

The "traditional approach" -- UNIX, Linux, xv6
  1) powerful abstractions, and
  2) a "monolithic" kernel implementation

Traditional philosphy: powerful abstractions
  for programmers: convenience, portability
    files, not disk controller registers
  for kernel: help sharing and managing resources
    file/directory abstraction lets kernel be in charge of disk layout
  for kernel: help with security
    file permissions

Powerful abstractions have led to big "monolithic" kernels
  kernel is one big program, like xv6
  easy for kernel sub-systems to cooperate -- no irritating boundaries
    exec() and mmap() enjoy close integration with both FS and VM system
  all kernel code runs with high privilege -- no internal security restrictions

What's wrong with monolithic kernels / big abstractions?
  big => complex => perhaps buggy/insecure
  over-general => perhaps slow
    how much code executes to send one byte via a UNIX pipe?
      buffering, locks, sleep/wakeup, scheduler
  big abstractions hide/enforce lots of design decisions, perhaps awkwardly
    maybe I want to wait for a process that's not my child
    maybe I want to change another process's address space
    maybe DB is better at laying out B-Tree files on disk than kernel FS

Microkernels -- a different approach
  main goal: simplest possible kernel
  main technique: move most O/S functionality to user-space service processes
  [diagram: h/w, kernel, services (FS disk VM TCP NIC display), apps]
  what does the kernel provide?
    address spaces, memory management, threads, IPC
    IPC = Inter-Process Communication
  1980s saw big burst of research on microkernel designs
    CMU's Mach perhaps the most influential
  outcome:
    lots of academic research
    many ideas adopted by traditional kernels
    microkernels used in some embedded systems, rarely visible

Why the interest in microkernels?
  elegant
  re-think from clean slate
  small -> more secure -- less privileged code -> fewer exploitable bugs
  small -> verifiable (see seL4)
  small -> easier to optimize
  small -> fewer design decisions forced on applications
  user-level -> may encourage modularity of O/S services
  user-level -> easier to extend / customize / replace user-level services
  user-level -> more robust -- restart individual user-level services
    most bugs are in drivers, get them out of the kernel!

Microkernel challenges
  What's the *minimum* possible kernel functionality?
  Want simple primitives -- but powerful enough to build exec, fork, mmap, &c
  Must still provide the rest of the O/S, at user level
  How to get good performance, despite IPC and less integration?
  How to encourage adoption?
  Compatibility with existing applications?

L4
  has evolved over time, many versions and re-implementations
  used commercially today, in phones and embedded controllers
  representative of the micro-kernel approach
  emphasis on minimality:
    a few dozen system calls (Linux has 300+)
    13,000 lines of code

L4 basic abstractions
  [diagram]
  address space ("task")
  page mappings
  thread
  IPC

L4 services -- some system calls, some special IPCs:
  create a task / address space 
  create/destroy a thread
  send/recv message via IPC (addresses are thread IDs)
  intercept another address space's page faults -- "pager"
    and change target's mappings
  access device hardware
  receive device interrupts via IPC

L4 kernel is missing almost everything that Linux or even xv6 has
  file system, fork(), exec(), pipes, device drivers, network stack, &c
  If you want these, they have to be user-level code
    library or server process

how do L4 external pagers work?
  every task has a pager task/thread
  1. page fault
  2. kernel suspends thread
  3. kernel sends fault info in IPC to pager
  4. pager picks one its own pages
  5. pager sends virtual page address in IPC reply to faulting thread
  6. kernel intercepts IPC, maps in target, resumes target

what can you use an L4 pager for?
  allocating memory -- "sigma0" allocates on fault for early tasks
  on-demand creation of a child task's address space
  mmap of file

problem: IPC performance
  Microkernel programs do lots of IPC!
  Was expensive in early systems
    multiple kernel crossings, TLB misses, context switches, &c
  Cost of IPC caused many to dismiss microkernels
  L4 designers put huge effort into IPC performance

Here's a slow IPC design
  patterned on UNIX pipes
  [diagram, message queue in kernel]
  send(id, msg)
    append msg to queue in kernel, wakeup recv, return
  recv(&id, &data)
    if msg waiting in queue, remove, return
    otherwise sleep()
  called "asynchronous" and "buffered"
  now the usual request-response pattern (RPC) involves:
    [diagram: 2nd message queue for replies]
    4 system calls (user->kernel->user)
      send()
               recv()
               send()
      recv()
    or eight user/kernel or kernel/user crossings
      each disturbs CPU's caches (TLB, data, instruction)
    four message copies (two for request, two for reply)
    two context switches, two general-purpose schedulings

L4's fast IPC
  "Improving IPC by Kernel Design," Jochen Liedtke, 1993
  * synchronous
    [diagram]
    send() waits for target thread's recv()
    common case: target is already waiting in recv()
    send() returns into target, as if returning from recv()
      no real context switch, no scheduler loop
  * unbuffered
    synchronous => both src and dst user buffer address known => direct copy
    no need to copy to/from kernel buffer
  * small messages in registers
    send() returns as if target's recv() -- preserving sender's registers
  * huge messages as page mappings
    send() maps sender's pages into target's address space
    recv() specifies address in target
    again, no copy
  * combined call() and sendrecv() system calls
    IPC almost always used as request-response RPC
    thus wasteful to use separate send() and recv() system calls
    [diagram]
    client: call(): send a message, obtain response
    server: sendrecv(): reply to one request, wait for the next one
    2x reduction in user/kernel crossings
  * careful layout of kernel code to minimize cache footprint
  result: 20x reduction in IPC cost

How to build a full operating system on a microkernel?
  Remember the idea was to move most features into user-level servers.
    File system, device drivers, network stack, process control, &c
  For embedded systems this can be fairly simple.
  What about services for general-purpose use, e.g. workstations, web servers?
  Really need compatibility for existing applications.
    E.g. the system needs mimic something like UNIX.
  Re-implement UNIX kernel services as lots of user-level service tasks?

Idea: run existing Linux kernel as a process on top of the microkernel.
  An "O/S server".
  Perhaps not elegant, but pragmatic.
  Part of a path to adoption:
    Users might start by just running Linux apps.
    Then gradually exploit possibilites of underlying microkernel.

Which brings us to today's paper:
  "The Performance of micro-Kernel-Based Systems",
  by Hartig et al, 1997

basic picture
  [diagram]
  L4 kernel
  Linux kernel server
  one L4 task per Linux process
  IPC for Linux system calls

What does it mean to run a Linux kernel at user-level?
  The Linux kernel is just a program!
  The authors modified Linux in small ways
    replacing hardware access with L4 system calls or IPC.
  Process creation, configuring user page tables, user memory allocation,
    system call handling, interrupt handling.

L4/Linux system calls via IPC
  Each Linux process is an L4 task
  Linux server is mostly a single L4 thread, waiting for IPC
  System call:
    IPC to Linux server (process blocks, waiting for IPC reply)
    Linux server switches to that process's Linux kernel thread
    Execute system call implementation
    Send IPC reply (L4 delivers, process can resume)
    Linux server waits for next system call IPC
  Linux server has one internal thread per process
    Like xv6, many system calls may be blocked e.g. in wait()
  But an L4/Linux kernel thread switch has
    no relation to user process switching
  Instead, L4 does the switching
    among both Linux kernel thread and the Linux process tasks

L4/Linux server allocates all memory, hands out to processes
  so all user memory also mapped into Linux server's address space
  uses this for copyin()/copyout(), to dereference user pointers from sys calls
  this keeps system call IPCs small -- data address, not the data itself
  Linux server also uses its memory access for fork() and exec()

Example: how does L4/Linux fork() work?
  process P1 calls fork() (P1 is really an L4 task)
  P1's libc library turns fork() into an IPC to L4/Linux server
  L4/Linux allocates some of its pages for P2,
           copies P1's mem via phys mappings
  L4/Linux asks L4 to create a new task -- P2
  L4/Linux sends special IPC to P2 with SP and PC to cause it to run
  L4/Linux gets external pager faults from P2, maps pages one by one

L4/Linux server acts as the pager for user processes
  so L4 turns process page faults into IPC to Linux server
  for e.g. copy-on-write fork, lazy allocation, memory mapped files

L4/Linux server uses Linux device drivers unchanged!
  since L4 allows it direct access to device registers
  except interrupts arrive via L4 IPC

How to evaluate?
  What are some questions that the paper might answer?
  It's not really about whether microkernels are a good idea.
  Its main goal is to show they can have good performance.

What kind of performance do we care about?
  Is IPC fast?
    -> microbenchmark
  Is there some other performance obstacle?
    -> whole-system benchmarks

IPC microbenchmarks
  Table 2
  getpid() is one system call on native Linux
    but two L4 system calls (call, sendrecv)
    for two IPC messages (request, reply)
  nice result: takes only somewhat more than 2x as long on L4/Linux
  FAR faster than Mach+LinuxServer
    why? the paper doesn't say, sadly.

What do we think the impact of syscalls taking 2x as long might be?
  Disaster?
  Hardly noticeable?

Whole-system benchmark: AIM
  AIM forks a bunch of processes
  Each randomly uses the disk, allocates memory, uses pipe, computes, &c
    To do a fixed amount of total work
  Figure 8 x-axis shows [some function of] number of concurrent AIM processes
    y-axis shows time for all processes to complete
  Only the slope really matters
    slope is time per unit of work, so lower is better
    Native Linux is best, but L4Linux is only a little slower
    Mach+Linux is noticeably less efficient
  Conclusions:
    2x IPC time doesn't seem to make much overall difference
    L4+Linux is only somewhat slower than Linux
    L4+Linux is significantly faster than Mach+Linux

These results are not by themselves an argument for using L4
  But they are an argument against rejecting L4 due to performance worries

What's the current situation?
  Microkernels are sometimes used for embedded computing
    Microcontrollers, Apple "enclave" processor
    Running custom software
  Microkernels never caught on for general computing
    No compelling story for why one should switch from Linux &c
  Many ideas from microkernel research have been adopted into modern UNIXes
    Mach spurred adoption of sophisticated virtual memory support
    Loadable kernel modules are a response to need for extensibility
    IPC and user-level services are common

Next lecture:
  Another architectural direction: virtual machines

References:
L4 details
  http://www.cse.unsw.edu.au/~cs9242/02/lectures/01-l4.pdf
  http://www.cse.unsw.edu.au/~cs9242/02/lectures/01-l4/01-l4.html
fast IPC in L4
  https://cs.nyu.edu/~mwalfish/classes/15fa/ref/liedtke93improving.pdf
later evolution of L4
  https://trustworthy.systems/publications/nicta_full_text/8988.pdf
an earlier paper on the ideas behind L4:
  https://www.cs.fsu.edu/~awang/courses/cop5611_s2004/microkernel.pdf
The Fiasco.OC Microkernel -- a current L4 descendent
  https://l4re.org/doc/