6.S081 2020 L18: Operating System Organization, Microkernels

Topic:
  What should a kernel do?
  What should its abstractions / system calls look like?

Answers depend on the application, and on programmer taste!
  There is no single best answer
  This topic is more about ideas and less about specific mechanisms

The traditional approach
  1) powerful abstractions, and
  2) a "monolithic" kernel implementation
  UNIX, Linux, xv6

The philosophy behind traditional kernels is powerful abstractions:
  portable interfaces
    files, not disk controller registers
    address spaces, not MMU access
  simple interfaces that hide complexity
    all I/O via FDs and read/write, not specialized for each device &c
    address spaces with transparent disk paging
  abstractions help the kernel manage and share resources
    process abstraction lets kernel be in charge of scheduling
    file/directory abstraction lets kernel be in charge of disk layout
  abstractions help the kernel enforce security
    file permissions
    processes with private address spaces
  lots of indirection
    e.g. FDs, virtual addresses, file names, PIDs
    helps kernel virtualize, hide, revoke, schedule, &c

Powerful abstractions have led to big "monolithic" kernels
  kernel is one big program, like xv6
  easy for kernel sub-systems to cooperate -- no irritating boundaries
    exec() and mmap() are part of both FS and VM system
    relatively easy to add sym links, COW fork, mmap, &c
  all kernel code runs with high privilege -- no internal security restrictions

What's wrong with traditional kernels?
  big => complex => buggy/insecure
  perhaps over-general and thus slow
    how much code executes to send one byte via a UNIX pipe?
    buffering, locks, sleep/wakeup, scheduler
  many design decisions are baked in, can't be changed, may be awkward
    maybe I want to wait for a process that's not my child
    maybe I want to change another process's address space
    maybe DB is better at laying out B-Tree files on disk than kernel FS
  hard to create kernel "extensions" that others can use
    new device drivers, file systems, &c

Microkernels -- a different approach
  big idea: move most O/S functionality to user-space service processes
  [diagram: h/w, kernel, services (FS disk VM TCP NIC display), apps]
  kernel can be small
    address spaces, threads, IPC (inter-process communication)
    IPC lets threads send each other messages
  1980s saw big burst of research on microkernel designs
    CMU's Mach perhaps the most influential
  used today in embedded systems, phone chips, car entertainment
  ideas (esp user-level servers and IPC) influential e.g. Windows and MacOS

Why the interest in microkernels?
  focused, elegant, clean slate
  small -> more security -- less code means fewer bugs to exploit
  small -> verifiable (see seL4)
  small -> easier to optimize
    you don't have to pay for features you don't use
  small -> avoid forcing design decisions on applications
  user-level -> may encourage modularity of O/S services
  user-level -> easier to extend / customize / replace user-level services
  user-level -> more robust -- restart individual user-level services
    most bugs are in drivers, get them out of the kernel!
  can run/emulate multiple O/Ses, like a VMM

Microkernel challenges
  What's a minimum kernel API?
  Need simple primitives on which to build exec, fork, mmap, &c
  Need to build the rest of the O/S at user level
  How to get good performance, despite IPC and less integration?

L4
  has evolved over time, many versions and re-implementations
  used commercially today, in phones and embedded controllers
  representative of the micro-kernel approach
  emphasis on minimality:
    7 system calls (Linux has 300+, xv6 has 21)
    13,000 lines of code

L4 basic abstractions
  [diagram]
  address space ("task")
  thread
  IPC

L4 system calls:
  create an address space 
  create/destroy a thread in [another] address space
  send/recv message via IPC (addresses are thread IDs)
  map pages of your memory into another address space
    it must agree
    this happens via IPC -- one task can modify another task's page table
    used to create new tasks, share memory
  intercept another address space's page faults -- "pager"
    kernel delivers via IPC
  access device hardware (not a system call, happens directly)
  handle device interrupts
    kernel delivers via IPC

Note L4 kernel is missing almost everything that Linux or even xv6 has
  file system, fork(), exec(), pipes, device drivers, network stack, &c
  If you want these, they have to be user-level code
    library or server process

how does L4 thread switching work?
  current user-level thread can yield for 3 reasons:
    IPC system call waits
    timer interrupt
    yield() system call
  L4 kernel saves user thread registers,
    picks a RUNNABLE thread to run,
    restores user registers,
    switches page table,
    jumps to user space
  no surprises here

how do L4 external pagers work?
  every task has a pager task
  1. page fault
  2. kernel suspends thread
  3. kernel sends fault info in IPC to pager
  4. pager picks one its own pages
  5. pager sends virtual page address in IPC reply to faulting thread
  6. kernel intercepts IPS, maps in target, resumes target

what can you use an L4 pager for?
  allocating memory -- "sigma0" allocates on fault for early tasks
  copy-on-write fork
    coupled with a system call that revokes access
  mmap of file

problem: IPC performance
  Microkernel programs do lots of IPC!
  Was expensive in early systems
    multiple kernel crossings, TLB misses, context switches, &c
  Cost of IPC caused many to dismiss microkernels
  L4 designers put huge effort into IPC performance

Here's a slow IPC design
  patterned on UNIX pipes
  [diagram, message queue in kernel]
  send(id, msg)
    append msg to queue in kernel, return
  recv(&id, &data)
    if msg waiting in queue, remove, return
    otherwise sleep()
  called "asynchronous" and "buffered"
  now the usual request-response pattern (RPC) involves:
    [diagram: 2nd message queue for replies]
    4 system calls (user->kernel->user)
      send() -> recv()
      recv() <- send
      each may disturb CPU's caches (TLB, data, instruction)
    four message copies (two for request, two for reply)
    two context switches, two general-purpose schedulings

L4's fast IPC
  "Improving IPC by Kernel Design," Jochen Liedtke, 1993
  * synchronous
    [diagram]
    send() waits for target thread's recv()
    common case: target is already waiting in recv()
    send() jumps into target's user space, as if returning from recv()
      no real context switch, no scheduler loop
  * unbuffered
    no queue in kernel
    since synchronous, kernel can copy directly between user buffers
  * small messages in registers
    kernel send() path does not disturb many of the registers
      e.g., no context switch
    no copying required for small messages
      since send() jumps into target's user space, along with registers
  * huge messages as virtual memory grants
    again, no copy required, though kernel send() code must change page table
  * combined call() and sendrecv() system calls
    [diagram]
    IPC almost always used as request-response RPC
    thus wasteful to use separate send() and recv() system calls
    client: call(): send a message, wait for response
    server: sendrecv(): reply to one request, wait for the next one
    2x reduction in user/kernel crossings
  * careful layout of kernel code to minimize cache footprint
  result: 20x reduction in IPC cost

How to build a full operating system on a microkernel?
  Remember the idea was to move most features into user-level servers.
    File system, device drivers, network stack, process control, &c
  For embedded systems this can be fairly simple.
  What about services for general-purpose use, e.g. workstations, web servers?
  Really need compatibility for existing applications.
    E.g. the system needs mimic something like UNIX.
  Re-implement UNIX kernel services as lots of user-level services?
  Or: run existing Linux kernel as a process on top of the microkernel.
    An "O/S server".
    Perhaps not elegant, but pragmatic.
    Part of a path to adoption:
      Users might start by just running Linux apps.
      Then gradually exploit possibilites of underlying microkernel.

Which brings us to today's paper:
  "The Performance of micro-Kernel-Based Systems",
  by Hartig et al, 1997

basic picture
  [diagram]
  L4 kernel
  Linux kernel server
  one L4 task per Linux process
  IPC for system calls

What does it mean to run a Linux kernel at user-level?
  The Linux kernel is just a program!
  The authors modified Linux in a number of ways,
    replacing hardware access with L4 system calls or IPC.
  Process creation, configuring user page tables, memory allocation,
    system call handling, interrupt handling.

L4/Linux's use of threads 
  Each Linux process has one or more L4 threads for its user code
  Linux server has just one L4 thread (plus L4 threads waiting for interrupts)
    At rest it is waiting for IPCs with system calls
  Linux server switches its own L4 thread among kernel threads for its processes
    When e.g. file system code sleep()s waiting for disk read
    Or pipe read() sleep()s waiting for someone to write the pipe
    Much as xv6 switches among kernel threads.
  But an L4/Linux kernel thread switch has
    no relation to user process switching
  Instead, L4 separately switches among runnable L4 threads that
    implement the Linux processes
  So Linux kernel server can be running a kernel thread for process P1,
    while L4 is running process P2 on another core

Why not use L4 threads to implement Linux server's kernel threads?
  Because that would cause pain without any benefit.
  Would introduce parallelism inside Linux.
    But Linux 2.0 did not have SMP support -- e.g. no spinlocks.
  And their hardware had only one core, so could be no parallel speedup anyway.

Drawback: L4 is in charge of scheduling user threads
  So L4/Linux couldn't enforce Linux's notions of priority &c

L4/Linux server maps all user memory into its address space
  (really, it allocates lots of memory, then gives its own memory to user processes)
  uses this for copyin()/copyout(), to dereference user pointers from sys calls
  this keeps system call IPCs small -- data address, not the data itself
  Linux server also uses its memory access for fork() and exec()

Example: how does fork() work?
  process P1 calls fork() (P1 is really an L4 task)
  P1's libc library turns fork() into an IPC to L4/Linux server
  L4/Linux asks L4 to create a new task and thread -- P2
  L4/Linux allocates memory pages (as many as P1 has)
  L4/Linux uses IPC to tell L4 to map pages into P2
  L4/Linux copies data from P1's pages to P2's pages
  L4/Linux sends special IPC to P2 with SP and PC to cause it to run
  L4/Linux sends reply to P1 via IPC

L4/Linux server acts as the pager for user processes
  so L4 turns process page faults into IPC to Linux server
  for e.g. copy-on-write fork, lazy allocation, memory mapped files

Drawback: L4 doesn't allow direct control over page tables
  so Linux server could not switch its page table to include user virt addresses
  until recently Linux used this trick to gain performance (no page table switch),
    and for convenience in dereferencing syscall arguments

L4/Linux server uses Linux device drivers unchanged!
  since L4 allows it direct access to device registers
  except interrupts arrive via L4 IPC

How to evaluate?
  What are some questions that the paper might answer?
  It's not really about whether microkernels are a good idea.
  It's main goal is to show they have good performance.

What kind of performance do we care about?
  Is IPC fast?
    -> microbenchmark
  Is there some other performance obstacle?
    -> whole-system benchmarks

IPC microbenchmarks
  Table 2
  getpid() is one system call on native Linux
    and two L4 system calls (IPC send, IPC recv) on L4/Linux
  nice result: takes only somewhat more than 2x as long on L4/Linux
    and FAR faster than Mach+LinuxServer

What do we think the impact of syscalls taking 2x as long might be?
  Disaster?
  Hardly noticeable?

Whole-system benchmark: AIM
  AIM forks a bunch of processes
  Each randomly uses the disk, allocates memory, uses pipe, computes, &c
    To do a fixed amount of total work
  Figure 8 x-axis shows [some function of] number of concurrent AIM processes
    y-axis shows time for all processes to complete
  Only the slope really matters
    slope is time per unit of work, so lower is better
    Native Linux is best, but L4Linux is only a little slower
    Mach+Linux is noticeably less efficient
  Conclusions:
    2x IPC time doesn't seem to make much overall difference
    L4+Linux is only somewhat slower than Linux
    L4+Linux is significantly faster than Mach+Linux

These results are not by themselves an argument for using L4
  But they are an argument against rejecting L4 due to performance worries

What's the current situation?
  Microkernels are sometimes used for embedded computing
    Microcontrollers, Apple "enclave" processor
    Running custom software
  Microkernels, as such, never caught on for general computing
    No compelling story for why one should switch from Linux &c
  Many ideas from microkernel research have been adopted into modern UNIXes
    Mach spurred adoption of sophisticated virtual memory support
    Virtual machines are partially a response to the O/S server idea
    Loadable kernel modules are a response to need to extensibility
    Client/server e.g. DNS server, window server
    MacOS has microkernel-style IPC

References:
The Fiasco.OC Microkernel -- a current L4 descendent
  https://l4re.org/doc/
fast IPC in L4
  https://cs.nyu.edu/~mwalfish/classes/15fa/ref/liedtke93improving.pdf
later evolution of L4
  https://ts.data61.csiro.au/publications/nicta_full_text/8988.pdf