Microkernels

Required reading: Improving IPC by kernel design

Overview

the microkernel vision
  implement main o/s abstractions as user-space servers
    files, tcp, process mgmt, paging
  kernel provides minimum to support servers
    address spaces
    threads
    inter-process communication (IPC)
why attractive?
  (compared to monolithic UNIX-style design)
  isolation (vs bugs)
  fault-tolerant (restart individual services)
  modular (understandable, replaceable services)
  easier to distribute (IPC allows services on other hosts)
  easier to run on multiprocessor (since many servers)
history
  individual ideas around since the beginning
  lots of research projects starting early 1980s
  hit the big time w/ CMU's Mach in 1986
  thought to be too slow in early 1990s
  now slowly returning (OSX, QNX in Cisco, &c)
  ideas very influential on non-microkernels
Mach overview
  (today's L3 paper very much a reaction to Mach)
  ambitious: multiprocessors, distrib systems
  multiple threads in process / address space
    as a style of programming
    i/o concurrency; multi-processors
    fetching a web page with many images?
      browser creates a thread for each (fetch, decode jpeg)
    Mach did a lot to popularize this style
  IPC
    "port" an independent kernel abstraction
    send to port, recv from port
    a port queues messages, waiting for someone to read
    processes have rights to send/recv from ports
    can send send or recv right for port via IPC msg
      so can't tell at send time who will recv
  process control via ports
    process creation gives parent "kernel port" to empty process
    IPC messages to map mem, create threads, stop, start, wait()
    parent can give away child's kernel port
    thus wait() not tied only to parent
    easy to write e.g. debugger
  paging via ports
    goal: move most of VM / memory mgmt out of the kernel
    an address space consists of mapped "memory objects"
      object is a file or anonymous disk-backed memory
    each object implemented by a server
      handles page faults that refer to that object
      kernel gives it an offset, it returns page content
    kernel keeps object / virtual address table for each process
      that's the real address space
    model: process LD/ST operates on mem objs; VM decides which obj
      hardware doesn't let us directly turn LD into an IPC msg
      but can mark pages invalid to detect LDs and STs
      so use page faults and caching in physical memory
  how to build full systems on Mach?
    goal was general-purpose O/S for servers and workstations
    huge issue: UNIX compatibility
      critical to widespread adoption
    split UNIX kernel code into many servers?
      too hard, design is fairly non-modular
      so they left UNIX as a single entity
    mostly new services as servers
      e.g. network file system, window system, pagers, LAN IPC, name servers
      my Mach-based OSX Mac has dozens of servers, 100s of ports
    Mach 2.5: 
      left UNIX in kernel, mostly unchanged
      so micro- and UNIX side-by-side in kernel
    Mach 3.0:
      UNIX in a big server process
      forward UNIX system calls via IPC
  big problem: Mach 3.0 was slow
    high cost of sending syscalls via IPC
    result: many people decided microkernels were a bad idea

L3/L4 Overview

partially an attempt to see if Mach 3.0 slowness was inherent
  or an accident of design / implementation
L3 used for education and embedded systems
L4 a more modern re-implementation
  has associated user-space Linux port (not a VM...)
  can fetch from sourceforge
just seven system calls!
IPC
  can send page mappings
  this the only way to set up address space
task creation
  creates fixed # of threads
task/thread manipulation
  can set/get PC, SP, state, &c
  need capability, initially granted to task creator

L3/L4 paper discussion

This paper is about performance. What is a microsecond? Is 100 usec bad? Is 5 usec so much better we care? How many instructions does 50-Mhz x86 execute in 100 usec? Do programs make a lot of UNIX systems calls, so that IPC to UNIX server at 100us each would be significant?
Is the 5usec impressive? Today machines have clock rates 60x faster: does IPC now cost < 5 usec w/o careful optimization?
In performance calculations, what is the appropriate/better metric? Microseconds or cycles?
Table 3 shows the minimum work for a one-way IPC. About 20 instructions, 170 cycles, 3.4 us. L3 goal: achieve this minimum amount of IPC overhead. Different from usual design approach, which starts with functionality, worries about performance after the fact.
What are the instructions in Table 3 doing?
Is JOS close to this streamlined? Where are the register save/restores? Where is the system call dispatch? Where is the call to the scheduler? Where is the target kernel thread woken up? Why is a stack switch needed? Where is the message content copied?
Why are the INT and IRET so expensive? (IDT, GDT for CS, TS, GDT for SS, parse seg descrs, permission checks, maybe cache misses, maybe TLB misses) Do you think CPU speed improvements like pipelining naturally make INT and IRET take fewer cycles?
Why so much focus on TLB misses? What is a TLB? How much does a miss cost? Why isn't the key bottleneck data cache misses?
What are the 5 TLB misses: 1) B's thread control block; loading %cr3 flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and 4+5) user text - two pages B's user code looks at message
Interface:
- call (threadID, send-message, receive-message, timeout);
- reply_and_receive (reply-message, receive-message, timeout);
- why better than separate send and recv()?
- why is target a threadID rather than e.g. fd or Mach port? any drawback to that design?
- call() blocks until receiver ready. why better than queuing and immediately returning?
- Multi-part messages: direct string, indirect strings, and memory objects. Why not simpler+faster to have just one message part?
Direct transfer:
- How to move data from A's memory to B's memory?
- Why not have kernel copy from A to kernel buffer, from buffer to B? Why does UNIX pipe write()/read() use such a kernel buffer?
- Why not map pages in both A and B's address space, have A copy message to B at user level? A might change msg while B is reading it, requires pre-arrangement, A might not know how B wants data laid out (for multi-part messages: header vs payload).
- L3 kernel maps some of B's address space internally to help it do the copy. Since send waits for recv, kernel knows both src and dst addresses and layout.
- Why change VM mappings rather than copy through physical memory? (TLB may be faster than code to interpret page tables. especially if code has to check for paged-out pages).
- But changing page table can be expensive: does it require a TLB flush (a load of %cr3)? Resulting TLB misses would be expensive.
- Only have to flush TLB if kernel has mapped something else there since the last TLB flush. Kernel flushes TLB anyway on thread switch, so TLB is often "clean". Might not be clean if copy interrupted by page fault or time-slice from a different thread in same address space.
Thread control blocks:
- Problem: thread has lots of state: stack, presence of various lists, RUNNING/WAITING, pgdir, &c. It would be bad if these were in different pages of memory (as they are in xv6 and JOS), since we might pay one TLB miss each even on the send side of the IPC.
- L3 solution: group all thread info required to send on single page. Including stack, queue pointers, state, page directory. No indirection: low bits of thread ID directly index an array of page-size structs.
- TCB per page also eliminates some checks; un-map that page if the thread does not exist or for some other reason can't handle IPC fast path (e.g. swapped out to disk).

Lazy scheduling:

What is the problem addressed by lazy scheduling? Why does L3 have ready/waiting queues instead of just RUNNABLE/WAITING per-thread state? Conventional approach to scheduling:

    A sends message to B:
      Move A from ready queue to waiting queue
      Move B from waiting queue to ready queue
    This requires 58 cycles, including 4 TLB misses.  What are TLB misses?
      One each for head of ready and waiting queues
      One each for previous queue element during the remove

Lazy scheduling:

    If A sends IPC to B:
      set tcb[a].state = WAITING
      set tcb[b].state = RUNNING
      switch stacks and %cr3
      (leave A on ready queue, leave B on waiting queue)
    Thus:
      Real thread state in its TCB; queues mostly just a hint
      Ready queue holds superset of runnable threads (except current thread)
      Waiting queue holds superset of waiting threads
      Scheduler cleans queues: can tell from tcb[].state

Why does this help performance?

Short messages via registers. Why might this have been a bad idea?
Segment register optimization. JOS and xv6 always save DS, set DS to kernel data segment. It's expensive (load and parse GDT entry). Why can L3 avoid this? Does L3 risk having a thread mess up the kernel by putting something weird in DS?
Minimizing misses:
- TLB misses. Try to cram as many things as possible onto same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually maybe can't fit whole tables but put the important parts of tables on the same page (maybe beginning of TSS, IDT, or GDT only?)
- Cache misses. Cost about 10 instructions. Short offsets, avoid jumps, avoid checks, pack often-used data on same cache lines, lazily save/restore CPU state like debug and FPU registers. Much of the kernel is written in assembly!
What are the results? figure 7 and 8 look good.
Is fast IPC enough to get good overall system performance? This paper doesn't make a statement either way; we have to read their 1997 paper to find find the answer to that question.
Is the principle of optimizing for performance right? In general, it is wrong to optimize for performance; other things matter more. Is IPC the one exception? Maybe, perhaps not. Was Liedtke fighting a losing battle against CPU makers? Should fast IPC time be a hardware, or just an OS issue?