System call interface: microkernels

Required reading: Improving IPC by kernel design

Overview

This lecture looks at the microkernel organization. In a microkernel, services that a monolithic kernel implements in the kernel are running as user-level programs. For example, the file system, UNIX process management, pager, and network protocols each run in a separate user-level address space. The microkernel itself supports only the services that are necessary to allow system services to run well in user space; a typical microkernel has at least support for creating address spaces, threads, and inter process communication.

The potential advantages of a microkernel are simplicity of the kernel (small), isolation of operating system components (each runs in its own user-level address space), and flexibility (we can have a file server and a database server). One potential disadvantage is performance loss, because what in a monolithich kernel requires a single system call may require in a microkernel multiple system calls and context switches.

One way in how microkernels differ from each other is the exact kernel API they implement. For example, Mach (a system developed at CMU, which influenced a number of commercial operating systems) has the following system calls: processes (create, terminate, suspend, resume, priority, assign, info, threads), threads (fork, exit, join, detach, yield, self), ports and messages (a port is a unidirectionally communication channel with a message queue and supporting primitives to send, destroy, etc), and regions/memory objects (allocate, deallocate, map, copy, inherit, read, write).

Some microkernels are more "microkernel" than others. For example, some microkernels implement the pager in user space but the basic virtual memory abstractions in the kernel (e.g, Mach); others, are more extreme, and implement most of the virtual memory in user space (L4). Yet others are less extreme: many servers run in their own address space, but in kernel mode (Chorus).

All microkernels support multiple threads per address space. V6 and UNIX until recently didn't; why? Because, in UNIX system services are typically implemented in the kernel, and those are the primary programs that need multiple threads to handle events concurrently (waiting for disk and processing new I/O requests). In microkernels, these services are implemented in user-level address spaces and so they need a mechanism to deal with handling operations concurrently. (Of course, UNIX supporters will also argue that if you make fork efficient enough, there is no need to have threads.)

L3/L4

L3 is a predecessor to L4. L3 provides data persistence, DOS emulation, and ELAN runtime system. L4 is a reimplementation of L3, but without the data persistence. L4KA is a project at sourceforge.net, and you can download the code for the latest incarnation of L4 from there.

L4 is a "second-generation" microkernel, with 7 calls: IPC (of which there are several types), id_nearest, fpage_unmap, thread_switch, lthread_ex_regs, thread_schedule, task_new). These calls provide address spaces, tasks, threads, interprocess communication, and unique identifiers. An address space is a set of mappings. Multiple threads may share mappings, a thread may grants mappings to another thread. Task is the set of threads sharing an address space.

A thread is the execution abstraction; it belongs to an address space, a UID, a register set, a page fault handler, and an exception handler. A UID of a thread is its task number plus the number of the thread within that task.

IPC passes data by value or by reference to another address space. It also provide for sequence coordination. It is used for communication between client and servers, to pass interrupts to a user-level exception handler, to pass page faults to an external pager. In L4, device drivers are implemented has a user-level processes with the device mapped into their address space. Linux runs as a user-level process.

L4 provides quite a scala of messages types: inline-by-value, strings, and virtual memory mappings. The send and receive descriptor specify how many, if any.

In addition, there is a system call for timeouts and controling thread scheduling.

L3/L4 paper discussion

This paper is about performance. What is a microsecond? Is 100 usec bad? Is 5 usec so much better we care? How many instructions does 50-Mhz x86 execute in 100 usec? What can we compute with that number of instructions? How many disk operations in that time?
In performance calculations, what is the appropriate/better metric? Microseconds or cycles?
Goal: improve IPC performance by a factor 10 by careful kernel design that is fully aware of the hardware it is running on. Performance rules! Optimize for the common case. Because in L3 interrupts are propogated to user-level using IPC, the system may have to be able to support 1,000s of IPC per second.
IPC consists of transfering control and transfering data. The minimal cost for transfering control is 127 cycles, plus 45 cycles for TLB misses (see table 3). What are the x86 instructions to enter and leave the kernel? Why do they consume so much time? Do modern processors perform these operations more efficient? What are the indirect costs (TLB flush and cache pollution).
Interface:
- call (threadID, send-message, receive-message, timeout);
- reply_and_receive (reply-message, receive-message, timeout);
Optimizations:
- New system call: reply_and_receive. Effect: 2 system calls per RPC.
- Complex messages: direct string, indirect strings, and memory objects.
- Direct transfer by temporary mapping through a communication window. The communication window is mapped in B address space and in A's kernel address space; why is this better than just mapping a page shared between A and B's address space? On the x86 implemented by coping B's pdir entry into A's address space. Why must the TLB be window clean?
- One kernel stack per thread. This means they must switch the value in the TSS when switching threads.
- Thread control block contains stack. Lower part of thread ID contains TCB number. Can also dededuce TCB address from stack by taking SP AND bitmask; the SP comes out of the TSS.
- Invariant on queues: queues always hold in-memory TCBs.
- Wakeup queue: set of n unordered wakeup lists, and smart representation of time so that 32-bit integers can be used in the common case.
- Lazy scheduling: don't change wakeup and ready queue. just change state variable in TCB.
- Direct process switch. This section just says you should use kernel threads instead of continuations.
- Short messages via registers.
- Avoiding unnecessary copies.
- Registers for paramater passing where ever possible: systems calls and IPC.
- Coding tricks: short offsets, IPC kernel code in one page, avoid segments, avoid jumps, minimize switch costs. Much of the kernel is written in assembly!
- Is fast IPC enough to get good overall system performance? This paper doesn't make a statement either way; we have to read their 1997 paper to find find the answer to that question.