System call interface: microkernels
Required reading: Improving IPC by kernel design
Overview
This lecture looks at the microkernel organization. In a
microkernel, services that a monolithic kernel implements in the
kernel are running as user-level programs. For example, the file
system, UNIX process management, pager, and network protocols each run
in a separate user-level address space. The microkernel itself
supports only the services that are necessary to allow system services
to run well in user space; a typical microkernel has at least support
for creating address spaces, threads, and inter process communication.
The potential advantages of a microkernel are simplicity of the
kernel (small), isolation of operating system components (each runs in
its own user-level address space), and flexibility (we can have a file
server and a database server). One potential disadvantage is
performance loss, because what in a monolithich kernel requires a
single system call may require in a microkernel multiple system calls
and context switches.
One way in how microkernels differ from each other is the exact
kernel API they implement. For example, Mach (a system developed at
CMU, which influenced a number of commercial operating systems) has
the following system calls: processes (create, terminate, suspend,
resume, priority, assign, info, threads), threads (fork, exit, join,
detach, yield, self), ports and messages (a port is a unidirectionally
communication channel with a message queue and supporting primitives
to send, destroy, etc), and regions/memory objects (allocate,
deallocate, map, copy, inherit, read, write).
Some microkernels are more "microkernel" than others. For example,
some microkernels implement the pager in user space but the basic
virtual memory abstractions in the kernel (e.g, Mach); others, are
more extreme, and implement most of the virtual memory in user space
(L4). Yet others are less extreme: many servers run in their own
address space, but in kernel mode (Chorus).
All microkernels support multiple threads per address space. V6 and
UNIX until recently didn't; why? Because, in UNIX system services are
typically implemented in the kernel, and those are the primary
programs that need multiple threads to handle events concurrently
(waiting for disk and processing new I/O requests). In microkernels,
these services are implemented in user-level address spaces and so
they need a mechanism to deal with handling operations concurrently.
(Of course, UNIX supporters will also argue that if you make fork
efficient enough, there is no need to have threads.)
L3/L4
L3 is a predecessor to L4. L3 provides data persistence, DOS
emulation, and ELAN runtime system. L4 is a reimplementation of L3,
but without the data persistence. L4KA is a project at
sourceforge.net, and you can download the code for the latest
incarnation of L4 from there.
L4 is a "second-generation" microkernel, with 7 calls: IPC (of
which there are several types), id_nearest (find a thread with an ID
close the given ID), fpage_unmap (unmap pages, mapping is done as a
side-effect of IPC), thread_switch (hand processor to specified
thread), lthread_ex_regs (manipulate thread registers),
thread_schedule (set scheduling policies), task_new (create a new
address space with some default number of threads). These calls
provide address spaces, tasks, threads, interprocess communication,
and unique identifiers. An address space is a set of mappings.
Multiple threads may share mappings, a thread may grants mappings to
another thread (through IPC). Task is the set of threads sharing an
address space.
A thread is the execution abstraction; it belongs to an address
space, a UID, a register set, a page fault handler, and an exception
handler. A UID of a thread is its task number plus the number of the
thread within that task.
IPC passes data by value or by reference to another address space.
It also provide for sequence coordination. It is used for
communication between client and servers, to pass interrupts to a
user-level exception handler, to pass page faults to an external
pager. In L4, device drivers are implemented has a user-level
processes with the device mapped into their address space.
Linux runs as a user-level process.
L4 provides quite a scala of messages types: inline-by-value,
strings, and virtual memory mappings. The send and receive descriptor
specify how many, if any.
In addition, there is a system call for timeouts and controling
thread scheduling.
L3/L4 paper discussion
- This paper is about performance. What is a microsecond? Is 100
usec bad? Is 5 usec so much better we care? How many instructions
does 50-Mhz x86 execute in 100 usec? What can we compute with that
number of instructions? How many disk operations in that time?
- In performance calculations, what is the appropriate/better metric?
Microseconds or cycles?
- Goal: improve IPC performance by a factor 10 by careful kernel
design that is fully aware of the hardware it is running on.
Performance rules! Optimize for the common case. Because in L3
interrupts are propogated to user-level using IPC, the system may have
to be able to support many IPCs per second (as many as the device can
generate interrupts).
- IPC consists of transfering control and transfering data. The
minimal cost for transfering control is 127 cycles, plus 45 cycles for
TLB misses (see table 3). What are the x86 instructions to enter and
leave the kernel? Why do they consume so much time? (Flush pipeline,
because address space may change.) Do modern processors perform these
operations more efficient? Worse now. Faster processors optimized
for straight-line code; Traps/Exceptions flush deeper pipeline, cache
misses cost more cycles.
- What are the 5 TLB misses: 1) B's thread control block; loading %cr3
flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and
4+5) user text - two pages B's user code looks at message
- Interface:
- call (threadID, send-message, receive-message, timeout);
- reply_and_receive (reply-message, receive-message, timeout);
- Optimizations:
- New system call: reply_and_receive. Effect: 2 system calls per
RPC.
- Complex messages: direct string, indirect strings, and memory
objects.
- Direct transfer by temporary mapping through a communication
window. The communication window is mapped in B address space and in
A's kernel address space; why is this better than just mapping a page
shared between A and B's address space? 1) Multi-level security, it
makes it hard to reason about information flow; 2) Receiver can't
check message legality (might change after check); 3) When server has
many clients, could run out of virtual address space Requires shared
memory region to be established ahead of time; 4) Not application
friendly, since data may already be at another address, i.e.
applications would have to copy anyway--possibly more copies.
- Why not use the following approach: map the region copy-on-write
(or read-only) in A's address space after send and read-only in B's
address space? Now B may have to copy data or cannot receive data in
its final destination.
- On the x86 implemented by coping B's PDE into A's address space.
Why two PDEs? Maximum message size is 4 Meg, so guaranteed to work
Why not just copy PTEs? Would be much more expensive
- What does it mean for the TLB to be "window clean"? Why do we
care? Means TLB contains no mappings within communication window. We
care because mapping is cheap (copy PDE), but invalidation not; x86
only lets you invalidate one page at a time, or whole TLB Does TLB
invalidation of communication window turn out to be a problem? Not
usually, because have to load %cr3 during IPC anyway
- Thread control block contains stack. Lower part of thread ID
contains TCB number. Can also dededuce TCB address from stack by
taking SP AND bitmask; the SP comes out of the TSS.
- Kernel stack is on same page as tcb. why? 1) Minimizes TLB
misses (since accessing kernel stack will bring in tcb); 2) Allows
very efficient access to tcb -- just mask off lower 12 bits of %esp;
3) With VM, can use lower 32-bits of thread id to indicate which tcb;
4) Using one page per tcb means no need to check if thread is swapped
out (Can simply not map that tcb if shouldn't access it).
- Invariant on queues: queues always hold in-memory TCBs.
- Wakeup queue: set of n unordered wakeup lists, and smart
representation of time so that 32-bit integers can be used in the
common case.
- What is the problem addressed by lazy scheduling?
Conventional approach to scheduling:
A sends message to B:
Move A from ready queue to waiting queue
Move B from waiting queue to ready queue
This requires 58 cycles, including 4 TLB misses. What are TLB misses?
One each for head of ready and waiting queues
One each for previous queue element during the remove
- Lazy scheduling:
Ready queue must contain all ready threads except current one
Might contain other threads that aren't actually ready, though
Each wakeup queue contains all threads waiting in that queue
Again, might contain other threads, too
Scheduler removes inappropriate queue entries when scanning
queue
- Why does this help performance? Only three situations in which
thread gives up CPU but stays ready: send syscall (as opposed to
call), preemption, and hardware interrupts. So very often can IPC into
thread while not putting it on ready list.
- Direct process switch. This section just says you should use
kernel threads instead of continuations.
- Short messages via registers.
- Avoiding unnecessary copies.
- Registers for paramater passing where ever possible: systems
calls and IPC.
- Coding tricks: short offsets, IPC kernel code in one page, avoid
segments, avoid jumps, minimize switch costs. Much of the kernel is
written in assembly!
- Minimizing TLB misses. Try to cram as many things as possible onto
same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually
maybe can't fit whole tables but put the important parts of tables on
the same page (maybe beginning of TSS, IDT, or GDT only?)
- Is fast IPC enough to get good overall system performance? This
paper doesn't make a statement either way; we have to read their 1997
paper to find find the answer to that question.
- Is the principle of optimizing for performance right? In general,
it is wrong to optimize for performance; other things matter more. Is
IPC the one exception? Maybe, perhaps not.