Microkernels
Required reading: Improving IPC by kernel design
Overview
the microkernel vision
implement main o/s abstractions as user-space servers
files, tcp, process mgmt, paging
kernel provides minimum to support servers
address spaces
threads
inter-process communication (IPC)
why attractive?
(compared to monolithic UNIX-style design)
isolation (vs bugs)
fault-tolerant (restart individual services)
modular (understandable, replaceable services)
easier to distribute (IPC allows services on other hosts)
easier to run on multiprocessor (since many servers)
history
individual ideas around since the beginning
lots of research projects starting early 1980s
hit the big time w/ CMU's Mach in 1986
thought to be too slow in early 1990s
now slowly returning (OSX, QNX in Cisco, &c)
ideas very influential on non-microkernels
Mach overview
(today's L3 paper very much a reaction to Mach)
ambitious: multiprocessors, distrib systems
multiple threads in process / address space
as a style of programming
i/o concurrency; multi-processors
fetching a web page with many images?
browser creates a thread for each (fetch, decode jpeg)
Mach did a lot to popularize this style
IPC
"port" an independent kernel abstraction
send to port, recv from port
a port queues messages, waiting for someone to read
processes have rights to send/recv from ports
can send send or recv right for port via IPC msg
so can't tell at send time who will recv
process control via ports
process creation gives parent "kernel port" to empty process
IPC messages to map mem, create threads, stop, start, wait()
parent can give away child's kernel port
thus wait() not tied only to parent
easy to write e.g. debugger
paging via ports
goal: move most of VM / memory mgmt out of the kernel
an address space consists of mapped "memory objects"
object is a file or anonymous disk-backed memory
each object implemented by a server
handles page faults that refer to that object
kernel gives it an offset, it returns page content
kernel keeps object / virtual address table for each process
that's the real address space
model: process LD/ST operates on mem objs; VM decides which obj
hardware doesn't let us directly turn LD into an IPC msg
but can mark pages invalid to detect LDs and STs
so use page faults and caching in physical memory
how to build full systems on Mach?
goal was general-purpose O/S for servers and workstations
huge issue: UNIX compatibility
critical to widespread adoption
split UNIX kernel code into many servers?
too hard, design is fairly non-modular
so they left UNIX as a single entity
mostly new services as servers
e.g. network file system, window system, pagers, LAN IPC, name servers
my Mach-based OSX Mac has dozens of servers, 100s of ports
Mach 2.5:
left UNIX in kernel, mostly unchanged
so micro- and UNIX side-by-side in kernel
Mach 3.0:
UNIX in a big server process
forward UNIX system calls via IPC
big problem: Mach 3.0 was slow
high cost of sending syscalls via IPC
result: many people decided microkernels were a bad idea
L3/L4 Overview
partially an attempt to see if Mach 3.0 slowness was inherent
or an accident of design / implementation
L3 used for education and embedded systems
L4 a more modern re-implementation
has associated user-space Linux port (not a VM...)
can fetch from sourceforge
just seven system calls!
IPC
can send page mappings
this the only way to set up address space
task creation
creates fixed # of threads
task/thread manipulation
can set/get PC, SP, state, &c
need capability, initially granted to task creator
L3/L4 paper discussion
- This paper is about performance. What is a microsecond? Is 100
usec bad? Is 5 usec so much better we care? How many instructions
does 50-Mhz x86 execute in 100 usec? Do programs make a lot of
UNIX systems calls, so that IPC to UNIX server at 100us each would
be significant?
- Is the 5usec impressive? Today machines have clock rates
60x faster: does IPC now cost < 5 usec w/o careful optimization?
- In performance calculations, what is the appropriate/better metric?
Microseconds or cycles?
- Table 3 shows the minimum work for a one-way IPC. About 20
instructions, 170 cycles, 3.4 us. L3 goal: achieve this minimum amount
of IPC overhead. Different from usual design approach, which starts
with functionality, worries about performance after the fact.
- What are the instructions in Table 3 doing?
- Is JOS close to this streamlined? Where are the register
save/restores? Where is the system call dispatch? Where is the call
to the scheduler? Where is the target kernel thread woken up? Why is a
stack switch needed? Where is the message content copied?
- Why are the INT and IRET so expensive? (IDT, GDT for CS, TS, GDT
for SS, parse seg descrs, permission checks, maybe cache misses, maybe
TLB misses) Do you think CPU speed improvements like pipelining
naturally make INT and IRET take fewer cycles?
- Why so much focus on TLB misses? What is a TLB? How much does
a miss cost? Why isn't the key bottleneck data cache misses?
- What are the 5 TLB misses: 1) B's thread control block; loading %cr3
flushes TLB, so 2) kernel text causes miss; iret, accesses both 3) stack and
4+5) user text - two pages B's user code looks at message
- Interface:
- call (threadID, send-message, receive-message, timeout);
- reply_and_receive (reply-message, receive-message, timeout);
- why better than separate send and recv()?
- why is target a threadID rather than e.g. fd or Mach port?
any drawback to that design?
- call() blocks until receiver ready. why better than queuing
and immediately returning?
- Multi-part messages: direct string, indirect strings, and memory
objects. Why not simpler+faster to have just one message part?
- Direct transfer:
- How to move data from A's memory to B's memory?
- Why not have kernel copy from A to kernel buffer, from buffer to
B? Why does UNIX pipe write()/read() use such a kernel buffer?
- Why not map pages in both A and B's address space, have A copy
message to B at user level? A might change msg while B is reading it,
requires pre-arrangement, A might not know how B wants data laid out
(for multi-part messages: header vs payload).
- L3 kernel maps some of B's address space internally to help it do
the copy. Since send waits for recv, kernel knows both src and
dst addresses and layout.
- Why change VM mappings rather than copy through physical memory?
(TLB may be faster than code to interpret page tables. especially if
code has to check for paged-out pages).
- But changing page table can be expensive: does it require a TLB
flush (a load of %cr3)? Resulting TLB misses would be expensive.
- Only have to flush TLB if kernel has mapped something else there
since the last TLB flush. Kernel flushes TLB anyway on thread switch,
so TLB is often "clean". Might not be clean if copy interrupted by
page fault or time-slice from a different thread in same address
space.
- Thread control blocks:
- Problem: thread has lots of state: stack, presence of various
lists, RUNNING/WAITING, pgdir, &c. It would be bad if these were in
different pages of memory (as they are in xv6 and JOS), since we might
pay one TLB miss each even on the send side of the IPC.
- L3 solution: group all thread info required to send on single
page. Including stack, queue pointers, state, page directory. No
indirection: low bits of thread ID directly index an array of
page-size structs.
- TCB per page also eliminates some checks; un-map that page if
the thread does not exist or for some other reason can't handle
IPC fast path (e.g. swapped out to disk).
- Lazy scheduling:
- What is the problem addressed by lazy scheduling?
Why does L3 have ready/waiting queues instead of just
RUNNABLE/WAITING per-thread state?
Conventional approach to scheduling:
A sends message to B:
Move A from ready queue to waiting queue
Move B from waiting queue to ready queue
This requires 58 cycles, including 4 TLB misses. What are TLB misses?
One each for head of ready and waiting queues
One each for previous queue element during the remove
- Lazy scheduling:
If A sends IPC to B:
set tcb[a].state = WAITING
set tcb[b].state = RUNNING
switch stacks and %cr3
(leave A on ready queue, leave B on waiting queue)
Thus:
Real thread state in its TCB; queues mostly just a hint
Ready queue holds superset of runnable threads (except current thread)
Waiting queue holds superset of waiting threads
Scheduler cleans queues: can tell from tcb[].state
- Why does this help performance?
- Short messages via registers. Why might this have been a bad idea?
- Segment register optimization. JOS and xv6 always save DS,
set DS to kernel data segment. It's expensive (load and parse GDT
entry). Why can L3 avoid this? Does L3 risk having a thread
mess up the kernel by putting something weird in DS?
- Minimizing misses:
- TLB misses. Try to cram as many things as possible onto
same page: IPC kernel code, GDT, IDT, TSS, all on same page. Actually
maybe can't fit whole tables but put the important parts of tables on
the same page (maybe beginning of TSS, IDT, or GDT only?)
- Cache misses. Cost about 10 instructions.
Short offsets, avoid jumps, avoid checks, pack
often-used data on same cache lines, lazily save/restore CPU state
like debug and FPU registers. Much of the kernel is written in
assembly!
- What are the results? figure 7 and 8 look good.
- Is fast IPC enough to get good overall system performance? This
paper doesn't make a statement either way; we have to read their 1997
paper to find find the answer to that question.
- Is the principle of optimizing for performance right? In general,
it is wrong to optimize for performance; other things matter more. Is
IPC the one exception? Maybe, perhaps not. Was Liedtke fighting a
losing battle against CPU makers? Should fast IPC time be a hardware,
or just an OS issue?