Microkernels and capabilities

Required reading: KeyKOS, Confused deputy.

Reference reading: KeyKOS details.

Microkernels

the microkernel vision
    implement main o/s abstractions as user-space servers
	files, tcp, process mgmt, paging
    kernel provides minimum to support servers
	address spaces
	threads
	inter-process communication (IPC)
	some IPC handle which can send/receive messages

why attractive?
    (compared to monolithic UNIX-style design)
    isolation (vs bugs)
    fault-tolerant (restart individual services)
    modular (understandable, replaceable services)
    easier to distribute (IPC allows services on other hosts)
    easier to run on multiprocessor (since many servers)
    easier to provide scheduling, priority (at least at kernel API)
    easier to provide security (at least at kernel API)

history
    individual ideas around since the beginning
    lots of research projects starting early 1980s
    hit the big time w/ CMU's Mach in 1986
    thought to be too slow in early 1990s
    now slowly returning (OSX, QNX in embedded systems/routers, etc)
    ideas very influential on non-microkernels

microkernels in practice
    huge issue: Unix compatibility
	critical to widespread adoption
	difficult: Unix was not designed in a modular fashion
    Mach, L4: one big Unix server
	not a huge practical difference from a single Linux kernel
	Mach in particular was quite slow, many decided microkernels a bad idea
    KeyKOS: more interesting structure
	split up Unix into many entities
	first, some background on what problems KeyKOS was trying to solve

Capabilities

traditional access control model:
    process has some privileges (in Unix model: UID)
    for each syscall, kernel checks that process's privileges allow it (e.g. opening file)

what problem was Norm Hardy facing on such a system?
    wanted to give fortran compiler access to a special file (/sysx/stat) to collect statistics
    gave fortran compiler (/sysx/fort) a "home files license"
	kind-of like setuid on Unix; allowed process to write to /sysx
    user executed "/sysx/fort code.f -o /sysx/bill"
    wiped out unrelated billing file

whose problem is this?
    paper argues all components were working fine in isolation
    problem was giving extra privileges to the compiler
     -> every file access made by the compiler was now more
	privileged, even though only one of them (opening
	/sysx/stat) should have been!  doing this correctly
	would have required looking at all places where we
	opened files in the compiler.
    proposal: explicitly specify privileges to use for every operation

why might this be a good idea?
    easier to write secure, privileged programs
	program will not grant its privileges to things done on user's behalf

what are the limitations?
    somewhat invasive design implications
    have to pass around capabilities instead of user-meaningful names
    notion of a user identity is at odds with a pure-capability design

capabilities used in many settings
    hardware: Cambridge CAP system, x86 segments (in a way)
    OS kernel: Hydra, KeyKOS (and its successors)
    distributed systems: Amoeba
	self-authenticating capabilities don't need to be tracked by OS kernel the same way we'd track FDs

capability-like things appear in many places
    Unix FDs
    Java object references (and type-safe languages in general)
    URLs (that include authority info, rather than using HTTP auth)

KeyKOS

KeyKOS: capability as the base access control mechanism
    more structure than Mach

very small kernel, provides a few kinds of objects to applications:
    devices - access to underlying hardware, given to device driver processes
    pages - 4KB of data, basically a page of memory
    nodes - a block of 16 keys (key is their term for a capability)

    segments - a virtual address space
	like a page table or page directory, implemented using nodes, mapping to page keys at the leaves
	can construct segments out of other segments
    meters - CPU time allocation (even CPU-time explicitly allocated!)
    domains - something like a Unix process

domains are the most interesting object:
    16 general-purpose key slots (kind-of like capability registers)
	effectively an implicit node object
    address slot: key to entire virtual memory of process
    meter slot: key to CPU meter for process
    keeper slot: key to another domain that will handle exceptions

objects named by keys
    key is actually a 12-byte blob, but cannot handle their bytes directly
    instead, manipulated through explicit operations, kind-of like FDs (open, close, dup)
    the KeyBits service returns the actual 12 bytes behind any key given to it
	unknown: can you supply a virtualized KeyBits to hide the fact that
		 other keys are being virtualized as well?

what API does the kernel provide?
    at a low level, just 3 system calls:
	void FORK(key k, msg m): send m to k's domain, continue running afterwards
	msg *CALL(key k, msg m): send m to k's domain (+ newly-formed resume key for sender) and suspend
	msg *RETURN(key k, msg m): send m to k (will be returned from its CALL) and wait for next message

	why does the kernel require the receiver to be available?  what's the alternative?
	why does RETURN wait for a new message?  what if we don't want to wait?
    process/domain has 3 states: available, running, waiting
	process transition diagram
	kernel design suggests an object-oriented structure for applications
    other things you'd expect to be system calls are messages to special kernel-managed objects
	CALL to a device to access it
	how would you map a page of memory?
	what are the benefits of doing it this way?  easy to override built-in kernel objects.

how could one give keys to others?
    can clone keys (like dup on a Unix FD)
    keys include restriction flags -- e.g. can make a read-only 
    IPC primitive allowed passing 4 keys (which could refer be nodes for more keys)
    can tack a "data byte" onto a key on creation (delivered to the key's domain on key invocation)
	useful for keeping track of where this key came from
	distinguish multiple callers of the same service domain
	we'll see how it might be used later on

object keepers (for meters, segments, domains) handle exceptions:
    segment keeper: domain to handle page faults
	conceptual model: want program loads and stores to translate into IPC messages
	hardware doesn't quite do this for us
	so treat address space as a cache, and insert/remove mappings as needed
	load/store on an invalid mapping now will generate an IPC message
    meter keeper: domain to call when CPU time expires, controls CPU time allocation
	really more about billing than fine-grained scheduling policy
    domain keeper: domain to call for other CPU exceptions
	didn't vector FORK/CALL/RETURN system calls to keeper; should it have?
    kernel makes faults look like a CALL from the faulting domain to keeper
    invoked keeper gets a "service key" to faulting object, to fix it up as needed

where do objects come from?
    a "bank", implemented as a domain of its own (synthesized the first time system boots up)
	top-level bank has all nodes and pages in a system
	other objects (domains, meters, segments) are really just special forms of a node
    how many nodes/pages did a bank have?
	total disk space in a system (has nothing to do with a machine's DRAM size)
	would that be a good idea today?
	    probably not in general-purpose systems: want over-allocation for performance
	    maybe OK for embedded or special-purpose systems
    invoke a bank to allocate a new object, returns a "service key" for that object
    for domain, can populate domain's slots and make it runnable
	very much like JOS exofork
    how to free up space?
	child space banks can be deallocated completely, releasing resources to parent

what was the idea of persistence?
    applications have no notion of a disk, just memory
    kernel periodically saves complete system state to disk
    no other kernel state of interest
	kernel stacks, scheduler queues, buffers -> ephemeral
	open files, processes, messages waiting to be sent -> stored in kernel objects

how did applications use the disk, then?
    just store data in memory
    kernel saves it to disk eventually
    to read from disk, just access memory; kernel will demand-page if necessary

why does KeyKOS need persistence?
    kernel provides no way of recovering lost capabilities
    would have needed some special mechanism to "synthesize" capabilities on bootup
    with persistence, can use capabilities for controlling access to data on disk

problems with persistence
    could be saving a lot of unnecessary stuff to disk
    could be hard to recover if "init" dies
    could be hard to kill runaway processes
	how do meters and space banks help?
    hard to deal with bad blocks

KeyKOS (and its predecessor, GNOSIS) was a very different system than Unix
    no file system root
    every user has their own "home directory", which just mapped names to keys
    keys included both files and processes that you could invoke (using CALL or FORK)
    one user's home directory is not nameable by other users (unless granted)
    every user has a persistent shell domain that keeps a key to user's home dir
    a login process ("receptionist") keeps a password and start key for each user's shell domain
    user types in password, login does an RPC call into user's shell, passing in a key for user's terminal
    when invoking a command in shell, must say if argument is a string or a capability for file of that name
    must specify all capabilities at invocation time

KeyNIX

how did the paper implement Unix compatibility?
    special Unix keeper to emulate Unix
    a separate Unix keeper for each Unix process
    shared state stored in explicitly-shared segment

how did the filesystem work?
    separate domain for each inode
    file inodes stored data in segment objects
	to read/write, insert the file's segment into your address-space segment
	because file inode is isolated from the rest of the FS code, easy to implement local optimizations
	    paper suggests storing data in the inode itself, for small files
    directory inodes supported lookup, insert, unlink messages
    who maintains consistency?  what happens on a crash?
    how does fsync work?  two kinds of fsync (in KeyKOS at least).

file system access control
    Unix model
	inodes have an owner user ID and group ID
	access permissions for the user, group, and everyone else
    unfortunately this is fundamentally at odds with capability model (users and groups are ambient authority)
    paper doesn't describe what happened
    simple version: kernel domain keeps track of an application's user/group IDs, checks against inode's fields
    better version: embed who's accessing the file in the inode key's data byte (user, group, or everyone else)
	inode returns appropriate segment key based on the file's perms for the entity specified in data byte
	    e.g. read-only segment for read-only access, or nothing if access prohibited
	directory inodes can propagate the data byte to any keys they return for files and sub-directories
	how would we get the user an appropriate key for his home directory in the first place?
	    maybe directory inodes keep a user or group password, and allow a "dirlookup-with-password" op
	    or KeyKOS-style: user home dir key kept by login service, shell keeps homedir key along w/ FS root

what benefits do we get from implementing Unix with capabilities (KeyNIX)?
    perhaps a more reliable kernel (fault isolation)
    what happened if you crashed your kernel?
	maybe locked entries in the shared table -- not so great
	could corrupt other process table entries or file descriptors
	can we design the system a little better?
	    what about shared file descriptors?
	    what about process table?  how to name entries?
    can run apps in different Unix universes (i.e. different process&FD table segment) on same machine
	could also have multiple file systems

performance evaluation
    overall, seems to perform OK for such a radical design
    hard to draw any fundamental conclusions about KeyKOS -- little said about why Mach performs poorly
    1st fundamental problem: namei slow
	what were their proposed alternatives?
	1) one namei domain stores all directories, but keep one domain per file inode
	2) one domain for entire filesystem, like Mach/L4
    2nd fundamental problem: implementing asynchronous mechanisms (KeyKOS IPC is synchronous)
	how did they implement pipes?
	    queue domain at each end to avoid blocking main kernel domain
	    wanted to to handle signals (^C) even if process blocked writing to pipe
	    figure 2 in paper
	    [perhaps pseudo-code for kernel and queue domains]

how did KeyKOS integrate the Unix model with capabilities?
    still susceptible to confused deputy problem
    capabilities have to be implemented at all levels, Unix lacks them
    avoiding confused deputies in Unix would require changing programming conventions

Discussion

compare to other microkernels?
    KeyKOS has somewhat more structure than Mach
    synchronous IPC in KeyKOS, async in Mach
	complicates Mach kernel, hard to say if performance win or not
    Mach has lots of system calls for operating on ports
    KeyKOS implements the same thing with messages
    superficial difference: KeyKOS has fewer system calls (but so what..)
    more interesting difference:
	KeyKOS allows programs to provide their own implementation of
	    objects that might traditionally be serviced by the kernel.
	Mach always handles system calls in the kernel, would need some
	    special system call redirection mechanism.

compare to JOS, exokernel?
    what ideas could we use from these papers?
	capability naming
	    env IDs: cannot pass rights to others right now
	full persistence
	    conflicts with exokernel philosophy
	    but could work
	per-file protection domains
	    might incur a high overhead in JOS, at least with naive impl
	    lab 5 has a single file server, though could easily run many of them
	allowing applications to handle faults
	    JOS kind-of does this with page fault upcalls
	    might be nice to vector these to another environment
	    or vector everything (including syscalls) to another environment

compare to xv6, Linux?
    can use file descriptor passing as capabilities
    would have been nice to pass other things as capabilities too
	child PIDs (so processes other than parent can wait)
    a lot of process structure exposed to applications
	easier to write a debugger
	easier to do some lower-level optimizations

    still have ambient authority -- process has a uid
    must use setuid to switch privileges before/after operations
    leads to programs that try to emulate security checks in userspace
    part of why TOCTTOU problems are abundant

why didn't pure-capability systems catch on?
    tightly couples naming and access control, limited notion of user identity
    requires structuring your data around your access control policy
	e.g. can give someone a capability for an entire directory, harder to specify permissions for each file
    biggest win was probably in distributed systems
    capabilities appear in various forms (Java objects)

what about microkernels?
    running Unix in a single server process not very interesting
    doing interesting things like KeyKOS requires more design effort
    harder to evolve system
	because low-level primitives can be specific to original requirements, changes could require re-design
    optimizations or features that affect multiple subsystems are hard
	sendfile affects FS + network stack
	fixing priority inversion might require re-design