Motivation Commodity kernels written in C For good reason: C gives programmer total control But C hard to use correctly Memory management left to programmer Serious problems for kernel developers: Concurrent data structures challenging (RCU, next week) Memory safety bugs Use-after-free (notoriously difficult to debug) Buffer overflows (security vulnerabilities) https://source.codeaurora.org/quic/la//kernel/msm-3.14/commit/?id=72f67b29a9c5e6e8d3c34751600c749c5f5e13e1 CVE-2017-0619 https://source.codeaurora.org/quic/la//kernel/msm-3.10/commit/?id=9656e2c2b3523af20502bf1e933e35a397f5e82f CVE-2017-0621 https://source.codeaurora.org/quic/la//kernel/msm-4.4/commit/?id=01b2c9a5d728ff6f2f1f28a5d4e927aaeabf56ed CVE-2017-0620 HLL automatically eliminates memory safety bugs! HLLs have a garbage collector (GC) GC automates memory deallocation Convenient for programmer ...and provides memory safety But GC has costs: CPU cycles at runtime Delays execution Extra memory Thus HLL use is a trade-off: safety vs perf Determining perf. cost important to understand this trade-off No in-depth performance evaluation of HLL kernel has been done before Despite researchers building many HLL kernels already Want: better understanding of HLL kernel perf Goal: compare perf of HLL kernel against fast kernel i.e. against Linux Biscuit Started in 2014 ~30k LOC in Go Arch similar to Linux for more fair comparison POSIX syscalls Can run complex apps (NGINX, Redis) w/o source modifications Similar to Linux in many ways Some differences: Kernel threads are light-weight goroutines Context switch doesn't save/restore pgtable Cannot dereference user pointers Manually translate (What are SSE regs/instructions?) Go compiler freely uses SSE registers While Linux avoids using SSE Explicitly saves/restores SSE regs Result: Biscuit saves/restores SSE regs more often Go isn't designed to handle interrupts Runtime doesn't disable/enable interrupts in critical sections Calling critical sections (allocation) in Interrupts could deadlock Interrupt handlers can't do much Instead they wakeup a handling goroutine and return Biggest difference: Handling OOM OOM Problem Biscuit, Windows, Linux, FreeBSD etc. all face (Swap doesn't solve this problem, though may make OOM less likely) Many kernel ops allocate memory open(2) allocates a file object socket(2) allocates a socket object User program decides when to release the resource/free the mem => User prog determines how much heap memory kernel uses But machine has limited memory Problem: What if user programs cause kernel to allocate all of mem? Why would this happen? Buggy program Database server is intentionally using most mem + unlucky spike in allocations Result: almost no operation can succeed until memory is free Hard or impossible for user program to handle sensibly i.e. printf() fails Even exit(2) fails! Memory used by other process No user prog can make progress How to recover? Need to free memory Recover mem by killing mem hogging process Can't simply kill exit(2) must allocate to close files and keep FS consistent Result: Kernel must reserve enough mem for exit(2) Linux and Biscuit both do so How kernel kills hog is important Linux's approach: Blocking in allocator is tempting Then caller doesn't have to handle failure But this can deadlock Ex: Good prog takes lock on directory in FS Mem hog waiting for lock Kernel can't kill hog Hog waits for good prog Good prog waits for hog to exit Deadlock Avoid deadlock by failing alloc of good prog Result: kernel must handle failure of nearly all allocations Hard and filled with bugs "Too small to fail" rule Biscuit's approach Can't use Linux's approach Before executing op, wait until enough mem No waiting in the middle of an op No locks held Thus no deadlock How to calculate max mem? Turns out static analysis of Go is easy Wrote tool: MAXLIVE Highlevel: fancy escape analysis Not exact, but conservative over-estimates needed memory, never under-estimates Inspects call graph Finds all allocations of each syscall Two kinds of objects: 1) May be written to global 2) Only ever referenced by stack pointer Type 1 objects always live Type 2 objects freed on some stack frame destruction Mem = sum of type 1 + max of type 2 at each call graph leaf Result: no deadlock, almost no handling of allocation failures Experience 90 uses of unsafe Hacked the runtime in a couple ways Schedule interrupt goroutines Count allocations Some code non-portable due to unclear memory model Go was helpful Slices vs pointer+size Defer vs gotos Closures Maps GC vs manual mem GC significantly simplifies concurrent data structures Entries heap allocated When to free an entry? When all other threads done with entry? Implemented many OS perf optimizations to compete with Linux Map kernel text w/large pages Per-CPU NIC TX queues RCU-like directory cache Go didn't prevent their implementation Performance Three demanding apps NGINX: webserver Redis: key-value store Cmailbench: fork/exec/VM benchmark Exercise 10 Gb NIC, TCP stack, VM, FS No idle CPU cycles At least 80% of CPU time spent in kernel Results: Linux comparison Is Biscuit performance in the same league? Disabled expensive Linux features Speeds Linux up Makes comparison more fair Biscuit within 9% of Linux on our apps Linux much faster in other situations More than four CPUs NUMA aware GC < 3% Cost of GC determined by two factors: 1) Number of objects (cost increases with objects) GC must mark/read pointers in each object 2) Amount of free heap memory (cost decreases with free mem) GC each time free memory exhausted Apps use up to 5GB of memory And cause kernel to allocate rapidly But # of kernel objects is small Kernel heap contains small metadata objects Separate allocator for pages User memory File content pages Socket buffers Page-table pages Reduces # kernel objects Increases free heap mem In general: Trade mem for reduced CPU overhead for large heaps Isolate perf diff due to HLL from diff OS features Modified Linux/Biscuit to get two nearly identical code paths Pipe ping-pong Page fault CPU-time profiles show both OSes doing the same thing Ping-pong 15% faster Go version has safety checks/write barriers Page fault 5% faster Kernel entry/exit and copying dominate other work GC pauses During GC, each allocation must complete GC work Pauses come from GC work Max single pause of 115us Pauses can accumulate Max accumulated in NGINX: 600us We reduced pause times by tuning GC Conclusion HLL worked well Performance pretty good But C is faster Questions What about other languages, like Python? Other HLLs would likely have much different performance results. Multithreading in Python harder than in Go Feasible to use HLL when perf less important and C when perf important? Interesting idea, seems like it could work Difficult to share GC heap objects with C code How does Biscuit access physical addresses without pointer arithmetic? Uses "unsafe" pointer conversion Why is nearly-identical Go code slower than C? The Go compiler inserts more instructions Safety-checks Write barriers