6.828 2018 Lecture 22: O/S Network Performance, IX *** please fill out a course evaluation! http://web.mit.edu/subjectevaluation/ Reading: IX: A Protected Dataplane Operating System for High Throughput and Low Latency, OSDI 2014 this lecture O/S network stack performance IX as case study Linux network software structure [diagram] queue: NIC DMA processing: driver interrupt queue: input queue processing: TCP (or UDP, ICMP, NFS, &c) queue: socket buffer (store until app wants to read) processing: application read() note a few things: in-kernel to get access to NIC hardware in-kernel to de-multiplex, e.g. ARP vs TCP in-kernel to prevent one app from messing with another app's connections lots of locks and inter-core sharing, e.g. queues, TCP connection state how does network software structure affect performance? let's focus on high-performance network servers e.g. memcached, an in-memory key/value storage server high request rate short requests / responses lots of clients, lots of potential parallelism want high throughput under high load (requests per second) want low latency under low/modest load (seconds per request) want low tail of latency distribution what are the relevant h/w limits? i.e. what can we hope for? throughput limits: 10 gigabit ethernet: 15 million tiny packets/second 40 gigabit ethernet: 60 million tiny packets/second RAM: a few gigabytes per second interrupts: a million per second system calls: a few million per second contended locks: a million per second inter-core data movement: a few million per second so: if limited by ethernet and RAM, ~10 million/sec short queries if limited by interrupts, locks, &c: ~1 million/sec (maybe per core) latency ingredients: latency important for e.g. web page with 100s of items low load: sum of a sequence of steps network speed-of-light and switch round-trip time interrupt queue operations sleep/wakeup system calls inter-core data movement RAM fetches high load: latency is largely determined by wait time -- queuing efficiency (high throughput) reduces queuing time bursty arrivals increase queuing time bursty service times increase queuing time structural problems can increase queuing time load imbalance, or no-one serving a queue latency is hard to reason about, hard to improve IX: a design for a high-performance network stack OSDI 2014 lives in Linux different syscall API (doesn't preserve Linux API) different stack architecture (doesn't use Linux stack code or design) IX diagram -- Figure 1(a) Linux kernel IX kernel NIC DMA queues TCP/IP stack IX application multiple threads IX notes IX runs in VMX non-root (guest) mode using Dune IX kernel at CPL=0 IX application at CPL=3 Linux kernel gives IX dedicated NIC queues and dedicated cores then Linux isn't involved much IX application makes system calls to IX kernel mostly to send and receive packets packet buffers in memory shared between IX kernel and IX application so packet data isn't copied (unlike Linux) idea: batching syscall interface the problem: syscall overhead is big if messages are small want to send/recv more packets/second than one can do syscalls/second the solution: run_io() (Section 4.3) run_io() argument describes many writes to many TCP connections run_io() return describes incoming data on many TCP connections (really more general, e.g. returns new connection events too) so: each user/kernel crossing does lots of work while True: run_io(out, in) for msg in in: process msg out.append(reply) idea: run-to-completion -- Figure 1(b) the problem: Linux uses CPU time moving packets through stages and queues queues: good if application is doing something else bad for network performance (locks, core-to-core, cache eviction) what is "run-to-completion"? complete the processing of one batch of inputs before starting on next batch really complete: driver, TCP, application, enqueue reply how? run_io(), function call down to driver, return pkt all the way up to app app calls next run_io() with reply message why? single thread carries the batch of packets through all steps avoids queues, sleep/wakeup, context switch, inter-core xfers helps keep active packet batch in the CPU data cache (vs long queues) avoids worries about balancing processing rate in each stage idea: polling rather than interrupts the problem: interrupts are expensive interrupts are redundant if always likely to be input waiting what is polling? periodically check NIC DMA queues for new input and disable NIC interrupts why hard? where to put the checks? i.e. in what loop? might check too often -- waste CPU might check too rarely -- high latency, queue overflow IX's solution: each core dedicated to one application thread while(1) { run_io(); ... } run_io() polls NIC DMA queues no waste: if no input, nothing for the core to do anyway if input, grabs a batch and returns it to application automatically polls more frequently at low load, less at high load very nice; paper calls this "adaptive polling" what about multi-core parallelism? the problem: one core often can't deliver enough throughput will leave most of 10-gigabit ethernet idle the opportunity: lots of clients work for each client is often independent all modern machines have multiple cores the dangers: lock contention is expensive data movement (between cores) is expensive to avoid data movement and lock contention: all actions for a client, TCP, and packet should be on same core no data should be used on more than one core examples of potentially shared data: packet content NIC queues packet free lists TCP data structures application data (e.g. memcached's in-memory DB) idea: multiple NIC queues for parallelism modern NICs support many independent DMA queues NIC uses filters and hashing to pick the queue Linux sets up a separate set of NIC DMA queues per IX application one queue per core for each IX application Linux tells NIC a filter for each IX application NIC hashes client IP addr / port to pick the queue for each incoming packet "flow-consistent hashing" or "receive-side scaling" (RSS) NIC gives all packets for a given TCP connection to the same core! no need to share TCP connection state among cores no need to move packet data between cores run_io() looks at NIC DMA queue for just its own core a new connection is given to the core determined by the NIC's hash hopefully uniform and results in balanced load idea: zero copy how to avoid IX/user and user/IX copies of TCP data? across the CPL=0/CPL=3 boundary (like user/kernel) 40 gigabits/sec may stress RAM throughput IX uses page table to map packet buffers into both IX and application NIC DMAs to/from this memory run_io() carries pointers into this memory app/IX cooperate to note when received/sent buffer is free freed buffers reported via run_io() IX design limitations assumes many parallel clients making small requests you'd want something else for a single 40-megabit/second transfer assumes good load balance across cores i.e. clients and requests evenly divided by NIC onto queues i.e. requests all take about the same amount of time could re-assign flows to NIC queues? could steal work from other cores? assumes non-blocking request handling i.e. service code computes and then replies does not read the disk, send an RPC and wait, &c b/c blocking would produce an idle core and a rapidly-growing queue could shift block requests to a dedicated thread/core? Evaluation what should we look for? high throughput under high load -- especially for small messages low latency under light load throughput proportional of # of cores Figure 2 looks at latency under light load one client, ping-pong, one request outstanding at a time x-axis is message size y-axis is throughput (gigabits/second) why does the line rise with increasing message size? lots of fixed overheads amortized over increasing data increased opportunity for batching, since many packets/message rtt, TCP/IP header, interrupts (for Linux), syscalls, TCP processing what limits the rise? 10-gigabit ethernet minus headers why does IX beat Linux? for small messages (e.g. 64 bytes): latency-limited IX polling sees the message sooner IX has no interrupt/queueing/sleep/wakeup fewer user/kernel crossings paper says 5.7 us for IX 64-byte; 24 us for Linux this is enough for people to care! for big messages: throughput-limited IX has less advantage here, since most wins are per-packet IX's zero-copy might be important Figure 3(a) effect of adding cores on throughput ideally: throughput increases with core count until we hit the 10- or 40-megabit/second ethernet limit note this does *not* show that the software is efficient only that it gets good parallel speedup 18 clients, 64-byte messages, one per connection x-axis is number of cores y-axis is RPCs/second in millions why do the lines go up (at least at start)? work is split over more parallel cores 18 clients, so perhaps up to 18 cores could be used is the throughput proportional to the number of cores? probably "yes" for all lines, at start so locking &c are not eliminating all parallelism note half a million / second for IX with one core that's 2 microseconds of CPU time per request/response or about 4000 CPU cycles why does IX 10 gigabit line level off? why does IX beat Linux? polling, batched system calls, run to completion it's impressive that IX still scales linearly at 4 million/sec that's a very high number for any system! it suggests that parallelization is nearly perfect RSS helps s/w must have absolutely no locks, no inter-core data movement IX makes some big architectural decisions differently per-application network stack rather than single stack shared by all applications allows packet buffers to be shared, so zero-copy dedicates cores to applications, and to application threads rather than shared cores, multiplexed by the kernel allows polling and run-to-completion helps make the software more efficient, simpler requires plentiful cores dedicates NIC queues to application rather than shared queues, de-multiplexed by the kernel more direct access, for better efficiency requires plentiful NIC queues