6.1810 2022 Lecture 21: Networking topics packet formats and protocols software stack in kernel -- already familiar due to lab overload behavior -- today's paper why do we care about software handling of network traffic? s/w stacks -- and linux -- used pervasively for packet processing low-end routers, NATs, firewalls, VPNs, load balancers, network monitors services too: DNS, web, NAS, IP route discovery, &c performance, design, overload performance are hot topics physical network architecture [local: apps, host, NIC, LAN, NIC, host, apps/services] ethernet cable, switch, WiFi, &c LANs are limited in physical size, number of hosts [Internet: hosts, LAN, router, ..., LAN, hosts] ethernet packet format [kernel/net.h, struct eth] (preamble bits for synchronization, e.g. 10101010) (start symbol, e.g. 10101011) destination ethernet address -- 48 bits source ethernet address -- 48 bits ether type -- 16 bits payload (CRC-32) (end symbol; e.g. 8b/10b or manchester) [tcpdump -t -xx -n] ethernet addresses a simple ethernet LAN broadcasts: all host NICs see all packets a host uses dest addr to decide if the packet is really for it today's ethernet LAN is a switch, with a cable to each host the switch uses the dest addr to decide which host to send a packet to every NIC has a unique ethernet address, built in assigned by NIC manufacturer <24-bit manufacturer ID, 24-bit serial number> could we use ethernet addresses for everything? 1. ethernet was not the only kind of bottom-layer address 2. "flat" addresses make wide-area routing hard so: 32-bit IP addresses, with routing hints in the high bits but how to translate from IP address to local ethernet address? ARP [kernel/net.h, struct arp] a request/response protocol to translate IP address to ethernet address "nested" inside an ethernet packet, with ether type 0x0806 ARP header indicates request vs response request contains desired IP address request packets are broadcast to every host on a switch all the hosts process the packet, but only owner of that IP addr responds response contains the corresponding ethernet address [tcpdump -t -n -xx arp] note: the habit is to "nest" higher-layer packets inside lower-layer packets e.g. ARP inside ethernet [diagram: eth | ip | udp | DNS] so you often see a sequence of headers (ether, then ARP) one layer's payload starts with the next layer's header the ethernet header is enough to get a packet to a host on the local LAN but more is needed to route the packet to a distant Internet host IP header [kernel/net.h, struct ip] ether type 0x0800 lots of stuff, but addresses are the most critical a 32-bit IP address is enough to route to any Internet computer the high bits contain a "network number" that helps routers understand how to forward through the Internet the protocol number tells the destination what to do with the packet i.e. which higher-layer protocol to hand it to (usually UDP or TCP) [tcpdump -t -n -xx ip] UDP header [kernel/net.h, struct udp] once a packet is at the right host, what app should it go to? UDP header sits inside IP packet contains src and dst port numbers an application uses the "socket API" system calls to tell the kernel it would like to receive all packets sent to a particular port some ports are "well known", e.g. port 53 is reserved for DNS server others are allocated as-needed for the client ends of connections after the UDP header: payload, e.g. DNS request or response [tcpdump -t -n udp] TCP like UDP but explicit connections, sequence number fields so it can retransmit lost packets and match send rate to network and destination capacity. ********** what does the network code look like in a typical kernel? typical UNIX / Linux setup widely used in e.g. firewalls and servers design matters for performance, overload control flow multiple independent actors (interrupt handlers, threads) queues at boundaries here's a typical setup (much variation; lab setup is simpler) * NIC physical layer (often more than one NIC!) - NIC internal buffers * NIC DMA engine - DMA RX ring * rx interrupt handler, copies from NIC to s/w input queue - ip input queue * IP / ARP / TCP / UDP input processing; PCBs, ARP and route tables - socket queues * applications * TCP / UDP / IP output processing - IP output queue * tx interrupt handler, copies from s/w output queue to NIC - NIC DMA TX ring * NIC notes: -- packet buffers w/ allocator (see struct mbuf in net.h) -- each layer parses, validates, and strips headers on the way in discards packet if there's a problem -- each layer prepends a header on the way out why all these queues of buffers? absorb temporary input bursts, to avoid forcing discard if s/w is busy keep output NIC busy while computing allow independent control flow (NICs vs network thread vs apps) other arrangements are possible and sometimes much better e.g. user-level stack e.g. direct user access to NIC (see Intel's DPDK) e.g. polling, as in today's paper NIC packet buffering paper's NIC buffers packets internally driver s/w must copy to/from RAM lab assignment's NIC (the Intel e1000) DMAs to/from host RAM DMA is faster than s/w copy loop DMA can go on concurrently with compute tasks interrupts are useful even when you have DMA alert s/w to newly arrived packet tell s/w NIC done sending, buffer can be freed ********** Today's paper: Eliminating Receive Livelock in an Interrupt-Driven Kernel, by Mogul and Ramakrishnan Why are we reading this paper? To illustrate some tradeoffs in kernel network stack structure It's a famous and influential paper Analogues to livelock arise in many systems Explain Figure 6-1 This is the original system, without the authors' fixes. The system is a router: [NIC, input queue, IP, output queue, NIC] (Figure 6-2) Why do the black dots go up? What determines how high the peak is? peak @ 5000 => 200 us/pkt. Why do the black dots go down? What determines how fast it goes down? What happens to the packets that aren't forwarded? The problem: If overloaded, interrupts for packets n+1, n+2, &c prevent processing of packet n. Or, interrupt handler wastes precious CPU time reading packets that will end up being discarded due to full input queue. A disk interrupts -- could a disk cause livelock? How about a sound card? How about a name server receiving DNS requests? Why not completely process each packet in the interrupt handler? I.e. forward it? (this is what the network lab does) (starve tx intr. starve other devs' rx intrs. no story for user processing.) Why not always poll, and never enable interrupts at all? while(1) if NIC packets waiting read a packet from NIC process the packet sleep? don't sleep? Paper's approach to solving livelock: once kernel has started to process a packet, don't spend time on subsequent packets until done with this one! i.e. completely finish this pkt before any CPU spent on the next. problem: interrupts force kernel to spend CPU time What's the paper's solution? No IP input queue NIC receive interrupt just wakes up net processing thread Then leaves interrupts *disabled* for that NIC Thread does all processing, re-checks NIC for more input, only re-enables interrupts if no input waiting. NIC intr: disable further NIC interrupts wake up net thread (but don't read any packets) net thread: while(1) if NIC packets waiting read a packet from NIC process the packet else enable interrupts sleep What happens when packets arrive too fast? Why does this help avoid livelock? What happens when packets arrive slowly? Modern Linux uses a scheme -- NAPI -- inspired by this paper. Explain Figure 6-3 This graph includes their system. Why do the empty squares level off? What happens to the excess packets? Why does "Polling (no quota)" work badly? Input starves xmit-complete processing Why does "no quota" rapidly fall, rather than gradually decreasing? Livelock is made worse by doing even more processing before discard I.e. each excess rx pkt consumes many tx pkts of CPU time Explain Figure 6-4 (this is with every packet going through a user-level program) Why does "Polling, no feedback" behave badly? There's a queue in front of screend If overload, 100% to input thread, 0% to screend Why does "Polling w/ feedback" behave well? Input thread sleeps when queue to screend near-full Wakes up when queue near-empty What would happen if screend hung? Big picture: polling loop is a place to exert scheduling control more controllable than CPU's interrupt hardware (e.g. PLIC) Can we formulate any general principles? Don't spend time on new work before completing existing work Design so that efficiency increases with load, rather than decreasing. E.g. the paper's switch from interrupts to polling under high load. Other approaches to the same problem interrupt coalescing multi-core and multiple NIC queues DPDK, kernel bypass Similar phenomena arise in other areas of systems Timeout + resend in networks, as number of connections grows Spin-locks, as number of cores grows A general lesson: complex (multi-stage) systems need careful scheduling of resources if they are to survive loads close to capacity. Linux's NAPI polling/interrupting scheme: https://www.usenix.org/legacy/publications/library/proceedings/als01/full_papers/jamal/jamal.pdf