6.S081 2020 Lecture 21: Networking topics packet formats and protocols software stack in kernel today's paper -- livelock overall network architecture diagram: apps, host, NIC, LAN, NIC, host, apps diagram: hosts, LAN, router, ..., LAN, hosts ethernet packet format [printout of kernel/net.h, struct eth] start "flag" destination ethernet address -- 48 bits source ethernet address -- 48 bits ether type -- 16 bits payload end "flag" [eth tcpdump output] ethernet addresses a simple ethernet LAN broadcasts: all host NICs see all packets a host uses dest addr to decide if the packet is really for it today's ethernet LAN is a switch, with a cable to each host the switch uses the dest addr to decide which host to send a packet to ethernet addresses are assigned by NIC manufacturer <24-bit manufacturer ID, 24-bit serial number> ARP [kernel/net.h, struct arp] a request/response protocol to translate IP address to ethernet address "nested" inside an ethernet packet, with ether type 0x0806 ARP header indicates request vs response request: desired IP address request packets are broadcast to every host on a switch all the hosts process the packet, but only owner of that IP addr responds response: the corresponding ethernet address [arp tcpdump output] note: the habit is to "nest" higher-level packets inside lower-level packets e.g. ARP inside ethernet so you often see a sequence of headers (ether, then ARP) one layer's payload is the next layer's header the ethernet header is enough to get a packet to a local host but more is needed to route the packet to a distant Internet host IP header [kernel/net.h, struct ip] ether type 0x0800 lots of stuff, but addresses are the most critical a 32-bit IP address is enough to route to any Internet computer the high bits contain a "network number" that helps routers understand how to forward through the Internet the protocol number tells the destination what to do with the packet i.e. which higher-level protocol to hand it to (usually UDP or TCP) [ip tcpdump output] UDP header [kernel/net.h, struct udp] once a packet is at the right host, what app should it go to? UDP header sits inside IP packet contains src and dst port numbers an application uses the "socket API" system calls to tell the kernel it would like to receive all packets sent to a particular port some ports are "well known", e.g. port 53 is reserved for DNS server others are allocated as-needed for the client ends of connections after the UDP header: payload, e.g. DNS request or response [udp tcpdump output] TCP like UDP but has sequence number fields so it can retransmit lost packets, and match send rate to network and destination capacity layering view of a typical kernel network stack apps e.g. web browser, DNS server socket, port->fd table UDP | TCP IP, routing table | ARP, table NIC drivers -- packet buffers w/ allocator (see struct mbuf in net.h) -- each layer parses, validates, and strips headers on the way in discards packet if there's a problem -- each layer prepends a header on the way out -- software layer structure partially follows header nesting control flow view of a typical kernel network stack multiple independent actors each has input packets, processes them, produces output here's a typical setup (much variation; lab setup is simpler) NICs, with internal or DMA buffering rx interrupt handler, copies from NIC to s/w input queue tx interrupt handler, copies from s/w output queue to NIC "network thread" reads packets from s/w input queue what to do with it? ARP, forward, put on socket queue applications -- read from per-socket queue why all these queues of buffers? absorb temporary input bursts keep output NIC busy while computing allow independent control flow (NICs vs network thread vs apps) other arrangements are possible and sometimes much better e.g. user-level stack e.g. direct user access to NIC (see Intel's DPDK) e.g. polling, as in today's paper NIC packet buffering paper's NIC queues packets in its own memory driver s/w must copy to RAM lab assignment's NIC (the Intel e1000) DMAs into host RAM s/w prepares a "ring" of buffer pointers for rx, and tx NIC DMAs each packet to memory pointed to by successive ring elements why: DMA is faster than s/w copy loop DMA can go on concurrently with compute tasks Today's paper: Eliminating Receive Livelock in an Interrupt-Driven Kernel, by Mogul and Ramakrishnan Why are we reading this paper? To illustrate some tradeoffs in kernel network stack structure It's a famous and influential paper Livelock / congestion collapse comes up in many situations Explain Figure 6-1 This is the original system, without the authors' fixes. Why does it go up? What determines how high the peak is? peak @ 5000 => 200 us/pkt. Why does it go down? What determines how fast it goes down? What happens to the packets that aren't forwarded? A disk uses interrupts -- would a disk cause this kind of problem? How about the UART? How about a host receiving TCP traffic? Why not completely process each packet in the interrupt handler? I.e. forward it? (this is what the network lab does) (still need out Q. starve tx intr. starve other devs' rx intrs. no story for user processing.) Why not always poll, never use interrupts? Overall goal: once we've started to spend CPU on a packet, make sure we finish that packet! i.e. avoid partially processing, then discarding due to overload for special case of forwarding: give output priority over input interrupts allow us no control What's the paper's solution? No IP input queue NIC receive interrupt just wakes up thread Then leaves interrupts *disabled* for that NIC Thread does all processing, re-checks NIC for more input, only re-enables interrupts if no input waiting NIC intr: wake up net thread (but don't read any packets) disable NIC interrupts loop: if NIC packets waiting read a few packets from NIC process each packet (this is the polling part) else enable interrupts sleep What happens when packets arrive too fast? Why does this help avoid livelock? What happens when packets arrive slowly? Modern Linux uses a scheme -- NAPI -- inspired by this paper. Explain Figure 6-3 This graph includes their system. Why do the empty squares level off? What happens to the excess packets? Why does "Polling (no quota)" work badly? Input still starves xmit-complete processing Why does it immediately fall to zero, rather than gradually decreasing? Livelock is made worse by doing even more processing before discard I.e. each excess rx pkt consumes many tx pkts of CPU time Explain Figure 6-4 (this is with every packet going through a user-level program) Why does "Polling, no feedback" behave badly? There's a queue in front of screend We can still give 100% to input thread, 0% to screend Why does "Polling w/ feedback" behave well? Input thread sleeps when queue to screend near-full Wakes up when queue near-empty What would happen if screend hung? Big picture: polling loop is a place to exert scheduling control Why are the two solutions different? 1. Polling thread *with quotas* 2. Feedback from full queue Perhaps they could have used #2 for both Feedback doesn't require magic numbers But hard to retro-fit into existing UNIX structure What if processing has more complex structure? Chain of processing stages with queues? Does feedback work? What happens when a late stage is slow? Split at some point, multiple parallel paths? No so great; one slow path blocks all paths Can we formulate any general principles? Don't spend time on new work before completing existing work Design so that efficiency increases with load, rather than decreasing. E.g. the paper's switch from interrupts to polling under high load. Similar phenomena arise in other areas of systems Timeout + retransmission in networks, as number of connections grows Spin-locks, as number of cores grows A general lesson: complex (multi-stage) systems may need careful scheduling of resources if they are to survive loads close to capacity