6.1810 2022 Lecture 21: Networking

topics
  packet formats and protocols
  software stack in kernel -- already familiar due to lab
  overload behavior -- today's paper

why do we care about software handling of network traffic?
  s/w stacks -- and linux -- used pervasively for packet processing
  low-end routers, NATs, firewalls, VPNs, load balancers, network monitors
  services too: DNS, web, NAS, IP route discovery, &c
  performance, design, overload performance are hot topics

physical network architecture
  [local: apps, host, NIC, LAN, NIC, host, apps/services]
    ethernet cable, switch, WiFi, &c
    LANs are limited in physical size, number of hosts
  [Internet: hosts, LAN, router, ..., LAN, hosts]

ethernet packet format
  [kernel/net.h, struct eth]
  (preamble bits for synchronization, e.g. 10101010)
  (start symbol, e.g. 10101011)
  destination ethernet address -- 48 bits
  source ethernet address -- 48 bits
  ether type -- 16 bits
  payload
  (CRC-32)
  (end symbol; e.g. 8b/10b or manchester)

[tcpdump -t -xx -n]

ethernet addresses
  a simple ethernet LAN broadcasts: all host NICs see all packets
    a host uses dest addr to decide if the packet is really for it
  today's ethernet LAN is a switch, with a cable to each host
    the switch uses the dest addr to decide which host to send a packet to
  every NIC has a unique ethernet address, built in
    assigned by NIC manufacturer
    <24-bit manufacturer ID, 24-bit serial number>

could we use ethernet addresses for everything?
  1. ethernet was not the only kind of bottom-layer address
  2. "flat" addresses make wide-area routing hard
  so: 32-bit IP addresses, with routing hints in the high bits
  but how to translate from IP address to local ethernet address?

ARP
  [kernel/net.h, struct arp]
  a request/response protocol to translate IP address to ethernet address
  "nested" inside an ethernet packet, with ether type 0x0806
  ARP header indicates request vs response
  request contains desired IP address
    request packets are broadcast to every host on a switch
    all the hosts process the packet, but only owner of that IP addr responds
  response contains the corresponding ethernet address

[tcpdump -t -n -xx arp]

note:
  the habit is to "nest" higher-layer packets inside lower-layer packets
  e.g. ARP inside ethernet
  [diagram: eth | ip | udp | DNS]
  so you often see a sequence of headers (ether, then ARP)
  one layer's payload starts with the next layer's header

the ethernet header is enough to get a packet to a host on the local LAN
  but more is needed to route the packet to a distant Internet host

IP header
  [kernel/net.h, struct ip]
  ether type 0x0800
  lots of stuff, but addresses are the most critical
  a 32-bit IP address is enough to route to any Internet computer
  the high bits contain a "network number" that helps routers
    understand how to forward through the Internet
  the protocol number tells the destination what to do with the packet
    i.e. which higher-layer protocol to hand it to (usually UDP or TCP)

[tcpdump -t -n -xx ip]

UDP header
  [kernel/net.h, struct udp]
  once a packet is at the right host, what app should it go to?
  UDP header sits inside IP packet
  contains src and dst port numbers
  an application uses the "socket API" system calls to tell
    the kernel it would like to receive all packets sent to a particular port
  some ports are "well known", e.g. port 53 is reserved for DNS server
  others are allocated as-needed for the client ends of connections
  after the UDP header: payload, e.g. DNS request or response

[tcpdump -t -n udp]

TCP
  like UDP but explicit connections,
  sequence number fields so it can retransmit lost packets
  and match send rate to network and destination capacity.

**********

what does the network code look like in a typical kernel?
  typical UNIX / Linux setup
  widely used in e.g. firewalls and servers
  design matters for performance, overload

control flow
  multiple independent actors (interrupt handlers, threads)
  queues at boundaries
  here's a typical setup (much variation; lab setup is simpler)
  * NIC physical layer (often more than one NIC!)
  - NIC internal buffers
  * NIC DMA engine
  - DMA RX ring
  * rx interrupt handler, copies from NIC to s/w input queue
  - ip input queue
  * IP / ARP / TCP / UDP input processing; PCBs, ARP and route tables
  - socket queues
  * applications
  * TCP / UDP / IP output processing
  - IP output queue
  * tx interrupt handler, copies from s/w output queue to NIC
  - NIC DMA TX ring
  * NIC

notes:
  -- packet buffers w/ allocator (see struct mbuf in net.h)
  -- each layer parses, validates, and strips headers on the way in
     discards packet if there's a problem
  -- each layer prepends a header on the way out
  
why all these queues of buffers?
  absorb temporary input bursts, to avoid forcing discard if s/w is busy
  keep output NIC busy while computing
  allow independent control flow (NICs vs network thread vs apps)

other arrangements are possible and sometimes much better
  e.g. user-level stack
  e.g. direct user access to NIC (see Intel's DPDK)
  e.g. polling, as in today's paper

NIC packet buffering
  paper's NIC buffers packets internally
    driver s/w must copy to/from RAM
  lab assignment's NIC (the Intel e1000) DMAs to/from host RAM
    DMA is faster than s/w copy loop
    DMA can go on concurrently with compute tasks
  interrupts are useful even when you have DMA
    alert s/w to newly arrived packet
    tell s/w NIC done sending, buffer can be freed

**********

Today's paper: Eliminating Receive Livelock in an Interrupt-Driven Kernel,
by Mogul and Ramakrishnan

Why are we reading this paper?
  To illustrate some tradeoffs in kernel network stack structure
  It's a famous and influential paper
  Analogues to livelock arise in many systems

Explain Figure 6-1
  This is the original system, without the authors' fixes.
    The system is a router: [NIC, input queue, IP, output queue, NIC]
    (Figure 6-2)
  Why do the black dots go up?
  What determines how high the peak is?
    peak @ 5000 => 200 us/pkt.
  Why do the black dots go down?
  What determines how fast it goes down?
  What happens to the packets that aren't forwarded?

The problem:
  If overloaded,
    interrupts for packets n+1, n+2, &c prevent processing of packet n.
  Or, interrupt handler wastes precious CPU time reading packets that
    will end up being discarded due to full input queue.

A disk interrupts -- could a disk cause livelock?
  How about a sound card?
  How about a name server receiving DNS requests?

Why not completely process each packet in the interrupt handler?
  I.e. forward it?
  (this is what the network lab does)
  (starve tx intr. starve other devs' rx intrs. no story for user processing.)

Why not always poll, and never enable interrupts at all?
while(1)
  if NIC packets waiting
    read a packet from NIC
    process the packet
  sleep? don't sleep?

Paper's approach to solving livelock:
  once kernel has started to process a packet,
    don't spend time on subsequent packets until done with this one!
    i.e. completely finish this pkt before any CPU spent on the next.
  problem: interrupts force kernel to spend CPU time

What's the paper's solution?
  No IP input queue
  NIC receive interrupt just wakes up net processing thread
    Then leaves interrupts *disabled* for that NIC
  Thread does all processing,
    re-checks NIC for more input,
    only re-enables interrupts if no input waiting.

NIC intr:
  disable further NIC interrupts
  wake up net thread
  (but don't read any packets)

net thread:
while(1)
  if NIC packets waiting
    read a packet from NIC
    process the packet
  else
    enable interrupts
    sleep

What happens when packets arrive too fast?
  Why does this help avoid livelock?

What happens when packets arrive slowly?

Modern Linux uses a scheme -- NAPI -- inspired by this paper.

Explain Figure 6-3
  This graph includes their system.
  Why do the empty squares level off?
  What happens to the excess packets?
  Why does "Polling (no quota)" work badly?
    Input starves xmit-complete processing
  Why does "no quota" rapidly fall, rather than gradually decreasing?
    Livelock is made worse by doing even more processing before discard
    I.e. each excess rx pkt consumes many tx pkts of CPU time

Explain Figure 6-4
  (this is with every packet going through a user-level program)
  Why does "Polling, no feedback" behave badly?
    There's a queue in front of screend
    If overload, 100% to input thread, 0% to screend
  Why does "Polling w/ feedback" behave well?
    Input thread sleeps when queue to screend near-full
    Wakes up when queue near-empty

What would happen if screend hung?

Big picture: polling loop is a place to exert scheduling control
  more controllable than CPU's interrupt hardware (e.g. PLIC)

Can we formulate any general principles?
  Don't spend time on new work before completing existing work
  Design so that efficiency increases with load,
    rather than decreasing. E.g. the paper's switch from
    interrupts to polling under high load.

Other approaches to the same problem
  interrupt coalescing
  multi-core and multiple NIC queues
  DPDK, kernel bypass

Similar phenomena arise in other areas of systems
  Timeout + resend in networks, as number of connections grows
  Spin-locks, as number of cores grows

A general lesson: complex (multi-stage) systems need careful
  scheduling of resources if they are to survive loads close to
  capacity.

Linux's NAPI polling/interrupting scheme:
  https://www.usenix.org/legacy/publications/library/proceedings/als01/full_papers/jamal/jamal.pdf