6.1810 2024 Lecture 20: Networking

topics
  layering of packet formats
  layering of kernel software stack
  overload behavior -- today's paper

why do we care how software handles network traffic?
  s/w stacks -- and linux -- used pervasively for packet processing
  low-end routers, firewalls, VPNs
  services: DNS, web
  performance, design, overload behavior are hot topics

paper's scenario:
  [3-iface router, two LANs, hosts, Internet]

ethernet packet format
  [kernel/net.h, struct eth]
  (preamble bits for synchronization, e.g. 10101010)
  (start symbol, e.g. 10101011)
  destination ethernet address -- 48 bits
  source ethernet address -- 48 bits
  ether type -- 16 bits
  payload
  (CRC-32)
  (end symbol; e.g. 8b/10b or manchester)

[tcpdump -t -xx -n | more]

ethernet addresses
  every Ethernet NIC has a unique ethernet address, built in
    assigned by NIC manufacturer
    <24-bit manufacturer ID, 24-bit serial number>
  original Ethernet was broadcast on a cable
    a host uses dest addr to decide if the packet is really for it
  today's ethernet LAN is a switch, with a cable to each host
    the switch uses the dest addr to decide which host to send a packet to

could we use ethernet addresses for everything?
  1. no: ethernet was not the only kind of bottom-layer address
  2. no: "flat" addresses make wide-area routing hard
  so: 32-bit IP addresses, with routing hints in the high bits

IP header
  [kernel/net.h, struct ip]
  ether type 0x0800
  lots of stuff, but addresses are the most critical
  a 32-bit IP address is enough to route to any Internet computer
  the high bits contain a "network number" that helps routers
    understand how to forward through the Internet
  the protocol number tells the destination what to do with the packet
    i.e. which higher-layer protocol to hand it to (usually UDP or TCP)

[tcpdump -t -n -xx ip | more]

note:
  pattern: "nest" higher-layer packets inside lower-layer packets
  [diagram: eth | ip | udp | DNS]
  one layer's payload starts with the next layer's header
  so you often see a sequence of headers

UDP header
  [kernel/net.h, struct udp]
  once a packet is at the right host, what app should it go to?
  UDP header sits inside IP packet
  contains src and dst port numbers
  an application uses the "socket API" system calls to tell
    the kernel it would like to receive all packets sent to a particular port
  some ports are "well known", e.g. port 53 is reserved for DNS server
  others are allocated as-needed for the client ends of connections
  after the UDP header: payload, e.g. DNS request or response

[tcpdump -t -n udp]

**********

what does the network code look like in a typical kernel?
  design matters for performance, overload

control flow
  * NIC hardware (often more than one NIC!)
    NIC internal buffers
  * NIC DMA engine
    DMA RX ring
  * rx interrupt handler, copies from NIC to s/w input queue
    IP input queue
  * network thread
    IP &c processing
    maybe forward
    maybe socket queue
  * applications
  * output processing
  - IP output queue
  * tx interrupt handler
  - frees finished NIC TX DMA ring slots
  - copies from s/w output queue to TX ring
  * NIC

the net lab is a subset of this arrangement

notes:
  -- each layer parses, validates, and strips headers on the way in
     discards packet if damaged or queue too long
  -- each layer prepends a header on the way out
  
why all these packet queues?
  absorb temporary input bursts, to avoid forcing discard if s/w is busy
  keep output NIC busy while computing
  allow independent control flow (NICs vs network thread vs apps)
    queues at boundaries

other arrangements are possible and sometimes much better
  e.g. user-level stack
  e.g. direct user access to NIC (see Intel's DPDK)
  e.g. polling, as in today's paper

**********

Today's paper: Eliminating Receive Livelock in an Interrupt-Driven Kernel,
by Mogul and Ramakrishnan, 1996

Why are we reading this paper?
  To examine some tradeoffs in kernel network stack structure
  It's a famous and influential paper
  Analogues to livelock arise in many systems

Context
  software routers very common, often Linux
    e.g. firewalls, cable modems, wifi access points
  max network speed often > router speed
  thus overload a possibility

Explain Figure 6-1
  This is the original system, without the authors' fixes.
    The system is a router: [NIC, RX intr, input q, IP, output q, TX intr,  NIC]
    (Figure 6-2)
  Why do the black dots go up?
  What determines how high the peak is?
    peak @ 5000 => 200 us/pkt.
  Why do the black dots go down?
  What determines how fast it goes down?
  What happens to the packets that aren't forwarded?

The problem:
  NIC interrupts have priority over all other processing.
  As input rate grows, interrupts eventually use 100% of CPU.
  No CPU time then available for the rest of IP processing.
    (only one core!)
  IP input queue grows to maximum allowed length.
  NIC interrupts then read each packet, see max queue, discard packet.

A higher-level view:
  Software wastes time partially processing packets that it
    will eventually discard.

Could livelock happen in the net lab?

Suppose the router design was as good as is possible.
  What would the graph then look like?
  That is, what is it reasonable to aim for?

What's the paper's solution?
  No IP input queue
  NIC receive interrupt just wakes up net thread
    Then leaves interrupts *disabled* for that NIC
  Thread does all processing,
    re-checks NIC for more input,
    only re-enables interrupts if no input waiting.

polling net thread:
while(1)
  if NIC packets waiting
    read a packet from NIC
    completely process the packet
  else
    enable interrupts
    sleep

What happens when packets arrive quickly?

Why interrupt at all? Why not always poll?
  What happens when packets arrive slowly?

Modern Linux uses a scheme -- NAPI -- inspired by this paper.

Explain Figure 6-3
  This graph includes their system.
  Why do the empty squares level off?
  Why don't the empty squares go down like Figure 6-1?
  What happens to the excess packets?

A higher-level view:
  Discard happens as early as possible (in the NIC).
  CPU does not waste time partially processing doomed packets.

Why does "Polling (no quota)" work badly?
  Input starves xmit-complete processing

Why does "no quota" rapidly fall, rather than gradually decreasing?
  Livelock is made worse by doing even more processing before discard
  I.e. each excess rx pkt consumes many tx pkts of CPU time

Explain Figure 6-4
  (this is with every packet going through a user-level program)
  Why does "Polling, no feedback" behave badly?
    There's a queue in front of screend
    If overload, 100% to net thread, 0% to screend
  Why does "Polling w/ feedback" behave well?
    Net thread sleeps when queue to screend near-full
    Wakes up when queue near-empty

Big picture: polling loop is a place to exert scheduling control
  quotas, feedback

General principles?
  Don't start new work if that might prevent completion of existing work
  When discard needed, do it as early as possible
  Design so that efficiency increases with load,
    rather than decreasing. E.g. the paper's switch from
    interrupts to polling under high load.

Other approaches to the same problem
  interrupt coalescing
  multi-core and multiple NIC queues
  expose NIC DMA rings to user code (DPDK)

Similar phenomena arise in other areas of systems
  Timeout + resend in networks, as number of connections grows
  Spin-locks, as number of cores grows

A general lesson: complex (multi-stage) systems need careful
  scheduling of resources if they are to survive overload

Linux's NAPI polling/interrupting scheme:
  https://www.usenix.org/legacy/publications/library/proceedings/als01/full_papers/jamal/jamal.pdf
  https://lwn.net/Articles/833840/
screend:
  https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=a822dc4058040a5a3866af897d9af9f618ba83ed
  http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/tech_reports/NSL-TN-2.pdf
AMD LANCE NIC interface:
  https://www.ardent-tool.com/datasheets/AMD_Am7990.pdf
  https://en.wikipedia.org/wiki/AMD_LANCE_Am7990