6.S081 2020 Lecture 21: Networking

topics
  packet formats and protocols
  software stack in kernel
  today's paper -- livelock

overall network architecture
  diagram: apps, host, NIC, LAN, NIC, host, apps
  diagram: hosts, LAN, router, ..., LAN, hosts

ethernet packet format
  [printout of kernel/net.h, struct eth]
  start "flag"
  destination ethernet address -- 48 bits
  source ethernet address -- 48 bits
  ether type -- 16 bits
  payload
  end "flag"

[eth tcpdump output]

ethernet addresses
  a simple ethernet LAN broadcasts: all host NICs see all packets
    a host uses dest addr to decide if the packet is really for it
  today's ethernet LAN is a switch, with a cable to each host
    the switch uses the dest addr to decide which host to send a packet to
  ethernet addresses are assigned by NIC manufacturer
    <24-bit manufacturer ID, 24-bit serial number>

ARP
  [kernel/net.h, struct arp]
  a request/response protocol to translate IP address to ethernet address
  "nested" inside an ethernet packet, with ether type 0x0806
  ARP header indicates request vs response
  request: desired IP address
    request packets are broadcast to every host on a switch
    all the hosts process the packet, but only owner of that IP addr responds
  response: the corresponding ethernet address

[arp tcpdump output]

note:
  the habit is to "nest" higher-level packets inside lower-level packets
  e.g. ARP inside ethernet
  so you often see a sequence of headers (ether, then ARP)
  one layer's payload is the next layer's header

the ethernet header is enough to get a packet to a local host
  but more is needed to route the packet to a distant Internet host

IP header
  [kernel/net.h, struct ip]
  ether type 0x0800
  lots of stuff, but addresses are the most critical
  a 32-bit IP address is enough to route to any Internet computer
  the high bits contain a "network number" that helps routers
    understand how to forward through the Internet
  the protocol number tells the destination what to do with the packet
    i.e. which higher-level protocol to hand it to (usually UDP or TCP)

[ip tcpdump output]

UDP header
  [kernel/net.h, struct udp]
  once a packet is at the right host, what app should it go to?
  UDP header sits inside IP packet
  contains src and dst port numbers
  an application uses the "socket API" system calls to tell
    the kernel it would like to receive all packets sent to a particular port
  some ports are "well known", e.g. port 53 is reserved for DNS server
  others are allocated as-needed for the client ends of connections
  after the UDP header: payload, e.g. DNS request or response

[udp tcpdump output]

TCP
  like UDP but has sequence number fields so it can retransmit lost packets,
    and match send rate to network and destination capacity

layering view of a typical kernel network stack
  apps e.g. web browser, DNS server
  socket, port->fd table
  UDP | TCP
  IP, routing table | ARP, table
  NIC drivers
  -- packet buffers w/ allocator (see struct mbuf in net.h)
  -- each layer parses, validates, and strips headers on the way in
     discards packet if there's a problem
  -- each layer prepends a header on the way out
  -- software layer structure partially follows header nesting

control flow view of a typical kernel network stack
  multiple independent actors
  each has input packets, processes them, produces output
  here's a typical setup (much variation; lab setup is simpler)
  NICs, with internal or DMA buffering
  rx interrupt handler, copies from NIC to s/w input queue
  tx interrupt handler, copies from s/w output queue to NIC
  "network thread"
    reads packets from s/w input queue
    what to do with it? ARP, forward, put on socket queue
  applications -- read from per-socket queue
  
why all these queues of buffers?
  absorb temporary input bursts
  keep output NIC busy while computing
  allow independent control flow (NICs vs network thread vs apps)

other arrangements are possible and sometimes much better
  e.g. user-level stack
  e.g. direct user access to NIC (see Intel's DPDK)
  e.g. polling, as in today's paper

NIC packet buffering
  paper's NIC queues packets in its own memory
    driver s/w must copy to RAM
  lab assignment's NIC (the Intel e1000) DMAs into host RAM
    s/w prepares a "ring" of buffer pointers for rx, and tx
    NIC DMAs each packet to memory pointed to by successive ring elements
    why: DMA is faster than s/w copy loop
         DMA can go on concurrently with compute tasks

Today's paper: Eliminating Receive Livelock in an Interrupt-Driven Kernel,
by Mogul and Ramakrishnan

Why are we reading this paper?
  To illustrate some tradeoffs in kernel network stack structure
  It's a famous and influential paper
  Livelock / congestion collapse comes up in many situations

Explain Figure 6-1
  This is the original system, without the authors' fixes.
  Why does it go up?
  What determines how high the peak is?
    peak @ 5000 => 200 us/pkt.
  Why does it go down?
  What determines how fast it goes down?
  What happens to the packets that aren't forwarded?

A disk uses interrupts -- would a disk cause this kind of problem?
  How about the UART?
  How about a host receiving TCP traffic?

Why not completely process each packet in the interrupt handler?
  I.e. forward it?
  (this is what the network lab does)
  (still need out Q. starve tx intr. starve other devs' rx intrs.
   no story for user processing.)

Why not always poll, never use interrupts?

Overall goal:
  once we've started to spend CPU on a packet, make sure we finish that packet!
    i.e. avoid partially processing, then discarding due to overload
  for special case of forwarding:
    give output priority over input
  interrupts allow us no control

What's the paper's solution?
  No IP input queue
  NIC receive interrupt just wakes up thread
    Then leaves interrupts *disabled* for that NIC
  Thread does all processing,
    re-checks NIC for more input,
    only re-enables interrupts if no input waiting

NIC intr:
  wake up net thread (but don't read any packets)
  disable NIC interrupts

loop:
  if NIC packets waiting
    read a few packets from NIC
    process each packet
    (this is the polling part)
  else
    enable interrupts
    sleep

What happens when packets arrive too fast?
  Why does this help avoid livelock?

What happens when packets arrive slowly?

Modern Linux uses a scheme -- NAPI -- inspired by this paper.

Explain Figure 6-3
  This graph includes their system.
  Why do the empty squares level off?
  What happens to the excess packets?
  Why does "Polling (no quota)" work badly?
    Input still starves xmit-complete processing
  Why does it immediately fall to zero, rather than gradually decreasing?
    Livelock is made worse by doing even more processing before discard
    I.e. each excess rx pkt consumes many tx pkts of CPU time

Explain Figure 6-4
  (this is with every packet going through a user-level program)
  Why does "Polling, no feedback" behave badly?
    There's a queue in front of screend
    We can still give 100% to input thread, 0% to screend
  Why does "Polling w/ feedback" behave well?
    Input thread sleeps when queue to screend near-full
    Wakes up when queue near-empty

What would happen if screend hung?

Big picture: polling loop is a place to exert scheduling control

Why are the two solutions different?
  1. Polling thread *with quotas*
  2. Feedback from full queue
  Perhaps they could have used #2 for both
    Feedback doesn't require magic numbers
    But hard to retro-fit into existing UNIX structure

What if processing has more complex structure?
  Chain of processing stages with queues?
    Does feedback work?
    What happens when a late stage is slow?
  Split at some point, multiple parallel paths?
    No so great; one slow path blocks all paths

Can we formulate any general principles?
  Don't spend time on new work before completing existing work
  Design so that efficiency increases with load,
    rather than decreasing. E.g. the paper's switch from
    interrupts to polling under high load.

Similar phenomena arise in other areas of systems
  Timeout + retransmission in networks, as number of connections grows
  Spin-locks, as number of cores grows

A general lesson: complex (multi-stage) systems may need careful
  scheduling of resources if they are to survive loads close to
  capacity