6.828 2011 Lecture 13: Scheduling, Livelock

What is scheduling?
  Allocation of resources to entities
    Typically to achieve some goal
    Fairness, high throughput, low latency, fulfiling contract
  Popular in the days of time sharing
    Scarce shared resources
  Seemed irrelevant in era of PCs and workstations
  But still important:
    1. Internet services -- must deal with sudden overload
    2. Careful design of high-performance systems (today's paper)

Key problems:
  Gap between desired policy and available mechanism
  Interaction between different schedulers

Resources you might want to schedule:
  CPU time
  Physical memory
  Disk and network I/O
  I/O bus bandwidth

Entities that you might want to give resources to:
  Users
  Processes
  Threads
  Web requests
  MIT accounts
  Connections and packets

Scheduling policies:
  Strict priority
  Divide equally
  Shortest job first
  Minimum guarantee + admission control

General plan for scheduling mechanisms
  1. Understand where scheduling is occuring
  2. Expose scheduling decisions, allow control
  3. Account for resource consumption, to allow intelligent control

Simple example:
  Policy: fair CPU time among processes
  Mechanism: round robin every 10 ms?
  Works if processes are compute-bound
  What if a process gives up some of its 10 ms to wait for input?
    Do we have to keep track of that and give it back?
  In practice simple round-robin doesn't produce equal shares
    Track "recent" CPU use, e.g. over the last second
    Always run process with least recent CPU use
    Still, if you sleep long enough you lose

Pitfall: Priority Inversion
  [timeline]
  Policy: strict priority
  Thread T1: low priority
  Thread T2: medium priority
  Thread T3: high priority
  T1: acquire(l)
  context switch to T3
  T3: acquire(l)... must wait for T1 to release(l)..
  context switch to T2
  T2 computes for a while
  T3 is indefinitely delayed despite high priority
  Can solve if T3 lends its priority to holder of lock it is waiting for
    So T1 runs, not T2
  This is really a multiple scheduler problem
    Since locks schedule access to locked resource

Pitfall: Efficiency
  Time quantum in CPU scheduling (vs low delay)
  Shortest seek vs FIFO disk scheduling
  Contiguous read-ahead vs data needed now
  Swap out my idle emacs to let gcc run faster w/ more phys mem
    What happens when I type a key?
  Efficiency often conflicts with fairness (or any other policy)

Pitfall: Interacting Schedulers
  Policy: emacs has absolute priority
  CPU: easy
  What about disk driver's request queue?
  What about disk cache replacement policy?
  What about physical memory allocator / replacement policy?

Pitfall: Server Processes
  Suppose emacs uses X windows to display
  The X server must serve requests from many clients
  Does it know that emacs' requests should be given priority?
  Does the O/S know to raise X's priority when it is serving emacs?

Scheduling is a whole-system problem
  Policies are typically global: good responsiveness for emacs
  Mechanisms are usually local: CPU scheduler, memory allocator, &c
    And often do not provide scheduling knobs
  It's hard to get all mechanisms aligned
  System structure is often the hardest part
    E.g. client/server

What did UNIX scheduling look like at the time of the paper?
  Uni-processor
  No spin-locks in kernel

UNIX Execution Environments
  [diagram]
  Process, user half
  Process, kernel half
  Soft interrupts: timer, network
  Device interrupts ("hard")

Rules:
  Hard intr at any time (other than another hard intr)
  Soft intr runs on hard return
  Interrupt must complete before other code continues
    Since runs on current stack
  Only switch in via sleep() -- kernel halves not pre-emptible
  Kernel half has absolute priority over user half
  User halves pre-emptable via timer interrupt

Effective priorities:
  intr > soft intr > kernel half > user  

Why no pre-emptable kernel halves?
  Few locks needed (uniprocessor...)
  Just turn off interrupts in critical sections

Why soft interrupts?
  Time-consuming processing in response to hard intrs
  E.g. network input processing
  Hard intr does time-critical h/w processing
    e.g. pull pkt from NIC to avoid overrunning its buffer
  Soft intr can be interrupted

How does (did) packet forwarding work?
  We need this to understand the paper
  [diagram: NIC, rx intr, in Q, soft intr, sock Q, user,
   out Q, tx intr, NIC]
  NIC:
    buffer the packet in on-board memory
  Receive interrupt:
    allocate pkt buffer
    copy packet from NIC to memory
    append packet to IP input queue
    schedule IP soft interrupt
  IP soft interrupt:
    for each packet on IP input queue:
      routing lookup
      if forward
        append pkt to output queue
        start output hardware, if idle
        i.e. copy packet to NIC
      if local
        append pkt to socket queue
        wake up process
  transmit interrupt
    dequeue pkt from output queue
    copy to NIC memory
    poke the NIC
    free pkt buf

Why the input queue?
Why the output queue?

What has changed since 1990s?
  Multi-processor support changed many parts of the design
  Threads instead of soft interrupts, for e.g. network processing
  Locks instead of disabling interrupts (ok, locks disable interrupts)
  Kernel halves/threads are pre-emptible

Today's paper: Eliminating Receive Livelock in an Interrupt-Driven Kernel,
by Mogul and Ramakrishnan

Why are we reading this paper?
  Most scheduling issues are boring, solvable with faster CPU
  This one is much more fundamental
  The underlying problem comes up a lot, in many guises

Explain Figure 6-1
  Why does it go up?
  What determines how high the peak is?
  Why does it go down?
  What determines how fast it goes down?
    peak @ 5000 => 200 us/pkt.
    zero at 13000 => input processing takes about 77 us.
    each additional 4 inputs/sec eliminates one out/sec.

Suppose I wanted to test a web server for livelock
  Run client with this loop:
    while(1){
      send a GET ;
      wait for response;
    }
  What would I see?

Why not completely process each packet in the interrupt handler?
  I.e. forward it?
  (still need out Q. starve tx intr. starve other devs' rx intrs.)

Why not always poll, never use interrupts?

Big picture:
  want to give output priority over input
  interrupts allow us no control

What's the paper's solution?
  No IP input queue
  Thread *polls* NICs for input and output
  NIC receive interrupt just wakes up thread
    Then leaves interrupts *disabled* for that NIC
  Thread does all input processing,
    then re-enables interrupts

thread(){
  while(1){
    for each NIC:
      while input pkt:
        process pkt
      if output done:
        pkt from out Q to NIC
    if nothing:
      enable interrupts
      sleep
  }
}

intr(){
  disable interrupts
  wake up thread
}

Why does this work?
  What happens when packets arrive too fast?
  What happens when packets arrive slowly?

Explain Figure 6-3
  Why does "Polling (no quota)" work badly?
    Input still starves xmit-complete processing
  Why does it immediately fall to zero, rather than gradually decreasing?
    Livelock is made worse by doing even more processing before discard
    I.e. each excess rx pkt consumes many tx pkts of CPU time

Explain Figure 6-4
  Why does "Polling, no feedback" behave badly?
    There's a queue in front of screend
    We can still give 100% to input thread, 0% to screend
  Why does "Polling w/ feedback" behave well?
    Input thread yields when queue to screend near-full
    Wakes up when queue near-empty

What would happen if screend hung?

Big picture: polling loop is a place to exert sched control

Why are the two solutions different?
  1. Polling thread *with quotas*
  2. Feedback from full queue
  Perhaps they could have used #2 for both
    Feedback doesn't require magic numbers
    But hard to retro-fit into existing UNIX structure

What if processing has more complex structure?
  Chain of processing stages with queues?
    Does feedback work?
    What happens when a late stage is slow?
  Split at some point, multiple parallel paths?
    No so great; one slow path blocks all paths

Can we formulate any general principles?
  Don't spend time on new work before completing existing work
  Or, give new work lower priority than partially-completed work
  Or, discard as early as possible