Scheduling

Required reading: Eliminating receive livelock

Overview

What is scheduling?
- OS policies and mechanisms to allocate resources to entities.
- Usually in the context of resources where re-assignment can happen with some frequency: CPU cycles, network bandwidth, disk accesses, memory when swapping to disk.
- Not so applicable to more static resources like disk space.
- Entities that you might want to give resources to: users, processes, threads, web requests, or MIT accounts.
- Scheduling unavoidable with shared resources (whether implicit or explicit), but particularly interesting when demand exceeds available resources (e.g. demand for CPU is infinite with compute-bound tasks).
- A good scheduling policy ensures that the most important entity gets the resources it needs.
- Many polices for resource to entity allocation are possible: strict priority, divide equally, shortest job first, minimum guarantee combined with admission control.
- This topic was popular in the days of time sharing, when there was a shortage of resources all around.
- Many scheduling problems become not very interesting when you can just buy a faster CPU or a faster network.
- Exception: web sites and large-scale networks often cannot be made fast enough to handle peak demand (flash crowds, attacks) so scheduling becomes important again. For example may want to prioritize paying customers, or address denial-of-service attacks.
- Exception 2: some scheduling decisions have non-linear effects on overall system behavior, not just different performance for different users. For example today's paper.
Key problems
- Gap between desired policy and available mechanism. High-level desired policies often don't have a one-to-one mapping onto available scheduling mechanisms. Scheduler can approximate policy, or implement it under certain assumptions, etc.
- Conflicting goals, e.g. low latency, high throughput, and fairness. The scheduler must make a trade-off between the goals.
- Interaction between different schedulers. Just optimizing the CPU scheduler may do little to for the overall desired policy.
General plan for scheduling mechanisms
1. Understand where scheduling is occuring.
2. Expose scheduling decisions, allow control.
3. Account for resource consumption, to allow intelligent control.
What policy does JOS enforce for scheduling environments?
- Reading between the lines, equal time for all environments: JOS's mechanism is that the 10 msec timer time-slices in round-robin among all runnable environments.
- But this only works if processes are compute-bound. What if a process gives up some of its 10 ms to wait for input? Do we have to keep track of that and give it back?
- For a long time Linux had a similar problem, where a process can use CPU time "undetected" by yielding just before the 10ms timer interrupt.
- How long should the quantum be? is 10 msec the right answer? Shorter quantum will lead to better interactive performance, but lowers overall system throughput because we will reschedule more, which has overhead.
- What if the environment computes for 1 msec and sends an IPC to a server environment? Shouldn't the server get more CPU time because it operates on behalf of all other functions?
- Potential improvements for JOS: track "recent" CPU use (e.g., over the last second) and always run environment to which we "owe" the most CPU time. (But can't accumulate CPU debt forever...)
- Other solution: directed yield; specify on the yield to which environment you are donating the remainder of the quantuam (e.g., to an IPC server so that it can compute on the environment's behalf).
Pitfall: Priority Inversion
- Assume policy is strict priority.
  Thread T1: low priority (e.g. check to see if I have new mail periodically).
  Thread T2: medium priority.
  Thread T3: high priority (e.g. real-time game).
- T1: acquire(l)
  context switch to T3 (maybe T3 wakes up needs to run: it has priority)
  T3: acquire(l)... must wait for T1 to release(l)...
  context switch to T2 (again, perhaps T2 needs to run & has prio over T1)
  T2 computes for a while
  T3 is indefinitely delayed despite high priority
- Can solve if T3 lends its priority to holder of lock it is waiting for, so T1 runs instead of T2.
- This is really a multiple scheduler problem, since locks schedule access to locked resource.
- (Diagram.)
Pitfall: Efficiency and conflicting requirements.
- Efficiency often conflicts with fairness, or any other policy.
- Long time quantum for efficiency in CPU scheduling versus low delay and context-switching overhead.
- Shortest seek versus FIFO disk scheduling.
- Contiguous read-ahead vs data needed now.
- For example, scheduler swaps out an idle emacs to let gcc run faster with more phys mem. What happens when user types a key?
- These issues don't fit well into a "who gets to go next" scheduler framework.
- Inefficient scheduling may make everybody slower, including high priority users.
Pitfall: Multiple Interacting Schedulers.
- Suppose you want your emacs to have priority over everything else. Give it high CPU priority. Does that mean nothing else will run if emacs wants to run?
- No: Disk scheduler might not know to favor emacs's disk I/Os. Typical UNIX disk scheduler favors disk efficiency, not process prio.
- Suppose emacs needs more memory. Other processes have dirty pages; emacs must wait. Disk scheduler doesn't know these other processes' writes are high prio.
Pitfall: Server Processes.
- Suppose emacs uses X windows to display. The X server must serve requests from many clients.
- Does it know that emacs' requests should be given priority?
- Does the OS know to raise X's priority when it is serving emacs?
- Similarly for DNS, and NFS. Does the network know to give emacs's NFS requests priority?
In short, scheduling is a system problem. There are many schedulers; they interact. The CPU scheduler is usually the easy part. The hardest part is system structure. For example, the existence of interrupts is bad for scheduling. Conflicting goals may limit effectiveness.

Case study: UNIX in 1990

What did UNIX scheduling look like at time of Livelock paper?
- Uni-processor
- No spin-locks in kernel
Different contexts:
- Process, user half
- Process, kernel half
- Soft interrupts
- Device ("hard") interrupts
- (Diagram.)
User half, kernel half, and device interrupts are analogous to xv6. Soft interrupts delay the work from device interrupts until later, and are pre-emptable, unlike device interrupts themselves. Typically soft intr runs when last hard intr finishes.
Rules for context transitions:
- Hardware can force hard interrupt any most times except in a hard intr (same as xv6)
- Soft interrupt can be invoked any time except during a soft or hard intr
- Interrupt must complete before interrupted code can continue, since interrupts share stack (same as xv6)
- Timer interrupt does *not* switch kernel halves, only voluntary switch via sleep()
- Never runs a user half if any kernel half wants to run
- User halfs pre-emptable via timer interrupt (same as xv6)
Thus, priority structure: hard intr > soft intr > kernel half > user half
Why no pre-emptable kernel halves?
- To avoid most locks (remember -- uniprocessor)
- Only have to worry about concurrency from interrupts
- Kernel halves disable interrupts (up to certain priority level) in critical sections
Why soft interrupts?
- For time-consuming processing in response to hard interrupts, which should run relatively soon, and might not be associated with any specific kernel half (e.g. TCP ACKs, ARP responses, etc.)
- Hard interrupt does time-critical stuff, schedules soft interrupt, more hard interrupts can occur during soft int.
- Short-running hard interrupts avoid input overruns.
Paper's application is packet forwarding -- let's look at how that works. (Diagram.)
Receive interrupt (hard intr)
- Copy incoming packet to memory
- Append packet pointer to IP input queue (ipintrq)
- Schedule IP soft interrupt
IP soft interrupt
- For each packet in IP input queue,
- Decide what to do with it
- Forward out a different network interface?
  - Append packet pointer to output interface's if_snd queue
  - Feed packet to output interface hardware, if it was idle
- Deliver to a user program?
  - Append packet pointer to socket Q
  - Wake up process
Transmit interrupt (hard intr)
- Free last packet buffer
- Copy packet from if_snd queue to h/w
- Tell h/w to transmit
Network interface hardware (e.g. ethernet)
- Executes concurrent to the main CPU
- Has small receive buffer on interface h/w (a few packets)
- Raises receive interrupt for each received packet
- Has small output buffer (a few packets)
- Raises transmit-complete interrupts when buffers open up
Why the input queue (ipintrq)?
Why the output queue (if_snd)?
Is this a good arrangement?

Paper discussion

Explain Figure 6-1. Why does it go up? What determines how high the peak is? Why does it go down? What determines how fast it goes does? (The peak at 5000 means total processing time is 200 us. The x-intercept at 13000 means input processing alone takes 77 us. So each additional four inputs/second eliminates one output/second.)
Would livelock show up for disk interrupts?
Why not tell the O/S scheduler to give interrupts lower priority? Non-preemptible. Could you fix this by making interrupts faster? (No)
Why not completely process each packet in the interrupt handler? (I.e. forward it?) Still need an output queue. Recv intr routine might never return, thus no tx complete interrupts, thus no transmissions. Also maybe interface h/w overruns during short input bursts. Wouldn't be fair if multiple input interfaces.
Big picture: we've surrendered all control over scheduling various actions to the CPU's interrupt mechanism. And it doesn't give us the control we need to enforce our desired policy: output has precedence over input.
What about using polling instead of interrupts? Solves overload problem, but under low low either wastes CPU or has high latency.
What's the paper's solution?
- No IP input queue.
- Device input polling and IP input processing in kernel thread.
- Device receive interrupt just wakes up thread. And leaves interrupts *disabled* for that device.
- Thread does all input and output processing, then re-enables interrupts.
- [show pseudo-code; where can we make scheduling decisions?]
Why does this work? What happens when packets arrive too fast? What happens when packets arrive slowly?
Explain Figure 6-3.
- Why does "Polling (no quota)" work badly? (Input still starves xmit complete processing.)
- Why does it quickly fall to zero, rather than gradually decreasing? Previously transmit had priority over IP, but not anymore. Estimate: x-intercept at 6500 means 150 usec for RX+IP. Made problem worse by doing more work before discarding packet.
Explain Figure 6-4.
- Why does "Polling, no feedback" behave badly? There's a queue in front of screend. We can still give 100% to input thread, 0% to screend.
- Why does "Polling w/ feedback" behave well? Input thread yields when queue to screend fills.
- What if screend hangs, what about other consumers of packets? (e.g., can you ssh to machine to fix screend?) Fortunately screend typically is only application. Also, re-enable input after timeout.
Big picture: polling loop gives us more control -- but only knows about device and IP packet processing, not about other activities on host.
Why are the two solutions different?
1. Polling thread with quotas.
2. Feedback from full queue.
In theory, two solutions appears to be mostly interchangeable. Can use polling thread, with callback to schedule screend to run on some quota of packets, or use feedback for ipintrq/if_snd.
Feedback more general, no magic numbers.
In practice, passing feedback from IF_ENQUEUE(if_snd) all the way back to receive processing might require a lot of changes.
If we apply the proposed fixes, does the phenomemon totally go away? (e.g. for web server, waits for disk, &c.)
- Can the net device throw away packets without slowing down host?
- Problem: We want to drop packets only for applications with big queues. But requires work to determine which application a packet belongs to. Perhaps interface hardware should look at packets and decide...
- Solution worked well in paper for linear flow, clear feedback path.
What if processing has more complex structure?
- Chain of processing stages with queues? Does feedback work? What happens when a late stage is slow?
- Split at some point, multiple parallel paths? No so great; one slow path blocks all paths.
Can we formulate any general principles from paper?
- Don't spend time on new work before completing existing work.
- Or give new work lower priority than partially-completed work.
- Corollary: If you might discard, do it as early as possible.
How would you build packet forwarding in JOS?