Scheduling
Required reading: Eliminating receive livelock
Notes based on prof. Morris's lecture on scheduling (6.824, fall'02).
Overview
- What is scheduling? The OS policies and mechanisms to allocates
resources to entities. A good scheduling policy ensures that the most
important entitity gets the resources it needs. This topic was
popular in the days of time sharing, when there was a shortage of
resources. It seemed irrelevant in era of PCs and workstations, when
resources were plenty. Now the topic is back from the dead to handle
massive Internet servers with paying customers. The Internet exposes
web sites to international abuse and overload, which can lead to
resource shortages. Furthermore, some customers are more important
than others (e.g., the ones that buy a lot).
- Key problems:
- Gap between desired policy and available mechanism. The desired
policies often include elements that not implementable with the
mechanisms available to the operation system. Furthermore, often
there are many conflicting goals (low latency, high throughput, and
fairness), and the scheduler must make a trade-off between the goals.
- Interaction between different schedulers. One have to take a
systems view. Just optimizing the CPU scheduler may do little to for
the overall desired policy.
- Resources you might want to schedule: CPU time, physical memory,
disk and network I/O, and I/O bus bandwidth.
- Entities that you might want to give resources to: users,
processes, threads, web requests, or MIT accounts.
- Many polices for resource to entity allocation are possible:
strict priority, divide equally, shortest job first, minimum guarantee
combined with admission control.
- General plan for scheduling mechanisms
- Understand where scheduling is occuring.
- Expose scheduling decisions, allow control.
- Account for resource consumption, to allow intelligent control.
- What policy does JOS enforce for scheduling environments? Reading
between the lines, equal time for all environments: JOS's mechanism is
that the 10 msec timer time-slices in round-robin among all runnable
environments.
But this only works if processes are compute-bound. What if a
process gives up some of its 10 ms to wait for input? Do we have to
keep track of that and give it back?
How long should the quantum be? is 10 msec the right answer?
Shorter quantum will lead to better interactive performance, but
lowers overall system throughput because we will reschedule more,
which has overhead.
What if the environment computes for 1 msec and sends an IPC to
the file server environment? Shouldn't the file server get more CPU
time because it operates on behalf of all other functions?
Potential improvements for JOS: track "recent" CPU use
(e.g., over the last second) and always run environment to
which we "owe" the most CPU time. (But can't accumulate CPU
debt forever...)
Other
solution: directed yield; specify on the yield to which environment
you are donating the remainder of the quantuam (e.g., to the file
server so that it can compute on the environment's behalf).
- Pitfall: Priority Inversion
Assume policy is strict priority.
Thread T1: low priority (e.g. check to see if I have new mail periodically).
Thread T2: medium priority.
Thread T3: high priority (e.g. real-time game).
T1: acquire(l)
context switch to T3 (maybe T3 wakes up needs to run: it has priority)
T3: acquire(l)... must wait for T1 to release(l)...
context switch to T2 (again, perhaps T2 needs to run & has prio over T1)
T2 computes for a while
T3 is indefinitely delayed despite high priority.
Can solve if T3 lends its priority to holder of lock it is waiting for.
So T1 runs, not T2.
[this is really a multiple scheduler problem.]
[since locks schedule access to locked resource.]
- Pitfall: Efficiency. Efficiency often conflicts with fairness (or
any other policy). Long time quantum for efficiency in CPU scheduling
versus low delay. Shortest seek versus FIFO disk scheduling.
Contiguous read-ahead vs data needed now. For example, scheduler
swaps out my idle emacs to let gcc run faster with more phys mem.
What happens when I type a key? These don't fit well into a "who gets
to go next" scheduler framework. Inefficient scheduling may make
everybody slower, including high priority users.
- Pitfall: Multiple Interacting Schedulers. Suppose you want your
emacs to have priority over everything else. Give it high CPU
priority. Does that mean nothing else will run if emacs wants to run?
Disk scheduler might not know to favor emacs's disk I/Os. Typical
UNIX disk scheduler favors disk efficiency, not process prio. Suppose
emacs needs more memory. Other processes have dirty pages; emacs must
wait. Does disk scheduler know these other processes' writes are high
prio?
- Pitfall: Server Processes. Suppose emacs uses X windows to
display. The X server must serve requests from many clients. Does it
know that emacs' requests should be given priority? Does the OS know
to raise X's priority when it is serving emacs? Similarly for DNS,
and NFS. Does the network know to give emacs' NFS requests priority?
In short, scheduling is a system problem. There are many
schedulers; they interact. The CPU scheduler is usually the easy
part. The hardest part is system structure. For example, the
existence of interrupts is bad for scheduling. Conflicting
goals may limit effectiveness.
Case study: UNIX in 1990
What did UNIX scheduling look like at time of Livelock paper?
Uni-processor
No spin-locks in kernel
Different contexts:
Process, user half
Process, kernel half
Soft interrupts
Device ("hard") interrupts
Rules for context transitions:
Hardware can force hard interrupt any most times except in a hard intr
Hardware can force soft interrupt except during a soft or hard intr
Interrupt must complete before interrupted code can continue
Interrupt shares stack
Timer interrupt does *not* switch kernel halves
Only voluntary switch via sleep()
Never runs a user half if any kernel half wants to run
User halfs pre-emptable via timer interrupt
Thus this priority structure:
hard intr > soft intr > kernel half > user half
Why no pre-emptable kernel halves?
To avoid most locks (remember -- uniprocessor)
Only have to worry about concurrency from interrupts
So kernel halves disable interrupts in critical sections
Why soft interrupts?
For time-consuming processing in response to hard interrupts
Hard interrupt does time-critical stuff, schedules soft interrupt
More hard interrupts can occur during soft int
Short hard interrupts avoid input overruns
Example: generating TCP ACKs (don't want to wait for some kernel half...)
Paper's application is packet forwarding
How did that work?
Receive interrupt
Copy incoming packet to memory
Append packet pointer to IP Input queue
Schedule IP soft interrupt
IP soft interrupt
For each packet in IP input queue
Decide what to do with it
Forward out a different network interface?
Append packet pointer to output Q
Give packet to output i/f h/w if idle
Deliver to a user program?
Append packet pointer to socket Q
Wake up process
Transmit interrupt
Free last packet buffer
Copy packet from output Q to h/w
Tell h/w to transmit
Network interface hardware (e.g. ethernet)
Has small receive buffer on interface h/w (a few packets)
Raises receive interrupt for each received packet
Has small output buffer (a few packets)
Transmit-complete interrupts
Why the input queue?
Why the output queue?
Is this a good arrangement?
Paper discussion
- Explain Figure 6-1. Why does it go up? What determines how high
the peak is? Why does it go down? What determines how fast it goes
does? (The peak at 4000 means total processing time is 250 us. The
x-intercept at 13000 means input processing alone takes 77 us.
So each additional four inputs/second eliminates one output/second.)
- Would livelock show up for disk interrupts?
- Why not tell the O/S scheduler to give interrupts lower priority?
Non-preemptible.
Could you fix this by making interrupts faster? (No)
- Why not completely process each packet in the interrupt handler?
(I.e. forward it?) Still need an output queue. Recv intr routine
might never return, thus no tx complete interrupts, thus no
transmissions. Also maybe interface h/w overruns during short input bursts.
Wouldn't be fair if multiple input interfaces.
- Big picture: we've surrendered all control over scheduling
various actions to the CPU's interrupt mechanism. And it doesn't
give us the control we need to enforce our desired policy:
output has precedence over input.
- What about using polling instead of interrupts? Solves overload
problem, but under low low either wastes CPU or has high latency.
- What's the paper's solution?
- No IP input queue.
- Device input polling and IP input processing in kernel thread.
- Device receive interrupt just wakes up thread. And leaves
interrupts *disabled* for that device.
- Thread does all input and output processing, then re-enables interrupts.
- [show pseudo-code]
- Why does this work? What happens when packets arrive too fast?
What happens when packets arrive slowly?
- Explain Figure 6-3.
- Why does "Polling (no quota)" work badly? (Input still starves
xmit complete processing.)
- Why does it immediately fall to zero, rather than gradually decreasing?
(xmit complete processing must be very cheap compared to input.)
- Explain Figure 6-4.
- Why does "Polling, no feedback" behave badly? There's a queue in
front of screend. We can still give 100% to input thread, 0% to
screend.
- Why does "Polling w/ feedback" behave well? Input thread yields
when queue to screend fills.
- What if screend hangs, what about other consumers of packets?
(e.g., can you ssh to machine to fix screend?) Fortunately screend
typically is only application. Also, re-enable input after timeout.
- Big picture: polling loop gives us more control -- but only knows
about device and IP packet processing, not about other activities
on host.
- Why are the two solutions different?
- Polling thread with quotas.
- Feedback from full queue.
(I believe they should have used #2 for both.)
- If we apply the proposed fixes, does the phenomemon totally go
away? (e.g. for web server, waits for disk, &c.)
- Can the net device throw away packets without slowing down host?
- Problem: We want to drop packets only for applications with big queues.
But requires work to determine which application a packet belongs to.
Perhaps interface hardware should look at packets and decide...
- What if processing has more complex structure?
- Chain of processing stages with queues? Does feedback work?
What happens when a late stage is slow?
- Split at some point, multiple parallel paths? No so great; one
slow path blocks all paths.
- Can we formulate any general principles from paper?
- Don't spend time on new work before completing existing work.
- Or give new work lower priority than partially-completed work.
- How would you build packet forwarding in JOS?