6.828 2011 Lecture 13: Scheduling, Livelock What is scheduling? Allocation of resources to entities Typically to achieve some goal Fairness, high throughput, low latency, fulfiling contract Popular in the days of time sharing Scarce shared resources Seemed irrelevant in era of PCs and workstations But still important: 1. Internet services -- must deal with sudden overload 2. Careful design of high-performance systems (today's paper) Key problems: Gap between desired policy and available mechanism Interaction between different schedulers Resources you might want to schedule: CPU time Physical memory Disk and network I/O I/O bus bandwidth Entities that you might want to give resources to: Users Processes Threads Web requests MIT accounts Connections and packets Scheduling policies: Strict priority Divide equally Shortest job first Minimum guarantee + admission control General plan for scheduling mechanisms 1. Understand where scheduling is occuring 2. Expose scheduling decisions, allow control 3. Account for resource consumption, to allow intelligent control Simple example: Policy: fair CPU time among processes Mechanism: round robin every 10 ms? Works if processes are compute-bound What if a process gives up some of its 10 ms to wait for input? Do we have to keep track of that and give it back? In practice simple round-robin doesn't produce equal shares Track "recent" CPU use, e.g. over the last second Always run process with least recent CPU use Still, if you sleep long enough you lose Pitfall: Priority Inversion [timeline] Policy: strict priority Thread T1: low priority Thread T2: medium priority Thread T3: high priority T1: acquire(l) context switch to T3 T3: acquire(l)... must wait for T1 to release(l).. context switch to T2 T2 computes for a while T3 is indefinitely delayed despite high priority Can solve if T3 lends its priority to holder of lock it is waiting for So T1 runs, not T2 This is really a multiple scheduler problem Since locks schedule access to locked resource Pitfall: Efficiency Time quantum in CPU scheduling (vs low delay) Shortest seek vs FIFO disk scheduling Contiguous read-ahead vs data needed now Swap out my idle emacs to let gcc run faster w/ more phys mem What happens when I type a key? Efficiency often conflicts with fairness (or any other policy) Pitfall: Interacting Schedulers Policy: emacs has absolute priority CPU: easy What about disk driver's request queue? What about disk cache replacement policy? What about physical memory allocator / replacement policy? Pitfall: Server Processes Suppose emacs uses X windows to display The X server must serve requests from many clients Does it know that emacs' requests should be given priority? Does the O/S know to raise X's priority when it is serving emacs? Scheduling is a whole-system problem Policies are typically global: good responsiveness for emacs Mechanisms are usually local: CPU scheduler, memory allocator, &c And often do not provide scheduling knobs It's hard to get all mechanisms aligned System structure is often the hardest part E.g. client/server What did UNIX scheduling look like at the time of the paper? Uni-processor No spin-locks in kernel UNIX Execution Environments [diagram] Process, user half Process, kernel half Soft interrupts: timer, network Device interrupts ("hard") Rules: Hard intr at any time (other than another hard intr) Soft intr runs on hard return Interrupt must complete before other code continues Since runs on current stack Only switch in via sleep() -- kernel halves not pre-emptible Kernel half has absolute priority over user half User halves pre-emptable via timer interrupt Effective priorities: intr > soft intr > kernel half > user Why no pre-emptable kernel halves? Few locks needed (uniprocessor...) Just turn off interrupts in critical sections Why soft interrupts? Time-consuming processing in response to hard intrs E.g. network input processing Hard intr does time-critical h/w processing e.g. pull pkt from NIC to avoid overrunning its buffer Soft intr can be interrupted How does (did) packet forwarding work? We need this to understand the paper [diagram: NIC, rx intr, in Q, soft intr, sock Q, user, out Q, tx intr, NIC] NIC: buffer the packet in on-board memory Receive interrupt: allocate pkt buffer copy packet from NIC to memory append packet to IP input queue schedule IP soft interrupt IP soft interrupt: for each packet on IP input queue: routing lookup if forward append pkt to output queue start output hardware, if idle i.e. copy packet to NIC if local append pkt to socket queue wake up process transmit interrupt dequeue pkt from output queue copy to NIC memory poke the NIC free pkt buf Why the input queue? Why the output queue? What has changed since 1990s? Multi-processor support changed many parts of the design Threads instead of soft interrupts, for e.g. network processing Locks instead of disabling interrupts (ok, locks disable interrupts) Kernel halves/threads are pre-emptible Today's paper: Eliminating Receive Livelock in an Interrupt-Driven Kernel, by Mogul and Ramakrishnan Why are we reading this paper? Most scheduling issues are boring, solvable with faster CPU This one is much more fundamental The underlying problem comes up a lot, in many guises Explain Figure 6-1 Why does it go up? What determines how high the peak is? Why does it go down? What determines how fast it goes down? peak @ 5000 => 200 us/pkt. zero at 13000 => input processing takes about 77 us. each additional 4 inputs/sec eliminates one out/sec. Suppose I wanted to test a web server for livelock Run client with this loop: while(1){ send a GET ; wait for response; } What would I see? Why not completely process each packet in the interrupt handler? I.e. forward it? (still need out Q. starve tx intr. starve other devs' rx intrs.) Why not always poll, never use interrupts? Big picture: want to give output priority over input interrupts allow us no control What's the paper's solution? No IP input queue Thread *polls* NICs for input and output NIC receive interrupt just wakes up thread Then leaves interrupts *disabled* for that NIC Thread does all input processing, then re-enables interrupts thread(){ while(1){ for each NIC: while input pkt: process pkt if output done: pkt from out Q to NIC if nothing: enable interrupts sleep } } intr(){ disable interrupts wake up thread } Why does this work? What happens when packets arrive too fast? What happens when packets arrive slowly? Explain Figure 6-3 Why does "Polling (no quota)" work badly? Input still starves xmit-complete processing Why does it immediately fall to zero, rather than gradually decreasing? Livelock is made worse by doing even more processing before discard I.e. each excess rx pkt consumes many tx pkts of CPU time Explain Figure 6-4 Why does "Polling, no feedback" behave badly? There's a queue in front of screend We can still give 100% to input thread, 0% to screend Why does "Polling w/ feedback" behave well? Input thread yields when queue to screend near-full Wakes up when queue near-empty What would happen if screend hung? Big picture: polling loop is a place to exert sched control Why are the two solutions different? 1. Polling thread *with quotas* 2. Feedback from full queue Perhaps they could have used #2 for both Feedback doesn't require magic numbers But hard to retro-fit into existing UNIX structure What if processing has more complex structure? Chain of processing stages with queues? Does feedback work? What happens when a late stage is slow? Split at some point, multiple parallel paths? No so great; one slow path blocks all paths Can we formulate any general principles? Don't spend time on new work before completing existing work Or, give new work lower priority than partially-completed work Or, discard as early as possible