6.824 2004 Lecture 12: Performance (Part 1: Livelock) Announcements: Project draft one due today for comments Midterms handed back at end of class, mean = 86 Performance of a single system under overload: Why this is interesting: it's deeper than just throwing CPU at the problem This problem shows up anytime regularly schedule job A is pre-empted by B Specific problem: IP forwarding What's going on here (diagram): Network hardware (rx) -> Q -> IP stack -> Q -> App (xmit) <- Q <- Interrupt generated by networking hardware - handled immediately, place on queue, soft-interrupt for IP stack, place on queue for app, etc. Q: When can the xmit starve? What about the application? In general, if a Q is dropping, pusher has higher priority than puller Explain Figure 6-1: What determines how far it goes up? Abstractly, what determines slope of line going down? Why does screend's line go down earlier than w/o screend? Ideally, should discard ASAP, curve would flatten out mogul's point: burning CPU cycles when most precious Q: Would infinite length queues fix this? Q: What about reducing interrupt priority level? Still have to deal with scheduling work, and not committing to work when there's no cpu time for it. Q: Any other ideas? Maybe process packet to completion? Different scenario: today's question Can this livelock? while(1){ send NFS READ RPC; wait for response; } Q: What ends up happening? Q: What about with lots of clients? The paper's solution: timeslice + RR scheduling when idle, use interrupts when busy, poll and handle work in kernel thread discard packet as early as possible when busy: no new work another way: change the push architecture to a pull arch Q: What happens if packets arrive too quickly? Too slowly? Figure 6-3: Q: What's wrong with polling no-quota? xmit is still starved Q: Why does it go to zero? That's a lot worse!? Figure 6-4: Q: Why does "polling, no feedback" do poorly? Can still end up starving screend Q: What's the basic idea behind feedback? Thread will yield when queue is full Comment: feedback seems more elegant than quotas (no consts) Is this the final answer? What if running telnet and webserver on same machine? How about a bunch of processing events - will feedback still work? Works well for net (can drop pkts), but what about disk i/o? Q: What's the general principle? Don't waste time on new work that you can't finish (or make new work lower priority) 6.824 2004 Lecture 12: Performance (Part 2: Black Boxes) We've talked about performance for single-machine/process systems - what happens with multiple components? Motivation: People tend to build systems out of components (web, DB, etc.) often closed or very complex components Performance problems are difficult to diagnose: don't want to open up every black box Example: multi-tier webserver with front-end, auth, DB all we might know is requests are taking 400ms on avg. Contribution: Figure out which black boxes to open, reducing the time/cost of debugging the system Instead of focusing on instrumentation, use the minimum amount of info to infer what's going on Goal: find "high-impact" causal paths which occur frequently and have high latency think of hierarchial profiler, like gprof for distributed sys Basic idea is to passively monitor messages such as RPCs using tcpdump to try to figure out what components are slow and when. Diagram: 3 components, A B C, call from a->b, b->c, c->b, b->a with slight delays at each component One approach which uses call-return/RPC ids: Nesting Algorithm pair call/return messages figure out nesting relationships reconstruct call paths Complication: parallel calls A->B, A->B, B->C, B->C -- which belongs to which? multiple options define heuristics: scoreboard potential parallel calls, make a guess based on histogram Another approach, without call-return semantics: create time series s1(t) = a->b messages s2(t) = b->c messages diagram with impulses at times 1,2,5,6,7 for s1, 3,4,7,8,9 for s2 convolution: x(t) = s2(t) [x] s1(-t) show peak at t=2 fft does it quickly, detects time shift inherently robust to noise, but periodic stuff causes problems Cool, but slow - see experimental results: 2hrs vs. 2 secs! If you have call-return semantics, nested seems like way to go. Practical concerns: clock synch: ok if less than minimum instrumentation delay message loss: ok up to 5% delay variance: robust up to 30% of min. period Q: other problems? What about cache hits? Q: How useful is this as a programmer? gprof tells you which functions, can this approach? Q: Is this the right level? Instrument JavaVM? Invasiveness tradeoff. Conclusion: passively collected traces have enough info to infer what's going on at a high level. Also, this was intended as a tool, not a system for automagic optimization. OK to have false positives/errors.