6.824 2004 Lecture 12: Performance (Part 1: Livelock)

Announcements:
  Project draft one due today for comments 
  Midterms handed back at end of class, mean = 86


Performance of a single system under overload:
  Why this is interesting: it's deeper than just throwing CPU at the problem
  This problem shows up anytime regularly schedule job A is pre-empted by B 


Specific problem: IP forwarding

  
What's going on here (diagram):
  Network hardware (rx) -> Q -> IP stack -> Q -> App
                 (xmit) <- Q <-

  Interrupt generated by networking hardware - handled immediately, place
    on queue, soft-interrupt for IP stack, place on queue for app, etc.

  Q: When can the xmit starve? What about the application?

  In general, if a Q is dropping, pusher has higher priority than puller


Explain Figure 6-1:
  What determines how far it goes up?
  Abstractly, what determines slope of line going down? 
  Why does screend's line go down earlier than w/o screend?
  Ideally, should discard ASAP, curve would flatten out

  mogul's point: burning CPU cycles when most precious
  
Q: Would infinite length queues fix this?

Q: What about reducing interrupt priority level? 

Still have to deal with scheduling work, and not committing to
work when there's no cpu time for it.

Q: Any other ideas? Maybe process packet to completion? 

Different scenario: today's question
  Can this livelock?

    while(1){
      send NFS READ RPC;
      wait for response;
    }

  Q: What ends up happening?
  Q: What about with lots of clients?


The paper's solution:
  timeslice + RR scheduling
    when idle, use interrupts
    when busy, poll and handle work in kernel thread
  discard packet as early as possible
  when busy: no new work

  another way: change the push architecture to a pull arch

  Q: What happens if packets arrive too quickly? Too slowly?

  
Figure 6-3: 
  Q: What's wrong with polling no-quota?
   xmit is still starved
  Q: Why does it go to zero? That's a lot worse!?


Figure 6-4:
  Q: Why does "polling, no feedback" do poorly?
    Can still end up starving screend
  Q: What's the basic idea behind feedback?
    Thread will yield when queue is full


Comment: feedback seems more elegant than quotas (no consts)

Is this the final answer?
  What if running telnet and webserver on same machine?
  How about a bunch of processing events - will feedback still work?
  Works well for net (can drop pkts), but what about disk i/o?


Q: What's the general principle?
  Don't waste time on new work that you can't finish
   (or make new work lower priority) 


6.824 2004 Lecture 12: Performance  (Part 2: Black Boxes)

We've talked about performance for single-machine/process
systems - what happens with multiple components?

Motivation: 
  People tend to build systems out of components (web, DB, etc.)
    often closed or very complex components
  Performance problems are difficult to diagnose: 
    don't want to open up every black box

Example: multi-tier webserver with front-end, auth, DB
  all we might know is requests are taking 400ms on avg.


Contribution:
  Figure out which black boxes to open, reducing
    the time/cost of debugging the system

  Instead of focusing on instrumentation, use the
    minimum amount of info to infer what's going on


Goal: find "high-impact" causal paths which occur 
  frequently and have high latency

think of hierarchial profiler, like gprof for distributed sys


Basic idea is to passively monitor messages such as RPCs
using tcpdump to try to figure out what components are slow
and when.

Diagram: 3 components, A B C, call from a->b, b->c, c->b, b->a
  with slight delays at each component


One approach which uses call-return/RPC ids:

Nesting Algorithm
  pair call/return messages
  figure out nesting relationships
  reconstruct call paths


Complication: parallel calls

A->B, A->B, B->C, B->C -- which belongs to which? multiple options
  define heuristics: scoreboard potential parallel calls, make
    a guess based on histogram


Another approach, without call-return semantics: 
  create time series s1(t) = a->b messages
  		     s2(t) = b->c messages

  diagram with impulses at times 1,2,5,6,7 for s1, 3,4,7,8,9 for s2

  convolution: x(t) = s2(t) [x] s1(-t)  show peak at t=2

  fft does it quickly, detects time shift 

  inherently robust to noise, but periodic stuff causes problems


Cool, but slow - see experimental results: 2hrs vs. 2 secs!

If you have call-return semantics, nested seems like way to go.


Practical concerns:
  clock synch: ok if less than minimum instrumentation delay
  message loss: ok up to 5%
  delay variance: robust up to 30% of min. period

  Q: other problems? 
    What about cache hits?

  Q: How useful is this as a programmer? 
    gprof tells you which functions, can this approach?

  Q: Is this the right level? 
    Instrument JavaVM? Invasiveness tradeoff.

Conclusion: passively collected traces have enough info to 
  infer what's going on at a high level. 

  Also, this was intended as a tool, not a system for automagic
   optimization. OK to have false positives/errors.