"The Interaction of Architecture and Operating System Design"
Anderson, Levy, Bershad, Lazowska
ASPLOS 1991

Skip sections 3 (VM) and 4 (thread-switch)

Why we're reading this paper:
  Digs into details of important kernel abstractions.
  Talks about how choice/design/impl of abstraction affects performance.
  It's a nice case study for *paper* design, we'll talk about paper structure

When was the paper written? 1 year ago? 10? 20?
  1991 ASPLOS

Why does it matter when it was written?

What kind of people wrote it -- CPU or O/S designers?
  O/S.
  They have much more to say about what CPU screws up than O/S.

What do the authors want to convince us us? from abstract:
  1. new CPUs are not getting the same speedup on O/S as on apps
  2. new O/S are additionally structured in a way that makes them slower
  3. the authors know why these trends are occuring

So what should we be looking for as we read the paper?
  1. evidence that newer CPUs are "slower" on OS than on apps
  2. reasons, and evidence that trend might continue
  3. evidence that newer OSes are slower than older ones
  4. reasons, and evidence that trend might continue
  5. evidence of larger significance (does anyone care?)
  6. solutions

You should figure out a list like this in your head as you read any
abstract/intro to help guide your attention in the paper. Papers are
long and hard to read, it pays to have a plan for what to skip and
what to read carefully. And you don't want to let the abstract
lull you into thinking that a paper proves a point that it doesn't prove.

Let's make two tables, to fill in as we go along:
  CPU problems:
    large register sets (sparc windows)
    deep pipelines (88000 o/s must save 30 regs of pipeline state)
    no h/w vectoring (in MIPS)
    limited write buffers (R2000, fixed in R3000)
    cpu speed vs memory speed
    i860 page fault handler must interpret to find faulting addr
    caches that have to be flushed during address space switches
  O/S problems:
    None mentioned for Mach 2.5.
    Mach 3.0 does a lot of context switches.
    Mach 3.0 can't make use of monster TLB entries.
    Mach 3.0 service processes must call kernel for critical sections
      Same code would have been in Mach 2.5 kernel, just turn of interrupts

Now let's look at the evidence they present: Table 1.

Where do the the "Time" numbers come from?

Where do the "Relative Speed" numbers come from?

What does Table 1 tell us?
  O/S not getting faster as fast as applications

In an ideal world, what would Table 1 look like?

Do we believe their methodology?
  Did they run the same s/w on different h/w platforms?
  If they didn't, why do the results mean anything?  

Why do these numbers look this way?
  Why do RISC machines beat CVAX for apps, at same-ish mHz?
    Why around 5x, not 2x or 20x?
  Are RISC machines literally slower than CVAX for O/S primitives?
    So what's the problem?
  High-level reason why RISC O/S performance not keeping up?
    optimized for executing single application
    not for switching between applications
    big virtual cache, big register set, deep pipeline

On my 1 GHz x86 laptop:
  getpid() takes half a microsecond
  context switch (w/ one char pipe) takes about 4 microseconds
  probably runs apps 500x

Optional: What's a CISC machine? What's the CVAX?
  not pipelined, like your first Beta implementation
    how much performance win do you get w/ pipeline?
    (actually had a microcode-instruction pipeline?)
    (hard to get a balanced pipeline for VAX micro-instrs)
  slower clock speed (i don't know why)
  only 16 registers (red herring?)
  complex microcoded instructions (red herring?)
    i.e. many instructions take multiple cycles
    but of course they do more work

Have the authors proved their points yet?
  Can we stop reading after Table 1?

What's the function of Section 2? Why is it in the paper?
  A. examine an O/S feature (RPC) whose performance might matter to us
  B. show that RPC performance suffers from adverse CPU trends (2.1, 2.2)
  C. dig into the reasons, CPU / O/S interaction (2.3, 2.4)

What are the steps required to send a local RPC from P1 to P2?
  (Table 4...)
  P1 makes system call
  kernel copies data from P1?
  P1 sleeps in kernel
  kernel switches to (waiting?) P2 kernel half
  kernel copies data to P2?
  return from P2 system call into P2

We're looking for ways new CPUs support this less well than old ones
  mHz: better
  architecture: worse for O/S

What are the problems they mention?
  (2.3, table 5, 2.4)
  large register sets (sparc windows)
  deep pipelines (88000 o/s must save 30 regs of pipeline state)
  no h/w vectoring (in MIPS)
  limited write buffers (R2000, fixed in R3000)
  memory didn't get faster w/ new CPUs

How do they establish that the mentioned problems are actually responsible?
  For the most part they do not!
  Would this have been straightforward?

How about trends? Will O/S speed lag app speed even more w/ time?
  e.g. ratio of O/S benchmark to application benchmark: continue to decrease?
  let's look at the individual problems they cite
  and ask if they will continue to get *worse* w/ time
    or if they just happ
  are register sets getting larger? (no)
  are pipelines getting deeper? (yes)
  is the vectoring situation getting worse? (no)
  are write buffers getting more limited? (no, indeed write-back caches &c)
  is memory b/w lagging CPU b/w (probably, but not as much as latency)

What are they trying to show in Section 5?
  1. O/S primitives matter: big fraction of real app run time.
  2. O/S trends are making performance worse.

Do they show that O/S primitive performance matters?
  Doesn't matter if traps are slow if they are rare.
  15% to 20% of app run time on Mach 3.0 due to O/S primitives.
  They don't show % for Mach 2.5.
  Is 20% a lot? (13% for compile+link on my laptop)

Do they show that O/S trends are reducing performance?
  O/S using traps &c more (microkernels).
  CPUs making traps &c relatively more expensive.

What can we say about trends w/ hindsight?
  Has CPU evolution continued to slow down O/S relative to apps?
  Has O/S evolution continued to increase O/S overhead?

What did we learn from this paper?
  Lots of performance details.
  Choice of O/S and CPU abstractions matters.
  System-level view: combined CPU-O/S-application behavior

Was it a good paper?
  Clearly written?
  Clear statement of goals/problem/method/ideas?

Does performance matter?

When does performance matter?
  Google runs on 10,000 PCs...