"The Interaction of Architecture and Operating System Design" Anderson, Levy, Bershad, Lazowska ASPLOS 1991 Skip sections 3 (VM) and 4 (thread-switch) Why we're reading this paper: Digs into details of important kernel abstractions. Talks about how choice/design/impl of abstraction affects performance. It's a nice case study for *paper* design, we'll talk about paper structure When was the paper written? 1 year ago? 10? 20? 1991 ASPLOS Why does it matter when it was written? What kind of people wrote it -- CPU or O/S designers? O/S. They have much more to say about what CPU screws up than O/S. What do the authors want to convince us us? from abstract: 1. new CPUs are not getting the same speedup on O/S as on apps 2. new O/S are additionally structured in a way that makes them slower 3. the authors know why these trends are occuring So what should we be looking for as we read the paper? 1. evidence that newer CPUs are "slower" on OS than on apps 2. reasons, and evidence that trend might continue 3. evidence that newer OSes are slower than older ones 4. reasons, and evidence that trend might continue 5. evidence of larger significance (does anyone care?) 6. solutions You should figure out a list like this in your head as you read any abstract/intro to help guide your attention in the paper. Papers are long and hard to read, it pays to have a plan for what to skip and what to read carefully. And you don't want to let the abstract lull you into thinking that a paper proves a point that it doesn't prove. Let's make two tables, to fill in as we go along: CPU problems: large register sets (sparc windows) deep pipelines (88000 o/s must save 30 regs of pipeline state) no h/w vectoring (in MIPS) limited write buffers (R2000, fixed in R3000) cpu speed vs memory speed i860 page fault handler must interpret to find faulting addr caches that have to be flushed during address space switches O/S problems: None mentioned for Mach 2.5. Mach 3.0 does a lot of context switches. Mach 3.0 can't make use of monster TLB entries. Mach 3.0 service processes must call kernel for critical sections Same code would have been in Mach 2.5 kernel, just turn of interrupts Now let's look at the evidence they present: Table 1. Where do the the "Time" numbers come from? Where do the "Relative Speed" numbers come from? What does Table 1 tell us? O/S not getting faster as fast as applications In an ideal world, what would Table 1 look like? Do we believe their methodology? Did they run the same s/w on different h/w platforms? If they didn't, why do the results mean anything? Why do these numbers look this way? Why do RISC machines beat CVAX for apps, at same-ish mHz? Why around 5x, not 2x or 20x? Are RISC machines literally slower than CVAX for O/S primitives? So what's the problem? High-level reason why RISC O/S performance not keeping up? optimized for executing single application not for switching between applications big virtual cache, big register set, deep pipeline On my 1 GHz x86 laptop: getpid() takes half a microsecond context switch (w/ one char pipe) takes about 4 microseconds probably runs apps 500x Optional: What's a CISC machine? What's the CVAX? not pipelined, like your first Beta implementation how much performance win do you get w/ pipeline? (actually had a microcode-instruction pipeline?) (hard to get a balanced pipeline for VAX micro-instrs) slower clock speed (i don't know why) only 16 registers (red herring?) complex microcoded instructions (red herring?) i.e. many instructions take multiple cycles but of course they do more work Have the authors proved their points yet? Can we stop reading after Table 1? What's the function of Section 2? Why is it in the paper? A. examine an O/S feature (RPC) whose performance might matter to us B. show that RPC performance suffers from adverse CPU trends (2.1, 2.2) C. dig into the reasons, CPU / O/S interaction (2.3, 2.4) What are the steps required to send a local RPC from P1 to P2? (Table 4...) P1 makes system call kernel copies data from P1? P1 sleeps in kernel kernel switches to (waiting?) P2 kernel half kernel copies data to P2? return from P2 system call into P2 We're looking for ways new CPUs support this less well than old ones mHz: better architecture: worse for O/S What are the problems they mention? (2.3, table 5, 2.4) large register sets (sparc windows) deep pipelines (88000 o/s must save 30 regs of pipeline state) no h/w vectoring (in MIPS) limited write buffers (R2000, fixed in R3000) memory didn't get faster w/ new CPUs How do they establish that the mentioned problems are actually responsible? For the most part they do not! Would this have been straightforward? How about trends? Will O/S speed lag app speed even more w/ time? e.g. ratio of O/S benchmark to application benchmark: continue to decrease? let's look at the individual problems they cite and ask if they will continue to get *worse* w/ time or if they just happ are register sets getting larger? (no) are pipelines getting deeper? (yes) is the vectoring situation getting worse? (no) are write buffers getting more limited? (no, indeed write-back caches &c) is memory b/w lagging CPU b/w (probably, but not as much as latency) What are they trying to show in Section 5? 1. O/S primitives matter: big fraction of real app run time. 2. O/S trends are making performance worse. Do they show that O/S primitive performance matters? Doesn't matter if traps are slow if they are rare. 15% to 20% of app run time on Mach 3.0 due to O/S primitives. They don't show % for Mach 2.5. Is 20% a lot? (13% for compile+link on my laptop) Do they show that O/S trends are reducing performance? O/S using traps &c more (microkernels). CPUs making traps &c relatively more expensive. What can we say about trends w/ hindsight? Has CPU evolution continued to slow down O/S relative to apps? Has O/S evolution continued to increase O/S overhead? What did we learn from this paper? Lots of performance details. Choice of O/S and CPU abstractions matters. System-level view: combined CPU-O/S-application behavior Was it a good paper? Clearly written? Clear statement of goals/problem/method/ideas? Does performance matter? When does performance matter? Google runs on 10,000 PCs...