"Flash...", by Pai, Druschel, and Zwaenepoel USENIX 1999 What are the paper's main claims? new web server architecture, AMPED efficient for cached workloads efficient for disk-bound workloads Does anyone care about web server efficiency? You do if you have thousands of servers. What is the basic AMPED idea? [draw picture w/ helpers] What operations does the helper perform? disk read() how about open()? stat()? Why is cached vs disk a big issue? Where might the cache be? O/S disk block cache, or Maintained by web server[s] Flash seems to use both. But Flash-MP has *mapped* file cache, so shared? How does the main process ask a helper to read data? pipe. why? select()able. How does the helper tell the main process the read is done? How does the main process actually get file data? After the helper causes it to be read in? it's implicit -- main process mmaps file, assumes pages will be resident What is mmap()? mincore()? What's a reasonable number of helpers? one or two per disk? many per disk for disk-arm scheduling? Other techniques they discuss: MP MT (user threads? kernel threads?) SPED (event-driven, like 2nd lab) Apache == MP Zeus == SPED What performance do we expect? Disk-bound: AMPED > MT > MP/Apache >> SPED/Zeus Cacheable: SPED/AMPED/Zeus > MT > MP/Apache. What's the test setup? Real server and lots of clients. How many clients? Is one enough? Clients run fake web browsers that issue concurrent requests. Figure 6: Why does b/w go up with file size? What's the limiting factor for small files? Disk? Net? RAM? I/O bus? CPU? Client's ability to generate requests? What's the limiting factor for large files? Why does the curve have the shape it does? x = file size a = time to process zero-length request b = big-file serving rate, bytes-per-second y = bytes/time = x / (a + x/b) What are a and b? Figure 6(b) suggests a is about 1 millisecond. Figure 6(a) suggests b is about 100 mbits/second. What new information does Figure 6(b) contain? 1 / (a + x); abstracts away the b, so less information. Shows small-file info more clearly. Why is there no MT line in Figure 7? Why is FreeBSD faster than Solaris? Same hardware... Solaris is a commercial O/S, you'd expect it to be faster? Why does the paper present Figures 6 and 7? Is the workload realistic? no. only one file, no disk... What have we learned? Apache is slow. What would we still like to learn about? Disk-bound performance. "Realistic" performance with typical mix of big/small, cached/disk. Effect of various parameters (mem size, # of processes, &c) Why don't they show us a simple disk-only graph like Figure 6? How could we force *all* requests to use the disk? Would we want to force all path name lookups to use disk too? What would we learn from true disk-bound experiment? Probably all servers the same, we'd learn # disk reads / request. Best we can do is serve enough files that they don't fit in cache. Thus mixed cache/disk workload. Why is performance only 40 mbits in Figure 8, was ca. 100 in Figure 6? avg file size apparently 10 kBytes. or too many files to fit in cache. They don't tell us. What can we conclude from Figure 8? Realistic traces. Flash is a bit faster, but not radically. Presumably this is a mix of cached/disk requests. But actual mix is not known, so we don't really know what we're testing. How do figures 9 and 10 shed light on cached/disk performance? by varying data set size, control how well data fits in ~100 MB disk cache. How do they vary the data set size? How does that affect cache vs disk? Why is there a discontinuity at around 100 mbytes in Figure 9? Why at 100 mbytes, not 50 mbytes or 200 mbytes? Why is b/w around 50..100 mbits for large data set sizes? How many requests per second? 500 to 1000... assuming 40-kilobit files. Is this workload diskbound? What would b/w be if diskbound? 8 mbits/second... What's the cache hit rate? Hit rate must be around 90% That is, 10 files served per 10 millisecond disk seek (i.e. per miss) Or 1000 files per second Do they in fact ever evaluate disk-bound behavior? Figure 9/10, Flash vs MP. Why does Flash beat MP for small data set? (MP has partitioned cache) Why does Flash beat MP for large data set? (event-driven is more efficient) Flash vs SPED Why are Flash and SPED close for small data set? Why does Flash beat SPED for large data set? Flash vs MT (Figure 10) Flash and MT have about the same behavior for all data set sizes. Why? What does this mean w.r.t. whether Flash is worthwhile? At right of Figure 9, why is MP < SPED? we expect MP to get more I/O concurrency with disk-heavy workload so maybe user-level cache is small in MP? Cynical view: Should just use MT, not Flash. Practical view: Flash far easier to implement then kernel-supported threads! Much better use of programmer time.