6.828 2011 Lecture 12: Rethink the Sync Rethink the Sync, Nightingale, Veeraraghavan, Chen, and Flinn, OSDI 2006. why are we reading Rethink the Sync? uses ext3, and sheds more light on FS consistency ext3 defends FS internal consistency (no dirent -> unallocated inode &c) but that is not enough! apps and external users also need consistency ext3 may hurt here: delayed commit xsyncfs's promises to external user correctness of synchronous FS updates performance of asynchronous FS what is a sync FS? a sync FS forces updates to disk before returning from syscall, like xv6 if syscall returns, its effects will be visible after a crash write-through disk cache in kernel, so slow why is sync FS a good standard for correct FS behavior? maybe since sync FS easy to reason about? can count on all ops completed before crash being visible after crash e.g. write data to file, print "OK" no-one uses a sync FS or writes apps that assume one! it's fine for the paper to use sync FS as standard for safety but odd that they compare against it for performance what does paper mean by an async FS? system calls can return before data is on disk syscalls modify blocks in a write-back cache but then FS has to worry about internal consistency paper really means FS w/ async logging, in particular Linux's ext3 why is not-durable a problem for external users? mail server says "OK" but then crash erases its copy of msg the OK causes the sending server to delete, so now msg entirely gone or % cp * backupdir % is it ok now to edit the original files? how do real systems currently work? async logging FS, app calls fsync() high-performance by default, but durable when needed example: mail server accepts msg, writes a few files, fsync()s each one, returns ack msg example: you want to replace a file w/o losing it can't just creat() and write() -- what if crash after creat()? write tmp file, fsync(), rename() fsync must come before rename! fsync() is often a limiting factor in performance fsync() takes dozens of ms fsync() prevents batching unnecessarily slow if you have many files one at a time! no easy way to do batch fsync() overkill if you just wanted order e.g. we wanted data on disk before rename() but it doesn't have to happen now! Microsoft Word on my Mac calls fsync() 50 times while saving a file adding an iCal entry: 75 fsync()s write; fsync; write; fsync; write; fsync; print "done" what s xsyncfs's insight to get performance + durability? "external synchrony": updates only need to be durable by the time of external output if user sees no output, no reason to expect FS I/O has finished thus: most ops (incl fsync) purely in cache then commit entire batch -> good performance automatic detection of when disk write is really needed! no application changes! which data has to be durable before external output? perhaps all FS updates of process producing the output also all updates "causally preceding" the external output example: X : write(file) -> exit() Y : -> wait() -> printf("$ ") process X's file write must be on disk before user sees Y's "$ " why is durability of causally preceding updates useful to external users? if mail sender sees "OK" msg via net, guaranteed msg is on rcvr's disk no "OK" -> recovery by resending later % cp * backupdir % if you see the 2nd prompt, you know backup is complete if crash before 2nd prompt, user "recovers" by backing up again seems to be sufficient in these two cases can we argue that no non-causally-preceding write is necessary? i.e. is the causal rule always sufficient? example: read mail msg from socket fork a child child writes msg to a file parent writes "OK" to socket so: sufficient only if application is written carefully is it necessary for xsyncfs to force all causally preceding writes? i.e. is the causal rule neccessary? examples when maybe app doesn't need a causally preceding write forced? will xsyncfs have good performance? after all, xsyncfs will have to write fairly often can use write-back cache, coalesce writes external output forces batched disk update will be fast if many updates to same disk block will xsyncfs have higher performance than async logging? i.e. ordinary ext3, commit xaction every 5 seconds if app doesn't fsync() much, xsyncfs may be slower: xsyncfs has causal tracking overhead xsyncfs might force to disk more often than every 5 seconds if app fsync()s a lot, xsyncfs maybe be faster: fewer commits since xsyncfs can ignore fsync()s example of xsyncfs beating ext3: mysql client does multiple SQL xactions before external output mysql server must fsync() after each xaction for D in ACID but xsyncfs knows: no external output -> can ignore fsync() what are the hard parts of xsyncfs design / impl? tracking causality each process and kernel object (pipe) has list of not-yet-written blocks copy that list when a process reads/updates kernel object buffering external output, e.g. display writes and network packets why useful to buffer? why not just suspend process? forcing disk writes on external output track down all causally preceding blocks ask ext3 to force them to log on disk wait, then release output what does that involve? can you ask ext3 to log just particular blocks? no: must commit entire ext3 transaction may include unrelated updates given that ext3 won't let you force an individual block why does xsyncfs have to do all that causal tracking? why not just commit ext3 xaction when program does external output? this eliminates a potential advantage of xsyncfs over async logging FS in principle, if there are multiple unrelated apps, xsyncfs only needs to force causally related updates not entire log, which is what ordinary ext3 fsync() does in practice, xsyncfs forces whole ext3 log xsyncfs evaluation what do they need to demonstrate? that the problem they're solving exists that they achieve the main goals stated at the start of the paper #0. ext sync is a good match for application needs #1. current practice is slow, or #2. current practice is not very durable #3. xsyncfs as durable as a synchronous file system #4. xsyncfs as fast as an asynchronous file system they don't really investigate #0 maybe true that ext sync's causality is sufficienct though causality is lost if fork() and no wait() ext sync's causality not always neccessary more restrictive than carefully fsync()ing app e.g. log file writes maybe don't need to be durable figure 3: durability (show #2 and #3) very cool that they systematically test crashes method: do lots of write()s and send a network msg *after* each one crash at some point are all writes on disk for which net msg was received? "Asynchronous" is ordinary ext3, force every 5 seconds or on fsync "Synchronous" is force ext3 log after every system call "Sync w/ write barriers" fixes ide write caching "External syncrhony" is xsyncfs (w/ async ext3 and write barriers) questions: what does "not durable" mean in the table? (e.g. O_SYNC write but no data) why does async ext3 lose write() data? why does async ext3 lose fsync() data? why does sync ext3 lose any data? it's misleading for paper to imply xsyncfs has better durability than ext3 their situations are identical: durable only if disk behaves figure 4: single-threaded read/write/create/delete benchmark no external output lots of FS I/O would we expect xsyncfs to be faster than async ext3? why or why not? shows #4 why is ext3-sync so slow? does this demonstrate #1 ? why ext3-barrier slower than ext3-sync? why ext3-sync slower than ext3-async? (after all, writing to on-disk cache) figure 6: mysql, vs async ext3 with write barriers for this experiment, async ext3 w/ write barriers (finally) mysql fsync()s log after each SQL transaction so both systems are arguably equally durable why does xsyncfs win with few threads? why does ext3 catch up with many threads? would xsyncfs win if client on diff host than mysql server? figure 7: web server, not much difference, why? what does specweb99 do? what is the bottom line? external synchrony is an interesting idea certainly fsync() is a flawed mechanism mysql example particularly compelling paper does a poor job of comparing against current practice so hard to tell if xsyncfs would do a lot of good