File System Performance and Durability
Required reading:
Rethink the Sync.
why are we reading Rethink the Sync?
uses ext3, and sheds more light on FS consistency
ext3 defends FS internal consistency (no dirent -> unallocated inode &c)
but that is not enough!
apps and external users also need consistency
ext3 may hurt here: delayed commit
xsyncfs's promises to external user
correctness of synchronous FS updates
performance of asynchronous FS
what is a sync FS?
a sync FS forces updates to disk before returning from syscall, like xv6
if syscall returns, its effects will be visible after a crash
write-through disk cache in kernel, so slow
why is sync FS a good standard for correct FS behavior?
maybe since sync FS easy to reason about?
can count on all ops completed before crash being visible after crash
e.g. write data to file, print "OK"
no-one uses a sync FS or writes apps that assume one...
in real life: write(), fsync(), print "OK"
what does paper mean by an async FS?
system calls can return before data is on disk
syscalls modify blocks in a write-back cache
but then FS has to worry about internal consistency
paper really means FS w/ async logging, in particular Linux's ext3
why is not-durable a problem for external users?
mail server says "OK" but then crash erases its copy of msg
the OK causes the sending server to delete, so now msg entirely gone
or
% cp * backupdir
%
is it ok now to edit the original files?
currently you have to choose between durability and performance
slow sync fs, or fast non-durable async logging
or remember to call fsync(), maybe call it too much
what's the paper's insight to get performance + durability?
sufficient definition of durability:
updates only need to be durable by the time of external output
display output, network packet
if user sees no output, no reason to expect FS I/O has finished
which data has to be durable before external output?
perhaps all FS updates of process producing the output
also all updates "causally preceding" the external output
example:
X : write(file) -> exit()
Y : -> wait() -> printf("$ ")
process X's file write must be on disk before user sees Y's "$ "
why is durability of causally preceding updates useful to external users?
if mail sender sees "OK" msg via net, guaranteed msg is on rcvr's disk
no "OK" -> recovery by resending later
% cp * backupdir
%
if you see the 2nd prompt, you know backup is complete
if crash before 2nd prompt, user "recovers" by backing up again
seems to be sufficient in these two cases
can we argue that no non-causally-preceding write is necessary?
i.e. is the causal rule always sufficient?
example:
read mail msg from socket
fork a child
child writes msg to a file
parent writes "OK" to socket
so: sufficient only if application is written carefully
is it necessary for xsyncfs to force all causally preceding writes?
i.e. is the causal rule neccessary?
examples when maybe app doesn't need a causally preceding write forced?
will xsyncfs have higher performance than sync updates?
after all, xsyncfs will have to write fairly often
can use write-back cache, coalesce writes
external output forces batched disk update
will be fast if many updates to same disk block
will xsyncfs have higher performance than async logging?
i.e. ordinary ext3, commit xaction every 5 seconds
usually no:
xsyncfs has causal tracking overhead
xsyncfs might force to disk more often than every 5 seconds
when might xsyncfs have *higher* performance than async FS?
mysql example
client does multiple SQL xactions before external output
mysql server must fsync() after each xaction for D in ACID
but xsyncfs knows: no external output -> can ignore fsync()
what guarantees does xsyncfs give to applications?
paper implies apps written for sync file system will work fine
xsyncfs guarantees ordering: section 2.2, end of page
so like sync FS, but crashed earlier
your app still needs to be able to recover from a crash
at any point, can rely on all syscalls up to a certain
point being visible after the crash
examples of why prefix property is useful in app recovery?
what are the hard parts of xsyncfs design / impl?
tracking causality
each process and kernel object (pipe) has list of not-yet-written blocks
copy that list when a process reads/updates kernel object
buffering external output, e.g. display writes and network packets
why useful to buffer? why not just suspend process?
forcing disk writes on external output
track down all causally preceding blocks
ask ext3 to write them
wait, then release output
what does that involve?
can you ask ext3 to write just a particular block?
no: xsyncfs must force whole log, including unrelated updates
given that ext3 won't let you write an individual block
why do they have to do all that causal tracking?
why not just commit ext3 xaction when program does external output?
this eliminates a potential advantage of xsyncfs over async logging FS
in principle, if there are multiple unrelated apps,
xsyncfs only needs to force causally related updates
not entire log, which is what ordinary ext3 fsync() does
in practice, xsyncfs forces whole ext3 log
xsyncfs evaluation
what do they need to demonstrate?
that the problem they're solving exists
that they achieve the main goals stated at the start of the paper
#0. apps want sync semantics, happy with ext sync as well
#1. synchronous file systems are slow (i.e. problem exists)
#2. asynchronous file systems aren't very durable (ditto)
#3. xsyncfs as durable as a synchronous file system
#4. xsyncfs as fast as an asynchronous file system
they don't really investigate #0
safe bet that sync is sufficient
maybe true that ext sync's causality is sufficienct, though less clear
e.g. causality is lost if fork() and no wait()
sync for sure not neccessary
many apps written w/ fsync() &c to get speed and just the right durability
e.g. mail servers, databases
ext sync's causality not always neccessary
more restrictive than carefully fsync()ing app
e.g. log file writes maybe don't need to be durable
figure 3: durability
(show #2 and #3)
very cool that they systematically test crashes
async is ordinary ext3, force every 5 seconds or on fsync
sync is force ext3 log after every system call
sync + write barriers fixes ide write caching
xsyncfs uses ext3 and write barriers
questions:
what does "not durable" mean in the table? (e.g. O_SYNC write but no data)
why does async ext3 lose write() data? (duh)
why does async ext3 lose fsync() data?
why does sync ext3 lose any data?
comparison unfair: async ext3 should be using write barriers
that is how it was designed to be used
then it would be as durable (and as high performance) as xsyncfs
figure 4:
single-threaded read/write/create/delete benchmark
no external output
lots of FS I/O
would we expect xsyncfs to be faster than async ext3?
why or why not?
shows #4
why is sync ext3 so slow?
shows #1
why ext3-barrier slower than ext3-sync?
why ext3-sync slower than async? (after all, writing to on-disk cache)
figure 6: mysql, vs async ext3 with write barriers
mysql fsync()s log after each SQL transaction
so both systems are arguably equally durable
why does xsyncfs win with few threads?
why does ext3 catch up with many threads?
would xsyncfs win if client on diff host than mysql server?
figure 7: web server, not much difference, why?
what does specweb99 do?
what is the bottom line?
should linux switch from ext3 to xsyncfs?