High-performance File Systems
Required reading: The
Soft Updates paper.
intro: disk i/o a huge bottleneck in many real systems
i've been involved in a bunch of production web sites
none w/ more than more than modest workload
all have had perf problems, #1 culprit is disk / file system
why is disk often a problem?
everything else has gotten faster!
25 years: CPU 1000x, net 100x, disk seek 3x
seek takes 5 or 10ms == 10 million CPU instructions!
sequential disk thruput 50x, but difficult to exploit
read workloads often not cacheable -- big data, random or long tail
writes: write-back caching vs durability/recoverability
today's topic
high level:
if each thing takes one random disk I/O,
your one-disk server can only do 100 things per second
watch bwrite printfs from xv6
$ echo > x
bwrite 4 ialloc inum 18
bwrite 4 iupdate inum 18
bwrite 29 writei inum 1 *
$ echo z > x
bwrite 28 balloc bn 337
bwrite 337 writei inum 18
bwrite 4 iupdate inum 18
bwrite 337 writei inum 18
bwrite 4 iupdate inum 18
$ rm x
bwrite 29 writei inum 1 **
bwrite 4 iupdate inum 18
bwrite 337 bzero
bwrite 28 bfree bn 337
bwrite 4 iupdate inum 18
bwrite 4 iupdate inum 18
$ echo > y
bwrite 4 ialloc inum 18
bwrite 4 iupdate inum 18
bwrite 29 writei inum 1 **
$
wow, that's slow; at 10ms per i/o can only write 16 small files/sec
"synchronous writes" or "synchronous meta-data updates"
esp for e.g. un-tar or rm *
idea for improving speed
use a write-back disk cache
write LRU block to disk when cache is full
would that improve performance? why, exactly?
would we regret this scheme?
what can go wrong w/ badly ordered writes?
an easy case: * happens, but crash before two other bwrites
fsck can fix this, since i-node is marked free
suppose only the ** writes occur, then a crash
can fsck fix this?
two rules for recoverable file system disk writes
1. initialize i-node before creating dirent
2. erase dirent before re-using i-node
revisit xv6 order
does it look correct?
does creat() initialize i-node before creating dirent?
does unlink() remove dirent before freeing i-node?
idea for speed *and* correctness:
don't re-use i-nodes while on-disk dirent points to them
write-back cache for speed
record block-block dependencies
write-back to disk in order
mark a few in easy cases
it is correct for e.g. creating lots of files in a directory
why doesn't this always work?
example cycle:
(file /a already exists w/ i-num 19 in block 4)
$ echo > b
bwrite 4 ialloc inum 20
bwrite 4 iupdate inum 20
bwrite 29 writei inum 1
$ rm a
bwrite 29 writei inum 1
bwrite 4 iupdate inum 19
bwrite 4 iupdate inum 19
bwrite 4 iupdate inum 19
$
both dirents in block 29
both i-nodes in block 4
dependencies:
block 4 before block 29 (b inode before b dirent)
block 29 before block 4 (a dirent before a inode)
is the cycle problem a fundamental show-stopper?
what if each i-node or dirent were 512 bytes long?
what's the core soft updates idea?
if you want to write block 29:
1. un-do y's dirent (leaving delete of x's dirent)
2. write block 29 to disk
3. restore y's dirent
4. delete "29 before 4" dependency
what's in an dependency record?
attached to a block (e.g. block of i-nodes or block of dirents)
specifies which record within the block it refers to
e.g. an i-node or dirent
specifies what other block (?) that record depends on
i.e. that other block needs to be written first
specifies a dependency record in the other block
so we know what update to the other block we're waiting for
in case other block written, but with update rolled back
specifies a function to roll back/forward
state:
satisfied -- all dependent update written to disk
complete -- written to disk
what does e.g. creat() look like?
allocate inode i
dependency record for setting i.type to T_FILE
(not dependent on anything; but we will soon point to it)
satisfied but not complete
put name and i to dirent d
dependency record for dirent modification
points to inode dependency record
not satisfied, not complete
extra book-keeping to never re-use i-node or data block w/ on-disk ref
what does code to write a block look like?
lock the block
for each unsatisfied dependency, un-do it
write the block to disk
wait for the disk write to finish
for each unsatisfied dependency, re-do it
for each satisfied dependency
mark it complete
for other blocks that depend on this one
for our complete dependency records
mark other dependency record as satisfied
unlock the block
what are some problems?
when you want to re-use a cache buffer, which to flush?
what might go wrong?
how to do a good job?
can we always write each buffer just once?
is fsck still needed?
yes, but only to collect unreferenced blocks and i-nodes
i.e. safe to not run it
prob not true w/o soft updates
ordinary FFS lazy about writing bitmaps?
all updates delayed
xv6 creat() durable on disk when creat() returns
w/ soft updates write-back cache may be a while before on disk
fsync(fd) may need to write file's dirent &c
how to find it?
multiple updates to same record?
could we roll back one but not the other?
is it faster?
paper, Figures 3 and 4