File System Performance and Recovery
Required reading: The
Soft Updates paper.
intro
topic: tension between file system performance and crash recovery
disk performance is often a #1 bottleneck
"how many seeks will that take?"
but many obvious fixes make crash / power failure hard to recover from
"what if a crash occured at this point?"
crash recovery is much harder than performance
why is disk performance often a bottleneck?
everything else has gotten faster!
25 years: CPU 1000x, net 100x, disk seek 3x
seek+rotate takes 10ms, so 100 I/O per sec, 51 KB/s on xv6
example: storing incoming short email msgs: how many seeks?
good news: sequential throughput is 50 MB / sec
but quite hard to arrange for pure sequential I/O
example: storing huge output of scientific computation
memories are huge, won't a big disk cache provide high performance?
reads: often big data sets with random / long-tail access
writes: what if a crash occurs before you write cache to disk?
why is FS crash recovery hard?
crash = halt/restart CPU, let disk finish current sector write
assume no h/w damage, no wild writes to disk
goal: automatic recovery
can FS always make sense of on-disk metadata after restart?
given that the crash could have occured at any point?
example:
many of you observed on exam that crash during mkdir()
might leave subdir without "." and ".."
you can imagine much worse outcomes
e.g. block used by a file AND on free list
recovery approaches
three main approaches:
synchronous meta-data update + fsck (as in xv6)
logging (as seen in 6.033)
soft updates (today's paper)
design tradeoffs -- you can move the work around
careful update vs recovery
work during ordinary operation vs work during recovery
FS guarantees vs how much app must do
are all syscalls durable? ordered? atomic?
terms for properties of FS operations
if app completes operations shortly before a crash,
what effects will app see after restart + recovery?
e.g. creat("a"); fd = creat("b"); write(fd, ...); crash;
durable: effects of operation are visible
atomic: all steps of operation visible, or none
ordered: exactly a prefix of operations is visible
scores for xv6? soft updates? logging?
typical set of tradeoffs
FS ensures it can recover its own metadata
"internal consistency"
no dangling references (e.g. all dirents refer to allocated inodes)
inode and block free lists contain only unused items
lengths, link counts, unique names in dir
this is the bare minimum for a useful file system
FS provides limited guarantees to apps
atomicity for create, delete, rename
mostly equivalent to internal consistency
no atomicity for write() (except maybe single-sector)
often no durability for anything (creat("a") then crash => no "a")
often no order guarantees
how do apps cope with these weak FS semantics?
example: editing file "grades" with emacs, crash while writing
do you end up with only half your file?
you would if you just opened file for writing, and wrote
crash between open() (or creat()) and write()
crash during write(), since it is not atomic
[see code in handout]
fd = open("tmp1234")
write(fd, ...)
fsync(fd);
close(fd);
rename("tmp1234", "grades")
why it works:
fsync() forces durability, returns after data on disk
rename() is atomic: after a crash, old or new grades (not nothing)
(atomic rename() is quite difficult to get right...)
let's look at how xv6 updates the disk
handout w/ print for each bwrite()
think: "what if a crash here? how many seeks?"
$ echo > x
bwrite 4 ialloc inum 18
bwrite 4 iupdate inum 18
bwrite 29 writei inum 1 *
$ echo z > x
bwrite 28 balloc bn 337
bwrite 337 writei inum 18
bwrite 4 iupdate inum 18
bwrite 337 writei inum 18
bwrite 4 iupdate inum 18
$ rm x
bwrite 29 writei inum 1 **
bwrite 4 iupdate inum 18
bwrite 337 bzero
bwrite 28 bfree bn 337
bwrite 4 iupdate inum 18
bwrite 4 iupdate inum 18
$ echo > y
bwrite 4 ialloc inum 18
bwrite 4 iupdate inum 18
bwrite 29 writei inum 1 **
$
would an xv6 FS be internally consistent after a crash?
could look at all possible crashes
what if a crash during file creation? &c
remember all disk writes in xv6 are synchronous
better: argue on-disk FS consistent for any crash
this is mostly about avoiding dangling references to blocks and inodes
xv6 strategy: carefully order disk writes to avoid dangling refs
1. initialize a new inode before creating dirent
2. delete dirent before marking inode free
3. mark block in-use before adding it to inode addrs[]
4. remove block from addrs[] before marking free (oops)
5. zero block before marking free
has some visible bugs: . and .. during mkdir(), link counts, sizes
has some invisible loose ends: may lose freed blocks and inodes
what about application-visible syscall semantics?
durable? yes (write-through cache)
atomic? often, though e.g. mkdir is broken
ordered? yes (all writes are sync)
how about xv6 FS performance?
slow: at 10ms per i/o can only write 16 small files/sec
think about un-tar or rm *
can we get rid of some xv6 disk writes?
could combine ialloc and iupdate when creating
echo could write("z\n") rather than two writes
could let block free bitmap be out of date
rebuild during recovery, in fsck
hard to reduce create or unlink to fewer than two seeks
still pretty slow
can we implement system calls with no seeks at all?
use a write-back disk cache
bwrite() *only* modifies in-memory buf (no disk write)
bufs written to disk later
if cache is full, write LRU dirty block
or maybe every 30 seconds
would write-back cache improve performance? why, exactly?
what problems will write-back cause?
what can go wrong w/ badly ordered writes?
an easy case: * goes to disk, crash before other two blocks written
FS not immediately useable after crash (dirent -> free inode)
easy for fsck to detect and fix, since i-node is marked free
suppose only the ** disk writes occur, then a crash
what is the problem?
can fsck detect and fix this?
problem: write-back cache doesn't follow ordering rules
those X before Y rules listed earlier
idea to fix write-back cache to be crash-recoverable:
record block-block dependencies between cached bufs
write-back to disk must obey dependencies
if it is about to write dirent block, first write inode block
example:
for creat():
write inode to disk before dirent
dirent block depends on inode block
arrow from block 29 to block 4
for write():
write bitmap block to disk before inode block
inode block depends on bitmap block
arrow from 4 to 28
would this have good performance?
e.g. for lots of creat()s in a directory
why doesn't it always work?
example cycle:
(file /a already exists w/ i-num 19 in block 4)
$ echo > b
bwrite 4 ialloc inum 20
bwrite 4 iupdate inum 20
bwrite 29 writei inum 1
$ rm a
bwrite 29 writei inum 1
bwrite 4 iupdate inum 19
bwrite 4 iupdate inum 19
bwrite 4 iupdate inum 19
$
both dirents in block 29
both i-nodes in block 4
dependencies:
block 4 before block 29 (b inode before b dirent)
block 29 before block 4 (a dirent before a inode)
it seems like we can't write either block first!
is this cycle problem fundamental?
what if each i-node or dirent were 512 bytes long?
what's the core soft updates idea?
dependencies between metadata items, not blocks
i.e. particular dirent depends on particular inode
not from dirent's block to inode's block
this eliminates our cycle!
if you want to write block 29:
1. un-do b's dirent (leaving delete of a's dirent)
2. write block 29 to disk
3. restore b's dirent
4. delete "29 before 4" dependency
now you can write 4 to disk, and then the real 29
what's in a dependency record?
attached to a record within a block (e.g. inode or dirent)
points to a dependency record for another block
so we know what update to the other block we're waiting for
in case other block written, but with update rolled back
note: you depend on an *update*, not just a record
roll-back / roll-forward data and function
state:
satisfied -- all dependent update written to disk
complete -- written to disk
what does e.g. creat() look like?
allocate inode i
dependency record for setting i.type to T_FILE
(not dependent on anything; but we will soon point to it)
satisfied but not complete
put name and i in dirent d
dependency record for dirent modification
points to inode dependency record
not satisfied, not complete
what does code to write a block to disk look like?
lock the block
for each unsatisfied dependency, un-do it
write the block to disk
wait for the disk write to finish
for each unsatisfied dependency, re-do it
for each satisfied dependency
mark it complete
for other blocks that depend on this one
for our complete dependency records
mark other dependency record as satisfied
unlock the block
various details they had to get right
when you want to re-use a cache buffer, which to flush?
what might go wrong?
how to do a good job?
how many times might we have to write a particular buffer?
that is, what is the ultimate cost of undo/write/redo?
is fsck still needed?
collect unreferenced blocks and i-nodes
fix too-high link counts (after link and rename)
safe to not run fsck
not true w/o soft updates
ordinary FFS lazy about writing bitmaps
all updates delayed
xv6 creat() durable on disk when creat() returns
w/ soft updates write-back cache may be a while before on disk
order among syscalls not preserved if a crash
creat("a"); create("b");
after crash, app might see b but not a
fix with fsync()
fsync(fd) may need to write file's dirent &c
no point in writing file content to disk w/o dirent
could be a lot of stuff; how to identify?
is soft updates fast?
paper, Figures 3 and 4