File system performance and crash recovery
last week:
introduced crash recovery problem
e.g. crash in the middle of multi-write operation like file create
xv6 orders writes to avoid dangling references
so FS is useable after crash
fsck completes or undoes partial operations
block/inode marked used but no reference
. and .. during mkdir(), link counts, sizes
what are the drawbacks of the xv6 story?
fsck takes a long time on big FS -- tens of minutes
must descend
must read every inode, bitmap, dirent
fsck must complete before we can use FS
sync writes are slow -- seek, spin
xv6 small-file create takes tens of milliseconds
dirent, inode init, balloc, inode update
how to get better performance than xv6?
RAM is cheap
disk sequential throughput is high, 50 MB/sec
(maybe someday solid state disks will change the landscape)
we'll talk about big memory, then sequential disk throughput
why not change xv6 to use a big write-back disk cache?
bwrite() *only* modifies in-memory buf (no disk write)
just sets "dirty" flag
so creat(), unlink(), write() &c return almost immediately
dirty bufs written to disk later
if cache is full, write LRU dirty block
write all dirty blocks every 30 seconds, to limit loss if crash
would write-back cache improve performance? why, exactly?
after all, we have to write the disk either way
what problems will write-back cause?
what can go wrong w/ write-back cache?
it may change the order of disk writes in a bad way
example: unlink() followed by create()
an existing file x with some content, all safely on disk
unlink(x)
1. delete x's dir entry **
2. put blocks in free bitmap
3. mark x's inode free
create(y)
4. allocate an inode
5. initialize the inode
6. create y's directory entry **
again, all writes initially just to disk buffer cache
suppose only ** writes forced to disk, then crash
what is the problem?
can fsck detect and fix this?
most popular solution: logging (== journaling)
goal: atomic system calls w.r.t. crashes
goal: fast recovery (no hour-long fsck)
goal: speed of write-back cache for normal operations
simple 6.033-style logging scheme:
[diagram: buffer cache, FS tree on disk, log on disk]
FS has a log on disk
syscall describes all the writes it wants to do in the log on disk
and only then modifies the real FS
if crash, recovery s/w can use log to complete the sys call
syscall:
append BEGIN record to disk log
for each modification (inode, dir block, free bitmap, &c)
make modification in buffer cache, but pin that buffer
append copy of modified block to log
append a COMMIT record to log on disk
un-pin the buffers
don't write buffers to disk (pin them) until after COMMIT is on disk
"write-ahead logging"
can defer FS disk writes as long as we want
after crash+reboot, recovery scans log
iff sees COMMIT, re-do all modifications in that xaction
if no COMMIT, ignore that syscall's modifications in log
so if crash during a system call
if no COMMIT, syscall has no effect at all
since we held write-back until after COMMIT on disk
if COMMIT, syscall has all effects
what's wrong with this simple logging scheme?
high log traffic -> maybe just as slow as xv6
some details missing: GC of log, find start/end of log after reboot
Linux's ext3 design
case study of the details required to add logging to a file system
Stephen Tweedie 2000 talk transcript "EXT3, Journaling Filesystem"
ext3 adds a log to ext2, a previous xv6-like log-less file system
has many modes, I'll start with "journaled data"
log contains both metadata and file content blocks
ext3 structures:
in-memory write-back block cache
in-memory list of blocks to be logged
on-disk FS
on-disk circular log file
what's in the ext3 log?
superblock: ptr to start of log
descriptor: magic, seq, block #s
data blocks (as described by descriptor)
commit: magic, seq
how does ext3 get good performance despite logging entire blocks?
batches many syscalls per commit
defers copying cache block to log until it commits log to disk
hopes multiple sycalls modified same block
thus many syscalls, but only one copy of block in log
"write absorbtion"
sys call:
h = start()
get(h, block #)
warn logging system we'll modify cached block
added to list of blocks to be logged
prevent writing block to disk until after xaction commits
modify the blocks in the cache
stop(h)
guarantee: all or none
stop() does *not* cause a commit
ext3 transaction
while "open", adds new syscall handles, and remembers their block #s
only one open transaction at a time
ext3 commits current transaction every few seconds (or fsync())
committing a transaction to disk
open new transaction, for subsequent syscalls
mark transaction as done
wait for in-progress syscalls to stop()
(maybe it starts writing blocks, then waits, then writes again if needed)
write descriptor to log on disk w/ list of block #s
write each block from cache to log on disk
wait for all log writes to finish
append the commit record
now cached blocks go to disk
is log correct if concurrent syscalls?
e.g. create of "a" and "b" in same directory
ext3 log doesn't preserve order, as in our simple xv6 log
updates for both are mixed into the same log transaction
T2 starts while T1 is committing to log on disk
what if syscall in T2 wants to write block in prev xaction?
can't be allowed to write buffer that T1 is writing to disk
then new syscall's write would be part of T1
crash after T1 commit, before T2, would expose update
T2 gets a separate copy of the block to modify
T1 holds onto old copy to write to log
are there now *two* versions of the block in the buffer cache?
no, only the new one is in the buffer cache, the old one isn't
does old copy need to be written to FS no disk?
no: T2 will write it
performance?
create 100 small files in a directory
xv6: >= 4 seeks / file, or 4 seconds
ext3: repeated mods to same direntry, inode, bitmap blocks
then one commit of a few metadata blocks plus 100 file blocks
how long to do a commit?
seq write of 100*4096 at 50 MB/sec: 10 ms
wait for disk to say writes are on disk
then write the commit record
that wastes one revolution, another 10 ms
modern disk interfaces can avoid wasted revolution
what if a crash?
crash may interrupt writing last xaction to log on disk
so disk may have a bunch of full xactions, then maybe one partial
may also have written some of block cache to disk
but only for fully committed xactions, not partial last one
how does recovery work
1. find the start and end of the log
log "superblock" at start of log file
log superblock has start offset and seq# of first transaction
scan until bad record or not the expected seq #
go back to last commit record
crash during commit -> last transaction ignored during recovery
2. replay all blocks through last complete xaction, in log order
what if block after last valid log block looks like a log descriptor?
perhaps left over from previous use of log? (seq...)
perhaps some file data happens to look like a descriptor? (magic #...)
when can ext3 free a transaction's log space?
after cached blocks have been written to FS on disk
what if block in T1 has been dirtied in cache by T2?
can't write that block to FS on disk
note ext3 only does copy-on-write while T1 is commiting
after T1 commit, T2 dirties only block copy in cache
so can't free T1 until T2 commits, so block is in log
T2's logged block contains T1's changes
free == advance log superblock's start pointer
what if not enough free space in log for a syscall?
suppose we start adding syscall's blocks to T2
half way through, realize T2 won't fit on disk
we cannot commit T2, since syscall not done
can we free T1 to free up log space?
maybe not, due to previous issue, T2 maybe dirtied a block in T1
deadlock!
solution: reservations
start of syscall pre-declares how many block of log space it might need
block the system from starting until enough free space
may need to commit open transaction, then free older transaction
OK since reservations mean all started sys calls can complete + commit
ext3 not as immediately durable as xv6
creat() returns -> maybe data is not on disk! crash will undo it.
need fsync(fd) to force commit of current transaction, and wait
would ext3 have good performance if commit after every sys call?
would log many more blocks, no absorption
10 ms per syscall, rather than 0 ms
(Rethink the Sync addresses this problem)
no checksum in ext3 commit record
disks usually have write caches and re-order writes, for performance
sometimes hard to turn off (the disk lies)
people often leave re-ordering enabled for speed, out of ignorance
bad news if disk re-orders writes, writes commit before preceding stuff
then recovery replays descriptors with random block #s!
and writes them with random content!
ordered vs journaled
journaling file content is slow, every data block written twice
in principle not needed to keep FS internally consistent
can we just lazily write file content blocks?
no:
if metadata updated first, crash may leave file pointing
to blocks with someone else's data
ext3 ordered mode:
write content block to disk before commiting inode w/ new block #
thus won't see stale data if there's a crash
correctness challenges w/ ordered mode:
A. rmdir, re-use block for file, ordered write of file,
crash before rmdir or write committed
now scribbled over the directory block
fix: defer free of block until freeing operation forced to log on disk
B. rmdir, commit, re-use block in file, ordered file write, commit,
crash, replay rmdir
file is left w/ directory content e.g. . and ..
fix: revoke records, prevent log replay of a given block
final tidbit
open a file, then unlink it
unlink commits
file is open, so unlink removes dir ent but doesn't free blocks
crash
nothing interesting in log to replay
inode and blocks not on free list, also not reachably by any name
will never be freed! oops
solution: add inode to linked list starting from FS superblock
commit that along with remove of dir ent
recovery looks at that list, completes deletions