File system performance and crash recovery
this lecture
tocttou
crash recovery for xv6
logging / ext3
time-of-check-to-time-of-use-bugs
when I was in college there was no mkdir() system call. here's
the source for /bin/mkdir (from v7 unix):
if ((mknod(d, 040777, 0)) < 0) {
fprintf(stderr,"mkdir: cannot make directory %s\n", d);
++Errors;
return;
}
chown(d, getuid(), getgid());
mkdir was setuid to root -- it ran as root even if started by
an ordinary user, since mknod() is a privileged call.
what's the problem?
how to exploit it?
cd /tmp
mkdir foo && ./exploit
exploit waits for foo to appear, then removes it,
substitutes a link to e.g. /etc/passwd
unix used to be full of such bugs. e.g. mail delivery command
appended to /usr/spool/mail/rtm, running as root:
stat(/usr/spool/mail/rtm)
if owned by rtm
open and write
how do people fix tocttou bugs?
atomic mkdir() call
f-calls:
fd = open(file)
fstat(fd)
if(ok)
fchown(fd)
don't run as root: structure s/w to drop privileges when possible
e.g. setuid(rtm) before delivering mail
what is crash recovery?
crash = kernel panic, power failure, fixable h/w failure
assume fail-stop: no wild writes to disk
assume disk content preserved, except maybe last write incomplete
goal: after reboot, recover to a correct state
crash recovery is hard because some syscalls involve multiple writes
example: mark block as not-free, add block # to inode
in that order, crash causes missing free blocks
in the other order, crash leaves a free block in a file
soon, two files will share a block!
example: mkdir allocates inode, make dirent in parent, then adds .
and ..
crash may leave new dir without its . and ..
we'll discuss two approaches:
synchronous meta-data update + fsck (xv6 + imaginary fsck)
logging (linux ext3)
what can we say about an xv6 FS after a crash?
xv6 strategy: carefully order disk writes to avoid dangling refs
1. initialize a new inode before creating dirent
2. delete dirent before marking inode free
3. mark block in-use before adding it to inode addrs[]
4. remove block from addrs[] before marking free (oops)
5. zero block before marking free
xv6 uses "synchronous meta-data update"
enforces these orders by waiting for each write to complete on disk
ordering does two things:
1. prevents dangling references from confusing FS after reboot
2. helps fsck reason about what was happening at time of crash
FS immediately after crash isn't quite consistent:
. and .. during mkdir(), link counts, sizes
may leak freed blocks and inodes
thus:
mostly internally consistent; FS unlikely to panic immediately
after reboot
no reachable dangling references
not quite "atomic", since you can see partial operations after
crash
but it is "durable": syscall returns => data will be on disk
after crash
xv6 needs an fsck
to restore internal consistency after a crash
complete or roll back partial system calls
how would you make it correct?
look at each syscall's sequence of synchronous writes
have a story for a crash at each point
how would fsck work?
scan inodes, build in-memory table of which are marked free
scan free block table, build in-memory table
descend directory hierarchy from root
i.e. look at all dirs and files a program could access
count dir references to each inode
count dir/file references to each block
what inconsistencies could exist?
inode not free, but not referenced by any dir
crash during create
crash during delete
ok to simply mark the inode free (rolls back create, or
completes delete)
block not free, but not referenced by any file/dir
crash during append
crash during unlink/truncate
in both cases, ok to mark the block free (rolls back append,
completes truncate)
reachable directory w/o . or ..
crash during mkdir
remove directory (since must be new+empty), or create . and ..
incorrect link count?
crash during link or unlink
fix the link count to match # of references in reachable dir
hierarchy
lots of "can't happen" inconsistencies:
block in two files? cycle in dir hierarchy?
fsck asks for human assistance
good news:
xv6 could be made to have good crash recovery, w/ fsck
bad news:
xv6 very slow during normal operation
xv6 very slow during fsck
how long would fsck take?
a read from a random place on disk takes about 10 milliseconds
descending the directory hierarchy might involve a random read per
inode
so maybe (n-inodes / 100) seconds?
faster if you read all inodes (and dir blocks) sequentially,
then descend hierarchy in memory
my server: fsck takes 10 minutes per 70GB disk w/ 2 million inodes
clearly reading many inodes sequentially, not seeking
still a long time, probably linear in disk size
how about xv6 normal FS performance?
creating a file and writing a few bytes takes 8 writes, probably 80
ms
(ialloc, init inode, write dirent, alloc data block, add to inode,
write data, set length in inode, one other mystery write to data)
so can create only about a dozen small files per second
think about un-tar or rm *
how to get better performance?
RAM is cheap
disk sequential throughput is high, 50 MB/sec
(maybe someday solid state disks will change the landscape)
we'll talk about big memory, then sequential disk throughput
why not change xv6 to use a big write-back disk cache?
bwrite() *only* modifies in-memory buf (no disk write)
so creat(), unlink(), write() &c return almost immediately
bufs written to disk later
if cache is full, write LRU dirty block
write all dirty blocks every 30 seconds, to limit loss if crash
would write-back cache improve performance? why, exactly?
what problems will write-back cause?
what can go wrong w/ write-back cache?
example: crash soon after xv6 file creation
1. ialloc/iupdate initialize new inode (in use, zero length, &c)
2. writei adds new directory entry with file's name and i-number
both just go into buffer cache, not disk
performance is good -- no disk seeks
suppose eventually cache manager decides to write block #2 to disk
then crash
now FS not useable after a crash (dirent -> free inode)
easy for fsck to detect and fix, since i-node is marked free
example: unlink() followed by create()
an existing file x with some content, all safely on disk
one user runs unlink(x)
1. delete x's dir entry **
2. put blocks in free bitmap
3. mark x's inode free
another user then runs create(y)
4. allocate a free inode
5. initialize the inode to be in-use and zero-length
6. create y's directory entry **
again, all writes initially just to disk buffer cache
suppose only ** writes forced to disk, then crash
what is the problem?
can fsck detect and fix this?
problem: no mechanism to make unlink(), create(), &c atomic
with respect to crashes
and (unlike sync metadata update) no ordering hints for fsck
most popular solution: logging (== journaling)
goal: atomic system calls w.r.t. crashes
goal: fast recovery (no hour-long fsck)
goal: speed of write-back cache for normal operations
simple 6.033-style logging scheme:
[diagram: buffer cache, FS tree on disk, log on disk]
FS has a log on disk
syscall describes all the writes it wants to do in the log on disk
and only then modifies the real FS
if crash during FS modification, recovery s/w can use log to
complete the sys call
syscall:
append BEGIN record to disk log
for each modification (inode, dir block, free bitmap, &c)
make modification in buffer cache, but pin that buffer
append copy of modified block to log
append a COMMIT record to log on disk
un-pin the buffers
don't write buffers to disk until after COMMIT goes to disk
"write-ahead logging"
after crash+reboot, recovery scans log
iff sees COMMIT, re-do all modifications in that xaction
if no COMMIT, ignore that syscall's modifications in log
so if crash during a system call
if no COMMIT, syscall has no effect at all
since we held write-back until after COMMIT on disk
if COMMIT, syscall has all effects
what's wrong with this simple logging scheme?
log traffic will be huge: every file create -> many blocks of records
sys calls still need to wait for disk i/o, so write-back cache not
much help
some details missing: GC of log, find start/end of log after reboot
Linux's ext3 design
case study of the details required to add logging to a file system
main source: Stephen Tweedie 2000 talk transcript "EXT3, Journaling
Filesystem"
ext3 adds a log to ext2, a previous xv6-like log-less file system
has many modes, I'll described "journaled data"
log contains both metadata and file content blocks
ext3 structures:
in-memory write-back block cache
in-memory list of blocks to be logged
on-disk FS
on-disk circular log file
what's in the ext3 log?
descriptor: magic, seq, block #s
data blocks (as described by descriptor)
commit: magic, seq
ext3 logs entire blocks
expensive: one little change -> 4096 bytes in log
how does ext3 get good performance despite logging entire blocks?
batches many syscalls per commit
defers copying cache block to log until it commits log to disk
hopes multiple sycalls modified same block
thus many syscalls, but only one copy of block in log
to be continued...