6.828 2017 Lecture 13: Crash Recovery, Logging Plan problem: crash recovery crash leads to inconsistent on-disk file system solution: logging This is the last xv6 lecture next week we'll switch to papers # Why crash recovery What is crash recovery? you're writing the file system then the power fails you reboot is your file system still useable? the problem: crash during multi-step operation may leave FS invariants violated after reboot: bad: crash again due to corrupt FS worse: no crash, but reads/writes incorrect data examples: create: new dirent allocate file inode crash: dirent points to free inode -- disaster! crash again, or worse if inode is allocated for something else crash: inode not free but not used -- not so bad write: inode addrs[] and len indirect block block content block free bitmap crash: inode refers to free block -- disaster! crash: block not free but not used -- not so bad unlink: block free bitmaps free inode erase dirent what can we hope for? after rebooting and running recovery code 1. FS internal invariants maintained e.g., no block is both in free list and in a file 2. all but last few operations preserved on disk e.g., data I wrote yesterday are preserved but perhaps not data I was writing at time of crash so user might have to check last few operations 3. no order anomalies echo 99 > result ; echo done > status correctness and performance often conflict disk writes are slow! safety => write to disk ASAP speed => don't write the disk (batch, write-back cache, sort by track, &c) crash recovery is a recurring problem arises in all storage systems, e.g. databases a lot of work has gone into solutions over the years many clever performance/correctness tradeoffs # Logging solution most popular solution: logging (== journaling) goal: atomic system calls w.r.t. crashes goal: fast recovery (no hour-long fsck) will introduce logging in two steps first xv6's log, which only provides safety and fast recovery then Linux EXT3, which is also fast in normal operation the basic idea behind logging you want atomicity: all of a system call's writes, or none let's call an atomic operation a "transaction" record all writes the sys call *will* do in the log on disk (log) then record "done" on disk (commit) then do the FS disk writes (install) on crash+recovery: if "done" in log, replay all writes in log if no "done", ignore log this is a WRITE-AHEAD LOG write-ahead log rule install *none* of a transaction's writes to disk until *all* writes are in the log on disk, and the logged writes are marked committed. why the rule? once we've installed one write to the on-disk FS, we have to do *all* of the transaction's other writes -- so the transaction is atomic. we have to be prepared for a crash after the first installation write, so the other writes must be still available after the crash -- in the log. logging is magic crash recovery of complex mutable data structures is generally hard logging can often be layered on existing storage systems and it's compatible with high performance (topic for next week) challenge: prevent write-back from cache a system call can safely update a *cached* block, but the block cannot be written to the FS until the transaction commits tricky because e.g. cache may run out of space, and be tempted to evict some entries in order to read and cache other data. consider create example: write dirty inode to log write dir block to log evict dirty inode commit solution: ensure buffer cache is big enough pin dirty blocks in buffer cache after commit, unpin block xv6 log representation [diagram: buffer cache, in-memory log block # array, FS tree on disk, log header and blocks on disk] on write add blockno to in-memory array keep the data itself in buffer cache (pinned) on commit: write buffers to the log on disk WAIT for disk to complete the writes ("synchronous") write the log header sector to disk block numbers non-zero "n" after commit: install (write) the blocks in the log to their home location in FS write zero to "n" in the log header on disk the "n" value in the log header on disk indicates commit non-zero == committed, log content valid and is a complete transaction zero == not committed, may not be complete, recovery should ignore log write of non-zero "n" is the "commit point" challenge: system's call data must fit in log compute an upper bound of number of blocks each system call writes set log size >= upper bound break up some system calls into several transactions large write()s thus: large write()s are not atomic but a crash will leave a correct prefix of the write challenge: allowing concurrent system calls must allow writes from several calls to be in log on commit must write them all BUT cannot write data from calls still in a transaction xv6 solution allow no new system calls to start if their data might not fit in log must wait for current calls to complete and commit when number of in-progress calls falls to zero commit free up log space wake up waiting calls challenge: a block may be written multiple times in a transaction writes affect only the cached block in memory so a cached block may reflect multiple uncommitted transactions but install only happens when there are no in-progress transactions so installed blocks reflect only committed transactions good for performance: "write absorbtion" xv6 disk layout with block numbers 2: log head 32: inodes 58: bitmap 63: content blocks Let's look at an example. I've modified bwrite() to print low-level disk writes, i.e. the disk writes that occur during transaction commit. $ echo a > x // create bwrite 3 inode, 35 bwrite 4 directory content, 63 bwrite 2 commit (block #s and n) bwrite 35 install inode bwrite 63 install directory content bwrite 2 mark log "empty" // write bwrite 3 bwrite 4 bwrite 5 bwrite 2 bwrite 58 bitmap bwrite 533 x bwrite 35 inode (file size) bwrite 2 // write bwrite 3 bwrite 4 bwrite 2 bwrite 533 \n bwrite 35 inode bwrite 2 let's look at the 2nd transaction, a write() first file.c:syswrite compute how many blocks we can write before log is full write that many blocks in a transaction combined with fs.c:writei begin_op() bmap() -- can write bitmap, indirect block bread() modify bp->data log_write() iupdate() -- writes inode end_op() begin_op() in log.c: need to indicate which group of writes must be atomic! need to check if log is being committed need to check if our writes will fit in remainder of log log_write(): add sector # to in-memory array set B_DIRTY, so that bio.c won't evict it end_op(): if no outstanding operations, commit commit(): copy updated blocks from cache to log on disk record sector #s and "done" in on-disk log header install writes -- copy from on-disk log to on-disk FS ide.c will clear B_DIRTY for block written --- now it can be evicted erase "done" from log What would have happened if we crashed during a transaction? memory is lost, leaving only the disk as of the crash kernel calls recover_from_log() during boot, before using FS if log header block says "done": copy blocks from log to real locations on disk what is in the on-disk log? crash before commit crash during commit -- commit point? crash during install_trans() crash just after reboot, while in recover_from_log() note: it is OK to replay the log more than once! as long no other activity intervenes note xv6 assumes the disk is fail-stop it either does the write correctly, or does not do the write i.e. perhaps it can't do the last write due to power failure thus: no partial writes (each sector write is atomic) no wild writes no decay of sectors (no read errors) no read of the wrong sector what is good about xv6's log design? correctness due to write-ahead log good disk throughput: log naturally batches writes but data disk blocks are written twice concurrency what's wrong with xv6's logging? not very efficient: every block is written twice (log and install) logs whole blocks even if only a few bytes modified writes each log block synchronously could write them as a batch and only write head synchronously log writes and install writes are eager both could be lazy, for more write absorbtion but must still write the log first trouble with operations that don't fit in the log unlink might dirty many blocks while truncating file