6.828 2018 Lecture 14: Linux ext3 crash recovery Project reminder Choose: Lab 6 (JOS networking) or JOS-related final project of your choice Some project ideas in the "Lab 7" writeup Projects can be done in groups Submit a project proposal this week We'll say yes or no -- level of difficulty? relevance to O/S? Dec 6: brief write-up due Last week: in-class demostrations, check-off with TAs Lecture plan logging for crash recovery xv6: slow and immediately durable ext3: fast but not immediately durable trade-off: performance vs.safety example problem: appending to a file two writes: mark block non-free in bitmap add block # to inode addrs[] array we want atomicity: both or neither so we cannot do the FS writes one at a time why logging? atomic system calls w.r.t. crashes fast recovery (no hour-long fsck) review of xv6 logging [ cache, on-disk log, on-disk FS ] each system call is a transaction system call updates cached blocks, in memory at end of system call: write modified blocks to log on disk write block #s and "done" to log on disk -- the commit point install modified blocks in FS on disk if we crash midway-through, recovery can replay all writes from log rule: don't start FS writes until all writes committed in log i.e. all or nothing -- atomic erase "done" from log on disk so logged blocks from next xaction don't appear committed homework echo hi > a commit() hacked to ignore one of the writes, crash after commit+install and recovery disabled Q: why does "cat a" (after crash) produce "panic: ilock: no type"? broken commit() updated directory entry but not i-node so dirent is on disk and contains the inode# but the i-node is marked as free (type=0) Q: after recovery, why does "cat a" produce an empty file? even though we ran "echo hi > a"? recovery wrote inode to the right place it is now allocated, and dirent is valid but create and write are separate system calls and transactions echo never called write() -- crash during file create what's wrong with xv6's logging? it is slow! immediate commit, after every syscall immediate write to FS after commit must do this in order to re-use on-disk log -- i.e. to run the next syscall all new syscalls block during any commit() so not much concurrent execution if multiple processes every block written twice to disk, once to log, once to FS not so bad for meta-data blocks painful for big files so: these writes are synchronous -- xv6 waits for each to complete before proceeding creating an empty file takes 6 synchronous disk writes -- 60 ms so only 10 or 20 disk update system calls per second Linux's ext3 design case study of the details required to add logging to a file system Stephen Tweedie 2000 talk transcript "EXT3, Journaling Filesystem" http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html ext3 adds a log to ext2, a previous log-less file system has many modes, I'll start with "journaled data" log contains both metadata and file content blocks ext3 structures: in memory: write-back block cache per-transaction info: set of block #s to be logged set of outstanding "handles" -- one per syscall on disk: FS circular log what's in the ext3 log? log superblock: log offset and starting seq # of earliest valid transaction (this is not the FS superblock; it's a block at start of log file) descriptor blocks: magic, seq, block #s data blocks (as described by descriptor) commit blocks: magic, seq |super: offset+seq #| ... |Descriptor 4|...blocks...|Commit 4| |Descriptor 5|... how does ext3 get good performance? batching commits every few seconds, not after every system call so each transaction includes many syscalls why does batching help performance? 1. amortize fixed transaction cost (descriptor and data blocks) over many transactions 2. "write absorbtion" many syscalls in the batch may modify the same block (i-node, bitmap, dirent) thus one disk write for many syscalls' updates 3. better concurrency -- less waiting for previous syscall to finish commit note: system calls return before they are safely on the disk this affects application-level crash recovery situation e.g. mail server that receives message, writes to disk, then replies "OK" ext3 allows concurrent transactions and syscalls there may be multiple transactions: some fully committed in the on-disk log some doing the log writes as part of commit *one* "open" transaction that's accepting new syscalls ext3 sys call code: sys_open() { h = start() get(h, block #) modify the block in the cache stop(h) } start(): tells logging system to make set of writes until stop() atomic logging system must know the set of outstanding system calls can't commit until they are all complete start() can block this sys call if needed get(): tells logging system we'll modify cached block added to list of blocks to be logged prevent writing block to disk until after xaction commits stop(): stop() does *not* cause a commit transaction can commit iff all included syscalls have called stop() committing a transaction to disk 1. block new syscalls 2. wait for in-progress syscalls to stop() 3. open a new transaction, unblock new syscalls 4. write descriptor to log on disk w/ list of block #s 5. write each block from cache to log on disk 6. wait for all log writes to finish 7. write the commit record 8. wait for the commit write to finish 9. now cached blocks allowed to go to homes on disk (but not forced) can syscall B read uncommitted results of syscall A? A: rm x B: echo > y -- re-using x's freed i-node could B commit first, so that crash would reveal anomaly? case 1: both in same xaction -- ok, both or neither case 2: A in T1, B in T2 -- ok, ext3 commits transactions in order case 3: B in T1, A in T2 in T1: |--B--| in T2: |--A--| could B see A's free of y's i-node? after all, A writes the same cache that B reads bad: crash after T1 could leave both x and y using the i-node no: ext3 waits for all syscalls in prev xaction to finish before letting any in next start thus B (in T1) completes before ext3 lets A (in T2) start so B won't see any of A's writes thus: T1: |-syscalls-| T2: |-syscalls-| T3: |-syscalls-| the larger point: the commit order must be consistent with the order in which the system calls read/wrote state. perhaps ext3 sacrifices a bit of performance here to gain correctness. is it safe for a syscall in T2 to write a block that was also written in T1? ext3 allows T2 to start before T1 finishes committing -- can take a while T1: |-syscalls-|-commitWrites-| T2: |-syscalls-|-commitWrites-| the danger: a T1 syscall writes block 17 T1 closes, starts writing cached blocks to log T2 starts, a T2 syscall also writes block 17 could T1 write T2's modified block 17 to the T1 transaction in the log? bad: not atomic, since then a crash would leave some but not all off T2's writes committed so: ext3 gives T1 a private copy of the block cache as it existed when T1 closed T1 commits from this snapshot of the cache it's efficient using copy-on-write the copies allow syscalls in T2 to proceed while T1 is committing the point: correctness requires a post-crash+recover state as if syscalls had executed atomically and sequentially ext3 uses various tricks to allow some concurrency when can ext3 re-use transaction T1's log space? (log is circular) once: all transactions prior to T1 have been freed in the log, and T1's cached blocks have all been written to FS on disk free == advance log superblock's start pointer/seq what if not enough free space in log for a syscall? suppose we start adding syscall's blocks to T2 half way through, realize T2 won't fit on disk we cannot commit T2, since syscall not done we cannot back out of this syscall, either there's no way to undo a syscall other syscalls in T2 may have read its modifications solution: reservations syscall pre-declares how many block of log space it might need ext3's start() blocks the syscall until enough free space may need to commit open transaction, then free older transaction OK since reservations mean all started sys calls can complete + commit performance? rm * in a directory with 100 files xv6: over 10 seconds -- six synchrounous disk writes per sys call ext3: about 20 ms total rm * repeatedly writes the same same direntry and inode blocks until commit, just updating the cached blocks, no disk writes then one commit of a few metadata blocks how long to do a commit? log a handful of blocks (inodes, dirents) wait for disk to say writes are on disk then write the commit record two rotations, or about 20ms total modern disk interfaces can avoid wasted revolution what if a crash? crash may interrupt writing last xaction to log on disk so disk may have a bunch of complete xactions, then maybe one partial may also have written some of block cache to disk but only for fully committed xactions, not partial last one how does recovery work 1. find the start of the log -- the first non-freed descriptor log "superblock" contains offset and seq# of first transaction (advanced when log space is freed) 2. find the end of the log scan until bad magic or not the expected seq # go back to last commit record crash during commit -> no commit record, recovery ignores 3. replay all blocks through last complete xaction, in log order what if block after last valid log block looks like a log descriptor? perhaps a descriptor block left over from previous use of log? seq # will be too low perhaps some file data happens to look like a descriptor? logged data block cannot contain the magic number! ext3 forbids magic number in logged data blocks -> flags in descriptor "ordered data" mode (so far we've been talking about "journaled data" mode) logging file content is slow, every data block written twice can we entirely omit file content from the log? if we did, when would we write file content to the FS? can we write file content blocks at any time at all? no: if metadata committed first, crash may leave file pointing to unwritten blocks with someone else's data ext3 "ordered data" mode: don't write file content to the log write content blocks to disk *before* commiting inode w/ new size and block # if no crash, there's no problem -- readers will see the written data if crash before commit: [diagram: i-node and bitmap and new data in memory, new data on disk too, but on-disk i-node and bitmap not updated] block has new data perhaps not visible, since i-node size and block list not updated no metadata inconsistencies i-node and free bitmap writes are still atomic most people use ext3 ordered mode correctness challenges w/ ordered mode: A. rmdir, re-use block for write() to some file, crash before rmdir or write committed after recovery, as if rmdir never happened, but directory block has been overwritten! fix: no re-use of freed block until freeing syscall committed B. rmdir, commit, re-use block in file, ordered file write, commit, crash+recover, replay rmdir file is left w/ directory content e.g. . and .. since file content write is not replayed fix: put "revoke" records into log, prevent log replay of a given block note: both problems due to changing the type of a block (content vs meta-data) so another solution might be to never do that Summary of rules The classic write-ahead logging rule: Don't write meta-data block to on-disk FS until committed in on-disk log Wait for all syscalls in T1 to finish before starting T2 Don't overwrite a block in buffer cache before it is in the log Don't free log space until all blocks have been written to FS Ordered mode: Write datablock to FS before commit Don't reuse free block until freeing syscall is committed Don't replay revoked syscalls another corner case: open fd and unlink open a file, then unlink it unlink commits file is open, so unlink removes dir entry but doesn't free blocks crash nothing interesting in log to replay inode and blocks not on free list, also not reachably by any name will never be freed! oops solution: add inode to linked list starting from FS superblock commit that along with remove of dir ent recovery looks at that list, completes deletions checksums recall: transaction's log blocks must be on disk before writing commit block ext3 waits for disk to say "done" before starting commit block write risk: disks usually have write caches and re-order writes, for performance sometimes hard to turn off (the disk lies) people often leave re-ordering enabled for speed, out of ignorance bad news if disk writes commit block before the rest of the transaction solution: commit block contains checksum of all data blocks on recovery: compute checksum of datablocks if matches checksum in commit block: install transaction if no match: don't install transaction ext4 has log checksumming does ext3 fix the xv6 log performance problems? synchronous write to on-disk log -- yes, but 5-second window tiny update -> whole block write -- maybe (if syscalls permit write absorbtion) synchronous writes to home locations after commit -- yes ext3/ext4 very successful!