6.S081/6.828 2019 Lecture 14: Linux ext3 crash recovery Topic: high-performance file system logging via Linux ext3 case study review: why logging? problem example: appending to a file FS writes multiple blocks on the disk, e.g. mark block non-free in bitmap add block # to inode addrs[] array it's dangerous for the FS to just do the disk writes a crash+reboot after one write would leave FS incorrect very damaging if block is both free and in use by a file such a file system lacks "crash recoverability" logging causes groups of writes to be atomic: all or none with respect to crashes logging is one way to get crash recoverability review of xv6 logging [ on-disk FS, in-memory cache, on-disk log ] each system call is a transaction: begin, writes, end initially writes affect only cached blocks in memory at end of system call: 1. write all modified blocks to log on disk 2. write block #s and "done" to log on disk -- the commit point 3. install modified blocks in home locations in FS on disk 4. erase "done" from log on disk so logged blocks from next xaction don't appear committed if crash (e.g. computer loses power, but disk OK): reboot, then run log recovery code if commit "done" flag is set on disk replay all the logged writes to home locations clear the "done" flag write-ahead rule: don't start home FS writes until all writes committed in log on disk otherwise, if crash, recovery can neither un-do nor complete the xaction freeing rule: don't erase/overwrite log until all home FS writes are done what's wrong with xv6's logging? it is slow! multiple "overhead" disk writes per syscall (2x commit writes) each syscall has to wait for all disk I/O-- "synchronous update" new syscalls have to wait for commit to finish prevents concurrent execution of multiple processes every block written twice to disk, once to log, once to FS so: creating an empty file takes 6 synchronous disk writes -- 60 ms thus only 10 or 20 creates per second; that's not much! Linux's ext3 design case study of the details required to add logging to a file system Stephen Tweedie 2000 talk transcript "EXT3, Journaling Filesystem" http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html ext3 adds a log to ext2, a previous log-less file system has many modes, I'll start with "journaled data" log contains both metadata and file content blocks ext3 structures: in memory: write-back block cache per-transaction info: sequence # set of block #s to be logged set of outstanding "handles" -- one per syscall on disk: FS circular log what's in the ext3 log (== journal)? log superblock: log offset and seq # of earliest valid transaction (this is not the FS superblock; it's a block at start of log file) descriptor blocks: magic, seq, block #s data blocks (as described by descriptor) commit blocks: magic, seq the log can hold multiple transactions |super: offset+seq #| ... |Descriptor 4|...blocks...|Commit 4| |Descriptor 5| how does ext3 get good performance? asynchronous disk update batching concurrency asynchronous disk updates system calls do not immediately update the disk they only modify cached block why do async updates help performance? 1. system calls return quickly 2. I/O concurrency; app can do other things while disk writes 3. batching... but: a crash may "forget" the last few seconds of completed system calls exception: fsync(fd) forces updates to log (and commit) before returning databases, text editors, &c use fsync() batching ext3 commits every few seconds so each transaction typically includes many syscalls why does batching help performance? 1. amortize fixed transaction costs over many system calls fixed costs: descriptor and commit blocks, seek/rotate, syscall barrier 2. "write absorbtion" many syscalls in the batch may modify the same block (bitmap, dirent) thus one disk write for many syscalls' updates 3. disk scheduling: sort writes for fewer seeks, sequential runs concurrency, in two forms 1. multiple system calls may be adding to the current transaction 2. multiple transactions may be at various stages: *one* "open" transaction that's accepting new syscalls committing transactions doing the log writes as part of commit committed transactions writing to home locations in FS on disk old transactions being freed why does concurrency help performance? many processes can be using the FS at the same time new system calls can proceed while old transaction(s) write the disk ext3 sys call code: sys_unlink() { h = start() get(h, block #) ... modify the block(s) in the cache stop(h) } start(): tells the logging system about a new system call can't commit until all start()ed system calls have called stop() start() can block this sys call if needed get(): tells logging system we'll modify cached block added to list of blocks to be logged stop(): stop() does *not* cause a commit transaction can commit iff all included syscalls have called stop() committing a transaction to disk 1. block new syscalls 2. wait for in-progress syscalls to stop() 3. open a new transaction, unblock new syscalls 4. write descriptor to on-disk log w/ list of block #s 5. write modified blocks from cache to on-disk log 6. wait for all log disk writes to finish 7. write the commit record to on-disk log 8. wait for the commit disk write to finish this is the commit point 9. now modified blocks allowed to go to homes on disk (but not forced) when can ext3 re-use transaction T2's log space? log is circular: SB T4 T1 T2 T3 T2's log space can be freed+reused if: all transactions prior to T2 have been freed in the log, and T2's blocks have all been written to home locations in FS on disk freeing writes on-disk log superblock with offset/seq of resulting oldest transaction in log what if a crash? crash causes RAM (and thus cached disk blocks) to be lost but disk is assumed to be OK crash may interrupt writing last xaction to on-disk log so on-disk log may have a bunch of complete xactions, then one partial may also have written some of block cache to disk but only for fully committed xactions, not partial last one how ext3 recovery works 0. reboot with intact disk 1. log "superblock" contains offset and seq# of oldest transaction in log 2. find the end of the log scan until bad magic (missing commit) or unexpected seq # (old entry) go back to last valid commit block crash during commit -> no commit block, so recovery ignores 3. replay all blocks through last valid commit, in log order what if block after last valid commit block looks like a log descriptor? i.e. looks like the start of a new transaction? perhaps a descriptor block left over from previous use of log? no: seq # will be too low perhaps some file data happens to look like a descriptor? no: logged data block cannot contain the magic number! ext3 forbids magic number in logged data blocks -> flags in descriptor what if another crash during log replay? after reboot, recovery will replay exactly the same log writes that was the straightforward part of ext3. there are also a bunch of tricky details! why does ext3 delay start of T2's syscalls until all of T1's syscalls complete? i.e. why this: T1: |-syscalls-| T2: |-syscalls-| T3: |-syscalls-| this barrier sacrifices some performance the bad scenario that the barrier prevents: file y exists, i-node 17 T1 opens syscall T1a starts, creating a file x T1 closes but is waiting for T1a T2 opens T2a starts, removes y T2a marks i-node 17 free T1a allocates i-node 17 T1a creates f1 with i-node 17 T1a finishes T1 commits crash (before T2 commits) recovery sees T1 but not T2 so y will still exist (since T2 didn't commit) but so will x and they both refer to the same i-node! ext3 forces T2a to wait until T1a is done to avoid this what if a T1 syscall writes a block, T1 is committing, and syscall in T2 writes same block? e.g. create d/f in T1 create d/g in T2 both of them write d's directory content block so T1: |--d/f--|-logWrites-| T2: |--d/g-- crash if crash after T1 finishes committing, but before T1 commits, directory entry d/g will exist, but point to an i-node that was never initialized (since T2 didn't commit) ext3's solution: give T1 a private copy of the block cache as it existed when T1 closed T1 commits from this snapshot of the cache it's efficient using copy-on-write the copies allow syscalls in T2 to proceed while T1 is committing what if there's not enough free log space for a transaction's blocks? free oldest transaction(s) i.e. install its writes in the on-disk FS what if so many syscalls have started that the entire log isn't big enough to hold their writes, even if all older transactions have been freed? (unlikely, since each syscall generates few writes compared to log size) syscall passes max number of blocks needed to start() start() waits if total for current transaction is too high and works on freeing old transactions another (complex) reason for reservations: T1 writes block 17 T1 commits T1 finishes writing its log and commit record but has not yet written block 17 to home location T2 writes block 17 in the cache ext3 does *not* make a copy of block 17 ext3 only copies if needed to write T1's log a T2 syscall is executing, does a write for which no log space can ext3 install T1's blocks to home locations on disk, and free T1's log space no: the cache no longer holds T1's version of block 17 (ext3 could read the block from the log, but it doesn't) reservations detect the potential log space exhaustion, prevent that T2 syscall from starting so far we've been talking about "journaled data" mode in which file content blocks are written to log as well as meta-data (i-nodes, directory content, free bitmaps) so file content is included in atomicity guarantee e.g. when appending data to file, new data in newly allocated block,ew, updated block bitmap, updated i-node, they are all logged; all or none "ordered data" mode logging file content is slow, every data block written twice data is usually much bigger than meta-data can we entirely omit file content from the log? if we did, when would we write file content to the FS? can we write file content blocks at any time at all? no: if metadata committed first, crash may leave file pointing to unwritten blocks from someone else's deleted file ext3 "ordered data" mode: (don't write file content to the log) write content blocks to disk *before* commiting inode w/ new size and block # if no crash, there's no problem -- readers will see the written data if crash after data write, but before commit: block on disk has new data but not visible, since i-node size and block list not updated no metadata inconsistencies neither i-node nor free bitmap were updated, so blocks still free most people use ext3 ordered data mode correctness challenges w/ ordered data mode: A. rmdir, re-use directory content block for write() to some file, crash before rmdir or write committed after recovery, as if rmdir never happened, but directory content block has been overwritten! fix: no re-use of freed block until freeing syscall committed B. rmdir, commit, re-use block in file, ordered file write, commit, crash+recover, replay rmdir but no replay of file content write! file is left w/ directory content e.g. . and .. fix: put "revoke" records into log, prevent log replay of a given block note: both problems due to changing the type of a block (content vs meta-data) so another solution might be to never do that Summary of ext3 rules The basic write-ahead logging rule: Don't write meta-data block to on-disk FS until committed in on-disk log Wait for all syscalls in T1 to finish before starting T2 Don't overwrite a block in buffer cache before it is in the log Don't free on-disk log transaction until all blocks have been written to FS Ordered mode: Write datablock to FS before commit Don't reuse free block until freeing syscall is committed Don't replay revoked syscalls another corner case: open fd and unlink open a file, then unlink it file is open, so unlink removes dir entry but doesn't free inode or blocks unlink commits (just removal of dir entry) crash log doesn't contain writes that free the i-node or blocks inode and blocks not on free list, also not reachably by any name will never be freed! oops solution: add inode to linked list starting from FS superblock commit that along with remove of dir ent recovery looks at that list, completes deletions checksums recall: transaction's log blocks must be on disk before writing commit block ext3 waits for disk to say "done" before starting commit block write risk: disks usually have write caches and re-order writes, for performance sometimes hard to turn off (the disk lies) people often leave re-ordering enabled for speed, out of ignorance bad news if disk writes commit block before the rest of the transaction solution: commit block contains checksum of all data blocks on recovery: compute checksum of datablocks if matches checksum in commit block: install transaction if no match: don't install transaction ext4 has log checksumming, but ext3 does not does ext3 fix the xv6 log performance problems? synchronous disk updates -- fixed with async most sys calls generate lots of disk writes -- fixed with async batching little concurrency -- fixed with batching, and concurrent commit every block written twice -- partially fixed with ordered data mode ext3 very successful!