6.1810 2022 Lecture 16: Linux ext3 crash recovery

This lecture: a case study of logging in Linux's ext3 file system
  with an eye to performance and real-world design details

Why do we care about logging?
  crash recovery for complex on-disk data structures e.g. FS or DB
  often possible to add to existing designs
  widely used, very successful, good performance
  lots to chew on for performance, correctness, flexibility

Review: the basic idea (both ext3 and xv6)
  [diagram: cached blocks, FS on disk, log on disk]
  system calls ("operations") initially only write the buffer cache, not disk
    often write multiple blocks, e.g. i-node and free-block bitmap
  the goal: atomic (all-or-nothing) with respect to crashes
  commit:
    1. write all updated blocks to log on disk
    2. mark on-disk log as committed (i.e. contains all of operation's writes)
    3. later, write blocks to home locations in FS
  recovery after crash:
    if log marked as committed, copy blocks from log to home locations
  write-ahead rule:
    don't start home FS writes until all writes committed in log on disk
    i.e. don't start modifying FS until *all* the operation's
         writes are in the log, so the operation can be completely
         finished if there's a crash.
  freeing rule:
    don't erase/overwrite log until all home FS writes are done

ext3 compared to xv6:
  less waiting (more concurrency) and fewer disk writes
  subsequent ops don't need to wait for prev xaction to finish committing
    (log holds multiple transactions)
  does not force write to home location immediately after log commit
    write only needed before log wraps
    thus better write absorbtion
  ordered data mode:
    does not write data to journal, so data only written once
  but: concurrency and ordered data add complexity

ext3 data structures:
  [cache, info, disk FS tree, disk log]
  in memory:
    write-back block cache
    current xaction info:
      sequence #
      block #s to be logged
      outstanding "handles" -- one per syscall
    prev xaction info...
  on disk:
    FS
    circular log

the ext3 log can hold multiple transactions
  |SB: offset+seq #| ... |D4|...blocks...|C4|D5|...
  log superblock: log offset and seq # of earliest valid transaction
    (this is not the FS superblock; it's a block at start of log file)
  descriptor blocks: magic, seq, block #s
  data blocks (as described by descriptor)
  commit blocks: magic, seq

an ext3 transaction is really a "compound" transaction
  ext3 "batches" updated blocks from many system calls into each transaction
  ext3 commits the current "open" transaction every few seconds

ext3's batching helps performance:
  1. spreads fixed transaction costs over many system calls
     fixed costs: descriptor and commit blocks, seek/rotate, syscall barrier
  2. "write absorbtion"
     many syscalls in the batch may modify the same block (bitmap, dirent)
     thus one disk write for many syscalls' updates
  3. disk scheduling: sort writes for fewer seeks, sequential runs

ext3 also gets performance from two kinds of concurrency:
  1. multiple system calls can add to the current transaction
  2. multiple transactions may be at various stages:
     * one "open" transaction that's accepting new syscalls 
     * committing transactions doing their log writes
     * committed transactions writing to home locations in FS on disk
     * old transactions being freed

why does concurrency help performance?
  ext3 tries to avoid forcing system calls to wait
  CPU concurrency: many processes can execute FS ops in parallel
  I/O concurrency: processes can operate on FS in cache while
    previous transactions are writing the disk

a crash may "forget" the last few seconds of operations
  even if the system calls returned with no error
  b/c system calls return before effects are committed to disk
  careful applications need to be aware (e.g. DB or mail server or editor)
  fsync(fd) forces updates to log (and commit) before returning
    databases, text editors, &c call fsync()

kernel implementation of ext3 system calls:
  sys_unlink() {
    h = start()
    get(h, block #) ...
    modify the block(s) in the cache
    stop(h)
  }
  start():
    tells the logging system about a new system call
      can't commit until all start()ed system calls have called stop()
    start() can block this sys call if needed
  get():
    tells logging system we'll modify cached block
      added to list of blocks to be logged in current transaction
  stop():
    stop() does *not* cause a commit
    transaction can commit iff all included syscalls have called stop()

committing a transaction to disk
  (this happens in the background)
  1. block new syscalls
  2. wait for in-progress syscalls to stop()
  3. open a new transaction, unblock new syscalls
  4. write descriptor to on-disk log w/ list of block #s
  5. write modified blocks from cache to on-disk log
  6. wait for descriptor and data log disk writes to finish
  7. write the commit record to on-disk log
  8. wait for the commit disk write to finish
     this is the "commit point"
  9. now modified blocks allowed to go to homes on disk (but not forced)

when can ext3 re-use log space?
  log is circular: SB T4 T1 T2 T3
  T2's log space can be reused if:
    T2 has committed
    and T2's blocks have all been written to home locations in FS on disk
      ==> no longer any need to replay from log after a crash
    and all transactions prior to T2 have been freed in the log
  freeing writes on-disk log superblock with
    offset/seq of resulting oldest transaction in log

what if a crash?
  crash causes RAM (and thus cached disk blocks) to be lost
    but disk h/w is assumed to be readable after restart
  crash may interrupt writing last xaction to on-disk log
  so on-disk log may have a bunch of complete xactions, then one partial
  may also have written some of block cache to disk
    i.e. FS *partially* updated, not safe to use
    e.g. file i-node has new block, but bit not cleared in free bitmap

how ext3 recovery works
  SB T8 xT5 T6 T7
    SB points to T6
    T5 is free, T8 has overwritten start of T5
  0. reboot with intact disk
  1. look in log "superblock" for offset and seq# of oldest transaction
  2. find the end of the log
     scan until bad magic (missing commit) or unexpected seq # (old entry)
     end of log is last valid commit block; ignore final partial transaction
  3. replay all blocks through last valid commit, in log order
     thus completing those partially-written operations

what if block after last valid commit block looks like a log descriptor?
  i.e. looks like the start of a new transaction?
  perhaps a descriptor block left over from previous use of log?
    seq # will be too low -> end of log
  perhaps some file data happens to look like a descriptor?
    ext3 prevents this from happening
    ext3 replaces magic number in data blocks in journal with zero
    and sets a flag for that block in the descriptor

what if another crash during log replay?
  after reboot, recovery will replay exactly the same log writes

that was the straightforward part of ext3.
  there are also a bunch of tricky details!

why does ext3 delay start of T2's syscalls until all of T1's syscalls finish?
  i.e. why this:
    T1: |-syscalls-|
    T2:             |-syscalls-|
              ---time--->
  this barrier sacrifices some performance
  example problem that this barrier prevents:
    T1: |-create(x)-|
    T2:        |-unlink(y)-|
                       X crash
              ---time--->
  if we allow unlink to start before create finishes
    unlink(y) free y's i-node -- i.e. marks it free in block cache
    create(x) may allocate that same i-node -- i.e. read unlink's write
  if T1 commits, but crash before T2 commits, then
    y will exist (the unlink did not commit)
    x will exist but use the same i-node as y!
  we can't let T1 see T2's updates!
    so: don't start T2's syscalls until T1's syscalls are done
  (note ext3's copy-on-write doesn't fix this: it gives T1
   a copy of all blocks as of the end of T1, but the unlink
   executed before T1 ended.)
  (note we can't give T2 its own copies of the blocks, separate from
   T1's copies, since generally we do want later system calls to
   see the effects of earlier system calls.)
    
why commit needs to write on-disk log from a snapshot
  consider this:
  T1: |--create(d/f)--|
  T2:                   |--create(d/g)--|
  both create()s write d's directory content block
  suppose T1 writes log to disk *after* T2's create()
  we can't let T1 write T2's update to the log!
    then a crash+recovery might replay some but not all of T2
    e.g. make the directory entry d/g but without initializing g's i-node
  ext3's solution:
    give T1 a private copy of the block cache as it existed when T1 closed
    T1 commits from this snapshot of the cache
    it's efficient using copy-on-write
    the copies allow syscalls in T2 to proceed while T1 is committing

so far we've been talking about EXT3's "journaled data" mode
  in which file content blocks are written to log
  as well as meta-data (i-nodes, directory content, free bitmaps)
  so file content is included in atomicity guarantee
  e.g. a write() that grows a file results in these log entries:
    [descriptor], i-node, block bitmap, content, [commit]

logging file content is slow, every data block written twice to disk
  data is usually much bigger than meta-data
  can we entirely omit file content from the log?
  if we did, when would we write file content to the FS?

can we write file content blocks at any time at all?
  no: if metadata is committed first, crash may leave i-node pointing
      to blocks from someone else's deleted file, with previous
      file's content!

ext3 "ordered data" mode:
  write file content blocks to disk *before* commiting log
  (and don't write file content to the log)
  thus, for a write():
    1. write file content blocks to FS on disk,
    2. wait until content writes finish,
    3. commit updated i-node (w/ block numbers and increased size) to the log
  if crash after data write, but before commit:
    block on disk has new data
    but not visible, since i-node size and block list not updated
    no metadata inconsistencies
      neither i-node nor free bitmap were updated, so blocks still free
  most people use ext3 ordered data mode
    it's much faster: avoids writing data twice

why does it make sense for the log to omit file content?
  a primary goal of the log is to protect FS internal invariants
    e.g. a block cannot be both free and used in an i-node
  from this point of view, file content is not important.
  in any case, the file system doesn't know which groups of application
    writes need to be atomically written.
  instead, applications call fsync() when they need to wait
    for data to be safe on the disk.

correctness challenges w/ ordered data mode:
  * 1. rmdir() frees directory content block, #27
    2. file write() (in same compound xaction) allocates block #27
    3. write() writes block #27 on disk
    4. crash before rmdir is committed
    after recovery, as if rmdir never happened,
      but directory content block (#27) has been overwritten!
    fix: no re-use of freed block until freeing syscall committed
  * rmdir, commit, re-use block in file, ordered file write, commit,
      crash+recover, replay rmdir
    1. rmdir
       erases "." and ".." in content block, which is logged
       then frees the content block
    2. rmdir commits
    3. write() allocates that block, and writes it (with ordered data mode)
    4. write commits
    5. crash, reboot, log is replayed, including the rmdir's write
       but write()'s data write was not logged, not replayed
    result: file is left w/ directory content
    fix: add "revoke" log record when freeing dir block,
         prevents prior log records from writing that block.
  note: both problems due to changing the type of a block (content vs meta-data)
    so another solution might be to never change a block's type
	 
Summary of ext3 rules
  The basic write-ahead logging rule:
    Don't write meta-data block to on-disk FS until committed in on-disk log
  Wait for all ops in T1 to finish before starting T2
  T1 writes log from a snapshot copy of cached blocks as of end of T1
  Don't free on-disk log transaction until all blocks have been written to FS
  Ordered data mode:
    Write datablock to FS before commit
    Don't reuse free block until freeing syscall is committed
    Don't replay revoked syscalls

another corner case: open fd and unlink
  open a file, then unlink it
  file is open, so unlink removes dir entry but doesn't free inode or blocks
  unlink commits (just removal of dir entry)
  crash
  log doesn't contain writes that free the i-node or blocks
  inode and blocks not on free list, also not reachably by any name
    will never be freed! oops
  solution: add inode to linked list starting from FS superblock
    commit that along with remove of dir ent
  recovery looks at that list, completes deletions

what if there's not enough free log space for a transaction's blocks?
  free oldest transaction(s)
  i.e. install its writes in the on-disk FS

what if so many syscalls have started that the entire log isn't
  big enough to hold their writes, even if all older transactions
  have been freed?
  (unlikely, since each syscall generates few writes compared to log size)
  syscall passes max number of blocks needed to start()
  start() waits if total for current transaction is too high
    and works on freeing old transactions

another (complex) reason for reservations:
  T1 updates block 17
  T1 commits
  T1 finishes writing its log and commit record
     but has not yet written block 17 to home location
  T2 updates block 17 in the cache
  a T2 syscall is executing, does a write for which no log space
  can ext3 install T1's blocks to home locations on disk,
    and free T1's log space?
  no: the cache no longer holds T1's version of block 17
    (ext3 could read the block from the log, but it doesn't)
  reservations detect the potential log space exhaustion,
    prevent that T2 syscall from starting

checksums
  recall: transaction's log blocks must be on disk before writing commit block
    ext3 waits for disk to say "done" before starting commit block write
  risk: disks usually have write caches and re-order writes, for performance
    sometimes hard to turn off (the disk lies)
    people often leave re-ordering enabled for speed, out of ignorance
    bad news if disk writes commit block before the rest of the transaction
  solution: commit block contains checksum of entire logged transaction
    on recovery: compute checksum of transaction
	if matches checksum in commit block: install transaction
    if no match: don't install transaction
  ext4 has log checksumming, but ext3 does not

ext3 wrap-up
  write-ahead logging provides crash recovery
  lots of concurrency -- and complexity
  but very successful

References:
Stephen Tweedie 2000 talk transcript "EXT3, Journaling Filesystem"
  http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html
log format details
  https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout
https://www.usenix.org/system/files/login/articles/04_tso_018-021_final.pdf