6.824 2012 Lecture 10: Logging and Recovery

Start of new big topic: fault tolerance
  Start with single computers
  Then distributed systems

Today's lecture:
  crash recovery for storage
  logging
  FSD case study

Overall topic: crash recovery
  Updating persistent (on-disk) storage in the face of crashes
  Challenge: multi-write operation, crash partway through
  Overall game:
    Write the disk carefully
    Recover after restart: complete or roll back partial operation
  Assuming fail-stop crashes; no malice, no bugs

Why storage?
  Why not crash recovery for computation?
  Crash recovery for computation has to involve persistent storage
    E.g. periodically save state of computation to disk
  You need today's techniques to save state reliably

Big idea: atomic operations
  Suppose an operation has many steps
  It is atomic w.r.t. crashes if:
    After restart+recovery, you see effect of all steps, or none
  We are *not* worried about atomicity in the fact of concurrency today

Let's talk about atomic update of file system data structures.
  E.g. UNIX file system.
  Or YFS updating the extent server.

Can you update the on-disk FS in place?
  For example, a file create might involve these disk writes:
  1. Allocate an i-number from a free-bitmap.
  2. Append name/i-number to directory.
  3. Initialize i-node (size, owner, block list, &c).

Why is the above order potentially bad?
  Crash between 1 and 2.
  Crash between 2 and 3.

Would a different order be better?
  1. Allocate i-number.
  2. Initialize i-node.
  3. Append to directory.

Idea: can view write #3 as a "committing write"
  Crash before: no change visible
  Crash after: complete change is visible

Synchronous Meta-Data Update
  Organize all system calls to have a commiting write
  Write disk in a careful order
  Wait for each write to finish before doing the next
  Scan the disk and fix things after crash
    e.g. allocated but unreferenced i-nodes

Why aren't synchronous meta-data updates enough?
  They're slow
    Must write to disk, not cache
    10 ms per disk I/O
    Can't re-order disk writes to decrease seek time
  Recovery may require checking/scanning the whole FS (e.g. fsck)
  Some operations don't have an obvious single committing write

Example: UNIX rename
  Programs use rename to reveal only a complete file
    echo stuff > /tmp/t0
    mv /tmp/t0 file
  need to update two directories, stored in two blocks on disk.
  remove then add? add then remove?
    probably want add then remove
  what if a crash?
  what does post-crash checker do?
    it knows something is wrong, since link count is 1, but two links.
    can't roll back -- which one to delete?
    has to leave two links, increase the link count
    this is *not* a legal result of rename!
    but at least we haven't lost the file
  sync meta-data updates are slow *and* have wrong semantics

You can push tree update one step farther.
  Prepare a new copy of the entire affected sub-tree.
  Replace old subtree in one final write.
  Very expensive if done in the obvious way.
  But you can share structure between old and new tree.
  Only need new storage between change points and sub-tree root.
  (NetApp WAFL does this and more.)
  This approach only works for tree data structures.
    and doesn't support concurrent operations very well

What's the basic idea of logging?
  The log lives on disk, separate from the file system
  Each operation pre-declares all its writes in log
  Create would say in the log that it *will* write
    dwrite(bn, data) -- free list
    dwrite(bn, data) -- dir
    dwrite(bn, data) -- inode
    END
  Start writing FS after intentions are all in log on disk
  If crash:
    Recover entire op from log
  What if crash while writing log?
    No END -> recovery ignores the operation's log entries

What are the reasons to use logging?
  atomic commit of compound operations. w.r.t. crashes.
  fast recovery (unlike fsck).
  well-defined post-recovery state: serial prefix of operations.
    in echo/mv example, if you see file, you see new content too
  can be applied to almost any existing data structure
    e.g. database tables, free lists
  helps performance by turning random writes into sequential writes
  useful to coordinate updates to distributed data structures
    let's all do this operation
    oops, some nodes crashed
    look in logs to help pick up the pieces

Software structure of a logging file system
  Apps
  (system calls)
  FS
  (block writes)
  Log + Disk cache
  Disk

Terminology
  in-memory data cache
    dirty vs clean
  on-disk FS
  [in-memory log]
  on-disk log

Transactions
  Most FS operations involve multiple disk writes
  Atomic == all writes or none
  So logging system needs to know operation boundaries
  A "transaction" is a multi-write operation that should be atomic
  create()
    begin_transaction
      update free list
      update i-node
      update directory entry
    end_transaction

Let's firm up the details of our super-simple log ("Design 1"):
  Where is the log?
    On a disk, at a known block address
    Holds just one transaction at a time
  Log format?
    Start block:
       # of data blocks xaction will dirty
       array of block addresses
       COMMIT flag
    A sequence of data blocks
  What does BEGIN_TRANSACTION do?
    Lock the log (wait until previous operation is totally done)
  What does an operation write(bn,data) do?
    Update in-memory data cache (but do *not* write the on-disk FS)
    Write data to the end of the log, on disk
    Remember bn
  What does END_TRANSACTION do?
    Write block addresses to start block
    Set COMMIT flag in start block
    Then write the cached blocks to the on-disk FS
    Then clear the COMMIT flag
  What does recovery do?
    If the start block has COMMIT flag set
      Copy each block to its place in the on-disk FS
      Clear the COMMIT flag
    else:
      do nothing

This is a "write-ahead" "re-do" log
  Write-ahead = all of an xaction's writes in log before any written to disk
  Re-do = recovery consists of replaying written data from log
    To finish any potentially interrupted END_TRANSACTIONs

What if crash during operation, before END_TRANSACTION?

What if crash during END_TRANSACTION?

What if crash during recovery?

What are the defects of Design 1?
  Slow: every block written twice
  Slow: each log write takes one disk revolution

Idea: delay write from cache to on-disk FS
  The point: hope for "write absorbtion"
    Multiple operations may modify the same block
    If we delay, maybe only write to disk once, not many times
    Example: create lots of files in a single directory
    Example: create temporary file, then remove
  But now we must store multiple transactions in on-disk log
    Otherwise we cannot delay writes to on-disk FS
    Can't erase a transaction from log until we write on-disk FS
  And we need a new story for when to re-use on-disk log space

Design 2: logging with delayed writes to on-disk FS
  Log format:
    Again, log sits in a known range of blocks on disk
    Header block: block # of first valid log record
    Start block: TID, # blocks, block addresses, COMMIT flag
    Data blocks...
    (then maybe another start block and data blocks, &c)
  BEGIN_TRANSACTION and block write same as in Design 1
  What does END_TRANSACTION do?
    Write block addresses to this transaction's start block
    Set COMMIT flag in start block
    (but don't write data from cache on on-disk FS)
  When to write data from cache to on-disk FS?
    After transaction's COMMIT flag is set
    Before re-using transaction's space in the log
  When to re-use on-disk log area?
    When log is full -- typically the log wraps around to the start
    So free log space in order, oldest transaction first
    Before freeing a transaction's space in the log:
      Must first force transaction's cached data to on-disk FS
      But only if cached block has *not* been modified by subsequent xaction
        Why is that?
    Recovery needs to know where log starts -- earliest valid xaction
      So update pointer in header block to point past freed xaction
    Clear freed xaction's COMMIT flag
    Now we can re-use freed xaction's log space
  What does recovery do?
    Header block says where to start looking at the log
    Look at each xaction in turn, wrapping around
      Stop if COMMIT flag not set
      Replay logged blocks

Must recovery examine log records in commit order?
  E.g. why not replay in reverse order?

Suppose a transaction in the middle of log is damaged (disk block goes bad)
  E.g., TID 10 OK, TID 11 damaged, TID 12 OK
  Is it OK to replay 10 and 12 but skip 11?

Problem: what if a data block looks just like a log Start block?

What are the defects of Design 2?
  Slow: each log write (or transaction) takes one disk revolution
    7200 RPM = 120 disk writes per second
    So < 30 3-write transactions/second, not very fast!
    If we could somehow do huge sequential writes to the log:
      50 MB/sec = 5,000 10-kbyte transactions/second

Idea: delay log writes
  END_TRANSACTION appends to an in-memory log
  Force in-memory log to disk every second or so
    One big sequential write, includes many transactions
  Called "group commit"
  Note: now completed system calls can disappear after crash!

Design 3: group commit
  Same log format as for Design 2
    Except transaction start block must hold checksum of all data
    In case big disk write is interrupted
  BEGIN_TRANSACTION same as in Design 1/2
  When a transaction writes a block:
    Modify in cache, append a copy to the in-memory log
  What does END_TRANSACTION do?
    Compute checksum of transaction's blocks, put in-memory start block
    Set the COMMIT flag in start block in in-memory log
  When to force data from in-memory log to on-disk log?
    At any time -- but only whole transactions
  When to write data from cache to on-disk FS?
    After transaction commits
    After transaction has been forced from in-memory log to on-disk log

Note there are three log regions:
  1. data guaranteed written to on-disk FS
  2. data might be written to on-disk FS
  3. data can *not* have been written to on-disk FS
  Part 1 can be freed
  Part 2 is on disk but cannot be freed
  Part 3 is either not COMMITed or only in in-memory log

Potential problem:
  We're getting performance by doing huge sequential log writes to disk
  Huge write may fail in the middle
  The disk may not write in the order we ask it to
  So, during recovery, any subset of the tail end of the log may be missing
    Blocks will be present on disk but contain old data/start blocks
  Checksum in start blocks ensures each transaction is complete
  Recovery must stop at first damaged/missing/uncommitted xaction

Are there defects in Design 3?
  Each FS modification turns into a whole block appended to the log
  Create many files in a directory -> directory block logged over and over
  Could group many operations into one transaction
    Just log the block once, after full set of operations
  Could log short description of the operation, rather than whole block

Review: why does write-ahead logging provide atomic transactions?
  Don't write on-disk FS until log contains full set of writes
  Then a crash midway through writing on-disk FS can be
    completed by replay from log during recovery

Review: why is logging fast?
  can write just to in-memory cache, delay writes to on-disk FS
    since can recover data from the log
    may benefit from write absorbtion
  log appends are fast
    no seek if log was recently written
    big sequential writes, not random writes
    single seek/rotate per log force

Paper question:
  (last para in Section 4)
  Why doesn't FSD put file content writes in the log?
  The sync content write means the content goes to disk before the log
    Why is that OK?
    What if crash after content write, but before forcing log to disk?
  Would it be OK for content to go to disk after log is forced?
    I.e. to write content just into the cache?
  What if delete frees a block, create allocates it, then write,
    then crash but delete hadn't been forced to log on disk?
    Then deleted file will exist -- but we wrote one of its blocks!
    See end of 5.5...

-------  Example: Cedar

Reimplementing the Cedar File System Using Logging and Group Commit
Robert Hagmann
SOSP 1987

What are FSD's main on-disk data structures?
  name table
    b-tree
    indexed by name
    entry: block list
  VAM -- bitmap indicating which blocks are free
  file content blocks
  leader block per file (ignore)

What were they worried about?
  Crash during complex b-tree operation (e.g. splitting during insert)
  Recovery speed -- no whole-disk scans
  Maybe speed of ordinary operation
  Disk sector corruption

What happens when you create a file in FSD?
  1. get two free pages for leader and data from VAM.
  2. update the name table b-tree in memory
  3. write leader+data to disk synchronously
  4. append modified name table pages to the log in memory

When is the in-memory log forced to the disk?
  Group commit... Every few seconds.
  Why not log immediately, after every operation?
  Is there a down-side to group commit?

What does FSD's on-disk log look like?
  Fixed area of disk, not a file, no free list for log blocks.
  Proper entry: hdr, blank, hdr copy, data pages..., end, copies of data..., end copy.

What if FSD runs out of log space during normal operation?
  How can we be sure it's safe to start re-using the log space on disk?
  What if crash while writing blocks during switch to a new 1/3?

Suppose a block is modified in one 1/3, modified again in a later 1/3.
  Does it have to be written to disk when first 1/3 is re-used?
  Is it even *allowed* to be written to disk when first 1/3 is re-used?
  What if 2nd transaction to write the block hasn't committed?

What happens during recovery after crash+reboot?
  How does recovery s/w find the start of the log?
    On-disk ptr to start of oldest third.
    Written when prev third was re-used.
  How does recovery find the last log record?
    hdr/end blocks must have time-stamp or something.
  What does recovery actually do with the log records?
  Can it just start re-play at the beginning?
    I.e. what if it crashed during previous replay, is 2nd time OK?

Does recovery work if crash during log write?
  Can recovery reliably decide if there's an "end" block?
  If "end" block is present, are all data blocks guaranteed to be good?

What about recovering the free list (VAM)?
  What if the on-disk VAM was incorrect at time of crash?
  Does FSD log VAM modifications?
  Why is it OK not to log them?

Why don't they log data writes?
  When do they write the data?
  Does that break the prefix rule?
  May have writes from "the future", after the recovery point.
  I think FSD never updates a file in place.
    Overwriting a file always creates a new version, preserves old one.
  What about delete, create, write, crash? W/ recovery point before delete?

Performance Evaluation
  The big story is Crash recovery: 25 sec instead of an hour

Why is FSD faster than CFS?
  Table 3, why does FSD do 1/6th as many I/Os for small creates?
  Why 149 I/Os for 100 files?

Did they evaluate whether they achieved their robustness goals?