6.824 2007 Lecture 10: Logging and Recovery

Done with consistency segment of the class.
  Consistency helps multiple clients update the same shared data
  Haven't yet dealt with the durability of the data itself across failures,
    even on just a single node

Today's topics:
  logging
  FSD case study

What's the overall topic?
  Atomic updates of complex data w.r.t. failures.
  Today just a single system, we'll be seeing distributed versions later.
  Similar techniques will be useful in distributed systems like Petal,
    2-phase commit, Paxos, etc.
  assume disk sector writes are atomic, and "work correctly"
    see FSD paper for better handling of disk failures

What does "atomic" mean?
  Suppose an operation has many steps
  It is atomic if observers see either none of the steps, or all
  We're worried about atomicity in the face of crashes

Let's talk about atomic update of file system data structures.
  E.g. UNIX file system.
  Or YFS updating the extent server.

Can you update the on-disk FS in place?
  For example, a file create might involve these disk writes:
  1. Allocate an i-number from a free-bitmap.
  2. Append name/i-number to directory.
  3. Initialize i-node (size, owner, block list, &c).

Why is the above order potentially bad?
  Crash between 1 and 2.
  Crash between 2 and 3.

Would a different order be better?
  1. Allocate i-number.
  2. Initialize i-node.
  3. Append to directory.

Idea: can view write #3 as a "committing write"
  Crash before: no change visible
  Crash after: complete change is visible

Synchronous Meta-Data Update
  Write disk in a careful order
  Wait for each write to finish before doing the next
  Scan the disk and fix things after crash
    e.g. allocated but unreferenced i-nodes

Why aren't synchronous meta-data updates enough?
  They're slow
    Must write to disk, not cache
    10 ms per disk I/O
    Can't re-order disk writes to decrease seek time
  Recovery may require checking/scanning the whole FS (e.g. fsck)
  Some operations don't have an obvious single committing write

Example: UNIX rename
  Programs use rename to reveal only a complete file
    echo stuff > /tmp/t0
    mv /tmp/t0 file
  need to update two directories, stored in two blocks on disk.
  remove then add? add then remove?
    probably want add then remove
  what if a crash?
  what does post-crash checker do?
    it knows something is wrong, since link count is 1, but two links.
    can't roll back -- which one to delete?
    has to leave two links, increase the link count
    this is *not* a legal result of rename!
    but at least we haven't lost the file
  sync meta-data updates are slow *and* have wrong semantics

You can push tree update one step farther.
  Prepare a new copy of the entire affected sub-tree.
  Replace old subtree in one final write.
  Very expensive if done in the obvious way.
  But you can share structure between old and new tree.
  Only need new storage between change points and sub-tree root.
  (NetApp WAFL does this and more.)
  This approach only works for tree data structures.
    and doesn't support concurrent operations very well

What are the reasons to use logging?
  communicates order -- this is the fundamental property of a log
    Bayou, so all nodes agree on order of operations
    pre-crash FS operations -> FS recovery
  atomic commit of compound operations. w.r.t. crashes.
  fast recovery (unlike fsck).
  well-defined post-recovery state: serial prefix of operations.
    as if synchronous and crash had occured a bit earlier
    in echo/mv example, if you see file, you see new content too
  can be applied to almost any existing data structure
    e.g. database tables, free lists
  representation is compact on a disk, so very fast to append
  useful to coordinate updates to distributed data structures
    let's all do this operation
    oops, someone didn't say "yes"
    how to back out or complete?

How does one build a logging file system?
  Apps
  FS
  Log
  Disk cache
  Disk

Transactions
  Most FS operations involve multiple disk writes
  We want all the writes of an operation to be atomic
  So logging system needs to know what the operations are
  A "transaction" is a multi-write operation that should be atomic.
  create()
    begin_transaction
      update free list
      update i-node
      update directory entry
    end_transaction
  there may be multiple concurrent transactions
    e.g. two processes are making system calls

Terminology
  in-memory data cache
  dirty vs clean
  on-disk data
  in-memory log
  on-disk log

naive re-do log
  log format:
    B TID   [begin]
    W TID B# new-data   [write]
    E TID   [end == commit]
  Example:
    B T1
    W T1 B1 25
    E T1
    B T2
    W T2 B1 30
    B T3
    W T3 B2 99
    W T3 B3 50
    E T3
  for now, log lives on its own infinite disk
  note log may include records from uncommitted xactions
  records from concurrent xactions may be inter-mingled
  we can write dirty in-memory data blocks to disk any time we want
  what does a write do?
    modify block in the data cache
    append a record to the log
  what does recovery do?
    1. discard all on-disk data
    2. scan whole log and remember all Committed TIDs
    3. scan whole log, ignore non-committed TIDs, replay the writes
  why does recovery omit writes of non-committed xactions?
    *code* for that xaction can't be resumed
    thus can't complete
    thus atomicity requires that it look like it was never started
  when can we write dirty data from cache to disk?
    any time
    since recovery reconstructs everything
  why can't we use any of on-disk data's contents during recovery?
    don't know if a block is from an uncommitted xaction
    i.e. was written to disk before commit
  the *real* data is in the log!
    the on-disk data structure is just a cache for speed
    since it's hard to *find* things in a log
  so what have we achieved?
    atomic update of complex data structures: gets rename() right
    recoverable
    operations are fast (exploit disk locality)
  problems:
    we have to store the whole log forever
    recovery has to replay from the beginning of time

A better log scheme: write-ahead log + checkpoint + re-do
  most FS logs work like this, e.g. FSD
  allows much faster recovery: can use on-disk data

  how can we avoid re-playing the whole log on recovery?
    recovery needs to know a point in log at which it can start
    a "checkpoint", pointer into log, stored on disk
    how to ensure recovery can ignore everything before the checkpoint?
  checkpoint rule:
    all data writes before checkpoint must be stable on disk
    checkpoint may not advance beyond first uncommitted Begin
  in background, flush a bunch of early writes, update checkpoint ptr
  recovery:
    scan log from checkpoint to see what transactions are committed
    re-play commited updates from checkpoint onward
  what about transactions that wrote the disk but did not commit?
    will they leave partial operations visible after recovery?
    we can't allow this!
  write-ahead rule:
    do *not* write dirty blocks from in-memory data cache to disk
    until corresponding commit record is on disk
    thus disk contains only committed data
  three log regions:
    data guaranteed on disk
    (checkpoint)
    data might be on disk
    (log write point)
    data cannot be on disk
    (end of in-memory log)
  so a transaction does this:
    write disk cache in memory
    append log records
    append commit record
    wait for all its log records to reach disk
    write disk cache to disk
    update checkpoint
    (last three in background)
  what if crash after writing blocks to disk, before updating checkpoint?
    recovery will write the same block contents to disk
  bonus: we can free log space before checkpoint

careful disk writing
  log usually stored in a dedicated known area of the disk
    so it's easy to find after a reboot
  where's the start?
    block #0 holds checkpoint pointer
  where's the end?
    hard if crash interrupted log append
    append records in order
    include unique ascending sequence # in each record
    also a checksum for multi-sector records (maybe in End?)
    recovery must search forward for highest sequential #
    does this always work?

why is logging fast?
  can delay writes to disk blocks
    only have to write the log right away
  log appends are fast
    no seek if log was recently written
    all writes to same place, not scattered around the disk
    single seek/rotate per transaction, rather than per write
  write-behind of data allows batched / scheduled.
    one data block may reflect many transactions.
    i.e. create many files in a directory.
    don't have to be so careful since the log is the real information

Problems with WAL + checkpoint + re-do
  uncommitted transactions use space in in-memory data cache
    a problem for long-running transactions
    (not a problem for file systems)
  log records are huge
    entire block, even if only modified a few bytes
    writing log to disk is slow
  can't abort a transaction under s/w control
    you're midway through a rename
    unexpectedly run out of blocks, or i-nodes, or cache space
    would be nice to un-do the partial operation
    FS rarely has support for this, but common in DBs

More sophisticated log system:
  log just the bytes that changes, not the whole block
    or a description of the operation
  un-do as well as re-do
    each log record has old data and new data
  un-do makes abort a lot easier
  un-do allows one to write uncommitted data to disk
    recovery can un-do

*** un-do/re-do was briefly summarized, skip to FSD paper ...

problem:
  uncommitted transactions use space in in-memory data cache
  a problem for long-running transactions
  (not a problem for file systems)

un-do/re-do with checkpoint
  suppose we want to write uncommitted data to disk?
    need to be able to un-do them in recovery
    so include old value in each log record      
    W TID B# old-data new-data
  now we can write data from in-memory data cache to disk
    after log entry is on disk
    no need to wait for the End to be on disk
    so we can free in-memory data cache blocks of uncommitted transactions
  recovery:
    for each block mentioned in the log
    find the last xaction that wrote that block
    if committed: re-do
    if not committed: un-do
  two pointers stored on disk: checkpoint and tail
  checkpoint:
    all in-memory data cache entries flushed up to this point
    no need to re-do before this point
    but may need to un-do before this point
  tail:
    start of first uncommitted transaction
    no need to un-do before this point
    so can free before this point
  it's ok if we crash just before updating the tail pointer itself
    we would have advanced it over committed transaction(s)
    so we will re-do them, no problem
  what if there's an un-do record for block never written to disk?
    it's ok: un-do will re-write same value that's already there
  what if
    B T1
    W T1 B1 old=10 new=20
    B T2
    W T2 B1 old=20 new=30
    crash
    The right answer is B1 = 10, since neither committed
    But it looks like we'll un-do to 20
    What went wrong? How to fix it?

-------  Example: Cedar

Reimplementing the Cedar File System Using Logging and Group Commit
Robert Hagmann
SOSP 1987

What are FSD's main on-disk data structures?
  name table
    b-tree
    indexed by name
    entry: block list
  VAM -- bitmap indicating which blocks are free
  file content blocks
  leader block per file (ignore)

What were they worried about?
  Crash during complex b-tree operation (e.g. splitting during insert)
  Recovery speed -- no whole-disk scans
  Maybe speed of ordinary operation
  Disk sector corruption

What happens when you create a file in FSD?
  1. get two free pages for leader and data from VAM.
  2. update the name table b-tree in memory
  3. write leader+data to disk synchronously
  4. append modified name table pages to the log in memory

When is the in-memory log forced to the disk?
  Group commit... Every few seconds.
  Why not log immediately, after every operation?
  Is there a down-side to group commit?

What does FSD's on-disk log look like?
  Fixed area of disk, not a file, no free list for log blocks.
  Proper entry: hdr, blank, hdr copy, data pages..., end, copies of data..., end copy.

What if FSD runs out of log space during normal operation?
  How can we be sure it's safe to start re-using the log space on disk?
  What if crash while writing blocks during switch to a new 1/3?

Suppose a block is modified in one 1/3, modified again in a later 1/3.
  Does it have to be written to disk when first 1/3 is re-used?
  Is it even *allowed* to be written to disk when first 1/3 is re-used?
  What if 2nd transaction to write the block hasn't committed?

What happens during recovery after crash+reboot?
  How does recovery s/w find the start of the log?
    On-disk ptr to start of oldest third.
    Written when prev third was re-used.
  How does recovery find the last log record?
    hdr/end blocks must have time-stamp or something.
  What does recovery actually do with the log records?
  Can it just start re-play at the beginning?
    I.e. what if it crashed during previous replay, is 2nd time OK?

Does recovery work if crash during log write?
  Can recovery reliably decide if there's an "end" block?
  If "end" block is present, are all data blocks guaranteed to be good?

What about recovering the free list (VAM)?
  What if the on-disk VAM was incorrect at time of crash?
  Does FSD log VAM modifications?
  Why is it OK not to log them?

Why don't they log data writes?
  When do they write the data?
  Does that break the prefix rule?
  May have writes from "the future", after the recovery point.
  I think FSD never updates a file in place.
    Overwriting a file always creates a new version, preserves old one.
  What about delete, create, write, crash? W/ recovery point before delete?

Performance Evaluation
  The big story is Crash recovery: 25 sec instead of an hour

Why is FSD faster than CFS?
  Table 3, why does FSD do 1/6th as many I/Os for small creates?
  Why 149 I/Os for 100 files?

Did they evaluate whether they achieved their robustness goals?

******  unused old notes *******

Let's talk about the old system, CFS, first.
  In some ways it's the more interesting of the two.

What were some strengths of the old design?
  Labels provided robustness.

What were the weaknesses of the old design?
  Non-standard hardware.
  Slow recovery.
  Slow meta-data update.

CFS file name table was a b-tree.
  *not* consistent after a crash.
  thus really required to duplicate names &c in headers.

How did the labels work?
  What was in a label?
  When written? when checked?
  What errors can the labels detect?
  What invariants could a label help maintain?
  Disk hardware failures:  
    For each, detect bad data? Recover?
    If disk scribbles onto the file name table?
    If disk scribbles onto a header?
    If disk scribbles onto file content?
  Software errors:
    If used page on free list, will we re-use it incorrectly?
    If someone writes junk into header, will we detect? Can we recover?
    If someone writes junk into file content, will we detect?

Why was the CFS VAM just hints?
  VAM is disk page free list.
  How could there be no invariants?
  What if an allocated page is on the free list?
  Does the FSD VAM have invariants?

Why is guarding against single sector errors reasonable?
  What's the logic behind this design focus?
  What are the alternatives?
  What if most disk errors occured in disk electronics?

Do they have hierarchical directories?
  Only one "directory", the file name table...