6.824 2002 Lecture 11: Logging

What's the overall topic?
  Atomic updates of complex data w.r.t. failures.
  Today just a single system, we'll be seeing distributed versions later.

Terminology
  data buffer
  data storage
  log buffer
  log storage
  dirty vs clean
  sync write vs async

FFS sync create.
  1. update i-free list
  2. initialize i-node
  3. update directory block
  sync writes correct w.r.t. failure but very slow.
  and requires slow recovery: restore i-free list.
  but fsck recovery works correctly!

FFS rename
  editor could use re-name from temp file for careful update
  echo a > d1/f1
  echo b > d2/f2
  mv d2/f2 d1/f1
  need to update two directories, stored in two blocks on disk.
  remove then add? add then remove?
    probably want add then remove
  what if a crash?
  what does fsck do?
    it knows something is wrong, since link count is 1, but two links.
    can't roll back -- which one to delete?
    has to just increase the link count.
    this is *not* a legal result of rename!
    but at least we haven't lost the file.
  so FFS is slow *and* it doesn't get semantics right.

XFS needs even more simultaneous updates
  create: dir block, new inode, i-free B+-tree, i-free counts
  can't just let fsck clean up mess (i.e. free lists)
  XFS wants no fsck at all!
    so can't even play FFS/fsck rename trick

What are the reasons to use logging?
  atomic commit of compound operations. w.r.t. crashes.
  fast recovery (unlike fsck).
  correct recovery: serial prefix of operations.
  can be applied to almost any existing data structure
    not just file systems. mostly used in databases, e.g. for banks.
  very useful to coordinate updates to distributed data structures
    need to be very systematic about order of updates,
    which parts of which updates have completed.
    in case of partial failure.
  very efficient representation

Transactions
  re-organize code to mark start/end of group of atomic operations
  create()
    begin_transaction
      update free list
      update i-node
      update directory entry
    end_transaction
  writes go to buffer cache via logging system

simple log
    (*not* write-ahead! no rule!)
    keep a "log" of updates
      <Data, T#, B#, new-data>
      <Begin, T#>
      <Commit, T#>
    for now, log lives on its own infinite disk
    note we include record from uncommitted xactions in the log
    recovery
      discard DB
      replay from start of log
      don't recover transactions that didn't commit
    why can't we use any of DB's contents during recovery?
      don't know if a block is from an uncommitted xaction
    so what have we achieved?
      atomic update of complex data structures
      slow recovery
      fast operation
      real persistent store is the log
      on-disk DB just a sort of cache

careful disk writing
    tape? dedicated disk? special area on disk?
    how do we know where the end of the log is?
      "log anchor"
      two copies, ping-pong
    what if a multi-sector disk write to log interrupted by crash?
      can't guarantee sectors written in order
      so each update probably protected by checksum
      update anchor after finished a full append
    probably don't want partial-sector updates

why is logging fast?
    group commit -- batched log writes.
      could delay flushing log -- may lose committed transactions
      but at least you have a prefix.
    single write (or less) to implement a transaction.
      no seek if dedicated log
    write-behind of data allows batched / scheduled.
      one buffer may reflect many transactions.
      i.e. create many files in a directory.
      don't have to be so careful since the log is the real information

re-do with checkpoint
    this is probably how SGI XFS (and Echo) log
    allows much faster recovery: can use on-disk DB
    write-ahead rule
      only write committed updates (i.e. commit record on disk)
    so keep updates of uncommitted xactions in buffers (not disk)
    so no un-committed data on disk.
      but disk may be missing some committed data
      recovery may need to replay committed data from the log
    how does recovery know what to re-do out of the log?
      remember a checkpoint pointer as well as anchor. non-volatile.
      first log record for which write isn't flushed to disk
      in background, flush earliest write, advance checkpoint pointer
      of course, cannot advance if an open xaction
    on recovery, re-play commited updates from checkpoint onward
    can free log space before checkpoint!
    problem: uncommitted transactions sit on main-memory buffers.

un-do/re-do with checkpoint
    what if you don't want to wait for commits?
      write uncommitted data to disk
      need to be able to un-do it in recovery
      so include old value in each log record      
    now we can write a buffer whenever we want to
      as long as log entry already on disk
    checkpoint:
      all buffers flushed up to this point
      no need to re-do before this point
      but may need to un-do before this point
    tail:
      start of first uncommitted transaction
      no need to un-do before this point
      so can free before this point
    no main-memory buffers for data between tail and checkpoint.
      so uncommitted xactions don't consume memory