High-performance File Systems

Required reading: The Soft Updates paper.
intro: disk i/o a huge bottleneck in many real systems
  i've been involved in a bunch of production web sites
  none w/ more than more than modest workload
  all have had perf problems, #1 culprit is disk / file system

why is disk often a problem?
  everything else has gotten faster!
    25 years: CPU 1000x, net 100x, disk seek 3x
    seek takes 5 or 10ms == 10 million CPU instructions!
    sequential disk thruput 50x, but difficult to exploit
  read workloads often not cacheable -- big data, random or long tail
  writes: write-back caching vs durability/recoverability
    today's topic
  high level:
    if each thing takes one random disk I/O,
    your one-disk server can only do 100 things per second

watch bwrite printfs from xv6
  $ echo > x
  bwrite 4 ialloc inum 18
  bwrite 4 iupdate inum 18
  bwrite 29 writei inum 1 *
  $ echo z > x
  bwrite 28 balloc bn 337
  bwrite 337 writei inum 18
  bwrite 4 iupdate inum 18
  bwrite 337 writei inum 18
  bwrite 4 iupdate inum 18
  $ rm x
  bwrite 29 writei inum 1  **
  bwrite 4 iupdate inum 18
  bwrite 337 bzero
  bwrite 28 bfree bn 337
  bwrite 4 iupdate inum 18
  bwrite 4 iupdate inum 18
  $ echo > y
  bwrite 4 ialloc inum 18
  bwrite 4 iupdate inum 18
  bwrite 29 writei inum 1 **
  wow, that's slow; at 10ms per i/o can only write 16 small files/sec
  "synchronous writes" or "synchronous meta-data updates"
  esp for e.g. un-tar or rm *

idea for improving speed
  use a write-back disk cache
  write LRU block to disk when cache is full
  would that improve performance? why, exactly?
  would we regret this scheme?

what can go wrong w/ badly ordered writes?
  an easy case: * happens, but crash before two other bwrites
    fsck can fix this, since i-node is marked free
  suppose only the ** writes occur, then a crash
    can fsck fix this?

two rules for recoverable file system disk writes
  1. initialize i-node before creating dirent
  2. erase dirent before re-using i-node

revisit xv6 order
  does it look correct?
  does creat() initialize i-node before creating dirent?
  does unlink() remove dirent before freeing i-node?

idea for speed *and* correctness:
  don't re-use i-nodes while on-disk dirent points to them
  write-back cache for speed
  record block-block dependencies
  write-back to disk in order
  mark a few in easy cases
  it is correct for e.g. creating lots of files in a directory
  why doesn't this always work?

example cycle:
  (file /a already exists w/ i-num 19 in block 4)
  $ echo > b
  bwrite 4 ialloc inum 20
  bwrite 4 iupdate inum 20
  bwrite 29 writei inum 1
  $ rm a
  bwrite 29 writei inum 1
  bwrite 4 iupdate inum 19
  bwrite 4 iupdate inum 19
  bwrite 4 iupdate inum 19
  both dirents in block 29
  both i-nodes in block 4
    block 4 before block 29 (b inode before b dirent)
    block 29 before block 4 (a dirent before a inode)

is the cycle problem a fundamental show-stopper?
  what if each i-node or dirent were 512 bytes long?

what's the core soft updates idea?
  if you want to write block 29:
  1. un-do y's dirent (leaving delete of x's dirent)
  2. write block 29 to disk
  3. restore y's dirent
  4. delete "29 before 4" dependency

what's in an dependency record?
  attached to a block (e.g. block of i-nodes or block of dirents)
  specifies which record within the block it refers to
    e.g. an i-node or dirent
  specifies what other block (?) that record depends on
    i.e. that other block needs to be written first
  specifies a dependency record in the other block
    so we know what update to the other block we're waiting for
    in case other block written, but with update rolled back
  specifies a function to roll back/forward
    satisfied -- all dependent update written to disk
    complete -- written to disk

what does e.g. creat() look like?
  allocate inode i
    dependency record for setting i.type to T_FILE
    (not dependent on anything; but we will soon point to it)
    satisfied but not complete
  put name and i to dirent d
    dependency record for dirent modification
    points to inode dependency record
    not satisfied, not complete

extra book-keeping to never re-use i-node or data block w/ on-disk ref

what does code to write a block look like?
  lock the block
  for each unsatisfied dependency, un-do it
  write the block to disk
  wait for the disk write to finish
  for each unsatisfied dependency, re-do it
  for each satisfied dependency
    mark it complete
  for other blocks that depend on this one
    for our complete dependency records
      mark other dependency record as satisfied
  unlock the block

what are some problems?
  when you want to re-use a cache buffer, which to flush?
    what might go wrong?
    how to do a good job?
  can we always write each buffer just once?
  is fsck still needed?
    yes, but only to collect unreferenced blocks and i-nodes
    i.e. safe to not run it
    prob not true w/o soft updates
      ordinary FFS lazy about writing bitmaps?
  all updates delayed
    xv6 creat() durable on disk when creat() returns
    w/ soft updates write-back cache may be a while before on disk
  fsync(fd) may need to write file's dirent &c
    how to find it?
  multiple updates to same record?
    could we roll back one but not the other?

is it faster?
  paper, Figures 3 and 4