File System Performance and Recovery

Required reading: The Soft Updates paper.
intro
  topic: tension between file system performance and crash recovery
  disk performance is often a #1 bottleneck
    "how many seeks will that take?"
  but many obvious fixes make crash / power failure hard to recover from
    "what if a crash occured at this point?"
    crash recovery is much harder than performance

why is disk performance often a bottleneck?
  everything else has gotten faster!
    25 years: CPU 1000x, net 100x, disk seek 3x
    seek+rotate takes 10ms, so 100 I/O per sec, 51 KB/s on xv6
      example: storing incoming short email msgs: how many seeks?
    good news: sequential throughput is 50 MB / sec
      but quite hard to arrange for pure sequential I/O
      example: storing huge output of scientific computation
  memories are huge, won't a big disk cache provide high performance?
    reads: often big data sets with random / long-tail access
    writes: what if a crash occurs before you write cache to disk?

why is FS crash recovery hard?
  crash = halt/restart CPU, let disk finish current sector write
    assume no h/w damage, no wild writes to disk
  goal: automatic recovery
  can FS always make sense of on-disk metadata after restart?
    given that the crash could have occured at any point?
  example: 
    many of you observed on exam that crash during mkdir()
      might leave subdir without "." and ".."
    you can imagine much worse outcomes
      e.g. block used by a file AND on free list

recovery approaches
  three main approaches:
    synchronous meta-data update + fsck (as in xv6)
    logging (as seen in 6.033)
    soft updates (today's paper)
  design tradeoffs -- you can move the work around
    careful update vs recovery
      work during ordinary operation vs work during recovery
    FS guarantees vs how much app must do
      are all syscalls durable? ordered? atomic?

terms for properties of FS operations
  if app completes operations shortly before a crash,
    what effects will app see after restart + recovery?
  e.g. creat("a"); fd = creat("b"); write(fd, ...); crash;
  durable: effects of operation are visible
  atomic: all steps of operation visible, or none
  ordered: exactly a prefix of operations is visible
  scores for xv6? soft updates? logging?

typical set of tradeoffs
  FS ensures it can recover its own metadata
    "internal consistency"
    no dangling references (e.g. all dirents refer to allocated inodes)
    inode and block free lists contain only unused items
    lengths, link counts, unique names in dir
    this is the bare minimum for a useful file system
  FS provides limited guarantees to apps
    atomicity for create, delete, rename
      mostly equivalent to internal consistency
    no atomicity for write() (except maybe single-sector)
    often no durability for anything (creat("a") then crash => no "a")
    often no order guarantees
    
how do apps cope with these weak FS semantics?
  example: editing file "grades" with emacs, crash while writing
    do you end up with only half your file?
  you would if you just opened file for writing, and wrote
    crash between open() (or creat()) and write()
    crash during write(), since it is not atomic
  [see code in handout]
  fd = open("tmp1234")
  write(fd, ...)
  fsync(fd);
  close(fd);
  rename("tmp1234", "grades")
  why it works:
    fsync() forces durability, returns after data on disk
    rename() is atomic: after a crash, old or new grades (not nothing)
    (atomic rename() is quite difficult to get right...)

let's look at how xv6 updates the disk
  handout w/ print for each bwrite()
  think: "what if a crash here? how many seeks?"
  $ echo > x
  bwrite 4 ialloc inum 18
  bwrite 4 iupdate inum 18
  bwrite 29 writei inum 1 *
  $ echo z > x
  bwrite 28 balloc bn 337
  bwrite 337 writei inum 18
  bwrite 4 iupdate inum 18
  bwrite 337 writei inum 18
  bwrite 4 iupdate inum 18
  $ rm x
  bwrite 29 writei inum 1  **
  bwrite 4 iupdate inum 18
  bwrite 337 bzero
  bwrite 28 bfree bn 337
  bwrite 4 iupdate inum 18
  bwrite 4 iupdate inum 18
  $ echo > y
  bwrite 4 ialloc inum 18
  bwrite 4 iupdate inum 18
  bwrite 29 writei inum 1 **
  $ 

would an xv6 FS be internally consistent after a crash?
  could look at all possible crashes
    what if a crash during file creation? &c
    remember all disk writes in xv6 are synchronous
  better: argue on-disk FS consistent for any crash
  this is mostly about avoiding dangling references to blocks and inodes
  xv6 strategy: carefully order disk writes to avoid dangling refs
    1. initialize a new inode before creating dirent
    2. delete dirent before marking inode free
    3. mark block in-use before adding it to inode addrs[]
    4. remove block from addrs[] before marking free (oops)
    5. zero block before marking free
  has some visible bugs: . and .. during mkdir(), link counts, sizes
  has some invisible loose ends: may lose freed blocks and inodes
  
what about application-visible syscall semantics?
   durable? yes (write-through cache)
   atomic? often, though e.g. mkdir is broken
   ordered? yes (all writes are sync)

how about xv6 FS performance?
  slow: at 10ms per i/o can only write 16 small files/sec
  think about un-tar or rm *

can we get rid of some xv6 disk writes?
  could combine ialloc and iupdate when creating
  echo could write("z\n") rather than two writes
  could let block free bitmap be out of date
    rebuild during recovery, in fsck
  hard to reduce create or unlink to fewer than two seeks
    still pretty slow

can we implement system calls with no seeks at all?
  use a write-back disk cache
  bwrite() *only* modifies in-memory buf (no disk write)
  bufs written to disk later
    if cache is full, write LRU dirty block
    or maybe every 30 seconds
  would write-back cache improve performance? why, exactly?
  what problems will write-back cause?

what can go wrong w/ badly ordered writes?
  an easy case: * goes to disk, crash before other two blocks written
    FS not immediately useable after crash (dirent -> free inode)
    easy for fsck to detect and fix, since i-node is marked free
  suppose only the ** disk writes occur, then a crash
    what is the problem?
    can fsck detect and fix this?

problem: write-back cache doesn't follow ordering rules
  those X before Y rules listed earlier

idea to fix write-back cache to be crash-recoverable:
  record block-block dependencies between cached bufs
  write-back to disk must obey dependencies
    if it is about to write dirent block, first write inode block
  example:
    for creat():
      write inode to disk before dirent
      dirent block depends on inode block
      arrow from block 29 to block 4
    for write():
      write bitmap block to disk before inode block
      inode block depends on bitmap block
      arrow from 4 to 28
  would this have good performance?
    e.g. for lots of creat()s in a directory
  why doesn't it always work?
    
example cycle:
  (file /a already exists w/ i-num 19 in block 4)
  $ echo > b
  bwrite 4 ialloc inum 20
  bwrite 4 iupdate inum 20
  bwrite 29 writei inum 1
  $ rm a
  bwrite 29 writei inum 1
  bwrite 4 iupdate inum 19
  bwrite 4 iupdate inum 19
  bwrite 4 iupdate inum 19
  $ 
  both dirents in block 29
  both i-nodes in block 4
  dependencies:
    block 4 before block 29 (b inode before b dirent)
    block 29 before block 4 (a dirent before a inode)
  it seems like we can't write either block first!

is this cycle problem fundamental?
  what if each i-node or dirent were 512 bytes long?

what's the core soft updates idea?
  dependencies between metadata items, not blocks
    i.e. particular dirent depends on particular inode
    not from dirent's block to inode's block
  this eliminates our cycle!
  if you want to write block 29:
  1. un-do b's dirent (leaving delete of a's dirent)
  2. write block 29 to disk
  3. restore b's dirent
  4. delete "29 before 4" dependency
  now you can write 4 to disk, and then the real 29

what's in a dependency record?
  attached to a record within a block (e.g. inode or dirent)
  points to a dependency record for another block
    so we know what update to the other block we're waiting for
    in case other block written, but with update rolled back
    note: you depend on an *update*, not just a record
  roll-back / roll-forward data and function
  state:
    satisfied -- all dependent update written to disk
    complete -- written to disk

what does e.g. creat() look like?
  allocate inode i
    dependency record for setting i.type to T_FILE
    (not dependent on anything; but we will soon point to it)
    satisfied but not complete
  put name and i in dirent d
    dependency record for dirent modification
    points to inode dependency record
    not satisfied, not complete

what does code to write a block to disk look like?
  lock the block
  for each unsatisfied dependency, un-do it
  write the block to disk
  wait for the disk write to finish
  for each unsatisfied dependency, re-do it
  for each satisfied dependency
    mark it complete
  for other blocks that depend on this one
    for our complete dependency records
      mark other dependency record as satisfied
  unlock the block

various details they had to get right
  when you want to re-use a cache buffer, which to flush?
    what might go wrong?
    how to do a good job?
  how many times might we have to write a particular buffer?
    that is, what is the ultimate cost of undo/write/redo?
  is fsck still needed?
    collect unreferenced blocks and i-nodes
    fix too-high link counts (after link and rename)
    safe to not run fsck
    not true w/o soft updates
      ordinary FFS lazy about writing bitmaps
  all updates delayed
    xv6 creat() durable on disk when creat() returns
    w/ soft updates write-back cache may be a while before on disk
  order among syscalls not preserved if a crash
    creat("a"); create("b");
    after crash, app might see b but not a
    fix with fsync()
  fsync(fd) may need to write file's dirent &c
    no point in writing file content to disk w/o dirent
    could be a lot of stuff; how to identify?

is soft updates fast?
  paper, Figures 3 and 4