File System Performance and Durability

intro: why are we reading Rethink the Sync?
  move up one level:
    last time: performance vs FS consistency
    today: performance vs durability
    most FS don't care much about durability, just internal consistency
  paper expands the scope of crash recovery to include external user
    user (or outside world via network)
    app
    FS
    disk
  each layer:
    has some kind of careful update plan while operating
    has a recovery plan
      depends on careful update plan
      and post-crash guarantees of layer below
  the paper exploits user-level view to improve system properties

xsyncfs's high-level promises
  correctness of synchronous FS updates
  performance of asynchronous FS

what is a sync FS?
  a sync FS forces updates to disk before returning from syscall, like xv6
    if syscall returns, its effects will be visible after a crash
  write-through disk cache in kernel, so slow

why is sync FS a good standard for correct FS behavior?
  maybe since sync FS easy to reason about?
    can count on all ops completed before crash being visible after crash
  paper claims apps that are correct on sync will be correct on xsyncfs
    not directly useful, since no real file systems are sync
    so few (no?) apps are written for sync FSs
  they probably mean careful use of fsync()
    e.g. mail server fsync()s incoming msg before saying "OK"
      sender can safely delete if it sees "OK"
      if we said OK, fsync() must have returned, so data must be on disk

what does paper mean by an async FS?
  system calls can return before data is on disk
    syscalls modify blocks in a write-back cache
    but then FS has to worry about internal consistency
  paper really means FS w/ async logging, in particular Linux's ext3

simplified review of logging
  goal: have both write-back cache and internal FS consistency
  make multi-step syscalls atomic despite crash: all or nothing
  FS has a log on disk
  syscall: append BEGIN, each of syscall's modifications, END
    a "transaction"
    also writes blocks in write-back cache
  don't send write-back cache to disk until after log goes to disk
    "write-ahead logging"
  if crash, recovery scans log
    iff sees END, re-do all modifications in that xaction
  so if crash during a system call
    if no END, syscall has no effect at all
      since we held write-back until after END on disk
    if END, syscall has all effects
  logging usually asynchronous
    syscall appends records to in-memory log
    force log to disk periodically or on fsync
    may lose recent updates on crash
    thus not durable (but still atomic)
  
why is not-durable a problem for external users?
  mail server says "OK" but crash erases its copy of msg
  or
    % cp * backupdir
    %

back to xsyncfs:
  currently you have to choose between durability and performance
  slow sync fs, or fast non-durable async logging

what's the paper's insight to get performance + durability?
  sufficient definition of durability:
     updates only need to be durable by the time of external output
     display output, network packet

which data has to be durable before external output?
  all updates "causally preceding" the external output
  "causally preceding" is a transitive relation
  process X writes a file, then sends a network message
  process X exit()s, process Y's wait() returns
  thus:
    X : write(file) -> exit()
    Y :                       -> wait() -> printf("$ ")
    process X's file write must be on disk before user sees Y's "$ "

why is durability of causally preceding updates sufficient?
  in order to keep external users happy?
  if mail sender sees "OK" msg via net, guaranteed msg is on rcvr's disk
    no "OK" -> recovery by resending later
  % cp * backupdir
  %
  if you see the 2nd prompt, you know backup is complete
  if crash before 2nd prompt, user "recovers" by backing up again
  seems to be sufficient in these two cases

can we argue that no non-causally-preceding write is necessary?
  i.e. is that rule sufficient for user-visible correctness?
  example:
    read mail msg from socket
    fork a child
      child writes msg to a file
    parent writes "OK" to socket
  so: sufficient only if application is written carefully

it is necessary for xsyncfs to force all causally preceding writes?
  examples when maybe app doesn't need a causally preceding write forced?

will xsyncfs have higher performance than sync updates?
  yes: can use write-back cache, coalesce writes
  until external output

will xsyncfs have higher performance than async logging?
  usually no:
    xsyncfs has causal tracking overhead
    xsyncfs might force to disk more often than every 5 seconds
  but perhaps yes for apps that fsync()

when might xsyncfs have *higher* performance than async FS?
  mysql example
    client does multiple xactions before external ouput
    mysql server really must fsync() for each transaction
    but xsyncfs knows: no external output -> can ignore fsync()

reminder
  xsyncfs guarantees same *user*-visible behavior as a sync FS
  what about app-visible behavior?

what guarantees given to applications?
  imply apps written for sync file system will work fine
    xsyncfs guarantees ordering: section 2.2, end of page
    so like sync FS, but crashed earlier
    your app still needs to be able to recover from a crash
      at any point, can rely on all syscalls up to a certain
      point being visible after the crash

what are the hard parts of xsyncfs design / impl?
  tracking causality
    each process and kernel object (pipe) has list of not-yet-written blocks
    copy that list when a process reads/updates kernel object
  buffering external output, e.g. display writes and network packets
    why useful to buffer? why not just suspend process?
  forcing disk writes on external output
    track down all causally preceding blocks
    ask ext3 to write them
    wait, then release output
    what does that involve?
      can you ask ext3 to write just a particular block?

ext3
  paper involves ext3 in two ways
  1. as the competition
  2. as part of xsyncfs

outline of ext3 design
  main source: Stephen Tweedie 2000 talk transcript "EXT3, Journaling Filesystem"
  goals of ext3
    performance via write-back cache
    crash recovery w/o fsck (just replay log, quick)
    atomic syscalls (to maintain FS internal consistency)
  ext3 uses a write-ahead redo log (== "journal")
    added to a previous log-less file system
  has many modes, I'll described "journaled data"
    log contains both metadata and file content blocks
  structures:
    in-memory write-back block cache
    in-memory list of blocks to be logged
    on-disk FS
    on-disk log file
  what's in the ext3 log?
    descriptor: magic, seq, block #s
    data blocks
    (can be multiple of descriptor / block)
    commit: magic, seq
  ext3 logs entire blocks ("value logging")
    expensive: one little change -> 4096 bytes in log
      other systems often log only operation descriptions, more compact
    easy: don't need to invent operation descriptions
    how does ext3 get good performance despite value logging?
      defers copying cache block to log until it commits log to disk
      batches many syscalls per commit
      hopes multiple sycalls modified same block
        thus many syscalls, only one copy of block in log
  sys call:
    h = start(# of log blocks to reserve)
    get(h, block #)
      warn logging system we'll modify cached block
      prevent going to disk until after log is forced
      (and copy-on-write)
    modify the blocks in the cache
    stop(h)
    guarantee: all or none
  ext3 groups many system calls into a "transaction"
    again, hopes to coalesce many updates to a block into one logged block
    there is only one open transaction at a time
  ext3 commits a transaction to disk every five seconds
    or on an fsync()
  committing a transaction to disk
    mark transaction as done (new syscalls must start new xaction)
    wait for in-progress syscalls to stop()
    for all blocks in list to be logged
      write descriptor w/ block #
      write block content from cache
    wait for all log writes to finish
    write the commit record
    (now blocks mentioned in committed log can go to disk)
  a new xaction may start while prev is committing
    what if syscall in new xaction wants to change a block in committing xaction?
    copy-on-write, so committing xaction uses consistent snapshot of blocks
  if no crash:
    after commit, cached blocks written to real locations in FS on disk
    then that part of log can be re-used
    wraps around
  what if a crash?
    may not have finished writing cache to disk (i.e. FS state not consistent)
    crash may also interrupt writing the log (partial commit)
  how does recovery work
    1. find the start and end of the log
       sb->s_start, sb->s_sequence
       scan until bad record or not the expected seq #
       so crash during commit -> last transaction ignored during recovery
    2. replay all blocks from whole transactions, in log order

other interesting tidbits about ext3
  ordered vs journaled
  correctness challenges w/ ordered:
    A. rmdir, re-use block for file, ordered write of file, crash before log
       now scribbled over the file
       defer free of block until freeing operation forced to log on disk
    B. rmdir, re-use block in file, ordered file write, log force,
         crash, replay rmdir
       file is left w/ directory content e.g. . and ..
       so revoke records, prevent log replay of a given block
  lack of checksum in commit record
    bad news if disk re-orders writes
  probably weak correctness for detecting end of log
    what if old data looks like end-of-log record
  reservations?
  rm a file while fd is open, then crash
    how to cause inode and blocks to be freed?
    keep a list of orphaned files, pointed to by superblock

remember: question was "can xsyncfs ask ext3 to write a single block?"
  answer: no
  xsyncfs must force whole log, including unrelated updates

given that ext3 won't let you write an individual block
  why do they have to do all that causal tracking?
  why not just force the log when program does external output?

this eliminates a potential advantage of xsyncfs over async logging FS
  in principle, if there are multiple unrelated apps,
    xsyncfs only needs to force causally related updates
    not entire log, which is what ordinary ext3 fsync() does
  in practice, xsyncfs forces whole ext3 log

evaluation
  figure 3: durability 
    very cool that they systematically test crashes
    async is ordinary ext3, force every 5 seconds or on fsync
    sync is force ext3 log after every system call
    sync + write barriers fixes ide write caching
    xsyncfs uses ext3 and write barriers
    comparison unfair: async ext3 should be using write barriers
      that is how it was designed to be used
      then it would be as durable (and as high performance) as xsyncfs
  figure 6: mysql, vs async ext3 with write barriers
    so both systems are arguably equally durable
    why does xsyncfs win with few threads?
    why does ext3 catch up with many threads?
  figure 7: web server, not much difference, why?
    what does specweb99 do?