"The Zebra Striped Network File System", Hartman and Ousterhout

What are the top-level goals?
  Increase throughput and availability
  By striping data across multiple servers

Do they present evidence that these goals are valuable?
  I.e. that existing systems were lacking in these areas
  Performance: yes, evidence that NFS and Sprite are slow
  Availability: no

Top-level architecture?
  Multiple clients
  Multiple data servers
  One file manager

Why does this architecture increase reliability?
  Over what baseline? Presumably standard NFS, single server
  It probably does *not* with the FM described in the paper
    FS acts as a Sprite server, with a local non-redundant disk
  But it might; FM could store stuff in DS's, then could be re-started anywhere

What kinds of performance bottlenecks might this architecture eliminate?
  [draw with just one client first]
  Client: only good if multiple concurrent clients
  Server disk: probably not the best plan (multiple disks/server)
  Server CPU or net i/f: good
  Network: not good
  Server's ability to handle meta-data: not good, still one FM

Why do they need the FM?
  Why not fully symmetric -- clients talk only to DS's?
  Need to synchronize updates to meta-data
  Can you imagine a design that eliminates the FM?

Is this a network disk?
  That is, did they preserve the standard file system / disk split
  I.e. each DS exports simple read/write sector interface
  Why not?
  In general, to avoid conflicting writes
  Need disk to manage free list
  Need to be able to ask disks "last stripe written by client X"

How do they decide how to divide up data over servers?
  File contents striped over DS's
  Each client stripes separately
  Meta-data only on the FM

Why this partition of data?
  Could split up directory tree: /home/to0, /home/to1, &c
    Each DS gets a different sub-tree
  We want load balance
    Performance depends on balanced load, to harness parallelism
    Hot spot -> low overall utilization
  When might directory tree partition work well?
    Many independent clients
  Why did they choose to stripe every client over every DS?
    Good load balance for even a single client
    Also helps w/ availability: RAID
    Though this could probably be done otherwise; mirrored per-server disks

How should one choose a stripe size?
  How about one block (i.e. each fragment is block/N)?
    Doesn't this maximize throughput even for single-block operations?
    No: most of time is in the seek for short operations
    So higher performance if different disks can seek for different ops
  They use huge stripes (512 kbytes?); does this work well?
    Small read/writes: yes, if many concurrent ops.
    Long sequential r/w: yes.

They use RAID; won't this wreck small-write performance?
  Looks like four disk ops per small write.
  It's OK: LFS and RAID interact well.
    LFS makes small writes sequential, and batches them.
      Avoids small writes to random places on disk.
  Is there any down-side to batching?

What if a client dies before updating parity?
  In general, writes are multi-step operations, involve multiple DS's
  A client crash will leave parity/data inconsistent
  Again, LFS saves us
    We know client was writing stripe at tail of log.
    DS's can tell us how much of the last stripe was written
    If the data looks good, we can use it, and re-build parity.
    If it looks bad, LFS allows us to ignore the tail of the log

Why does each client have its own log?
  To avoid expense of synchronizing client writes
  Could imagine FM telling you where to write next for every write
  Also to allow per-client recovery

What happens during a read?
  1. Client sends Sprite open RPC to FM
  2. FM does cache consistency in case some other client has dirty data cached
  3. FM replies with file "contents": list of block pointers
     These may point into other clients' DS stripes
  4. Client reads from DS's in parallel

What happens during a write?
  1. Client sends open-for-write to FM
  2. Application issues writes. They are buffered locally.
  3. Client decides to flush, or FM asks it to.
  4. Client gathers *all* dirty blocks, decides how to append to log.
  5. Client generates a delta for each write:
     File, version#, file offset, new block, old block
     Puts deltas in the log as well
  6. Client computes RAID parity
  7. Client appends new data and parity to log
     Parallel write RPCs to DS's
     Never over-writes existing data
     But may overwrite parity -- why?
  8. Client sends deltas to the FM (and cleaner)
  9. FM applies deltas to its meta-data -- just file block lists
     FM stores this stuff on a normal local disk Sprite/LFS file system

Why log the deltas -- why not just send them to the FM?
  If FM crashes, it may miss deltas.
    Or get them, but not have finished applying them.
    It logs internally, so tail of log may be missing
  FM recovery reads and replays tails of client logs

Why not just ignore tail of client logs after FM crash?
  After all, we're allowed to ignore tails of logs.
  Because the clients didn't crash!
  Client apps are still running, don't want their data to disappear.

What if step 9 happens before step 8 finishes?
  Client log update may fail, so FM would update w/ bad block ptr.

What's a file version?
  Who allocates them?
  What's the point?
    Multiple clients were writing same file offset before an FM crash.
    Need to replay the two log tails in correct relative order.

What's a block pointer?
  Fragment identifier: client #, client log seq #, which in stripe.
  What about location on disk?
    When you present a frag id to DS, how does it know where to read?
    I don't know. Must maintain a map. How does it recover the map?

Did they actually get higher performance?
 
Figure 6: (large writes)
  For 1 client, why do more servers get more b/w?
    limited by disk write performance
    so maybe should have just put multiple disks on a single server
  Why do more clients get more b/w?
    w/ many servers, limited by client cpu or net i/f speed
  Why is NFS/Sprite performance so low?
    They claim small blocks and no async rpc
    So no disk/net overlap
    But NFS has biod, should be able to overlap

Figure 8: (small writes)
  Why is Sprite/Zebra so much faster than NFS?

Figure 9: (utilization)
  What are they trying to demonstrate?
    That FM is not a serious performance bottleneck
  What's the right number of DS's per FM for large read/write?
    Many.
    How about DS's per active client? about one to one.
  How many clients per FM for small write?
    About two.

Hardware limits:
  disk write 1.1 MB/s
  net interface 3.8 (or 8?)
  scsi bus 1.6
  disk surface 2

Questions for Hartman:
  Do the DS's maintain a mapping from stripe frag ID to disk location?
  Do the DS's maintain a fragment free list?
  How are these made recoverable? Is there a performance impact?
  How is the parity update made atomic?
  Why does parity update need to be atomic?