"The Zebra Striped Network File System", Hartman and Ousterhout What are the top-level goals? Increase throughput and availability By striping data across multiple servers Do they present evidence that these goals are valuable? I.e. that existing systems were lacking in these areas Performance: yes, evidence that NFS and Sprite are slow Availability: no Top-level architecture? Multiple clients Multiple data servers One file manager Why does this architecture increase reliability? Over what baseline? Presumably standard NFS, single server It probably does *not* with the FM described in the paper FS acts as a Sprite server, with a local non-redundant disk But it might; FM could store stuff in DS's, then could be re-started anywhere What kinds of performance bottlenecks might this architecture eliminate? [draw with just one client first] Client: only good if multiple concurrent clients Server disk: probably not the best plan (multiple disks/server) Server CPU or net i/f: good Network: not good Server's ability to handle meta-data: not good, still one FM Why do they need the FM? Why not fully symmetric -- clients talk only to DS's? Need to synchronize updates to meta-data Can you imagine a design that eliminates the FM? Is this a network disk? That is, did they preserve the standard file system / disk split I.e. each DS exports simple read/write sector interface Why not? In general, to avoid conflicting writes Need disk to manage free list Need to be able to ask disks "last stripe written by client X" How do they decide how to divide up data over servers? File contents striped over DS's Each client stripes separately Meta-data only on the FM Why this partition of data? Could split up directory tree: /home/to0, /home/to1, &c Each DS gets a different sub-tree We want load balance Performance depends on balanced load, to harness parallelism Hot spot -> low overall utilization When might directory tree partition work well? Many independent clients Why did they choose to stripe every client over every DS? Good load balance for even a single client Also helps w/ availability: RAID Though this could probably be done otherwise; mirrored per-server disks How should one choose a stripe size? How about one block (i.e. each fragment is block/N)? Doesn't this maximize throughput even for single-block operations? No: most of time is in the seek for short operations So higher performance if different disks can seek for different ops They use huge stripes (512 kbytes?); does this work well? Small read/writes: yes, if many concurrent ops. Long sequential r/w: yes. They use RAID; won't this wreck small-write performance? Looks like four disk ops per small write. It's OK: LFS and RAID interact well. LFS makes small writes sequential, and batches them. Avoids small writes to random places on disk. Is there any down-side to batching? What if a client dies before updating parity? In general, writes are multi-step operations, involve multiple DS's A client crash will leave parity/data inconsistent Again, LFS saves us We know client was writing stripe at tail of log. DS's can tell us how much of the last stripe was written If the data looks good, we can use it, and re-build parity. If it looks bad, LFS allows us to ignore the tail of the log Why does each client have its own log? To avoid expense of synchronizing client writes Could imagine FM telling you where to write next for every write Also to allow per-client recovery What happens during a read? 1. Client sends Sprite open RPC to FM 2. FM does cache consistency in case some other client has dirty data cached 3. FM replies with file "contents": list of block pointers These may point into other clients' DS stripes 4. Client reads from DS's in parallel What happens during a write? 1. Client sends open-for-write to FM 2. Application issues writes. They are buffered locally. 3. Client decides to flush, or FM asks it to. 4. Client gathers *all* dirty blocks, decides how to append to log. 5. Client generates a delta for each write: File, version#, file offset, new block, old block Puts deltas in the log as well 6. Client computes RAID parity 7. Client appends new data and parity to log Parallel write RPCs to DS's Never over-writes existing data But may overwrite parity -- why? 8. Client sends deltas to the FM (and cleaner) 9. FM applies deltas to its meta-data -- just file block lists FM stores this stuff on a normal local disk Sprite/LFS file system Why log the deltas -- why not just send them to the FM? If FM crashes, it may miss deltas. Or get them, but not have finished applying them. It logs internally, so tail of log may be missing FM recovery reads and replays tails of client logs Why not just ignore tail of client logs after FM crash? After all, we're allowed to ignore tails of logs. Because the clients didn't crash! Client apps are still running, don't want their data to disappear. What if step 9 happens before step 8 finishes? Client log update may fail, so FM would update w/ bad block ptr. What's a file version? Who allocates them? What's the point? Multiple clients were writing same file offset before an FM crash. Need to replay the two log tails in correct relative order. What's a block pointer? Fragment identifier: client #, client log seq #, which in stripe. What about location on disk? When you present a frag id to DS, how does it know where to read? I don't know. Must maintain a map. How does it recover the map? Did they actually get higher performance? Figure 6: (large writes) For 1 client, why do more servers get more b/w? limited by disk write performance so maybe should have just put multiple disks on a single server Why do more clients get more b/w? w/ many servers, limited by client cpu or net i/f speed Why is NFS/Sprite performance so low? They claim small blocks and no async rpc So no disk/net overlap But NFS has biod, should be able to overlap Figure 8: (small writes) Why is Sprite/Zebra so much faster than NFS? Figure 9: (utilization) What are they trying to demonstrate? That FM is not a serious performance bottleneck What's the right number of DS's per FM for large read/write? Many. How about DS's per active client? about one to one. How many clients per FM for small write? About two. Hardware limits: disk write 1.1 MB/s net interface 3.8 (or 8?) scsi bus 1.6 disk surface 2 Questions for Hartman: Do the DS's maintain a mapping from stripe frag ID to disk location? Do the DS's maintain a fragment free list? How are these made recoverable? Is there a performance impact? How is the parity update made atomic? Why does parity update need to be atomic?