6.5840 2023 Lecture 13: Frangipani Frangipani: A Scalable Distributed File System Thekkath, Mann, Lee SOSP 1997 why are we reading this paper? cache coherence distributed transactions distributed crash recovery and the interaction among the three what's the context? the authors' research lab programming, text processing, e-mail, &c lots of collaboration UNIX workstations in offices most file access is to user's own files need to potentially use any file on any workstation user/user collaboration one user logging into multiple workstations solution: a network file system looks to users/apps just like an ordinary local-disk filesystem but workstations talk to server to get at files this was a common scenario when the paper was written the state of the art in 1997: AFS and NFS [workstations; one AFS server] MIT Athena uses AFS: /afs/.../rtm RPC interface is system calls: create(), mkdir(), readdir(), read(), &c one central server, executes all filesystem logic problem: central server gets overloaded, hard to scale up Frangipani overall design? [users; workstations + Frangipani; network; petal servers; lock server] Petal: block storage service; replicated; striped+sharded for performance Petal knows nothing about file systems the workstations run the filesystem code: Frangipani What does Frangipani store in Petal? like an ordinary hard disk file system i-nodes, file content blocks, directory content blocks, free bitmaps file content vs "meta-data" what properties were wanted? strong consistency write-back caching in each workstation operate entirely from workstation cache if no sharing including creating files, creating directories, rename, &c scalability Petal already can be scaled up in storage and CPU power they wanted the file system processing also to scale challenges cache coherence WS1 creates file "x" in local cache; WS2 should see it if it looks! i.e. they wanted linearizability despite write-back cache isolation for both write/write and write/read uses locks -- but distributed crash recovery WS1 crashes while creating in directory "d", but doesn't reboot WS1 was holding a lock on "d", so "d" may not be safe to look at! (like ACID transactions, but implementation is unusual) first challenge: cache coherence it's what provides linearizability despite caching so reads see writes many systems use "cache coherence protocols" multi-core, file servers, distributed shared memory Frangipani's coherence protocol (simplified): lock server (LS), with one lock per file/directory file owner ----------- x WS1 y WS1 workstation (WS) Frangipani cache: file/dir lock content ----------------------- x busy ... y idle ... if WS holds lock, busy: using data right now idle: holds lock but not using the cached data right now workstation rules: don't cache unless you hold the lock acquire lock, then read from Petal write to Petal, then release lock coherence protocol messages: request (WS -> LS) grant (LS -> WS) revoke (LS -> WS) release (WS -> LS) the locks are named by files/directories (really i-numbers), though the lock server doesn't actually understand anything about file systems or Petal. example: WS1 changes file z, then WS2 reads z WS1 LS WS2 read z --request(z)--> owner(z)=WS1 <--grant(z)--- (read+cache z data from Petal) (modify z locally) (when done, z's lock is cached in "idle" state) read z <--request(z)-- <--revoke(z)-- (write modified z to Petal) --release(z)--> owner(z)=WS2 --grant(z)--> (read z from Petal) the point: locks and rules force reads to see last write locks ensure that "last write" is well-defined coherence optimizations the "idle" state allows lock caching if no conflicts Frangipani has shared read locks, as well as exclusive write locks Frangipani's cache coherence is unusually sophisticated for a network FS but design is very close to previous multi-processor coherence schemes next challenge: isolation what if two workstations try to create the same file at the same time? what if WS1 looks at a file/directory while WS2 is modifying it? e.g. file create initializes i-node, adds directory entry e.g. rename (both names visible? neither?) Frangipani provides isolated file-system operations: operation corresponds to a system call (create file, remove file, rename, &c) WS acquires locks on all file system data that it will modify performs operation with all locks held only releases when finished only responds to a "revoke" after entire operation is complete the locks prevent any other WS from seeing partially-complete operations and prevent another WS from performing a conflicting update there's not much surprising to this isolation design mostly that the lock server is factored out and shared last challenge: crash recovery What if a Frangipani workstation dies while holding locks? other workstations will want to continue operating... can we just revoke dead WS's locks? what if dead WS had modified data in its cache? what if dead WS had started to write back modified data to Petal? e.g. WS wrote new directory entry to Petal, but not initialized i-node this is the troubling case Is it OK to just wait until a crashed workstation reboots? Frangipani uses write-ahead logging for crash recovery Before writing any of op's cached blocks to Petal, writes log entry to Petal containing description of entire op So if a crashed workstation has done some Petal writes for an operation, but not all, the writes can be completed from the log in Petal Very traditional -- but... 1) Frangipani has a separate log for each workstation this avoids a logging bottleneck, eases decentralization but may scatter updates to a given file over many logs 2) Frangipani's logs are in shared Petal storage, not local disk WS2 can read WS1's log to recover from WS1 crashing Separate logs is an interesting and unusual arrangement Frangipani's logging scheme involves four kinds of storage: * FS blocks in Petal (non-volatile) * WS's circular log in Petal (non-volatile) * cache of FS blocks in WS RAM (volatile) * tail of WS's log in WS RAM (volatile) What's in each workstation's log? log sequence number (LSN) per 512-byte log block log entry: (this is a guess, paper isn't explicit) array of updates: block #, new version #, offset, new bytes just contains meta-data updates, not file content updates example -- create file d/f produces a log entry: a three-entry update array: add an "f" entry to d's content block, with new i-number initialize the i-node for f set the bit for i-node in the allocation bitmap initially the log entry is in WS local memory (not yet Petal) When WS gets lock revocation on modified directory from LS: 1) write all as-yet-unwritten in-memory log entries to Petal, then 2) send the cached updated FS blocks to Petal, then 3) release the locks to the LS Why must WS write log to Petal before updating i-node and directory &c in Petal? Why delay writing the log until LS revokes locks? What happens when WS1 crashes while holding locks? Not much, until WS2 requests a lock that WS1 holds LS sends revoke to WS1, gets no response LS times out, tells WS2 to recover WS1 from its log in Petal What does WS2 do to recover from WS1's log? Read WS1's log from Petal Perform Petal writes described by logged operations Tell LS it is done, so LS can release WS1's locks Note it's crucal that each WS log is in Petal so that it can be read by any WS for recovery. What if WS1 crashes before it writes its recent log to Petal? WS1's recent operations may be totally lost if WS1 crashes. But the file system in Petal will be internally consistent. Why is it safe to replay just one log, despite interleaved operations on same files by other workstations? Example: WS1: delete(d/f) crash WS2: create(d/f) WS3: recover WS1 WS3 is recovering WS1's log -- but it doesn't look at WS2's log Will recovery re-play the delete? This is The Question No -- prevented by "version number" mechanism Version number in each meta-data block (i-node) in Petal Version number(s) in each logged op is block's version plus one Recovery replays only if op's version > block version i.e. only if the block hasn't yet been updated by this op Does WS3 need to aquire the d or d/f lock? No: if version number same as before operation, WS1 didn't do the write, so it couldn't have released the lock, so no-one else could have the lock, so it's safe for WS3 safe to update in Petal The log holds meta-data but not file content. Workstations do write file content to Petal before releasing lock. This might be after the WS writes its log to Petal. If WS crashes, file may be missing some or all content. Why does this make sense? Why not log content too? 1. Logging content would be much slower, since content would then be written twice to Petal. 2. For many files it wouldn't matter, e.g. object files generated by cc. 3. It wouldn't help applications that need to be correct, e.g. DB or e-mail system, since they usually have notions of transaction that are different from single FS ops. 4. Existing filesystem had (and have) similar behavior, so careful applications were already written to e.g. use fsync(). What if: WS1 holds a lock Network partition Lock server decides WS1 is dead, recovers, releases WS1's locks If WS1 is actually alive, could it then try to write data covered by the lock? Locks have leases! Lock owner can't use a lock past its lease period LS doesn't start recovery until after lease expires Is Paxos or Raft hidden somewhere here? Yes -- Paxos-based configuration managers for lock servers, Petal servers ensures a single lock server, despite partition ensures a single primary for each Petal shard Performance? what might we want to know? * basic single-user performance - small operations per second (e.g. file create) - throughput for big reads/writes * scaling with more users - again, both small and big operations * what limits the performance? Tables 1 and 2 show single-user small-op performance against AdvFS, a conventional local-disk file system some differences, but similar hard to guess what the limiting factors are disk seek? network round-trip? CPU time? Table 3 shows single-user large-file throughput around 10 MB/second why? seems low given Petal's 63 disks, each good for 6 MB/second Figure 5 shows small-file scalability X axis is number of active workstations each workstation runs a file-intensive benchmark workstations use different files and directories Y axis is completion time for a single workstation flat implies good scaling == no significant shared bottleneck why? maybe each workstation is mostly operating out of its own cache maybe Petal's many disks yield parallel performance Figure 6 shows large-file read scalability why 10 MB/second for one client? (it's a familiar number...) why does the line go up with more workstations? where might the line stop? Figure 7 shows large-file write scalability why does the line stop? why at 40 MB/s? Petal details Petal provides Frangipani w/ fault-tolerant high-performance storage so it's worth discussing block read/write interface compatible with existing file systems looks like single huge disk, but many servers and many many disks big, high performance striped, 64-KB blocks virtual: 64-bit sparse address space, allocate on write address translation map primary/backup (one backup server) primary sends each write to the backup uses Paxos to agree on primary for each virt addr range what about recovery after crash? suppose pair is S1+S2 S1 fails, S2 is now sole server S1 restarts, but has missed lots of updates S2 remembers a list of every block it wrote! so S1 only has to read those blocks, not entire disk logging virt->phys map and missed-write info Limitations Most useful for e.g. programmer workstations, not so much otherwise A file system is not a great API for many applications, e.g. web site Frangipani enforces permissions, so workstations must be trusted so Athena couldn't run Frangipani on Athena workstations Frangipani/Petal split is a little awkward both layers log Petal may accept updates from "down" Frangipani workstations more RPC messages than a simple file server Ideas to remember cache coherence distributed crash recovery