6.824 Lecture 14: Case studies: Frangipani

Frangipani: A Scalable Distributed File System
Thekkath, Mann, Lee
SOSP 1997

Intro
  YFS is a simplified Frangipani
  Looking at Frangipani again now that we've studied the ingredients
    replication, paxos, consistency
  look for ideas for final projects!

Overall architectural diagram
  [Clients, Frangipani, Petal, Lock]
  Petal: replicated block storage (no knowledge of FS)
  Frangipani: knows abt inodes, directories, &c

Why not a single primary/backup pair of combined file/storage servers?

Why the Frangipani/Petal split?
  Why not partition files among a set of primary/backup pairs?

What does Petal do?
  [diagram]
  Looks like a huge disk (block read/write)
  Really striped across a cluster of Petal servers
  Two copies of every block
    Primary/backup replication, Paxos to agree on primaries

What happens if a client e.g. creates a file?
  What steps does the server go through?
  acquire lock, read from petal, append to *local* log,
  update local meta-data, release lock locally, reply to client.

What if a client on a different server reads that file?
  S1 gets the REVOKE
  writes log to Petal, writes meta-data to Petal, RELEASEs lock
 
What if two clients try to create the same file at the same time?

How does all this relate to replicated state machines and primary/backup?
  is there RSM and p/b hiding here somewhere?

The locks are doing two things:
  Atomic multi-write operations.
  Cache consistency: readers see most recent data.

What if a server dies while holding locks?
  Revoke its locks and continue?
  Ignore it until it comes back up and recovers itself?
  What does Frangipani do to recover?

Suppose S1 deletes f1, flushes its block+log, releases lock.
  Then S2 acquires lock and creates a new f1.
  Then S1 crashes.
  Will recovery re-play the delete? 
  Does recovery need to get the lock from S2?
    No, and can't rely on this, since S2 may also have crashed.
  If S1 recovery runs before S2 has written the block?
  If after?

Why not ask lock server what locks the dead server held?
  Rather than using version numbers?
  Maybe:
    S1: create d/f1         create d/f2 crash
    S2:             del d/f1
    S1 holds "d" lock but should not replay create d/f1!
  Or lock server might have crashed.

What if two servers crash at about the same time?
  And they both recently modified the same file.
  Do we have to replay both ops?
    If we don't, do we risk missing an operation?
  Does it matter which order we replay their log records for that file?
  Can both recovery daemons read vers #,
    see that it is old, both write, and the lower # write second?
    Thus losing an update?
  What's the exact version number rule?
    Replay if log# > block#?
    Or if log# >= block#?

Suppose S1 creates f1, creates f2, then crashes.
  What combinations of f1 and f2 are allowed after recovery?
  First-order goal, or artifact of design?

What if a server runs out of log space?
  What if it hasn't yet flushed corresponding blocks to Petal?

How does Frangipani find the start/end of the log?
  Could there be ambiguity if log has just the right content?

What if:
  S1 holds a lock
  Network problem, so S2 decides S1 is dead, recovers, releases S1's locks
  But S1 is alive and subsequently writes data covered by the lock
    Or reads cached data

What if the partition heals just before the lease expires?
  Could file server and lock server disagree about who holds the lock?

What if a lock server crashes?

Why does their lock service use Paxos?
  What do they need to agree about?

For what workloads is Frangipani likely to have poor performance?

Could its logs be a bottleneck?

Could the lock server be a bottleneck?

Table 2: why are creates relatively slow, but deletes fast?

Does NVRAM help?
  Why/when would we expect NVRAM to help?

Why is figure 5 flat?
  Why not more load -> longer run times?

Petal details
  Petal provides Frangipani w/ fault-tolerant storage
    so it's worth discussing
    also it's a good source for Lab 8 project ideas
  block read/write interface
    compatible with existing file systems
  looks like single huge disk, but many servers and many many disks
    big, high performance
    striped, 64-KB blocks
  virtual: 64-bit sparse address space, allocate on write
    why?
    virt addres partitioned over Petal servers
    each Petal srvr maintains translation map, like virtual memory
  primary/backup (one backup server)
    primary sends each write to the backup
  uses Paxos to agree on primary for each virt addr range
  what about recovery after crash?
    suppose pair is S1+S2
    S1 fails, S2 is now sole server
    S1 restarts, but has missed lots of updates
    S2 remembers a list of every block it wrote!
    so S1 only has to read those blocks, not entire disk
  logging
    virt->phys map and missed-write info

Lab 8 project ideas:
  Frangipani-style logs to tolerate yfs_client failure
  Fault-tolerant replicated extent server, like Petal
  Disk-based extent server, fast recovery after down-time
  High-performance extent server cluster