6.824 2012 Lecture 15: Case studies: Frangipani

Frangipani: A Scalable Distributed File System
Thekkath, Mann, Lee
SOSP 1997

Intro
  YFS is a simplified Frangipani
  Frangipani vs YFS:
    recovery from server crashes
    fault-tolerant extent server
    high-performance extent server
  Looking at Frangipani again now that we've studied the ingredients
    replication, paxos, consistency
  look for ideas for final projects!

Overall architectural diagram
  [Clients, Frangipani, Petal, Lock]
  Petal: replicated block storage (no knowledge of FS)
  Frangipani: knows abt inodes, directories, &c

Why lots of Frangipani servers sharing Petal back-end?
  Why not split files over lots of Harp (or AFS) clusters?

Why separate Petal from Frangipani?

What does Petal do?
  [diagram: net, 3 srvrs, 6 disks]
  Looks like a huge disk (block read/write)
  Really striped across a cluster of Petal servers
  Two copies of every block
    Primary/backup replication, Paxos to agree on primaries

What happens if Frangipani server S1 creates file d/f?
  What steps does the server go through?
  lock d, read d's inode/content from petal, append op to *local* log,
  update local meta-data, release lock locally, reply to client.

What if S2 looks up d/f?
  S1 gets the REVOKE for d
  writes log to Petal, writes meta-data to Petal, RELEASEs lock
 
What if two clients try to create the same file at the same time?

The locks do two main things:
  Atomic multi-write operations.
  Cache consistency: readers see most recent data.

What if a server dies while holding locks?
  Revoke its locks and continue?
  Ignore it until it comes back up and recovers itself?
  What does Frangipani do?

Suppose:
  S1: delete d1/f1               crash
  S2:               create d/f1
  Will recovery re-play the delete? 
  Does recovery need to get the lock from S2?
    No, and can't rely on this, since S2 may also have crashed.

Why not ask lock server what locks the dead server held?
  Rather than using version numbers?
  Maybe:
    S1: create d/f1         create d/f2 crash
    S2:             del d/f1
    S1 holds "d" lock but should not replay create d/f1!
  Or lock server might have crashed.

What if two servers crash at about the same time?
  And they both recently modified the same inode.
  S1: create d/f1                   crash
  S2:               create d/f2    crash
  Do we have to replay both ops?
    If we don't, do we risk missing an operation?
  Does it matter which order we replay their log records for that file?
  Can both recovery daemons read vers #,
    see that it is old, both write, and the lower # write second?
    Thus losing an update?
  What's the exact version number rule?
    Replay if log# > block#?
    Or if log# >= block#?

What about re-use of freed blocks?
  S1: delete d1/f   delete d1                         crash
  S2:                          create d2/f write d2/f
  Could f re-use d1's content block?
  And will replay of S1's log overwrite f's content?
  Do file blocks have version #s?

What if a server runs out of log space?
  What if it hasn't yet flushed corresponding blocks to Petal?

How does Frangipani find the start/end of the log?
  Could there be ambiguity if log has just the right content?

What's in a Frangipani log entry?
  (paper is not explicit, but says ca 180 bytes, thus not block images)
  operation description, e.g. create "f":inum1 in directory inum2
  may imply other modifications, e.g. remove inum1 from free bitmap
  version #s of each updated block

What if:
  S1 holds a lock
  Network problem, so S2 decides S1 is dead, recovers, releases S1's locks
  But S1 is alive and subsequently writes data covered by the lock
    Or reads cached data

What if the partition heals just before the lease expires?
  Could file server and lock server disagree about who holds the lock?

What if a lock server crashes?

Why does their lock service use Paxos?
  What do they need to agree about?

Why do they have multiple lock servers (Figure 2)?

Is there a replicated state machine hiding inside Frangipani?
  Or does it get fault tolerance from some other approach?

For what workloads is Frangipani likely to have poor performance?

Could its logs be a bottleneck?

Could the lock server be a bottleneck?

Table 2: why are creates relatively slow, but deletes fast?

Does NVRAM help?
  Why/when would we expect NVRAM to help?

Why is figure 5 flat?
  Why not more load -> longer run times?

Petal details
  Petal provides Frangipani w/ fault-tolerant storage
    so it's worth discussing
    also it's a good source for Lab 8 project ideas
  block read/write interface
    compatible with existing file systems
  looks like single huge disk, but many servers and many many disks
    big, high performance
    striped, 64-KB blocks
  virtual: 64-bit sparse address space, allocate on write
    why?
    virt addres partitioned over Petal servers
    each Petal srvr maintains translation map, like virtual memory
  primary/backup (one backup server)
    primary sends each write to the backup
  uses Paxos to agree on primary for each virt addr range
  what about recovery after crash?
    suppose pair is S1+S2
    S1 fails, S2 is now sole server
    S1 restarts, but has missed lots of updates
    S2 remembers a list of every block it wrote!
    so S1 only has to read those blocks, not entire disk
  logging
    virt->phys map and missed-write info

Lab 8 project ideas:
  Frangipani-style logs to tolerate yfs_client failure
  Fault-tolerant replicated extent server, like Petal
  Disk-based extent server, fast recovery after down-time