6.824 2009 Lecture 14: Case studies: Frangipani

Frangipani: A Scalable Distributed File System
Thekkath, Mann, Lee
SOSP 1997

intro
  YFS a watered-down version of Frangipani
  Looking at Frangipani again now that we've studied the ingredients
    replication, paxos, consistency
  this will be a discussion more than a lecture
  look for ideas for final projects!

Why not primary copy?
  Single primary not scalable enough.
  Each Frangipani srvr a primary for the data it has cached/locked.
    Lock server dynamically assigns who is the primary for each file.
    No coincidence that it's the lock server that runs Paxos.
  Petal uses primary-backup internally (and Paxos, again).
    Many pairs.

Overall architectural diagram

What does Petal do?

What happens if a client e.g. creates a file?
  What steps does the server go through?
  acquire lock, append to *local* log, update local meta-data,
    release lock locally, reply to client.

What if a client on a different server reads that file?
  S1 gets the REVOKE
  writes log to Petal, writes meta-data to Petal, RELEASEs lock

Why must it write the log entry to Petal before writing the meta-data?

Why must it write the meta-data to Petal before releasing the lock?

Why must it write the entire prefix of the log before releasing lock?
  Not just the relevant log entry.

What if two clients try to create the same file at the same time?

The locks are doing two things:
  Atomic multi-write transactions.
  Serializing updates to meta-data (cache consistency).

What if a server dies and it is not holding any locks?
  Can the other servers totally ignore the failure?

What if a server dies while holding locks?
  Can we just ignore it until it comes back up and recovers itself?
  Can we just revoke its locks and continue?
  What does Frangipani do to recover?

What's in a log record?
  Why not entire new block images?

S1 creates f2, crashes while holding lock
  how does replay work?
  if S1 crashed before any flush of anything?
  mid-way through flushing log?
  mid-way through flushing data?
  just after all flushing, before releasing lock?
  just after releasing the lock?

Suppose S1 deletes f1, flushes its block+log, releases lock.
  Then S2 acquires lock and creates a new f1.
  Then S1 crashes.
  Will recovery re-play the delete? 
  Does recovery need to get the lock from S2?
    Can't; S2 might also have crashed. And doesn't need to.
  Details depend on whether S2 has written the block yet.

Why not ask lock server what locks the dead server held?
  Rather than using version numbers?
  Maybe:
    S1: create d/f1         create d/f2 crash
    S2:             del d/f1
    S1 holds "d" lock but should not replay create d/f1!
  Or lock server might have crashed.

What if two servers crash at about the same time?
  And they both recently modified the same file.
  Do we have to replay both ops?
    If we don't, do we risk missing an operation?
  Does it matter which order we replay their log records for that file?
  What if both recovery demons read old vers #, and both write, so lower # might win?

Suppose S1 creates f1, creates f2, then crashes.
  What combinations of f1 and f2 are allowed after recovery?
  First-order goal, or artifact of design?

What if a server runs out of log space?
  What if it hasn't yet flushed corresponding blocks to Petal?

How does Frangipani find the start/end of the log?
  Could there be ambiguity?

What happens if the network partitions?
  Could more than one file server perform updates?
  Could a file server use stale cached data?

What if the partition heals just before the lease expires?
  Could file server and lock server disagree about who holds the lock?

What if a lock server crashes?
  What if there's power failure that affects all servers?

Why does their lock service use Paxos?
  What do they need to agree about?

For what workloads is Frangipani likely to have poor performance?

Could its logs be a bottleneck?

Could the lock server be a bottleneck?

Table 2: why are creates relatively slow, but deletes fast?

Does NVRAM help?
  Why/when would we expect NVRAM to help?

Why is figure 5 flat?
  Why not more load -> longer run times?