FAQ for Frangipani, Thekkath, Mann, Lee, SOSP 1997

Q: Why are we reading this paper?

A: Primarily as an illustration of cache coherence.

But there are other interesting aspects. It's impressive that recovery
is possible if a Frangipani workstation crashes while in the middle of
updating the shared file system; we usually think of a thread crashing
while holding mutexes as being not recoverable. The idea of each
Frangipani workstation having its own log, stored in a public place so
that anyone can recover from it, is clever. Further, the logs are
intertwined in an unusual way: the updates to a given object may be
spread over many logs. This makes replaying a single log tricky (hence
Frangipani's version numbers). Building a system out of simple shared
storage (Petal) and smart but decentralized participants is
interesting.

Q: In what situations would Frangipani be useful?

A: It was designed for organizations in which a modest number of
people sit in front of workstations at their desks and want to
collaborate through shared files. The authors worked for such an
organization.

At the time, there were competing systems like NFS and AFS (MIT Athena
workstations use AFS). Frangipani aimed at better scalability by
moving complexity from the central server to the workstation client.

Frangipani and GFS are aimed at different kinds of uses. Frangipani
appears as an ordinary file system on a workstation, and you can use
it with commands like ls, cat, vi, cc, &c. Frangipani is intended to
store human users' home directories and files that humans share, such
as word-processing documents and program source files. GFS, in
contrast, can only be used via special library calls (it doesn't look
like an ordinary file system), and it makes the most sense if you are
reading and writing vast files with special-purpose software such as
MapReduce.

Q: Why not use an ordinary filesystem like Linux's ext4 on top of
Petal, since Petal looks a lot like a disk? Wouldn't this work as well
as Frangipani?

A: ext4 allows a single computer to use a file system. If you store an
ext4 file system in something like Petal (or modern equivalents like
Amazon EBS) and mount it on multiple computers, it will not work
correctly, because the different computers' ext4's won't know about
each other. For example, if two computers create files at the same
time, they will likely allocate the same i-node, and write over each
others' information.

You can think of Frangipani as solving the problems that would come up
if multiple computers mounted the same file system. One problem has to
do with the computers caching blocks: if computer C1 reads and caches a
file, and then computer C2 modifies the file, how to ensure that C1
knows to not use its stale cached copy? Another challenge is to ensure
that modifications by one computer are revealed atomically to other
computers, even though operations such as creating and deleting files
involve modifying multiple blocks in Petal; we don't want C1 seeing C2's
half-completed update. A final problem is crash recovery: if one
computer crashes while in the middle of updating a file or directory,
leaving the data in Petal only partially updated, how to restore
correctness to the file system in Petal?

Q: What are the clients and servers in this system?

A: Petal servers store block data. Each user's workstation runs an
instance of Frangipani. Frangipani acts as a client of Petal, but also
acts as a file server for applications running on the same
workstation. So it's not clear whether one should use "client" or
"server" to refer to a Frangipani instance.

Q: Section 5 says that a Frangipani workstation must invalidate cached
data when it releases a read lock, and write modified cached data back
to Petal when it releases a write lock. Doesn't that severely limit
how much Frangipani benefits from caching?

A: After a Frangipani workstation acquires a lock, it will hold onto
the lock until some other workstation tries to acquire it. That is, if
a user creates a file in a directory (thus acquiring a write lock for
that directory, as well as caching the directory's information), and
then pauses a bit, and then creates a second file in the same
directory, the user's workstation will likely *not* have to acquire
the directory lock twice. And as a consequence, the workstation will
read the directory information from Petal just once, and modify that
information only in its local cached copy. If some other workstation
needs to use that directory, then that other workstation will send an
acquire message to the lock server, which will send a revoke message
to the first workstation, which will write the directory's modified
information from its cache to Petal (and then delete the data from its
cache), and then tell the lock service it is giving up the lock. So,
as long as there isn't much read/write sharing, Frangipani does a
pretty good job of caching locks and data.

Q: Why does Petal have a block interface? Why didn't Petal provide a
file-system interface that knows about things like directories and
files?

One reason is that the authors developed Petal first. Petal already
solved many fault tolerance and scaling problems, so using it
simplified some aspects of the Frangipani design. And this arrangement
moves work from centralized servers to workstations, which
helps Frangipani performance scale well as more workstations are
added.

However, the Petal/Frangipani split makes enforcing invariants on
file-system-level structures harder, since no one entity is in charge.
Frangipani builds its own transaction system (using the lock service
and Frangipani's logs) in order to be able to make complex atomic
updates to the file system stored in Petal.

Q: Can a workstation running Frangipani break security?

A: Yes. Since the file system logic is in users' workstations, the
design trusts the workstations. A user could modify the local
Frangipani software and read/write other users' data in Petal. This
makes Frangipani unattractive if users are not trusted. It might still
make sense in a small organization, or if Frangipani ran on separate
dedicated servers (not on workstations) and talked to user
workstations with a protocol like NFS.

Q: How does Frangipani differ from GFS?

A: A big architectural difference is that GFS has most of the file
system logic in the servers, while Frangipani distributes the logic
over the workstations that run Frangipani. That is, Frangipani doesn't
really have a notion of file server in the way that GFS does.
Frangipani puts the file system logic in the workstations in
order to allow them to perform file system operations purely in their
local caches. This makes sense when most activity is 
workstations reading and writing a single user's (cached) files.
Frangipani has a lot of mechanism to ensure that workstation caches
stay coherent, both so that a write on one workstation is immediately
visible to a read on another workstation, and so that complex
operations (like creating a file) are atomic even if other
workstations are trying to look at the file or directory involved.
This last situation is tricky for Frangipani because there's no
designated file server that executes all operations on a given file or
directory.

In contrast, GFS doesn't have caches at all, since its focus is
sequential read and write of giant files that are too big to fit in
any cache. It gets high performance for reads of giant files by
striping each file over many GFS servers. Because GFS has no caches,
GFS does not have a cache coherence protocol. Because the file system
logic is in the servers, GFS clients are relatively simple; only the
servers need to have locking and worry about crash recovery.

Frangipani appears as a real file system that you can use with any
existing workstation program. GFS doesn't present as a file system in
that sense; applications have to be written explicitly to use GFS via
library calls.

Q: What is Digital Equipment Corporation (DEC)?

A: It's the company at which the authors worked. DEC sold computers
and systems software. Unix was originally developed on DEC hardware
(though at Bell Labs, not at Digital).

Q: What does the comment "File creation takes longer..." in Section
9.2 mean?

A: Because of logging, all meta-data changes have to be written to
Petal twice: once to the log, and once to the file system. An
operation must be written to the log first (thus "write-ahead log").
Then the on-disk file system structures can be written. Only after
that can the operation be deleted from the log. That is, a portion of
the log can only be freed after all the on-disk file system updates
from the operations in that portion have been written to Petal.

So when Frangipani's log fills, Frangipani has to stop processing new
operations, send modified blocks from its cache to Petal for the
operations in the portion of the log it wants to re-use, and only then
free that portion and resume processing new operations.

You might wonder why increasing the log size improved performance.
After all, Frangipani has to perform the updates to the file system in
Petal no matter how long the log is, and those updates seem to be the
limiting performance factor. Here's a guess. It's probably the case
that the benchmark involves updating the same directory many times, to
add new files to that directory. That directory's blocks will have to
be written to Petal every time part of the log is freed (assuming
every creation modifies the one directory). So letting the log grow
longer reduces the total number of times that directory has to be
written to Petal during the benchmark, which decreases benchmark
running time. This effect is called "write absorbtion."

Q: Frangipani is over 20 years old now; what's the state-of-the-art in
distributed file systems?

A: While some organizations today store user and project files on
distributed file systems, their importance has waned with the rise of
laptops (which must be self-contained) and commercial cloud services.
Still, there are many users of existing file-system protocols such as
SMB, NFS, and AFS, as well as more recent distributed file-systems
like xtreemfs, Ceph, Lustre, and Dropbox. A huge amount of network
storage is sold by companies like NetAPP and EMC, but I don't know to
what extent people use the storage at the disk level (via e.g. iSCSI)
or as file servers (e.g. talking to a NetAPP server with NFS).

The rise of web sites, big data, and cloud computing have shifted the
focus of storage system development. Web sites work well with
database-like servers, including key/value stores; file servers are
not a good match here. Big data processing systems often use file
systems, e.g. GFS for MapReduce, but the focus is on high parallel
through for huge files; Frangipani's caching, coherence, and locking
are not needed.

Q: The paper says Frangipani only does crash recovery for its own file
system meta-data (i-nodes, directories, free bitmaps), but not for
users' file content. What does that mean, and why is it OK?

A: If a user on a workstation writes some data to a file on
Frangipani, and the workstation immediately crashes, then each block
of the recently-written data might or might not appear after recovery.
That is, if the user logs into another workstation and looks at the
file, they might see the file in question but only a subset of the
recent writes.

The most pressing reason for not logging file content is that it would
be slow, since every byte of file content would have to be written
twice to Petal, once in Frangipani's log, and then again in the
filesystem itself. And many files, perhaps most, don't need to have
their content logged; for example, if a crash occurs while one is
running the compiler, and the output of the comipler isn't complete
after recovery, one can easily run the compiler again. One would
likely have to re-start the build anyway if interrupted by a crash.

Not logging content does make it more difficult for careful
applications like databases, text editors, mail servers, &c to recover
their data correctly after a crash. But such programs have widely
differing notions of what correctness means and what (if anything) a
file system could do to help. Even if Frangipani put file content in
the log, a program doing a long sequence of write()s at the time of a
crash would still likely be left with a partially-written file after
recovery. Programs that care about crash recovery typically use Unix's
fsync() call to control the order in which data is written to disk,
and they have application-level recovery software that exploits that
order. Frangipani supports fsync(), so these programs will work on
Frangipani as well as they do on ordinary UNIX file systems.

Ordinary Unix file systems (e.g. Linux on a laptop) have the same
property: file content written just before a crash may be lost.

The reason why file systems are more careful of their own meta-data is
that, if they weren't careful, the file system might be completely
unuseable after a crash, with no possibility of recovering anything at
all. So the division of labor is that the file system takes care of
the meta-data, and applications do whatever they need to (e.g. call
fsync()) to ensure recoverability of their own data.

Q: The paper mentions that the Petal implementation includes a log.
What's the difference between Frangipani's logs and Petal's log?

The Petal paper hardly mentions Petal's log at all. My guess is that
Petal logs changes to the mapping from logical block number to
physical block number, changes the physical block free list, and
changes to the "busy" bits that indicate which block updates the other
replica of the block may be missing. That is, Petal logs information
about low-level block operations.

Frangipani logs information about file system operations, which often
involve updating multiple pieces of file system state in Petal. For
example, a Frangipani log entry for deleting a file may say that the
delete operation modified some block and i-node free bitmap bits, and
that the operation erased a particular directory entry.

Both of the logs are stored on Petal's disks. From Petal's point of
view, Frangipani's logs are just data stored in Petal blocks; Petal does
not know anything special about Frangipani's logs.

Q: How does Petal take efficient snapshots of the large virtual disk that
it represents?

A: Petal maintains a mapping from virtual to physical block numbers. The
mapping is actually indexed by a pair: the virtual block number and an
epoch number. There is also a notion of the current epoch number. When
Petal performs a write to a virtual block, it looks at the epoch number
of the current mapping; if the epoch number is less than the current
epoch number, Petal creates a new mapping (and allocates a new physical
block) with the current epoch. A read for a virtual block uses the
mapping with the highest epoch.

Creating a snapshot then merely requires incrementing the current epoch
number. Reading from a snapshot requires the epoch number of the
snapshot to be specified; then each read uses the mapping for the
requested virtual block number that has the highest epoch number <= the
snapshot epoch.

Have a look at Section 2.2 of the Petal paper:

http://www.scs.stanford.edu/nyu/02fa/sched/petal.pdf

Q: The paper says that Frangipani doesn't immediately send new log
entries to Petal. What happens if a workstation crashes after
a system call completes, but before it send the corresponding
log entry to Petal?

A: Suppose an application running on a Frangipani workstation creates
a file. For a while, the information about the newly created file
exists only in the cache (in RAM) of that workstation. If the workstation
crashes before it writes the information to Petal, the new file is
completely lost. It won't be recovered.

So application that looks like

  create file x;
  print "I created file x";

might print "I created file x", but (if it then crashes) file x might
nevertheless not exist.

This may seem unfortunate, but it's fairly common even for local disk
file systems such as Linux. An application that needs to be sure its
data is really permanently saved can call fsync().

Q: What does it mean to stripe a file? Is this similar to sharding? 

A: It's similar to sharding. More specifically it means to distribute
the blocks of a single file over multiple servers (Petal server, for
this paper), so that (for example) the file's block 0 goes on server
0, block 1 goes on server 1, block 2 goes on server 0, block 3 goes on
server 1, &c. One reason to do this is to get high performance for
reads of a single big file, since different parts of the same file can
be read from different servers/disks. Another reason is to ensure that
load is balanced evenly over the servers.

Q: What is the "false sharing" problem mentioned in the paper?

A: The system reads and writes Petal at the granularity of a 512-byte
block. If it stored unrelated items X and Y on the same 512-byte
block, and one workstation needed to modify X while another needed to
modify Y, they would have to bounce the single copy of the block back
and forth between them. Because there's not a way to ask Petal to
write less than 512 bytes at a time. It's sharing because both
workstations are using the same block; it's false because they don't
fundamentally need to.

Q: I don’t understand the section 4 text about how Frangipani enforces
the stronger condition of never replaying a completed update.

A: Here's an example of the problem. Suppose workstation WS1 creates file
xxx, then deletes it. After that, a different workstation WS2 creates
file xxx. Their logs will look like this:

WS1:  create(xxx)  delete(xxx)
WS2:                           create(xxx)

At this point, correctness requires that xxx exist (since WS2's create
came after WS1's delete).

Now suppose WS1 crashes, and Frangipani recovery replays WS1's log.
Recovery will see the delete(xxx) operation in the log. We know that it
would be incorrect to actually do the delete, because we know that WS2's
create(xxx) came after WS1's delete(xxx). But how does Frangipani's
recovery software conclude that it should ignore the delete(xxx) when
replaying WS1's log?

Section 4 is saying that Frangipani's recovery software knows to ignore
the delete(xxx) because it "never replays a log record desccribing an
update that has already been completed." It achieves this property by
keeping a version number in every block of meta-data, and in every log
entry; if the log entry's version number is <= the meta-data block's
version number, Frangipani recovery knows that the meta-data block has
already been updated by the log entry, and thus that the log entry
should be ignored.

Q: How much performance improvement could Frangipani get from having a
separate lock for each block in each file?

A: It would probably be bad for performance. It would presumably mean
that reading a file would require acquiring a lock for each block --
thus much more work for the workstation and the lock server. And there
might be no upside in practice, because there are few applications
that do read/write sharing at the level of file blocks, and thus could
benefit from different workstations being able to modify different
blocks of the same file at the same time. It's much more common to see
sharing at the level of whole files -- as when your editor writes out
the whole file, and then your compiler reads the whole file.