High-performance File Systems

Required reading: soft updates.

Overview

A key problem in designing file systems is how to obtain performance on file system operations while providing consistency. With consistency, we mean, that file system invariants are maintained is on disk. These invariants include that if a file is created, it appears in its directory, etc. If the file system data structures are consistent, then it is possible to rebuild the file system to a correct state after a failure.

To ensure consistency of on-disk file system data structures, modifications to the file system must respect certain rules:

The paper calls these dependencies update dependencies.

xv6 ensures these rules by writing every block synchronously, and by ordering the writes appropriately. With synchronous, we mean that a process waits until the current disk write has been completed before continuing with execution.

xv6 is a nice example of the tension between consistency and performance. To get consistency, xv6 uses synchronous writes, but these writes are slow, because they perform at the rate of a seek instead of the rate of the maximum data transfer rate. The bandwidth to a disk is reasonable high for large transfer (around 50Mbyte/s), but latency is low, because of the cost of moving the disk arm(s) (the seek latency is about 10msec).

This tension is an implementation-dependent one. The Unix API doesn't require that writes are synchronous. Updates don't have to appear on disk until a sync, fsync, or open with O_SYNC. Thus, in principle, the UNIX API allows delayed writes, which are good for performance:

Thus the question: how to delay writes and achieve consistency? The paper provides an answer.

This paper

The paper surveys some of the existing techniques and introduces a new to achieve the goal of performance and consistency.

Techniques possible:

Soft updates is the solution explored in this paper. It doesn't require NVRAM, and performs as well as the naive strategy of keep all dirty block in main memory. Compared to logging, it is unclear if soft updates is better. The default BSD file systems uses soft updates, but most Linux file systems use logging.

Soft updates is a sophisticated variant of flusher-enforced ordering. Instead of maintaining dependencies on the block-level, it maintains dependencies on file structure level (per inode, per directory, etc.), reducing circular dependencies. Furthermore, it breaks any remaining circular dependencies by undo changes before writing the block and then redoing them to the block after writing.

Pseudocode for create:

create (f) {
   allocate inode in block i  (assuming inode is available)
   add i to directory data block d  (assuming d has space)
   mark d has dependent on i, and create undo/redo record
   update directory inode in block di
   mark di has dependent on d
}

Pseudocode for the flusher:

flushblock (b)
{
  lock b;
  for all dependencies that b is relying on
    "remove" that dependency by undoing the change to b
    mark the dependency as "unrolled"
  write b 
}

write_completed (b) {
  remove dependencies that depend on b
  reapply "unrolled" dependencies that b depended on
  unlock b
}

Apply flush algorithm to example:

An file operation that is important for file-system consistency is rename. Rename conceptually works as follows:

rename (from, to)
   unlink (to);
   link (from, to);
   unlink (from);

Rename it often used by programs to make a new version of a file the current version. Committing to a new version must happen atomically. Unfortunately, without a transaction-like support atomicity is impossible to guarantee, so a typical file systems provides weaker semantics for rename: if to already exists, an instance of to will always exist, even if the system should crash in the middle of the operation. Does the above implementation of rename guarantee this semantics? (Answer: no).

If rename is implemented as unlink, link, unlink, then it is difficult to guarantee even the weak semantics. Modern UNIXes provide rename as a file system call:

   update dir block for to point to from's inode // write block
   update dir block for from to free entry // write block

fsck may need to correct refcounts in the inode if the file system fails during rename. for example, a crash after the first write followed by fsck should set refcount to 2, since both from and to are pointing at the inode.

This semantics is sufficient, however, for an application to ensure atomicity. Before the call, there is a from and perhaps a to. If the call is successful, following the call there is only a to. If there is a crash, there may be both a from and a to, in which case the caller knows the previous attempt failed, and must retry. The subtlety is that if you now follow the two links, the "to" name may link to either the old file or the new file. If it links to the new file, that means that there was a crash and you just detected that the rename operation was composite. On the other hand, the retry procedure can be the same for either case (do the rename again), so it isn't necessary to discover how it failed. The function follows the golden rule of recoverability, and it is idempotent, so it lays all the needed groundwork for use as part of a true atomic action.

With soft updates renames becomes:

rename (from, to) {
   i = namei(from);
   add "to" directory data block td a reference to inode i
   mark td dependent on block i
   update directory inode "to" tdi
   mark tdi as dependent on td
   remove "from" directory data block fd a reference to inode i
   mark fd as dependent on tdi
   update directory inode in block fdi
   mark fdi as dependent on fd
}

No synchronous writes!

What needs to be done on recovery? (Inspect every statement in rename and see what inconsistencies could exist on the disk; e.g., refcnt inode could be too high.) None of these inconsitencies require fixing before the file system can operate; they can be fixed by a background file system repairer.

Paper discussion

Do soft updates perform any useless writes? (A useless write is a write that will be immediately overwritten.) (Answer: yes.) Fix syncer to becareful with what block to start. Fix cache replacement to selecting LRU block with no pendending dependencies.

Can a log-structured file system implement rename better? (Answer: yes, since it can get the refcnts right).

Discuss all graphs.