Reliable File systems

Required reading: soft updates.

Overview

A key problem in designing file systems is how to obtain performance on file system operations while providing consistency and stability of file systems. With consistency, we mean, that file system invariants are maintained is on disk. These invariants include that if a file is created, it appears in its directory, etc.

A block is stable if a block is stored on the disk persistently. For example, an application may write a block of data, but the file system may not write it immediately to disk to achieve better performance by, for example, delying this write to combine it with future writes. Thus, even though the application has performed the write, it may take a while for the block is actual stable.

To gain performance many file system designers are willing to have a window of vunerability (e.g., 30s in v6) in which changes are not immediately written to disk. So, if the file system fails in that window, some recently updated data my be lost.

The performance bottleneck in a file system is the disk. The bandwidth to a disk is reasonable high for large transfer (around 50Mbyte/s), but latency is low, because of the cost of moving the disk arm(s) (the seek latency is about 10msec).

Thus, in general, we want to:

Avoid disk operations. For read operations, cache data. For write operations, delay write operations, and absorb writes.
Avoid seeks. Give many operations to the disk at once so that it can do good disk scheduling. This reduces latency and improves disk throughput.
Batch small writes into a single operations. This means we want to be smart about disk layout.

To improve latency, we count on caching, read-ahead, and writing asynchronously. To improve throughput we count on batching, good disk scheduling, and good disk layout.

Because we are interested in the performance versus (consistency and stability) trade-off, we will focus on write operation (i.e., operations that modify data that needs to be written back at some point).

The naive solution is for the file system to delay writes as long as possible, because that results in big writes, high absorption, and many dirty blocks that can be written at the same time, leading to good disk scheduling. For example, the file systems waits until it all blocks in memory are dirty and it needs a new clean block. Then, it selects some of the oldest blocks and writes them to the disk.

This strategy has two potential problems: (1) stability guarantees are weak; if the power fails before the system flushes a block, the modifications are lost; (2) loss of consistency; if the oldest block, for example, contains an inode that points to dirty data block, then after the write the disk will contain an inode pointing to old data blocks. When the file system recovers the inode may be inconsistent with what is on disk (wrong data blocks, wrong size, etc.).

The solution to the first problem is pushed to the applications in UNIX. If an application desires that a file is stable, it must call fsync, or sync (to flush all dirty blocks). To keep of the window of vunerability small, the file system also flushes ditry blocks every 30 seconds.

The second problem can be solved by ensuring that the file system respects dependencies between operations:

Never point to a structure before it is initialized. An inode must be initialized before a directory entry references it. An block must be initialized before an inode references it.
Never reuse a structure before nullifying all pointers to it. An inode pointer to a disk block must be reset before the file system can reallocate the disk block.
Never reset the last point to a live structure before a new pointer is set. When renaming a file, the file system should not remove the old name for an inode until after the new name has been written.

The paper calls these dependencies update dependencies.

Does v6 guarantee them? (Answer: No).

This paper

The paper surveys some of the existing techniques and introduces a new to achieve the goal of performance and integrity for write operations in the context of a UNIX file system.

Techniques possible:

Synchronous writes. v6 used this solution partially. FFS uses this solution for all its metadata (inodes, directory blocks, etc.) updates.
Equip system with NVRAM, and put buffer cache in NVRAM.
Atomic updates using logging. Often used in UNIX file systems for metadata updates. LFS is an extreme version of this strategy.
Scheduler-enforced ordering. The disk driver is aware of the dependencies, but doesn't work well with delayed writes (ok with asynchronous write).
Flusher-enforced ordering. All writes are delayed. This flusher is aware of dependencies between blocks, but doesn't work because circular dependencies need to be broken by writing blocks out.

Soft updates is the solution explored in this paper. It doesn't require NVRAM, is simpler than complete logging (e.g., no changes on disk datastructures), and performs as well as the naive strategy of keep all dirty block in main memory.

Soft updates is a sophisticated variant of flusher-enforced ordering. Instead of maintaining dependencies on the block-level, it maintains dependencies on file structure level (per inode, per directory, etc.), reducing circular dependencies. Furthermore, it breaks any remaining circular dependencies by undo changes before writing the block and then redoing them to the block after writing.

Pseudocode for create:

create (f) {
   allocate inode in block i  (assuming inode is available)
   add i to directory data block d  (assuming d has space)
   mark d has dependent on i, and create undo/redo record
   update directory inode in block di
   mark di has dependent on d
}

Pseudocode for the flusher:

flushblock (b)
{
  lock b;
  for all dependencies that b is relying on
    "remove" that dependency by undoing the change to b
    mark the dependency as "unrolled"
  write b 
}

write_completed (b) {
  remove dependencies that depend on b
  reapply "unrolled" dependencies that b depended on
  unlock b
}

Apply flush algorithm to example:

A list of two dependencies: directory->inode, inode->directory.
Lets say syncer picks directory first
Undo directory->inode changes (i.e., unroll )
Write directory block
Remove met dependencies (i.e., remove inode->directory dependency)
Perform redo operation (i.e., redo )
Select inode block and write it
Remove met dependencies (i.e., remove directory->inode dependency)
Select directory block (it is dirty again!)
Write it.

An file operation that is important for file-system consistency is rename. Rename conceptually works as follows:

rename (from, to)
   unlink (to);
   link (from, to);
   unlink (from);

Rename it often used by programs to make a new version of a file the current version. Committing to a new version must happen atomically. Unfortunately, without a transaction-like support atomicity is impossible to guarantee, so a typical file systems provides weaker semantics for rename: if to already exists, an instance of to will always exist, even if the system should crash in the middle of the operation. Does the above implementation of rename guarantee this semantics? (Answer: no).

If rename is implemented as unlink, link, unlink, then it is difficult to guarantee even the weak semantics. Modern UNIXes provide rename as a file system call:

   update dir block for to point to from's inode // write block
   update dir block for from to free entry // write block

fsck may need to correct refcounts in the inode if the file system fails during rename. for example, a crash after the first write followed by fsck should set refcount to 2, since both from and to are pointing at the inode.

This semantics is sufficient, however, for an application to ensure atomicity. Before the call, there is a from and perhaps a to. If the call is successful, following the call there is only a to. If there is a crash, there may be both a from and a to, in which case the caller knows the previous attempt failed, and must retry. The subtlety is that if you now follow the two links, the "to" name may link to either the old file or the new file. If it links to the new file, that means that there was a crash and you just detected that the rename operation was composite. On the other hand, the retry procedure can be the same for either case (do the rename again), so it isn't necessary to discover how it failed. The function follows the golden rule of recoverability, and it is idempotent, so it lays all the needed groundwork for use as part of a true atomic action.

With soft updates renames becomes:

rename (from, to) {
   i = namei(from);
   add "to" directory data block td a reference to inode i
   mark td dependent on block i
   update directory inode "to" tdi
   mark tdi as dependent on td
   remove "from" directory data block fd a reference to inode i
   mark fd as dependent on tdi
   update directory inode in block fdi
   mark fdi as dependent on fd
}

No synchronous writes!

What needs to be done on recovery? (Inspect every statement in rename and see what inconsistencies could exist on the disk; e.g., refcnt inode could be too high.) None of these inconsitencies require fixing before the file system can operate; they can be fixed by a background file system repairer.

Paper discussion

Do soft updates perform any useless writes? (A useless write is a write that will be immediately overwritten.) (Answer: yes.) Fix syncer to becareful with what block to start. Fix cache replacement to selecting LRU block with no pendending dependencies.

Does FFS provide better stability than soft updates? (Answer: yes. FFS writes metadata updates synchronously. Before create() returns the file has been created on disk!

Can a log-structured file system implement rename better? (Answer: yes, since it can get the refcnts right).

Discuss all graphs.