6.824 2001 Lecture 11: File system layout on disk

- Why do we care?
  Critical for performance and crash recovery of individual machines.
  Interacts with distributed protocols, for both reasons.
  Some of the fault-handling techniques will re-appear in dist systems.

- Review of file representation
  - inodes
  - block numbers
  - directores: files with a particular format
  - block free list
  - i-node free list

- Simple file system:
  - assume simple disk layout: inodes at beginning and the rest data
  - write-through cache

- Reliability: ordering of writes

  - How is create("f") in directory "d" be implemented?
    (Ignoring reads, since they're probably cached.
    1. Get a new block from block free list, update free list on disk
    2. Write block full of data to disk
    3. Get a new i-node from inode free list, update free list on disk
    4. Write new i-node to disk
    5. Write entry for "f" to "d"'s directory contents
    6. Write "d"'s i-node, since it has a longer length
  
  - Order of these 4 write matters
    - Ex. if file system crashes after performing steps 2, 3, 4, 5 and
    6 before 1, then we will have an incorrect file system after recovery.

  - What happens if crashes intervene at various points?

- Now let's do the same thing for deletion of file "f" from dir "d"
    1. Write "d"'s content to eliminate reference to "f"
    2. Write "f"'s inode to clear it
    3. Write inode free list, to include "f"'s old i-node

- Performance
  How many files can we create per second?
  How fast can we write content to files?

- Suppose we *never* wrote to disk.
  I.e. write only to an in-memory cache.
  Write the blocks to disk later.
  This is what Linux does.
  How much faster can we create files?
  What can go wrong if a crash occurs?
  Worst case may be delete+create -> wrong file.

- Goal: achieve high-performance *and* reliability
  
  Design approach:
  - Reads can be made fast by exploiting big main-memory cache
    Result: read ops can happen at the speed of main-memory
  - Writes can be made fast by performing them in the main-memory
    cache and asynchronously updating the disk
    Result: 
      - write ops can happen at the speed of main-memory (until
        main-memory is filled by dirty blocks)
      - a queue of dirty blocks to be written also allows us to
        achieve high disk bandwidth (by handing them to the disk
	together so the disk can schedule them efficiently---e.g.,
	using a elevator algorithm).
      - LOST RELIABILITY
        - applications expect that the completion of some disk
	operation (e.g., close) means that data is on the disk recoverable.
        - writes out of order could make data unrecoverable.
  - Approaches to making disk writes asynchronously:
    1. Forget reliability (Linux)
    2. Synchronously write metada (inodes etc.) and asynchronously
       data blocks (VMS, DOS, FFS, etc.)
    3. Use NVRAM (disadv: cannot remove disk from broken machine)
    4. Atomic updates (group set of dependent writes as an atomic
       operations) with write-ahead logging (journaling FS, LFS, XFS,
       etc.)
       - each change to metadata first writes asynch entry in the log
       - commit ops wait until writes to log are stable
    5. Scheduler-enforced ordering  (pass dependencies to disk driver)
    6. The cache-write-back code enforces interbuffer dependencies
       Problem: many circular dependencies  (e.g., between an inode
       block and a directory block after creating a file A and
       removing B from the same directory and whose inodes are in the
       same inode block). 
		 inode block I	dir block D	
		     inode 4	 -
		     inode 5     B
	  if we add 4 to D, then D is dependent on I
          if we delete 5 from D, then I is dependent on D
    7. Soft updates
	    - Fine-grained in memory log to break dependencies from 6.
	    - After main-memory updates for I and D to add 4 and delete
	      5:
		     I		D
		     4		F
		     5		-

	     - Writing I and D:
	             undo adding 4 (in main memory)
		     write D
		     redo 4  (in main memory)
		     write 4
		     write D

	     - Adv: No on-disk log and transaction machinery