6.824 2001 Lecture 8: Network file system protocols

Sun NFS Goals
  Designed in early/mid 80s.
  Before: each computer had its own private file disk + file system.
    Fine for expensive central time-sharing.
    Awkward for individual workstation.
  One server, LAN full of client workstations. Not WAN.
  Allow users to share files easily.
  Allow a user to sit at any workstation.
  Save money -- diskless workstations.
  Had to work with existing applications.
  Had to be easy to retro-fit into UNIX O/S.
  Had to implement same semantics as local UNIX FFS.
  Had to be not too UNIX specific -- work w/ DOS, for example.
  Had to be fast enough to be tolerable (but willing to sacrifice some).

Kernel file system structure before NFS
  Specialized to local file system called FFS.
  Disk inodes.
  O/S keeps in-core copies of inodes that are in use.
    File descriptors, current directories, executing programs.
  File system system calls used inodes directly.
    To find e.g. disk addresses for read().
  Disk block cache.
    Indexed by disk block #.

Why not network disk?
  I.e. have server supply a block store, with JUST read/write block RPCs.
  This makes read/write sharing awkward.
  Clients would have to carefully lock disk data structures.
  Possible, but better to move complex operations to server.

New "vnode" plan, invented to support NFS.
  Need a layer of indirection, to hide implementation.
  A file might be FFS, NFS, or something else.
  Replace inode with vnode object.
  vnode has lots of methods:
    open, close, read, remove.
  Each file system type has its own implementation methods.
  What about disk cache?
    Replaced with per-vnode list of cached blocks.

NFS client/server structure.
  Client programs have file descriptors, current directory, &c.
  Inside kernel, these refer to vnodes of type NFS.
  When client program make system call:
    NFS vnode implementation sends RPC to server.
    Kernel half of that program waits for reply.
    So we can have one outstanding RPC per program.
  Server kernel has NFS threads, waiting for incoming RPCs.
    NFS thread acts a lot like user program making system call.
    Find *vnode* in server corresponding to client's vnode.
    Calls that vnode's relevant method.
    Server vnodes are usually (always) of type FFS.
    This saves a lot of code in the server.
    This means NFS will work with different local file systems.
    This means files are available on server in the ordinary way.
    NFS server thread blocks when needed.

How should an NFS rpc indicate which file is involved?
  For a read RPC, for example.
  Could use file name. Client NFS vnode would contain name, send in rpc.
  Easy to implement in the server.
  Why doesn't this work?
  Doesn't preserve UNIX file semantics.
    Client 1: chdir("dir1");
              fd = open("file");
    Client 2: rename("dir1", "dir2");
              rename("dir3", "dir1");
    Client 1: read(fd, buf, n);
    Does client read current dir1/file, or dir2/file?
    UNIX says dir2/file.
    Also using names might be slow.

The File Handle
  File systems already need a way to name files / inodes.
    How do directory entries refer to files?
  Assume it exists, provide a way to encode it.
  For UNIX, i-number (i.e. disk address of inode).
  Don't want to expose details to client.
    I.e. client should never have to make up a file reference.
  So file handles are opaque.
    Client sees them as 32-byte blob.
    Client gets all file handles from the server.
    Every client NFS vnode contains the file's handle.
    Client sends back same handle to server.

NFS RPCs
  lookup
  read
  write
  getattr
  create
  remove, setattr, rename, readlink, link, symlink, mkdir, rmdir, readdir
  (no open, close, chdir)

Example:
  fd = open("./notes", 0);
  read(fd, buf, n);
  Client process has a reference to current directory's vnode.
  Sends LOOKUP(dir-vnode, "notes") to server.
  Server extracts i-number from file handle.
  Asks local file system to turn that into a local vnode.
    Every local file system must support file handles...
  Calls the local vnode's lookup method.
  dir->lookup("notes") returns "notes" vnode.
  NFS server code extracts i-number from vnode, creates new file handle.
  Server returns new file handle to client.
  Client creates new vnode, sets its file handle.
  Client creates new file descriptor pointing to new vnode.
  Client app issues read(fd, ...).
  Results in READ(file-handle, ...) being sent to server.

Where does the client get the first file handle?
  Since every NFS rpc has to be accompanied by a valid file handle.
  Server's mount daemon maps file system name to root file handle.
  Client kernel marks mount point on local file system as special.
    Remembers vnode (and thus file handle) of remote file system.

Crash recovery.
  Suppose server crashes and reboots.
  Clients might not even know.
  File handles held by clients must still work!
  That's why file handle holds i-number, which is basically a disk address.
  Rather than, say, server NFS code creating an arbitrary map.

What if some other client deletes a file I have open?
  UNIX semantics: file still exists until I stop using it.
  Would require server to keep reference count per file.
    Would require open() and close() RPCs to help mantain that count.
  Which would have to persist across server reboots.
  So NFS just does the wrong thing!
    RPCs will fail if some other client deletes a file I have open.
  This is part of the reason why there's no open() or close() rpc.

What if another client deletes my file, then creates a new one?
  And new file happens to use the same i-node as the old one?
  Will my RPCs appear to succeed, but use the wrong file?
  No: generation numbers.
  Which are persistent -- written in disk i-node.
  Yet another small way in which local file system must support NFS.
  Result: client might see a "stale file handle" error on any system call.

What about performance?
  Does *every* program system call go over the wire to the server?
  No: client cache for better performance.
  Per-vnode block cache, name->file handle cache, attribute cache.
  Can satisfy read()s, for example, from block cache.

What about cache consistency?
  Is it enough to make the data cache write-through?
  No: I read a file, another client writes it, I read it again.
  How do I realize my cache is stale?

How would we know if we got cache consistency right?
  What are the semantics of read and write system calls?
  One possibility: read() sees data from most recent write().
    This is what local UNIX file systems implement.
  How to implement these strong semantics?
    Turn off client caching altogether.
    Or have clients check w/ server before every read.
    Or have server notify clients when other clients write.

NFS chooses to implement weaker consistency semantics.
  If I write() and then close(), then you open() and read(), you see my data.
  Otherwise you may see stale data.
  How to implement these strong semantics?
    Writing client must force dirty blocks during close().
    Reading client must check w/ server during open().
      Ask if file has been modified since data were cached.
  This is must less expensive than strong consistency.
    Though maybe not very scalable; every open() produces an rpc.

What about security?
  Server has list of IP addresses.
  Fully trusts any client with that address.
  Client O/S expected to enforce user IDs, send to server.

Other issues:
  soft/hard/intr mounts
  replay cache.
  I can execute files I can't read.
  what if I open(), then chmod u=?
    owner always allowed to read/write...
  dump+restore may wreck client file handles.