Naming in file systems
Required reading: namei() and associated functions; sysfile.c.
- why naming?
- convenient way for users, applications to keep track
of their data
- isolates different apps from one another, much like VM
(e.g. different apps can use different directories)
- can help applications with concurrency and sharing,
as we will see later
- conceptual layers (Unix)
- blocks: 512 bytes on disk. named by a sector number.
- inodes: sequence of bytes, backed by a sequence of blocks.
inode named by an inode number,
bytes in an inode named by offset in inode.
- directories: map user-meaningful names to inode numbers.
directories are inodes, so named by inode number.
- pathnames: describe a way to traverse the directory
structure to get the desired file.
- (Unix FDs: user-visible binding for an inode;
more on this later.)
- pathnames
- pathnames separated by /, root directory denoted by
a leading "/"
- special entries "." and ".."
- directory hierarchy -- can create subdirs, etc
- each process has a current working directory (p->cwd),
which is used for relative paths (not starting with "/")
- syscalls to manage namespace:
open, link, unlink, mkdir, chdir
- limitations, features
- since naming layer is quite separate from files (inodes),
can easily have a file with multiple names, or no names
at all.
- how do we implement rename? rename-replace?
- hard links for directories? problem: inode numbers
don't tell us when we might be forming a cycle..
- fix: disallow hard links for directories
- Unix convention: can only remove empty directories
(in part making the user do explicit garbage-collection)
- symbolic links: does not hold a reference, but just
tells the name lookup code to continue looking for
another name.
- works fine for directories, but does not guarantee
anything exists.
- tricky bit: relative-path symlinks evaluated starting
from the directory that holds the symlink, not caller's
cwd (as if caller provided same relative-path string).
- how else could you design an FS? e.g. FAT
- puts file info (equivalent of inode) in directory entry
- no notion of an inode (file's identity aside from name)
- cannot have hard links to files
- cannot rename across directories
- NTFS
- has a notion of an inode
- doesn't have the static disk layout of xv6 fs
- most metadata (like block bitmap) is a file
- kind-of like UVPT trick, can look at inode map as
just another file
- cute trick: can even map the real boot sector at
the start of a disk as yet another file
xv6 data structures
- dirent
- name: string
- inum: inode number for that file
- empty directory entries: zeroed-out inode
- dinode
- nlink: number of links (directory references)
- inode
- ref: number of threads that have an in-memory
pointer to this inode. FDs, p->cwd, syscalls,
namei.
- flags: I_BUSY (locked), I_VALID (data present, like buf)
- we'll see why we need I_BUSY vs ref in a bit..
- struct file
- abstracts files (files, pipes, ...)
- shared offset
naming functions:
- _namei does most name-to-inode translation
- skipelem parses path names
- dirlookup does name-to-inode lookup in a single dir
- dirlink adds names to a directory
xv6 code
- open("a/b/c", O_RDWR)
- 4851: sys_open
- 4865: namei should return inode for "a/b/c"
- 4389: namei
- 4354: namex
- 4361: start in cp->cwd
- 4363: skipelem: a/b/c -> b/c, name=a
- 4374: look up "a":
dirlookup [4212] iterates over all entries in dir
- 4379: go back around looking up "b/c" in the "a" dir
- 4374: look up "b", ...
- 4385: return inode for "a/b/c"
- 4874: back in sys_open, get a new FD#
- 4883: this is where we save reference to actual inode
- 4888: return FD# to user
- open("a/b/c", O_RDWR | O_CREATE)
- 4851: sys_open
- 4801: create
- 4807: nameiparent [4396] is slightly different:
get inode for parent dir, keep name to create
- 4811: dirlookup [4212] checks if file already exists.
who sets canexist? sys_open, sys_mknod, sys_mkdir.
- 4821: get a fresh inode, fill it in
(ialloc [3952] scans for a type=0 inode)
- 4831: add name to directory, dirlink [4252]
(what if we now crash here? good example from quiz.
how to fix?)
- 4838: if mkdir("a/b/c"), create "." and ".." entries.
- what ensures two threads can't race to create same name?
- 4846: return newly created inode
- unlink("a")
- 4751: sys_unlink
- 4760: nameiparent again, get parent dir, keep name to unlink
- 4765: why do we need to guard "." and ".." here?
- 4770: find dirent slot that holds name to unlink
- 4774: we already locked directory, why lock inode?
- 4778: if removing directory, make sure it's empty
- 4784: first blank out dirent
- 4789: then drop nlink; what would happen if we did otherwise?
- 4791: iunlockput [3924]
- when does the file (inode) actually get freed?
- what happens to files with multiple names (nlink>1)?
- what happens to files still held open by some processes?
- what happens if there's a concurrent link [4663]
executing for the same inode?
- what happens if there's a process running in a child dir?
child directory's ".." holds an nlink on parent [4839].
could have accessed a free directory inode otherwise.
(and xv6 gets it wrong again; itrunc should
drop parent->nlink..)
- back to namei: concurrency
- namei vs unlink of final name
(what's the precise problem?)
- namei vs unlink/rmdir of directory being traversed
(what's the precise problem?)
- namei("d1/d2") vs namei("d1/d2/..") -- circular locking?
- key idea: differentiate between locking an inode (I_BUSY)
vs protecting it from deletion (ref)
- concurrency requires lots of work to keep data structures
consistent and to provide some reasonable semantics
- work not necessarily wasted:
lots of applications have potential concurrency
and crash-recovery issues, and it's convenient when these issues
can be relegated to some common FS code
- example: atomic file creation -- create and rename trick
- atomic file replacement? xv6 lacks rename aka
unlink-and-link, but most Unix systems do have it.
symlinks
- what are the interesting points for implementing symlinks in xv6?
TOCTTOU
- unfortunately, for all our effort on concurrency,
some things just aren't solvable in the Unix API.
- various system calls take names, but the file it refers to
can change between each call, making it difficult to know
what you're really doing
- traditionally this problem has come up in privileged code,
which needs to perform some checks on a file, and if the
checks succeed, do something to that file. if the file
name points to a different inode between the "check" and
the "do", we have a security problem.
- called "time-of-check to time-of-use" (TOCTTOU) bug.
- usual example: cleaner for a /tmp directory
- root runs: rm /tmp/*/*
(really something like
find /tmp -not-accessed-recently | xargs rm)
- two phases: expand list of files, then unlink them
- attacker: mkdir /tmp/a; echo >/tmp/a/passwd
- root's rm: finds /tmp/a/passwd
- attacker: rm /tmp/a/passwd; rmdir /tmp/a;
ln -s /etc /tmp/d
- root's rm: unlink("/tmp/a/passwd"),
unlinks passwd from /tmp/a==/etc
- what's the fix? expose more inodes (as file descriptors)
to avoid duplicate namei() calls that might return different
values.
- fstat allows us to check file status for an inode,
rather than path (real unix also has stat, which
can lead to trouble)
- how to do a directory lookup on a directory inode?
- how to execute a file with a particular inode?
rest of course
- will talk about a research paper per lecture
- next lecture: paper on getting both crash-consistency and
performance in a file system.
- read paper before class, lecture will discuss it in detail,
be prepared.