* Ordered mode

In ordered mode ext4 flushes all dirty data of a file directly to the
main file system, but it orders those updates before it flushes the
corresponding metadata to the on-disk log. Ordered-mode ensures, for
example, that inodes will always be pointing to valid blocks. But, it
has the (perhaps undesirable) feature that data of a file may have
been updated on disk, but the file's metadata may not have been
updated.

Note that "ordered" mode doesn't refer some order among file
operations such as create, read, write, etc..  "Ordered" refers to
that dirty data blocks are written before the corresponding metadata
blocks.

* Implementation

There is a file system (e.g., ext4) and a logging system (e.g., jdb2).
Ext4 uses jbd2 to provide logging. 

jbd2 maintains an in-memory log and an on-disk log.  It provides
handles to allow file systems to describe atomic operations.  It
may group several handles into a single transaction.

Ext4 uses jbd2 as follows. Systems calls in ext4 are made atomic using
jbd2's handles. A system call starts with start_handle(). All blocks
updated by that system call are associated with that handle. Each
block has a buffered head, which describes that block.  It may also
have a journal head too, if jbd2 knows about it, which records the
handle of the block, whether it contains metadata, and so on.  When a
ext4 opens a handle that handle is appended to list of handles for the
current open transaction (and reserves sufficient space in the
log). Ext4 marks the super block, blocks containing inodes,
directory blocks, etc. as metadata blocks.

jbd2 maintains the in-memory log implicitly: it is the list of blocks
associated with each handle in this transaction.  jbd2 maintains a
list of metadata blocks.  jbd2 also maintains a list of dirty inodes
that are part of this transaction.

When committing the in-memory log to the on-disk log, jbd2 flushes
first the dirty, non-metadata blocks to their final destination (i.e.,
without going through the log).  It does so by traversing the list of
inodes in this transaction, and for each block of an inode it flushes
it if it is dirty and not marked as metadata blocks. Then, jbd2 syncs
the metadata blocks to the on-disk log.

Ext4+jbd2 guarantee a prefix property: if a file system operation x
finishes before file system operation y starts, then x will be on disk
before y.  If x and y are closes to each other in time, they may be
committed in the same transaction (which is fine).  Y may also end up
in a later transaction (which is fine).  But, y will never end up in
an earlier transaction than x.

When calling fsync(), ext4 waits for the current transaction to
commit, which flushes the in-memory log to disk.  Thus, fsync
guarantees the prefix property for metadata operations: the metadata
operations preceding fsync() are on disk when the fsync() completes.

Can a system call y observe results from some other call x by another
process in memory and be ordered before x in the on-disk log?  It is
up to the file system and its concurrency mechanism, but for ext4 the
answer is no. If y starts after x completes, the answer is definitely
no (because of prefix property).  If two calls run concurrently, then
both will be committed to the disk in the same transaction, because
once ext4 opens a handle, it is guaranteed to be part of the current
transaction. on a commit, ext4 waits until all current active handles
are closed before committing.

When opening a handle, ext4 must say how many blocks it needs in the
log to complete the handle so that jbd2 can guarantee that all active
handles can be committed in the current transaction. If there isn't
enough space, then the start handle will be delayed until the next
transaction.

* Implication for applications

A nuance is that ordering of metadata isn't always sufficient for
apps.  For example, an app may overwrite a block of a file, which
won't update its metada (xxx is this really true? size? mtime?).  As a
result, that dirty block may be written to disk much later than
metadata operations that follow it and those metadata operations may
have read the overwritten block.  (XXX i assume this is why the alice
paper lists an x for [overwrite -> any operation].)

An app doesn't have to sync a newly-created parent directory when
fsyncing a file in that directory (assuming that the handle for the
parent directory is ordered before the handle of the file).  If ext4
orders handles correctly, jbd2 will write them to the on-disk log 
in the order of their handles.

---

(Note: if ext4 is run without a log, the code explicitly checks for
this case, and forces a flush on the parent directory when a file in
that directory is fsynced, see ext4/sync.c)

XXX why does the alice paper have an X for ext4-ordered [append -> any
op].  the append updates metadata so the dirty block should be flushed
before the metadata changes.  maybe this has to do with delayed block
allocation?

I skimmed that paper. I think I have an inkling of what each of the "X"
is in Table 1 for ext4-ordered. To me the most mysterious was "append ->
any op", the answer seems to be delayed allocation.

The paper does mention fsync() on the parent a few times, and I still do
not understand what the problem is.

I had not realized that rename() in an ext3-ordered file system won't
appear on disk until all previous writes are on disk, even for unrelated
files. And that fsync(), if it requires an i-node change, also forces
all previous metadata and data updates to disk for every file.

XXX many renames and cycles.