* Ordered mode In ordered mode ext4 flushes all dirty data of a file directly to the main file system, but it orders those updates before it flushes the corresponding metadata to the on-disk log. Ordered-mode ensures, for example, that inodes will always be pointing to valid blocks. But, it has the (perhaps undesirable) feature that data of a file may have been updated on disk, but the file's metadata may not have been updated. Note that "ordered" mode doesn't refer some order among file operations such as create, read, write, etc.. "Ordered" refers to that dirty data blocks are written before the corresponding metadata blocks. * Implementation There is a file system (e.g., ext4) and a logging system (e.g., jdb2). Ext4 uses jbd2 to provide logging. jbd2 maintains an in-memory log and an on-disk log. It provides handles to allow file systems to describe atomic operations. It may group several handles into a single transaction. Ext4 uses jbd2 as follows. Systems calls in ext4 are made atomic using jbd2's handles. A system call starts with start_handle(). All blocks updated by that system call are associated with that handle. Each block has a buffered head, which describes that block. It may also have a journal head too, if jbd2 knows about it, which records the handle of the block, whether it contains metadata, and so on. When a ext4 opens a handle that handle is appended to list of handles for the current open transaction (and reserves sufficient space in the log). Ext4 marks the super block, blocks containing inodes, directory blocks, etc. as metadata blocks. jbd2 maintains the in-memory log implicitly: it is the list of blocks associated with each handle in this transaction. jbd2 maintains a list of metadata blocks. jbd2 also maintains a list of dirty inodes that are part of this transaction. When committing the in-memory log to the on-disk log, jbd2 flushes first the dirty, non-metadata blocks to their final destination (i.e., without going through the log). It does so by traversing the list of inodes in this transaction, and for each block of an inode it flushes it if it is dirty and not marked as metadata blocks. Then, jbd2 syncs the metadata blocks to the on-disk log. Ext4+jbd2 guarantee a prefix property: if a file system operation x finishes before file system operation y starts, then x will be on disk before y. If x and y are closes to each other in time, they may be committed in the same transaction (which is fine). Y may also end up in a later transaction (which is fine). But, y will never end up in an earlier transaction than x. When calling fsync(), ext4 waits for the current transaction to commit, which flushes the in-memory log to disk. Thus, fsync guarantees the prefix property for metadata operations: the metadata operations preceding fsync() are on disk when the fsync() completes. Can a system call y observe results from some other call x by another process in memory and be ordered before x in the on-disk log? It is up to the file system and its concurrency mechanism, but for ext4 the answer is no. If y starts after x completes, the answer is definitely no (because of prefix property). If two calls run concurrently, then both will be committed to the disk in the same transaction, because once ext4 opens a handle, it is guaranteed to be part of the current transaction. on a commit, ext4 waits until all current active handles are closed before committing. When opening a handle, ext4 must say how many blocks it needs in the log to complete the handle so that jbd2 can guarantee that all active handles can be committed in the current transaction. If there isn't enough space, then the start handle will be delayed until the next transaction. * Implication for applications A nuance is that ordering of metadata isn't always sufficient for apps. For example, an app may overwrite a block of a file, which won't update its metada (xxx is this really true? size? mtime?). As a result, that dirty block may be written to disk much later than metadata operations that follow it and those metadata operations may have read the overwritten block. (XXX i assume this is why the alice paper lists an x for [overwrite -> any operation].) An app doesn't have to sync a newly-created parent directory when fsyncing a file in that directory (assuming that the handle for the parent directory is ordered before the handle of the file). If ext4 orders handles correctly, jbd2 will write them to the on-disk log in the order of their handles. --- (Note: if ext4 is run without a log, the code explicitly checks for this case, and forces a flush on the parent directory when a file in that directory is fsynced, see ext4/sync.c) XXX why does the alice paper have an X for ext4-ordered [append -> any op]. the append updates metadata so the dirty block should be flushed before the metadata changes. maybe this has to do with delayed block allocation? I skimmed that paper. I think I have an inkling of what each of the "X" is in Table 1 for ext4-ordered. To me the most mysterious was "append -> any op", the answer seems to be delayed allocation. The paper does mention fsync() on the parent a few times, and I still do not understand what the problem is. I had not realized that rename() in an ext3-ordered file system won't appear on disk until all previous writes are on disk, even for unrelated files. And that fsync(), if it requires an i-node change, also forces all previous metadata and data updates to disk for every file. XXX many renames and cycles.