diff mbox series

[v2] Documenting the crash-recovery guarantees of Linux file systems

Message ID 1552418820-18102-1-git-send-email-jaya@cs.utexas.edu (mailing list archive)
State New, archived
Headers show
Series [v2] Documenting the crash-recovery guarantees of Linux file systems | expand

Commit Message

Jayashree March 12, 2019, 7:27 p.m. UTC
In this file, we document the crash-recovery guarantees
provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
(SOMC), which is provided by xfs. It is not clear to us if other file systems
provide SOMC.

Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---

We would be happy to modify the document if file-system
developers claim that their system provides (or aims to provide) SOMC.

Changes since v1:
  * Addressed few nits identified in the review
  * Added the fsync guarantees for F2FS and its SOMC compliance
---
 .../filesystems/crash-recovery-guarantees.txt      | 193 +++++++++++++++++++++
 1 file changed, 193 insertions(+)
 create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt

--
2.7.4

Comments

Filipe Manana March 13, 2019, 5:13 p.m. UTC | #1
On Tue, Mar 12, 2019 at 7:27 PM Jayashree <jaya@cs.utexas.edu> wrote:
>
> In this file, we document the crash-recovery guarantees
> provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
> present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
> (SOMC), which is provided by xfs. It is not clear to us if other file systems
> provide SOMC.
>
> Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Mostly for the Btrfs part:

Reviewed-by: Filipe Manana <fdmanana@suse.com>

Nit: you can have lines up to 80 characters, many of them are
significantly and unnecessarily smaller than that.

Thanks for writing this document.

> ---
>
> We would be happy to modify the document if file-system
> developers claim that their system provides (or aims to provide) SOMC.
>
> Changes since v1:
>   * Addressed few nits identified in the review
>   * Added the fsync guarantees for F2FS and its SOMC compliance
> ---
>  .../filesystems/crash-recovery-guarantees.txt      | 193 +++++++++++++++++++++
>  1 file changed, 193 insertions(+)
>  create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt
>
> diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt
> new file mode 100644
> index 0000000..be84964
> --- /dev/null
> +++ b/Documentation/filesystems/crash-recovery-guarantees.txt
> @@ -0,0 +1,193 @@
> +=====================================================================
> +File System Crash-Recovery Guarantees
> +=====================================================================
> +Linux file systems provide certain guarantees to user-space
> +applications about what happens to their data if the system crashes
> +(due to power loss or kernel panic). These are termed crash-recovery
> +guarantees.
> +
> +Crash-recovery guarantees only pertain to data or metadata that has
> +been explicitly persisted to storage with fsync(), fdatasync(), or
> +sync() system calls. By default, write(), mkdir(), and other
> +file-system related system calls only affect the in-memory state of
> +the file system.
> +
> +The crash-recovery guarantees provided by most Linux file systems are
> +significantly stronger than what is required by POSIX. POSIX is vague,
> +even allowing fsync() to do nothing (Mac OSX takes advantage of
> +this). However, the guarantees provided by file systems are not
> +documented, and vary between file systems. This document seeks to
> +describe the current crash-recovery guarantees provided by major Linux
> +file systems.
> +
> +What does the fsync() operation guarantee?
> +----------------------------------------------------
> +fsync() operation is meant to force the physical write of data
> +corresponding to a file from the buffer cache, along with the file
> +metadata. Note that the guarantees mentioned for each file system below
> +are in addition to the ones provided by POSIX.
> +
> +POSIX
> +-----
> +fsync(file) : Flushes the data and metadata associated with the
> +file. However, if the directory entry for the file has not been
> +previously persisted, or has been modified, it is not guaranteed to be
> +persisted by the fsync of the file [1]. What this means is, if a file
> +is newly created, you will have to fsync(parent directory) in addition
> +to fsync(file) in order to ensure that the file's directory entry has
> +safely reached the disk.
> +
> +fsync(dir) : Flushes directory data and directory entries. However if
> +you created a new file within the directory and wrote data to the
> +file, then the file data is not guaranteed to be persisted, unless an
> +explicit fsync() is issued on the file.
> +
> +ext4
> +-----
> +fsync(file) : Ensures that a newly created file's directory entry is
> +persisted (no need to explicitly persist the parent directory). However,
> +if you create multiple names of the file (hard links), then their directory
> +entries are not guaranteed to persist unless each one of the parent
> +directory entries are persisted [2].
> +
> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data.
> +
> +xfs
> +----
> +fsync(file) : Ensures that a newly created file's directory entry is
> +persisted. Additionally, all the previous dependent modifications to
> +this file are also persisted. If any file shares an object
> +modification dependency with the fsync-ed file, then that file's
> +directory entry is also persisted.
> +
> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data. As with files, fsync(dir) also persists
> +previous dependent metadata operations.
> +
> +btrfs
> +------
> +fsync(file) : Ensures that a newly created file's directory entry
> +is persisted, along with the directory entries of all its hard links.
> +You do not need to explicitly fsync individual hard links to the file.
> +
> +fsync(dir) : All the file names within the directory will persist. All the
> +rename and unlink operations within the directory are persisted. Due
> +to the design choices made by btrfs, fsync of a directory could lead
> +to an iterative fsync on sub-directories, thereby requiring a full
> +file system commit. So btrfs does not advocate fsync of directories
> +[2].
> +
> +F2FS
> +----
> +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix),
> +F2FS only guarantees POSIX behaviour. However, it provides xfs-like
> +guarantees if mounted with fsync-mode=strict option.
> +
> +fsync(symlink)
> +-------------
> +A symlink inode cannot be directly opened for IO, which means there is
> +no such thing as fsync of a symlink [3]. You could be tricked by the
> +fact that open and fsync of a symlink succeeds without returning a
> +error, but what happens in reality is as follows.
> +
> +Suppose we have a symlink “foo”, which points to the file “A/bar”
> +
> +fd = open(“foo”, O_CREAT | O_RDWR)
> +fsync(fd)
> +
> +Both the above operations succeed, but if you crash after fsync, the
> +symlink could be still missing.
> +
> +When you try to open the symlink “foo”, you are actually trying to
> +open the file that the symlink resolves to, which in this case is
> +“A/bar”. When you fsync the inode returned by the open system call, you
> +are actually persisting the file “A/bar” and not the symlink. Note
> +that if the file “A/bar” does not exist and you try the open the
> +symlink “foo” without the O_CREAT flag, then file open will fail. To
> +obtain the file descriptor associated with the symlink inode, you
> +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the
> +file descriptor obtained this way can be only used to indicate a
> +location in the file-system tree and to perform operations that act
> +purely at the file descriptor level. Operations like read(), write(),
> +fsync() etc cannot be performed on such file descriptors.
> +
> +Bottomline : You cannot fsync() a symlink.
> +
> +fsync(special files)
> +--------------------
> +Special files in Linux include block and character device files
> +(created using mknod), FIFO (created using mkfifo) etc. Just like the
> +behavior of fsync on symlinks described above, these special files do
> +not have an fsync function defined. Similar to symlinks, you
> +cannot fsync a special file [4].
> +
> +
> +Strictly Ordered Metadata Consistency
> +-------------------------------------
> +With each file system providing varying levels of persistence
> +guarantees, a consensus in this regard, will benefit application
> +developers to work with certain fixed assumptions about file system
> +guarantees. Dave Chinner proposed a unified model called the
> +Strictly Ordered Metadata Consistency (SOMC) [5].
> +
> +Under this scheme, the file system guarantees to persist all previous
> +dependent modifications to the object upon fsync().  If you fsync() an
> +inode, it will persist all the changes required to reference the inode
> +and its data. SOMC can be defined as follows [6]:
> +
> +If op1 precedes op2 in program order (in-memory execution order), and
> +op1 and op2 share a dependency, then op2 must not be observed by a
> +user after recovery without also observing op1.
> +
> +Unfortunately, SOMC's definition depends upon whether two operations
> +share a dependency, which could be file-system specific. It might
> +require a developer to understand file-system internals to know if
> +SOMC would order one operation before another. It is worth noting
> +that a file system can be crash-consistent (according to POSIX),
> +without providing SOMC [7].
> +
> +As an example, consider the following test case from xfstest
> +generic/342 [8]
> +-------
> +touch A/foo
> +echo “hello” >  A/foo
> +sync
> +
> +mv A/foo A/bar
> +echo “world” > A/foo
> +fsync A/foo
> +CRASH
> +
> +What would you expect on recovery, if the file system crashed after
> +the final fsync returned successfully?
> +
> +Non-SOMC file systems will not persist the file
> +A/bar because it was not explicitly fsync-ed. But this means, you will
> +find only the file A/foo with data “world” after crash, thereby losing
> +the previously persisted file with data “hello”. You will need to
> +explicitly fsync the directory A to ensure the rename operation is
> +safely persisted on disk.
> +
> +Under SOMC, to correctly reference the new inode via A/foo,
> +the previous rename operation must persist as well. Therefore,
> +fsync() of A/foo will persist the renamed file A/bar as well.
> +On recovery you will find both A/bar (with data “hello”)
> +and A/foo (with data “world”).
> +
> +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict)
> +and btrfs provide SOMC-like behaviour in this particular example.
> +However, in writing, only XFS claims to provide SOMC. F2FS aims to provide
> +SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and
> +btrfs provide strictly ordered metadata consistency.
> +
> +--------------------------------------------------------
> +[1] http://man7.org/linux/man-pages/man2/fdatasync.2.html
> +[2] https://www.spinics.net/lists/linux-btrfs/msg77340.html
> +[3] https://www.spinics.net/lists/fstests/msg09370.html
> +[4] https://bugzilla.kernel.org/show_bug.cgi?id=202485
> +[5] https://marc.info/?l=fstests&m=155010885626284&w=2
> +[6] https://marc.info/?l=fstests&m=155011123126916&w=2
> +[7] https://www.spinics.net/lists/fstests/msg09379.html
> +[8] https://patchwork.kernel.org/patch/10132305/
> +
> --
> 2.7.4
>
Amir Goldstein March 13, 2019, 6:43 p.m. UTC | #2
On Tue, Mar 12, 2019 at 9:27 PM Jayashree <jaya@cs.utexas.edu> wrote:
>
> In this file, we document the crash-recovery guarantees
> provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
> present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
> (SOMC), which is provided by xfs. It is not clear to us if other file systems
> provide SOMC.

I think your document already claims that f2fs is SOMC, so better
update commit message.

FWIW, it is clear that ext4 also provides SOMC, because
all metadata is journalled on a single linear transaction journal.
Compared to xfs, an fsync on any dirty object is likely to flush
even more metadata.

It'd be a pitty to merge this document without Ted's ACK
on the SOMC claim for ext4.

Thanks,
Amir.

>
> Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> ---
>
> We would be happy to modify the document if file-system
> developers claim that their system provides (or aims to provide) SOMC.
>
> Changes since v1:
>   * Addressed few nits identified in the review
>   * Added the fsync guarantees for F2FS and its SOMC compliance
> ---
>  .../filesystems/crash-recovery-guarantees.txt      | 193 +++++++++++++++++++++
>  1 file changed, 193 insertions(+)
>  create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt
>
> diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt
> new file mode 100644
> index 0000000..be84964
> --- /dev/null
> +++ b/Documentation/filesystems/crash-recovery-guarantees.txt
> @@ -0,0 +1,193 @@
> +=====================================================================
> +File System Crash-Recovery Guarantees
> +=====================================================================
> +Linux file systems provide certain guarantees to user-space
> +applications about what happens to their data if the system crashes
> +(due to power loss or kernel panic). These are termed crash-recovery
> +guarantees.
> +
> +Crash-recovery guarantees only pertain to data or metadata that has
> +been explicitly persisted to storage with fsync(), fdatasync(), or
> +sync() system calls. By default, write(), mkdir(), and other
> +file-system related system calls only affect the in-memory state of
> +the file system.
> +
> +The crash-recovery guarantees provided by most Linux file systems are
> +significantly stronger than what is required by POSIX. POSIX is vague,
> +even allowing fsync() to do nothing (Mac OSX takes advantage of
> +this). However, the guarantees provided by file systems are not
> +documented, and vary between file systems. This document seeks to
> +describe the current crash-recovery guarantees provided by major Linux
> +file systems.
> +
> +What does the fsync() operation guarantee?
> +----------------------------------------------------
> +fsync() operation is meant to force the physical write of data
> +corresponding to a file from the buffer cache, along with the file
> +metadata. Note that the guarantees mentioned for each file system below
> +are in addition to the ones provided by POSIX.
> +
> +POSIX
> +-----
> +fsync(file) : Flushes the data and metadata associated with the
> +file. However, if the directory entry for the file has not been
> +previously persisted, or has been modified, it is not guaranteed to be
> +persisted by the fsync of the file [1]. What this means is, if a file
> +is newly created, you will have to fsync(parent directory) in addition
> +to fsync(file) in order to ensure that the file's directory entry has
> +safely reached the disk.
> +
> +fsync(dir) : Flushes directory data and directory entries. However if
> +you created a new file within the directory and wrote data to the
> +file, then the file data is not guaranteed to be persisted, unless an
> +explicit fsync() is issued on the file.
> +
> +ext4
> +-----
> +fsync(file) : Ensures that a newly created file's directory entry is
> +persisted (no need to explicitly persist the parent directory). However,
> +if you create multiple names of the file (hard links), then their directory
> +entries are not guaranteed to persist unless each one of the parent
> +directory entries are persisted [2].
> +
> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data.
> +
> +xfs
> +----
> +fsync(file) : Ensures that a newly created file's directory entry is
> +persisted. Additionally, all the previous dependent modifications to
> +this file are also persisted. If any file shares an object
> +modification dependency with the fsync-ed file, then that file's
> +directory entry is also persisted.
> +
> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data. As with files, fsync(dir) also persists
> +previous dependent metadata operations.
> +
> +btrfs
> +------
> +fsync(file) : Ensures that a newly created file's directory entry
> +is persisted, along with the directory entries of all its hard links.
> +You do not need to explicitly fsync individual hard links to the file.
> +
> +fsync(dir) : All the file names within the directory will persist. All the
> +rename and unlink operations within the directory are persisted. Due
> +to the design choices made by btrfs, fsync of a directory could lead
> +to an iterative fsync on sub-directories, thereby requiring a full
> +file system commit. So btrfs does not advocate fsync of directories
> +[2].
> +
> +F2FS
> +----
> +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix),
> +F2FS only guarantees POSIX behaviour. However, it provides xfs-like
> +guarantees if mounted with fsync-mode=strict option.
> +
> +fsync(symlink)
> +-------------
> +A symlink inode cannot be directly opened for IO, which means there is
> +no such thing as fsync of a symlink [3]. You could be tricked by the
> +fact that open and fsync of a symlink succeeds without returning a
> +error, but what happens in reality is as follows.
> +
> +Suppose we have a symlink “foo”, which points to the file “A/bar”
> +
> +fd = open(“foo”, O_CREAT | O_RDWR)
> +fsync(fd)
> +
> +Both the above operations succeed, but if you crash after fsync, the
> +symlink could be still missing.
> +
> +When you try to open the symlink “foo”, you are actually trying to
> +open the file that the symlink resolves to, which in this case is
> +“A/bar”. When you fsync the inode returned by the open system call, you
> +are actually persisting the file “A/bar” and not the symlink. Note
> +that if the file “A/bar” does not exist and you try the open the
> +symlink “foo” without the O_CREAT flag, then file open will fail. To
> +obtain the file descriptor associated with the symlink inode, you
> +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the
> +file descriptor obtained this way can be only used to indicate a
> +location in the file-system tree and to perform operations that act
> +purely at the file descriptor level. Operations like read(), write(),
> +fsync() etc cannot be performed on such file descriptors.
> +
> +Bottomline : You cannot fsync() a symlink.
> +
> +fsync(special files)
> +--------------------
> +Special files in Linux include block and character device files
> +(created using mknod), FIFO (created using mkfifo) etc. Just like the
> +behavior of fsync on symlinks described above, these special files do
> +not have an fsync function defined. Similar to symlinks, you
> +cannot fsync a special file [4].
> +
> +
> +Strictly Ordered Metadata Consistency
> +-------------------------------------
> +With each file system providing varying levels of persistence
> +guarantees, a consensus in this regard, will benefit application
> +developers to work with certain fixed assumptions about file system
> +guarantees. Dave Chinner proposed a unified model called the
> +Strictly Ordered Metadata Consistency (SOMC) [5].
> +
> +Under this scheme, the file system guarantees to persist all previous
> +dependent modifications to the object upon fsync().  If you fsync() an
> +inode, it will persist all the changes required to reference the inode
> +and its data. SOMC can be defined as follows [6]:
> +
> +If op1 precedes op2 in program order (in-memory execution order), and
> +op1 and op2 share a dependency, then op2 must not be observed by a
> +user after recovery without also observing op1.
> +
> +Unfortunately, SOMC's definition depends upon whether two operations
> +share a dependency, which could be file-system specific. It might
> +require a developer to understand file-system internals to know if
> +SOMC would order one operation before another. It is worth noting
> +that a file system can be crash-consistent (according to POSIX),
> +without providing SOMC [7].
> +
> +As an example, consider the following test case from xfstest
> +generic/342 [8]
> +-------
> +touch A/foo
> +echo “hello” >  A/foo
> +sync
> +
> +mv A/foo A/bar
> +echo “world” > A/foo
> +fsync A/foo
> +CRASH
> +
> +What would you expect on recovery, if the file system crashed after
> +the final fsync returned successfully?
> +
> +Non-SOMC file systems will not persist the file
> +A/bar because it was not explicitly fsync-ed. But this means, you will
> +find only the file A/foo with data “world” after crash, thereby losing
> +the previously persisted file with data “hello”. You will need to
> +explicitly fsync the directory A to ensure the rename operation is
> +safely persisted on disk.
> +
> +Under SOMC, to correctly reference the new inode via A/foo,
> +the previous rename operation must persist as well. Therefore,
> +fsync() of A/foo will persist the renamed file A/bar as well.
> +On recovery you will find both A/bar (with data “hello”)
> +and A/foo (with data “world”).
> +
> +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict)
> +and btrfs provide SOMC-like behaviour in this particular example.
> +However, in writing, only XFS claims to provide SOMC. F2FS aims to provide
> +SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and
> +btrfs provide strictly ordered metadata consistency.
> +
> +--------------------------------------------------------
> +[1] http://man7.org/linux/man-pages/man2/fdatasync.2.html
> +[2] https://www.spinics.net/lists/linux-btrfs/msg77340.html
> +[3] https://www.spinics.net/lists/fstests/msg09370.html
> +[4] https://bugzilla.kernel.org/show_bug.cgi?id=202485
> +[5] https://marc.info/?l=fstests&m=155010885626284&w=2
> +[6] https://marc.info/?l=fstests&m=155011123126916&w=2
> +[7] https://www.spinics.net/lists/fstests/msg09379.html
> +[8] https://patchwork.kernel.org/patch/10132305/
> +
> --
> 2.7.4
>
Dave Chinner March 14, 2019, 1:19 a.m. UTC | #3
On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> In this file, we document the crash-recovery guarantees
> provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
> present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
> (SOMC), which is provided by xfs. It is not clear to us if other file systems
> provide SOMC.

FWIW, new kernel documents should be written in rst markup format,
not plain ascii text.

> 
> Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> ---
> 
> We would be happy to modify the document if file-system
> developers claim that their system provides (or aims to provide) SOMC.
> 
> Changes since v1:
>   * Addressed few nits identified in the review
>   * Added the fsync guarantees for F2FS and its SOMC compliance
> ---
>  .../filesystems/crash-recovery-guarantees.txt      | 193 +++++++++++++++++++++
>  1 file changed, 193 insertions(+)
>  create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt
> 
> diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt
> new file mode 100644
> index 0000000..be84964
> --- /dev/null
> +++ b/Documentation/filesystems/crash-recovery-guarantees.txt
> @@ -0,0 +1,193 @@
> +=====================================================================
> +File System Crash-Recovery Guarantees
> +=====================================================================
> +Linux file systems provide certain guarantees to user-space
> +applications about what happens to their data if the system crashes
> +(due to power loss or kernel panic). These are termed crash-recovery
> +guarantees.

These are termed "data integrity guarantees", not "crash recovery
guarantees".

i.e. crash recovery is generic phrase describing the _mechanism_
used by some filesystems to implement the data integrity guarantees
the filesystem provides to userspace applications. 

> +
> +Crash-recovery guarantees only pertain to data or metadata that has
> +been explicitly persisted to storage with fsync(), fdatasync(), or
> +sync() system calls.

Define data and metadata in terms of what they refer to when we talk
about data integrity guarantees.

Define "persisted to storage".

Also, data integrity guarantees are provided by more interfaces than
you mention. They also apply to syncfs(), FIFREEZE, files/dirs
opened with O_[D]SYNC, readv2/writev2 calls with RWF_[D]SYNC set,
inodes with the S_[DIR]SYNC on-disk attribute, mounts with
dirsync/wsync options, etc. "data integrity guarantees" encompass
all these operations, not just fsync/fdatasync/sync....

> By default, write(), mkdir(), and other
> +file-system related system calls only affect the in-memory state of
> +the file system.

That's a generalisation that is not always correct from the user's
or userspace develper's point of view. e.g.  inodes with the sync
attribute set will default to synchronous on-disk state changes,
applications can use O_DSYNC/O_SYNC by default, etc....

> +The crash-recovery guarantees provided by most Linux file systems are
> +significantly stronger than what is required by POSIX. POSIX is vague,
> +even allowing fsync() to do nothing (Mac OSX takes advantage of
> +this).

Except when _POSIX_SYNCHRONIZED_IO is asserted, and then the
semantics filesystems must provide users are very explicit:

"[SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the
fsync() function shall force all currently queued I/O operations
associated with the file indicated by file descriptor fildes to the
synchronized I/O completion state. All I/O operations shall be
completed as defined for synchronized I/O file integrity completion.
[Option End]"

glibc asserts _POSIX_SYNCHRONIZED_IO (I'll use SIO from now on):

$ getconf _POSIX_SYNCHRONIZED_IO
200809
$

This means fsync() on Linux is supposed to conform to Section 3.376
"Synchronized I/O File Integrity Completion" of the specification,
which is a superset of the 3.375 "Synchronized I/O Data Integrity
Completion". Section 3.375 says:

"For write, when the operation has been completed or diagnosed if
unsuccessful. The write is complete only when the data specified in
the write request is successfully transferred and all file system
information required to retrieve the data is successfully
transferred."

https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_375

The key phrase here is "all the file system information required to
retrieve the data". If the directory entry that points at the file
is not persisted with the file itself, then you can't retreive the
data after a crash.  i.e. when _POSIX_SYNCHRONIZED_IO is asserted by
the system, the filesystem must guarantee this:

# touch A/foo
# echo "hello world" > A/foo
# fsync A/foo

persists the foo entry in the directory A, because that is
"filesystem information required to retreive the data in the file
A/foo". i.e. if we crash here and A/foo is not present after
restart, then we've violated the POSIX specification for SIO.

IOWs, POSIX fsync w/ SIO semantics does not allow fsync() to do
nothing, but instead has explicit definitions of the behaviour
applications can expect.  The only "wiggle room" in this
specification whether the meaning of "data transfer" includes
physically persisting the data to storage media or just moving it
into the device's volatile cache. On Linux, we've explicitly chosen
the former, because the latter does not provide SIO semantics as
data or referencing metadata can still be lost from the device's
volatile cache after transfer.

> However, the guarantees provided by file systems are not
> +documented, and vary between file systems. This document seeks to
> +describe the current crash-recovery guarantees provided by major Linux
> +file systems.
> +
> +What does the fsync() operation guarantee?
> +----------------------------------------------------
> +fsync() operation is meant to force the physical write of data
> +corresponding to a file from the buffer cache, along with the file
> +metadata. Note that the guarantees mentioned for each file system below
> +are in addition to the ones provided by POSIX.

a. what is a "physical write"?
b. Linux does not have a buffer cache. What about direct IO?
c. Exactly what "file metadata" are you talking about here?
e. Actually, it's not "in addtion" to posix - what you are
documenting here is where filesystems do not conform to
the POSIX SIO specification....

> +POSIX
> +-----
> +fsync(file) : Flushes the data and metadata associated with the
> +file.  However, if the directory entry for the file has not been
> +previously persisted, or has been modified, it is not guaranteed to be
> +persisted by the fsync of the file [1].

These are the semantics defined in the linux fsync(3) man page, and
as per the above, they are substantially /weaker/ than the POSIX
SIO specification glibc says we implement.

> What this means is, if a file
> +is newly created, you will have to fsync(parent directory) in addition
> +to fsync(file) in order to ensure that the file's directory entry has
> +safely reached the disk.

Define "safely reached disk" or use the same terms as previously
defined (i.e. "persisted to storage").

> +
> +fsync(dir) : Flushes directory data and directory entries. However if
> +you created a new file within the directory and wrote data to the
> +file, then the file data is not guaranteed to be persisted, unless an
> +explicit fsync() is issued on the file.

You talk about file metadata, then ignore what fsync does with
directory metadata...

> +ext4
> +-----
> +fsync(file) : Ensures that a newly created file's directory entry is
> +persisted (no need to explicitly persist the parent directory). However,
> +if you create multiple names of the file (hard links), then their directory
> +entries are not guaranteed to persist unless each one of the parent
> +directory entries are persisted [2].

So you use a specific example to indicate an exception where ext4
needs an explicit parent directory fsync (i.e. hard links to a
single file across multiple directories). That implies ext4 POSIX
SIO compliance is questionable, and it is definitely not SOMC
compliant. Further, it implies that transactional change atomicity
requirements are also violated. i.e. the inode is journalled with a
link count equivalent to all links existing, but not all the dirents
that point to the inode are persisted at the same time.

So from this example, ext4 is not SOMC compliant.

> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data.

what about the inodes that were created, removed or hard linked?
Does it ensure they exist (or have been correctly freed) after
fsync(dir), too?  (that hardlink behaviour makes me question
everything related to transaction atomicity in ext4 now)

> +xfs
> +----
> +fsync(file) : Ensures that a newly created file's directory entry is
> +persisted.

Actually, it ensures the path all the way up to the root inode is
persisted. i.e. it guarantees the inode can be found after crash via
a path walk. Basically, XFS demonstrates POSIX SIO compliant
behaviour.

> Additionally, all the previous dependent modifications to
> +this file are also persisted.

That's the mechanism that provides the behaviour, not sure that's
relevant here.

FWIW, this description is pretty much useless to a reader who knows
nothing about XFS and what these terms actually mean.  IOWs, you
need to define "previous dependent modifications", "modification
dependency", etc before using them. Essentially, you need to
describe the observable behaviour here, not the implementation that
creates the behaviour.

> If any file shares an object
> +modification dependency with the fsync-ed file, then that file's
> +directory entry is also persisted.

Which you need to explain with references to the ext4 hardlink
failure and how XFS will persist all the hard link directory entries
for each hardlink all the way back up to the root. i.e. don't
describe the implementation, describe the observable behaviour.

> +fsync(dir) : All file names within the persisted directory will exist,
> +but does not guarantee file data. As with files, fsync(dir) also persists
> +previous dependent metadata operations.
>
> +btrfs
> +------
> +fsync(file) : Ensures that a newly created file's directory entry
> +is persisted, along with the directory entries of all its hard links.
> +You do not need to explicitly fsync individual hard links to the file.

So how is that different to XFS? Why explicitly state the hard link
behaviour, but then not mention anything about dependencies and
propagation? Especially after doing exactly the opposite when
describing XFS....

> +fsync(dir) : All the file names within the directory will persist. All the
> +rename and unlink operations within the directory are persisted. Due
> +to the design choices made by btrfs, fsync of a directory could lead
> +to an iterative fsync on sub-directories, thereby requiring a full
> +file system commit. So btrfs does not advocate fsync of directories
> +[2].

I don't think this "recommendation" is appropriate for a document
describing behaviour. It's also indicative of btrfs not having SOMC
behaviour.

> +F2FS
> +----
> +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix),
> +F2FS only guarantees POSIX behaviour. However, it provides xfs-like

What does "only guarantees POSIX behaviour" actually mean? because
it can mean "loses all your data on crash"....

> +guarantees if mounted with fsync-mode=strict option.

So, by default, f2fs will lose all your data on crash? And they call
that "POSIX" behaviour, despite glibc telling applications that the
system provides data integrity preserving fsync functionality?

Seems like a very badly named mount option and a terrible default -
basically we have "fast-and-loose" behaviour which has "eats your
data" data integrity semantics and "strict" which should be POSIX
SIO conformant.


> +fsync(symlink)
> +-------------
> +A symlink inode cannot be directly opened for IO, which means there is
> +no such thing as fsync of a symlink [3]. You could be tricked by the
> +fact that open and fsync of a symlink succeeds without returning a
> +error, but what happens in reality is as follows.
> +
> +Suppose we have a symlink “foo”, which points to the file “A/bar”
> +
> +fd = open(“foo”, O_CREAT | O_RDWR)
> +fsync(fd)
> +
> +Both the above operations succeed, but if you crash after fsync, the
> +symlink could be still missing.
> +
> +When you try to open the symlink “foo”, you are actually trying to
> +open the file that the symlink resolves to, which in this case is
> +“A/bar”. When you fsync the inode returned by the open system call, you
> +are actually persisting the file “A/bar” and not the symlink. Note
> +that if the file “A/bar” does not exist and you try the open the
> +symlink “foo” without the O_CREAT flag, then file open will fail. To
> +obtain the file descriptor associated with the symlink inode, you
> +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the
> +file descriptor obtained this way can be only used to indicate a
> +location in the file-system tree and to perform operations that act
> +purely at the file descriptor level. Operations like read(), write(),
> +fsync() etc cannot be performed on such file descriptors.
> +
> +Bottomline : You cannot fsync() a symlink.

You can fsync() the parent dir after it is created or removed
to persist that operation.

> +fsync(special files)
> +--------------------
> +Special files in Linux include block and character device files
> +(created using mknod), FIFO (created using mkfifo) etc. Just like the
> +behavior of fsync on symlinks described above, these special files do
> +not have an fsync function defined. Similar to symlinks, you
> +cannot fsync a special file [4].

You can fsync() the parent dir after it is created or removed
to persist that operation.

> +Strictly Ordered Metadata Consistency
> +-------------------------------------
> +With each file system providing varying levels of persistence
> +guarantees, a consensus in this regard, will benefit application
> +developers to work with certain fixed assumptions about file system
> +guarantees. Dave Chinner proposed a unified model called the
> +Strictly Ordered Metadata Consistency (SOMC) [5].
> +
> +Under this scheme, the file system guarantees to persist all previous
> +dependent modifications to the object upon fsync().  If you fsync() an
> +inode, it will persist all the changes required to reference the inode
> +and its data. SOMC can be defined as follows [6]:
> +
> +If op1 precedes op2 in program order (in-memory execution order), and
> +op1 and op2 share a dependency, then op2 must not be observed by a
> +user after recovery without also observing op1.
> +
> +Unfortunately, SOMC's definition depends upon whether two operations
> +share a dependency, which could be file-system specific. It might
> +require a developer to understand file-system internals to know if
> +SOMC would order one operation before another.

That's largely an internal implementation detail, and users should
not have to care about the internal implementation because the
fundamental dependencies are all defined by the directory heirarchy
relationships that users can see and manipulate.

i.e. fs internal dependencies only increase the size of the graph
that is persisted, but it will never be reduced to less than what
the user can observe in the directory heirarchy.

So this can be further refined:

	If op1 precedes op2 in program order (in-memory execution
	order), and op1 and op2 share a user visible reference, then
	op2 must not be observed by a user after recovery without
	also observing op1.

e.g. in the case of the parent directory - the parent has a link
count. Hence every create, unlink, rename, hard link, symlink, etc
operation in a directory modifies a user visible link count
reference.  Hence fsync of one of those children will persist the
directory link count, and then all of the other preceeding
transactions that modified the link count also need to be persisted.

But keep in mind this defines ordering, not the persistence set:

# touch {a,b,c,d}
# touch {1,2,3,4}
# fsync d
<crash>

SOMC doesn't require {1,2,3,4} to be in the persistence set and
hence present after recovery. It only requires {a,b,c,d} to be in
the persistence set.

If you observe XFS behaviour, it will result in {1,2,3,4} also being
included in the persistence set, because it aggregates all the changes to the parent
directory into a single change per journal checkpoint sequence and
hence it cannot separate them at fsync time.

This, however, is a XFS journal implementation detail and not
something required by SOMC. The resulting behaviour is that XFS
generally persists more than SOMC requires, but the persistence set
that XFS calculates always maintains SOMC semantics so should always
does the right thing.

IOWs, a finer grained implementation of change dependencies could
result in providing exact, minimal persistence SOMC behaviour in
every situation, but don't expect that from XFS. It is likely that
experimental, explicit change depedency graph based filesystems like
featherstitch would provide minimal scope SOMC persistence
behaviour, but that's out of the scope of this document.

(*) http://featherstitch.cs.ucla.edu/
http://featherstitch.cs.ucla.edu/publications/featherstitch-sosp07.pdf
https://lwn.net/Articles/354861/

> It is worth noting
> +that a file system can be crash-consistent (according to POSIX),
> +without providing SOMC [7].

"crash-consistent" doesn't mean "data integrity preserving", and
posix only talks about data integrity beahviour. "crash-consistent"
just means the filesystem is not in a corrupt state when it
recovers.

> +As an example, consider the following test case from xfstest
> +generic/342 [8]
> +-------
> +touch A/foo
> +echo “hello” >  A/foo
> +sync
> +
> +mv A/foo A/bar
> +echo “world” > A/foo
> +fsync A/foo
> +CRASH

[whacky utf-8(?) symbols.  Plain ascii text for documents, please.]

> +What would you expect on recovery, if the file system crashed after
> +the final fsync returned successfully?
> +
> +Non-SOMC file systems will not persist the file
> +A/bar because it was not explicitly fsync-ed. But this means, you will
> +find only the file A/foo with data “world” after crash, thereby losing
> +the previously persisted file with data “hello”. You will need to
> +explicitly fsync the directory A to ensure the rename operation is
> +safely persisted on disk.
> +
> +Under SOMC, to correctly reference the new inode via A/foo,
> +the previous rename operation must persist as well. Therefore,
> +fsync() of A/foo will persist the renamed file A/bar as well.
> +On recovery you will find both A/bar (with data “hello”)
> +and A/foo (with data “world”).

You should describe the SOMC behaviour up front in the document,
because that is the behaviour this document is about.  Then describe
how the "man page fsync behaviour" and individual filesystems differ
from SOMC behaviour.

it would also be worth contrasting SOMC to historic ext3 behaviour
(globally ordered metadata and data), because that is the behaviour
that many application devleopers and users still want current
filesystems to emulate.

> +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict)
> +and btrfs provide SOMC-like behaviour in this particular example.
> +However, in writing, only XFS claims to provide SOMC. F2FS aims to provide
> +SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and
> +btrfs provide strictly ordered metadata consistency.

btrfs does not provide SOMC w.r.t. fsync() - that much is clear from
the endless stream of fsync bugs that are being found and fixed.

Also, the hard link behaviour described for ext4 indicates that it
is not truly SOMC, either. From this, I'd consider ext4 a "mostly
SOMC" implementation, but it seems that there are aspects of
ext4/jbd2 dependency and/or atomicity tracking that don't fully
resolve cross-object transactional atomicity dependencies correctly.

Cheers,

Dave.
Amir Goldstein March 14, 2019, 7:19 a.m. UTC | #4
On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > In this file, we document the crash-recovery guarantees
> > provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also
> > present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency
> > (SOMC), which is provided by xfs. It is not clear to us if other file systems
> > provide SOMC.
>
> FWIW, new kernel documents should be written in rst markup format,
> not plain ascii text.
>
> >
> > Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu>
> > Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> > ---
> >
> > We would be happy to modify the document if file-system
> > developers claim that their system provides (or aims to provide) SOMC.
> >
> > Changes since v1:
> >   * Addressed few nits identified in the review
> >   * Added the fsync guarantees for F2FS and its SOMC compliance
> > ---
> >  .../filesystems/crash-recovery-guarantees.txt      | 193 +++++++++++++++++++++
> >  1 file changed, 193 insertions(+)
> >  create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt
> >
> > diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt
> > new file mode 100644
> > index 0000000..be84964
> > --- /dev/null
> > +++ b/Documentation/filesystems/crash-recovery-guarantees.txt
> > @@ -0,0 +1,193 @@
> > +=====================================================================
> > +File System Crash-Recovery Guarantees
> > +=====================================================================
> > +Linux file systems provide certain guarantees to user-space
> > +applications about what happens to their data if the system crashes
> > +(due to power loss or kernel panic). These are termed crash-recovery
> > +guarantees.
>
> These are termed "data integrity guarantees", not "crash recovery
> guarantees".
>
> i.e. crash recovery is generic phrase describing the _mechanism_
> used by some filesystems to implement the data integrity guarantees
> the filesystem provides to userspace applications.
>

Well, if we use the term "data integrity guarantees" we need to make sure
to explain that "data" may also refer to "metadata" as most of the examples
and corner cases in this document are not about whether or or not the file's
data is persisted, but rather about the existence of a directory entry.
Yes, when the file has data, the directory entry existence is a prerequisite
to reading the file's data, but when a file doesn't have any data, like symlinks
sparse files with xattrs, etc, it is important to clarify what we mean by
"integrity".

[...]

> > +ext4
> > +-----
> > +fsync(file) : Ensures that a newly created file's directory entry is
> > +persisted (no need to explicitly persist the parent directory). However,
> > +if you create multiple names of the file (hard links), then their directory
> > +entries are not guaranteed to persist unless each one of the parent
> > +directory entries are persisted [2].
>
> So you use a specific example to indicate an exception where ext4
> needs an explicit parent directory fsync (i.e. hard links to a
> single file across multiple directories). That implies ext4 POSIX
> SIO compliance is questionable, and it is definitely not SOMC
> compliant. Further, it implies that transactional change atomicity
> requirements are also violated. i.e. the inode is journalled with a
> link count equivalent to all links existing, but not all the dirents
> that point to the inode are persisted at the same time.
>
> So from this example, ext4 is not SOMC compliant.
>

I question the claim made by the document about ext4
behavior.
I believe Ted's words [2] may have been misinterpreted.
Ted, can you comment?

> > +fsync(dir) : All file names within the persisted directory will exist,
> > +but does not guarantee file data.
>
> what about the inodes that were created, removed or hard linked?
> Does it ensure they exist (or have been correctly freed) after
> fsync(dir), too?  (that hardlink behaviour makes me question
> everything related to transaction atomicity in ext4 now)
>

Those should also be flushed with the same (or previous)
transaction, either deleted or on orphan list.

[...]

> > +Strictly Ordered Metadata Consistency
> > +-------------------------------------
> > +With each file system providing varying levels of persistence
> > +guarantees, a consensus in this regard, will benefit application
> > +developers to work with certain fixed assumptions about file system
> > +guarantees. Dave Chinner proposed a unified model called the
> > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > +
> > +Under this scheme, the file system guarantees to persist all previous
> > +dependent modifications to the object upon fsync().  If you fsync() an
> > +inode, it will persist all the changes required to reference the inode
> > +and its data. SOMC can be defined as follows [6]:
> > +
> > +If op1 precedes op2 in program order (in-memory execution order), and
> > +op1 and op2 share a dependency, then op2 must not be observed by a
> > +user after recovery without also observing op1.
> > +
> > +Unfortunately, SOMC's definition depends upon whether two operations
> > +share a dependency, which could be file-system specific. It might
> > +require a developer to understand file-system internals to know if
> > +SOMC would order one operation before another.
>
> That's largely an internal implementation detail, and users should
> not have to care about the internal implementation because the
> fundamental dependencies are all defined by the directory heirarchy
> relationships that users can see and manipulate.
>
> i.e. fs internal dependencies only increase the size of the graph
> that is persisted, but it will never be reduced to less than what
> the user can observe in the directory heirarchy.
>
> So this can be further refined:
>
>         If op1 precedes op2 in program order (in-memory execution
>         order), and op1 and op2 share a user visible reference, then
>         op2 must not be observed by a user after recovery without
>         also observing op1.
>
> e.g. in the case of the parent directory - the parent has a link
> count. Hence every create, unlink, rename, hard link, symlink, etc
> operation in a directory modifies a user visible link count
> reference.  Hence fsync of one of those children will persist the
> directory link count, and then all of the other preceeding
> transactions that modified the link count also need to be persisted.
>

One thing that bothers me is that the definition of SOMC (as well as
your refined definition) doesn't mention fsync at all, but all the examples
only discuss use cases with fsync.

I personally find SOMC guaranty *much* more powerful in the absence
of fsync. I have an application that creates sparse files, sets xattrs, mtime
and moves them into place. The observed requirement is that after crash
those files either exist with correct mtime, xattr or not exist.
To my understanding, SOMC provides a guaranty that the application does
not need to do any fsync at all, which is very desired when many such
operations are performed while other users are doing data I/O on the same
filesystem.

For me. This is a very powerful feature of the filesystem and if we can (?)
document this behavior and commit to it, that could benefit application
developers.

Thanks,
Amir.
Dave Chinner March 15, 2019, 3:03 a.m. UTC | #5
On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > +Strictly Ordered Metadata Consistency
> > > +-------------------------------------
> > > +With each file system providing varying levels of persistence
> > > +guarantees, a consensus in this regard, will benefit application
> > > +developers to work with certain fixed assumptions about file system
> > > +guarantees. Dave Chinner proposed a unified model called the
> > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > +
> > > +Under this scheme, the file system guarantees to persist all previous
> > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > +inode, it will persist all the changes required to reference the inode
> > > +and its data. SOMC can be defined as follows [6]:
> > > +
> > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > +user after recovery without also observing op1.
> > > +
> > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > +share a dependency, which could be file-system specific. It might
> > > +require a developer to understand file-system internals to know if
> > > +SOMC would order one operation before another.
> >
> > That's largely an internal implementation detail, and users should
> > not have to care about the internal implementation because the
> > fundamental dependencies are all defined by the directory heirarchy
> > relationships that users can see and manipulate.
> >
> > i.e. fs internal dependencies only increase the size of the graph
> > that is persisted, but it will never be reduced to less than what
> > the user can observe in the directory heirarchy.
> >
> > So this can be further refined:
> >
> >         If op1 precedes op2 in program order (in-memory execution
> >         order), and op1 and op2 share a user visible reference, then
> >         op2 must not be observed by a user after recovery without
> >         also observing op1.
> >
> > e.g. in the case of the parent directory - the parent has a link
> > count. Hence every create, unlink, rename, hard link, symlink, etc
> > operation in a directory modifies a user visible link count
> > reference.  Hence fsync of one of those children will persist the
> > directory link count, and then all of the other preceeding
> > transactions that modified the link count also need to be persisted.
> >
> 
> One thing that bothers me is that the definition of SOMC (as well as
> your refined definition) doesn't mention fsync at all, but all the examples
> only discuss use cases with fsync.

You can't discuss operational ordering without a point in time to
use as a reference for that ordering.  SOMC behaviour is preserved
at any point the filesystem checkpoints itself, and the only thing
that changes is the scope of that checkpoint. fsync is just a
convenient, widely understood, minimum dependecy reference point
that people can reason from. All the interesting ordering problems
come from minimum dependecy reference point (i.e. fsync()), not from
background filesystem-wide checkpoints.

> I personally find SOMC guaranty *much* more powerful in the absence
> of fsync. I have an application that creates sparse files, sets xattrs, mtime
> and moves them into place. The observed requirement is that after crash
> those files either exist with correct mtime, xattr or not exist.

SOMC does not provide the guarantees you seek in the absence of a
known data synchronisation point:

	a) a background metadata checkpoint can land anywhere in
	that series of operations and hence recovery will land in an
	intermediate state.

	b) there is data that needs writing, and SOMC provides no
	ordering guarantees for data. So after recovery file could
	exist with correct mtime and xattrs, but have no (or
	partial) data.

> To my understanding, SOMC provides a guaranty that the application does
> not need to do any fsync at all,

Absolutely not true. If the application has atomic creation
requirements that need multiple syscalls to set up, it must
implement them itself and use fsync to synchronise data and metadata
before the "atomic create" operation that makes it visible to the
application.

SOMC only guarantees what /metadata/ you see at a fileystem
synchronisation point; it does not provide ACID semantics to a
random set of system calls into the filesystem.

Cheers,

Dave.
Amir Goldstein March 15, 2019, 3:44 a.m. UTC | #6
On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
> > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > +Strictly Ordered Metadata Consistency
> > > > +-------------------------------------
> > > > +With each file system providing varying levels of persistence
> > > > +guarantees, a consensus in this regard, will benefit application
> > > > +developers to work with certain fixed assumptions about file system
> > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > +
> > > > +Under this scheme, the file system guarantees to persist all previous
> > > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > > +inode, it will persist all the changes required to reference the inode
> > > > +and its data. SOMC can be defined as follows [6]:
> > > > +
> > > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > > +user after recovery without also observing op1.
> > > > +
> > > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > > +share a dependency, which could be file-system specific. It might
> > > > +require a developer to understand file-system internals to know if
> > > > +SOMC would order one operation before another.
> > >
> > > That's largely an internal implementation detail, and users should
> > > not have to care about the internal implementation because the
> > > fundamental dependencies are all defined by the directory heirarchy
> > > relationships that users can see and manipulate.
> > >
> > > i.e. fs internal dependencies only increase the size of the graph
> > > that is persisted, but it will never be reduced to less than what
> > > the user can observe in the directory heirarchy.
> > >
> > > So this can be further refined:
> > >
> > >         If op1 precedes op2 in program order (in-memory execution
> > >         order), and op1 and op2 share a user visible reference, then
> > >         op2 must not be observed by a user after recovery without
> > >         also observing op1.
> > >
> > > e.g. in the case of the parent directory - the parent has a link
> > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > operation in a directory modifies a user visible link count
> > > reference.  Hence fsync of one of those children will persist the
> > > directory link count, and then all of the other preceeding
> > > transactions that modified the link count also need to be persisted.
> > >
> >
> > One thing that bothers me is that the definition of SOMC (as well as
> > your refined definition) doesn't mention fsync at all, but all the examples
> > only discuss use cases with fsync.
>
> You can't discuss operational ordering without a point in time to
> use as a reference for that ordering.  SOMC behaviour is preserved
> at any point the filesystem checkpoints itself, and the only thing
> that changes is the scope of that checkpoint. fsync is just a
> convenient, widely understood, minimum dependecy reference point
> that people can reason from. All the interesting ordering problems
> come from minimum dependecy reference point (i.e. fsync()), not from
> background filesystem-wide checkpoints.
>

Yes, I was referring to rename as a commonly used operation used
by application as "metadata barrier".

> > I personally find SOMC guaranty *much* more powerful in the absence
> > of fsync. I have an application that creates sparse files, sets xattrs, mtime
> > and moves them into place. The observed requirement is that after crash
> > those files either exist with correct mtime, xattr or not exist.

I wasn't clear:
1. "sparse" meaning no data at all only hole.
2. "exist" meaning found at rename destination
Naturally, its applications responsibility to cleanup temp files that were
not moved into rename destination.

>
> SOMC does not provide the guarantees you seek in the absence of a
> known data synchronisation point:
>
>         a) a background metadata checkpoint can land anywhere in
>         that series of operations and hence recovery will land in an
>         intermediate state.

Yes, that results in temp files that would be cleaned up on recovery.

>
>         b) there is data that needs writing, and SOMC provides no
>         ordering guarantees for data. So after recovery file could
>         exist with correct mtime and xattrs, but have no (or
>         partial) data.
>

There is no data in my use case, only metadata, that is why
SOMC without fsync is an option.

> > To my understanding, SOMC provides a guaranty that the application does
> > not need to do any fsync at all,
>
> Absolutely not true. If the application has atomic creation
> requirements that need multiple syscalls to set up, it must
> implement them itself and use fsync to synchronise data and metadata
> before the "atomic create" operation that makes it visible to the
> application.
>
> SOMC only guarantees what /metadata/ you see at a fileystem
> synchronisation point; it does not provide ACID semantics to a
> random set of system calls into the filesystem.
>

So I re-state my claim above after having explained the use case.
IMO, SOMC guaranties is an important feature even in the absence
of any fsync, because of the ability to use some metadata operation
(e.g. rename, link) as metadata barriers.
I am wrong about this?

Thanks,
Amir.
Dave Chinner March 17, 2019, 10:16 p.m. UTC | #7
On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > +Strictly Ordered Metadata Consistency
> > > > > +-------------------------------------
> > > > > +With each file system providing varying levels of persistence
> > > > > +guarantees, a consensus in this regard, will benefit application
> > > > > +developers to work with certain fixed assumptions about file system
> > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > +
> > > > > +Under this scheme, the file system guarantees to persist all previous
> > > > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > > > +inode, it will persist all the changes required to reference the inode
> > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > +
> > > > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > > > +user after recovery without also observing op1.
> > > > > +
> > > > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > > > +share a dependency, which could be file-system specific. It might
> > > > > +require a developer to understand file-system internals to know if
> > > > > +SOMC would order one operation before another.
> > > >
> > > > That's largely an internal implementation detail, and users should
> > > > not have to care about the internal implementation because the
> > > > fundamental dependencies are all defined by the directory heirarchy
> > > > relationships that users can see and manipulate.
> > > >
> > > > i.e. fs internal dependencies only increase the size of the graph
> > > > that is persisted, but it will never be reduced to less than what
> > > > the user can observe in the directory heirarchy.
> > > >
> > > > So this can be further refined:
> > > >
> > > >         If op1 precedes op2 in program order (in-memory execution
> > > >         order), and op1 and op2 share a user visible reference, then
> > > >         op2 must not be observed by a user after recovery without
> > > >         also observing op1.
> > > >
> > > > e.g. in the case of the parent directory - the parent has a link
> > > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > > operation in a directory modifies a user visible link count
> > > > reference.  Hence fsync of one of those children will persist the
> > > > directory link count, and then all of the other preceeding
> > > > transactions that modified the link count also need to be persisted.
> > > >
> > >
> > > One thing that bothers me is that the definition of SOMC (as well as
> > > your refined definition) doesn't mention fsync at all, but all the examples
> > > only discuss use cases with fsync.
> >
> > You can't discuss operational ordering without a point in time to
> > use as a reference for that ordering.  SOMC behaviour is preserved
> > at any point the filesystem checkpoints itself, and the only thing
> > that changes is the scope of that checkpoint. fsync is just a
> > convenient, widely understood, minimum dependecy reference point
> > that people can reason from. All the interesting ordering problems
> > come from minimum dependecy reference point (i.e. fsync()), not from
> > background filesystem-wide checkpoints.
> >
> 
> Yes, I was referring to rename as a commonly used operation used
> by application as "metadata barrier".

What is a "metadata barrier" and what are it's semantics supposed to
be?

> > > I personally find SOMC guaranty *much* more powerful in the absence
> > > of fsync. I have an application that creates sparse files, sets xattrs, mtime
> > > and moves them into place. The observed requirement is that after crash
> > > those files either exist with correct mtime, xattr or not exist.
> 
> I wasn't clear:
> 1. "sparse" meaning no data at all only hole.

That's not sparse, that is an empty file or "contains no data".
"Sparse" means the file has "sparse data" - the data in the file is
separated by holes. A file that is just a single hole does not
contain "sparse data", it contains no data at all.

IOWs, if you mean "file has no data in it", then say that as it is a
clear and unambiguous statement of what the file contains.

> 2. "exist" meaning found at rename destination
> Naturally, its applications responsibility to cleanup temp files that were
> not moved into rename destination.
> 
> >
> > SOMC does not provide the guarantees you seek in the absence of a
> > known data synchronisation point:
> >
> >         a) a background metadata checkpoint can land anywhere in
> >         that series of operations and hence recovery will land in an
> >         intermediate state.
> 
> Yes, that results in temp files that would be cleaned up on recovery.

Ambiguous. "recovery" is something filesystems do to bring the
filesystem into a consistent state after a crash. If you are talking
about applicaiton level behaviour, then you need to make that
explicit.

i.e. I can /assume/ you are talking about application level recovery
from your previous statement, but that assumption is obviously wrong
if the application is using O_TMPFILE and linkat rather than rename,
in which case it will be fileystem level recovery that is doing the
cleanup. Ambiguous, yes?


> >         b) there is data that needs writing, and SOMC provides no
> >         ordering guarantees for data. So after recovery file could
> >         exist with correct mtime and xattrs, but have no (or
> >         partial) data.
> >
> 
> There is no data in my use case, only metadata, that is why
> SOMC without fsync is an option.

Perhaps, but I am not clear on exactly what you are proposing
because I don't know what the hell a "metadata barrier" is, what it
does or what it implies for filesystem integrity operations...

> > > To my understanding, SOMC provides a guaranty that the application does
> > > not need to do any fsync at all,
> >
> > Absolutely not true. If the application has atomic creation
> > requirements that need multiple syscalls to set up, it must
> > implement them itself and use fsync to synchronise data and metadata
> > before the "atomic create" operation that makes it visible to the
> > application.
> >
> > SOMC only guarantees what /metadata/ you see at a fileystem
> > synchronisation point; it does not provide ACID semantics to a
> > random set of system calls into the filesystem.
> >
> 
> So I re-state my claim above after having explained the use case.

With words that I can only guess the meaning of.

Amir, if you are asking a complex question as to whether something
conforms to a specification, then please slow down and take the time
to define all the terms, the initial state, the observable behaviour
that you expect to see, etc in clear, unambiguous and well defined
terms.  Otherwise the question cannot be answered....

Cheers,

Dave.
Theodore Ts'o March 18, 2019, 2:48 a.m. UTC | #8
On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > +ext4
> > > +-----
> > > +fsync(file) : Ensures that a newly created file's directory entry is
> > > +persisted (no need to explicitly persist the parent directory). However,
> > > +if you create multiple names of the file (hard links), then their directory
> > > +entries are not guaranteed to persist unless each one of the parent
> > > +directory entries are persisted [2].
> >
> > So you use a specific example to indicate an exception where ext4
> > needs an explicit parent directory fsync (i.e. hard links to a
> > single file across multiple directories). That implies ext4 POSIX
> > SIO compliance is questionable, and it is definitely not SOMC
> > compliant. Further, it implies that transactional change atomicity
> > requirements are also violated. i.e. the inode is journalled with a
> > link count equivalent to all links existing, but not all the dirents
> > that point to the inode are persisted at the same time.
> >
> > So from this example, ext4 is not SOMC compliant.
> 
> I question the claim made by the document about ext4
> behavior.
> I believe Ted's words [2] may have been misinterpreted.
> Ted, can you comment?

We need to be really careful about claims and specifications.
Consider the following sequence of events (error checking ignored;
assume that it's there).

       unlink("foo");  // make sure the file "foo" doesn't exist
       unlink("bar");  // make sure the file "bar" doesn't exist
       fd = open("foo", O_CREAT|O_WRONLY);
       write(fd, "foo bar data", 12);
       fsync(fd);
       // if we crash here, the file "foo" will exist, and you will be able
       // to read 12 bytes, and it will be present
       link("foo", "bar");
       fdatasync(fd);
       // if we crash here, there is no guarantee that the hard link "bar"
       // will be persisted

I believe this is perfectly compliant with _POSIX_SYNCHRONIZED_IO.

If the last fsyncdata(fd) is replaced with fsync(fd), the link(2) will
touch the ctime, so in order only to persist the ctime change, as a
side effect we will also update the directory entry so the existence
of "bar" will be persisted.

The bottom line, though, is I remain *very* skeptical about people who
want to document and then tie the hands of file system developers
about guarantees that go beyond Posix unless we have a very careful
discussion about what benefit this will provide application
developers, at least in general.  If application developers start
depending on behaviors beyond POSIX, it limits the ability of file
system developers to innovate in order to improve performance.

There may be cases where it's worth it, and there may be cases where
it's pretty clear that the laws of physics such that certain things
that go beyond POSIX will always be true.  But before we encourage
application developers to go beyond POSIX, we really should have this
conversation first.

For example, ext4 implements a guarantee that goes beyond POSIX, in
that if you create a file, say, "foo.new", and then you rename that
file such that it replaces an existing file, say "foo", then after the
rename system call, we will initiate asynchronous writeback on
"foo.new".  This is free if the application programmer has alrady
called fsync on "foo.new".  However, for an sloppy application which
doesn't bother to call fsync(2), for which Darrick informs me includes
"rpm", it saves you from lost files if you immediately reboot.

I do this, because there are tons of sloppy application programmers,
and so they outnumber file system developers.  However, this is not
documented behavior, nor is it guaranteed by POSIX!  I'm told by
Darrick that XFS doesn't do this, and he believes the XFS developers
would refuse to add such hacks, because it accomodates incompetent
userspace programmers.

Perhaps the right answer is to yell at the application programmers who
make these mistakes.  After all, the fact that ext4 accomodates
incompetence could be argued is leading to decreased application
quality and decreased portability.  But at least back in the O_PONIES
era, I looked at multiple text editors supplied by both GNOME and KDE,
and I discovered to my horror they were writing files in an extremely
unsafe manner.  (Some of them also were simply opening an existing
file with O_TRUNC, and then rewriting the text file's data, and NOT
bother calling fsync afterwards; so ext4 also has a hack so for files
opened with O_TRUNC where an existing file is truncated, on the close,
we will initiate writeback.  I chose this as being relatively low
overhead, because no competently implemented text editor should be
saving files in this way....)

Whether or not ext4 should accomodate application programmers by going
on POSIX, I believe very strongly that it should *not* be documented,
since it just encourages the bad application programming practice.
It's there just as a backstop, and in fact, it's done as an
asynchronous writeback, not as a data integrity writeback.  So it is
*not* something people should be relying on.

So before we document behavior that goes beyond POSIX, we should think
*very* carefully if this is something that we want to be encouraging
application programmers to rely on this sort of thing.

					- Ted
Amir Goldstein March 18, 2019, 5:46 a.m. UTC | #9
On Mon, Mar 18, 2019 at 4:48 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > > +ext4
> > > > +-----
> > > > +fsync(file) : Ensures that a newly created file's directory entry is
> > > > +persisted (no need to explicitly persist the parent directory). However,
> > > > +if you create multiple names of the file (hard links), then their directory
> > > > +entries are not guaranteed to persist unless each one of the parent
> > > > +directory entries are persisted [2].
> > >
> > > So you use a specific example to indicate an exception where ext4
> > > needs an explicit parent directory fsync (i.e. hard links to a
> > > single file across multiple directories). That implies ext4 POSIX
> > > SIO compliance is questionable, and it is definitely not SOMC
> > > compliant. Further, it implies that transactional change atomicity
> > > requirements are also violated. i.e. the inode is journalled with a
> > > link count equivalent to all links existing, but not all the dirents
> > > that point to the inode are persisted at the same time.
> > >
> > > So from this example, ext4 is not SOMC compliant.
> >
> > I question the claim made by the document about ext4
> > behavior.
> > I believe Ted's words [2] may have been misinterpreted.
> > Ted, can you comment?
>
> We need to be really careful about claims and specifications.
> Consider the following sequence of events (error checking ignored;
> assume that it's there).
>
>        unlink("foo");  // make sure the file "foo" doesn't exist
>        unlink("bar");  // make sure the file "bar" doesn't exist
>        fd = open("foo", O_CREAT|O_WRONLY);
>        write(fd, "foo bar data", 12);
>        fsync(fd);
>        // if we crash here, the file "foo" will exist, and you will be able
>        // to read 12 bytes, and it will be present
>        link("foo", "bar");
>        fdatasync(fd);
>        // if we crash here, there is no guarantee that the hard link "bar"
>        // will be persisted
>
> I believe this is perfectly compliant with _POSIX_SYNCHRONIZED_IO.
>
> If the last fsyncdata(fd) is replaced with fsync(fd), the link(2) will
> touch the ctime, so in order only to persist the ctime change, as a
> side effect we will also update the directory entry so the existence
> of "bar" will be persisted.
>
> The bottom line, though, is I remain *very* skeptical about people who
> want to document and then tie the hands of file system developers
> about guarantees that go beyond Posix unless we have a very careful
> discussion about what benefit this will provide application
> developers, at least in general.  If application developers start
> depending on behaviors beyond POSIX, it limits the ability of file
> system developers to innovate in order to improve performance.
>

That is understandable. But I believe the ACK Jayashree is looking
for from you has to do with the SOMC guaranty. That has to do with
ordering of metadata operations, which are all journalled by default
in ext4 and has nothing to do with writeback hacks for dodgy apps.

> There may be cases where it's worth it, and there may be cases where
> it's pretty clear that the laws of physics such that certain things
> that go beyond POSIX will always be true.  But before we encourage
> application developers to go beyond POSIX, we really should have this
> conversation first.
>
> For example, ext4 implements a guarantee that goes beyond POSIX, in
> that if you create a file, say, "foo.new", and then you rename that
> file such that it replaces an existing file, say "foo", then after the
> rename system call, we will initiate asynchronous writeback on
> "foo.new".  This is free if the application programmer has alrady
> called fsync on "foo.new".  However, for an sloppy application which
> doesn't bother to call fsync(2), for which Darrick informs me includes
> "rpm", it saves you from lost files if you immediately reboot.
>
> I do this, because there are tons of sloppy application programmers,
> and so they outnumber file system developers.  However, this is not
> documented behavior, nor is it guaranteed by POSIX!  I'm told by
> Darrick that XFS doesn't do this, and he believes the XFS developers
> would refuse to add such hacks, because it accomodates incompetent
> userspace programmers.
>
> Perhaps the right answer is to yell at the application programmers who
> make these mistakes.  After all, the fact that ext4 accomodates
> incompetence could be argued is leading to decreased application
> quality and decreased portability.  But at least back in the O_PONIES
> era, I looked at multiple text editors supplied by both GNOME and KDE,
> and I discovered to my horror they were writing files in an extremely
> unsafe manner.  (Some of them also were simply opening an existing
> file with O_TRUNC, and then rewriting the text file's data, and NOT
> bother calling fsync afterwards; so ext4 also has a hack so for files
> opened with O_TRUNC where an existing file is truncated, on the close,
> we will initiate writeback.  I chose this as being relatively low
> overhead, because no competently implemented text editor should be
> saving files in this way....)
>
> Whether or not ext4 should accomodate application programmers by going
> on POSIX, I believe very strongly that it should *not* be documented,
> since it just encourages the bad application programming practice.
> It's there just as a backstop, and in fact, it's done as an
> asynchronous writeback, not as a data integrity writeback.  So it is
> *not* something people should be relying on.
>
> So before we document behavior that goes beyond POSIX, we should think
> *very* carefully if this is something that we want to be encouraging
> application programmers to rely on this sort of thing.
>

Perhaps it makes sense that if a behavior is already documented or
will be documented in an xfstest with _require_metadata_journaling
then it might as well be documented as well. Perhaps not.

Thanks,
Amir.
Amir Goldstein March 18, 2019, 7:13 a.m. UTC | #10
On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > > +Strictly Ordered Metadata Consistency
> > > > > > +-------------------------------------
> > > > > > +With each file system providing varying levels of persistence
> > > > > > +guarantees, a consensus in this regard, will benefit application
> > > > > > +developers to work with certain fixed assumptions about file system
> > > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > > +
> > > > > > +Under this scheme, the file system guarantees to persist all previous
> > > > > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > > > > +inode, it will persist all the changes required to reference the inode
> > > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > > +
> > > > > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > > > > +user after recovery without also observing op1.
> > > > > > +
> > > > > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > > > > +share a dependency, which could be file-system specific. It might
> > > > > > +require a developer to understand file-system internals to know if
> > > > > > +SOMC would order one operation before another.
> > > > >
> > > > > That's largely an internal implementation detail, and users should
> > > > > not have to care about the internal implementation because the
> > > > > fundamental dependencies are all defined by the directory heirarchy
> > > > > relationships that users can see and manipulate.
> > > > >
> > > > > i.e. fs internal dependencies only increase the size of the graph
> > > > > that is persisted, but it will never be reduced to less than what
> > > > > the user can observe in the directory heirarchy.
> > > > >
> > > > > So this can be further refined:
> > > > >
> > > > >         If op1 precedes op2 in program order (in-memory execution
> > > > >         order), and op1 and op2 share a user visible reference, then
> > > > >         op2 must not be observed by a user after recovery without
> > > > >         also observing op1.
> > > > >
> > > > > e.g. in the case of the parent directory - the parent has a link
> > > > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > > > operation in a directory modifies a user visible link count
> > > > > reference.  Hence fsync of one of those children will persist the
> > > > > directory link count, and then all of the other preceeding
> > > > > transactions that modified the link count also need to be persisted.
> > > > >
> > > >
> > > > One thing that bothers me is that the definition of SOMC (as well as
> > > > your refined definition) doesn't mention fsync at all, but all the examples
> > > > only discuss use cases with fsync.
> > >
> > > You can't discuss operational ordering without a point in time to
> > > use as a reference for that ordering.  SOMC behaviour is preserved
> > > at any point the filesystem checkpoints itself, and the only thing
> > > that changes is the scope of that checkpoint. fsync is just a
> > > convenient, widely understood, minimum dependecy reference point
> > > that people can reason from. All the interesting ordering problems
> > > come from minimum dependecy reference point (i.e. fsync()), not from
> > > background filesystem-wide checkpoints.
> > >
> >
> > Yes, I was referring to rename as a commonly used operation used
> > by application as "metadata barrier".
>
> What is a "metadata barrier" and what are it's semantics supposed to
> be?
>

In this context I mean that effects of metadata operations before the
barrier (e.g. setxattr, truncate) must be observed after crash if the effects
of barrier operation (e.g. file was renamed) are observed after crash.

> > > > I personally find SOMC guaranty *much* more powerful in the absence
> > > > of fsync. I have an application that creates sparse files, sets xattrs, mtime
> > > > and moves them into place. The observed requirement is that after crash
> > > > those files either exist with correct mtime, xattr or not exist.
> >
> > I wasn't clear:
> > 1. "sparse" meaning no data at all only hole.
>
> That's not sparse, that is an empty file or "contains no data".
> "Sparse" means the file has "sparse data" - the data in the file is
> separated by holes. A file that is just a single hole does not
> contain "sparse data", it contains no data at all.
>
> IOWs, if you mean "file has no data in it", then say that as it is a
> clear and unambiguous statement of what the file contains.
>
> > 2. "exist" meaning found at rename destination
> > Naturally, its applications responsibility to cleanup temp files that were
> > not moved into rename destination.
> >
> > >
> > > SOMC does not provide the guarantees you seek in the absence of a
> > > known data synchronisation point:
> > >
> > >         a) a background metadata checkpoint can land anywhere in
> > >         that series of operations and hence recovery will land in an
> > >         intermediate state.
> >
> > Yes, that results in temp files that would be cleaned up on recovery.
>
> Ambiguous. "recovery" is something filesystems do to bring the
> filesystem into a consistent state after a crash. If you are talking
> about applicaiton level behaviour, then you need to make that
> explicit.
>
> i.e. I can /assume/ you are talking about application level recovery
> from your previous statement, but that assumption is obviously wrong
> if the application is using O_TMPFILE and linkat rather than rename,
> in which case it will be fileystem level recovery that is doing the
> cleanup. Ambiguous, yes?
>

Yes. From the application writers POV, the fact that doing things
"atomically" is possible is what matters. Whether filesystem provides
the recovery from incomplete transaction (O_TMPFILE+linkat), or
application can cleanup leftovers on startup (rename).
I have some applications that use the former and some that use the
latter for directories and for portability with OS/fs that don't have
O_TMPFILE.

>
> > >         b) there is data that needs writing, and SOMC provides no
> > >         ordering guarantees for data. So after recovery file could
> > >         exist with correct mtime and xattrs, but have no (or
> > >         partial) data.
> > >
> >
> > There is no data in my use case, only metadata, that is why
> > SOMC without fsync is an option.
>
> Perhaps, but I am not clear on exactly what you are proposing
> because I don't know what the hell a "metadata barrier" is, what it
> does or what it implies for filesystem integrity operations...
>
> > > > To my understanding, SOMC provides a guaranty that the application does
> > > > not need to do any fsync at all,
> > >
> > > Absolutely not true. If the application has atomic creation
> > > requirements that need multiple syscalls to set up, it must
> > > implement them itself and use fsync to synchronise data and metadata
> > > before the "atomic create" operation that makes it visible to the
> > > application.
> > >
> > > SOMC only guarantees what /metadata/ you see at a fileystem
> > > synchronisation point; it does not provide ACID semantics to a
> > > random set of system calls into the filesystem.
> > >
> >
> > So I re-state my claim above after having explained the use case.
>
> With words that I can only guess the meaning of.
>
> Amir, if you are asking a complex question as to whether something
> conforms to a specification, then please slow down and take the time
> to define all the terms, the initial state, the observable behaviour
> that you expect to see, etc in clear, unambiguous and well defined
> terms.  Otherwise the question cannot be answered....
>

Sure. TBH, I didn't even dare to ask the complex question yet,
because it was hard for me to define all terms. I sketched the
use case with the example of create+setxattr+truncate+rename
because I figured it is rather easy to understand.

The more complex question has do to with explicit "data dependency"
operation. At the moment, I will not explain what that means in details,
but I am sure you can figure it out.
With fdatasync+rename, fdatasync created a dependency between
data and metadata of the file, so with SOMC, if file is observed after
crash in rename destination, it also contains the data changes before
fdatasync. But fdatasync gives a stringer guaranty than what
my application actually needs, because in many cases it will cause
journal flush. What it really needs is filemap_write_and_wait().
Metadata doesn't need to be flushed as rename takes care of
metadata ordering guaranties.
As far as I can tell, there is no "official" API to do what I need
and there is certainly no documentation about this expected behavior.
Apologies, if above was not clear, I promise to explain in person
during LSF to whoever is interested.

Judging by the volume and passion of this thread, I think a
session on LSF fs track would probably be a good idea.
[CC Josef and Anna.]

I find our behavior as a group of filesystem developers on this matter
slightly bi-polar - on the one hand we wish to maintain implementation
freedom for future performance improvements and don't wish to commit
to existing behavior by documenting it. On the other hand, we wish to
not break existing application, whose expectations from filesystems are
far from what filesystems guaranty in documentation.

There is no one good answer that fits all aspects of this subject and I
personally agree with Ted on not wanting to document the ext4 "hacks"
that are meant to cater misbehaving applications.

I think it is good that Jayashree posted this patch as a basis for discussion
of what needs to be documented and how.
Eventually, instead of trying to formalize filesystem expected behavior, it
might be better to just encode the expected crash behavior tests
in a readable manner, as Jayashree already started to do.
Or maybe there is room for both documentation and tests.

Thanks,
Amir.
Vijay Chidambaram March 19, 2019, 2:37 a.m. UTC | #11
For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin
and Jayashree's advisor. We recently developed CrashMonkey, a tool for
finding crash-consistency bugs in file systems. As part of the
research effort, we had a lot of conversations with file-system
developers to understand the guarantees provided by different file
systems. This patch was inspired by the thought that we should quickly
document what we know about the data integrity guarantees of different
file systems. We did not expect to spur debate!

Thanks Dave, Amir, and Ted for the discussion. We will incorporate
these comments into the next patch. If it is better to wait until a
consensus is reached after the LSF meeting, we'd be happy to do so.

On Mon, Mar 18, 2019 at 2:14 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > > > +Strictly Ordered Metadata Consistency
> > > > > > > +-------------------------------------
> > > > > > > +With each file system providing varying levels of persistence
> > > > > > > +guarantees, a consensus in this regard, will benefit application
> > > > > > > +developers to work with certain fixed assumptions about file system
> > > > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > > > +
> > > > > > > +Under this scheme, the file system guarantees to persist all previous
> > > > > > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > > > > > +inode, it will persist all the changes required to reference the inode
> > > > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > > > +
> > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > > > > > +user after recovery without also observing op1.
> > > > > > > +
> > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > > > > > +share a dependency, which could be file-system specific. It might
> > > > > > > +require a developer to understand file-system internals to know if
> > > > > > > +SOMC would order one operation before another.
> > > > > >
> > > > > > That's largely an internal implementation detail, and users should
> > > > > > not have to care about the internal implementation because the
> > > > > > fundamental dependencies are all defined by the directory heirarchy
> > > > > > relationships that users can see and manipulate.
> > > > > >
> > > > > > i.e. fs internal dependencies only increase the size of the graph
> > > > > > that is persisted, but it will never be reduced to less than what
> > > > > > the user can observe in the directory heirarchy.
> > > > > >
> > > > > > So this can be further refined:
> > > > > >
> > > > > >         If op1 precedes op2 in program order (in-memory execution
> > > > > >         order), and op1 and op2 share a user visible reference, then
> > > > > >         op2 must not be observed by a user after recovery without
> > > > > >         also observing op1.
> > > > > >
> > > > > > e.g. in the case of the parent directory - the parent has a link
> > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > > > > operation in a directory modifies a user visible link count
> > > > > > reference.  Hence fsync of one of those children will persist the
> > > > > > directory link count, and then all of the other preceeding
> > > > > > transactions that modified the link count also need to be persisted.

Dave, how did SOMC come about? Even XFS persists more than the minimum
set required by SOMC. Is SOMC most useful as a sort of intuitive
guideline as to what users should expect to see after recovery?

I found your notes about POSIX SIO interesting, and will incorporate
it into the next version of the patch. Should POSIX SIO be agreed upon
between file systems as the set of guarantees to provide (especially
since this is what glibc assumes)? I think SOMC is stronger than POSIX
SIO.

> In this context I mean that effects of metadata operations before the
> barrier (e.g. setxattr, truncate) must be observed after crash if the effects
> of barrier operation (e.g. file was renamed) are observed after crash.
>
> > > > > I personally find SOMC guaranty *much* more powerful in the absence
> > > > > of fsync. I have an application that creates sparse files, sets xattrs, mtime
> > > > > and moves them into place. The observed requirement is that after crash
> > > > > those files either exist with correct mtime, xattr or not exist.
> > >
> > > I wasn't clear:
> > > 1. "sparse" meaning no data at all only hole.
> >
> > That's not sparse, that is an empty file or "contains no data".
> > "Sparse" means the file has "sparse data" - the data in the file is
> > separated by holes. A file that is just a single hole does not
> > contain "sparse data", it contains no data at all.
> >
> > IOWs, if you mean "file has no data in it", then say that as it is a
> > clear and unambiguous statement of what the file contains.
> >
> > > 2. "exist" meaning found at rename destination
> > > Naturally, its applications responsibility to cleanup temp files that were
> > > not moved into rename destination.
> > >
> > > >
> > > > SOMC does not provide the guarantees you seek in the absence of a
> > > > known data synchronisation point:
> > > >
> > > >         a) a background metadata checkpoint can land anywhere in
> > > >         that series of operations and hence recovery will land in an
> > > >         intermediate state.
> > >
> > > Yes, that results in temp files that would be cleaned up on recovery.
> >
> > Ambiguous. "recovery" is something filesystems do to bring the
> > filesystem into a consistent state after a crash. If you are talking
> > about applicaiton level behaviour, then you need to make that
> > explicit.
> >
> > i.e. I can /assume/ you are talking about application level recovery
> > from your previous statement, but that assumption is obviously wrong
> > if the application is using O_TMPFILE and linkat rather than rename,
> > in which case it will be fileystem level recovery that is doing the
> > cleanup. Ambiguous, yes?
> >
>
> Yes. From the application writers POV, the fact that doing things
> "atomically" is possible is what matters. Whether filesystem provides
> the recovery from incomplete transaction (O_TMPFILE+linkat), or
> application can cleanup leftovers on startup (rename).
> I have some applications that use the former and some that use the
> latter for directories and for portability with OS/fs that don't have
> O_TMPFILE.
>
> >
> > > >         b) there is data that needs writing, and SOMC provides no
> > > >         ordering guarantees for data. So after recovery file could
> > > >         exist with correct mtime and xattrs, but have no (or
> > > >         partial) data.
> > > >
> > >
> > > There is no data in my use case, only metadata, that is why
> > > SOMC without fsync is an option.
> >
> > Perhaps, but I am not clear on exactly what you are proposing
> > because I don't know what the hell a "metadata barrier" is, what it
> > does or what it implies for filesystem integrity operations...
> >
> > > > > To my understanding, SOMC provides a guaranty that the application does
> > > > > not need to do any fsync at all,
> > > >
> > > > Absolutely not true. If the application has atomic creation
> > > > requirements that need multiple syscalls to set up, it must
> > > > implement them itself and use fsync to synchronise data and metadata
> > > > before the "atomic create" operation that makes it visible to the
> > > > application.
> > > >
> > > > SOMC only guarantees what /metadata/ you see at a fileystem
> > > > synchronisation point; it does not provide ACID semantics to a
> > > > random set of system calls into the filesystem.
> > > >
> > >
> > > So I re-state my claim above after having explained the use case.
> >
> > With words that I can only guess the meaning of.
> >
> > Amir, if you are asking a complex question as to whether something
> > conforms to a specification, then please slow down and take the time
> > to define all the terms, the initial state, the observable behaviour
> > that you expect to see, etc in clear, unambiguous and well defined
> > terms.  Otherwise the question cannot be answered....
> >
>
> Sure. TBH, I didn't even dare to ask the complex question yet,
> because it was hard for me to define all terms. I sketched the
> use case with the example of create+setxattr+truncate+rename
> because I figured it is rather easy to understand.
>
> The more complex question has do to with explicit "data dependency"
> operation. At the moment, I will not explain what that means in details,
> but I am sure you can figure it out.
> With fdatasync+rename, fdatasync created a dependency between
> data and metadata of the file, so with SOMC, if file is observed after
> crash in rename destination, it also contains the data changes before
> fdatasync. But fdatasync gives a stringer guaranty than what
> my application actually needs, because in many cases it will cause
> journal flush. What it really needs is filemap_write_and_wait().
> Metadata doesn't need to be flushed as rename takes care of
> metadata ordering guaranties.
> As far as I can tell, there is no "official" API to do what I need
> and there is certainly no documentation about this expected behavior.
> Apologies, if above was not clear, I promise to explain in person
> during LSF to whoever is interested.

At the risk of being ambiguous in the same way as Amir:

Some applications may only care about ordering of metadata operations,
not whether they are persistent. Application-level correctness is
closely tied to ordering of different operations. Since SOMC gives us
the guarantee that if operation X is seen after recovery, all
dependent ops are also seen on recovery, this might be enough to
create a consistent application. For example, an application may not
care when file X was persisted to storage, as long as file Y was
persisted before it.

> Judging by the volume and passion of this thread, I think a
> session on LSF fs track would probably be a good idea.
> [CC Josef and Anna.]

+1 to discussion at LSF. We would be interested in hearing about the
results of the discussion.

> I find our behavior as a group of filesystem developers on this matter
> slightly bi-polar - on the one hand we wish to maintain implementation
> freedom for future performance improvements and don't wish to commit
> to existing behavior by documenting it. On the other hand, we wish to
> not break existing application, whose expectations from filesystems are
> far from what filesystems guaranty in documentation.
>
> There is no one good answer that fits all aspects of this subject and I
> personally agree with Ted on not wanting to document the ext4 "hacks"
> that are meant to cater misbehaving applications.

Completely agree with Amir here. There is a lot to be gained by
documentation data integrity guarantees of current file systems. We
currently do not know what each file system supports, without the
developers themselves weighing in. There have been multiple instances
where users/researchers like us and kernel developers like Amir were
confused about the guarantees provided by a given file system;
documentation would erase such confusion. If a standard like POSIX SIO
or SOMC is agreed upon, this allows optimizations while not breaking
application behavior.

I agree with being careful about committing to a set of guarantees,
but the ext4 "hacks" are now 10 years old. I'm not sure if they were
meant to be temporary, but clearly they are not. I highly doubt that
they are going to change anytime soon without breaking many
applications.

All I'm asking for is documenting the minimal set of guarantees each
file system already provides (or should provide in the absence of
bugs). It is alright if the file system provides more than what is
documented. The original patch does not talk about the rename hack
that Ted mentions.

> I think it is good that Jayashree posted this patch as a basis for discussion
> of what needs to be documented and how.
> Eventually, instead of trying to formalize filesystem expected behavior, it
> might be better to just encode the expected crash behavior tests
> in a readable manner, as Jayashree already started to do.
> Or maybe there is room for both documentation and tests.

Thanks for the support Amir!
Dave Chinner March 19, 2019, 3:13 a.m. UTC | #12
On Mon, Mar 18, 2019 at 09:13:58AM +0200, Amir Goldstein wrote:
> On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > > > +Strictly Ordered Metadata Consistency
> > > > > > > +-------------------------------------
> > > > > > > +With each file system providing varying levels of persistence
> > > > > > > +guarantees, a consensus in this regard, will benefit application
> > > > > > > +developers to work with certain fixed assumptions about file system
> > > > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > > > +
> > > > > > > +Under this scheme, the file system guarantees to persist all previous
> > > > > > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > > > > > +inode, it will persist all the changes required to reference the inode
> > > > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > > > +
> > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > > > > > +user after recovery without also observing op1.
> > > > > > > +
> > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > > > > > +share a dependency, which could be file-system specific. It might
> > > > > > > +require a developer to understand file-system internals to know if
> > > > > > > +SOMC would order one operation before another.
> > > > > >
> > > > > > That's largely an internal implementation detail, and users should
> > > > > > not have to care about the internal implementation because the
> > > > > > fundamental dependencies are all defined by the directory heirarchy
> > > > > > relationships that users can see and manipulate.
> > > > > >
> > > > > > i.e. fs internal dependencies only increase the size of the graph
> > > > > > that is persisted, but it will never be reduced to less than what
> > > > > > the user can observe in the directory heirarchy.
> > > > > >
> > > > > > So this can be further refined:
> > > > > >
> > > > > >         If op1 precedes op2 in program order (in-memory execution
> > > > > >         order), and op1 and op2 share a user visible reference, then
> > > > > >         op2 must not be observed by a user after recovery without
> > > > > >         also observing op1.
> > > > > >
> > > > > > e.g. in the case of the parent directory - the parent has a link
> > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > > > > operation in a directory modifies a user visible link count
> > > > > > reference.  Hence fsync of one of those children will persist the
> > > > > > directory link count, and then all of the other preceeding
> > > > > > transactions that modified the link count also need to be persisted.
> > > > > >
> > > > >
> > > > > One thing that bothers me is that the definition of SOMC (as well as
> > > > > your refined definition) doesn't mention fsync at all, but all the examples
> > > > > only discuss use cases with fsync.
> > > >
> > > > You can't discuss operational ordering without a point in time to
> > > > use as a reference for that ordering.  SOMC behaviour is preserved
> > > > at any point the filesystem checkpoints itself, and the only thing
> > > > that changes is the scope of that checkpoint. fsync is just a
> > > > convenient, widely understood, minimum dependecy reference point
> > > > that people can reason from. All the interesting ordering problems
> > > > come from minimum dependecy reference point (i.e. fsync()), not from
> > > > background filesystem-wide checkpoints.
> > > >
> > >
> > > Yes, I was referring to rename as a commonly used operation used
> > > by application as "metadata barrier".
> >
> > What is a "metadata barrier" and what are it's semantics supposed to
> > be?
> >
> 
> In this context I mean that effects of metadata operations before the
> barrier (e.g. setxattr, truncate) must be observed after crash if the effects
> of barrier operation (e.g. file was renamed) are observed after crash.

Ok, so you've just arbitrarily denoted a specific rename operation
to be a "recovery barrier" for your application?

In terms of SOMC, there is no operation that is an implied
"barrier". There are explicitly ordered checkpoints via data
integrity operations (i.e. sync, fsync, etc), but between those
points it's just dependency based ordering...

IOWs, if there is no direct relationship between two objects in
depnendency grpah, then then rename of one or the other does not
create a "metadata ordering barrier" between those two objects. They
are still independent, and so rename isn't a barrier in the true
sense (i.e. that it is an ordering synchronisation point).

At best rename can define a point in a dependency graph where an
independent dependency branch is merged atomically into the main
graph. This is still a powerful tool, and likely exactly what you
are wanting to know if it will work or not....

> > > > > To my understanding, SOMC provides a guaranty that the application does
> > > > > not need to do any fsync at all,
> > > >
> > > > Absolutely not true. If the application has atomic creation
> > > > requirements that need multiple syscalls to set up, it must
> > > > implement them itself and use fsync to synchronise data and metadata
> > > > before the "atomic create" operation that makes it visible to the
> > > > application.
> > > >
> > > > SOMC only guarantees what /metadata/ you see at a fileystem
> > > > synchronisation point; it does not provide ACID semantics to a
> > > > random set of system calls into the filesystem.
> > > >
> > >
> > > So I re-state my claim above after having explained the use case.
> >
> > With words that I can only guess the meaning of.
> >
> > Amir, if you are asking a complex question as to whether something
> > conforms to a specification, then please slow down and take the time
> > to define all the terms, the initial state, the observable behaviour
> > that you expect to see, etc in clear, unambiguous and well defined
> > terms.  Otherwise the question cannot be answered....
> >
> 
> Sure. TBH, I didn't even dare to ask the complex question yet,
> because it was hard for me to define all terms. I sketched the
> use case with the example of create+setxattr+truncate+rename
> because I figured it is rather easy to understand.
> 
> The more complex question has do to with explicit "data dependency"
> operation. At the moment, I will not explain what that means in details,
> but I am sure you can figure it out.
> With fdatasync+rename, fdatasync created a dependency between
> data and metadata of the file, so with SOMC, if file is observed after
> crash in rename destination, it also contains the data changes before
> fdatasync. But fdatasync gives a stringer guaranty than what
> my application actually needs, because in many cases it will cause
> journal flush. What it really needs is filemap_write_and_wait().
> Metadata doesn't need to be flushed as rename takes care of
> metadata ordering guaranties.

Ok, so what you are actually asking is whether SOMC provides a
guarantee that data writes that have completed before the rename
will be present on disk if the rename is present on disk? i.e.:

create+setxattr+write()+fdatawait()+rename

is atomic on a SOMC filesystem without a data integrity operation
being performed?

I don't think we've defined how data vs metadata ordering
persistence works in the SOMC model at all. We've really only been
discussing the metadata ordering and so I haven't really thought
all the different cases through.

OK, let's try to define how it works through examples.  Let's start
with the simple one: non-AIO O_DIRECT writes, because they send the
data straight to the device. i.e.

create
setxattr
write
  Extent Allocation
		  ----> device -+
					data volatile
		  <-- complete -+
write completion
rename					metadata volatile

At this point, we may have no direct dependency between the
write completion and the rename operation. Normally we would do
(O_DSYNC case)

write completion
    device cache flush
		  ----> device -+
		  <-- complete -+	data persisted
    journal FUA write
		  ----> device -+
		  <-- complete -+	file metadata persisted

and so we are guaranteed to have the data on disk before the rename
is started (i.e. POSIX compliance). Hence regardless of whether the
rename exists or not, we'll have the data on disk.

However, if we require a data completion rule similar to the IO
completion to device flush rule we have in the kernel:

	If data is to be ordered against a specific metadata
	operation, then the dependent data must be issued and
	completed before executing the ordering metadata operation.
	The application is responsibile for ensuring the necessary
	data has been flushed to storage and signalled complete, but
	it does not need to ensure it is persistent.

	When the ordering metadata operation is to be made
	persistent, the filesystem must ensure the dependent data is
	persistent before starting the ordered metadata persistence
	operation. It must also ensure that any data dependent
	metadata is captured and persisted in the pending ordered
	metadata persistence operation so all the metadata required
	to access the dependent data is persisted correctly.

Then we create the conditions where it is possible for data to be
ordered amongst the metadata with the same ordering guarantees
as the metadata. The above O_DIRECT example ends up as:

create
setxattr
write
  Extent Allocation			metadata volatile
		  ----> device -+
					data volatile
		  <-- complete -+
write completion
rename					metadata volatile
.....
<journal flush>
    device cache flush
		  ----> device -+
		  <-- complete -+	data persisted
    journal FUA write
		  ----> device -+
		  <-- complete -+	metadata persisted
<flush completion>


With AIO based O_DIRECT, then we cannot issue the ordering rename
until after the AIO completion has been delivered to the
application. Once that has been delivered, then it is the same case
as non AIO O_DIRECT.

BUffered IO is a bit harder, because we need flush-and-wait
primitives that don't provide data integrity guarantees. SO, after
soundly smacking down the user of sync_file_range() this morning
because it's not a data integrity operation and it has massive
gaping holes in it's behaviour, it may actually be useful here in a
very limited scope.

That is, sync_file_range() is only safe to use for this specific
sort of ordered data integrity algorithm when flushing the entire
file.(*)

create
setxattr
write					metadata volatile
  delayed allocation			data volatile
....
sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
		SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
  Extent Allocation			metadata volatile
		  ----> device -+
					data volatile
		  <-- complete -+
....
rename					metadata volatile

And so at this point, we only need a device cache flush to
make the data persistent and a journal flush to make the rename
persistent. And so it ends up the same case as non-AIO O_DIRECT.

So, yeah, I think this model will work to order completed data
writes against future metadata operations such that this is
observed:

	If a metadata operation is performed after dependent data
	has been flushed and signalled complete to userspace, then
	if that metadata operation is present after recovery the
	dependent data will also be present.

The good news here is what I described above is exactly what XFS
implements with it's journal flushes - it uses REQ_PREFLUSH |
REQ_FUA for journal writes, and so it follows the rules I outlined
above.  A quick grep shows that ext4/jbd2, f2fs and gfs2 also use
the same flags for journal and/or critical ordering IO. I can't tell
whether btrfs follows these rules or not.

> As far as I can tell, there is no "official" API to do what I need
> and there is certainly no documentation about this expected behavior.

Oh, userspace controlled data flushing is exactly what
sync_file_range() was intended for back when it was implemented back
in 2.6.17.

Unfortunately, the implementation was completely botched because it
was written from a top down "clean the page cache" perspective, not
a bottom up filesystem data integrity mechanism and by the time we
realised just how awful it was there were applications dependent on
it's existing behaviour....

> I find our behavior as a group of filesystem developers on this matter
> slightly bi-polar - on the one hand we wish to maintain implementation
> freedom for future performance improvements and don't wish to commit
> to existing behavior by documenting it. On the other hand, we wish to
> not break existing application, whose expectations from filesystems are
> far from what filesystems guaranty in documentation.

Personally I want the SOMC model to be explicitly documented so that
we can sanely discuss how we can provide sane optimisations to
userspace. It's the first step towards a model where
applications can run filesystem operations completely asynchronously
yet still provide large scale ordering and integrity guarantees
without needing copious amounts of fine-grained fsync
operations.(**)

I really don't care about the crazy vagaries of POSIX right now -
POSIX is a shit specification when it comes to integrity. The
sooner we move beyond it, the better off we'll be. And the beauty of
the SOMC model is that POSIX compliance falls out of it for free,
yet it allows us much more freedom for optimisation because we
can reason about integrity in terms of ordering and dependencies
rather than in terms of what fsync() must provide.

> There is no one good answer that fits all aspects of this subject and I
> personally agree with Ted on not wanting to document the ext4 "hacks"
> that are meant to cater misbehaving applications.

Applications "misbehave" largely because there is no definitive
documentation on what filesystems actually provide userspace. The
man pages document API behaviour, they /can't/ document things like
SOMC, which filesystems can provide it and how to use it to avoid
fsync()....

> I think it is good that Jayashree posted this patch as a basis for discussion
> of what needs to be documented and how.
> Eventually, instead of trying to formalize filesystem expected behavior, it
> might be better to just encode the expected crash behavior tests
> in a readable manner, as Jayashree already started to do.
> Or maybe there is room for both documentation and tests.

It needs documentation. crash tests do not document algorithms
behaviour, intentions, application programming models, constraints,
etc....

Cheers,

Dave.

(*) Using sync_file_range() for sub file ranges are simply broken when
it comes to data integrity style flushes as there is no guarantee it
will capture all the dirty ranges that need to be flushed (e.g.
write starting 100kb beyond EOF, then sync the range starting 100kb
beyond EOF, and it won't sync the sub-block zeroing that was done at
the old EOF, thereby exposing stale data....)

(**) That featherstitch paper I linked to earlier? Did you notice
the userspace defined "patch group" transaction interface?
http://featherstitch.cs.ucla.edu/
Dave Chinner March 19, 2019, 4:37 a.m. UTC | #13
On Mon, Mar 18, 2019 at 09:37:28PM -0500, Vijay Chidambaram wrote:
> For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin
> and Jayashree's advisor. We recently developed CrashMonkey, a tool for
> finding crash-consistency bugs in file systems. As part of the
> research effort, we had a lot of conversations with file-system
> developers to understand the guarantees provided by different file
> systems. This patch was inspired by the thought that we should quickly
> document what we know about the data integrity guarantees of different
> file systems. We did not expect to spur debate!
> 
> Thanks Dave, Amir, and Ted for the discussion. We will incorporate
> these comments into the next patch. If it is better to wait until a
> consensus is reached after the LSF meeting, we'd be happy to do so.
> 
> On Mon, Mar 18, 2019 at 2:14 AM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> > > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > > > > +Strictly Ordered Metadata Consistency
> > > > > > > > +-------------------------------------
> > > > > > > > +With each file system providing varying levels of persistence
> > > > > > > > +guarantees, a consensus in this regard, will benefit application
> > > > > > > > +developers to work with certain fixed assumptions about file system
> > > > > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > > > > +
> > > > > > > > +Under this scheme, the file system guarantees to persist all previous
> > > > > > > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > > > > > > +inode, it will persist all the changes required to reference the inode
> > > > > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > > > > +
> > > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > > > > > > +user after recovery without also observing op1.
> > > > > > > > +
> > > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > > > > > > +share a dependency, which could be file-system specific. It might
> > > > > > > > +require a developer to understand file-system internals to know if
> > > > > > > > +SOMC would order one operation before another.
> > > > > > >
> > > > > > > That's largely an internal implementation detail, and users should
> > > > > > > not have to care about the internal implementation because the
> > > > > > > fundamental dependencies are all defined by the directory heirarchy
> > > > > > > relationships that users can see and manipulate.
> > > > > > >
> > > > > > > i.e. fs internal dependencies only increase the size of the graph
> > > > > > > that is persisted, but it will never be reduced to less than what
> > > > > > > the user can observe in the directory heirarchy.
> > > > > > >
> > > > > > > So this can be further refined:
> > > > > > >
> > > > > > >         If op1 precedes op2 in program order (in-memory execution
> > > > > > >         order), and op1 and op2 share a user visible reference, then
> > > > > > >         op2 must not be observed by a user after recovery without
> > > > > > >         also observing op1.
> > > > > > >
> > > > > > > e.g. in the case of the parent directory - the parent has a link
> > > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > > > > > operation in a directory modifies a user visible link count
> > > > > > > reference.  Hence fsync of one of those children will persist the
> > > > > > > directory link count, and then all of the other preceeding
> > > > > > > transactions that modified the link count also need to be persisted.
> 
> Dave, how did SOMC come about? Even XFS persists more than the minimum
> set required by SOMC. Is SOMC most useful as a sort of intuitive
> guideline as to what users should expect to see after recovery?

Lots of things. 15+ years of fixing data and metadata recovery
ordering bugs in XFS, 20 years of reading academic filesystem
papers, many years of hating POSIX and that we should be aiming more
towards database ACID semantics in our filesystems, deciding
~10 years ago that maintainable integrity is far more important than
performance, understanding the block layer/device integrity
requirements and the model smarter people than me came up with for
ensuring integrity with minimal loss of performance, etc.

A big influence has also been that the "crash lost data" bug reports
we get from users are generally not a result of data being lost,
they are a result of incomplete and/or inconsistent recreation of
the state before the crash occurred.  e.g. files that exist with a
non-zero size but have no data in them, even though it had been
minutes between writing the data and crashing and other files were
just fine.

i.e. people don't tend to notice "stuff I just wrote is missing"
after a crash - they expect that. What they notice and complain
about is inconsistent state after recovery. e.g. file A was fine,
but file B was empty, even though I wrote file B before file A!

This is the sort of thing that Ted was refering to when he talked
about having to add hacks to ext4 to make sure certain "expected
behaviours" were maintained. ext4 inherited quite a few unrealistic
expectations from ext3, which had a much stricter data vs
metadata ordering model than ext4 does....

With XFS, the problems we've had with lost data/files have
invariably been a result of code that violated ordering semantics
for what were once considered performance benefits (hence my comment
about "integrity is more important than performance").  Those sorts
of problems (and there's been quite a others w.r.t. the XFS recovery
algorithm) have all been solved by journalling all metadata changes
(hence strict ordering against other metadata), improving the
journal format and the information we log in it, and delaying
data-dependent metadata updates until after the data IO completes.

And from that perspective, SOMC is really just a further
generalisation of the dependency and atomicity model that underlies
the existing XFS transaction engine.

> I found your notes about POSIX SIO interesting, and will incorporate
> it into the next version of the patch. Should POSIX SIO be agreed upon
> between file systems as the set of guarantees to provide (especially
> since this is what glibc assumes)? I think SOMC is stronger than POSIX
> SIO.

SOMC is stronger than POSIX SIO. POSIX SIO is still a horribly
ambiguous standard, even though it does define "data integrity"
and "file integrity" in a meaningful manner. It's an improvement,
but I still think it is terrible from efficiency and performance
perspectives.

> > The more complex question has do to with explicit "data dependency"
> > operation. At the moment, I will not explain what that means in details,
> > but I am sure you can figure it out.
> > With fdatasync+rename, fdatasync created a dependency between
> > data and metadata of the file, so with SOMC, if file is observed after
> > crash in rename destination, it also contains the data changes before
> > fdatasync. But fdatasync gives a stringer guaranty than what
> > my application actually needs, because in many cases it will cause
> > journal flush. What it really needs is filemap_write_and_wait().
> > Metadata doesn't need to be flushed as rename takes care of
> > metadata ordering guaranties.
> > As far as I can tell, there is no "official" API to do what I need
> > and there is certainly no documentation about this expected behavior.
> > Apologies, if above was not clear, I promise to explain in person
> > during LSF to whoever is interested.
> 
> At the risk of being ambiguous in the same way as Amir:
> 
> Some applications may only care about ordering of metadata operations,
> not whether they are persistent. Application-level correctness is
> closely tied to ordering of different operations. Since SOMC gives us
> the guarantee that if operation X is seen after recovery, all
> dependent ops are also seen on recovery, this might be enough to
> create a consistent application. For example, an application may not
> care when file X was persisted to storage, as long as file Y was
> persisted before it.

*nod*

Application developers have been asking for this sort of integrity
guarantee from filesystems for a long time. The problem has always
been that we've been unable to agree on a defined model that allows
us to guarantee such behaviour to userspace. Every ~5 years,
somebody comes up with a new userspace transaction proposal that
ends up going nowhere because it cannot be applied to most of the
underlying linux filesystems without severe compromises.

However, this discussion is leading me to belive that the benefits
of having a well defined and documented behavioural model (such as
SOMC) are starting to be realised. i.e. a well defined model allows
kernel and userspace to optimise indepedently but still provide the
exact integrity semantics each other requires. And that we can
expose that model as a set of tests in fstests, hence enabling both
fs developers and users to understand where filesystems behave
according to the model and where they may need further improvement.

So I think we are definitely headed in the right direction here.
That said....

> All I'm asking for is documenting the minimal set of guarantees each
> file system already provides (or should provide in the absence of
> bugs). It is alright if the file system provides more than what is
> documented. The original patch does not talk about the rename hack
> that Ted mentions.

... I'm really not that interested in documenting the limitations of
existing filesystems because that entirely backwards looking. I'm
looking forwards and aiming to provide a model that we can build
filesystems and applications around to fully exploit the performance
potential of modern storage hardware...

Cheers,

Dave.
Amir Goldstein March 19, 2019, 7:35 a.m. UTC | #14
On Tue, Mar 19, 2019 at 5:13 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Mar 18, 2019 at 09:13:58AM +0200, Amir Goldstein wrote:
> > On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote:
> > > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote:
> > > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote:
> > > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote:
> > > > > > > > +Strictly Ordered Metadata Consistency
> > > > > > > > +-------------------------------------
> > > > > > > > +With each file system providing varying levels of persistence
> > > > > > > > +guarantees, a consensus in this regard, will benefit application
> > > > > > > > +developers to work with certain fixed assumptions about file system
> > > > > > > > +guarantees. Dave Chinner proposed a unified model called the
> > > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5].
> > > > > > > > +
> > > > > > > > +Under this scheme, the file system guarantees to persist all previous
> > > > > > > > +dependent modifications to the object upon fsync().  If you fsync() an
> > > > > > > > +inode, it will persist all the changes required to reference the inode
> > > > > > > > +and its data. SOMC can be defined as follows [6]:
> > > > > > > > +
> > > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and
> > > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a
> > > > > > > > +user after recovery without also observing op1.
> > > > > > > > +
> > > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations
> > > > > > > > +share a dependency, which could be file-system specific. It might
> > > > > > > > +require a developer to understand file-system internals to know if
> > > > > > > > +SOMC would order one operation before another.
> > > > > > >
> > > > > > > That's largely an internal implementation detail, and users should
> > > > > > > not have to care about the internal implementation because the
> > > > > > > fundamental dependencies are all defined by the directory heirarchy
> > > > > > > relationships that users can see and manipulate.
> > > > > > >
> > > > > > > i.e. fs internal dependencies only increase the size of the graph
> > > > > > > that is persisted, but it will never be reduced to less than what
> > > > > > > the user can observe in the directory heirarchy.
> > > > > > >
> > > > > > > So this can be further refined:
> > > > > > >
> > > > > > >         If op1 precedes op2 in program order (in-memory execution
> > > > > > >         order), and op1 and op2 share a user visible reference, then
> > > > > > >         op2 must not be observed by a user after recovery without
> > > > > > >         also observing op1.
> > > > > > >
> > > > > > > e.g. in the case of the parent directory - the parent has a link
> > > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc
> > > > > > > operation in a directory modifies a user visible link count
> > > > > > > reference.  Hence fsync of one of those children will persist the
> > > > > > > directory link count, and then all of the other preceeding
> > > > > > > transactions that modified the link count also need to be persisted.
> > > > > > >
> > > > > >
> > > > > > One thing that bothers me is that the definition of SOMC (as well as
> > > > > > your refined definition) doesn't mention fsync at all, but all the examples
> > > > > > only discuss use cases with fsync.
> > > > >
> > > > > You can't discuss operational ordering without a point in time to
> > > > > use as a reference for that ordering.  SOMC behaviour is preserved
> > > > > at any point the filesystem checkpoints itself, and the only thing
> > > > > that changes is the scope of that checkpoint. fsync is just a
> > > > > convenient, widely understood, minimum dependecy reference point
> > > > > that people can reason from. All the interesting ordering problems
> > > > > come from minimum dependecy reference point (i.e. fsync()), not from
> > > > > background filesystem-wide checkpoints.
> > > > >
> > > >
> > > > Yes, I was referring to rename as a commonly used operation used
> > > > by application as "metadata barrier".
> > >
> > > What is a "metadata barrier" and what are it's semantics supposed to
> > > be?
> > >
> >
> > In this context I mean that effects of metadata operations before the
> > barrier (e.g. setxattr, truncate) must be observed after crash if the effects
> > of barrier operation (e.g. file was renamed) are observed after crash.
>
> Ok, so you've just arbitrarily denoted a specific rename operation
> to be a "recovery barrier" for your application?
>
> In terms of SOMC, there is no operation that is an implied
> "barrier". There are explicitly ordered checkpoints via data
> integrity operations (i.e. sync, fsync, etc), but between those
> points it's just dependency based ordering...
>
> IOWs, if there is no direct relationship between two objects in
> depnendency grpah, then then rename of one or the other does not
> create a "metadata ordering barrier" between those two objects. They
> are still independent, and so rename isn't a barrier in the true
> sense (i.e. that it is an ordering synchronisation point).
>
> At best rename can define a point in a dependency graph where an
> independent dependency branch is merged atomically into the main
> graph. This is still a powerful tool, and likely exactly what you
> are wanting to know if it will work or not....

Absolutely. The application only cares about atomicity of creating
a certain file/dir with specific size/xattrs with a certain name.

>
> > > > > > To my understanding, SOMC provides a guaranty that the application does
> > > > > > not need to do any fsync at all,
> > > > >
> > > > > Absolutely not true. If the application has atomic creation
> > > > > requirements that need multiple syscalls to set up, it must
> > > > > implement them itself and use fsync to synchronise data and metadata
> > > > > before the "atomic create" operation that makes it visible to the
> > > > > application.
> > > > >
> > > > > SOMC only guarantees what /metadata/ you see at a fileystem
> > > > > synchronisation point; it does not provide ACID semantics to a
> > > > > random set of system calls into the filesystem.
> > > > >
> > > >
> > > > So I re-state my claim above after having explained the use case.
> > >
> > > With words that I can only guess the meaning of.
> > >
> > > Amir, if you are asking a complex question as to whether something
> > > conforms to a specification, then please slow down and take the time
> > > to define all the terms, the initial state, the observable behaviour
> > > that you expect to see, etc in clear, unambiguous and well defined
> > > terms.  Otherwise the question cannot be answered....
> > >
> >
> > Sure. TBH, I didn't even dare to ask the complex question yet,
> > because it was hard for me to define all terms. I sketched the
> > use case with the example of create+setxattr+truncate+rename
> > because I figured it is rather easy to understand.
> >
> > The more complex question has do to with explicit "data dependency"
> > operation. At the moment, I will not explain what that means in details,
> > but I am sure you can figure it out.
> > With fdatasync+rename, fdatasync created a dependency between
> > data and metadata of the file, so with SOMC, if file is observed after
> > crash in rename destination, it also contains the data changes before
> > fdatasync. But fdatasync gives a stringer guaranty than what
> > my application actually needs, because in many cases it will cause
> > journal flush. What it really needs is filemap_write_and_wait().
> > Metadata doesn't need to be flushed as rename takes care of
> > metadata ordering guaranties.
>
> Ok, so what you are actually asking is whether SOMC provides a
> guarantee that data writes that have completed before the rename
> will be present on disk if the rename is present on disk? i.e.:
>
> create+setxattr+write()+fdatawait()+rename
>
> is atomic on a SOMC filesystem without a data integrity operation
> being performed?
>
> I don't think we've defined how data vs metadata ordering
> persistence works in the SOMC model at all. We've really only been
> discussing the metadata ordering and so I haven't really thought
> all the different cases through.
>
> OK, let's try to define how it works through examples.  Let's start
> with the simple one: non-AIO O_DIRECT writes, because they send the
> data straight to the device. i.e.
>
> create
> setxattr
> write
>   Extent Allocation
>                   ----> device -+
>                                         data volatile
>                   <-- complete -+
> write completion
> rename                                  metadata volatile
>
> At this point, we may have no direct dependency between the
> write completion and the rename operation. Normally we would do
> (O_DSYNC case)
>
> write completion
>     device cache flush
>                   ----> device -+
>                   <-- complete -+       data persisted
>     journal FUA write
>                   ----> device -+
>                   <-- complete -+       file metadata persisted
>
> and so we are guaranteed to have the data on disk before the rename
> is started (i.e. POSIX compliance). Hence regardless of whether the
> rename exists or not, we'll have the data on disk.
>
> However, if we require a data completion rule similar to the IO
> completion to device flush rule we have in the kernel:
>
>         If data is to be ordered against a specific metadata
>         operation, then the dependent data must be issued and
>         completed before executing the ordering metadata operation.
>         The application is responsibile for ensuring the necessary
>         data has been flushed to storage and signalled complete, but
>         it does not need to ensure it is persistent.
>
>         When the ordering metadata operation is to be made
>         persistent, the filesystem must ensure the dependent data is
>         persistent before starting the ordered metadata persistence
>         operation. It must also ensure that any data dependent
>         metadata is captured and persisted in the pending ordered
>         metadata persistence operation so all the metadata required
>         to access the dependent data is persisted correctly.
>
> Then we create the conditions where it is possible for data to be
> ordered amongst the metadata with the same ordering guarantees
> as the metadata. The above O_DIRECT example ends up as:
>
> create
> setxattr
> write
>   Extent Allocation                     metadata volatile
>                   ----> device -+
>                                         data volatile
>                   <-- complete -+
> write completion
> rename                                  metadata volatile
> .....
> <journal flush>
>     device cache flush
>                   ----> device -+
>                   <-- complete -+       data persisted
>     journal FUA write
>                   ----> device -+
>                   <-- complete -+       metadata persisted
> <flush completion>
>
>
> With AIO based O_DIRECT, then we cannot issue the ordering rename
> until after the AIO completion has been delivered to the
> application. Once that has been delivered, then it is the same case
> as non AIO O_DIRECT.
>
> BUffered IO is a bit harder, because we need flush-and-wait
> primitives that don't provide data integrity guarantees. SO, after
> soundly smacking down the user of sync_file_range() this morning
> because it's not a data integrity operation and it has massive
> gaping holes in it's behaviour, it may actually be useful here in a
> very limited scope.
>
> That is, sync_file_range() is only safe to use for this specific
> sort of ordered data integrity algorithm when flushing the entire
> file.(*)
>
> create
> setxattr
> write                                   metadata volatile
>   delayed allocation                    data volatile
> ....
> sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
>                 SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
>   Extent Allocation                     metadata volatile
>                   ----> device -+
>                                         data volatile
>                   <-- complete -+
> ....
> rename                                  metadata volatile
>
> And so at this point, we only need a device cache flush to
> make the data persistent and a journal flush to make the rename
> persistent. And so it ends up the same case as non-AIO O_DIRECT.
>

Funny, I once told that story and one Dave Chinner told me
"Nice story, but wrong.":
https://patchwork.kernel.org/patch/10576303/#22190719

You pointed to the minor detail that sync_file_range() uses
WB_SYNC_NONE.
So yes, I agree, it is a nice story and we need to make it right,
by having an API (perhaps SYNC_FILE_RANGE_ALL).
When you pointed out my mistake, I switched the application to
use the FIEMAP_FLAG_SYNC API as a hack.

> So, yeah, I think this model will work to order completed data
> writes against future metadata operations such that this is
> observed:
>
>         If a metadata operation is performed after dependent data
>         has been flushed and signalled complete to userspace, then
>         if that metadata operation is present after recovery the
>         dependent data will also be present.
>
> The good news here is what I described above is exactly what XFS
> implements with it's journal flushes - it uses REQ_PREFLUSH |
> REQ_FUA for journal writes, and so it follows the rules I outlined
> above.  A quick grep shows that ext4/jbd2, f2fs and gfs2 also use
> the same flags for journal and/or critical ordering IO. I can't tell
> whether btrfs follows these rules or not.
>
> > As far as I can tell, there is no "official" API to do what I need
> > and there is certainly no documentation about this expected behavior.
>
> Oh, userspace controlled data flushing is exactly what
> sync_file_range() was intended for back when it was implemented back
> in 2.6.17.
>
> Unfortunately, the implementation was completely botched because it
> was written from a top down "clean the page cache" perspective, not
> a bottom up filesystem data integrity mechanism and by the time we
> realised just how awful it was there were applications dependent on
> it's existing behaviour....
>

Thanks a lot, Dave, for taking the time to fill in the gaps in my sketchy
requirement and for the detailed answer.

Besides tests and documentation what could be useful is a portable
user space library that just does the right thing for every filesystem.
For example, safe_rename(), could be properly documented and is
all the application developer should really care about. The default
implementation just does fdatasync() before rename and from here
things can only improve based on underlying filesystem and available
kernel APIs.

I am not volunteering to write that library, but I'd be happy to write
write the patch/tests/man page for SYNC_FILE_RANGE_ALL API
or whatever we want to call it, if we can agree that it is needed.

Thanks!
Amir.
Theodore Ts'o March 19, 2019, 3:17 p.m. UTC | #15
On Mon, Mar 18, 2019 at 09:37:28PM -0500, Vijay Chidambaram wrote:
> For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin
> and Jayashree's advisor. We recently developed CrashMonkey, a tool for
> finding crash-consistency bugs in file systems. As part of the
> research effort, we had a lot of conversations with file-system
> developers to understand the guarantees provided by different file
> systems. This patch was inspired by the thought that we should quickly
> document what we know about the data integrity guarantees of different
> file systems. We did not expect to spur debate!
> 
> Thanks Dave, Amir, and Ted for the discussion. We will incorporate
> these comments into the next patch. If it is better to wait until a
> consensus is reached after the LSF meeting, we'd be happy to do so.

Something to consider is that certain side effects of what fsync(2) or
fdatasync(2) might drag into the jbd2 transaction might change if we
were to implement (for example) something like Daejun Park and Dongkun
Shin's "iJournaling: Fine-grained journaling for improving the latency
of fsync system call" published in Usenix, ATC 2017:

   https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf

That's an example of how if we document synchronization that goes
beyond POSIX, it might change in the future.  So if it gets
documented, applications might start becoming unreliable on FreeBSD,
MacOS, etc.  And maybe as Linux developers we won't care about that;
since it increases Linux lock-in.  Win!  (If you think like Steve
Ballmer, anyway.  :-)

But then if we were to implement something like incremental journaling
for fsync, and applications were to start assuming that it would also
work, application authors might complain that we had broken their
application So they might call the new feature a *BUG* which broke
backwards compatibility, and then demand that we either withdraw the
new feature, or complicate our testing matrix by adding Yet Another
Mount Option.  (That's especially true since iJournaling is a
performance improvement that doesn't require an on-disk format change.
So this is the sort of thing that we might want to enable by default
eventually, even if initially it's only enabled via a mount option
while we are stablizing the new feature.)

So my concerns are not a theoretical, abstract concern, but something
which is very real.  Implementing something like what Park and Shin
has proposed is something that is very much that we are thinking
about.

    	     		       	       	    - Ted
Dave Chinner March 19, 2019, 8:43 p.m. UTC | #16
On Tue, Mar 19, 2019 at 09:35:19AM +0200, Amir Goldstein wrote:
> On Tue, Mar 19, 2019 at 5:13 AM Dave Chinner <david@fromorbit.com> wrote:
> > That is, sync_file_range() is only safe to use for this specific
> > sort of ordered data integrity algorithm when flushing the entire
> > file.(*)
> >
> > create
> > setxattr
> > write                                   metadata volatile
> >   delayed allocation                    data volatile
> > ....
> > sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE |
> >                 SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
> >   Extent Allocation                     metadata volatile
> >                   ----> device -+
> >                                         data volatile
> >                   <-- complete -+
> > ....
> > rename                                  metadata volatile
> >
> > And so at this point, we only need a device cache flush to
> > make the data persistent and a journal flush to make the rename
> > persistent. And so it ends up the same case as non-AIO O_DIRECT.
> >
> 
> Funny, I once told that story and one Dave Chinner told me
> "Nice story, but wrong.":
> https://patchwork.kernel.org/patch/10576303/#22190719
> 
> You pointed to the minor detail that sync_file_range() uses
> WB_SYNC_NONE.

Ah, I forgot about that. That's what I get for not looking at the
code. Did I mention that SFR is a complete crock of shit when it
comes to data integrity operations? :/

> So yes, I agree, it is a nice story and we need to make it right,
> by having an API (perhaps SYNC_FILE_RANGE_ALL).
> When you pointed out my mistake, I switched the application to
> use the FIEMAP_FLAG_SYNC API as a hack.

Yeah, that 's a nasty hack :/

> Besides tests and documentation what could be useful is a portable
> user space library that just does the right thing for every filesystem.

*nod*

but before that, we need the model to be defined and documented.
And once we have a library, the fun part of convincing the world
that it should be the glibc default behaviour can begin....

Cheers,

Dave.
Dave Chinner March 19, 2019, 9:08 p.m. UTC | #17
On Tue, Mar 19, 2019 at 11:17:09AM -0400, Theodore Ts'o wrote:
> On Mon, Mar 18, 2019 at 09:37:28PM -0500, Vijay Chidambaram wrote:
> > For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin
> > and Jayashree's advisor. We recently developed CrashMonkey, a tool for
> > finding crash-consistency bugs in file systems. As part of the
> > research effort, we had a lot of conversations with file-system
> > developers to understand the guarantees provided by different file
> > systems. This patch was inspired by the thought that we should quickly
> > document what we know about the data integrity guarantees of different
> > file systems. We did not expect to spur debate!
> > 
> > Thanks Dave, Amir, and Ted for the discussion. We will incorporate
> > these comments into the next patch. If it is better to wait until a
> > consensus is reached after the LSF meeting, we'd be happy to do so.
> 
> Something to consider is that certain side effects of what fsync(2) or
> fdatasync(2) might drag into the jbd2 transaction might change if we
> were to implement (for example) something like Daejun Park and Dongkun
> Shin's "iJournaling: Fine-grained journaling for improving the latency
> of fsync system call" published in Usenix, ATC 2017:
> 
>    https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf
> 
> That's an example of how if we document synchronization that goes
> beyond POSIX, it might change in the future.

Sure, but again this is orthognal to what we are discussing here:
the user visible ordering of metadata operations after a crash.

If anyone implements a multi-segment or per-inode journal (say, like
NOVA), then it is up to that implementation to maintain the ordering
guarantees that a SOMC model requires. You can implement whatever
fsync() go-fast bits you want, as long as it provides the ordering
behaviour guarantees that the model defines.

IOWs, Ted, I think you have the wrong end of the stick here. This
isn't about optimising fsync() to provide better performance, it's
about guaranteeing order so that fsync() is not necessary and we
improve performance by allowing applications to omit order-only
synchornisation points in their workloads.

i.e. an order-based integrity model /reduces/ the need for a
hyper-optimised fsync operation because applications won't need to
use it as often.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt
new file mode 100644
index 0000000..be84964
--- /dev/null
+++ b/Documentation/filesystems/crash-recovery-guarantees.txt
@@ -0,0 +1,193 @@ 
+=====================================================================
+File System Crash-Recovery Guarantees
+=====================================================================
+Linux file systems provide certain guarantees to user-space
+applications about what happens to their data if the system crashes
+(due to power loss or kernel panic). These are termed crash-recovery
+guarantees.
+
+Crash-recovery guarantees only pertain to data or metadata that has
+been explicitly persisted to storage with fsync(), fdatasync(), or
+sync() system calls. By default, write(), mkdir(), and other
+file-system related system calls only affect the in-memory state of
+the file system.
+
+The crash-recovery guarantees provided by most Linux file systems are
+significantly stronger than what is required by POSIX. POSIX is vague,
+even allowing fsync() to do nothing (Mac OSX takes advantage of
+this). However, the guarantees provided by file systems are not
+documented, and vary between file systems. This document seeks to
+describe the current crash-recovery guarantees provided by major Linux
+file systems.
+
+What does the fsync() operation guarantee?
+----------------------------------------------------
+fsync() operation is meant to force the physical write of data
+corresponding to a file from the buffer cache, along with the file
+metadata. Note that the guarantees mentioned for each file system below
+are in addition to the ones provided by POSIX.
+
+POSIX
+-----
+fsync(file) : Flushes the data and metadata associated with the
+file. However, if the directory entry for the file has not been
+previously persisted, or has been modified, it is not guaranteed to be
+persisted by the fsync of the file [1]. What this means is, if a file
+is newly created, you will have to fsync(parent directory) in addition
+to fsync(file) in order to ensure that the file's directory entry has
+safely reached the disk.
+
+fsync(dir) : Flushes directory data and directory entries. However if
+you created a new file within the directory and wrote data to the
+file, then the file data is not guaranteed to be persisted, unless an
+explicit fsync() is issued on the file.
+
+ext4
+-----
+fsync(file) : Ensures that a newly created file's directory entry is
+persisted (no need to explicitly persist the parent directory). However,
+if you create multiple names of the file (hard links), then their directory
+entries are not guaranteed to persist unless each one of the parent
+directory entries are persisted [2].
+
+fsync(dir) : All file names within the persisted directory will exist,
+but does not guarantee file data.
+
+xfs
+----
+fsync(file) : Ensures that a newly created file's directory entry is
+persisted. Additionally, all the previous dependent modifications to
+this file are also persisted. If any file shares an object
+modification dependency with the fsync-ed file, then that file's
+directory entry is also persisted.
+
+fsync(dir) : All file names within the persisted directory will exist,
+but does not guarantee file data. As with files, fsync(dir) also persists
+previous dependent metadata operations.
+
+btrfs
+------
+fsync(file) : Ensures that a newly created file's directory entry
+is persisted, along with the directory entries of all its hard links.
+You do not need to explicitly fsync individual hard links to the file.
+
+fsync(dir) : All the file names within the directory will persist. All the
+rename and unlink operations within the directory are persisted. Due
+to the design choices made by btrfs, fsync of a directory could lead
+to an iterative fsync on sub-directories, thereby requiring a full
+file system commit. So btrfs does not advocate fsync of directories
+[2].
+
+F2FS
+----
+fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix),
+F2FS only guarantees POSIX behaviour. However, it provides xfs-like
+guarantees if mounted with fsync-mode=strict option.
+
+fsync(symlink)
+-------------
+A symlink inode cannot be directly opened for IO, which means there is
+no such thing as fsync of a symlink [3]. You could be tricked by the
+fact that open and fsync of a symlink succeeds without returning a
+error, but what happens in reality is as follows.
+
+Suppose we have a symlink “foo”, which points to the file “A/bar”
+
+fd = open(“foo”, O_CREAT | O_RDWR)
+fsync(fd)
+
+Both the above operations succeed, but if you crash after fsync, the
+symlink could be still missing.
+
+When you try to open the symlink “foo”, you are actually trying to
+open the file that the symlink resolves to, which in this case is
+“A/bar”. When you fsync the inode returned by the open system call, you
+are actually persisting the file “A/bar” and not the symlink. Note
+that if the file “A/bar” does not exist and you try the open the
+symlink “foo” without the O_CREAT flag, then file open will fail. To
+obtain the file descriptor associated with the symlink inode, you
+could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the
+file descriptor obtained this way can be only used to indicate a
+location in the file-system tree and to perform operations that act
+purely at the file descriptor level. Operations like read(), write(),
+fsync() etc cannot be performed on such file descriptors.
+
+Bottomline : You cannot fsync() a symlink.
+
+fsync(special files)
+--------------------
+Special files in Linux include block and character device files
+(created using mknod), FIFO (created using mkfifo) etc. Just like the
+behavior of fsync on symlinks described above, these special files do
+not have an fsync function defined. Similar to symlinks, you
+cannot fsync a special file [4].
+
+
+Strictly Ordered Metadata Consistency
+-------------------------------------
+With each file system providing varying levels of persistence
+guarantees, a consensus in this regard, will benefit application
+developers to work with certain fixed assumptions about file system
+guarantees. Dave Chinner proposed a unified model called the
+Strictly Ordered Metadata Consistency (SOMC) [5].
+
+Under this scheme, the file system guarantees to persist all previous
+dependent modifications to the object upon fsync().  If you fsync() an
+inode, it will persist all the changes required to reference the inode
+and its data. SOMC can be defined as follows [6]:
+
+If op1 precedes op2 in program order (in-memory execution order), and
+op1 and op2 share a dependency, then op2 must not be observed by a
+user after recovery without also observing op1.
+
+Unfortunately, SOMC's definition depends upon whether two operations
+share a dependency, which could be file-system specific. It might
+require a developer to understand file-system internals to know if
+SOMC would order one operation before another. It is worth noting
+that a file system can be crash-consistent (according to POSIX),
+without providing SOMC [7].
+
+As an example, consider the following test case from xfstest
+generic/342 [8]
+-------
+touch A/foo
+echo “hello” >  A/foo
+sync
+
+mv A/foo A/bar
+echo “world” > A/foo
+fsync A/foo
+CRASH
+
+What would you expect on recovery, if the file system crashed after
+the final fsync returned successfully?
+
+Non-SOMC file systems will not persist the file
+A/bar because it was not explicitly fsync-ed. But this means, you will
+find only the file A/foo with data “world” after crash, thereby losing
+the previously persisted file with data “hello”. You will need to
+explicitly fsync the directory A to ensure the rename operation is
+safely persisted on disk.
+
+Under SOMC, to correctly reference the new inode via A/foo,
+the previous rename operation must persist as well. Therefore,
+fsync() of A/foo will persist the renamed file A/bar as well.
+On recovery you will find both A/bar (with data “hello”)
+and A/foo (with data “world”).
+
+It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict)
+and btrfs provide SOMC-like behaviour in this particular example.
+However, in writing, only XFS claims to provide SOMC. F2FS aims to provide
+SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and
+btrfs provide strictly ordered metadata consistency.
+
+--------------------------------------------------------
+[1] http://man7.org/linux/man-pages/man2/fdatasync.2.html
+[2] https://www.spinics.net/lists/linux-btrfs/msg77340.html
+[3] https://www.spinics.net/lists/fstests/msg09370.html
+[4] https://bugzilla.kernel.org/show_bug.cgi?id=202485
+[5] https://marc.info/?l=fstests&m=155010885626284&w=2
+[6] https://marc.info/?l=fstests&m=155011123126916&w=2
+[7] https://www.spinics.net/lists/fstests/msg09379.html
+[8] https://patchwork.kernel.org/patch/10132305/
+