Message ID | 1552418820-18102-1-git-send-email-jaya@cs.utexas.edu (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] Documenting the crash-recovery guarantees of Linux file systems | expand |
On Tue, Mar 12, 2019 at 7:27 PM Jayashree <jaya@cs.utexas.edu> wrote: > > In this file, we document the crash-recovery guarantees > provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also > present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency > (SOMC), which is provided by xfs. It is not clear to us if other file systems > provide SOMC. > > Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu> > Reviewed-by: Amir Goldstein <amir73il@gmail.com> Mostly for the Btrfs part: Reviewed-by: Filipe Manana <fdmanana@suse.com> Nit: you can have lines up to 80 characters, many of them are significantly and unnecessarily smaller than that. Thanks for writing this document. > --- > > We would be happy to modify the document if file-system > developers claim that their system provides (or aims to provide) SOMC. > > Changes since v1: > * Addressed few nits identified in the review > * Added the fsync guarantees for F2FS and its SOMC compliance > --- > .../filesystems/crash-recovery-guarantees.txt | 193 +++++++++++++++++++++ > 1 file changed, 193 insertions(+) > create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt > > diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt > new file mode 100644 > index 0000000..be84964 > --- /dev/null > +++ b/Documentation/filesystems/crash-recovery-guarantees.txt > @@ -0,0 +1,193 @@ > +===================================================================== > +File System Crash-Recovery Guarantees > +===================================================================== > +Linux file systems provide certain guarantees to user-space > +applications about what happens to their data if the system crashes > +(due to power loss or kernel panic). These are termed crash-recovery > +guarantees. > + > +Crash-recovery guarantees only pertain to data or metadata that has > +been explicitly persisted to storage with fsync(), fdatasync(), or > +sync() system calls. By default, write(), mkdir(), and other > +file-system related system calls only affect the in-memory state of > +the file system. > + > +The crash-recovery guarantees provided by most Linux file systems are > +significantly stronger than what is required by POSIX. POSIX is vague, > +even allowing fsync() to do nothing (Mac OSX takes advantage of > +this). However, the guarantees provided by file systems are not > +documented, and vary between file systems. This document seeks to > +describe the current crash-recovery guarantees provided by major Linux > +file systems. > + > +What does the fsync() operation guarantee? > +---------------------------------------------------- > +fsync() operation is meant to force the physical write of data > +corresponding to a file from the buffer cache, along with the file > +metadata. Note that the guarantees mentioned for each file system below > +are in addition to the ones provided by POSIX. > + > +POSIX > +----- > +fsync(file) : Flushes the data and metadata associated with the > +file. However, if the directory entry for the file has not been > +previously persisted, or has been modified, it is not guaranteed to be > +persisted by the fsync of the file [1]. What this means is, if a file > +is newly created, you will have to fsync(parent directory) in addition > +to fsync(file) in order to ensure that the file's directory entry has > +safely reached the disk. > + > +fsync(dir) : Flushes directory data and directory entries. However if > +you created a new file within the directory and wrote data to the > +file, then the file data is not guaranteed to be persisted, unless an > +explicit fsync() is issued on the file. > + > +ext4 > +----- > +fsync(file) : Ensures that a newly created file's directory entry is > +persisted (no need to explicitly persist the parent directory). However, > +if you create multiple names of the file (hard links), then their directory > +entries are not guaranteed to persist unless each one of the parent > +directory entries are persisted [2]. > + > +fsync(dir) : All file names within the persisted directory will exist, > +but does not guarantee file data. > + > +xfs > +---- > +fsync(file) : Ensures that a newly created file's directory entry is > +persisted. Additionally, all the previous dependent modifications to > +this file are also persisted. If any file shares an object > +modification dependency with the fsync-ed file, then that file's > +directory entry is also persisted. > + > +fsync(dir) : All file names within the persisted directory will exist, > +but does not guarantee file data. As with files, fsync(dir) also persists > +previous dependent metadata operations. > + > +btrfs > +------ > +fsync(file) : Ensures that a newly created file's directory entry > +is persisted, along with the directory entries of all its hard links. > +You do not need to explicitly fsync individual hard links to the file. > + > +fsync(dir) : All the file names within the directory will persist. All the > +rename and unlink operations within the directory are persisted. Due > +to the design choices made by btrfs, fsync of a directory could lead > +to an iterative fsync on sub-directories, thereby requiring a full > +file system commit. So btrfs does not advocate fsync of directories > +[2]. > + > +F2FS > +---- > +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix), > +F2FS only guarantees POSIX behaviour. However, it provides xfs-like > +guarantees if mounted with fsync-mode=strict option. > + > +fsync(symlink) > +------------- > +A symlink inode cannot be directly opened for IO, which means there is > +no such thing as fsync of a symlink [3]. You could be tricked by the > +fact that open and fsync of a symlink succeeds without returning a > +error, but what happens in reality is as follows. > + > +Suppose we have a symlink “foo”, which points to the file “A/bar” > + > +fd = open(“foo”, O_CREAT | O_RDWR) > +fsync(fd) > + > +Both the above operations succeed, but if you crash after fsync, the > +symlink could be still missing. > + > +When you try to open the symlink “foo”, you are actually trying to > +open the file that the symlink resolves to, which in this case is > +“A/bar”. When you fsync the inode returned by the open system call, you > +are actually persisting the file “A/bar” and not the symlink. Note > +that if the file “A/bar” does not exist and you try the open the > +symlink “foo” without the O_CREAT flag, then file open will fail. To > +obtain the file descriptor associated with the symlink inode, you > +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the > +file descriptor obtained this way can be only used to indicate a > +location in the file-system tree and to perform operations that act > +purely at the file descriptor level. Operations like read(), write(), > +fsync() etc cannot be performed on such file descriptors. > + > +Bottomline : You cannot fsync() a symlink. > + > +fsync(special files) > +-------------------- > +Special files in Linux include block and character device files > +(created using mknod), FIFO (created using mkfifo) etc. Just like the > +behavior of fsync on symlinks described above, these special files do > +not have an fsync function defined. Similar to symlinks, you > +cannot fsync a special file [4]. > + > + > +Strictly Ordered Metadata Consistency > +------------------------------------- > +With each file system providing varying levels of persistence > +guarantees, a consensus in this regard, will benefit application > +developers to work with certain fixed assumptions about file system > +guarantees. Dave Chinner proposed a unified model called the > +Strictly Ordered Metadata Consistency (SOMC) [5]. > + > +Under this scheme, the file system guarantees to persist all previous > +dependent modifications to the object upon fsync(). If you fsync() an > +inode, it will persist all the changes required to reference the inode > +and its data. SOMC can be defined as follows [6]: > + > +If op1 precedes op2 in program order (in-memory execution order), and > +op1 and op2 share a dependency, then op2 must not be observed by a > +user after recovery without also observing op1. > + > +Unfortunately, SOMC's definition depends upon whether two operations > +share a dependency, which could be file-system specific. It might > +require a developer to understand file-system internals to know if > +SOMC would order one operation before another. It is worth noting > +that a file system can be crash-consistent (according to POSIX), > +without providing SOMC [7]. > + > +As an example, consider the following test case from xfstest > +generic/342 [8] > +------- > +touch A/foo > +echo “hello” > A/foo > +sync > + > +mv A/foo A/bar > +echo “world” > A/foo > +fsync A/foo > +CRASH > + > +What would you expect on recovery, if the file system crashed after > +the final fsync returned successfully? > + > +Non-SOMC file systems will not persist the file > +A/bar because it was not explicitly fsync-ed. But this means, you will > +find only the file A/foo with data “world” after crash, thereby losing > +the previously persisted file with data “hello”. You will need to > +explicitly fsync the directory A to ensure the rename operation is > +safely persisted on disk. > + > +Under SOMC, to correctly reference the new inode via A/foo, > +the previous rename operation must persist as well. Therefore, > +fsync() of A/foo will persist the renamed file A/bar as well. > +On recovery you will find both A/bar (with data “hello”) > +and A/foo (with data “world”). > + > +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict) > +and btrfs provide SOMC-like behaviour in this particular example. > +However, in writing, only XFS claims to provide SOMC. F2FS aims to provide > +SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and > +btrfs provide strictly ordered metadata consistency. > + > +-------------------------------------------------------- > +[1] http://man7.org/linux/man-pages/man2/fdatasync.2.html > +[2] https://www.spinics.net/lists/linux-btrfs/msg77340.html > +[3] https://www.spinics.net/lists/fstests/msg09370.html > +[4] https://bugzilla.kernel.org/show_bug.cgi?id=202485 > +[5] https://marc.info/?l=fstests&m=155010885626284&w=2 > +[6] https://marc.info/?l=fstests&m=155011123126916&w=2 > +[7] https://www.spinics.net/lists/fstests/msg09379.html > +[8] https://patchwork.kernel.org/patch/10132305/ > + > -- > 2.7.4 >
On Tue, Mar 12, 2019 at 9:27 PM Jayashree <jaya@cs.utexas.edu> wrote: > > In this file, we document the crash-recovery guarantees > provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also > present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency > (SOMC), which is provided by xfs. It is not clear to us if other file systems > provide SOMC. I think your document already claims that f2fs is SOMC, so better update commit message. FWIW, it is clear that ext4 also provides SOMC, because all metadata is journalled on a single linear transaction journal. Compared to xfs, an fsync on any dirty object is likely to flush even more metadata. It'd be a pitty to merge this document without Ted's ACK on the SOMC claim for ext4. Thanks, Amir. > > Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu> > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > --- > > We would be happy to modify the document if file-system > developers claim that their system provides (or aims to provide) SOMC. > > Changes since v1: > * Addressed few nits identified in the review > * Added the fsync guarantees for F2FS and its SOMC compliance > --- > .../filesystems/crash-recovery-guarantees.txt | 193 +++++++++++++++++++++ > 1 file changed, 193 insertions(+) > create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt > > diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt > new file mode 100644 > index 0000000..be84964 > --- /dev/null > +++ b/Documentation/filesystems/crash-recovery-guarantees.txt > @@ -0,0 +1,193 @@ > +===================================================================== > +File System Crash-Recovery Guarantees > +===================================================================== > +Linux file systems provide certain guarantees to user-space > +applications about what happens to their data if the system crashes > +(due to power loss or kernel panic). These are termed crash-recovery > +guarantees. > + > +Crash-recovery guarantees only pertain to data or metadata that has > +been explicitly persisted to storage with fsync(), fdatasync(), or > +sync() system calls. By default, write(), mkdir(), and other > +file-system related system calls only affect the in-memory state of > +the file system. > + > +The crash-recovery guarantees provided by most Linux file systems are > +significantly stronger than what is required by POSIX. POSIX is vague, > +even allowing fsync() to do nothing (Mac OSX takes advantage of > +this). However, the guarantees provided by file systems are not > +documented, and vary between file systems. This document seeks to > +describe the current crash-recovery guarantees provided by major Linux > +file systems. > + > +What does the fsync() operation guarantee? > +---------------------------------------------------- > +fsync() operation is meant to force the physical write of data > +corresponding to a file from the buffer cache, along with the file > +metadata. Note that the guarantees mentioned for each file system below > +are in addition to the ones provided by POSIX. > + > +POSIX > +----- > +fsync(file) : Flushes the data and metadata associated with the > +file. However, if the directory entry for the file has not been > +previously persisted, or has been modified, it is not guaranteed to be > +persisted by the fsync of the file [1]. What this means is, if a file > +is newly created, you will have to fsync(parent directory) in addition > +to fsync(file) in order to ensure that the file's directory entry has > +safely reached the disk. > + > +fsync(dir) : Flushes directory data and directory entries. However if > +you created a new file within the directory and wrote data to the > +file, then the file data is not guaranteed to be persisted, unless an > +explicit fsync() is issued on the file. > + > +ext4 > +----- > +fsync(file) : Ensures that a newly created file's directory entry is > +persisted (no need to explicitly persist the parent directory). However, > +if you create multiple names of the file (hard links), then their directory > +entries are not guaranteed to persist unless each one of the parent > +directory entries are persisted [2]. > + > +fsync(dir) : All file names within the persisted directory will exist, > +but does not guarantee file data. > + > +xfs > +---- > +fsync(file) : Ensures that a newly created file's directory entry is > +persisted. Additionally, all the previous dependent modifications to > +this file are also persisted. If any file shares an object > +modification dependency with the fsync-ed file, then that file's > +directory entry is also persisted. > + > +fsync(dir) : All file names within the persisted directory will exist, > +but does not guarantee file data. As with files, fsync(dir) also persists > +previous dependent metadata operations. > + > +btrfs > +------ > +fsync(file) : Ensures that a newly created file's directory entry > +is persisted, along with the directory entries of all its hard links. > +You do not need to explicitly fsync individual hard links to the file. > + > +fsync(dir) : All the file names within the directory will persist. All the > +rename and unlink operations within the directory are persisted. Due > +to the design choices made by btrfs, fsync of a directory could lead > +to an iterative fsync on sub-directories, thereby requiring a full > +file system commit. So btrfs does not advocate fsync of directories > +[2]. > + > +F2FS > +---- > +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix), > +F2FS only guarantees POSIX behaviour. However, it provides xfs-like > +guarantees if mounted with fsync-mode=strict option. > + > +fsync(symlink) > +------------- > +A symlink inode cannot be directly opened for IO, which means there is > +no such thing as fsync of a symlink [3]. You could be tricked by the > +fact that open and fsync of a symlink succeeds without returning a > +error, but what happens in reality is as follows. > + > +Suppose we have a symlink “foo”, which points to the file “A/bar” > + > +fd = open(“foo”, O_CREAT | O_RDWR) > +fsync(fd) > + > +Both the above operations succeed, but if you crash after fsync, the > +symlink could be still missing. > + > +When you try to open the symlink “foo”, you are actually trying to > +open the file that the symlink resolves to, which in this case is > +“A/bar”. When you fsync the inode returned by the open system call, you > +are actually persisting the file “A/bar” and not the symlink. Note > +that if the file “A/bar” does not exist and you try the open the > +symlink “foo” without the O_CREAT flag, then file open will fail. To > +obtain the file descriptor associated with the symlink inode, you > +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the > +file descriptor obtained this way can be only used to indicate a > +location in the file-system tree and to perform operations that act > +purely at the file descriptor level. Operations like read(), write(), > +fsync() etc cannot be performed on such file descriptors. > + > +Bottomline : You cannot fsync() a symlink. > + > +fsync(special files) > +-------------------- > +Special files in Linux include block and character device files > +(created using mknod), FIFO (created using mkfifo) etc. Just like the > +behavior of fsync on symlinks described above, these special files do > +not have an fsync function defined. Similar to symlinks, you > +cannot fsync a special file [4]. > + > + > +Strictly Ordered Metadata Consistency > +------------------------------------- > +With each file system providing varying levels of persistence > +guarantees, a consensus in this regard, will benefit application > +developers to work with certain fixed assumptions about file system > +guarantees. Dave Chinner proposed a unified model called the > +Strictly Ordered Metadata Consistency (SOMC) [5]. > + > +Under this scheme, the file system guarantees to persist all previous > +dependent modifications to the object upon fsync(). If you fsync() an > +inode, it will persist all the changes required to reference the inode > +and its data. SOMC can be defined as follows [6]: > + > +If op1 precedes op2 in program order (in-memory execution order), and > +op1 and op2 share a dependency, then op2 must not be observed by a > +user after recovery without also observing op1. > + > +Unfortunately, SOMC's definition depends upon whether two operations > +share a dependency, which could be file-system specific. It might > +require a developer to understand file-system internals to know if > +SOMC would order one operation before another. It is worth noting > +that a file system can be crash-consistent (according to POSIX), > +without providing SOMC [7]. > + > +As an example, consider the following test case from xfstest > +generic/342 [8] > +------- > +touch A/foo > +echo “hello” > A/foo > +sync > + > +mv A/foo A/bar > +echo “world” > A/foo > +fsync A/foo > +CRASH > + > +What would you expect on recovery, if the file system crashed after > +the final fsync returned successfully? > + > +Non-SOMC file systems will not persist the file > +A/bar because it was not explicitly fsync-ed. But this means, you will > +find only the file A/foo with data “world” after crash, thereby losing > +the previously persisted file with data “hello”. You will need to > +explicitly fsync the directory A to ensure the rename operation is > +safely persisted on disk. > + > +Under SOMC, to correctly reference the new inode via A/foo, > +the previous rename operation must persist as well. Therefore, > +fsync() of A/foo will persist the renamed file A/bar as well. > +On recovery you will find both A/bar (with data “hello”) > +and A/foo (with data “world”). > + > +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict) > +and btrfs provide SOMC-like behaviour in this particular example. > +However, in writing, only XFS claims to provide SOMC. F2FS aims to provide > +SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and > +btrfs provide strictly ordered metadata consistency. > + > +-------------------------------------------------------- > +[1] http://man7.org/linux/man-pages/man2/fdatasync.2.html > +[2] https://www.spinics.net/lists/linux-btrfs/msg77340.html > +[3] https://www.spinics.net/lists/fstests/msg09370.html > +[4] https://bugzilla.kernel.org/show_bug.cgi?id=202485 > +[5] https://marc.info/?l=fstests&m=155010885626284&w=2 > +[6] https://marc.info/?l=fstests&m=155011123126916&w=2 > +[7] https://www.spinics.net/lists/fstests/msg09379.html > +[8] https://patchwork.kernel.org/patch/10132305/ > + > -- > 2.7.4 >
On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > In this file, we document the crash-recovery guarantees > provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also > present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency > (SOMC), which is provided by xfs. It is not clear to us if other file systems > provide SOMC. FWIW, new kernel documents should be written in rst markup format, not plain ascii text. > > Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu> > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > --- > > We would be happy to modify the document if file-system > developers claim that their system provides (or aims to provide) SOMC. > > Changes since v1: > * Addressed few nits identified in the review > * Added the fsync guarantees for F2FS and its SOMC compliance > --- > .../filesystems/crash-recovery-guarantees.txt | 193 +++++++++++++++++++++ > 1 file changed, 193 insertions(+) > create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt > > diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt > new file mode 100644 > index 0000000..be84964 > --- /dev/null > +++ b/Documentation/filesystems/crash-recovery-guarantees.txt > @@ -0,0 +1,193 @@ > +===================================================================== > +File System Crash-Recovery Guarantees > +===================================================================== > +Linux file systems provide certain guarantees to user-space > +applications about what happens to their data if the system crashes > +(due to power loss or kernel panic). These are termed crash-recovery > +guarantees. These are termed "data integrity guarantees", not "crash recovery guarantees". i.e. crash recovery is generic phrase describing the _mechanism_ used by some filesystems to implement the data integrity guarantees the filesystem provides to userspace applications. > + > +Crash-recovery guarantees only pertain to data or metadata that has > +been explicitly persisted to storage with fsync(), fdatasync(), or > +sync() system calls. Define data and metadata in terms of what they refer to when we talk about data integrity guarantees. Define "persisted to storage". Also, data integrity guarantees are provided by more interfaces than you mention. They also apply to syncfs(), FIFREEZE, files/dirs opened with O_[D]SYNC, readv2/writev2 calls with RWF_[D]SYNC set, inodes with the S_[DIR]SYNC on-disk attribute, mounts with dirsync/wsync options, etc. "data integrity guarantees" encompass all these operations, not just fsync/fdatasync/sync.... > By default, write(), mkdir(), and other > +file-system related system calls only affect the in-memory state of > +the file system. That's a generalisation that is not always correct from the user's or userspace develper's point of view. e.g. inodes with the sync attribute set will default to synchronous on-disk state changes, applications can use O_DSYNC/O_SYNC by default, etc.... > +The crash-recovery guarantees provided by most Linux file systems are > +significantly stronger than what is required by POSIX. POSIX is vague, > +even allowing fsync() to do nothing (Mac OSX takes advantage of > +this). Except when _POSIX_SYNCHRONIZED_IO is asserted, and then the semantics filesystems must provide users are very explicit: "[SIO] [Option Start] If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state. All I/O operations shall be completed as defined for synchronized I/O file integrity completion. [Option End]" glibc asserts _POSIX_SYNCHRONIZED_IO (I'll use SIO from now on): $ getconf _POSIX_SYNCHRONIZED_IO 200809 $ This means fsync() on Linux is supposed to conform to Section 3.376 "Synchronized I/O File Integrity Completion" of the specification, which is a superset of the 3.375 "Synchronized I/O Data Integrity Completion". Section 3.375 says: "For write, when the operation has been completed or diagnosed if unsuccessful. The write is complete only when the data specified in the write request is successfully transferred and all file system information required to retrieve the data is successfully transferred." https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_375 The key phrase here is "all the file system information required to retrieve the data". If the directory entry that points at the file is not persisted with the file itself, then you can't retreive the data after a crash. i.e. when _POSIX_SYNCHRONIZED_IO is asserted by the system, the filesystem must guarantee this: # touch A/foo # echo "hello world" > A/foo # fsync A/foo persists the foo entry in the directory A, because that is "filesystem information required to retreive the data in the file A/foo". i.e. if we crash here and A/foo is not present after restart, then we've violated the POSIX specification for SIO. IOWs, POSIX fsync w/ SIO semantics does not allow fsync() to do nothing, but instead has explicit definitions of the behaviour applications can expect. The only "wiggle room" in this specification whether the meaning of "data transfer" includes physically persisting the data to storage media or just moving it into the device's volatile cache. On Linux, we've explicitly chosen the former, because the latter does not provide SIO semantics as data or referencing metadata can still be lost from the device's volatile cache after transfer. > However, the guarantees provided by file systems are not > +documented, and vary between file systems. This document seeks to > +describe the current crash-recovery guarantees provided by major Linux > +file systems. > + > +What does the fsync() operation guarantee? > +---------------------------------------------------- > +fsync() operation is meant to force the physical write of data > +corresponding to a file from the buffer cache, along with the file > +metadata. Note that the guarantees mentioned for each file system below > +are in addition to the ones provided by POSIX. a. what is a "physical write"? b. Linux does not have a buffer cache. What about direct IO? c. Exactly what "file metadata" are you talking about here? e. Actually, it's not "in addtion" to posix - what you are documenting here is where filesystems do not conform to the POSIX SIO specification.... > +POSIX > +----- > +fsync(file) : Flushes the data and metadata associated with the > +file. However, if the directory entry for the file has not been > +previously persisted, or has been modified, it is not guaranteed to be > +persisted by the fsync of the file [1]. These are the semantics defined in the linux fsync(3) man page, and as per the above, they are substantially /weaker/ than the POSIX SIO specification glibc says we implement. > What this means is, if a file > +is newly created, you will have to fsync(parent directory) in addition > +to fsync(file) in order to ensure that the file's directory entry has > +safely reached the disk. Define "safely reached disk" or use the same terms as previously defined (i.e. "persisted to storage"). > + > +fsync(dir) : Flushes directory data and directory entries. However if > +you created a new file within the directory and wrote data to the > +file, then the file data is not guaranteed to be persisted, unless an > +explicit fsync() is issued on the file. You talk about file metadata, then ignore what fsync does with directory metadata... > +ext4 > +----- > +fsync(file) : Ensures that a newly created file's directory entry is > +persisted (no need to explicitly persist the parent directory). However, > +if you create multiple names of the file (hard links), then their directory > +entries are not guaranteed to persist unless each one of the parent > +directory entries are persisted [2]. So you use a specific example to indicate an exception where ext4 needs an explicit parent directory fsync (i.e. hard links to a single file across multiple directories). That implies ext4 POSIX SIO compliance is questionable, and it is definitely not SOMC compliant. Further, it implies that transactional change atomicity requirements are also violated. i.e. the inode is journalled with a link count equivalent to all links existing, but not all the dirents that point to the inode are persisted at the same time. So from this example, ext4 is not SOMC compliant. > +fsync(dir) : All file names within the persisted directory will exist, > +but does not guarantee file data. what about the inodes that were created, removed or hard linked? Does it ensure they exist (or have been correctly freed) after fsync(dir), too? (that hardlink behaviour makes me question everything related to transaction atomicity in ext4 now) > +xfs > +---- > +fsync(file) : Ensures that a newly created file's directory entry is > +persisted. Actually, it ensures the path all the way up to the root inode is persisted. i.e. it guarantees the inode can be found after crash via a path walk. Basically, XFS demonstrates POSIX SIO compliant behaviour. > Additionally, all the previous dependent modifications to > +this file are also persisted. That's the mechanism that provides the behaviour, not sure that's relevant here. FWIW, this description is pretty much useless to a reader who knows nothing about XFS and what these terms actually mean. IOWs, you need to define "previous dependent modifications", "modification dependency", etc before using them. Essentially, you need to describe the observable behaviour here, not the implementation that creates the behaviour. > If any file shares an object > +modification dependency with the fsync-ed file, then that file's > +directory entry is also persisted. Which you need to explain with references to the ext4 hardlink failure and how XFS will persist all the hard link directory entries for each hardlink all the way back up to the root. i.e. don't describe the implementation, describe the observable behaviour. > +fsync(dir) : All file names within the persisted directory will exist, > +but does not guarantee file data. As with files, fsync(dir) also persists > +previous dependent metadata operations. > > +btrfs > +------ > +fsync(file) : Ensures that a newly created file's directory entry > +is persisted, along with the directory entries of all its hard links. > +You do not need to explicitly fsync individual hard links to the file. So how is that different to XFS? Why explicitly state the hard link behaviour, but then not mention anything about dependencies and propagation? Especially after doing exactly the opposite when describing XFS.... > +fsync(dir) : All the file names within the directory will persist. All the > +rename and unlink operations within the directory are persisted. Due > +to the design choices made by btrfs, fsync of a directory could lead > +to an iterative fsync on sub-directories, thereby requiring a full > +file system commit. So btrfs does not advocate fsync of directories > +[2]. I don't think this "recommendation" is appropriate for a document describing behaviour. It's also indicative of btrfs not having SOMC behaviour. > +F2FS > +---- > +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix), > +F2FS only guarantees POSIX behaviour. However, it provides xfs-like What does "only guarantees POSIX behaviour" actually mean? because it can mean "loses all your data on crash".... > +guarantees if mounted with fsync-mode=strict option. So, by default, f2fs will lose all your data on crash? And they call that "POSIX" behaviour, despite glibc telling applications that the system provides data integrity preserving fsync functionality? Seems like a very badly named mount option and a terrible default - basically we have "fast-and-loose" behaviour which has "eats your data" data integrity semantics and "strict" which should be POSIX SIO conformant. > +fsync(symlink) > +------------- > +A symlink inode cannot be directly opened for IO, which means there is > +no such thing as fsync of a symlink [3]. You could be tricked by the > +fact that open and fsync of a symlink succeeds without returning a > +error, but what happens in reality is as follows. > + > +Suppose we have a symlink “foo”, which points to the file “A/bar” > + > +fd = open(“foo”, O_CREAT | O_RDWR) > +fsync(fd) > + > +Both the above operations succeed, but if you crash after fsync, the > +symlink could be still missing. > + > +When you try to open the symlink “foo”, you are actually trying to > +open the file that the symlink resolves to, which in this case is > +“A/bar”. When you fsync the inode returned by the open system call, you > +are actually persisting the file “A/bar” and not the symlink. Note > +that if the file “A/bar” does not exist and you try the open the > +symlink “foo” without the O_CREAT flag, then file open will fail. To > +obtain the file descriptor associated with the symlink inode, you > +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the > +file descriptor obtained this way can be only used to indicate a > +location in the file-system tree and to perform operations that act > +purely at the file descriptor level. Operations like read(), write(), > +fsync() etc cannot be performed on such file descriptors. > + > +Bottomline : You cannot fsync() a symlink. You can fsync() the parent dir after it is created or removed to persist that operation. > +fsync(special files) > +-------------------- > +Special files in Linux include block and character device files > +(created using mknod), FIFO (created using mkfifo) etc. Just like the > +behavior of fsync on symlinks described above, these special files do > +not have an fsync function defined. Similar to symlinks, you > +cannot fsync a special file [4]. You can fsync() the parent dir after it is created or removed to persist that operation. > +Strictly Ordered Metadata Consistency > +------------------------------------- > +With each file system providing varying levels of persistence > +guarantees, a consensus in this regard, will benefit application > +developers to work with certain fixed assumptions about file system > +guarantees. Dave Chinner proposed a unified model called the > +Strictly Ordered Metadata Consistency (SOMC) [5]. > + > +Under this scheme, the file system guarantees to persist all previous > +dependent modifications to the object upon fsync(). If you fsync() an > +inode, it will persist all the changes required to reference the inode > +and its data. SOMC can be defined as follows [6]: > + > +If op1 precedes op2 in program order (in-memory execution order), and > +op1 and op2 share a dependency, then op2 must not be observed by a > +user after recovery without also observing op1. > + > +Unfortunately, SOMC's definition depends upon whether two operations > +share a dependency, which could be file-system specific. It might > +require a developer to understand file-system internals to know if > +SOMC would order one operation before another. That's largely an internal implementation detail, and users should not have to care about the internal implementation because the fundamental dependencies are all defined by the directory heirarchy relationships that users can see and manipulate. i.e. fs internal dependencies only increase the size of the graph that is persisted, but it will never be reduced to less than what the user can observe in the directory heirarchy. So this can be further refined: If op1 precedes op2 in program order (in-memory execution order), and op1 and op2 share a user visible reference, then op2 must not be observed by a user after recovery without also observing op1. e.g. in the case of the parent directory - the parent has a link count. Hence every create, unlink, rename, hard link, symlink, etc operation in a directory modifies a user visible link count reference. Hence fsync of one of those children will persist the directory link count, and then all of the other preceeding transactions that modified the link count also need to be persisted. But keep in mind this defines ordering, not the persistence set: # touch {a,b,c,d} # touch {1,2,3,4} # fsync d <crash> SOMC doesn't require {1,2,3,4} to be in the persistence set and hence present after recovery. It only requires {a,b,c,d} to be in the persistence set. If you observe XFS behaviour, it will result in {1,2,3,4} also being included in the persistence set, because it aggregates all the changes to the parent directory into a single change per journal checkpoint sequence and hence it cannot separate them at fsync time. This, however, is a XFS journal implementation detail and not something required by SOMC. The resulting behaviour is that XFS generally persists more than SOMC requires, but the persistence set that XFS calculates always maintains SOMC semantics so should always does the right thing. IOWs, a finer grained implementation of change dependencies could result in providing exact, minimal persistence SOMC behaviour in every situation, but don't expect that from XFS. It is likely that experimental, explicit change depedency graph based filesystems like featherstitch would provide minimal scope SOMC persistence behaviour, but that's out of the scope of this document. (*) http://featherstitch.cs.ucla.edu/ http://featherstitch.cs.ucla.edu/publications/featherstitch-sosp07.pdf https://lwn.net/Articles/354861/ > It is worth noting > +that a file system can be crash-consistent (according to POSIX), > +without providing SOMC [7]. "crash-consistent" doesn't mean "data integrity preserving", and posix only talks about data integrity beahviour. "crash-consistent" just means the filesystem is not in a corrupt state when it recovers. > +As an example, consider the following test case from xfstest > +generic/342 [8] > +------- > +touch A/foo > +echo “hello” > A/foo > +sync > + > +mv A/foo A/bar > +echo “world” > A/foo > +fsync A/foo > +CRASH [whacky utf-8(?) symbols. Plain ascii text for documents, please.] > +What would you expect on recovery, if the file system crashed after > +the final fsync returned successfully? > + > +Non-SOMC file systems will not persist the file > +A/bar because it was not explicitly fsync-ed. But this means, you will > +find only the file A/foo with data “world” after crash, thereby losing > +the previously persisted file with data “hello”. You will need to > +explicitly fsync the directory A to ensure the rename operation is > +safely persisted on disk. > + > +Under SOMC, to correctly reference the new inode via A/foo, > +the previous rename operation must persist as well. Therefore, > +fsync() of A/foo will persist the renamed file A/bar as well. > +On recovery you will find both A/bar (with data “hello”) > +and A/foo (with data “world”). You should describe the SOMC behaviour up front in the document, because that is the behaviour this document is about. Then describe how the "man page fsync behaviour" and individual filesystems differ from SOMC behaviour. it would also be worth contrasting SOMC to historic ext3 behaviour (globally ordered metadata and data), because that is the behaviour that many application devleopers and users still want current filesystems to emulate. > +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict) > +and btrfs provide SOMC-like behaviour in this particular example. > +However, in writing, only XFS claims to provide SOMC. F2FS aims to provide > +SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and > +btrfs provide strictly ordered metadata consistency. btrfs does not provide SOMC w.r.t. fsync() - that much is clear from the endless stream of fsync bugs that are being found and fixed. Also, the hard link behaviour described for ext4 indicates that it is not truly SOMC, either. From this, I'd consider ext4 a "mostly SOMC" implementation, but it seems that there are aspects of ext4/jbd2 dependency and/or atomicity tracking that don't fully resolve cross-object transactional atomicity dependencies correctly. Cheers, Dave.
On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > In this file, we document the crash-recovery guarantees > > provided by four Linux file systems - xfs, ext4, F2FS and btrfs. We also > > present Dave Chinner's proposal of Strictly-Ordered Metadata Consistency > > (SOMC), which is provided by xfs. It is not clear to us if other file systems > > provide SOMC. > > FWIW, new kernel documents should be written in rst markup format, > not plain ascii text. > > > > > Signed-off-by: Jayashree Mohan <jaya@cs.utexas.edu> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > --- > > > > We would be happy to modify the document if file-system > > developers claim that their system provides (or aims to provide) SOMC. > > > > Changes since v1: > > * Addressed few nits identified in the review > > * Added the fsync guarantees for F2FS and its SOMC compliance > > --- > > .../filesystems/crash-recovery-guarantees.txt | 193 +++++++++++++++++++++ > > 1 file changed, 193 insertions(+) > > create mode 100644 Documentation/filesystems/crash-recovery-guarantees.txt > > > > diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt > > new file mode 100644 > > index 0000000..be84964 > > --- /dev/null > > +++ b/Documentation/filesystems/crash-recovery-guarantees.txt > > @@ -0,0 +1,193 @@ > > +===================================================================== > > +File System Crash-Recovery Guarantees > > +===================================================================== > > +Linux file systems provide certain guarantees to user-space > > +applications about what happens to their data if the system crashes > > +(due to power loss or kernel panic). These are termed crash-recovery > > +guarantees. > > These are termed "data integrity guarantees", not "crash recovery > guarantees". > > i.e. crash recovery is generic phrase describing the _mechanism_ > used by some filesystems to implement the data integrity guarantees > the filesystem provides to userspace applications. > Well, if we use the term "data integrity guarantees" we need to make sure to explain that "data" may also refer to "metadata" as most of the examples and corner cases in this document are not about whether or or not the file's data is persisted, but rather about the existence of a directory entry. Yes, when the file has data, the directory entry existence is a prerequisite to reading the file's data, but when a file doesn't have any data, like symlinks sparse files with xattrs, etc, it is important to clarify what we mean by "integrity". [...] > > +ext4 > > +----- > > +fsync(file) : Ensures that a newly created file's directory entry is > > +persisted (no need to explicitly persist the parent directory). However, > > +if you create multiple names of the file (hard links), then their directory > > +entries are not guaranteed to persist unless each one of the parent > > +directory entries are persisted [2]. > > So you use a specific example to indicate an exception where ext4 > needs an explicit parent directory fsync (i.e. hard links to a > single file across multiple directories). That implies ext4 POSIX > SIO compliance is questionable, and it is definitely not SOMC > compliant. Further, it implies that transactional change atomicity > requirements are also violated. i.e. the inode is journalled with a > link count equivalent to all links existing, but not all the dirents > that point to the inode are persisted at the same time. > > So from this example, ext4 is not SOMC compliant. > I question the claim made by the document about ext4 behavior. I believe Ted's words [2] may have been misinterpreted. Ted, can you comment? > > +fsync(dir) : All file names within the persisted directory will exist, > > +but does not guarantee file data. > > what about the inodes that were created, removed or hard linked? > Does it ensure they exist (or have been correctly freed) after > fsync(dir), too? (that hardlink behaviour makes me question > everything related to transaction atomicity in ext4 now) > Those should also be flushed with the same (or previous) transaction, either deleted or on orphan list. [...] > > +Strictly Ordered Metadata Consistency > > +------------------------------------- > > +With each file system providing varying levels of persistence > > +guarantees, a consensus in this regard, will benefit application > > +developers to work with certain fixed assumptions about file system > > +guarantees. Dave Chinner proposed a unified model called the > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > + > > +Under this scheme, the file system guarantees to persist all previous > > +dependent modifications to the object upon fsync(). If you fsync() an > > +inode, it will persist all the changes required to reference the inode > > +and its data. SOMC can be defined as follows [6]: > > + > > +If op1 precedes op2 in program order (in-memory execution order), and > > +op1 and op2 share a dependency, then op2 must not be observed by a > > +user after recovery without also observing op1. > > + > > +Unfortunately, SOMC's definition depends upon whether two operations > > +share a dependency, which could be file-system specific. It might > > +require a developer to understand file-system internals to know if > > +SOMC would order one operation before another. > > That's largely an internal implementation detail, and users should > not have to care about the internal implementation because the > fundamental dependencies are all defined by the directory heirarchy > relationships that users can see and manipulate. > > i.e. fs internal dependencies only increase the size of the graph > that is persisted, but it will never be reduced to less than what > the user can observe in the directory heirarchy. > > So this can be further refined: > > If op1 precedes op2 in program order (in-memory execution > order), and op1 and op2 share a user visible reference, then > op2 must not be observed by a user after recovery without > also observing op1. > > e.g. in the case of the parent directory - the parent has a link > count. Hence every create, unlink, rename, hard link, symlink, etc > operation in a directory modifies a user visible link count > reference. Hence fsync of one of those children will persist the > directory link count, and then all of the other preceeding > transactions that modified the link count also need to be persisted. > One thing that bothers me is that the definition of SOMC (as well as your refined definition) doesn't mention fsync at all, but all the examples only discuss use cases with fsync. I personally find SOMC guaranty *much* more powerful in the absence of fsync. I have an application that creates sparse files, sets xattrs, mtime and moves them into place. The observed requirement is that after crash those files either exist with correct mtime, xattr or not exist. To my understanding, SOMC provides a guaranty that the application does not need to do any fsync at all, which is very desired when many such operations are performed while other users are doing data I/O on the same filesystem. For me. This is a very powerful feature of the filesystem and if we can (?) document this behavior and commit to it, that could benefit application developers. Thanks, Amir.
On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > > +Strictly Ordered Metadata Consistency > > > +------------------------------------- > > > +With each file system providing varying levels of persistence > > > +guarantees, a consensus in this regard, will benefit application > > > +developers to work with certain fixed assumptions about file system > > > +guarantees. Dave Chinner proposed a unified model called the > > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > > + > > > +Under this scheme, the file system guarantees to persist all previous > > > +dependent modifications to the object upon fsync(). If you fsync() an > > > +inode, it will persist all the changes required to reference the inode > > > +and its data. SOMC can be defined as follows [6]: > > > + > > > +If op1 precedes op2 in program order (in-memory execution order), and > > > +op1 and op2 share a dependency, then op2 must not be observed by a > > > +user after recovery without also observing op1. > > > + > > > +Unfortunately, SOMC's definition depends upon whether two operations > > > +share a dependency, which could be file-system specific. It might > > > +require a developer to understand file-system internals to know if > > > +SOMC would order one operation before another. > > > > That's largely an internal implementation detail, and users should > > not have to care about the internal implementation because the > > fundamental dependencies are all defined by the directory heirarchy > > relationships that users can see and manipulate. > > > > i.e. fs internal dependencies only increase the size of the graph > > that is persisted, but it will never be reduced to less than what > > the user can observe in the directory heirarchy. > > > > So this can be further refined: > > > > If op1 precedes op2 in program order (in-memory execution > > order), and op1 and op2 share a user visible reference, then > > op2 must not be observed by a user after recovery without > > also observing op1. > > > > e.g. in the case of the parent directory - the parent has a link > > count. Hence every create, unlink, rename, hard link, symlink, etc > > operation in a directory modifies a user visible link count > > reference. Hence fsync of one of those children will persist the > > directory link count, and then all of the other preceeding > > transactions that modified the link count also need to be persisted. > > > > One thing that bothers me is that the definition of SOMC (as well as > your refined definition) doesn't mention fsync at all, but all the examples > only discuss use cases with fsync. You can't discuss operational ordering without a point in time to use as a reference for that ordering. SOMC behaviour is preserved at any point the filesystem checkpoints itself, and the only thing that changes is the scope of that checkpoint. fsync is just a convenient, widely understood, minimum dependecy reference point that people can reason from. All the interesting ordering problems come from minimum dependecy reference point (i.e. fsync()), not from background filesystem-wide checkpoints. > I personally find SOMC guaranty *much* more powerful in the absence > of fsync. I have an application that creates sparse files, sets xattrs, mtime > and moves them into place. The observed requirement is that after crash > those files either exist with correct mtime, xattr or not exist. SOMC does not provide the guarantees you seek in the absence of a known data synchronisation point: a) a background metadata checkpoint can land anywhere in that series of operations and hence recovery will land in an intermediate state. b) there is data that needs writing, and SOMC provides no ordering guarantees for data. So after recovery file could exist with correct mtime and xattrs, but have no (or partial) data. > To my understanding, SOMC provides a guaranty that the application does > not need to do any fsync at all, Absolutely not true. If the application has atomic creation requirements that need multiple syscalls to set up, it must implement them itself and use fsync to synchronise data and metadata before the "atomic create" operation that makes it visible to the application. SOMC only guarantees what /metadata/ you see at a fileystem synchronisation point; it does not provide ACID semantics to a random set of system calls into the filesystem. Cheers, Dave.
On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > > > +Strictly Ordered Metadata Consistency > > > > +------------------------------------- > > > > +With each file system providing varying levels of persistence > > > > +guarantees, a consensus in this regard, will benefit application > > > > +developers to work with certain fixed assumptions about file system > > > > +guarantees. Dave Chinner proposed a unified model called the > > > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > > > + > > > > +Under this scheme, the file system guarantees to persist all previous > > > > +dependent modifications to the object upon fsync(). If you fsync() an > > > > +inode, it will persist all the changes required to reference the inode > > > > +and its data. SOMC can be defined as follows [6]: > > > > + > > > > +If op1 precedes op2 in program order (in-memory execution order), and > > > > +op1 and op2 share a dependency, then op2 must not be observed by a > > > > +user after recovery without also observing op1. > > > > + > > > > +Unfortunately, SOMC's definition depends upon whether two operations > > > > +share a dependency, which could be file-system specific. It might > > > > +require a developer to understand file-system internals to know if > > > > +SOMC would order one operation before another. > > > > > > That's largely an internal implementation detail, and users should > > > not have to care about the internal implementation because the > > > fundamental dependencies are all defined by the directory heirarchy > > > relationships that users can see and manipulate. > > > > > > i.e. fs internal dependencies only increase the size of the graph > > > that is persisted, but it will never be reduced to less than what > > > the user can observe in the directory heirarchy. > > > > > > So this can be further refined: > > > > > > If op1 precedes op2 in program order (in-memory execution > > > order), and op1 and op2 share a user visible reference, then > > > op2 must not be observed by a user after recovery without > > > also observing op1. > > > > > > e.g. in the case of the parent directory - the parent has a link > > > count. Hence every create, unlink, rename, hard link, symlink, etc > > > operation in a directory modifies a user visible link count > > > reference. Hence fsync of one of those children will persist the > > > directory link count, and then all of the other preceeding > > > transactions that modified the link count also need to be persisted. > > > > > > > One thing that bothers me is that the definition of SOMC (as well as > > your refined definition) doesn't mention fsync at all, but all the examples > > only discuss use cases with fsync. > > You can't discuss operational ordering without a point in time to > use as a reference for that ordering. SOMC behaviour is preserved > at any point the filesystem checkpoints itself, and the only thing > that changes is the scope of that checkpoint. fsync is just a > convenient, widely understood, minimum dependecy reference point > that people can reason from. All the interesting ordering problems > come from minimum dependecy reference point (i.e. fsync()), not from > background filesystem-wide checkpoints. > Yes, I was referring to rename as a commonly used operation used by application as "metadata barrier". > > I personally find SOMC guaranty *much* more powerful in the absence > > of fsync. I have an application that creates sparse files, sets xattrs, mtime > > and moves them into place. The observed requirement is that after crash > > those files either exist with correct mtime, xattr or not exist. I wasn't clear: 1. "sparse" meaning no data at all only hole. 2. "exist" meaning found at rename destination Naturally, its applications responsibility to cleanup temp files that were not moved into rename destination. > > SOMC does not provide the guarantees you seek in the absence of a > known data synchronisation point: > > a) a background metadata checkpoint can land anywhere in > that series of operations and hence recovery will land in an > intermediate state. Yes, that results in temp files that would be cleaned up on recovery. > > b) there is data that needs writing, and SOMC provides no > ordering guarantees for data. So after recovery file could > exist with correct mtime and xattrs, but have no (or > partial) data. > There is no data in my use case, only metadata, that is why SOMC without fsync is an option. > > To my understanding, SOMC provides a guaranty that the application does > > not need to do any fsync at all, > > Absolutely not true. If the application has atomic creation > requirements that need multiple syscalls to set up, it must > implement them itself and use fsync to synchronise data and metadata > before the "atomic create" operation that makes it visible to the > application. > > SOMC only guarantees what /metadata/ you see at a fileystem > synchronisation point; it does not provide ACID semantics to a > random set of system calls into the filesystem. > So I re-state my claim above after having explained the use case. IMO, SOMC guaranties is an important feature even in the absence of any fsync, because of the ability to use some metadata operation (e.g. rename, link) as metadata barriers. I am wrong about this? Thanks, Amir.
On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote: > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > > > > +Strictly Ordered Metadata Consistency > > > > > +------------------------------------- > > > > > +With each file system providing varying levels of persistence > > > > > +guarantees, a consensus in this regard, will benefit application > > > > > +developers to work with certain fixed assumptions about file system > > > > > +guarantees. Dave Chinner proposed a unified model called the > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > > > > + > > > > > +Under this scheme, the file system guarantees to persist all previous > > > > > +dependent modifications to the object upon fsync(). If you fsync() an > > > > > +inode, it will persist all the changes required to reference the inode > > > > > +and its data. SOMC can be defined as follows [6]: > > > > > + > > > > > +If op1 precedes op2 in program order (in-memory execution order), and > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a > > > > > +user after recovery without also observing op1. > > > > > + > > > > > +Unfortunately, SOMC's definition depends upon whether two operations > > > > > +share a dependency, which could be file-system specific. It might > > > > > +require a developer to understand file-system internals to know if > > > > > +SOMC would order one operation before another. > > > > > > > > That's largely an internal implementation detail, and users should > > > > not have to care about the internal implementation because the > > > > fundamental dependencies are all defined by the directory heirarchy > > > > relationships that users can see and manipulate. > > > > > > > > i.e. fs internal dependencies only increase the size of the graph > > > > that is persisted, but it will never be reduced to less than what > > > > the user can observe in the directory heirarchy. > > > > > > > > So this can be further refined: > > > > > > > > If op1 precedes op2 in program order (in-memory execution > > > > order), and op1 and op2 share a user visible reference, then > > > > op2 must not be observed by a user after recovery without > > > > also observing op1. > > > > > > > > e.g. in the case of the parent directory - the parent has a link > > > > count. Hence every create, unlink, rename, hard link, symlink, etc > > > > operation in a directory modifies a user visible link count > > > > reference. Hence fsync of one of those children will persist the > > > > directory link count, and then all of the other preceeding > > > > transactions that modified the link count also need to be persisted. > > > > > > > > > > One thing that bothers me is that the definition of SOMC (as well as > > > your refined definition) doesn't mention fsync at all, but all the examples > > > only discuss use cases with fsync. > > > > You can't discuss operational ordering without a point in time to > > use as a reference for that ordering. SOMC behaviour is preserved > > at any point the filesystem checkpoints itself, and the only thing > > that changes is the scope of that checkpoint. fsync is just a > > convenient, widely understood, minimum dependecy reference point > > that people can reason from. All the interesting ordering problems > > come from minimum dependecy reference point (i.e. fsync()), not from > > background filesystem-wide checkpoints. > > > > Yes, I was referring to rename as a commonly used operation used > by application as "metadata barrier". What is a "metadata barrier" and what are it's semantics supposed to be? > > > I personally find SOMC guaranty *much* more powerful in the absence > > > of fsync. I have an application that creates sparse files, sets xattrs, mtime > > > and moves them into place. The observed requirement is that after crash > > > those files either exist with correct mtime, xattr or not exist. > > I wasn't clear: > 1. "sparse" meaning no data at all only hole. That's not sparse, that is an empty file or "contains no data". "Sparse" means the file has "sparse data" - the data in the file is separated by holes. A file that is just a single hole does not contain "sparse data", it contains no data at all. IOWs, if you mean "file has no data in it", then say that as it is a clear and unambiguous statement of what the file contains. > 2. "exist" meaning found at rename destination > Naturally, its applications responsibility to cleanup temp files that were > not moved into rename destination. > > > > > SOMC does not provide the guarantees you seek in the absence of a > > known data synchronisation point: > > > > a) a background metadata checkpoint can land anywhere in > > that series of operations and hence recovery will land in an > > intermediate state. > > Yes, that results in temp files that would be cleaned up on recovery. Ambiguous. "recovery" is something filesystems do to bring the filesystem into a consistent state after a crash. If you are talking about applicaiton level behaviour, then you need to make that explicit. i.e. I can /assume/ you are talking about application level recovery from your previous statement, but that assumption is obviously wrong if the application is using O_TMPFILE and linkat rather than rename, in which case it will be fileystem level recovery that is doing the cleanup. Ambiguous, yes? > > b) there is data that needs writing, and SOMC provides no > > ordering guarantees for data. So after recovery file could > > exist with correct mtime and xattrs, but have no (or > > partial) data. > > > > There is no data in my use case, only metadata, that is why > SOMC without fsync is an option. Perhaps, but I am not clear on exactly what you are proposing because I don't know what the hell a "metadata barrier" is, what it does or what it implies for filesystem integrity operations... > > > To my understanding, SOMC provides a guaranty that the application does > > > not need to do any fsync at all, > > > > Absolutely not true. If the application has atomic creation > > requirements that need multiple syscalls to set up, it must > > implement them itself and use fsync to synchronise data and metadata > > before the "atomic create" operation that makes it visible to the > > application. > > > > SOMC only guarantees what /metadata/ you see at a fileystem > > synchronisation point; it does not provide ACID semantics to a > > random set of system calls into the filesystem. > > > > So I re-state my claim above after having explained the use case. With words that I can only guess the meaning of. Amir, if you are asking a complex question as to whether something conforms to a specification, then please slow down and take the time to define all the terms, the initial state, the observable behaviour that you expect to see, etc in clear, unambiguous and well defined terms. Otherwise the question cannot be answered.... Cheers, Dave.
On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > > +ext4 > > > +----- > > > +fsync(file) : Ensures that a newly created file's directory entry is > > > +persisted (no need to explicitly persist the parent directory). However, > > > +if you create multiple names of the file (hard links), then their directory > > > +entries are not guaranteed to persist unless each one of the parent > > > +directory entries are persisted [2]. > > > > So you use a specific example to indicate an exception where ext4 > > needs an explicit parent directory fsync (i.e. hard links to a > > single file across multiple directories). That implies ext4 POSIX > > SIO compliance is questionable, and it is definitely not SOMC > > compliant. Further, it implies that transactional change atomicity > > requirements are also violated. i.e. the inode is journalled with a > > link count equivalent to all links existing, but not all the dirents > > that point to the inode are persisted at the same time. > > > > So from this example, ext4 is not SOMC compliant. > > I question the claim made by the document about ext4 > behavior. > I believe Ted's words [2] may have been misinterpreted. > Ted, can you comment? We need to be really careful about claims and specifications. Consider the following sequence of events (error checking ignored; assume that it's there). unlink("foo"); // make sure the file "foo" doesn't exist unlink("bar"); // make sure the file "bar" doesn't exist fd = open("foo", O_CREAT|O_WRONLY); write(fd, "foo bar data", 12); fsync(fd); // if we crash here, the file "foo" will exist, and you will be able // to read 12 bytes, and it will be present link("foo", "bar"); fdatasync(fd); // if we crash here, there is no guarantee that the hard link "bar" // will be persisted I believe this is perfectly compliant with _POSIX_SYNCHRONIZED_IO. If the last fsyncdata(fd) is replaced with fsync(fd), the link(2) will touch the ctime, so in order only to persist the ctime change, as a side effect we will also update the directory entry so the existence of "bar" will be persisted. The bottom line, though, is I remain *very* skeptical about people who want to document and then tie the hands of file system developers about guarantees that go beyond Posix unless we have a very careful discussion about what benefit this will provide application developers, at least in general. If application developers start depending on behaviors beyond POSIX, it limits the ability of file system developers to innovate in order to improve performance. There may be cases where it's worth it, and there may be cases where it's pretty clear that the laws of physics such that certain things that go beyond POSIX will always be true. But before we encourage application developers to go beyond POSIX, we really should have this conversation first. For example, ext4 implements a guarantee that goes beyond POSIX, in that if you create a file, say, "foo.new", and then you rename that file such that it replaces an existing file, say "foo", then after the rename system call, we will initiate asynchronous writeback on "foo.new". This is free if the application programmer has alrady called fsync on "foo.new". However, for an sloppy application which doesn't bother to call fsync(2), for which Darrick informs me includes "rpm", it saves you from lost files if you immediately reboot. I do this, because there are tons of sloppy application programmers, and so they outnumber file system developers. However, this is not documented behavior, nor is it guaranteed by POSIX! I'm told by Darrick that XFS doesn't do this, and he believes the XFS developers would refuse to add such hacks, because it accomodates incompetent userspace programmers. Perhaps the right answer is to yell at the application programmers who make these mistakes. After all, the fact that ext4 accomodates incompetence could be argued is leading to decreased application quality and decreased portability. But at least back in the O_PONIES era, I looked at multiple text editors supplied by both GNOME and KDE, and I discovered to my horror they were writing files in an extremely unsafe manner. (Some of them also were simply opening an existing file with O_TRUNC, and then rewriting the text file's data, and NOT bother calling fsync afterwards; so ext4 also has a hack so for files opened with O_TRUNC where an existing file is truncated, on the close, we will initiate writeback. I chose this as being relatively low overhead, because no competently implemented text editor should be saving files in this way....) Whether or not ext4 should accomodate application programmers by going on POSIX, I believe very strongly that it should *not* be documented, since it just encourages the bad application programming practice. It's there just as a backstop, and in fact, it's done as an asynchronous writeback, not as a data integrity writeback. So it is *not* something people should be relying on. So before we document behavior that goes beyond POSIX, we should think *very* carefully if this is something that we want to be encouraging application programmers to rely on this sort of thing. - Ted
On Mon, Mar 18, 2019 at 4:48 AM Theodore Ts'o <tytso@mit.edu> wrote: > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > > > +ext4 > > > > +----- > > > > +fsync(file) : Ensures that a newly created file's directory entry is > > > > +persisted (no need to explicitly persist the parent directory). However, > > > > +if you create multiple names of the file (hard links), then their directory > > > > +entries are not guaranteed to persist unless each one of the parent > > > > +directory entries are persisted [2]. > > > > > > So you use a specific example to indicate an exception where ext4 > > > needs an explicit parent directory fsync (i.e. hard links to a > > > single file across multiple directories). That implies ext4 POSIX > > > SIO compliance is questionable, and it is definitely not SOMC > > > compliant. Further, it implies that transactional change atomicity > > > requirements are also violated. i.e. the inode is journalled with a > > > link count equivalent to all links existing, but not all the dirents > > > that point to the inode are persisted at the same time. > > > > > > So from this example, ext4 is not SOMC compliant. > > > > I question the claim made by the document about ext4 > > behavior. > > I believe Ted's words [2] may have been misinterpreted. > > Ted, can you comment? > > We need to be really careful about claims and specifications. > Consider the following sequence of events (error checking ignored; > assume that it's there). > > unlink("foo"); // make sure the file "foo" doesn't exist > unlink("bar"); // make sure the file "bar" doesn't exist > fd = open("foo", O_CREAT|O_WRONLY); > write(fd, "foo bar data", 12); > fsync(fd); > // if we crash here, the file "foo" will exist, and you will be able > // to read 12 bytes, and it will be present > link("foo", "bar"); > fdatasync(fd); > // if we crash here, there is no guarantee that the hard link "bar" > // will be persisted > > I believe this is perfectly compliant with _POSIX_SYNCHRONIZED_IO. > > If the last fsyncdata(fd) is replaced with fsync(fd), the link(2) will > touch the ctime, so in order only to persist the ctime change, as a > side effect we will also update the directory entry so the existence > of "bar" will be persisted. > > The bottom line, though, is I remain *very* skeptical about people who > want to document and then tie the hands of file system developers > about guarantees that go beyond Posix unless we have a very careful > discussion about what benefit this will provide application > developers, at least in general. If application developers start > depending on behaviors beyond POSIX, it limits the ability of file > system developers to innovate in order to improve performance. > That is understandable. But I believe the ACK Jayashree is looking for from you has to do with the SOMC guaranty. That has to do with ordering of metadata operations, which are all journalled by default in ext4 and has nothing to do with writeback hacks for dodgy apps. > There may be cases where it's worth it, and there may be cases where > it's pretty clear that the laws of physics such that certain things > that go beyond POSIX will always be true. But before we encourage > application developers to go beyond POSIX, we really should have this > conversation first. > > For example, ext4 implements a guarantee that goes beyond POSIX, in > that if you create a file, say, "foo.new", and then you rename that > file such that it replaces an existing file, say "foo", then after the > rename system call, we will initiate asynchronous writeback on > "foo.new". This is free if the application programmer has alrady > called fsync on "foo.new". However, for an sloppy application which > doesn't bother to call fsync(2), for which Darrick informs me includes > "rpm", it saves you from lost files if you immediately reboot. > > I do this, because there are tons of sloppy application programmers, > and so they outnumber file system developers. However, this is not > documented behavior, nor is it guaranteed by POSIX! I'm told by > Darrick that XFS doesn't do this, and he believes the XFS developers > would refuse to add such hacks, because it accomodates incompetent > userspace programmers. > > Perhaps the right answer is to yell at the application programmers who > make these mistakes. After all, the fact that ext4 accomodates > incompetence could be argued is leading to decreased application > quality and decreased portability. But at least back in the O_PONIES > era, I looked at multiple text editors supplied by both GNOME and KDE, > and I discovered to my horror they were writing files in an extremely > unsafe manner. (Some of them also were simply opening an existing > file with O_TRUNC, and then rewriting the text file's data, and NOT > bother calling fsync afterwards; so ext4 also has a hack so for files > opened with O_TRUNC where an existing file is truncated, on the close, > we will initiate writeback. I chose this as being relatively low > overhead, because no competently implemented text editor should be > saving files in this way....) > > Whether or not ext4 should accomodate application programmers by going > on POSIX, I believe very strongly that it should *not* be documented, > since it just encourages the bad application programming practice. > It's there just as a backstop, and in fact, it's done as an > asynchronous writeback, not as a data integrity writeback. So it is > *not* something people should be relying on. > > So before we document behavior that goes beyond POSIX, we should think > *very* carefully if this is something that we want to be encouraging > application programmers to rely on this sort of thing. > Perhaps it makes sense that if a behavior is already documented or will be documented in an xfstest with _require_metadata_journaling then it might as well be documented as well. Perhaps not. Thanks, Amir.
On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote: > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > > > > > +Strictly Ordered Metadata Consistency > > > > > > +------------------------------------- > > > > > > +With each file system providing varying levels of persistence > > > > > > +guarantees, a consensus in this regard, will benefit application > > > > > > +developers to work with certain fixed assumptions about file system > > > > > > +guarantees. Dave Chinner proposed a unified model called the > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > > > > > + > > > > > > +Under this scheme, the file system guarantees to persist all previous > > > > > > +dependent modifications to the object upon fsync(). If you fsync() an > > > > > > +inode, it will persist all the changes required to reference the inode > > > > > > +and its data. SOMC can be defined as follows [6]: > > > > > > + > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a > > > > > > +user after recovery without also observing op1. > > > > > > + > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations > > > > > > +share a dependency, which could be file-system specific. It might > > > > > > +require a developer to understand file-system internals to know if > > > > > > +SOMC would order one operation before another. > > > > > > > > > > That's largely an internal implementation detail, and users should > > > > > not have to care about the internal implementation because the > > > > > fundamental dependencies are all defined by the directory heirarchy > > > > > relationships that users can see and manipulate. > > > > > > > > > > i.e. fs internal dependencies only increase the size of the graph > > > > > that is persisted, but it will never be reduced to less than what > > > > > the user can observe in the directory heirarchy. > > > > > > > > > > So this can be further refined: > > > > > > > > > > If op1 precedes op2 in program order (in-memory execution > > > > > order), and op1 and op2 share a user visible reference, then > > > > > op2 must not be observed by a user after recovery without > > > > > also observing op1. > > > > > > > > > > e.g. in the case of the parent directory - the parent has a link > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc > > > > > operation in a directory modifies a user visible link count > > > > > reference. Hence fsync of one of those children will persist the > > > > > directory link count, and then all of the other preceeding > > > > > transactions that modified the link count also need to be persisted. > > > > > > > > > > > > > One thing that bothers me is that the definition of SOMC (as well as > > > > your refined definition) doesn't mention fsync at all, but all the examples > > > > only discuss use cases with fsync. > > > > > > You can't discuss operational ordering without a point in time to > > > use as a reference for that ordering. SOMC behaviour is preserved > > > at any point the filesystem checkpoints itself, and the only thing > > > that changes is the scope of that checkpoint. fsync is just a > > > convenient, widely understood, minimum dependecy reference point > > > that people can reason from. All the interesting ordering problems > > > come from minimum dependecy reference point (i.e. fsync()), not from > > > background filesystem-wide checkpoints. > > > > > > > Yes, I was referring to rename as a commonly used operation used > > by application as "metadata barrier". > > What is a "metadata barrier" and what are it's semantics supposed to > be? > In this context I mean that effects of metadata operations before the barrier (e.g. setxattr, truncate) must be observed after crash if the effects of barrier operation (e.g. file was renamed) are observed after crash. > > > > I personally find SOMC guaranty *much* more powerful in the absence > > > > of fsync. I have an application that creates sparse files, sets xattrs, mtime > > > > and moves them into place. The observed requirement is that after crash > > > > those files either exist with correct mtime, xattr or not exist. > > > > I wasn't clear: > > 1. "sparse" meaning no data at all only hole. > > That's not sparse, that is an empty file or "contains no data". > "Sparse" means the file has "sparse data" - the data in the file is > separated by holes. A file that is just a single hole does not > contain "sparse data", it contains no data at all. > > IOWs, if you mean "file has no data in it", then say that as it is a > clear and unambiguous statement of what the file contains. > > > 2. "exist" meaning found at rename destination > > Naturally, its applications responsibility to cleanup temp files that were > > not moved into rename destination. > > > > > > > > SOMC does not provide the guarantees you seek in the absence of a > > > known data synchronisation point: > > > > > > a) a background metadata checkpoint can land anywhere in > > > that series of operations and hence recovery will land in an > > > intermediate state. > > > > Yes, that results in temp files that would be cleaned up on recovery. > > Ambiguous. "recovery" is something filesystems do to bring the > filesystem into a consistent state after a crash. If you are talking > about applicaiton level behaviour, then you need to make that > explicit. > > i.e. I can /assume/ you are talking about application level recovery > from your previous statement, but that assumption is obviously wrong > if the application is using O_TMPFILE and linkat rather than rename, > in which case it will be fileystem level recovery that is doing the > cleanup. Ambiguous, yes? > Yes. From the application writers POV, the fact that doing things "atomically" is possible is what matters. Whether filesystem provides the recovery from incomplete transaction (O_TMPFILE+linkat), or application can cleanup leftovers on startup (rename). I have some applications that use the former and some that use the latter for directories and for portability with OS/fs that don't have O_TMPFILE. > > > > b) there is data that needs writing, and SOMC provides no > > > ordering guarantees for data. So after recovery file could > > > exist with correct mtime and xattrs, but have no (or > > > partial) data. > > > > > > > There is no data in my use case, only metadata, that is why > > SOMC without fsync is an option. > > Perhaps, but I am not clear on exactly what you are proposing > because I don't know what the hell a "metadata barrier" is, what it > does or what it implies for filesystem integrity operations... > > > > > To my understanding, SOMC provides a guaranty that the application does > > > > not need to do any fsync at all, > > > > > > Absolutely not true. If the application has atomic creation > > > requirements that need multiple syscalls to set up, it must > > > implement them itself and use fsync to synchronise data and metadata > > > before the "atomic create" operation that makes it visible to the > > > application. > > > > > > SOMC only guarantees what /metadata/ you see at a fileystem > > > synchronisation point; it does not provide ACID semantics to a > > > random set of system calls into the filesystem. > > > > > > > So I re-state my claim above after having explained the use case. > > With words that I can only guess the meaning of. > > Amir, if you are asking a complex question as to whether something > conforms to a specification, then please slow down and take the time > to define all the terms, the initial state, the observable behaviour > that you expect to see, etc in clear, unambiguous and well defined > terms. Otherwise the question cannot be answered.... > Sure. TBH, I didn't even dare to ask the complex question yet, because it was hard for me to define all terms. I sketched the use case with the example of create+setxattr+truncate+rename because I figured it is rather easy to understand. The more complex question has do to with explicit "data dependency" operation. At the moment, I will not explain what that means in details, but I am sure you can figure it out. With fdatasync+rename, fdatasync created a dependency between data and metadata of the file, so with SOMC, if file is observed after crash in rename destination, it also contains the data changes before fdatasync. But fdatasync gives a stringer guaranty than what my application actually needs, because in many cases it will cause journal flush. What it really needs is filemap_write_and_wait(). Metadata doesn't need to be flushed as rename takes care of metadata ordering guaranties. As far as I can tell, there is no "official" API to do what I need and there is certainly no documentation about this expected behavior. Apologies, if above was not clear, I promise to explain in person during LSF to whoever is interested. Judging by the volume and passion of this thread, I think a session on LSF fs track would probably be a good idea. [CC Josef and Anna.] I find our behavior as a group of filesystem developers on this matter slightly bi-polar - on the one hand we wish to maintain implementation freedom for future performance improvements and don't wish to commit to existing behavior by documenting it. On the other hand, we wish to not break existing application, whose expectations from filesystems are far from what filesystems guaranty in documentation. There is no one good answer that fits all aspects of this subject and I personally agree with Ted on not wanting to document the ext4 "hacks" that are meant to cater misbehaving applications. I think it is good that Jayashree posted this patch as a basis for discussion of what needs to be documented and how. Eventually, instead of trying to formalize filesystem expected behavior, it might be better to just encode the expected crash behavior tests in a readable manner, as Jayashree already started to do. Or maybe there is room for both documentation and tests. Thanks, Amir.
For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin and Jayashree's advisor. We recently developed CrashMonkey, a tool for finding crash-consistency bugs in file systems. As part of the research effort, we had a lot of conversations with file-system developers to understand the guarantees provided by different file systems. This patch was inspired by the thought that we should quickly document what we know about the data integrity guarantees of different file systems. We did not expect to spur debate! Thanks Dave, Amir, and Ted for the discussion. We will incorporate these comments into the next patch. If it is better to wait until a consensus is reached after the LSF meeting, we'd be happy to do so. On Mon, Mar 18, 2019 at 2:14 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote: > > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > > > > > > +Strictly Ordered Metadata Consistency > > > > > > > +------------------------------------- > > > > > > > +With each file system providing varying levels of persistence > > > > > > > +guarantees, a consensus in this regard, will benefit application > > > > > > > +developers to work with certain fixed assumptions about file system > > > > > > > +guarantees. Dave Chinner proposed a unified model called the > > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > > > > > > + > > > > > > > +Under this scheme, the file system guarantees to persist all previous > > > > > > > +dependent modifications to the object upon fsync(). If you fsync() an > > > > > > > +inode, it will persist all the changes required to reference the inode > > > > > > > +and its data. SOMC can be defined as follows [6]: > > > > > > > + > > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and > > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a > > > > > > > +user after recovery without also observing op1. > > > > > > > + > > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations > > > > > > > +share a dependency, which could be file-system specific. It might > > > > > > > +require a developer to understand file-system internals to know if > > > > > > > +SOMC would order one operation before another. > > > > > > > > > > > > That's largely an internal implementation detail, and users should > > > > > > not have to care about the internal implementation because the > > > > > > fundamental dependencies are all defined by the directory heirarchy > > > > > > relationships that users can see and manipulate. > > > > > > > > > > > > i.e. fs internal dependencies only increase the size of the graph > > > > > > that is persisted, but it will never be reduced to less than what > > > > > > the user can observe in the directory heirarchy. > > > > > > > > > > > > So this can be further refined: > > > > > > > > > > > > If op1 precedes op2 in program order (in-memory execution > > > > > > order), and op1 and op2 share a user visible reference, then > > > > > > op2 must not be observed by a user after recovery without > > > > > > also observing op1. > > > > > > > > > > > > e.g. in the case of the parent directory - the parent has a link > > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc > > > > > > operation in a directory modifies a user visible link count > > > > > > reference. Hence fsync of one of those children will persist the > > > > > > directory link count, and then all of the other preceeding > > > > > > transactions that modified the link count also need to be persisted. Dave, how did SOMC come about? Even XFS persists more than the minimum set required by SOMC. Is SOMC most useful as a sort of intuitive guideline as to what users should expect to see after recovery? I found your notes about POSIX SIO interesting, and will incorporate it into the next version of the patch. Should POSIX SIO be agreed upon between file systems as the set of guarantees to provide (especially since this is what glibc assumes)? I think SOMC is stronger than POSIX SIO. > In this context I mean that effects of metadata operations before the > barrier (e.g. setxattr, truncate) must be observed after crash if the effects > of barrier operation (e.g. file was renamed) are observed after crash. > > > > > > I personally find SOMC guaranty *much* more powerful in the absence > > > > > of fsync. I have an application that creates sparse files, sets xattrs, mtime > > > > > and moves them into place. The observed requirement is that after crash > > > > > those files either exist with correct mtime, xattr or not exist. > > > > > > I wasn't clear: > > > 1. "sparse" meaning no data at all only hole. > > > > That's not sparse, that is an empty file or "contains no data". > > "Sparse" means the file has "sparse data" - the data in the file is > > separated by holes. A file that is just a single hole does not > > contain "sparse data", it contains no data at all. > > > > IOWs, if you mean "file has no data in it", then say that as it is a > > clear and unambiguous statement of what the file contains. > > > > > 2. "exist" meaning found at rename destination > > > Naturally, its applications responsibility to cleanup temp files that were > > > not moved into rename destination. > > > > > > > > > > > SOMC does not provide the guarantees you seek in the absence of a > > > > known data synchronisation point: > > > > > > > > a) a background metadata checkpoint can land anywhere in > > > > that series of operations and hence recovery will land in an > > > > intermediate state. > > > > > > Yes, that results in temp files that would be cleaned up on recovery. > > > > Ambiguous. "recovery" is something filesystems do to bring the > > filesystem into a consistent state after a crash. If you are talking > > about applicaiton level behaviour, then you need to make that > > explicit. > > > > i.e. I can /assume/ you are talking about application level recovery > > from your previous statement, but that assumption is obviously wrong > > if the application is using O_TMPFILE and linkat rather than rename, > > in which case it will be fileystem level recovery that is doing the > > cleanup. Ambiguous, yes? > > > > Yes. From the application writers POV, the fact that doing things > "atomically" is possible is what matters. Whether filesystem provides > the recovery from incomplete transaction (O_TMPFILE+linkat), or > application can cleanup leftovers on startup (rename). > I have some applications that use the former and some that use the > latter for directories and for portability with OS/fs that don't have > O_TMPFILE. > > > > > > > b) there is data that needs writing, and SOMC provides no > > > > ordering guarantees for data. So after recovery file could > > > > exist with correct mtime and xattrs, but have no (or > > > > partial) data. > > > > > > > > > > There is no data in my use case, only metadata, that is why > > > SOMC without fsync is an option. > > > > Perhaps, but I am not clear on exactly what you are proposing > > because I don't know what the hell a "metadata barrier" is, what it > > does or what it implies for filesystem integrity operations... > > > > > > > To my understanding, SOMC provides a guaranty that the application does > > > > > not need to do any fsync at all, > > > > > > > > Absolutely not true. If the application has atomic creation > > > > requirements that need multiple syscalls to set up, it must > > > > implement them itself and use fsync to synchronise data and metadata > > > > before the "atomic create" operation that makes it visible to the > > > > application. > > > > > > > > SOMC only guarantees what /metadata/ you see at a fileystem > > > > synchronisation point; it does not provide ACID semantics to a > > > > random set of system calls into the filesystem. > > > > > > > > > > So I re-state my claim above after having explained the use case. > > > > With words that I can only guess the meaning of. > > > > Amir, if you are asking a complex question as to whether something > > conforms to a specification, then please slow down and take the time > > to define all the terms, the initial state, the observable behaviour > > that you expect to see, etc in clear, unambiguous and well defined > > terms. Otherwise the question cannot be answered.... > > > > Sure. TBH, I didn't even dare to ask the complex question yet, > because it was hard for me to define all terms. I sketched the > use case with the example of create+setxattr+truncate+rename > because I figured it is rather easy to understand. > > The more complex question has do to with explicit "data dependency" > operation. At the moment, I will not explain what that means in details, > but I am sure you can figure it out. > With fdatasync+rename, fdatasync created a dependency between > data and metadata of the file, so with SOMC, if file is observed after > crash in rename destination, it also contains the data changes before > fdatasync. But fdatasync gives a stringer guaranty than what > my application actually needs, because in many cases it will cause > journal flush. What it really needs is filemap_write_and_wait(). > Metadata doesn't need to be flushed as rename takes care of > metadata ordering guaranties. > As far as I can tell, there is no "official" API to do what I need > and there is certainly no documentation about this expected behavior. > Apologies, if above was not clear, I promise to explain in person > during LSF to whoever is interested. At the risk of being ambiguous in the same way as Amir: Some applications may only care about ordering of metadata operations, not whether they are persistent. Application-level correctness is closely tied to ordering of different operations. Since SOMC gives us the guarantee that if operation X is seen after recovery, all dependent ops are also seen on recovery, this might be enough to create a consistent application. For example, an application may not care when file X was persisted to storage, as long as file Y was persisted before it. > Judging by the volume and passion of this thread, I think a > session on LSF fs track would probably be a good idea. > [CC Josef and Anna.] +1 to discussion at LSF. We would be interested in hearing about the results of the discussion. > I find our behavior as a group of filesystem developers on this matter > slightly bi-polar - on the one hand we wish to maintain implementation > freedom for future performance improvements and don't wish to commit > to existing behavior by documenting it. On the other hand, we wish to > not break existing application, whose expectations from filesystems are > far from what filesystems guaranty in documentation. > > There is no one good answer that fits all aspects of this subject and I > personally agree with Ted on not wanting to document the ext4 "hacks" > that are meant to cater misbehaving applications. Completely agree with Amir here. There is a lot to be gained by documentation data integrity guarantees of current file systems. We currently do not know what each file system supports, without the developers themselves weighing in. There have been multiple instances where users/researchers like us and kernel developers like Amir were confused about the guarantees provided by a given file system; documentation would erase such confusion. If a standard like POSIX SIO or SOMC is agreed upon, this allows optimizations while not breaking application behavior. I agree with being careful about committing to a set of guarantees, but the ext4 "hacks" are now 10 years old. I'm not sure if they were meant to be temporary, but clearly they are not. I highly doubt that they are going to change anytime soon without breaking many applications. All I'm asking for is documenting the minimal set of guarantees each file system already provides (or should provide in the absence of bugs). It is alright if the file system provides more than what is documented. The original patch does not talk about the rename hack that Ted mentions. > I think it is good that Jayashree posted this patch as a basis for discussion > of what needs to be documented and how. > Eventually, instead of trying to formalize filesystem expected behavior, it > might be better to just encode the expected crash behavior tests > in a readable manner, as Jayashree already started to do. > Or maybe there is room for both documentation and tests. Thanks for the support Amir!
On Mon, Mar 18, 2019 at 09:13:58AM +0200, Amir Goldstein wrote: > On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote: > > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > > > > > > +Strictly Ordered Metadata Consistency > > > > > > > +------------------------------------- > > > > > > > +With each file system providing varying levels of persistence > > > > > > > +guarantees, a consensus in this regard, will benefit application > > > > > > > +developers to work with certain fixed assumptions about file system > > > > > > > +guarantees. Dave Chinner proposed a unified model called the > > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > > > > > > + > > > > > > > +Under this scheme, the file system guarantees to persist all previous > > > > > > > +dependent modifications to the object upon fsync(). If you fsync() an > > > > > > > +inode, it will persist all the changes required to reference the inode > > > > > > > +and its data. SOMC can be defined as follows [6]: > > > > > > > + > > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and > > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a > > > > > > > +user after recovery without also observing op1. > > > > > > > + > > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations > > > > > > > +share a dependency, which could be file-system specific. It might > > > > > > > +require a developer to understand file-system internals to know if > > > > > > > +SOMC would order one operation before another. > > > > > > > > > > > > That's largely an internal implementation detail, and users should > > > > > > not have to care about the internal implementation because the > > > > > > fundamental dependencies are all defined by the directory heirarchy > > > > > > relationships that users can see and manipulate. > > > > > > > > > > > > i.e. fs internal dependencies only increase the size of the graph > > > > > > that is persisted, but it will never be reduced to less than what > > > > > > the user can observe in the directory heirarchy. > > > > > > > > > > > > So this can be further refined: > > > > > > > > > > > > If op1 precedes op2 in program order (in-memory execution > > > > > > order), and op1 and op2 share a user visible reference, then > > > > > > op2 must not be observed by a user after recovery without > > > > > > also observing op1. > > > > > > > > > > > > e.g. in the case of the parent directory - the parent has a link > > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc > > > > > > operation in a directory modifies a user visible link count > > > > > > reference. Hence fsync of one of those children will persist the > > > > > > directory link count, and then all of the other preceeding > > > > > > transactions that modified the link count also need to be persisted. > > > > > > > > > > > > > > > > One thing that bothers me is that the definition of SOMC (as well as > > > > > your refined definition) doesn't mention fsync at all, but all the examples > > > > > only discuss use cases with fsync. > > > > > > > > You can't discuss operational ordering without a point in time to > > > > use as a reference for that ordering. SOMC behaviour is preserved > > > > at any point the filesystem checkpoints itself, and the only thing > > > > that changes is the scope of that checkpoint. fsync is just a > > > > convenient, widely understood, minimum dependecy reference point > > > > that people can reason from. All the interesting ordering problems > > > > come from minimum dependecy reference point (i.e. fsync()), not from > > > > background filesystem-wide checkpoints. > > > > > > > > > > Yes, I was referring to rename as a commonly used operation used > > > by application as "metadata barrier". > > > > What is a "metadata barrier" and what are it's semantics supposed to > > be? > > > > In this context I mean that effects of metadata operations before the > barrier (e.g. setxattr, truncate) must be observed after crash if the effects > of barrier operation (e.g. file was renamed) are observed after crash. Ok, so you've just arbitrarily denoted a specific rename operation to be a "recovery barrier" for your application? In terms of SOMC, there is no operation that is an implied "barrier". There are explicitly ordered checkpoints via data integrity operations (i.e. sync, fsync, etc), but between those points it's just dependency based ordering... IOWs, if there is no direct relationship between two objects in depnendency grpah, then then rename of one or the other does not create a "metadata ordering barrier" between those two objects. They are still independent, and so rename isn't a barrier in the true sense (i.e. that it is an ordering synchronisation point). At best rename can define a point in a dependency graph where an independent dependency branch is merged atomically into the main graph. This is still a powerful tool, and likely exactly what you are wanting to know if it will work or not.... > > > > > To my understanding, SOMC provides a guaranty that the application does > > > > > not need to do any fsync at all, > > > > > > > > Absolutely not true. If the application has atomic creation > > > > requirements that need multiple syscalls to set up, it must > > > > implement them itself and use fsync to synchronise data and metadata > > > > before the "atomic create" operation that makes it visible to the > > > > application. > > > > > > > > SOMC only guarantees what /metadata/ you see at a fileystem > > > > synchronisation point; it does not provide ACID semantics to a > > > > random set of system calls into the filesystem. > > > > > > > > > > So I re-state my claim above after having explained the use case. > > > > With words that I can only guess the meaning of. > > > > Amir, if you are asking a complex question as to whether something > > conforms to a specification, then please slow down and take the time > > to define all the terms, the initial state, the observable behaviour > > that you expect to see, etc in clear, unambiguous and well defined > > terms. Otherwise the question cannot be answered.... > > > > Sure. TBH, I didn't even dare to ask the complex question yet, > because it was hard for me to define all terms. I sketched the > use case with the example of create+setxattr+truncate+rename > because I figured it is rather easy to understand. > > The more complex question has do to with explicit "data dependency" > operation. At the moment, I will not explain what that means in details, > but I am sure you can figure it out. > With fdatasync+rename, fdatasync created a dependency between > data and metadata of the file, so with SOMC, if file is observed after > crash in rename destination, it also contains the data changes before > fdatasync. But fdatasync gives a stringer guaranty than what > my application actually needs, because in many cases it will cause > journal flush. What it really needs is filemap_write_and_wait(). > Metadata doesn't need to be flushed as rename takes care of > metadata ordering guaranties. Ok, so what you are actually asking is whether SOMC provides a guarantee that data writes that have completed before the rename will be present on disk if the rename is present on disk? i.e.: create+setxattr+write()+fdatawait()+rename is atomic on a SOMC filesystem without a data integrity operation being performed? I don't think we've defined how data vs metadata ordering persistence works in the SOMC model at all. We've really only been discussing the metadata ordering and so I haven't really thought all the different cases through. OK, let's try to define how it works through examples. Let's start with the simple one: non-AIO O_DIRECT writes, because they send the data straight to the device. i.e. create setxattr write Extent Allocation ----> device -+ data volatile <-- complete -+ write completion rename metadata volatile At this point, we may have no direct dependency between the write completion and the rename operation. Normally we would do (O_DSYNC case) write completion device cache flush ----> device -+ <-- complete -+ data persisted journal FUA write ----> device -+ <-- complete -+ file metadata persisted and so we are guaranteed to have the data on disk before the rename is started (i.e. POSIX compliance). Hence regardless of whether the rename exists or not, we'll have the data on disk. However, if we require a data completion rule similar to the IO completion to device flush rule we have in the kernel: If data is to be ordered against a specific metadata operation, then the dependent data must be issued and completed before executing the ordering metadata operation. The application is responsibile for ensuring the necessary data has been flushed to storage and signalled complete, but it does not need to ensure it is persistent. When the ordering metadata operation is to be made persistent, the filesystem must ensure the dependent data is persistent before starting the ordered metadata persistence operation. It must also ensure that any data dependent metadata is captured and persisted in the pending ordered metadata persistence operation so all the metadata required to access the dependent data is persisted correctly. Then we create the conditions where it is possible for data to be ordered amongst the metadata with the same ordering guarantees as the metadata. The above O_DIRECT example ends up as: create setxattr write Extent Allocation metadata volatile ----> device -+ data volatile <-- complete -+ write completion rename metadata volatile ..... <journal flush> device cache flush ----> device -+ <-- complete -+ data persisted journal FUA write ----> device -+ <-- complete -+ metadata persisted <flush completion> With AIO based O_DIRECT, then we cannot issue the ordering rename until after the AIO completion has been delivered to the application. Once that has been delivered, then it is the same case as non AIO O_DIRECT. BUffered IO is a bit harder, because we need flush-and-wait primitives that don't provide data integrity guarantees. SO, after soundly smacking down the user of sync_file_range() this morning because it's not a data integrity operation and it has massive gaping holes in it's behaviour, it may actually be useful here in a very limited scope. That is, sync_file_range() is only safe to use for this specific sort of ordered data integrity algorithm when flushing the entire file.(*) create setxattr write metadata volatile delayed allocation data volatile .... sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); Extent Allocation metadata volatile ----> device -+ data volatile <-- complete -+ .... rename metadata volatile And so at this point, we only need a device cache flush to make the data persistent and a journal flush to make the rename persistent. And so it ends up the same case as non-AIO O_DIRECT. So, yeah, I think this model will work to order completed data writes against future metadata operations such that this is observed: If a metadata operation is performed after dependent data has been flushed and signalled complete to userspace, then if that metadata operation is present after recovery the dependent data will also be present. The good news here is what I described above is exactly what XFS implements with it's journal flushes - it uses REQ_PREFLUSH | REQ_FUA for journal writes, and so it follows the rules I outlined above. A quick grep shows that ext4/jbd2, f2fs and gfs2 also use the same flags for journal and/or critical ordering IO. I can't tell whether btrfs follows these rules or not. > As far as I can tell, there is no "official" API to do what I need > and there is certainly no documentation about this expected behavior. Oh, userspace controlled data flushing is exactly what sync_file_range() was intended for back when it was implemented back in 2.6.17. Unfortunately, the implementation was completely botched because it was written from a top down "clean the page cache" perspective, not a bottom up filesystem data integrity mechanism and by the time we realised just how awful it was there were applications dependent on it's existing behaviour.... > I find our behavior as a group of filesystem developers on this matter > slightly bi-polar - on the one hand we wish to maintain implementation > freedom for future performance improvements and don't wish to commit > to existing behavior by documenting it. On the other hand, we wish to > not break existing application, whose expectations from filesystems are > far from what filesystems guaranty in documentation. Personally I want the SOMC model to be explicitly documented so that we can sanely discuss how we can provide sane optimisations to userspace. It's the first step towards a model where applications can run filesystem operations completely asynchronously yet still provide large scale ordering and integrity guarantees without needing copious amounts of fine-grained fsync operations.(**) I really don't care about the crazy vagaries of POSIX right now - POSIX is a shit specification when it comes to integrity. The sooner we move beyond it, the better off we'll be. And the beauty of the SOMC model is that POSIX compliance falls out of it for free, yet it allows us much more freedom for optimisation because we can reason about integrity in terms of ordering and dependencies rather than in terms of what fsync() must provide. > There is no one good answer that fits all aspects of this subject and I > personally agree with Ted on not wanting to document the ext4 "hacks" > that are meant to cater misbehaving applications. Applications "misbehave" largely because there is no definitive documentation on what filesystems actually provide userspace. The man pages document API behaviour, they /can't/ document things like SOMC, which filesystems can provide it and how to use it to avoid fsync().... > I think it is good that Jayashree posted this patch as a basis for discussion > of what needs to be documented and how. > Eventually, instead of trying to formalize filesystem expected behavior, it > might be better to just encode the expected crash behavior tests > in a readable manner, as Jayashree already started to do. > Or maybe there is room for both documentation and tests. It needs documentation. crash tests do not document algorithms behaviour, intentions, application programming models, constraints, etc.... Cheers, Dave. (*) Using sync_file_range() for sub file ranges are simply broken when it comes to data integrity style flushes as there is no guarantee it will capture all the dirty ranges that need to be flushed (e.g. write starting 100kb beyond EOF, then sync the range starting 100kb beyond EOF, and it won't sync the sub-block zeroing that was done at the old EOF, thereby exposing stale data....) (**) That featherstitch paper I linked to earlier? Did you notice the userspace defined "patch group" transaction interface? http://featherstitch.cs.ucla.edu/
On Mon, Mar 18, 2019 at 09:37:28PM -0500, Vijay Chidambaram wrote: > For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin > and Jayashree's advisor. We recently developed CrashMonkey, a tool for > finding crash-consistency bugs in file systems. As part of the > research effort, we had a lot of conversations with file-system > developers to understand the guarantees provided by different file > systems. This patch was inspired by the thought that we should quickly > document what we know about the data integrity guarantees of different > file systems. We did not expect to spur debate! > > Thanks Dave, Amir, and Ted for the discussion. We will incorporate > these comments into the next patch. If it is better to wait until a > consensus is reached after the LSF meeting, we'd be happy to do so. > > On Mon, Mar 18, 2019 at 2:14 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote: > > > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > > > > > > > +Strictly Ordered Metadata Consistency > > > > > > > > +------------------------------------- > > > > > > > > +With each file system providing varying levels of persistence > > > > > > > > +guarantees, a consensus in this regard, will benefit application > > > > > > > > +developers to work with certain fixed assumptions about file system > > > > > > > > +guarantees. Dave Chinner proposed a unified model called the > > > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > > > > > > > + > > > > > > > > +Under this scheme, the file system guarantees to persist all previous > > > > > > > > +dependent modifications to the object upon fsync(). If you fsync() an > > > > > > > > +inode, it will persist all the changes required to reference the inode > > > > > > > > +and its data. SOMC can be defined as follows [6]: > > > > > > > > + > > > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and > > > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a > > > > > > > > +user after recovery without also observing op1. > > > > > > > > + > > > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations > > > > > > > > +share a dependency, which could be file-system specific. It might > > > > > > > > +require a developer to understand file-system internals to know if > > > > > > > > +SOMC would order one operation before another. > > > > > > > > > > > > > > That's largely an internal implementation detail, and users should > > > > > > > not have to care about the internal implementation because the > > > > > > > fundamental dependencies are all defined by the directory heirarchy > > > > > > > relationships that users can see and manipulate. > > > > > > > > > > > > > > i.e. fs internal dependencies only increase the size of the graph > > > > > > > that is persisted, but it will never be reduced to less than what > > > > > > > the user can observe in the directory heirarchy. > > > > > > > > > > > > > > So this can be further refined: > > > > > > > > > > > > > > If op1 precedes op2 in program order (in-memory execution > > > > > > > order), and op1 and op2 share a user visible reference, then > > > > > > > op2 must not be observed by a user after recovery without > > > > > > > also observing op1. > > > > > > > > > > > > > > e.g. in the case of the parent directory - the parent has a link > > > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc > > > > > > > operation in a directory modifies a user visible link count > > > > > > > reference. Hence fsync of one of those children will persist the > > > > > > > directory link count, and then all of the other preceeding > > > > > > > transactions that modified the link count also need to be persisted. > > Dave, how did SOMC come about? Even XFS persists more than the minimum > set required by SOMC. Is SOMC most useful as a sort of intuitive > guideline as to what users should expect to see after recovery? Lots of things. 15+ years of fixing data and metadata recovery ordering bugs in XFS, 20 years of reading academic filesystem papers, many years of hating POSIX and that we should be aiming more towards database ACID semantics in our filesystems, deciding ~10 years ago that maintainable integrity is far more important than performance, understanding the block layer/device integrity requirements and the model smarter people than me came up with for ensuring integrity with minimal loss of performance, etc. A big influence has also been that the "crash lost data" bug reports we get from users are generally not a result of data being lost, they are a result of incomplete and/or inconsistent recreation of the state before the crash occurred. e.g. files that exist with a non-zero size but have no data in them, even though it had been minutes between writing the data and crashing and other files were just fine. i.e. people don't tend to notice "stuff I just wrote is missing" after a crash - they expect that. What they notice and complain about is inconsistent state after recovery. e.g. file A was fine, but file B was empty, even though I wrote file B before file A! This is the sort of thing that Ted was refering to when he talked about having to add hacks to ext4 to make sure certain "expected behaviours" were maintained. ext4 inherited quite a few unrealistic expectations from ext3, which had a much stricter data vs metadata ordering model than ext4 does.... With XFS, the problems we've had with lost data/files have invariably been a result of code that violated ordering semantics for what were once considered performance benefits (hence my comment about "integrity is more important than performance"). Those sorts of problems (and there's been quite a others w.r.t. the XFS recovery algorithm) have all been solved by journalling all metadata changes (hence strict ordering against other metadata), improving the journal format and the information we log in it, and delaying data-dependent metadata updates until after the data IO completes. And from that perspective, SOMC is really just a further generalisation of the dependency and atomicity model that underlies the existing XFS transaction engine. > I found your notes about POSIX SIO interesting, and will incorporate > it into the next version of the patch. Should POSIX SIO be agreed upon > between file systems as the set of guarantees to provide (especially > since this is what glibc assumes)? I think SOMC is stronger than POSIX > SIO. SOMC is stronger than POSIX SIO. POSIX SIO is still a horribly ambiguous standard, even though it does define "data integrity" and "file integrity" in a meaningful manner. It's an improvement, but I still think it is terrible from efficiency and performance perspectives. > > The more complex question has do to with explicit "data dependency" > > operation. At the moment, I will not explain what that means in details, > > but I am sure you can figure it out. > > With fdatasync+rename, fdatasync created a dependency between > > data and metadata of the file, so with SOMC, if file is observed after > > crash in rename destination, it also contains the data changes before > > fdatasync. But fdatasync gives a stringer guaranty than what > > my application actually needs, because in many cases it will cause > > journal flush. What it really needs is filemap_write_and_wait(). > > Metadata doesn't need to be flushed as rename takes care of > > metadata ordering guaranties. > > As far as I can tell, there is no "official" API to do what I need > > and there is certainly no documentation about this expected behavior. > > Apologies, if above was not clear, I promise to explain in person > > during LSF to whoever is interested. > > At the risk of being ambiguous in the same way as Amir: > > Some applications may only care about ordering of metadata operations, > not whether they are persistent. Application-level correctness is > closely tied to ordering of different operations. Since SOMC gives us > the guarantee that if operation X is seen after recovery, all > dependent ops are also seen on recovery, this might be enough to > create a consistent application. For example, an application may not > care when file X was persisted to storage, as long as file Y was > persisted before it. *nod* Application developers have been asking for this sort of integrity guarantee from filesystems for a long time. The problem has always been that we've been unable to agree on a defined model that allows us to guarantee such behaviour to userspace. Every ~5 years, somebody comes up with a new userspace transaction proposal that ends up going nowhere because it cannot be applied to most of the underlying linux filesystems without severe compromises. However, this discussion is leading me to belive that the benefits of having a well defined and documented behavioural model (such as SOMC) are starting to be realised. i.e. a well defined model allows kernel and userspace to optimise indepedently but still provide the exact integrity semantics each other requires. And that we can expose that model as a set of tests in fstests, hence enabling both fs developers and users to understand where filesystems behave according to the model and where they may need further improvement. So I think we are definitely headed in the right direction here. That said.... > All I'm asking for is documenting the minimal set of guarantees each > file system already provides (or should provide in the absence of > bugs). It is alright if the file system provides more than what is > documented. The original patch does not talk about the rename hack > that Ted mentions. ... I'm really not that interested in documenting the limitations of existing filesystems because that entirely backwards looking. I'm looking forwards and aiming to provide a model that we can build filesystems and applications around to fully exploit the performance potential of modern storage hardware... Cheers, Dave.
On Tue, Mar 19, 2019 at 5:13 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Mar 18, 2019 at 09:13:58AM +0200, Amir Goldstein wrote: > > On Mon, Mar 18, 2019 at 12:16 AM Dave Chinner <david@fromorbit.com> wrote: > > > On Fri, Mar 15, 2019 at 05:44:49AM +0200, Amir Goldstein wrote: > > > > On Fri, Mar 15, 2019 at 5:03 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > On Thu, Mar 14, 2019 at 09:19:03AM +0200, Amir Goldstein wrote: > > > > > > On Thu, Mar 14, 2019 at 3:19 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > On Tue, Mar 12, 2019 at 02:27:00PM -0500, Jayashree wrote: > > > > > > > > +Strictly Ordered Metadata Consistency > > > > > > > > +------------------------------------- > > > > > > > > +With each file system providing varying levels of persistence > > > > > > > > +guarantees, a consensus in this regard, will benefit application > > > > > > > > +developers to work with certain fixed assumptions about file system > > > > > > > > +guarantees. Dave Chinner proposed a unified model called the > > > > > > > > +Strictly Ordered Metadata Consistency (SOMC) [5]. > > > > > > > > + > > > > > > > > +Under this scheme, the file system guarantees to persist all previous > > > > > > > > +dependent modifications to the object upon fsync(). If you fsync() an > > > > > > > > +inode, it will persist all the changes required to reference the inode > > > > > > > > +and its data. SOMC can be defined as follows [6]: > > > > > > > > + > > > > > > > > +If op1 precedes op2 in program order (in-memory execution order), and > > > > > > > > +op1 and op2 share a dependency, then op2 must not be observed by a > > > > > > > > +user after recovery without also observing op1. > > > > > > > > + > > > > > > > > +Unfortunately, SOMC's definition depends upon whether two operations > > > > > > > > +share a dependency, which could be file-system specific. It might > > > > > > > > +require a developer to understand file-system internals to know if > > > > > > > > +SOMC would order one operation before another. > > > > > > > > > > > > > > That's largely an internal implementation detail, and users should > > > > > > > not have to care about the internal implementation because the > > > > > > > fundamental dependencies are all defined by the directory heirarchy > > > > > > > relationships that users can see and manipulate. > > > > > > > > > > > > > > i.e. fs internal dependencies only increase the size of the graph > > > > > > > that is persisted, but it will never be reduced to less than what > > > > > > > the user can observe in the directory heirarchy. > > > > > > > > > > > > > > So this can be further refined: > > > > > > > > > > > > > > If op1 precedes op2 in program order (in-memory execution > > > > > > > order), and op1 and op2 share a user visible reference, then > > > > > > > op2 must not be observed by a user after recovery without > > > > > > > also observing op1. > > > > > > > > > > > > > > e.g. in the case of the parent directory - the parent has a link > > > > > > > count. Hence every create, unlink, rename, hard link, symlink, etc > > > > > > > operation in a directory modifies a user visible link count > > > > > > > reference. Hence fsync of one of those children will persist the > > > > > > > directory link count, and then all of the other preceeding > > > > > > > transactions that modified the link count also need to be persisted. > > > > > > > > > > > > > > > > > > > One thing that bothers me is that the definition of SOMC (as well as > > > > > > your refined definition) doesn't mention fsync at all, but all the examples > > > > > > only discuss use cases with fsync. > > > > > > > > > > You can't discuss operational ordering without a point in time to > > > > > use as a reference for that ordering. SOMC behaviour is preserved > > > > > at any point the filesystem checkpoints itself, and the only thing > > > > > that changes is the scope of that checkpoint. fsync is just a > > > > > convenient, widely understood, minimum dependecy reference point > > > > > that people can reason from. All the interesting ordering problems > > > > > come from minimum dependecy reference point (i.e. fsync()), not from > > > > > background filesystem-wide checkpoints. > > > > > > > > > > > > > Yes, I was referring to rename as a commonly used operation used > > > > by application as "metadata barrier". > > > > > > What is a "metadata barrier" and what are it's semantics supposed to > > > be? > > > > > > > In this context I mean that effects of metadata operations before the > > barrier (e.g. setxattr, truncate) must be observed after crash if the effects > > of barrier operation (e.g. file was renamed) are observed after crash. > > Ok, so you've just arbitrarily denoted a specific rename operation > to be a "recovery barrier" for your application? > > In terms of SOMC, there is no operation that is an implied > "barrier". There are explicitly ordered checkpoints via data > integrity operations (i.e. sync, fsync, etc), but between those > points it's just dependency based ordering... > > IOWs, if there is no direct relationship between two objects in > depnendency grpah, then then rename of one or the other does not > create a "metadata ordering barrier" between those two objects. They > are still independent, and so rename isn't a barrier in the true > sense (i.e. that it is an ordering synchronisation point). > > At best rename can define a point in a dependency graph where an > independent dependency branch is merged atomically into the main > graph. This is still a powerful tool, and likely exactly what you > are wanting to know if it will work or not.... Absolutely. The application only cares about atomicity of creating a certain file/dir with specific size/xattrs with a certain name. > > > > > > > To my understanding, SOMC provides a guaranty that the application does > > > > > > not need to do any fsync at all, > > > > > > > > > > Absolutely not true. If the application has atomic creation > > > > > requirements that need multiple syscalls to set up, it must > > > > > implement them itself and use fsync to synchronise data and metadata > > > > > before the "atomic create" operation that makes it visible to the > > > > > application. > > > > > > > > > > SOMC only guarantees what /metadata/ you see at a fileystem > > > > > synchronisation point; it does not provide ACID semantics to a > > > > > random set of system calls into the filesystem. > > > > > > > > > > > > > So I re-state my claim above after having explained the use case. > > > > > > With words that I can only guess the meaning of. > > > > > > Amir, if you are asking a complex question as to whether something > > > conforms to a specification, then please slow down and take the time > > > to define all the terms, the initial state, the observable behaviour > > > that you expect to see, etc in clear, unambiguous and well defined > > > terms. Otherwise the question cannot be answered.... > > > > > > > Sure. TBH, I didn't even dare to ask the complex question yet, > > because it was hard for me to define all terms. I sketched the > > use case with the example of create+setxattr+truncate+rename > > because I figured it is rather easy to understand. > > > > The more complex question has do to with explicit "data dependency" > > operation. At the moment, I will not explain what that means in details, > > but I am sure you can figure it out. > > With fdatasync+rename, fdatasync created a dependency between > > data and metadata of the file, so with SOMC, if file is observed after > > crash in rename destination, it also contains the data changes before > > fdatasync. But fdatasync gives a stringer guaranty than what > > my application actually needs, because in many cases it will cause > > journal flush. What it really needs is filemap_write_and_wait(). > > Metadata doesn't need to be flushed as rename takes care of > > metadata ordering guaranties. > > Ok, so what you are actually asking is whether SOMC provides a > guarantee that data writes that have completed before the rename > will be present on disk if the rename is present on disk? i.e.: > > create+setxattr+write()+fdatawait()+rename > > is atomic on a SOMC filesystem without a data integrity operation > being performed? > > I don't think we've defined how data vs metadata ordering > persistence works in the SOMC model at all. We've really only been > discussing the metadata ordering and so I haven't really thought > all the different cases through. > > OK, let's try to define how it works through examples. Let's start > with the simple one: non-AIO O_DIRECT writes, because they send the > data straight to the device. i.e. > > create > setxattr > write > Extent Allocation > ----> device -+ > data volatile > <-- complete -+ > write completion > rename metadata volatile > > At this point, we may have no direct dependency between the > write completion and the rename operation. Normally we would do > (O_DSYNC case) > > write completion > device cache flush > ----> device -+ > <-- complete -+ data persisted > journal FUA write > ----> device -+ > <-- complete -+ file metadata persisted > > and so we are guaranteed to have the data on disk before the rename > is started (i.e. POSIX compliance). Hence regardless of whether the > rename exists or not, we'll have the data on disk. > > However, if we require a data completion rule similar to the IO > completion to device flush rule we have in the kernel: > > If data is to be ordered against a specific metadata > operation, then the dependent data must be issued and > completed before executing the ordering metadata operation. > The application is responsibile for ensuring the necessary > data has been flushed to storage and signalled complete, but > it does not need to ensure it is persistent. > > When the ordering metadata operation is to be made > persistent, the filesystem must ensure the dependent data is > persistent before starting the ordered metadata persistence > operation. It must also ensure that any data dependent > metadata is captured and persisted in the pending ordered > metadata persistence operation so all the metadata required > to access the dependent data is persisted correctly. > > Then we create the conditions where it is possible for data to be > ordered amongst the metadata with the same ordering guarantees > as the metadata. The above O_DIRECT example ends up as: > > create > setxattr > write > Extent Allocation metadata volatile > ----> device -+ > data volatile > <-- complete -+ > write completion > rename metadata volatile > ..... > <journal flush> > device cache flush > ----> device -+ > <-- complete -+ data persisted > journal FUA write > ----> device -+ > <-- complete -+ metadata persisted > <flush completion> > > > With AIO based O_DIRECT, then we cannot issue the ordering rename > until after the AIO completion has been delivered to the > application. Once that has been delivered, then it is the same case > as non AIO O_DIRECT. > > BUffered IO is a bit harder, because we need flush-and-wait > primitives that don't provide data integrity guarantees. SO, after > soundly smacking down the user of sync_file_range() this morning > because it's not a data integrity operation and it has massive > gaping holes in it's behaviour, it may actually be useful here in a > very limited scope. > > That is, sync_file_range() is only safe to use for this specific > sort of ordered data integrity algorithm when flushing the entire > file.(*) > > create > setxattr > write metadata volatile > delayed allocation data volatile > .... > sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE | > SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); > Extent Allocation metadata volatile > ----> device -+ > data volatile > <-- complete -+ > .... > rename metadata volatile > > And so at this point, we only need a device cache flush to > make the data persistent and a journal flush to make the rename > persistent. And so it ends up the same case as non-AIO O_DIRECT. > Funny, I once told that story and one Dave Chinner told me "Nice story, but wrong.": https://patchwork.kernel.org/patch/10576303/#22190719 You pointed to the minor detail that sync_file_range() uses WB_SYNC_NONE. So yes, I agree, it is a nice story and we need to make it right, by having an API (perhaps SYNC_FILE_RANGE_ALL). When you pointed out my mistake, I switched the application to use the FIEMAP_FLAG_SYNC API as a hack. > So, yeah, I think this model will work to order completed data > writes against future metadata operations such that this is > observed: > > If a metadata operation is performed after dependent data > has been flushed and signalled complete to userspace, then > if that metadata operation is present after recovery the > dependent data will also be present. > > The good news here is what I described above is exactly what XFS > implements with it's journal flushes - it uses REQ_PREFLUSH | > REQ_FUA for journal writes, and so it follows the rules I outlined > above. A quick grep shows that ext4/jbd2, f2fs and gfs2 also use > the same flags for journal and/or critical ordering IO. I can't tell > whether btrfs follows these rules or not. > > > As far as I can tell, there is no "official" API to do what I need > > and there is certainly no documentation about this expected behavior. > > Oh, userspace controlled data flushing is exactly what > sync_file_range() was intended for back when it was implemented back > in 2.6.17. > > Unfortunately, the implementation was completely botched because it > was written from a top down "clean the page cache" perspective, not > a bottom up filesystem data integrity mechanism and by the time we > realised just how awful it was there were applications dependent on > it's existing behaviour.... > Thanks a lot, Dave, for taking the time to fill in the gaps in my sketchy requirement and for the detailed answer. Besides tests and documentation what could be useful is a portable user space library that just does the right thing for every filesystem. For example, safe_rename(), could be properly documented and is all the application developer should really care about. The default implementation just does fdatasync() before rename and from here things can only improve based on underlying filesystem and available kernel APIs. I am not volunteering to write that library, but I'd be happy to write write the patch/tests/man page for SYNC_FILE_RANGE_ALL API or whatever we want to call it, if we can agree that it is needed. Thanks! Amir.
On Mon, Mar 18, 2019 at 09:37:28PM -0500, Vijay Chidambaram wrote: > For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin > and Jayashree's advisor. We recently developed CrashMonkey, a tool for > finding crash-consistency bugs in file systems. As part of the > research effort, we had a lot of conversations with file-system > developers to understand the guarantees provided by different file > systems. This patch was inspired by the thought that we should quickly > document what we know about the data integrity guarantees of different > file systems. We did not expect to spur debate! > > Thanks Dave, Amir, and Ted for the discussion. We will incorporate > these comments into the next patch. If it is better to wait until a > consensus is reached after the LSF meeting, we'd be happy to do so. Something to consider is that certain side effects of what fsync(2) or fdatasync(2) might drag into the jbd2 transaction might change if we were to implement (for example) something like Daejun Park and Dongkun Shin's "iJournaling: Fine-grained journaling for improving the latency of fsync system call" published in Usenix, ATC 2017: https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf That's an example of how if we document synchronization that goes beyond POSIX, it might change in the future. So if it gets documented, applications might start becoming unreliable on FreeBSD, MacOS, etc. And maybe as Linux developers we won't care about that; since it increases Linux lock-in. Win! (If you think like Steve Ballmer, anyway. :-) But then if we were to implement something like incremental journaling for fsync, and applications were to start assuming that it would also work, application authors might complain that we had broken their application So they might call the new feature a *BUG* which broke backwards compatibility, and then demand that we either withdraw the new feature, or complicate our testing matrix by adding Yet Another Mount Option. (That's especially true since iJournaling is a performance improvement that doesn't require an on-disk format change. So this is the sort of thing that we might want to enable by default eventually, even if initially it's only enabled via a mount option while we are stablizing the new feature.) So my concerns are not a theoretical, abstract concern, but something which is very real. Implementing something like what Park and Shin has proposed is something that is very much that we are thinking about. - Ted
On Tue, Mar 19, 2019 at 09:35:19AM +0200, Amir Goldstein wrote: > On Tue, Mar 19, 2019 at 5:13 AM Dave Chinner <david@fromorbit.com> wrote: > > That is, sync_file_range() is only safe to use for this specific > > sort of ordered data integrity algorithm when flushing the entire > > file.(*) > > > > create > > setxattr > > write metadata volatile > > delayed allocation data volatile > > .... > > sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WAIT_BEFORE | > > SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER); > > Extent Allocation metadata volatile > > ----> device -+ > > data volatile > > <-- complete -+ > > .... > > rename metadata volatile > > > > And so at this point, we only need a device cache flush to > > make the data persistent and a journal flush to make the rename > > persistent. And so it ends up the same case as non-AIO O_DIRECT. > > > > Funny, I once told that story and one Dave Chinner told me > "Nice story, but wrong.": > https://patchwork.kernel.org/patch/10576303/#22190719 > > You pointed to the minor detail that sync_file_range() uses > WB_SYNC_NONE. Ah, I forgot about that. That's what I get for not looking at the code. Did I mention that SFR is a complete crock of shit when it comes to data integrity operations? :/ > So yes, I agree, it is a nice story and we need to make it right, > by having an API (perhaps SYNC_FILE_RANGE_ALL). > When you pointed out my mistake, I switched the application to > use the FIEMAP_FLAG_SYNC API as a hack. Yeah, that 's a nasty hack :/ > Besides tests and documentation what could be useful is a portable > user space library that just does the right thing for every filesystem. *nod* but before that, we need the model to be defined and documented. And once we have a library, the fun part of convincing the world that it should be the glibc default behaviour can begin.... Cheers, Dave.
On Tue, Mar 19, 2019 at 11:17:09AM -0400, Theodore Ts'o wrote: > On Mon, Mar 18, 2019 at 09:37:28PM -0500, Vijay Chidambaram wrote: > > For new folks on the thread, I'm Vijay Chidambaram, prof at UT Austin > > and Jayashree's advisor. We recently developed CrashMonkey, a tool for > > finding crash-consistency bugs in file systems. As part of the > > research effort, we had a lot of conversations with file-system > > developers to understand the guarantees provided by different file > > systems. This patch was inspired by the thought that we should quickly > > document what we know about the data integrity guarantees of different > > file systems. We did not expect to spur debate! > > > > Thanks Dave, Amir, and Ted for the discussion. We will incorporate > > these comments into the next patch. If it is better to wait until a > > consensus is reached after the LSF meeting, we'd be happy to do so. > > Something to consider is that certain side effects of what fsync(2) or > fdatasync(2) might drag into the jbd2 transaction might change if we > were to implement (for example) something like Daejun Park and Dongkun > Shin's "iJournaling: Fine-grained journaling for improving the latency > of fsync system call" published in Usenix, ATC 2017: > > https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf > > That's an example of how if we document synchronization that goes > beyond POSIX, it might change in the future. Sure, but again this is orthognal to what we are discussing here: the user visible ordering of metadata operations after a crash. If anyone implements a multi-segment or per-inode journal (say, like NOVA), then it is up to that implementation to maintain the ordering guarantees that a SOMC model requires. You can implement whatever fsync() go-fast bits you want, as long as it provides the ordering behaviour guarantees that the model defines. IOWs, Ted, I think you have the wrong end of the stick here. This isn't about optimising fsync() to provide better performance, it's about guaranteeing order so that fsync() is not necessary and we improve performance by allowing applications to omit order-only synchornisation points in their workloads. i.e. an order-based integrity model /reduces/ the need for a hyper-optimised fsync operation because applications won't need to use it as often. Cheers, Dave.
diff --git a/Documentation/filesystems/crash-recovery-guarantees.txt b/Documentation/filesystems/crash-recovery-guarantees.txt new file mode 100644 index 0000000..be84964 --- /dev/null +++ b/Documentation/filesystems/crash-recovery-guarantees.txt @@ -0,0 +1,193 @@ +===================================================================== +File System Crash-Recovery Guarantees +===================================================================== +Linux file systems provide certain guarantees to user-space +applications about what happens to their data if the system crashes +(due to power loss or kernel panic). These are termed crash-recovery +guarantees. + +Crash-recovery guarantees only pertain to data or metadata that has +been explicitly persisted to storage with fsync(), fdatasync(), or +sync() system calls. By default, write(), mkdir(), and other +file-system related system calls only affect the in-memory state of +the file system. + +The crash-recovery guarantees provided by most Linux file systems are +significantly stronger than what is required by POSIX. POSIX is vague, +even allowing fsync() to do nothing (Mac OSX takes advantage of +this). However, the guarantees provided by file systems are not +documented, and vary between file systems. This document seeks to +describe the current crash-recovery guarantees provided by major Linux +file systems. + +What does the fsync() operation guarantee? +---------------------------------------------------- +fsync() operation is meant to force the physical write of data +corresponding to a file from the buffer cache, along with the file +metadata. Note that the guarantees mentioned for each file system below +are in addition to the ones provided by POSIX. + +POSIX +----- +fsync(file) : Flushes the data and metadata associated with the +file. However, if the directory entry for the file has not been +previously persisted, or has been modified, it is not guaranteed to be +persisted by the fsync of the file [1]. What this means is, if a file +is newly created, you will have to fsync(parent directory) in addition +to fsync(file) in order to ensure that the file's directory entry has +safely reached the disk. + +fsync(dir) : Flushes directory data and directory entries. However if +you created a new file within the directory and wrote data to the +file, then the file data is not guaranteed to be persisted, unless an +explicit fsync() is issued on the file. + +ext4 +----- +fsync(file) : Ensures that a newly created file's directory entry is +persisted (no need to explicitly persist the parent directory). However, +if you create multiple names of the file (hard links), then their directory +entries are not guaranteed to persist unless each one of the parent +directory entries are persisted [2]. + +fsync(dir) : All file names within the persisted directory will exist, +but does not guarantee file data. + +xfs +---- +fsync(file) : Ensures that a newly created file's directory entry is +persisted. Additionally, all the previous dependent modifications to +this file are also persisted. If any file shares an object +modification dependency with the fsync-ed file, then that file's +directory entry is also persisted. + +fsync(dir) : All file names within the persisted directory will exist, +but does not guarantee file data. As with files, fsync(dir) also persists +previous dependent metadata operations. + +btrfs +------ +fsync(file) : Ensures that a newly created file's directory entry +is persisted, along with the directory entries of all its hard links. +You do not need to explicitly fsync individual hard links to the file. + +fsync(dir) : All the file names within the directory will persist. All the +rename and unlink operations within the directory are persisted. Due +to the design choices made by btrfs, fsync of a directory could lead +to an iterative fsync on sub-directories, thereby requiring a full +file system commit. So btrfs does not advocate fsync of directories +[2]. + +F2FS +---- +fsync(file) or fsync(dir) : In the default mode (fsync-mode=posix), +F2FS only guarantees POSIX behaviour. However, it provides xfs-like +guarantees if mounted with fsync-mode=strict option. + +fsync(symlink) +------------- +A symlink inode cannot be directly opened for IO, which means there is +no such thing as fsync of a symlink [3]. You could be tricked by the +fact that open and fsync of a symlink succeeds without returning a +error, but what happens in reality is as follows. + +Suppose we have a symlink “foo”, which points to the file “A/bar” + +fd = open(“foo”, O_CREAT | O_RDWR) +fsync(fd) + +Both the above operations succeed, but if you crash after fsync, the +symlink could be still missing. + +When you try to open the symlink “foo”, you are actually trying to +open the file that the symlink resolves to, which in this case is +“A/bar”. When you fsync the inode returned by the open system call, you +are actually persisting the file “A/bar” and not the symlink. Note +that if the file “A/bar” does not exist and you try the open the +symlink “foo” without the O_CREAT flag, then file open will fail. To +obtain the file descriptor associated with the symlink inode, you +could open the symlink using “O_PATH | O_NOFOLLOW” flags. However, the +file descriptor obtained this way can be only used to indicate a +location in the file-system tree and to perform operations that act +purely at the file descriptor level. Operations like read(), write(), +fsync() etc cannot be performed on such file descriptors. + +Bottomline : You cannot fsync() a symlink. + +fsync(special files) +-------------------- +Special files in Linux include block and character device files +(created using mknod), FIFO (created using mkfifo) etc. Just like the +behavior of fsync on symlinks described above, these special files do +not have an fsync function defined. Similar to symlinks, you +cannot fsync a special file [4]. + + +Strictly Ordered Metadata Consistency +------------------------------------- +With each file system providing varying levels of persistence +guarantees, a consensus in this regard, will benefit application +developers to work with certain fixed assumptions about file system +guarantees. Dave Chinner proposed a unified model called the +Strictly Ordered Metadata Consistency (SOMC) [5]. + +Under this scheme, the file system guarantees to persist all previous +dependent modifications to the object upon fsync(). If you fsync() an +inode, it will persist all the changes required to reference the inode +and its data. SOMC can be defined as follows [6]: + +If op1 precedes op2 in program order (in-memory execution order), and +op1 and op2 share a dependency, then op2 must not be observed by a +user after recovery without also observing op1. + +Unfortunately, SOMC's definition depends upon whether two operations +share a dependency, which could be file-system specific. It might +require a developer to understand file-system internals to know if +SOMC would order one operation before another. It is worth noting +that a file system can be crash-consistent (according to POSIX), +without providing SOMC [7]. + +As an example, consider the following test case from xfstest +generic/342 [8] +------- +touch A/foo +echo “hello” > A/foo +sync + +mv A/foo A/bar +echo “world” > A/foo +fsync A/foo +CRASH + +What would you expect on recovery, if the file system crashed after +the final fsync returned successfully? + +Non-SOMC file systems will not persist the file +A/bar because it was not explicitly fsync-ed. But this means, you will +find only the file A/foo with data “world” after crash, thereby losing +the previously persisted file with data “hello”. You will need to +explicitly fsync the directory A to ensure the rename operation is +safely persisted on disk. + +Under SOMC, to correctly reference the new inode via A/foo, +the previous rename operation must persist as well. Therefore, +fsync() of A/foo will persist the renamed file A/bar as well. +On recovery you will find both A/bar (with data “hello”) +and A/foo (with data “world”). + +It is noteworthy that xfs, ext4, F2FS (when mounted with fsync_mode=strict) +and btrfs provide SOMC-like behaviour in this particular example. +However, in writing, only XFS claims to provide SOMC. F2FS aims to provide +SOMC when mounted with fsync_mode=strict. It is not clear if ext4 and +btrfs provide strictly ordered metadata consistency. + +-------------------------------------------------------- +[1] http://man7.org/linux/man-pages/man2/fdatasync.2.html +[2] https://www.spinics.net/lists/linux-btrfs/msg77340.html +[3] https://www.spinics.net/lists/fstests/msg09370.html +[4] https://bugzilla.kernel.org/show_bug.cgi?id=202485 +[5] https://marc.info/?l=fstests&m=155010885626284&w=2 +[6] https://marc.info/?l=fstests&m=155011123126916&w=2 +[7] https://www.spinics.net/lists/fstests/msg09379.html +[8] https://patchwork.kernel.org/patch/10132305/ +