Message ID | 20190304140622.23997-1-fdmanana@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | generic: add test for fsync after shrinking truncate and rename | expand |
On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > From: Filipe Manana <fdmanana@suse.com> > > Test that if we truncate a file to reduce its size, rename it and then > fsync it, after a power failure the file has a correct size and name. > I am not sure that ext4/xfs semantics guaranty anything about persisting file name after fsync of file?... > This test is motivated by a bug found in btrfs, which is fixed by a > patch for the linux kernel titled: > > "Btrfs: fix incorrect file size after shrinking truncate and fsync" At least the title of this patch says nothing about persisting the name. > > This test currently passes on ext4, xfs, f2fs and patched btrfs. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > tests/generic/532 | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++ > tests/generic/532.out | 8 +++++++ > tests/generic/group | 1 + > 3 files changed, 75 insertions(+) > create mode 100755 tests/generic/532 > create mode 100644 tests/generic/532.out > > diff --git a/tests/generic/532 b/tests/generic/532 > new file mode 100755 > index 00000000..64992d85 > --- /dev/null > +++ b/tests/generic/532 > @@ -0,0 +1,66 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved. > +# > +# FS QA Test No. 532 > +# > +# Test that if we truncate a file to reduce its size, rename it and then fsync > +# it, after a power failure the file has a correct size and name. > +# > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + _cleanup_flakey > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/dmflakey > + > +# real QA test starts here > +_supported_fs generic > +_supported_os Linux > +_require_scratch > +_require_dm_target flakey > + > +rm -f $seqres.full > + > +_scratch_mkfs >>$seqres.full 2>&1 > +_require_metadata_journaling $SCRATCH_DEV > +_init_flakey > +_mount_flakey > + > +# Create our test file with an initial size of 8000 bytes, then fsync it, > +# followed by a truncate that reduces its size down to 3000 bytes. > +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \ > + -c "fsync" \ > + -c "truncate 3000" \ > + $SCRATCH_MNT/foo | _filter_xfs_io > + > +# Now rename the file and fsync it again. > +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar > + > +# Simulate a power failure and mount the filesystem to check that the file was > +# persisted with the new name and has a size of 3000 bytes. > +_flakey_drop_and_remount > + > +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing" > +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists" > + > +echo "File content after power failure:" > +od -A d -t x1 $SCRATCH_MNT/bar > + > +_unmount_flakey > + > +status=0 > +exit > diff --git a/tests/generic/532.out b/tests/generic/532.out > new file mode 100644 > index 00000000..554fbe2a > --- /dev/null > +++ b/tests/generic/532.out > @@ -0,0 +1,8 @@ > +QA output created by 532 > +wrote 8000/8000 bytes at offset 0 > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > +File content after power failure: > +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab > +* > +0002992 ab ab ab ab ab ab ab ab > +0003000 > diff --git a/tests/generic/group b/tests/generic/group > index 31011ac8..7d63f303 100644 > --- a/tests/generic/group > +++ b/tests/generic/group > @@ -534,3 +534,4 @@ > 529 auto quick attr > 530 auto quick unlink > 531 auto quick unlink > +532 auto quick log > -- > 2.11.0 >
On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > From: Filipe Manana <fdmanana@suse.com> > > > > Test that if we truncate a file to reduce its size, rename it and then > > fsync it, after a power failure the file has a correct size and name. > > > > I am not sure that ext4/xfs semantics guaranty anything about > persisting file name after fsync of file?... > > > This test is motivated by a bug found in btrfs, which is fixed by a > > patch for the linux kernel titled: > > > > "Btrfs: fix incorrect file size after shrinking truncate and fsync" > > At least the title of this patch says nothing about persisting the > name. Because the bug in btrfs is not about persisting the name, it's about persisting the correct inode size. > > > > > This test currently passes on ext4, xfs, f2fs and patched btrfs. > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > --- > > tests/generic/532 | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > tests/generic/532.out | 8 +++++++ > > tests/generic/group | 1 + > > 3 files changed, 75 insertions(+) > > create mode 100755 tests/generic/532 > > create mode 100644 tests/generic/532.out > > > > diff --git a/tests/generic/532 b/tests/generic/532 > > new file mode 100755 > > index 00000000..64992d85 > > --- /dev/null > > +++ b/tests/generic/532 > > @@ -0,0 +1,66 @@ > > +#! /bin/bash > > +# SPDX-License-Identifier: GPL-2.0 > > +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved. > > +# > > +# FS QA Test No. 532 > > +# > > +# Test that if we truncate a file to reduce its size, rename it and then fsync > > +# it, after a power failure the file has a correct size and name. > > +# > > +seq=`basename $0` > > +seqres=$RESULT_DIR/$seq > > +echo "QA output created by $seq" > > +tmp=/tmp/$$ > > +status=1 # failure is the default! > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > + > > +_cleanup() > > +{ > > + _cleanup_flakey > > + cd / > > + rm -f $tmp.* > > +} > > + > > +# get standard environment, filters and checks > > +. ./common/rc > > +. ./common/filter > > +. ./common/dmflakey > > + > > +# real QA test starts here > > +_supported_fs generic > > +_supported_os Linux > > +_require_scratch > > +_require_dm_target flakey > > + > > +rm -f $seqres.full > > + > > +_scratch_mkfs >>$seqres.full 2>&1 > > +_require_metadata_journaling $SCRATCH_DEV > > +_init_flakey > > +_mount_flakey > > + > > +# Create our test file with an initial size of 8000 bytes, then fsync it, > > +# followed by a truncate that reduces its size down to 3000 bytes. > > +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \ > > + -c "fsync" \ > > + -c "truncate 3000" \ > > + $SCRATCH_MNT/foo | _filter_xfs_io > > + > > +# Now rename the file and fsync it again. > > +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar > > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar > > + > > +# Simulate a power failure and mount the filesystem to check that the file was > > +# persisted with the new name and has a size of 3000 bytes. > > +_flakey_drop_and_remount > > + > > +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing" > > +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists" > > + > > +echo "File content after power failure:" > > +od -A d -t x1 $SCRATCH_MNT/bar > > + > > +_unmount_flakey > > + > > +status=0 > > +exit > > diff --git a/tests/generic/532.out b/tests/generic/532.out > > new file mode 100644 > > index 00000000..554fbe2a > > --- /dev/null > > +++ b/tests/generic/532.out > > @@ -0,0 +1,8 @@ > > +QA output created by 532 > > +wrote 8000/8000 bytes at offset 0 > > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > +File content after power failure: > > +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab > > +* > > +0002992 ab ab ab ab ab ab ab ab > > +0003000 > > diff --git a/tests/generic/group b/tests/generic/group > > index 31011ac8..7d63f303 100644 > > --- a/tests/generic/group > > +++ b/tests/generic/group > > @@ -534,3 +534,4 @@ > > 529 auto quick attr > > 530 auto quick unlink > > 531 auto quick unlink > > +532 auto quick log > > -- > > 2.11.0 > >
On Mon, Mar 4, 2019 at 5:23 PM Filipe Manana <fdmanana@kernel.org> wrote: > > On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > fsync it, after a power failure the file has a correct size and name. > > > > > > > I am not sure that ext4/xfs semantics guaranty anything about > > persisting file name after fsync of file?... > > > > > This test is motivated by a bug found in btrfs, which is fixed by a > > > patch for the linux kernel titled: > > > > > > "Btrfs: fix incorrect file size after shrinking truncate and fsync" > > > > At least the title of this patch says nothing about persisting the > > name. > > Because the bug in btrfs is not about persisting the name, it's about > persisting the correct inode size. > OK. Still AFAIK, ext4/xfs don't guaranty that the name bar will exist after power failure unless the parent dir was fsynced. If you fsync the parent dir instead of the file after rename, it will make the test correct for ext4/xfs (and POSIX for that matter). Will that test modification still catch the btrfs bug (pre fix)? Thanks, Amir. > > > > > > > > This test currently passes on ext4, xfs, f2fs and patched btrfs. > > > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > --- > > > tests/generic/532 | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > tests/generic/532.out | 8 +++++++ > > > tests/generic/group | 1 + > > > 3 files changed, 75 insertions(+) > > > create mode 100755 tests/generic/532 > > > create mode 100644 tests/generic/532.out > > > > > > diff --git a/tests/generic/532 b/tests/generic/532 > > > new file mode 100755 > > > index 00000000..64992d85 > > > --- /dev/null > > > +++ b/tests/generic/532 > > > @@ -0,0 +1,66 @@ > > > +#! /bin/bash > > > +# SPDX-License-Identifier: GPL-2.0 > > > +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved. > > > +# > > > +# FS QA Test No. 532 > > > +# > > > +# Test that if we truncate a file to reduce its size, rename it and then fsync > > > +# it, after a power failure the file has a correct size and name. > > > +# > > > +seq=`basename $0` > > > +seqres=$RESULT_DIR/$seq > > > +echo "QA output created by $seq" > > > +tmp=/tmp/$$ > > > +status=1 # failure is the default! > > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > > + > > > +_cleanup() > > > +{ > > > + _cleanup_flakey > > > + cd / > > > + rm -f $tmp.* > > > +} > > > + > > > +# get standard environment, filters and checks > > > +. ./common/rc > > > +. ./common/filter > > > +. ./common/dmflakey > > > + > > > +# real QA test starts here > > > +_supported_fs generic > > > +_supported_os Linux > > > +_require_scratch > > > +_require_dm_target flakey > > > + > > > +rm -f $seqres.full > > > + > > > +_scratch_mkfs >>$seqres.full 2>&1 > > > +_require_metadata_journaling $SCRATCH_DEV > > > +_init_flakey > > > +_mount_flakey > > > + > > > +# Create our test file with an initial size of 8000 bytes, then fsync it, > > > +# followed by a truncate that reduces its size down to 3000 bytes. > > > +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \ > > > + -c "fsync" \ > > > + -c "truncate 3000" \ > > > + $SCRATCH_MNT/foo | _filter_xfs_io > > > + > > > +# Now rename the file and fsync it again. > > > +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar > > > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar > > > + > > > +# Simulate a power failure and mount the filesystem to check that the file was > > > +# persisted with the new name and has a size of 3000 bytes. > > > +_flakey_drop_and_remount > > > + > > > +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing" > > > +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists" > > > + > > > +echo "File content after power failure:" > > > +od -A d -t x1 $SCRATCH_MNT/bar > > > + > > > +_unmount_flakey > > > + > > > +status=0 > > > +exit > > > diff --git a/tests/generic/532.out b/tests/generic/532.out > > > new file mode 100644 > > > index 00000000..554fbe2a > > > --- /dev/null > > > +++ b/tests/generic/532.out > > > @@ -0,0 +1,8 @@ > > > +QA output created by 532 > > > +wrote 8000/8000 bytes at offset 0 > > > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > > +File content after power failure: > > > +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab > > > +* > > > +0002992 ab ab ab ab ab ab ab ab > > > +0003000 > > > diff --git a/tests/generic/group b/tests/generic/group > > > index 31011ac8..7d63f303 100644 > > > --- a/tests/generic/group > > > +++ b/tests/generic/group > > > @@ -534,3 +534,4 @@ > > > 529 auto quick attr > > > 530 auto quick unlink > > > 531 auto quick unlink > > > +532 auto quick log > > > -- > > > 2.11.0 > > >
On Mon, Mar 4, 2019 at 5:59 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Mar 4, 2019 at 5:23 PM Filipe Manana <fdmanana@kernel.org> wrote: > > > > On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > > fsync it, after a power failure the file has a correct size and name. > > > > > > > > > > I am not sure that ext4/xfs semantics guaranty anything about > > > persisting file name after fsync of file?... > > > > > > > This test is motivated by a bug found in btrfs, which is fixed by a > > > > patch for the linux kernel titled: > > > > > > > > "Btrfs: fix incorrect file size after shrinking truncate and fsync" > > > > > > At least the title of this patch says nothing about persisting the > > > name. > > > > Because the bug in btrfs is not about persisting the name, it's about > > persisting the correct inode size. > > > > OK. Still AFAIK, ext4/xfs don't guaranty that the name bar will exist > after power failure unless the parent dir was fsynced. > If you fsync the parent dir instead of the file after rename, it will > make the test correct for ext4/xfs (and POSIX for that matter). I'm not so sure about that. Couldn't find anything explicit anywhere about that. There are other tests that do similar assumptions, either after renames or after adding a new hard link (fsyncing just the file being enough to guarantee the new hard link exits after a power failure, without the need to fsync the parent directory). I don't recall ever having had such comment before, even for much more complex cases involving renames such as [1] for example. [1] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=e2d866668a98af4cd47cd6913418e410cd80aa93 In this case both old and new names are in the same parent directory, but if they were in different directories, and requiring the user/application to fsync both directories, it could be dangerous, if the old parent dir is fsync'ed and a power failure happened before fsync'ing the new parent (requiring the user/application to know about this and fsync the new parent before the old parent dir also seems very demanding), leading to a potential file loss after replaying the log/journal (there were such cases in btrfs before). > Will that test modification still catch the btrfs bug (pre fix)? Yes, it will still trigger the btrfs bug. Well, I don't mind doing such change, but I would like to hear other opinions too, perhaps Dave could shed some light on this? Even old reiserfs and f2fs (both on posix and strict fsync modes) currently pass this test. Thanks. > > Thanks, > Amir. > > > > > > > > > > > > This test currently passes on ext4, xfs, f2fs and patched btrfs. > > > > > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > > > --- > > > > tests/generic/532 | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > tests/generic/532.out | 8 +++++++ > > > > tests/generic/group | 1 + > > > > 3 files changed, 75 insertions(+) > > > > create mode 100755 tests/generic/532 > > > > create mode 100644 tests/generic/532.out > > > > > > > > diff --git a/tests/generic/532 b/tests/generic/532 > > > > new file mode 100755 > > > > index 00000000..64992d85 > > > > --- /dev/null > > > > +++ b/tests/generic/532 > > > > @@ -0,0 +1,66 @@ > > > > +#! /bin/bash > > > > +# SPDX-License-Identifier: GPL-2.0 > > > > +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved. > > > > +# > > > > +# FS QA Test No. 532 > > > > +# > > > > +# Test that if we truncate a file to reduce its size, rename it and then fsync > > > > +# it, after a power failure the file has a correct size and name. > > > > +# > > > > +seq=`basename $0` > > > > +seqres=$RESULT_DIR/$seq > > > > +echo "QA output created by $seq" > > > > +tmp=/tmp/$$ > > > > +status=1 # failure is the default! > > > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > > > + > > > > +_cleanup() > > > > +{ > > > > + _cleanup_flakey > > > > + cd / > > > > + rm -f $tmp.* > > > > +} > > > > + > > > > +# get standard environment, filters and checks > > > > +. ./common/rc > > > > +. ./common/filter > > > > +. ./common/dmflakey > > > > + > > > > +# real QA test starts here > > > > +_supported_fs generic > > > > +_supported_os Linux > > > > +_require_scratch > > > > +_require_dm_target flakey > > > > + > > > > +rm -f $seqres.full > > > > + > > > > +_scratch_mkfs >>$seqres.full 2>&1 > > > > +_require_metadata_journaling $SCRATCH_DEV > > > > +_init_flakey > > > > +_mount_flakey > > > > + > > > > +# Create our test file with an initial size of 8000 bytes, then fsync it, > > > > +# followed by a truncate that reduces its size down to 3000 bytes. > > > > +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \ > > > > + -c "fsync" \ > > > > + -c "truncate 3000" \ > > > > + $SCRATCH_MNT/foo | _filter_xfs_io > > > > + > > > > +# Now rename the file and fsync it again. > > > > +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar > > > > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar > > > > + > > > > +# Simulate a power failure and mount the filesystem to check that the file was > > > > +# persisted with the new name and has a size of 3000 bytes. > > > > +_flakey_drop_and_remount > > > > + > > > > +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing" > > > > +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists" > > > > + > > > > +echo "File content after power failure:" > > > > +od -A d -t x1 $SCRATCH_MNT/bar > > > > + > > > > +_unmount_flakey > > > > + > > > > +status=0 > > > > +exit > > > > diff --git a/tests/generic/532.out b/tests/generic/532.out > > > > new file mode 100644 > > > > index 00000000..554fbe2a > > > > --- /dev/null > > > > +++ b/tests/generic/532.out > > > > @@ -0,0 +1,8 @@ > > > > +QA output created by 532 > > > > +wrote 8000/8000 bytes at offset 0 > > > > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) > > > > +File content after power failure: > > > > +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab > > > > +* > > > > +0002992 ab ab ab ab ab ab ab ab > > > > +0003000 > > > > diff --git a/tests/generic/group b/tests/generic/group > > > > index 31011ac8..7d63f303 100644 > > > > --- a/tests/generic/group > > > > +++ b/tests/generic/group > > > > @@ -534,3 +534,4 @@ > > > > 529 auto quick attr > > > > 530 auto quick unlink > > > > 531 auto quick unlink > > > > +532 auto quick log > > > > -- > > > > 2.11.0 > > > >
On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote: > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > From: Filipe Manana <fdmanana@suse.com> > > > > Test that if we truncate a file to reduce its size, rename it and then > > fsync it, after a power failure the file has a correct size and name. > > > > I am not sure that ext4/xfs semantics guaranty anything about > persisting file name after fsync of file?... They do. It's that pesky "strictly ordered metadata" thing I keep having to explain to people... i.e. if you fsync an inode, then you are persisting all the changes needed to reference that file and it's data. And so if there was a rename in the history of that file, then that is persisted, too. Which means that both the original and the new directory modifications are persisted, too. *POSIX* doesn't require this - it says that if you O_DSYNC data, then it also includes all the metadata needed to reference that data. So even if the data is there, POSIX doesn't define whether the rename is there or noti, just that you can get to the fsync'd data via either the old or new name. IOWs, POSIX allows the behaviour to be implementation specific. In this case, file systems with strictly ordered metadata will end up making the rename visible because the rename occurred before the truncate that the fsync() is persisting... Cheers, Dave.
On Tue, Mar 05, 2019 at 11:50:20AM +1100, Dave Chinner wrote: > On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote: > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > fsync it, after a power failure the file has a correct size and name. > > > > > > > I am not sure that ext4/xfs semantics guaranty anything about > > persisting file name after fsync of file?... > > They do. It's that pesky "strictly ordered metadata" thing I keep > having to explain to people... https://marc.info/?l=fstests&m=155010885626284&w=2 -Dave. > > i.e. if you fsync an inode, then you are persisting all the changes > needed to reference that file and it's data. And so if there was a > rename in the history of that file, then that is persisted, too. > Which means that both the original and the new directory > modifications are persisted, too. > > *POSIX* doesn't require this - it says that if you O_DSYNC data, > then it also includes all the metadata needed to reference that > data. So even if the data is there, POSIX doesn't define whether the > rename is there or noti, just that you can get to the fsync'd data > via either the old or new name. IOWs, POSIX allows the behaviour to > be implementation specific. > > In this case, file systems with strictly ordered metadata will end > up making the rename visible because the rename occurred before the > truncate that the fsync() is persisting... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com >
On Mon, Mar 4, 2019 at 7:00 PM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 05, 2019 at 11:50:20AM +1100, Dave Chinner wrote: > > On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote: > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > > fsync it, after a power failure the file has a correct size and name. > > > > > > > > > > I am not sure that ext4/xfs semantics guaranty anything about > > > persisting file name after fsync of file?... > > > > They do. It's that pesky "strictly ordered metadata" thing I keep > > having to explain to people... > > https://marc.info/?l=fstests&m=155010885626284&w=2 With reference to that previous conversation, Jayashree and I are planning to submit a patch adding something like Documentation/filesystems/crash-recovery.txt to document crash guarantees and strictly ordered metadata semantics soon! So hopefully there will be in-kernel documentation we can point folks to soon.
On Tue, Mar 5, 2019 at 2:50 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote: > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > fsync it, after a power failure the file has a correct size and name. > > > > > > > I am not sure that ext4/xfs semantics guaranty anything about > > persisting file name after fsync of file?... > > They do. It's that pesky "strictly ordered metadata" thing I keep > having to explain to people... > > i.e. if you fsync an inode, then you are persisting all the changes > needed to reference that file and it's data. And so if there was a > rename in the history of that file, then that is persisted, too. > Which means that both the original and the new directory > modifications are persisted, too. > > *POSIX* doesn't require this - it says that if you O_DSYNC data, > then it also includes all the metadata needed to reference that > data. So even if the data is there, POSIX doesn't define whether the > rename is there or noti, just that you can get to the fsync'd data > via either the old or new name. IOWs, POSIX allows the behaviour to > be implementation specific. > > In this case, file systems with strictly ordered metadata will end > up making the rename visible because the rename occurred before the > truncate that the fsync() is persisting... > That is not what is happening in Filipe's test. Test has: - ftruncate A - fsync A - rename A B - fsync B So the reason this is working is because 2nd fsync needs to persist ctime of B and not because it needs to persist the truncate. XFS does it, but it doesn't seem like something that any filesystem is guaranteed to do the same: /* * We always want to hit the ctime on the source inode. * * This isn't strictly required by the standards since the source * inode isn't really being changed, but old unix file systems did * it and some incremental backup programs won't work without it. */ xfs_trans_ichgtime(tp, src_ip, XFS_ICHGTIME_CHG); So for the purpose of the test itself, which needs to guaranty that btrfs persists the size, fsync of parent would be more robust for any filesystem. Thanks, Amir.
On Tue, Mar 5, 2019 at 12:31 AM Filipe Manana <fdmanana@kernel.org> wrote: > > On Mon, Mar 4, 2019 at 5:59 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > On Mon, Mar 4, 2019 at 5:23 PM Filipe Manana <fdmanana@kernel.org> wrote: > > > > > > On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > > > fsync it, after a power failure the file has a correct size and name. > > > > > > > > > > > > > I am not sure that ext4/xfs semantics guaranty anything about > > > > persisting file name after fsync of file?... > > > > > > > > > This test is motivated by a bug found in btrfs, which is fixed by a > > > > > patch for the linux kernel titled: > > > > > > > > > > "Btrfs: fix incorrect file size after shrinking truncate and fsync" > > > > > > > > At least the title of this patch says nothing about persisting the > > > > name. > > > > > > Because the bug in btrfs is not about persisting the name, it's about > > > persisting the correct inode size. > > > > > > > OK. Still AFAIK, ext4/xfs don't guaranty that the name bar will exist > > after power failure unless the parent dir was fsynced. > > If you fsync the parent dir instead of the file after rename, it will > > make the test correct for ext4/xfs (and POSIX for that matter). > > I'm not so sure about that. Couldn't find anything explicit anywhere about that. man fsync(2): " Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed." > There are other tests that do similar assumptions, either after > renames or after adding a new > hard link (fsyncing just the file being enough to guarantee the new > hard link exits after a power > failure, without the need to fsync the parent directory). I don't > recall ever having had such comment before, Because I missed it in review... > even for much more complex cases involving renames > such as [1] for example. > > [1] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=e2d866668a98af4cd47cd6913418e410cd80aa93 > > In this case both old and new names are in the same parent directory, > but if they were in different directories, and > requiring the user/application to fsync both directories, it could be > dangerous, if the old parent dir is fsync'ed and > a power failure happened before fsync'ing the new parent (requiring > the user/application to know about this > and fsync the new parent before the old parent dir also seems very > demanding), leading to a potential file loss > after replaying the log/journal (there were such cases in btrfs before). > This is not a problem for ext4/xfs. fsync(newdir) guarantees persisting new filename metadata ordering implies that olddir changes (and file nlink) also persist. To meet requirement of both btrfs and ext4/xfs in that case your test may fsync both file AND newdir, as long as this still allows your test to catch the bug? > > Will that test modification still catch the btrfs bug (pre fix)? > > Yes, it will still trigger the btrfs bug. > > Well, I don't mind doing such change, but I would like to hear other > opinions too, perhaps Dave could shed some light on this? > Even old reiserfs and f2fs (both on posix and strict fsync modes) > currently pass this test. > As I wrote to Dave, if file is not "metadata dirty" before rename, then whether or not rename dirties to file for fsync is a filesystem specific implementation detail that is not in any standard. Since filesystems tend to try to optimize out unneeded journal commits, so the observation about what filesystems do today empirically is not a sufficiently strong argument. Again, if your test can afford to do fsync of file and parent dir it is best to be on the safe side (please audit other similar tests that you know about). If your test cannot afford to fsync parent dir, then at least a fat comment about this issue is worth about the expected (but not guaranteed) effects of fsync of file. Thanks, Amir.
On Tue, Mar 5, 2019 at 5:59 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Tue, Mar 5, 2019 at 12:31 AM Filipe Manana <fdmanana@kernel.org> wrote: > > > > On Mon, Mar 4, 2019 at 5:59 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > On Mon, Mar 4, 2019 at 5:23 PM Filipe Manana <fdmanana@kernel.org> wrote: > > > > > > > > On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > > > > fsync it, after a power failure the file has a correct size and name. > > > > > > > > > > > > > > > > I am not sure that ext4/xfs semantics guaranty anything about > > > > > persisting file name after fsync of file?... > > > > > > > > > > > This test is motivated by a bug found in btrfs, which is fixed by a > > > > > > patch for the linux kernel titled: > > > > > > > > > > > > "Btrfs: fix incorrect file size after shrinking truncate and fsync" > > > > > > > > > > At least the title of this patch says nothing about persisting the > > > > > name. > > > > > > > > Because the bug in btrfs is not about persisting the name, it's about > > > > persisting the correct inode size. > > > > > > > > > > OK. Still AFAIK, ext4/xfs don't guaranty that the name bar will exist > > > after power failure unless the parent dir was fsynced. > > > If you fsync the parent dir instead of the file after rename, it will > > > make the test correct for ext4/xfs (and POSIX for that matter). > > > > I'm not so sure about that. Couldn't find anything explicit anywhere about that. > > man fsync(2): > " Calling fsync() does not necessarily ensure that the entry in > the directory > containing the file has also reached disk. For that an > explicit fsync() on a file > descriptor for the directory is also needed." > > > There are other tests that do similar assumptions, either after > > renames or after adding a new > > hard link (fsyncing just the file being enough to guarantee the new > > hard link exits after a power > > failure, without the need to fsync the parent directory). I don't > > recall ever having had such comment before, > > Because I missed it in review... > > > even for much more complex cases involving renames > > such as [1] for example. > > > > [1] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=e2d866668a98af4cd47cd6913418e410cd80aa93 > > > > In this case both old and new names are in the same parent directory, > > but if they were in different directories, and > > requiring the user/application to fsync both directories, it could be > > dangerous, if the old parent dir is fsync'ed and > > a power failure happened before fsync'ing the new parent (requiring > > the user/application to know about this > > and fsync the new parent before the old parent dir also seems very > > demanding), leading to a potential file loss > > after replaying the log/journal (there were such cases in btrfs before). > > > > This is not a problem for ext4/xfs. > fsync(newdir) guarantees persisting new filename > metadata ordering implies that olddir changes (and file nlink) > also persist. > > To meet requirement of both btrfs and ext4/xfs in that case > your test may fsync both file AND newdir, as long as this > still allows your test to catch the bug? > > > > Will that test modification still catch the btrfs bug (pre fix)? > > > > Yes, it will still trigger the btrfs bug. > > > > Well, I don't mind doing such change, but I would like to hear other > > opinions too, perhaps Dave could shed some light on this? > > Even old reiserfs and f2fs (both on posix and strict fsync modes) > > currently pass this test. > > > > As I wrote to Dave, if file is not "metadata dirty" before rename, > then whether or not rename dirties to file for fsync is a filesystem > specific implementation detail that is not in any standard. > > Since filesystems tend to try to optimize out unneeded journal > commits, so the observation about what filesystems do today > empirically is not a sufficiently strong argument. Even if the strictly ordered metadata guarantees Dave explained so many time before is not formally defined anywhere, I think it's a behavior that shouldn't change, as not only it makes sense, but applications might be relying on it and therefore useful to make sure no regressions happen. If such tests should be shared (and run only filesystems known to support the strictly ordered metadata) or require something special, is another question. > > Again, if your test can afford to do fsync of file and parent dir > it is best to be on the safe side (please audit other similar tests > that you know about). If your test cannot afford to fsync parent > dir, then at least a fat comment about this issue is worth about > the expected (but not guaranteed) effects of fsync of file. I'm sending a v2 with the fsync against the parent directory, despite not being needed for any fs I could run the tests against, because it's a generic test. Thanks. > > Thanks, > Amir.
> > As I wrote to Dave, if file is not "metadata dirty" before rename, > > then whether or not rename dirties to file for fsync is a filesystem > > specific implementation detail that is not in any standard. > > > > Since filesystems tend to try to optimize out unneeded journal > > commits, so the observation about what filesystems do today > > empirically is not a sufficiently strong argument. > > Even if the strictly ordered metadata guarantees Dave explained so > many time before is not formally defined anywhere, > I think it's a behavior that shouldn't change, as not only it makes > sense, but applications might be relying on it and therefore > useful to make sure no regressions happen. If such tests should be > shared (and run only filesystems known to support > the strictly ordered metadata) or require something special, is > another question. > I agree with you that it is good to test and verify those semantics. I was not arguing against assuming strictly ordered metadata. I was arguing against assuming that rename marks the inode dirty. If rename does not mark the inode dirty, then fsync(file) may be optimized away by filesystem and will not require persisting anything, without breaking strictly ordered metadata semantics. > > > > Again, if your test can afford to do fsync of file and parent dir > > it is best to be on the safe side (please audit other similar tests > > that you know about). If your test cannot afford to fsync parent > > dir, then at least a fat comment about this issue is worth about > > the expected (but not guaranteed) effects of fsync of file. > > I'm sending a v2 with the fsync against the parent directory, despite > not being needed for any fs I could > run the tests against, because it's a generic test. ACK. Thanks, Amir.
On Tue, Mar 05, 2019 at 07:39:28AM +0200, Amir Goldstein wrote: > On Tue, Mar 5, 2019 at 2:50 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote: > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > > fsync it, after a power failure the file has a correct size and name. This says: - ftruncate A - rename A B - fsync B > > > I am not sure that ext4/xfs semantics guaranty anything about > > > persisting file name after fsync of file?... > > > > They do. It's that pesky "strictly ordered metadata" thing I keep > > having to explain to people... > > > > i.e. if you fsync an inode, then you are persisting all the changes > > needed to reference that file and it's data. And so if there was a > > rename in the history of that file, then that is persisted, too. > > Which means that both the original and the new directory > > modifications are persisted, too. > > > > *POSIX* doesn't require this - it says that if you O_DSYNC data, > > then it also includes all the metadata needed to reference that > > data. So even if the data is there, POSIX doesn't define whether the > > rename is there or noti, just that you can get to the fsync'd data > > via either the old or new name. IOWs, POSIX allows the behaviour to > > be implementation specific. > > > > In this case, file systems with strictly ordered metadata will end > > up making the rename visible because the rename occurred before the > > truncate that the fsync() is persisting... > > > > That is not what is happening in Filipe's test. Test has: > - ftruncate A > - fsync A > - rename A B > - fsync B And this does not match the test description. /me goes and looks at the test again to check. Ok, the test is as Filipe describes: - pwrite 0 0x8000 A - fsync A - truncate 3000 A - rename A B - fsync B There is no fsync between truncate and rename. > So the reason this is working is because 2nd fsync needs to > persist ctime of B and not because it needs to persist the > truncate. ctime modifications during rename are irrelevent because there's no fsync between the truncate and the rename so the file inode is already dirty due to the truncate. I think you've got the wrong end of the stick here, Amir. :) Cheers, Dave.
On Wed, Mar 6, 2019 at 12:33 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Mar 05, 2019 at 07:39:28AM +0200, Amir Goldstein wrote: > > On Tue, Mar 5, 2019 at 2:50 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote: > > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote: > > > > > > > > > > From: Filipe Manana <fdmanana@suse.com> > > > > > > > > > > Test that if we truncate a file to reduce its size, rename it and then > > > > > fsync it, after a power failure the file has a correct size and name. > > This says: > > - ftruncate A > - rename A B > - fsync B > > > > > I am not sure that ext4/xfs semantics guaranty anything about > > > > persisting file name after fsync of file?... > > > > > > They do. It's that pesky "strictly ordered metadata" thing I keep > > > having to explain to people... > > > > > > i.e. if you fsync an inode, then you are persisting all the changes > > > needed to reference that file and it's data. And so if there was a > > > rename in the history of that file, then that is persisted, too. > > > Which means that both the original and the new directory > > > modifications are persisted, too. > > > > > > *POSIX* doesn't require this - it says that if you O_DSYNC data, > > > then it also includes all the metadata needed to reference that > > > data. So even if the data is there, POSIX doesn't define whether the > > > rename is there or noti, just that you can get to the fsync'd data > > > via either the old or new name. IOWs, POSIX allows the behaviour to > > > be implementation specific. > > > > > > In this case, file systems with strictly ordered metadata will end > > > up making the rename visible because the rename occurred before the > > > truncate that the fsync() is persisting... > > > > > > > That is not what is happening in Filipe's test. Test has: > > - ftruncate A > > - fsync A > > - rename A B > > - fsync B > > And this does not match the test description. > > /me goes and looks at the test again to check. > > Ok, the test is as Filipe describes: > > - pwrite 0 0x8000 A > - fsync A > - truncate 3000 A > - rename A B > - fsync B > > There is no fsync between truncate and rename. > > > So the reason this is working is because 2nd fsync needs to > > persist ctime of B and not because it needs to persist the > > truncate. > > ctime modifications during rename are irrelevent because there's no > fsync between the truncate and the rename so the file inode is > already dirty due to the truncate. I think you've got the wrong end > of the stick here, Amir. :) > Doh! The discussion is still interesting because people have hard time to understand that those hidden details like ctime update on rename may have different behavior on different fs regardless if they obay ordered metadata or not. Btrfs is different in the respect of metadata dependencies from xfs/ext4 in many ways as seen in the different rename/link crash consistency discussions. Thanks, Amir.
On Wed, Mar 06, 2019 at 09:51:23AM +0200, Amir Goldstein wrote: > On Wed, Mar 6, 2019 at 12:33 AM Dave Chinner <david@fromorbit.com> wrote: > > > So the reason this is working is because 2nd fsync needs to > > > persist ctime of B and not because it needs to persist the > > > truncate. > > > > ctime modifications during rename are irrelevent because there's no > > fsync between the truncate and the rename so the file inode is > > already dirty due to the truncate. I think you've got the wrong end > > of the stick here, Amir. :) > > Doh! The discussion is still interesting because people have > hard time to understand that those hidden details like ctime > update on rename may have different behavior on different fs > regardless if they obay ordered metadata or not. > Btrfs is different in the respect of metadata dependencies from > xfs/ext4 in many ways as seen in the different rename/link > crash consistency discussions. Yes, little things like can result in different behaviour, but what we are trying to do is get to the point where there is minimal difference between all crash-recovery-capable linux filesystems. e.g. what we see here is that by always including the inode being moved in the rename transaction (regardless of how a filesystem acheives that), we provide consistent, reliable, predictable behaviour in all cases of "fsync after rename". IOWs, the SOMC model that _require_metadata_journaling tests are supposed to conform to is far more strict that POSIX requires and our tests need to reflect this stricter consistency model. IOWs, we should be encoding the behaviour we want in these tests rather than implementing yet another "test POSIX compatible behaviour" - POSIX is a complete crapshoot when it comes to persistence requirements. And if a filesystem fails a SOMC-model test, then the filesystem needs to be fixed, not have the test "relaxed" to only exercise POSIX-defined behaviour. Cheers, Dave.
On Wed, Mar 6, 2019 at 11:48 PM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Mar 06, 2019 at 09:51:23AM +0200, Amir Goldstein wrote: > > On Wed, Mar 6, 2019 at 12:33 AM Dave Chinner <david@fromorbit.com> wrote: > > > > So the reason this is working is because 2nd fsync needs to > > > > persist ctime of B and not because it needs to persist the > > > > truncate. > > > > > > ctime modifications during rename are irrelevent because there's no > > > fsync between the truncate and the rename so the file inode is > > > already dirty due to the truncate. I think you've got the wrong end > > > of the stick here, Amir. :) > > > > Doh! The discussion is still interesting because people have > > hard time to understand that those hidden details like ctime > > update on rename may have different behavior on different fs > > regardless if they obay ordered metadata or not. > > Btrfs is different in the respect of metadata dependencies from > > xfs/ext4 in many ways as seen in the different rename/link > > crash consistency discussions. > > Yes, little things like can result in different behaviour, but what > we are trying to do is get to the point where there is minimal > difference between all crash-recovery-capable linux filesystems. > > e.g. what we see here is that by always including the inode being > moved in the rename transaction (regardless of how a filesystem > acheives that), we provide consistent, reliable, predictable > behaviour in all cases of "fsync after rename". IOWs, the SOMC model > that _require_metadata_journaling tests are supposed to conform to > is far more strict that POSIX requires and our tests need to reflect > this stricter consistency model. > > IOWs, we should be encoding the behaviour we want in these tests > rather than implementing yet another "test POSIX compatible > behaviour" - POSIX is a complete crapshoot when it comes to > persistence requirements. And if a filesystem fails a SOMC-model > test, then the filesystem needs to be fixed, not have the test > "relaxed" to only exercise POSIX-defined behaviour. > Agreed! v1 is better than v2. Sorry for my mistake in v1 review. I went back to look at similar fsync tests by Filipe: generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512} I found some alleged subtle mistakes about SOMC assumptions. generic/336 does: touch $SCRATCH_MNT/a/foo ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link touch $SCRATCH_MNT/b/bar sync unlink $SCRATCH_MNT/b/foo_link mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/ $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo And expects both unlink and rename to persist. However, this is only true in the *very likely* case that there is no journal commit in between unlink and rename, because fsync foo is only guaranteed to persist metadata changes that depend on the unlink and happened BEFORE it, which is not the case for the rename of bar. generic/343 and generic/510 do something similar. At first glance, generic/498 is actually broken (for xfs) or at least I don't understand why it works. Thanks, Amir.
Hi Amir, > I went back to look at similar fsync tests by Filipe: > generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512} > > I found some alleged subtle mistakes about SOMC assumptions. > > generic/336 does: > touch $SCRATCH_MNT/a/foo > ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link > touch $SCRATCH_MNT/b/bar > sync > unlink $SCRATCH_MNT/b/foo_link > mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/ > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo This is probably what's happening in this particular test : SOMC requires: fsync(a/foo) must ensure unlink(b/foo_link) (because they were linked at some point) But what happens is: fsync(a/foo) --> unlink(b/foo_link) unlink(b/foo_link) --> fsync(b) fsync(b) --> rename goes through SOMC should only require that the unlink persists. The rename operation persists due to the side-effect of SOMC. So we should be only testing if the unlink operation went through. Take a look at this thread that describes the bug which resulted in this test case (https://patchwork.kernel.org/patch/8293181/). The file bar was lost during the rename operation and I think this test simply wanted to verify that file bar was intact in either one of the directories. Probably the reason this test was written was not to test SOMC, but to verify that the log replay in btrfs does not delete the renamed file - luckily it happens to pass on other file systems because they all persist the rename operation as a side effect of SOMC. To me, this test currently checks for something more than SOMC. The check must be modified to only test if b/foo_link is removed, and that the file bar is *either* in directory b or c. Test 343 is a similar case where in btrfs log replay ended up persisting the file in both the source and destination directories after a rename operation (https://patchwork.kernel.org/patch/8766401/). The checks must be updated here as well. Thanks, Jayashree
On Thu, Mar 07, 2019 at 09:52:03AM +0200, Amir Goldstein wrote: > On Wed, Mar 6, 2019 at 11:48 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Wed, Mar 06, 2019 at 09:51:23AM +0200, Amir Goldstein wrote: > > > On Wed, Mar 6, 2019 at 12:33 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > So the reason this is working is because 2nd fsync needs to > > > > > persist ctime of B and not because it needs to persist the > > > > > truncate. > > > > > > > > ctime modifications during rename are irrelevent because there's no > > > > fsync between the truncate and the rename so the file inode is > > > > already dirty due to the truncate. I think you've got the wrong end > > > > of the stick here, Amir. :) > > > > > > Doh! The discussion is still interesting because people have > > > hard time to understand that those hidden details like ctime > > > update on rename may have different behavior on different fs > > > regardless if they obay ordered metadata or not. > > > Btrfs is different in the respect of metadata dependencies from > > > xfs/ext4 in many ways as seen in the different rename/link > > > crash consistency discussions. > > > > Yes, little things like can result in different behaviour, but what > > we are trying to do is get to the point where there is minimal > > difference between all crash-recovery-capable linux filesystems. > > > > e.g. what we see here is that by always including the inode being > > moved in the rename transaction (regardless of how a filesystem > > acheives that), we provide consistent, reliable, predictable > > behaviour in all cases of "fsync after rename". IOWs, the SOMC model > > that _require_metadata_journaling tests are supposed to conform to > > is far more strict that POSIX requires and our tests need to reflect > > this stricter consistency model. > > > > IOWs, we should be encoding the behaviour we want in these tests > > rather than implementing yet another "test POSIX compatible > > behaviour" - POSIX is a complete crapshoot when it comes to > > persistence requirements. And if a filesystem fails a SOMC-model > > test, then the filesystem needs to be fixed, not have the test > > "relaxed" to only exercise POSIX-defined behaviour. > > > > Agreed! v1 is better than v2. Sorry for my mistake in v1 review. > > I went back to look at similar fsync tests by Filipe: > generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512} > > I found some alleged subtle mistakes about SOMC assumptions. > > generic/336 does: > touch $SCRATCH_MNT/a/foo > ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link > touch $SCRATCH_MNT/b/bar > sync > unlink $SCRATCH_MNT/b/foo_link > mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/ > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo > > And expects both unlink and rename to persist. *nod*. Yup, that's a typical async transactional ordering behaviour because the dirty state is carried on the inode, not the path or fd that is being used to access it. > However, this is only true in the *very likely* case that there is no > journal commit in between unlink and rename, because fsync foo > is only guaranteed to persist metadata changes that depend on the > unlink and happened BEFORE it, which is not the case for the rename > of bar. Yup, most likely. I haven't looked at any of these btrfs-inspired fsync tests in any detail so it wouldn't surprise me that there are issues like this in them - I just don't have time to look at everything. Indeed, if there are mistaken assumptions in the tests based around pending dirty state and async transaction aggregation, then running the tests with "-o dirsync" or even "-o wsync" should cause such tests to fail. > At first glance, generic/498 is actually broken (for xfs) or at least > I don't understand why it works. The test does this: mkdir $SCRATCH_MNT/A mkdir $SCRATCH_MNT/B mkdir $SCRATCH_MNT/A/C touch $SCRATCH_MNT/B/foo $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/B/foo # It is important the new hard link is located in a hierarchy of new directories # (not yet persisted). ln $SCRATCH_MNT/B/foo $SCRATCH_MNT/A/C/foo $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/A _flakey_drop_and_remount And it expects /A and B/foo to be there afterwards. The comment "(not yet persisted)" is about the buggy btrfs behaviour, not how a SOMC fs should behave. So, yeah, that should always work on XFS, because the first fsync persists everything the test checks (due to common ancestors), and second fsync is a no-opt because dir A has already been checkpointed. Cheers, Dave.
On Thu, Mar 07, 2019 at 05:19:51PM -0600, Jayashree Mohan wrote: > Hi Amir, > > > I went back to look at similar fsync tests by Filipe: > > generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512} > > > > I found some alleged subtle mistakes about SOMC assumptions. > > > > generic/336 does: > > touch $SCRATCH_MNT/a/foo > > ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link > > touch $SCRATCH_MNT/b/bar > > sync > > unlink $SCRATCH_MNT/b/foo_link > > mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/ > > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo > > This is probably what's happening in this particular test : > > SOMC requires: > fsync(a/foo) must ensure unlink(b/foo_link) (because they > were linked at some point) > > But what happens is: > fsync(a/foo) --> unlink(b/foo_link) > unlink(b/foo_link) --> fsync(b) > fsync(b) --> rename goes through > > SOMC should only require that the unlink persists. That's a /fsync/ requirement, not SOMC. SOMC says: "If the rename after the fsync()d unlink is present after recovery, then every metadata operation completed between the unlink and the rename must also be present after recovery." > The rename > operation persists due to the side-effect of SOMC. So we should be > only testing if the unlink operation went through. No, it persists as a side effect of the a filesystem fsync() implementation, not SOMC. i.e. on fsync(), the filesystem must commit the journal at a point in time that *includes the unlink*. That means it can chose any time between the unlink and the moment the fsync() is received by the filesystem. Once the point in time has been chosen by the fsync operation, SOMC then defines what else must be journalled and recovered with the information that must be fsync()d. > Take a look at this thread that describes the bug which resulted in > this test case (https://patchwork.kernel.org/patch/8293181/). I really wouldn't try to infer anything from the bugs in btrfs fsync behaviour or the test cases that expose them. 'Behave like other filesystems" is not a substitute for having solid fundamental algorithms... Cheers, Dave.
On Thu, Mar 7, 2019 at 10:35 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Mar 07, 2019 at 05:19:51PM -0600, Jayashree Mohan wrote: > > Hi Amir, > > > > > I went back to look at similar fsync tests by Filipe: > > > generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512} > > > > > > I found some alleged subtle mistakes about SOMC assumptions. > > > > > > generic/336 does: > > > touch $SCRATCH_MNT/a/foo > > > ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link > > > touch $SCRATCH_MNT/b/bar > > > sync > > > unlink $SCRATCH_MNT/b/foo_link > > > mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/ > > > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo > > > > This is probably what's happening in this particular test : > > > > SOMC requires: > > fsync(a/foo) must ensure unlink(b/foo_link) (because they > > were linked at some point) > > > > But what happens is: > > fsync(a/foo) --> unlink(b/foo_link) > > unlink(b/foo_link) --> fsync(b) > > fsync(b) --> rename goes through > > > > SOMC should only require that the unlink persists. > > That's a /fsync/ requirement, not SOMC. > > SOMC says: > > "If the rename after the fsync()d unlink is present after recovery, > then every metadata operation completed between the unlink and the > rename must also be present after recovery." Isn't this a bit too broad? That sounds more like total ordering of metadata operations rather than SOMC. Shouldn't SOMC say "all operations dependent upon the rename should be present after recovery if rename is present?"
(Sorry, missed this email and only just noticed it...) On Fri, Mar 08, 2019 at 09:11:19AM -0600, Vijay Chidambaram wrote: > On Thu, Mar 7, 2019 at 10:35 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Mar 07, 2019 at 05:19:51PM -0600, Jayashree Mohan wrote: > > > Hi Amir, > > > > > > > I went back to look at similar fsync tests by Filipe: > > > > generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512} > > > > > > > > I found some alleged subtle mistakes about SOMC assumptions. > > > > > > > > generic/336 does: > > > > touch $SCRATCH_MNT/a/foo > > > > ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link > > > > touch $SCRATCH_MNT/b/bar > > > > sync > > > > unlink $SCRATCH_MNT/b/foo_link > > > > mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/ > > > > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo > > > > > > This is probably what's happening in this particular test : > > > > > > SOMC requires: > > > fsync(a/foo) must ensure unlink(b/foo_link) (because they > > > were linked at some point) > > > > > > But what happens is: > > > fsync(a/foo) --> unlink(b/foo_link) > > > unlink(b/foo_link) --> fsync(b) > > > fsync(b) --> rename goes through > > > > > > SOMC should only require that the unlink persists. > > > > That's a /fsync/ requirement, not SOMC. > > > > SOMC says: > > > > "If the rename after the fsync()d unlink is present after recovery, > > then every metadata operation completed between the unlink and the > > rename must also be present after recovery." > > Isn't this a bit too broad? That sounds more like total ordering of > metadata operations rather than SOMC. Shouldn't SOMC say "all > operations dependent upon the rename should be present after recovery > if rename is present?" Yes. I was describing the explicit behaivour to expect in the context of the given example, in which all the operations between the unlink and rename were dependent operations. You've just restated it as the general rule that I applied... Cheers, Dave.
diff --git a/tests/generic/532 b/tests/generic/532 new file mode 100755 index 00000000..64992d85 --- /dev/null +++ b/tests/generic/532 @@ -0,0 +1,66 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved. +# +# FS QA Test No. 532 +# +# Test that if we truncate a file to reduce its size, rename it and then fsync +# it, after a power failure the file has a correct size and name. +# +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + _cleanup_flakey + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_require_scratch +_require_dm_target flakey + +rm -f $seqres.full + +_scratch_mkfs >>$seqres.full 2>&1 +_require_metadata_journaling $SCRATCH_DEV +_init_flakey +_mount_flakey + +# Create our test file with an initial size of 8000 bytes, then fsync it, +# followed by a truncate that reduces its size down to 3000 bytes. +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \ + -c "fsync" \ + -c "truncate 3000" \ + $SCRATCH_MNT/foo | _filter_xfs_io + +# Now rename the file and fsync it again. +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar + +# Simulate a power failure and mount the filesystem to check that the file was +# persisted with the new name and has a size of 3000 bytes. +_flakey_drop_and_remount + +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing" +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists" + +echo "File content after power failure:" +od -A d -t x1 $SCRATCH_MNT/bar + +_unmount_flakey + +status=0 +exit diff --git a/tests/generic/532.out b/tests/generic/532.out new file mode 100644 index 00000000..554fbe2a --- /dev/null +++ b/tests/generic/532.out @@ -0,0 +1,8 @@ +QA output created by 532 +wrote 8000/8000 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +File content after power failure: +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab +* +0002992 ab ab ab ab ab ab ab ab +0003000 diff --git a/tests/generic/group b/tests/generic/group index 31011ac8..7d63f303 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -534,3 +534,4 @@ 529 auto quick attr 530 auto quick unlink 531 auto quick unlink +532 auto quick log