generic: add test for fsync after shrinking truncate and rename
diff mbox series

Message ID 20190304140622.23997-1-fdmanana@kernel.org
State New
Headers show
Series
  • generic: add test for fsync after shrinking truncate and rename
Related show

Commit Message

Filipe Manana March 4, 2019, 2:06 p.m. UTC
From: Filipe Manana <fdmanana@suse.com>

Test that if we truncate a file to reduce its size, rename it and then
fsync it, after a power failure the file has a correct size and name.

This test is motivated by a bug found in btrfs, which is fixed by a
patch for the linux kernel titled:

  "Btrfs: fix incorrect file size after shrinking truncate and fsync"

This test currently passes on ext4, xfs, f2fs and patched btrfs.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 tests/generic/532     | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/generic/532.out |  8 +++++++
 tests/generic/group   |  1 +
 3 files changed, 75 insertions(+)
 create mode 100755 tests/generic/532
 create mode 100644 tests/generic/532.out

Comments

Amir Goldstein March 4, 2019, 3:04 p.m. UTC | #1
On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> Test that if we truncate a file to reduce its size, rename it and then
> fsync it, after a power failure the file has a correct size and name.
>

I am not sure that ext4/xfs semantics guaranty anything about
persisting file name after fsync of file?...

> This test is motivated by a bug found in btrfs, which is fixed by a
> patch for the linux kernel titled:
>
>   "Btrfs: fix incorrect file size after shrinking truncate and fsync"

At least the title of this patch says nothing about persisting the
name.

>
> This test currently passes on ext4, xfs, f2fs and patched btrfs.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  tests/generic/532     | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/532.out |  8 +++++++
>  tests/generic/group   |  1 +
>  3 files changed, 75 insertions(+)
>  create mode 100755 tests/generic/532
>  create mode 100644 tests/generic/532.out
>
> diff --git a/tests/generic/532 b/tests/generic/532
> new file mode 100755
> index 00000000..64992d85
> --- /dev/null
> +++ b/tests/generic/532
> @@ -0,0 +1,66 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test No. 532
> +#
> +# Test that if we truncate a file to reduce its size, rename it and then fsync
> +# it, after a power failure the file has a correct size and name.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +tmp=/tmp/$$
> +status=1       # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +       _cleanup_flakey
> +       cd /
> +       rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/dmflakey
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_dm_target flakey
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_require_metadata_journaling $SCRATCH_DEV
> +_init_flakey
> +_mount_flakey
> +
> +# Create our test file with an initial size of 8000 bytes, then fsync it,
> +# followed by a truncate that reduces its size down to 3000 bytes.
> +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \
> +            -c "fsync" \
> +            -c "truncate 3000" \
> +            $SCRATCH_MNT/foo | _filter_xfs_io
> +
> +# Now rename the file and fsync it again.
> +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
> +
> +# Simulate a power failure and mount the filesystem to check that the file was
> +# persisted with the new name and has a size of 3000 bytes.
> +_flakey_drop_and_remount
> +
> +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing"
> +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists"
> +
> +echo "File content after power failure:"
> +od -A d -t x1 $SCRATCH_MNT/bar
> +
> +_unmount_flakey
> +
> +status=0
> +exit
> diff --git a/tests/generic/532.out b/tests/generic/532.out
> new file mode 100644
> index 00000000..554fbe2a
> --- /dev/null
> +++ b/tests/generic/532.out
> @@ -0,0 +1,8 @@
> +QA output created by 532
> +wrote 8000/8000 bytes at offset 0
> +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> +File content after power failure:
> +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
> +*
> +0002992 ab ab ab ab ab ab ab ab
> +0003000
> diff --git a/tests/generic/group b/tests/generic/group
> index 31011ac8..7d63f303 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -534,3 +534,4 @@
>  529 auto quick attr
>  530 auto quick unlink
>  531 auto quick unlink
> +532 auto quick log
> --
> 2.11.0
>
Filipe Manana March 4, 2019, 3:23 p.m. UTC | #2
On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> >
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Test that if we truncate a file to reduce its size, rename it and then
> > fsync it, after a power failure the file has a correct size and name.
> >
>
> I am not sure that ext4/xfs semantics guaranty anything about
> persisting file name after fsync of file?...
>
> > This test is motivated by a bug found in btrfs, which is fixed by a
> > patch for the linux kernel titled:
> >
> >   "Btrfs: fix incorrect file size after shrinking truncate and fsync"
>
> At least the title of this patch says nothing about persisting the
> name.

Because the bug in btrfs is not about persisting the name, it's about
persisting the correct inode size.

>
> >
> > This test currently passes on ext4, xfs, f2fs and patched btrfs.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  tests/generic/532     | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/generic/532.out |  8 +++++++
> >  tests/generic/group   |  1 +
> >  3 files changed, 75 insertions(+)
> >  create mode 100755 tests/generic/532
> >  create mode 100644 tests/generic/532.out
> >
> > diff --git a/tests/generic/532 b/tests/generic/532
> > new file mode 100755
> > index 00000000..64992d85
> > --- /dev/null
> > +++ b/tests/generic/532
> > @@ -0,0 +1,66 @@
> > +#! /bin/bash
> > +# SPDX-License-Identifier: GPL-2.0
> > +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
> > +#
> > +# FS QA Test No. 532
> > +#
> > +# Test that if we truncate a file to reduce its size, rename it and then fsync
> > +# it, after a power failure the file has a correct size and name.
> > +#
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +tmp=/tmp/$$
> > +status=1       # failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +       _cleanup_flakey
> > +       cd /
> > +       rm -f $tmp.*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +. ./common/dmflakey
> > +
> > +# real QA test starts here
> > +_supported_fs generic
> > +_supported_os Linux
> > +_require_scratch
> > +_require_dm_target flakey
> > +
> > +rm -f $seqres.full
> > +
> > +_scratch_mkfs >>$seqres.full 2>&1
> > +_require_metadata_journaling $SCRATCH_DEV
> > +_init_flakey
> > +_mount_flakey
> > +
> > +# Create our test file with an initial size of 8000 bytes, then fsync it,
> > +# followed by a truncate that reduces its size down to 3000 bytes.
> > +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \
> > +            -c "fsync" \
> > +            -c "truncate 3000" \
> > +            $SCRATCH_MNT/foo | _filter_xfs_io
> > +
> > +# Now rename the file and fsync it again.
> > +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
> > +
> > +# Simulate a power failure and mount the filesystem to check that the file was
> > +# persisted with the new name and has a size of 3000 bytes.
> > +_flakey_drop_and_remount
> > +
> > +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing"
> > +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists"
> > +
> > +echo "File content after power failure:"
> > +od -A d -t x1 $SCRATCH_MNT/bar
> > +
> > +_unmount_flakey
> > +
> > +status=0
> > +exit
> > diff --git a/tests/generic/532.out b/tests/generic/532.out
> > new file mode 100644
> > index 00000000..554fbe2a
> > --- /dev/null
> > +++ b/tests/generic/532.out
> > @@ -0,0 +1,8 @@
> > +QA output created by 532
> > +wrote 8000/8000 bytes at offset 0
> > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > +File content after power failure:
> > +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
> > +*
> > +0002992 ab ab ab ab ab ab ab ab
> > +0003000
> > diff --git a/tests/generic/group b/tests/generic/group
> > index 31011ac8..7d63f303 100644
> > --- a/tests/generic/group
> > +++ b/tests/generic/group
> > @@ -534,3 +534,4 @@
> >  529 auto quick attr
> >  530 auto quick unlink
> >  531 auto quick unlink
> > +532 auto quick log
> > --
> > 2.11.0
> >
Amir Goldstein March 4, 2019, 5:59 p.m. UTC | #3
On Mon, Mar 4, 2019 at 5:23 PM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Test that if we truncate a file to reduce its size, rename it and then
> > > fsync it, after a power failure the file has a correct size and name.
> > >
> >
> > I am not sure that ext4/xfs semantics guaranty anything about
> > persisting file name after fsync of file?...
> >
> > > This test is motivated by a bug found in btrfs, which is fixed by a
> > > patch for the linux kernel titled:
> > >
> > >   "Btrfs: fix incorrect file size after shrinking truncate and fsync"
> >
> > At least the title of this patch says nothing about persisting the
> > name.
>
> Because the bug in btrfs is not about persisting the name, it's about
> persisting the correct inode size.
>

OK. Still AFAIK, ext4/xfs don't guaranty that the name bar will exist
after power failure unless the parent dir was fsynced.
If you fsync the parent dir instead of the file after rename, it will
make the test correct for ext4/xfs (and POSIX for that matter).
Will that test modification still catch the btrfs bug (pre fix)?

Thanks,
Amir.

> >
> > >
> > > This test currently passes on ext4, xfs, f2fs and patched btrfs.
> > >
> > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > ---
> > >  tests/generic/532     | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  tests/generic/532.out |  8 +++++++
> > >  tests/generic/group   |  1 +
> > >  3 files changed, 75 insertions(+)
> > >  create mode 100755 tests/generic/532
> > >  create mode 100644 tests/generic/532.out
> > >
> > > diff --git a/tests/generic/532 b/tests/generic/532
> > > new file mode 100755
> > > index 00000000..64992d85
> > > --- /dev/null
> > > +++ b/tests/generic/532
> > > @@ -0,0 +1,66 @@
> > > +#! /bin/bash
> > > +# SPDX-License-Identifier: GPL-2.0
> > > +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
> > > +#
> > > +# FS QA Test No. 532
> > > +#
> > > +# Test that if we truncate a file to reduce its size, rename it and then fsync
> > > +# it, after a power failure the file has a correct size and name.
> > > +#
> > > +seq=`basename $0`
> > > +seqres=$RESULT_DIR/$seq
> > > +echo "QA output created by $seq"
> > > +tmp=/tmp/$$
> > > +status=1       # failure is the default!
> > > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > > +
> > > +_cleanup()
> > > +{
> > > +       _cleanup_flakey
> > > +       cd /
> > > +       rm -f $tmp.*
> > > +}
> > > +
> > > +# get standard environment, filters and checks
> > > +. ./common/rc
> > > +. ./common/filter
> > > +. ./common/dmflakey
> > > +
> > > +# real QA test starts here
> > > +_supported_fs generic
> > > +_supported_os Linux
> > > +_require_scratch
> > > +_require_dm_target flakey
> > > +
> > > +rm -f $seqres.full
> > > +
> > > +_scratch_mkfs >>$seqres.full 2>&1
> > > +_require_metadata_journaling $SCRATCH_DEV
> > > +_init_flakey
> > > +_mount_flakey
> > > +
> > > +# Create our test file with an initial size of 8000 bytes, then fsync it,
> > > +# followed by a truncate that reduces its size down to 3000 bytes.
> > > +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \
> > > +            -c "fsync" \
> > > +            -c "truncate 3000" \
> > > +            $SCRATCH_MNT/foo | _filter_xfs_io
> > > +
> > > +# Now rename the file and fsync it again.
> > > +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> > > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
> > > +
> > > +# Simulate a power failure and mount the filesystem to check that the file was
> > > +# persisted with the new name and has a size of 3000 bytes.
> > > +_flakey_drop_and_remount
> > > +
> > > +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing"
> > > +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists"
> > > +
> > > +echo "File content after power failure:"
> > > +od -A d -t x1 $SCRATCH_MNT/bar
> > > +
> > > +_unmount_flakey
> > > +
> > > +status=0
> > > +exit
> > > diff --git a/tests/generic/532.out b/tests/generic/532.out
> > > new file mode 100644
> > > index 00000000..554fbe2a
> > > --- /dev/null
> > > +++ b/tests/generic/532.out
> > > @@ -0,0 +1,8 @@
> > > +QA output created by 532
> > > +wrote 8000/8000 bytes at offset 0
> > > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > > +File content after power failure:
> > > +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
> > > +*
> > > +0002992 ab ab ab ab ab ab ab ab
> > > +0003000
> > > diff --git a/tests/generic/group b/tests/generic/group
> > > index 31011ac8..7d63f303 100644
> > > --- a/tests/generic/group
> > > +++ b/tests/generic/group
> > > @@ -534,3 +534,4 @@
> > >  529 auto quick attr
> > >  530 auto quick unlink
> > >  531 auto quick unlink
> > > +532 auto quick log
> > > --
> > > 2.11.0
> > >
Filipe Manana March 4, 2019, 10:30 p.m. UTC | #4
On Mon, Mar 4, 2019 at 5:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Mar 4, 2019 at 5:23 PM Filipe Manana <fdmanana@kernel.org> wrote:
> >
> > On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > > >
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Test that if we truncate a file to reduce its size, rename it and then
> > > > fsync it, after a power failure the file has a correct size and name.
> > > >
> > >
> > > I am not sure that ext4/xfs semantics guaranty anything about
> > > persisting file name after fsync of file?...
> > >
> > > > This test is motivated by a bug found in btrfs, which is fixed by a
> > > > patch for the linux kernel titled:
> > > >
> > > >   "Btrfs: fix incorrect file size after shrinking truncate and fsync"
> > >
> > > At least the title of this patch says nothing about persisting the
> > > name.
> >
> > Because the bug in btrfs is not about persisting the name, it's about
> > persisting the correct inode size.
> >
>
> OK. Still AFAIK, ext4/xfs don't guaranty that the name bar will exist
> after power failure unless the parent dir was fsynced.
> If you fsync the parent dir instead of the file after rename, it will
> make the test correct for ext4/xfs (and POSIX for that matter).

I'm not so sure about that. Couldn't find anything explicit anywhere about that.
There are other tests that do similar assumptions, either after
renames or after adding a new
hard link (fsyncing just the file being enough to guarantee the new
hard link exits after a power
failure, without the need to fsync the parent directory). I don't
recall ever having had such
comment before, even for much more complex cases involving renames
such as [1] for example.

[1] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=e2d866668a98af4cd47cd6913418e410cd80aa93

In this case both old and new names are in the same parent directory,
but if they were in different directories, and
requiring the user/application to fsync both directories, it could be
dangerous, if the old parent dir is fsync'ed and
a power failure happened before fsync'ing the new parent (requiring
the user/application to know about this
and fsync the new parent before the old parent dir also seems very
demanding), leading to a potential file loss
after replaying the log/journal (there were such cases in btrfs before).

> Will that test modification still catch the btrfs bug (pre fix)?

Yes, it will still trigger the btrfs bug.

Well, I don't mind doing such change, but I would like to hear other
opinions too, perhaps Dave could shed some light on this?
Even old reiserfs and f2fs (both on posix and strict fsync modes)
currently pass this test.

Thanks.

>
> Thanks,
> Amir.
>
> > >
> > > >
> > > > This test currently passes on ext4, xfs, f2fs and patched btrfs.
> > > >
> > > > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > > > ---
> > > >  tests/generic/532     | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  tests/generic/532.out |  8 +++++++
> > > >  tests/generic/group   |  1 +
> > > >  3 files changed, 75 insertions(+)
> > > >  create mode 100755 tests/generic/532
> > > >  create mode 100644 tests/generic/532.out
> > > >
> > > > diff --git a/tests/generic/532 b/tests/generic/532
> > > > new file mode 100755
> > > > index 00000000..64992d85
> > > > --- /dev/null
> > > > +++ b/tests/generic/532
> > > > @@ -0,0 +1,66 @@
> > > > +#! /bin/bash
> > > > +# SPDX-License-Identifier: GPL-2.0
> > > > +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
> > > > +#
> > > > +# FS QA Test No. 532
> > > > +#
> > > > +# Test that if we truncate a file to reduce its size, rename it and then fsync
> > > > +# it, after a power failure the file has a correct size and name.
> > > > +#
> > > > +seq=`basename $0`
> > > > +seqres=$RESULT_DIR/$seq
> > > > +echo "QA output created by $seq"
> > > > +tmp=/tmp/$$
> > > > +status=1       # failure is the default!
> > > > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > > > +
> > > > +_cleanup()
> > > > +{
> > > > +       _cleanup_flakey
> > > > +       cd /
> > > > +       rm -f $tmp.*
> > > > +}
> > > > +
> > > > +# get standard environment, filters and checks
> > > > +. ./common/rc
> > > > +. ./common/filter
> > > > +. ./common/dmflakey
> > > > +
> > > > +# real QA test starts here
> > > > +_supported_fs generic
> > > > +_supported_os Linux
> > > > +_require_scratch
> > > > +_require_dm_target flakey
> > > > +
> > > > +rm -f $seqres.full
> > > > +
> > > > +_scratch_mkfs >>$seqres.full 2>&1
> > > > +_require_metadata_journaling $SCRATCH_DEV
> > > > +_init_flakey
> > > > +_mount_flakey
> > > > +
> > > > +# Create our test file with an initial size of 8000 bytes, then fsync it,
> > > > +# followed by a truncate that reduces its size down to 3000 bytes.
> > > > +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \
> > > > +            -c "fsync" \
> > > > +            -c "truncate 3000" \
> > > > +            $SCRATCH_MNT/foo | _filter_xfs_io
> > > > +
> > > > +# Now rename the file and fsync it again.
> > > > +mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar
> > > > +$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
> > > > +
> > > > +# Simulate a power failure and mount the filesystem to check that the file was
> > > > +# persisted with the new name and has a size of 3000 bytes.
> > > > +_flakey_drop_and_remount
> > > > +
> > > > +[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing"
> > > > +[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists"
> > > > +
> > > > +echo "File content after power failure:"
> > > > +od -A d -t x1 $SCRATCH_MNT/bar
> > > > +
> > > > +_unmount_flakey
> > > > +
> > > > +status=0
> > > > +exit
> > > > diff --git a/tests/generic/532.out b/tests/generic/532.out
> > > > new file mode 100644
> > > > index 00000000..554fbe2a
> > > > --- /dev/null
> > > > +++ b/tests/generic/532.out
> > > > @@ -0,0 +1,8 @@
> > > > +QA output created by 532
> > > > +wrote 8000/8000 bytes at offset 0
> > > > +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> > > > +File content after power failure:
> > > > +0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
> > > > +*
> > > > +0002992 ab ab ab ab ab ab ab ab
> > > > +0003000
> > > > diff --git a/tests/generic/group b/tests/generic/group
> > > > index 31011ac8..7d63f303 100644
> > > > --- a/tests/generic/group
> > > > +++ b/tests/generic/group
> > > > @@ -534,3 +534,4 @@
> > > >  529 auto quick attr
> > > >  530 auto quick unlink
> > > >  531 auto quick unlink
> > > > +532 auto quick log
> > > > --
> > > > 2.11.0
> > > >
Dave Chinner March 5, 2019, 12:50 a.m. UTC | #5
On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote:
> On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> >
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Test that if we truncate a file to reduce its size, rename it and then
> > fsync it, after a power failure the file has a correct size and name.
> >
> 
> I am not sure that ext4/xfs semantics guaranty anything about
> persisting file name after fsync of file?...

They do.  It's that pesky "strictly ordered metadata" thing I keep
having to explain to people...

i.e. if you fsync an inode, then you are persisting all the changes
needed to reference that file and it's data. And so if there was a
rename in the history of that file, then that is persisted, too.
Which means that both the original and the new directory
modifications are persisted, too.

*POSIX* doesn't require this - it says that if you O_DSYNC data,
then it also includes all the metadata needed to reference that
data. So even if the data is there, POSIX doesn't define whether the
rename is there or noti, just that you can get to the fsync'd data
via either the old or new name. IOWs, POSIX allows the behaviour to
be implementation specific.

In this case, file systems with strictly ordered metadata will end
up making the rename visible because the rename occurred before the
truncate that the fsync() is persisting...

Cheers,

Dave.
Dave Chinner March 5, 2019, 1 a.m. UTC | #6
On Tue, Mar 05, 2019 at 11:50:20AM +1100, Dave Chinner wrote:
> On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote:
> > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Test that if we truncate a file to reduce its size, rename it and then
> > > fsync it, after a power failure the file has a correct size and name.
> > >
> > 
> > I am not sure that ext4/xfs semantics guaranty anything about
> > persisting file name after fsync of file?...
> 
> They do.  It's that pesky "strictly ordered metadata" thing I keep
> having to explain to people...

https://marc.info/?l=fstests&m=155010885626284&w=2

-Dave.

> 
> i.e. if you fsync an inode, then you are persisting all the changes
> needed to reference that file and it's data. And so if there was a
> rename in the history of that file, then that is persisted, too.
> Which means that both the original and the new directory
> modifications are persisted, too.
> 
> *POSIX* doesn't require this - it says that if you O_DSYNC data,
> then it also includes all the metadata needed to reference that
> data. So even if the data is there, POSIX doesn't define whether the
> rename is there or noti, just that you can get to the fsync'd data
> via either the old or new name. IOWs, POSIX allows the behaviour to
> be implementation specific.
> 
> In this case, file systems with strictly ordered metadata will end
> up making the rename visible because the rename occurred before the
> truncate that the fsync() is persisting...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Vijay Chidambaram March 5, 2019, 1:08 a.m. UTC | #7
On Mon, Mar 4, 2019 at 7:00 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Mar 05, 2019 at 11:50:20AM +1100, Dave Chinner wrote:
> > On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote:
> > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > > >
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Test that if we truncate a file to reduce its size, rename it and then
> > > > fsync it, after a power failure the file has a correct size and name.
> > > >
> > >
> > > I am not sure that ext4/xfs semantics guaranty anything about
> > > persisting file name after fsync of file?...
> >
> > They do.  It's that pesky "strictly ordered metadata" thing I keep
> > having to explain to people...
>
> https://marc.info/?l=fstests&m=155010885626284&w=2

With reference to that previous conversation, Jayashree and I are
planning to submit a patch adding something like
Documentation/filesystems/crash-recovery.txt to document crash
guarantees and strictly ordered metadata semantics soon! So hopefully
there will be in-kernel documentation we can point folks to soon.
Amir Goldstein March 5, 2019, 5:39 a.m. UTC | #8
On Tue, Mar 5, 2019 at 2:50 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote:
> > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > >
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > Test that if we truncate a file to reduce its size, rename it and then
> > > fsync it, after a power failure the file has a correct size and name.
> > >
> >
> > I am not sure that ext4/xfs semantics guaranty anything about
> > persisting file name after fsync of file?...
>
> They do.  It's that pesky "strictly ordered metadata" thing I keep
> having to explain to people...
>
> i.e. if you fsync an inode, then you are persisting all the changes
> needed to reference that file and it's data. And so if there was a
> rename in the history of that file, then that is persisted, too.
> Which means that both the original and the new directory
> modifications are persisted, too.
>
> *POSIX* doesn't require this - it says that if you O_DSYNC data,
> then it also includes all the metadata needed to reference that
> data. So even if the data is there, POSIX doesn't define whether the
> rename is there or noti, just that you can get to the fsync'd data
> via either the old or new name. IOWs, POSIX allows the behaviour to
> be implementation specific.
>
> In this case, file systems with strictly ordered metadata will end
> up making the rename visible because the rename occurred before the
> truncate that the fsync() is persisting...
>

That is not what is happening in Filipe's test. Test has:
- ftruncate A
- fsync A
- rename A B
- fsync B

So the reason this is working is because 2nd fsync needs to
persist ctime of B and not because it needs to persist the
truncate.

XFS does it, but it doesn't seem like something that any
filesystem is guaranteed to do the same:
        /*
         * We always want to hit the ctime on the source inode.
         *
         * This isn't strictly required by the standards since the source
         * inode isn't really being changed, but old unix file systems did
         * it and some incremental backup programs won't work without it.
         */
        xfs_trans_ichgtime(tp, src_ip, XFS_ICHGTIME_CHG);

So for the purpose of the test itself, which needs to guaranty that
btrfs persists the size, fsync of parent would be more robust for
any filesystem.

Thanks,
Amir.
Amir Goldstein March 5, 2019, 5:59 a.m. UTC | #9
On Tue, Mar 5, 2019 at 12:31 AM Filipe Manana <fdmanana@kernel.org> wrote:
>
> On Mon, Mar 4, 2019 at 5:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > On Mon, Mar 4, 2019 at 5:23 PM Filipe Manana <fdmanana@kernel.org> wrote:
> > >
> > > On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > >
> > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > > > >
> > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > >
> > > > > Test that if we truncate a file to reduce its size, rename it and then
> > > > > fsync it, after a power failure the file has a correct size and name.
> > > > >
> > > >
> > > > I am not sure that ext4/xfs semantics guaranty anything about
> > > > persisting file name after fsync of file?...
> > > >
> > > > > This test is motivated by a bug found in btrfs, which is fixed by a
> > > > > patch for the linux kernel titled:
> > > > >
> > > > >   "Btrfs: fix incorrect file size after shrinking truncate and fsync"
> > > >
> > > > At least the title of this patch says nothing about persisting the
> > > > name.
> > >
> > > Because the bug in btrfs is not about persisting the name, it's about
> > > persisting the correct inode size.
> > >
> >
> > OK. Still AFAIK, ext4/xfs don't guaranty that the name bar will exist
> > after power failure unless the parent dir was fsynced.
> > If you fsync the parent dir instead of the file after rename, it will
> > make the test correct for ext4/xfs (and POSIX for that matter).
>
> I'm not so sure about that. Couldn't find anything explicit anywhere about that.

man fsync(2):
"       Calling  fsync() does not necessarily ensure that the entry in
the directory
        containing the file has also reached disk.  For that an
explicit fsync() on a file
        descriptor for the directory is also needed."

> There are other tests that do similar assumptions, either after
> renames or after adding a new
> hard link (fsyncing just the file being enough to guarantee the new
> hard link exits after a power
> failure, without the need to fsync the parent directory). I don't
> recall ever having had such comment before,

Because I missed it in review...

> even for much more complex cases involving renames
> such as [1] for example.
>
> [1] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=e2d866668a98af4cd47cd6913418e410cd80aa93
>
> In this case both old and new names are in the same parent directory,
> but if they were in different directories, and
> requiring the user/application to fsync both directories, it could be
> dangerous, if the old parent dir is fsync'ed and
> a power failure happened before fsync'ing the new parent (requiring
> the user/application to know about this
> and fsync the new parent before the old parent dir also seems very
> demanding), leading to a potential file loss
> after replaying the log/journal (there were such cases in btrfs before).
>

This is not a problem for ext4/xfs.
fsync(newdir) guarantees persisting new filename
metadata ordering implies that olddir changes (and file nlink)
also persist.

To meet requirement of both btrfs and ext4/xfs in that case
your test may fsync both file AND newdir, as long as this
still allows your test to catch the bug?

> > Will that test modification still catch the btrfs bug (pre fix)?
>
> Yes, it will still trigger the btrfs bug.
>
> Well, I don't mind doing such change, but I would like to hear other
> opinions too, perhaps Dave could shed some light on this?
> Even old reiserfs and f2fs (both on posix and strict fsync modes)
> currently pass this test.
>

As I wrote to Dave, if file is not "metadata dirty" before rename,
then whether or not rename dirties to file for fsync is a filesystem
specific implementation detail that is not in any standard.

Since filesystems tend to try to optimize out unneeded journal
commits, so the observation about what filesystems do today
empirically is not a sufficiently strong argument.

Again, if your test can afford to do fsync of file and parent dir
it is best to be on the safe side (please audit other similar tests
that you know about). If your test cannot afford to fsync parent
dir, then at least a fat comment about this issue is worth about
the expected (but not guaranteed) effects of fsync of file.

Thanks,
Amir.
Filipe Manana March 5, 2019, 9:26 a.m. UTC | #10
On Tue, Mar 5, 2019 at 5:59 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Mar 5, 2019 at 12:31 AM Filipe Manana <fdmanana@kernel.org> wrote:
> >
> > On Mon, Mar 4, 2019 at 5:59 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > On Mon, Mar 4, 2019 at 5:23 PM Filipe Manana <fdmanana@kernel.org> wrote:
> > > >
> > > > On Mon, Mar 4, 2019 at 3:04 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > > > >
> > > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > > > > >
> > > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > > >
> > > > > > Test that if we truncate a file to reduce its size, rename it and then
> > > > > > fsync it, after a power failure the file has a correct size and name.
> > > > > >
> > > > >
> > > > > I am not sure that ext4/xfs semantics guaranty anything about
> > > > > persisting file name after fsync of file?...
> > > > >
> > > > > > This test is motivated by a bug found in btrfs, which is fixed by a
> > > > > > patch for the linux kernel titled:
> > > > > >
> > > > > >   "Btrfs: fix incorrect file size after shrinking truncate and fsync"
> > > > >
> > > > > At least the title of this patch says nothing about persisting the
> > > > > name.
> > > >
> > > > Because the bug in btrfs is not about persisting the name, it's about
> > > > persisting the correct inode size.
> > > >
> > >
> > > OK. Still AFAIK, ext4/xfs don't guaranty that the name bar will exist
> > > after power failure unless the parent dir was fsynced.
> > > If you fsync the parent dir instead of the file after rename, it will
> > > make the test correct for ext4/xfs (and POSIX for that matter).
> >
> > I'm not so sure about that. Couldn't find anything explicit anywhere about that.
>
> man fsync(2):
> "       Calling  fsync() does not necessarily ensure that the entry in
> the directory
>         containing the file has also reached disk.  For that an
> explicit fsync() on a file
>         descriptor for the directory is also needed."
>
> > There are other tests that do similar assumptions, either after
> > renames or after adding a new
> > hard link (fsyncing just the file being enough to guarantee the new
> > hard link exits after a power
> > failure, without the need to fsync the parent directory). I don't
> > recall ever having had such comment before,
>
> Because I missed it in review...
>
> > even for much more complex cases involving renames
> > such as [1] for example.
> >
> > [1] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=e2d866668a98af4cd47cd6913418e410cd80aa93
> >
> > In this case both old and new names are in the same parent directory,
> > but if they were in different directories, and
> > requiring the user/application to fsync both directories, it could be
> > dangerous, if the old parent dir is fsync'ed and
> > a power failure happened before fsync'ing the new parent (requiring
> > the user/application to know about this
> > and fsync the new parent before the old parent dir also seems very
> > demanding), leading to a potential file loss
> > after replaying the log/journal (there were such cases in btrfs before).
> >
>
> This is not a problem for ext4/xfs.
> fsync(newdir) guarantees persisting new filename
> metadata ordering implies that olddir changes (and file nlink)
> also persist.
>
> To meet requirement of both btrfs and ext4/xfs in that case
> your test may fsync both file AND newdir, as long as this
> still allows your test to catch the bug?
>
> > > Will that test modification still catch the btrfs bug (pre fix)?
> >
> > Yes, it will still trigger the btrfs bug.
> >
> > Well, I don't mind doing such change, but I would like to hear other
> > opinions too, perhaps Dave could shed some light on this?
> > Even old reiserfs and f2fs (both on posix and strict fsync modes)
> > currently pass this test.
> >
>
> As I wrote to Dave, if file is not "metadata dirty" before rename,
> then whether or not rename dirties to file for fsync is a filesystem
> specific implementation detail that is not in any standard.
>
> Since filesystems tend to try to optimize out unneeded journal
> commits, so the observation about what filesystems do today
> empirically is not a sufficiently strong argument.

Even if the strictly ordered metadata guarantees Dave explained so
many time before is not formally defined anywhere,
I think it's a behavior that shouldn't change, as not only it makes
sense, but applications might be relying on it and therefore
useful to make sure no regressions happen. If such tests should be
shared (and run only filesystems known to support
the strictly ordered metadata) or require something special, is
another question.

>
> Again, if your test can afford to do fsync of file and parent dir
> it is best to be on the safe side (please audit other similar tests
> that you know about). If your test cannot afford to fsync parent
> dir, then at least a fat comment about this issue is worth about
> the expected (but not guaranteed) effects of fsync of file.

I'm sending a v2 with the fsync against the parent directory, despite
not being needed for any fs I could
run the tests against, because it's a generic test.

Thanks.

>
> Thanks,
> Amir.
Amir Goldstein March 5, 2019, 10:51 a.m. UTC | #11
> > As I wrote to Dave, if file is not "metadata dirty" before rename,
> > then whether or not rename dirties to file for fsync is a filesystem
> > specific implementation detail that is not in any standard.
> >
> > Since filesystems tend to try to optimize out unneeded journal
> > commits, so the observation about what filesystems do today
> > empirically is not a sufficiently strong argument.
>
> Even if the strictly ordered metadata guarantees Dave explained so
> many time before is not formally defined anywhere,
> I think it's a behavior that shouldn't change, as not only it makes
> sense, but applications might be relying on it and therefore
> useful to make sure no regressions happen. If such tests should be
> shared (and run only filesystems known to support
> the strictly ordered metadata) or require something special, is
> another question.
>

I agree with you that it is good to test and verify those semantics.
I was not arguing against assuming strictly ordered metadata.
I was arguing against assuming that rename marks the inode dirty.
If rename does not mark the inode dirty, then fsync(file) may be optimized
away by filesystem and will not require persisting anything, without
breaking strictly ordered metadata semantics.


> >
> > Again, if your test can afford to do fsync of file and parent dir
> > it is best to be on the safe side (please audit other similar tests
> > that you know about). If your test cannot afford to fsync parent
> > dir, then at least a fat comment about this issue is worth about
> > the expected (but not guaranteed) effects of fsync of file.
>
> I'm sending a v2 with the fsync against the parent directory, despite
> not being needed for any fs I could
> run the tests against, because it's a generic test.

ACK.

Thanks,
Amir.
Dave Chinner March 5, 2019, 10:33 p.m. UTC | #12
On Tue, Mar 05, 2019 at 07:39:28AM +0200, Amir Goldstein wrote:
> On Tue, Mar 5, 2019 at 2:50 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote:
> > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > > >
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > Test that if we truncate a file to reduce its size, rename it and then
> > > > fsync it, after a power failure the file has a correct size and name.

This says:

- ftruncate A
- rename A B
- fsync B

> > > I am not sure that ext4/xfs semantics guaranty anything about
> > > persisting file name after fsync of file?...
> >
> > They do.  It's that pesky "strictly ordered metadata" thing I keep
> > having to explain to people...
> >
> > i.e. if you fsync an inode, then you are persisting all the changes
> > needed to reference that file and it's data. And so if there was a
> > rename in the history of that file, then that is persisted, too.
> > Which means that both the original and the new directory
> > modifications are persisted, too.
> >
> > *POSIX* doesn't require this - it says that if you O_DSYNC data,
> > then it also includes all the metadata needed to reference that
> > data. So even if the data is there, POSIX doesn't define whether the
> > rename is there or noti, just that you can get to the fsync'd data
> > via either the old or new name. IOWs, POSIX allows the behaviour to
> > be implementation specific.
> >
> > In this case, file systems with strictly ordered metadata will end
> > up making the rename visible because the rename occurred before the
> > truncate that the fsync() is persisting...
> >
> 
> That is not what is happening in Filipe's test. Test has:
> - ftruncate A
> - fsync A
> - rename A B
> - fsync B

And this does not match the test description.

/me goes and looks at the test again to check.

Ok, the test is as Filipe describes:

- pwrite 0 0x8000 A
- fsync A
- truncate 3000 A
- rename A B
- fsync B

There is no fsync between truncate and rename.

> So the reason this is working is because 2nd fsync needs to
> persist ctime of B and not because it needs to persist the
> truncate.

ctime modifications during rename are irrelevent because there's no
fsync between the truncate and the rename so the file inode is
already dirty due to the truncate. I think you've got the wrong end
of the stick here, Amir. :)

Cheers,

Dave.
Amir Goldstein March 6, 2019, 7:51 a.m. UTC | #13
On Wed, Mar 6, 2019 at 12:33 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Mar 05, 2019 at 07:39:28AM +0200, Amir Goldstein wrote:
> > On Tue, Mar 5, 2019 at 2:50 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Mar 04, 2019 at 05:04:23PM +0200, Amir Goldstein wrote:
> > > > On Mon, Mar 4, 2019 at 4:44 PM <fdmanana@kernel.org> wrote:
> > > > >
> > > > > From: Filipe Manana <fdmanana@suse.com>
> > > > >
> > > > > Test that if we truncate a file to reduce its size, rename it and then
> > > > > fsync it, after a power failure the file has a correct size and name.
>
> This says:
>
> - ftruncate A
> - rename A B
> - fsync B
>
> > > > I am not sure that ext4/xfs semantics guaranty anything about
> > > > persisting file name after fsync of file?...
> > >
> > > They do.  It's that pesky "strictly ordered metadata" thing I keep
> > > having to explain to people...
> > >
> > > i.e. if you fsync an inode, then you are persisting all the changes
> > > needed to reference that file and it's data. And so if there was a
> > > rename in the history of that file, then that is persisted, too.
> > > Which means that both the original and the new directory
> > > modifications are persisted, too.
> > >
> > > *POSIX* doesn't require this - it says that if you O_DSYNC data,
> > > then it also includes all the metadata needed to reference that
> > > data. So even if the data is there, POSIX doesn't define whether the
> > > rename is there or noti, just that you can get to the fsync'd data
> > > via either the old or new name. IOWs, POSIX allows the behaviour to
> > > be implementation specific.
> > >
> > > In this case, file systems with strictly ordered metadata will end
> > > up making the rename visible because the rename occurred before the
> > > truncate that the fsync() is persisting...
> > >
> >
> > That is not what is happening in Filipe's test. Test has:
> > - ftruncate A
> > - fsync A
> > - rename A B
> > - fsync B
>
> And this does not match the test description.
>
> /me goes and looks at the test again to check.
>
> Ok, the test is as Filipe describes:
>
> - pwrite 0 0x8000 A
> - fsync A
> - truncate 3000 A
> - rename A B
> - fsync B
>
> There is no fsync between truncate and rename.
>
> > So the reason this is working is because 2nd fsync needs to
> > persist ctime of B and not because it needs to persist the
> > truncate.
>
> ctime modifications during rename are irrelevent because there's no
> fsync between the truncate and the rename so the file inode is
> already dirty due to the truncate. I think you've got the wrong end
> of the stick here, Amir. :)
>

Doh! The discussion is still interesting because people have
hard time to understand that those hidden details like ctime
update on rename may have different behavior on different fs
regardless if they obay ordered metadata or not.
Btrfs is different in the respect of metadata dependencies from
xfs/ext4 in many ways as seen in the different rename/link
crash consistency discussions.

Thanks,
Amir.
Dave Chinner March 6, 2019, 9:48 p.m. UTC | #14
On Wed, Mar 06, 2019 at 09:51:23AM +0200, Amir Goldstein wrote:
> On Wed, Mar 6, 2019 at 12:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > > So the reason this is working is because 2nd fsync needs to
> > > persist ctime of B and not because it needs to persist the
> > > truncate.
> >
> > ctime modifications during rename are irrelevent because there's no
> > fsync between the truncate and the rename so the file inode is
> > already dirty due to the truncate. I think you've got the wrong end
> > of the stick here, Amir. :)
> 
> Doh! The discussion is still interesting because people have
> hard time to understand that those hidden details like ctime
> update on rename may have different behavior on different fs
> regardless if they obay ordered metadata or not.
> Btrfs is different in the respect of metadata dependencies from
> xfs/ext4 in many ways as seen in the different rename/link
> crash consistency discussions.

Yes, little things like can result in different behaviour, but what
we are trying to do is get to the point where there is minimal
difference between all crash-recovery-capable linux filesystems.

e.g. what we see here is that by always including the inode being
moved in the rename transaction (regardless of how a filesystem
acheives that), we provide consistent, reliable, predictable
behaviour in all cases of "fsync after rename". IOWs, the SOMC model
that _require_metadata_journaling tests are supposed to conform to
is far more strict that POSIX requires and our tests need to reflect
this stricter consistency model.

IOWs, we should be encoding the behaviour we want in these tests
rather than implementing yet another "test POSIX compatible
behaviour" - POSIX is a complete crapshoot when it comes to
persistence requirements. And if a filesystem fails a SOMC-model
test, then the filesystem needs to be fixed, not have the test
"relaxed" to only exercise POSIX-defined behaviour.

Cheers,

Dave.
Amir Goldstein March 7, 2019, 7:52 a.m. UTC | #15
On Wed, Mar 6, 2019 at 11:48 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Mar 06, 2019 at 09:51:23AM +0200, Amir Goldstein wrote:
> > On Wed, Mar 6, 2019 at 12:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > So the reason this is working is because 2nd fsync needs to
> > > > persist ctime of B and not because it needs to persist the
> > > > truncate.
> > >
> > > ctime modifications during rename are irrelevent because there's no
> > > fsync between the truncate and the rename so the file inode is
> > > already dirty due to the truncate. I think you've got the wrong end
> > > of the stick here, Amir. :)
> >
> > Doh! The discussion is still interesting because people have
> > hard time to understand that those hidden details like ctime
> > update on rename may have different behavior on different fs
> > regardless if they obay ordered metadata or not.
> > Btrfs is different in the respect of metadata dependencies from
> > xfs/ext4 in many ways as seen in the different rename/link
> > crash consistency discussions.
>
> Yes, little things like can result in different behaviour, but what
> we are trying to do is get to the point where there is minimal
> difference between all crash-recovery-capable linux filesystems.
>
> e.g. what we see here is that by always including the inode being
> moved in the rename transaction (regardless of how a filesystem
> acheives that), we provide consistent, reliable, predictable
> behaviour in all cases of "fsync after rename". IOWs, the SOMC model
> that _require_metadata_journaling tests are supposed to conform to
> is far more strict that POSIX requires and our tests need to reflect
> this stricter consistency model.
>
> IOWs, we should be encoding the behaviour we want in these tests
> rather than implementing yet another "test POSIX compatible
> behaviour" - POSIX is a complete crapshoot when it comes to
> persistence requirements. And if a filesystem fails a SOMC-model
> test, then the filesystem needs to be fixed, not have the test
> "relaxed" to only exercise POSIX-defined behaviour.
>

Agreed! v1 is better than v2. Sorry for my mistake in v1 review.

I went back to look at similar fsync tests by Filipe:
generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512}

I found some alleged subtle mistakes about SOMC assumptions.

generic/336 does:
touch $SCRATCH_MNT/a/foo
ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link
touch $SCRATCH_MNT/b/bar
sync
unlink $SCRATCH_MNT/b/foo_link
mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo

And expects both unlink and rename to persist.
However, this is only true in the *very likely* case that there is no
journal commit in between unlink and rename, because fsync foo
is only guaranteed to persist metadata changes that depend on the
unlink and happened BEFORE it, which is not the case for the rename
of bar.

generic/343 and generic/510 do something similar.

At first glance, generic/498 is actually broken (for xfs) or at least
I don't understand why it works.

Thanks,
Amir.
Jayashree Mohan March 7, 2019, 11:19 p.m. UTC | #16
Hi Amir,

> I went back to look at similar fsync tests by Filipe:
> generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512}
>
> I found some alleged subtle mistakes about SOMC assumptions.
>
> generic/336 does:
> touch $SCRATCH_MNT/a/foo
> ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link
> touch $SCRATCH_MNT/b/bar
> sync
> unlink $SCRATCH_MNT/b/foo_link
> mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/
> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo

This is probably what's happening in this particular test :

SOMC requires:
          fsync(a/foo) must ensure unlink(b/foo_link)  (because they
were linked at some point)

But what happens is:
           fsync(a/foo)          -->  unlink(b/foo_link)
           unlink(b/foo_link)  -->  fsync(b)
           fsync(b)                 -->  rename goes through

SOMC should only require that the unlink persists. The rename
operation persists due to the side-effect of SOMC. So we should be
only testing if the unlink operation went through.

Take a look at this thread that describes the bug which resulted in
this test case (https://patchwork.kernel.org/patch/8293181/).
The file bar was lost during the rename operation and I think this
test simply wanted to verify that file bar was intact in either one of
the directories. Probably the reason this test was written was not to
test SOMC, but to verify that the log replay in btrfs does not delete
the renamed file - luckily it happens to pass on other file systems
because they all persist the rename operation as a side effect of
SOMC. To me, this test currently checks for something more than SOMC.
The check must be modified to only test if b/foo_link is removed, and
that the file bar is *either* in directory b or c.

Test 343 is a similar case where in btrfs log replay ended up
persisting the file in both the source and destination directories
after a rename operation
(https://patchwork.kernel.org/patch/8766401/). The checks must be
updated here as well.

Thanks,
Jayashree
Dave Chinner March 8, 2019, 3:46 a.m. UTC | #17
On Thu, Mar 07, 2019 at 09:52:03AM +0200, Amir Goldstein wrote:
> On Wed, Mar 6, 2019 at 11:48 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Mar 06, 2019 at 09:51:23AM +0200, Amir Goldstein wrote:
> > > On Wed, Mar 6, 2019 at 12:33 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > So the reason this is working is because 2nd fsync needs to
> > > > > persist ctime of B and not because it needs to persist the
> > > > > truncate.
> > > >
> > > > ctime modifications during rename are irrelevent because there's no
> > > > fsync between the truncate and the rename so the file inode is
> > > > already dirty due to the truncate. I think you've got the wrong end
> > > > of the stick here, Amir. :)
> > >
> > > Doh! The discussion is still interesting because people have
> > > hard time to understand that those hidden details like ctime
> > > update on rename may have different behavior on different fs
> > > regardless if they obay ordered metadata or not.
> > > Btrfs is different in the respect of metadata dependencies from
> > > xfs/ext4 in many ways as seen in the different rename/link
> > > crash consistency discussions.
> >
> > Yes, little things like can result in different behaviour, but what
> > we are trying to do is get to the point where there is minimal
> > difference between all crash-recovery-capable linux filesystems.
> >
> > e.g. what we see here is that by always including the inode being
> > moved in the rename transaction (regardless of how a filesystem
> > acheives that), we provide consistent, reliable, predictable
> > behaviour in all cases of "fsync after rename". IOWs, the SOMC model
> > that _require_metadata_journaling tests are supposed to conform to
> > is far more strict that POSIX requires and our tests need to reflect
> > this stricter consistency model.
> >
> > IOWs, we should be encoding the behaviour we want in these tests
> > rather than implementing yet another "test POSIX compatible
> > behaviour" - POSIX is a complete crapshoot when it comes to
> > persistence requirements. And if a filesystem fails a SOMC-model
> > test, then the filesystem needs to be fixed, not have the test
> > "relaxed" to only exercise POSIX-defined behaviour.
> >
> 
> Agreed! v1 is better than v2. Sorry for my mistake in v1 review.
> 
> I went back to look at similar fsync tests by Filipe:
> generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512}
> 
> I found some alleged subtle mistakes about SOMC assumptions.
> 
> generic/336 does:
> touch $SCRATCH_MNT/a/foo
> ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link
> touch $SCRATCH_MNT/b/bar
> sync
> unlink $SCRATCH_MNT/b/foo_link
> mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/
> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo
> 
> And expects both unlink and rename to persist.

*nod*. Yup, that's a typical async transactional ordering
behaviour because the dirty state is carried on the inode, not the
path or fd that is being used to access it.

> However, this is only true in the *very likely* case that there is no
> journal commit in between unlink and rename, because fsync foo
> is only guaranteed to persist metadata changes that depend on the
> unlink and happened BEFORE it, which is not the case for the rename
> of bar.

Yup, most likely. I haven't looked at any of these btrfs-inspired
fsync tests in any detail so it wouldn't surprise me that there are
issues like this in them - I just don't have time to look at
everything.

Indeed, if there are mistaken assumptions in the tests based around
pending dirty state and async transaction aggregation, then running
the tests with "-o dirsync" or even "-o wsync" should cause such
tests to fail.

> At first glance, generic/498 is actually broken (for xfs) or at least
> I don't understand why it works.

The test does this:

mkdir $SCRATCH_MNT/A
mkdir $SCRATCH_MNT/B
mkdir $SCRATCH_MNT/A/C
touch $SCRATCH_MNT/B/foo
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/B/foo

# It is important the new hard link is located in a hierarchy of new directories
# (not yet persisted).
ln $SCRATCH_MNT/B/foo $SCRATCH_MNT/A/C/foo
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/A

_flakey_drop_and_remount

And it expects /A and B/foo to be there afterwards. The comment
"(not yet persisted)" is about the buggy btrfs behaviour, not how a
SOMC fs should behave.

So, yeah, that should always work on XFS, because the first fsync
persists everything the test checks (due to common ancestors), and
second fsync is a no-opt because dir A has already been
checkpointed.


Cheers,

Dave.
Dave Chinner March 8, 2019, 4:35 a.m. UTC | #18
On Thu, Mar 07, 2019 at 05:19:51PM -0600, Jayashree Mohan wrote:
> Hi Amir,
> 
> > I went back to look at similar fsync tests by Filipe:
> > generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512}
> >
> > I found some alleged subtle mistakes about SOMC assumptions.
> >
> > generic/336 does:
> > touch $SCRATCH_MNT/a/foo
> > ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link
> > touch $SCRATCH_MNT/b/bar
> > sync
> > unlink $SCRATCH_MNT/b/foo_link
> > mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/
> > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo
> 
> This is probably what's happening in this particular test :
> 
> SOMC requires:
>           fsync(a/foo) must ensure unlink(b/foo_link)  (because they
> were linked at some point)
> 
> But what happens is:
>            fsync(a/foo)          -->  unlink(b/foo_link)
>            unlink(b/foo_link)  -->  fsync(b)
>            fsync(b)                 -->  rename goes through
> 
> SOMC should only require that the unlink persists.

That's a /fsync/ requirement, not SOMC.

SOMC says:

"If the rename after the fsync()d unlink is present after recovery,
then every metadata operation completed between the unlink and the
rename must also be present after recovery."

> The rename
> operation persists due to the side-effect of SOMC. So we should be
> only testing if the unlink operation went through.

No, it persists as a side effect of the a filesystem fsync()
implementation, not SOMC.

i.e. on fsync(), the filesystem must commit the journal at a point
in time that *includes the unlink*. That means it can chose any time
between the unlink and the moment the fsync() is received by the
filesystem. Once the point in time has been chosen by the fsync
operation, SOMC then defines what else must be journalled and
recovered with the information that must be fsync()d.

> Take a look at this thread that describes the bug which resulted in
> this test case (https://patchwork.kernel.org/patch/8293181/).

I really wouldn't try to infer anything from the bugs in btrfs fsync
behaviour or the test cases that expose them. 'Behave like other
filesystems" is not a substitute for having solid fundamental
algorithms...

Cheers,

Dave.
Vijay Chidambaram March 8, 2019, 3:11 p.m. UTC | #19
On Thu, Mar 7, 2019 at 10:35 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Mar 07, 2019 at 05:19:51PM -0600, Jayashree Mohan wrote:
> > Hi Amir,
> >
> > > I went back to look at similar fsync tests by Filipe:
> > > generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512}
> > >
> > > I found some alleged subtle mistakes about SOMC assumptions.
> > >
> > > generic/336 does:
> > > touch $SCRATCH_MNT/a/foo
> > > ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link
> > > touch $SCRATCH_MNT/b/bar
> > > sync
> > > unlink $SCRATCH_MNT/b/foo_link
> > > mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/
> > > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo
> >
> > This is probably what's happening in this particular test :
> >
> > SOMC requires:
> >           fsync(a/foo) must ensure unlink(b/foo_link)  (because they
> > were linked at some point)
> >
> > But what happens is:
> >            fsync(a/foo)          -->  unlink(b/foo_link)
> >            unlink(b/foo_link)  -->  fsync(b)
> >            fsync(b)                 -->  rename goes through
> >
> > SOMC should only require that the unlink persists.
>
> That's a /fsync/ requirement, not SOMC.
>
> SOMC says:
>
> "If the rename after the fsync()d unlink is present after recovery,
> then every metadata operation completed between the unlink and the
> rename must also be present after recovery."

Isn't this a bit too broad? That sounds more like total ordering of
metadata operations rather than SOMC. Shouldn't SOMC say "all
operations dependent upon the rename should be present after recovery
if rename is present?"
Dave Chinner March 19, 2019, 1:13 a.m. UTC | #20
(Sorry, missed this email and only just noticed it...)

On Fri, Mar 08, 2019 at 09:11:19AM -0600, Vijay Chidambaram wrote:
> On Thu, Mar 7, 2019 at 10:35 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Mar 07, 2019 at 05:19:51PM -0600, Jayashree Mohan wrote:
> > > Hi Amir,
> > >
> > > > I went back to look at similar fsync tests by Filipe:
> > > > generic/{106,107,335,336,341,342,343,348,498,501,502,509,510,512}
> > > >
> > > > I found some alleged subtle mistakes about SOMC assumptions.
> > > >
> > > > generic/336 does:
> > > > touch $SCRATCH_MNT/a/foo
> > > > ln $SCRATCH_MNT/a/foo $SCRATCH_MNT/b/foo_link
> > > > touch $SCRATCH_MNT/b/bar
> > > > sync
> > > > unlink $SCRATCH_MNT/b/foo_link
> > > > mv $SCRATCH_MNT/b/bar $SCRATCH_MNT/c/
> > > > $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a/foo
> > >
> > > This is probably what's happening in this particular test :
> > >
> > > SOMC requires:
> > >           fsync(a/foo) must ensure unlink(b/foo_link)  (because they
> > > were linked at some point)
> > >
> > > But what happens is:
> > >            fsync(a/foo)          -->  unlink(b/foo_link)
> > >            unlink(b/foo_link)  -->  fsync(b)
> > >            fsync(b)                 -->  rename goes through
> > >
> > > SOMC should only require that the unlink persists.
> >
> > That's a /fsync/ requirement, not SOMC.
> >
> > SOMC says:
> >
> > "If the rename after the fsync()d unlink is present after recovery,
> > then every metadata operation completed between the unlink and the
> > rename must also be present after recovery."
> 
> Isn't this a bit too broad? That sounds more like total ordering of
> metadata operations rather than SOMC. Shouldn't SOMC say "all
> operations dependent upon the rename should be present after recovery
> if rename is present?"

Yes. I was describing the explicit behaivour to expect in the
context of the given example, in which all the operations between
the unlink and rename were dependent operations. You've just
restated it as the general rule that I applied...

Cheers,

Dave.

Patch
diff mbox series

diff --git a/tests/generic/532 b/tests/generic/532
new file mode 100755
index 00000000..64992d85
--- /dev/null
+++ b/tests/generic/532
@@ -0,0 +1,66 @@ 
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test No. 532
+#
+# Test that if we truncate a file to reduce its size, rename it and then fsync
+# it, after a power failure the file has a correct size and name.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	_cleanup_flakey
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+_mount_flakey
+
+# Create our test file with an initial size of 8000 bytes, then fsync it,
+# followed by a truncate that reduces its size down to 3000 bytes.
+$XFS_IO_PROG -f -c "pwrite -S 0xab 0 8000" \
+	     -c "fsync" \
+	     -c "truncate 3000" \
+	     $SCRATCH_MNT/foo | _filter_xfs_io
+
+# Now rename the file and fsync it again.
+mv $SCRATCH_MNT/foo $SCRATCH_MNT/bar
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/bar
+
+# Simulate a power failure and mount the filesystem to check that the file was
+# persisted with the new name and has a size of 3000 bytes.
+_flakey_drop_and_remount
+
+[ -f $SCRATCH_MNT/bar ] || echo "file name 'bar' is missing"
+[ -f $SCRATCH_MNT/foo ] && echo "file name 'foo' still exists"
+
+echo "File content after power failure:"
+od -A d -t x1 $SCRATCH_MNT/bar
+
+_unmount_flakey
+
+status=0
+exit
diff --git a/tests/generic/532.out b/tests/generic/532.out
new file mode 100644
index 00000000..554fbe2a
--- /dev/null
+++ b/tests/generic/532.out
@@ -0,0 +1,8 @@ 
+QA output created by 532
+wrote 8000/8000 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File content after power failure:
+0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
+*
+0002992 ab ab ab ab ab ab ab ab
+0003000
diff --git a/tests/generic/group b/tests/generic/group
index 31011ac8..7d63f303 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -534,3 +534,4 @@ 
 529 auto quick attr
 530 auto quick unlink
 531 auto quick unlink
+532 auto quick log