Message ID | 20241015153957.2099812-1-maharmstone@fb.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | generic: add test for missing btrfs csums in log when doing async on subpage vol | expand |
On Tue, Oct 15, 2024 at 4:42 PM Mark Harmstone <maharmstone@fb.com> wrote: > > Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a > race could mean that csums weren't getting written to the log tree, > leading to corruption when it was replayed. > > The patches to detect log this tree corruption are in btrfs-progs 6.11. This shouldn't be needed right? Because after log replay the csums are missing and 'btrfs check' detects (IIRC) missing csums for extents referred by file extent items in a subvolume tree - if it doesn't then it should be improved. > > Signed-off-by: Mark Harmstone <maharmstone@fb.com> > --- > This is a genericized version of the test I originally proposed as > btrfs/333. > > tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++ > tests/generic/757.out | 2 ++ > 2 files changed, 73 insertions(+) > create mode 100755 tests/generic/757 > create mode 100644 tests/generic/757.out > > diff --git a/tests/generic/757 b/tests/generic/757 > new file mode 100755 > index 00000000..6ad3d01e > --- /dev/null > +++ b/tests/generic/757 > @@ -0,0 +1,71 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# > +# FS QA Test 757 > +# > +# Test async dio with fsync to test a btrfs bug where a race meant that csums > +# weren't getting written to the log tree, causing corruptions on remount. > +# This can be seen on subpage FSes on Linux 6.4. > +# > +. ./common/preamble > +_begin_fstest auto quick metadata log recoveryloop > + > +_fixed_by_kernel_commit e917ff56c8e7 \ > + "btrfs: determine synchronous writers from bio or writeback control" For generic tests what we do is: [ $FSTYP == "btrfs" ] && _fixed_by_kernel_commit ..... As long as the failure has not been observed and fixed on other filesystems. In case one day a regression happens in another fs, we just add a corresponding line using the same logic. Otherwise if the test one days fails on another fs and fstests suggests that that commit is missing, it would be odd. Everything else looks good, so with that fixed (maybe Zorro can change that when picking the patch): Reviewed-by: Filipe Manana <fdmanana@suse.com> Thanks. > + > +fio_config=$tmp.fio > + > +. ./common/dmlogwrites > + > +_require_scratch > +_require_log_writes > + > +cat >$fio_config <<EOF > +[global] > +iodepth=128 > +direct=1 > +ioengine=libaio > +rw=randwrite > +runtime=1s > +[job0] > +rw=randwrite > +filename=$SCRATCH_MNT/file > +size=1g > +fdatasync=1 > +EOF > + > +_require_fio $fio_config > + > +cat $fio_config >> $seqres.full > + > +_log_writes_init $SCRATCH_DEV > +_log_writes_mkfs >> $seqres.full 2>&1 > +_log_writes_mark mkfs > + > +_log_writes_mount > + > +$FIO_PROG $fio_config > /dev/null 2>&1 > +_log_writes_unmount > + > +_log_writes_remove > + > +prev=$(_log_writes_mark_to_entry_number mkfs) > +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'" > +cur=$(_log_writes_find_next_fua $prev) > +[ -z "$cur" ] && _fail "failed to locate next FUA write" > + > +while [ ! -z "$cur" ]; do > + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full > + > + _check_scratch_fs > + > + prev=$cur > + cur=$(_log_writes_find_next_fua $(($cur + 1))) > + [ -z "$cur" ] && break > +done > + > +echo "Silence is golden" > + > +# success, all done > +status=0 > +exit > diff --git a/tests/generic/757.out b/tests/generic/757.out > new file mode 100644 > index 00000000..dfbc8094 > --- /dev/null > +++ b/tests/generic/757.out > @@ -0,0 +1,2 @@ > +QA output created by 757 > +Silence is golden > -- > 2.44.2 > >
On 16/10/24 12:09, Filipe Manana wrote: > > > On Tue, Oct 15, 2024 at 4:42 PM Mark Harmstone <maharmstone@fb.com> wrote: >> >> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a >> race could mean that csums weren't getting written to the log tree, >> leading to corruption when it was replayed. >> >> The patches to detect log this tree corruption are in btrfs-progs 6.11. > > This shouldn't be needed right? > Because after log replay the csums are missing and 'btrfs check' > detects (IIRC) missing csums for extents referred by file extent items > in a subvolume tree - if it doesn't then it should be improved. Yes, but we're not mounting it in the tests between the log_writes calls, so the log isn't getting replayed. The patches to btrfs check make it so that it identifies filesystems that would get corrupted as soon as they're next mounted. > >> >> Signed-off-by: Mark Harmstone <maharmstone@fb.com> >> --- >> This is a genericized version of the test I originally proposed as >> btrfs/333. >> >> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++ >> tests/generic/757.out | 2 ++ >> 2 files changed, 73 insertions(+) >> create mode 100755 tests/generic/757 >> create mode 100644 tests/generic/757.out >> >> diff --git a/tests/generic/757 b/tests/generic/757 >> new file mode 100755 >> index 00000000..6ad3d01e >> --- /dev/null >> +++ b/tests/generic/757 >> @@ -0,0 +1,71 @@ >> +#! /bin/bash >> +# SPDX-License-Identifier: GPL-2.0 >> +# >> +# FS QA Test 757 >> +# >> +# Test async dio with fsync to test a btrfs bug where a race meant that csums >> +# weren't getting written to the log tree, causing corruptions on remount. >> +# This can be seen on subpage FSes on Linux 6.4. >> +# >> +. ./common/preamble >> +_begin_fstest auto quick metadata log recoveryloop >> + >> +_fixed_by_kernel_commit e917ff56c8e7 \ >> + "btrfs: determine synchronous writers from bio or writeback control" > > For generic tests what we do is: > > [ $FSTYP == "btrfs" ] && _fixed_by_kernel_commit ..... > > As long as the failure has not been observed and fixed on other filesystems. > In case one day a regression happens in another fs, we just add a > corresponding line using the same logic. > > Otherwise if the test one days fails on another fs and fstests > suggests that that commit is missing, it would be odd. > > Everything else looks good, so with that fixed (maybe Zorro can change > that when picking the patch): > > Reviewed-by: Filipe Manana <fdmanana@suse.com> > > Thanks. > Thanks Filipe. Zorro, let me know if you're happy making this change, or otherwise I'll resubmit. Mark
On Fri, Oct 18, 2024 at 05:36:26PM +0000, Mark Harmstone wrote: > On 16/10/24 12:09, Filipe Manana wrote: > > > > > On Tue, Oct 15, 2024 at 4:42 PM Mark Harmstone <maharmstone@fb.com> wrote: > >> > >> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a > >> race could mean that csums weren't getting written to the log tree, > >> leading to corruption when it was replayed. > >> > >> The patches to detect log this tree corruption are in btrfs-progs 6.11. > > > > This shouldn't be needed right? > > Because after log replay the csums are missing and 'btrfs check' > > detects (IIRC) missing csums for extents referred by file extent items > > in a subvolume tree - if it doesn't then it should be improved. > > Yes, but we're not mounting it in the tests between the log_writes > calls, so the log isn't getting replayed. The patches to btrfs check > make it so that it identifies filesystems that would get corrupted as > soon as they're next mounted. > > > > >> > >> Signed-off-by: Mark Harmstone <maharmstone@fb.com> > >> --- > >> This is a genericized version of the test I originally proposed as > >> btrfs/333. > >> > >> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++ > >> tests/generic/757.out | 2 ++ > >> 2 files changed, 73 insertions(+) > >> create mode 100755 tests/generic/757 > >> create mode 100644 tests/generic/757.out > >> > >> diff --git a/tests/generic/757 b/tests/generic/757 > >> new file mode 100755 > >> index 00000000..6ad3d01e > >> --- /dev/null > >> +++ b/tests/generic/757 > >> @@ -0,0 +1,71 @@ > >> +#! /bin/bash > >> +# SPDX-License-Identifier: GPL-2.0 > >> +# > >> +# FS QA Test 757 > >> +# > >> +# Test async dio with fsync to test a btrfs bug where a race meant that csums > >> +# weren't getting written to the log tree, causing corruptions on remount. > >> +# This can be seen on subpage FSes on Linux 6.4. > >> +# > >> +. ./common/preamble > >> +_begin_fstest auto quick metadata log recoveryloop > >> + > >> +_fixed_by_kernel_commit e917ff56c8e7 \ > >> + "btrfs: determine synchronous writers from bio or writeback control" > > > > For generic tests what we do is: > > > > [ $FSTYP == "btrfs" ] && _fixed_by_kernel_commit ..... > > > > As long as the failure has not been observed and fixed on other filesystems. > > In case one day a regression happens in another fs, we just add a > > corresponding line using the same logic. > > > > Otherwise if the test one days fails on another fs and fstests > > suggests that that commit is missing, it would be odd. > > > > Everything else looks good, so with that fixed (maybe Zorro can change > > that when picking the patch): > > > > Reviewed-by: Filipe Manana <fdmanana@suse.com> > > > > Thanks. > > > > Thanks Filipe. Zorro, let me know if you're happy making this change, or > otherwise I'll resubmit. I can help to change that when I merge it. Thanks you and Filipe. Thanks, Zorro > > Mark > >
On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote: > Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a > race could mean that csums weren't getting written to the log tree, > leading to corruption when it was replayed. > > The patches to detect log this tree corruption are in btrfs-progs 6.11. > > Signed-off-by: Mark Harmstone <maharmstone@fb.com> > --- > This is a genericized version of the test I originally proposed as > btrfs/333. > > tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++ > tests/generic/757.out | 2 ++ > 2 files changed, 73 insertions(+) > create mode 100755 tests/generic/757 > create mode 100644 tests/generic/757.out > > diff --git a/tests/generic/757 b/tests/generic/757 > new file mode 100755 > index 00000000..6ad3d01e > --- /dev/null > +++ b/tests/generic/757 > @@ -0,0 +1,71 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# > +# FS QA Test 757 > +# > +# Test async dio with fsync to test a btrfs bug where a race meant that csums > +# weren't getting written to the log tree, causing corruptions on remount. > +# This can be seen on subpage FSes on Linux 6.4. > +# > +. ./common/preamble > +_begin_fstest auto quick metadata log recoveryloop > + > +_fixed_by_kernel_commit e917ff56c8e7 \ > + "btrfs: determine synchronous writers from bio or writeback control" > + > +fio_config=$tmp.fio > + > +. ./common/dmlogwrites > + > +_require_scratch > +_require_log_writes > + > +cat >$fio_config <<EOF > +[global] > +iodepth=128 > +direct=1 > +ioengine=libaio > +rw=randwrite > +runtime=1s > +[job0] > +rw=randwrite > +filename=$SCRATCH_MNT/file > +size=1g > +fdatasync=1 > +EOF > + > +_require_fio $fio_config > + > +cat $fio_config >> $seqres.full > + > +_log_writes_init $SCRATCH_DEV > +_log_writes_mkfs >> $seqres.full 2>&1 > +_log_writes_mark mkfs > + > +_log_writes_mount For dmlogwrites test, we generally calls _log_writes_cleanup in _cleanup, to recover the SCRATCH_DEV anyway, even if this test is killed at the middle of its testing phase, to avoid later tests failed. I'll add below code in this case, when I merge it. If there's not objection from you. _cleanup() { cd / _log_writes_cleanup rm -f $tmp.* } Thanks, Zorro > + > +$FIO_PROG $fio_config > /dev/null 2>&1 > +_log_writes_unmount > + > +_log_writes_remove > + > +prev=$(_log_writes_mark_to_entry_number mkfs) > +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'" > +cur=$(_log_writes_find_next_fua $prev) > +[ -z "$cur" ] && _fail "failed to locate next FUA write" > + > +while [ ! -z "$cur" ]; do > + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full > + > + _check_scratch_fs > + > + prev=$cur > + cur=$(_log_writes_find_next_fua $(($cur + 1))) > + [ -z "$cur" ] && break > +done > + > +echo "Silence is golden" > + > +# success, all done > +status=0 > +exit > diff --git a/tests/generic/757.out b/tests/generic/757.out > new file mode 100644 > index 00000000..dfbc8094 > --- /dev/null > +++ b/tests/generic/757.out > @@ -0,0 +1,2 @@ > +QA output created by 757 > +Silence is golden > -- > 2.44.2 > >
On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote: > Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a > race could mean that csums weren't getting written to the log tree, > leading to corruption when it was replayed. > > The patches to detect log this tree corruption are in btrfs-progs 6.11. > > Signed-off-by: Mark Harmstone <maharmstone@fb.com> > --- Sorry, more review points below. I can help to change these if you say "yes" to all :) > This is a genericized version of the test I originally proposed as > btrfs/333. > > tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++ > tests/generic/757.out | 2 ++ > 2 files changed, 73 insertions(+) > create mode 100755 tests/generic/757 > create mode 100644 tests/generic/757.out > > diff --git a/tests/generic/757 b/tests/generic/757 > new file mode 100755 > index 00000000..6ad3d01e > --- /dev/null > +++ b/tests/generic/757 > @@ -0,0 +1,71 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# > +# FS QA Test 757 > +# > +# Test async dio with fsync to test a btrfs bug where a race meant that csums > +# weren't getting written to the log tree, causing corruptions on remount. > +# This can be seen on subpage FSes on Linux 6.4. > +# > +. ./common/preamble > +_begin_fstest auto quick metadata log recoveryloop ^^^ aio > + > +_fixed_by_kernel_commit e917ff56c8e7 \ > + "btrfs: determine synchronous writers from bio or writeback control" > + > +fio_config=$tmp.fio > + > +. ./common/dmlogwrites > + > +_require_scratch > +_require_log_writes > + > +cat >$fio_config <<EOF > +[global] > +iodepth=128 > +direct=1 > +ioengine=libaio _require_aiodio ? > +rw=randwrite > +runtime=1s > +[job0] > +rw=randwrite > +filename=$SCRATCH_MNT/file > +size=1g > +fdatasync=1 > +EOF > + > +_require_fio $fio_config > + > +cat $fio_config >> $seqres.full > + > +_log_writes_init $SCRATCH_DEV > +_log_writes_mkfs >> $seqres.full 2>&1 > +_log_writes_mark mkfs > + > +_log_writes_mount > + > +$FIO_PROG $fio_config > /dev/null 2>&1 Don't you care the output of fio running anymore? Maybe use > $seqres.full ? And just make sure, do you want to ignore failures of fio, as you do "2>&1"? Thanks, Zorro > +_log_writes_unmount > + > +_log_writes_remove > + > +prev=$(_log_writes_mark_to_entry_number mkfs) > +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'" > +cur=$(_log_writes_find_next_fua $prev) > +[ -z "$cur" ] && _fail "failed to locate next FUA write" > + > +while [ ! -z "$cur" ]; do > + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full > + > + _check_scratch_fs > + > + prev=$cur > + cur=$(_log_writes_find_next_fua $(($cur + 1))) > + [ -z "$cur" ] && break > +done > + > +echo "Silence is golden" > + > +# success, all done > +status=0 > +exit > diff --git a/tests/generic/757.out b/tests/generic/757.out > new file mode 100644 > index 00000000..dfbc8094 > --- /dev/null > +++ b/tests/generic/757.out > @@ -0,0 +1,2 @@ > +QA output created by 757 > +Silence is golden > -- > 2.44.2 > >
These look good to me. Thank you for your help On 23/10/24 04:53, Zorro Lang wrote: > > > On Tue, Oct 15, 2024 at 04:39:34PM +0100, Mark Harmstone wrote: >> Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a >> race could mean that csums weren't getting written to the log tree, >> leading to corruption when it was replayed. >> >> The patches to detect log this tree corruption are in btrfs-progs 6.11. >> >> Signed-off-by: Mark Harmstone <maharmstone@fb.com> >> --- > > Sorry, more review points below. I can help to change these if you say "yes" > to all :) > >> This is a genericized version of the test I originally proposed as >> btrfs/333. >> >> tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++ >> tests/generic/757.out | 2 ++ >> 2 files changed, 73 insertions(+) >> create mode 100755 tests/generic/757 >> create mode 100644 tests/generic/757.out >> >> diff --git a/tests/generic/757 b/tests/generic/757 >> new file mode 100755 >> index 00000000..6ad3d01e >> --- /dev/null >> +++ b/tests/generic/757 >> @@ -0,0 +1,71 @@ >> +#! /bin/bash >> +# SPDX-License-Identifier: GPL-2.0 >> +# >> +# FS QA Test 757 >> +# >> +# Test async dio with fsync to test a btrfs bug where a race meant that csums >> +# weren't getting written to the log tree, causing corruptions on remount. >> +# This can be seen on subpage FSes on Linux 6.4. >> +# >> +. ./common/preamble >> +_begin_fstest auto quick metadata log recoveryloop > ^^^ > aio > >> + >> +_fixed_by_kernel_commit e917ff56c8e7 \ >> + "btrfs: determine synchronous writers from bio or writeback control" >> + >> +fio_config=$tmp.fio >> + >> +. ./common/dmlogwrites >> + >> +_require_scratch >> +_require_log_writes >> + >> +cat >$fio_config <<EOF >> +[global] >> +iodepth=128 >> +direct=1 >> +ioengine=libaio > > _require_aiodio ? > >> +rw=randwrite >> +runtime=1s >> +[job0] >> +rw=randwrite >> +filename=$SCRATCH_MNT/file >> +size=1g >> +fdatasync=1 >> +EOF >> + >> +_require_fio $fio_config >> + >> +cat $fio_config >> $seqres.full >> + >> +_log_writes_init $SCRATCH_DEV >> +_log_writes_mkfs >> $seqres.full 2>&1 >> +_log_writes_mark mkfs >> + >> +_log_writes_mount >> + >> +$FIO_PROG $fio_config > /dev/null 2>&1 > > Don't you care the output of fio running anymore? Maybe use > $seqres.full ? > > And just make sure, do you want to ignore failures of fio, as you do "2>&1"? > > Thanks, > Zorro > >> +_log_writes_unmount >> + >> +_log_writes_remove >> + >> +prev=$(_log_writes_mark_to_entry_number mkfs) >> +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'" >> +cur=$(_log_writes_find_next_fua $prev) >> +[ -z "$cur" ] && _fail "failed to locate next FUA write" >> + >> +while [ ! -z "$cur" ]; do >> + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full >> + >> + _check_scratch_fs >> + >> + prev=$cur >> + cur=$(_log_writes_find_next_fua $(($cur + 1))) >> + [ -z "$cur" ] && break >> +done >> + >> +echo "Silence is golden" >> + >> +# success, all done >> +status=0 >> +exit >> diff --git a/tests/generic/757.out b/tests/generic/757.out >> new file mode 100644 >> index 00000000..dfbc8094 >> --- /dev/null >> +++ b/tests/generic/757.out >> @@ -0,0 +1,2 @@ >> +QA output created by 757 >> +Silence is golden >> -- >> 2.44.2 >> >> >
diff --git a/tests/generic/757 b/tests/generic/757 new file mode 100755 index 00000000..6ad3d01e --- /dev/null +++ b/tests/generic/757 @@ -0,0 +1,71 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# FS QA Test 757 +# +# Test async dio with fsync to test a btrfs bug where a race meant that csums +# weren't getting written to the log tree, causing corruptions on remount. +# This can be seen on subpage FSes on Linux 6.4. +# +. ./common/preamble +_begin_fstest auto quick metadata log recoveryloop + +_fixed_by_kernel_commit e917ff56c8e7 \ + "btrfs: determine synchronous writers from bio or writeback control" + +fio_config=$tmp.fio + +. ./common/dmlogwrites + +_require_scratch +_require_log_writes + +cat >$fio_config <<EOF +[global] +iodepth=128 +direct=1 +ioengine=libaio +rw=randwrite +runtime=1s +[job0] +rw=randwrite +filename=$SCRATCH_MNT/file +size=1g +fdatasync=1 +EOF + +_require_fio $fio_config + +cat $fio_config >> $seqres.full + +_log_writes_init $SCRATCH_DEV +_log_writes_mkfs >> $seqres.full 2>&1 +_log_writes_mark mkfs + +_log_writes_mount + +$FIO_PROG $fio_config > /dev/null 2>&1 +_log_writes_unmount + +_log_writes_remove + +prev=$(_log_writes_mark_to_entry_number mkfs) +[ -z "$prev" ] && _fail "failed to locate entry mark 'mkfs'" +cur=$(_log_writes_find_next_fua $prev) +[ -z "$cur" ] && _fail "failed to locate next FUA write" + +while [ ! -z "$cur" ]; do + _log_writes_replay_log_range $cur $SCRATCH_DEV >> $seqres.full + + _check_scratch_fs + + prev=$cur + cur=$(_log_writes_find_next_fua $(($cur + 1))) + [ -z "$cur" ] && break +done + +echo "Silence is golden" + +# success, all done +status=0 +exit diff --git a/tests/generic/757.out b/tests/generic/757.out new file mode 100644 index 00000000..dfbc8094 --- /dev/null +++ b/tests/generic/757.out @@ -0,0 +1,2 @@ +QA output created by 757 +Silence is golden
Adds a test for a bug we encountered on Linux 6.4 on aarch64, where a race could mean that csums weren't getting written to the log tree, leading to corruption when it was replayed. The patches to detect log this tree corruption are in btrfs-progs 6.11. Signed-off-by: Mark Harmstone <maharmstone@fb.com> --- This is a genericized version of the test I originally proposed as btrfs/333. tests/generic/757 | 71 +++++++++++++++++++++++++++++++++++++++++++ tests/generic/757.out | 2 ++ 2 files changed, 73 insertions(+) create mode 100755 tests/generic/757 create mode 100644 tests/generic/757.out