Message ID | ff12321e5ddfdb763b9258f746e67fe3f6ea1321.1684489375.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] btrfs/213: avoid occasional failure due to already finished balance | expand |
On 2023/5/19 17:57, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > btrfs/213 writes data, in 1M extents, for 4 seconds into a file, then > triggers a balance and then after 2 seconds it tries to cancel the > balance operation. More often than not, this works because the balance > is still running after 2 seconds. However it also fails sporadically > because balance has finished in less than 2 seconds, which is plausible > since data and metadata are cached or other factors such as virtualized > environment. When that's the case, it fails like this: > > $ ./check btrfs/213 > FSTYP -- btrfs > PLATFORM -- Linux/x86_64 debian0 6.4.0-rc1-btrfs-next-131+ #1 SMP PREEMPT_DYNAMIC Thu May 11 11:26:19 WEST 2023 > MKFS_OPTIONS -- /dev/sdc > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 > > btrfs/213 51s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad) > --- tests/btrfs/213.out 2020-06-10 19:29:03.822519250 +0100 > +++ /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad 2023-05-17 15:39:32.653727223 +0100 > @@ -1,2 +1,3 @@ > QA output created by 213 > +ERROR: balance cancel on '/home/fdmanana/btrfs-tests/scratch_1' failed: Not in progress > Silence is golden > ... > (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/213.out /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad' to see the entire diff) > Ran: btrfs/213 > Failures: btrfs/213 > Failed 1 of 1 tests > > To make it much less likely that balance has already finished before we > try to cancel it, unmount and mount again the filesystem before starting > balance, to clear cached metadata and data, and also double the time we > spend writing 1M data extents. Also make the test not run with an > informative message if we detect that balance finished before we could > cancel it. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks, Qu > --- > > v2: Make the test _notrun if we detect that balance finished before we > could cancel it. > > tests/btrfs/213 | 16 +++++++++++++--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/tests/btrfs/213 b/tests/btrfs/213 > index e16e41c0..5666d9b9 100755 > --- a/tests/btrfs/213 > +++ b/tests/btrfs/213 > @@ -31,7 +31,7 @@ _fixed_by_kernel_commit 1dae7e0e58b4 \ > _scratch_mkfs >> $seqres.full > _scratch_mount > > -runtime=4 > +runtime=8 > > # Create enough IO so that we need around $runtime seconds to relocate it. > # > @@ -42,11 +42,21 @@ sleep $runtime > kill $write_pid > wait $write_pid > > +# Unmount and mount again the fs to clear any cached data and metadata, so that > +# it's less likely balance has already finished when we try to cancel it below. > +_scratch_cycle_mount > + > # Now balance should take at least $runtime seconds, we can cancel it at > # $runtime/2 to ensure a success cancel. > _run_btrfs_balance_start -d --bg "$SCRATCH_MNT" > -sleep $(($runtime / 2)) > -$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" > +sleep $(($runtime / 4)) > +# It's possible that balance has already completed. It's unlikely but often > +# it may happen due to virtualization, caching and other factors, so ignore > +# any error about no balance currently running. > +$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" 2>&1 | grep -iq 'not in progress' > +if [ $? -eq 0 ]; then > + _not_run "balance finished before we could cancel it" > +fi > > # Now check if we can finish relocating metadata, which should finish very > # quickly.
LGTM
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Hi, > From: Filipe Manana <fdmanana@suse.com> > > btrfs/213 writes data, in 1M extents, for 4 seconds into a file, then > triggers a balance and then after 2 seconds it tries to cancel the > balance operation. More often than not, this works because the balance > is still running after 2 seconds. However it also fails sporadically > because balance has finished in less than 2 seconds, which is plausible > since data and metadata are cached or other factors such as virtualized > environment. When that's the case, it fails like this: > > $ ./check btrfs/213 > FSTYP -- btrfs > PLATFORM -- Linux/x86_64 debian0 6.4.0-rc1-btrfs-next-131+ #1 SMP PREEMPT_DYNAMIC Thu May 11 11:26:19 WEST 2023 > MKFS_OPTIONS -- /dev/sdc > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 > > btrfs/213 51s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad) > --- tests/btrfs/213.out 2020-06-10 19:29:03.822519250 +0100 > +++ /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad 2023-05-17 15:39:32.653727223 +0100 > @@ -1,2 +1,3 @@ > QA output created by 213 > +ERROR: balance cancel on '/home/fdmanana/btrfs-tests/scratch_1' failed: Not in progress > Silence is golden > ... > (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/213.out /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad' to see the entire diff) > Ran: btrfs/213 > Failures: btrfs/213 > Failed 1 of 1 tests > > To make it much less likely that balance has already finished before we > try to cancel it, unmount and mount again the filesystem before starting > balance, to clear cached metadata and data, and also double the time we > spend writing 1M data extents. Also make the test not run with an > informative message if we detect that balance finished before we could > cancel it. > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > --- > > v2: Make the test _notrun if we detect that balance finished before we > could cancel it. > > tests/btrfs/213 | 16 +++++++++++++--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/tests/btrfs/213 b/tests/btrfs/213 > index e16e41c0..5666d9b9 100755 > --- a/tests/btrfs/213 > +++ b/tests/btrfs/213 > @@ -31,7 +31,7 @@ _fixed_by_kernel_commit 1dae7e0e58b4 \ > _scratch_mkfs >> $seqres.full > _scratch_mount > > -runtime=4 > +runtime=8 > > # Create enough IO so that we need around $runtime seconds to relocate it. > # > @@ -42,11 +42,21 @@ sleep $runtime > kill $write_pid > wait $write_pid > > +# Unmount and mount again the fs to clear any cached data and metadata, so that > +# it's less likely balance has already finished when we try to cancel it below. > +_scratch_cycle_mount > + > # Now balance should take at least $runtime seconds, we can cancel it at > # $runtime/2 to ensure a success cancel. > _run_btrfs_balance_start -d --bg "$SCRATCH_MNT" > -sleep $(($runtime / 2)) > -$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" > +sleep $(($runtime / 4)) > +# It's possible that balance has already completed. It's unlikely but often > +# it may happen due to virtualization, caching and other factors, so ignore > +# any error about no balance currently running. > +$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" 2>&1 | grep -iq 'not in progress' > +if [ $? -eq 0 ]; then > + _not_run "balance finished before we could cancel it" > +fi fstests(btrfs/213) failed once here. btrfs/213 22s ... - output mismatch (see /usr/hpc-bio/xfstests/results//btrfs/213.out.bad) --- tests/btrfs/213.out 2023-03-28 06:09:10.372680814 +0800 +++ /usr/hpc-bio/xfstests/results//btrfs/213.out.bad 2023-08-12 20:31:47.848303940 +0800 @@ -1,2 +1,5 @@ QA output created by 213 +/usr/hpc-bio/xfstests/tests/btrfs/213: line 59: _not_run: command not found +ERROR: error during balancing '/mnt/scratch': No space left on device +There may be more info in syslog - try dmesg | tail Silence is golden we need to fix the error of '_not_run: command not found' firstly. I will update the info if fstests(btrfs/213) fails again. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/08/12
On Sat, Aug 12, 2023 at 1:54 PM Wang Yugui <wangyugui@e16-tech.com> wrote: > > Hi, > > > From: Filipe Manana <fdmanana@suse.com> > > > > btrfs/213 writes data, in 1M extents, for 4 seconds into a file, then > > triggers a balance and then after 2 seconds it tries to cancel the > > balance operation. More often than not, this works because the balance > > is still running after 2 seconds. However it also fails sporadically > > because balance has finished in less than 2 seconds, which is plausible > > since data and metadata are cached or other factors such as virtualized > > environment. When that's the case, it fails like this: > > > > $ ./check btrfs/213 > > FSTYP -- btrfs > > PLATFORM -- Linux/x86_64 debian0 6.4.0-rc1-btrfs-next-131+ #1 SMP PREEMPT_DYNAMIC Thu May 11 11:26:19 WEST 2023 > > MKFS_OPTIONS -- /dev/sdc > > MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 > > > > btrfs/213 51s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad) > > --- tests/btrfs/213.out 2020-06-10 19:29:03.822519250 +0100 > > +++ /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad 2023-05-17 15:39:32.653727223 +0100 > > @@ -1,2 +1,3 @@ > > QA output created by 213 > > +ERROR: balance cancel on '/home/fdmanana/btrfs-tests/scratch_1' failed: Not in progress > > Silence is golden > > ... > > (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/213.out /home/fdmanana/git/hub/xfstests/results//btrfs/213.out.bad' to see the entire diff) > > Ran: btrfs/213 > > Failures: btrfs/213 > > Failed 1 of 1 tests > > > > To make it much less likely that balance has already finished before we > > try to cancel it, unmount and mount again the filesystem before starting > > balance, to clear cached metadata and data, and also double the time we > > spend writing 1M data extents. Also make the test not run with an > > informative message if we detect that balance finished before we could > > cancel it. > > > > Signed-off-by: Filipe Manana <fdmanana@suse.com> > > --- > > > > v2: Make the test _notrun if we detect that balance finished before we > > could cancel it. > > > > tests/btrfs/213 | 16 +++++++++++++--- > > 1 file changed, 13 insertions(+), 3 deletions(-) > > > > diff --git a/tests/btrfs/213 b/tests/btrfs/213 > > index e16e41c0..5666d9b9 100755 > > --- a/tests/btrfs/213 > > +++ b/tests/btrfs/213 > > @@ -31,7 +31,7 @@ _fixed_by_kernel_commit 1dae7e0e58b4 \ > > _scratch_mkfs >> $seqres.full > > _scratch_mount > > > > -runtime=4 > > +runtime=8 > > > > # Create enough IO so that we need around $runtime seconds to relocate it. > > # > > @@ -42,11 +42,21 @@ sleep $runtime > > kill $write_pid > > wait $write_pid > > > > +# Unmount and mount again the fs to clear any cached data and metadata, so that > > +# it's less likely balance has already finished when we try to cancel it below. > > +_scratch_cycle_mount > > + > > # Now balance should take at least $runtime seconds, we can cancel it at > > # $runtime/2 to ensure a success cancel. > > _run_btrfs_balance_start -d --bg "$SCRATCH_MNT" > > -sleep $(($runtime / 2)) > > -$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" > > +sleep $(($runtime / 4)) > > +# It's possible that balance has already completed. It's unlikely but often > > +# it may happen due to virtualization, caching and other factors, so ignore > > +# any error about no balance currently running. > > +$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" 2>&1 | grep -iq 'not in progress' > > +if [ $? -eq 0 ]; then > > + _not_run "balance finished before we could cancel it" > > +fi > > fstests(btrfs/213) failed once here. > > btrfs/213 22s ... - output mismatch (see /usr/hpc-bio/xfstests/results//btrfs/213.out.bad) > --- tests/btrfs/213.out 2023-03-28 06:09:10.372680814 +0800 > +++ /usr/hpc-bio/xfstests/results//btrfs/213.out.bad 2023-08-12 20:31:47.848303940 +0800 > @@ -1,2 +1,5 @@ > QA output created by 213 > +/usr/hpc-bio/xfstests/tests/btrfs/213: line 59: _not_run: command not found > +ERROR: error during balancing '/mnt/scratch': No space left on device > +There may be more info in syslog - try dmesg | tail > Silence is golden > > we need to fix the error of '_not_run: command not found' firstly. > > I will update the info if fstests(btrfs/213) fails again. It's a misspelled function name, it should be _notrun instead of _not_run. I just sent a fix for that. The test is not very reliable as balance can finish quickly, so occasionally it may be skipped in some environments. Thanks. > > Best Regards > Wang Yugui (wangyugui@e16-tech.com) > 2023/08/12 > > >
diff --git a/tests/btrfs/213 b/tests/btrfs/213 index e16e41c0..5666d9b9 100755 --- a/tests/btrfs/213 +++ b/tests/btrfs/213 @@ -31,7 +31,7 @@ _fixed_by_kernel_commit 1dae7e0e58b4 \ _scratch_mkfs >> $seqres.full _scratch_mount -runtime=4 +runtime=8 # Create enough IO so that we need around $runtime seconds to relocate it. # @@ -42,11 +42,21 @@ sleep $runtime kill $write_pid wait $write_pid +# Unmount and mount again the fs to clear any cached data and metadata, so that +# it's less likely balance has already finished when we try to cancel it below. +_scratch_cycle_mount + # Now balance should take at least $runtime seconds, we can cancel it at # $runtime/2 to ensure a success cancel. _run_btrfs_balance_start -d --bg "$SCRATCH_MNT" -sleep $(($runtime / 2)) -$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" +sleep $(($runtime / 4)) +# It's possible that balance has already completed. It's unlikely but often +# it may happen due to virtualization, caching and other factors, so ignore +# any error about no balance currently running. +$BTRFS_UTIL_PROG balance cancel "$SCRATCH_MNT" 2>&1 | grep -iq 'not in progress' +if [ $? -eq 0 ]; then + _not_run "balance finished before we could cancel it" +fi # Now check if we can finish relocating metadata, which should finish very # quickly.