[RFC] fstests: btrfs: Test if balance takes too long time or return -ENOSPC unexpected
diff mbox series

Message ID 20190114082043.31220-1-wqu@suse.com
State New
Headers show
Series
  • [RFC] fstests: btrfs: Test if balance takes too long time or return -ENOSPC unexpected
Related show

Commit Message

Qu Wenruo Jan. 14, 2019, 8:20 a.m. UTC
Kernel commit 64403612b73a ("btrfs: rework
btrfs_check_space_for_delayed_refs") is introducing a regression for
btrfs balance performance.

Originally, the workload used in the test case only takes seconds for
balance on v4.20 while now it takes over 400 seconds for balance.

During that 400 seconds balance, it commits over 2000 transactions just
for nothing, compare to original several transactions.

Add test cases to detect such regression.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Extra note:
In fact, without the snapshots created in the test case, it would return
-ENOSPC even we have enough unallocated space.
---
 tests/btrfs/180     | 80 +++++++++++++++++++++++++++++++++++++++++++++
 tests/btrfs/180.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 83 insertions(+)
 create mode 100755 tests/btrfs/180
 create mode 100644 tests/btrfs/180.out

Comments

David Sterba Jan. 25, 2019, 6:57 p.m. UTC | #1
On Mon, Jan 14, 2019 at 04:20:43PM +0800, Qu Wenruo wrote:
> Kernel commit 64403612b73a ("btrfs: rework
> btrfs_check_space_for_delayed_refs") is introducing a regression for
> btrfs balance performance.
> 
> Originally, the workload used in the test case only takes seconds for
> balance on v4.20 while now it takes over 400 seconds for balance.
> 
> During that 400 seconds balance, it commits over 2000 transactions just
> for nothing, compare to original several transactions.
> 
> Add test cases to detect such regression.

How do you detect that the test regressed, other than comparing the
runtime recorded between runs of the testsuite? I don't actually know
how this could be done inside the script, as there can't be a fixed
number of seconds to compare with. The run time depends on the
underlying storage.

I takes 14 seconds on a physical machine on a rotational disk and 9
seconds on a SSD.

Other than the, I've used this test to validate the fix, for that

Tested-by: David Sterba <dsterba@suse.com>

It would be good to have the test in the testsuite so possibly some
large timeout can be set and the test will print a message, this should
catch attention at least and let us decide.

> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Extra note:
> In fact, without the snapshots created in the test case, it would return
> -ENOSPC even we have enough unallocated space.
> ---
>  tests/btrfs/180     | 80 +++++++++++++++++++++++++++++++++++++++++++++
>  tests/btrfs/180.out |  2 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 83 insertions(+)
>  create mode 100755 tests/btrfs/180
>  create mode 100644 tests/btrfs/180.out
> 
> diff --git a/tests/btrfs/180 b/tests/btrfs/180
> new file mode 100755
> index 00000000..534fea01
> --- /dev/null
> +++ b/tests/btrfs/180
> @@ -0,0 +1,80 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FS QA Test 180
> +#
> +# Test if metadata balance would take forever to finish or return ENOSPC even
> +# there there are tons of space.
> +#
> +# This is regression caused by upstream commit 64403612b73a
> +# ("btrfs: rework btrfs_check_space_for_delayed_refs")
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_xfs_io_command falloc
> +
> +i=0
> +loop=16384
> +
> +_scratch_mkfs > /dev/null 2>&1
> +_scratch_mount
> +
> +$BTRFS_UTIL_PROG subvolume create "$SCRATCH_MNT/src" > /dev/null
> +
> +while [ $i -le $loop ]; do
> +	# Use small file to create inline extent
> +	_pwrite_byte 0x00 0 2K "$SCRATCH_MNT/src/inline_$i" > /dev/null
> +	#$XFS_IO_PROG -f -c "falloc 0 4K" "$SCRATCH_MNT/src/regular_$i" > /dev/null

You can delete the commented command

> +	i=$((i + 1))
> +done
> +
> +# Create enough snapshots so at space reservation part of relocation, we could
> +# generate enough space pressure
> +for i in $(seq -w 0 16); do
> +	$BTRFS_UTIL_PROG subvolume snapshot "$SCRATCH_MNT/src" \
> +		"$SCRATCH_MNT/snapshot_$i" > /dev/null
> +	# touch random files to create some new tree blocks
> +	for j in $(seq -w 0 16); do
> +		victim="$(ls $SCRATCH_MNT/snapshot_$i | sort -R | head -n1)"
> +		touch "$SCRATCH_MNT/snapshot_$i/$victim"
> +	done
> +done
> +
> +# Balancing metadata shouldn't be too time consuming, as the amount of metadata
> +# is less than 8M, thus normally it should finish very quick.
> +#
> +# However with that offending commit, it will take forever to finish or return
> +# ENOSPC after a long wait.
> +$BTRFS_UTIL_PROG balance start -m "$SCRATCH_MNT" > /dev/null
> +
> +echo "Silence is golden"
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/180.out b/tests/btrfs/180.out
> new file mode 100644
> index 00000000..50aba766
> --- /dev/null
> +++ b/tests/btrfs/180.out
> @@ -0,0 +1,2 @@
> +QA output created by 180
> +Silence is golden
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index 46dd3c95..e724968b 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -182,3 +182,4 @@
>  177 auto quick swap balance
>  178 auto quick send
>  179 auto qgroup dangerous
> +180 auto balance dangerous

I don't think it falls to the dangerous category, it does not crash the
machine, only that the test could take long.
Qu Wenruo Jan. 25, 2019, 11:38 p.m. UTC | #2
On 2019/1/26 上午2:57, David Sterba wrote:
> On Mon, Jan 14, 2019 at 04:20:43PM +0800, Qu Wenruo wrote:
>> Kernel commit 64403612b73a ("btrfs: rework
>> btrfs_check_space_for_delayed_refs") is introducing a regression for
>> btrfs balance performance.
>>
>> Originally, the workload used in the test case only takes seconds for
>> balance on v4.20 while now it takes over 400 seconds for balance.
>>
>> During that 400 seconds balance, it commits over 2000 transactions just
>> for nothing, compare to original several transactions.
>>
>> Add test cases to detect such regression.
> 
> How do you detect that the test regressed, other than comparing the
> runtime recorded between runs of the testsuite? I don't actually know
> how this could be done inside the script, as there can't be a fixed
> number of seconds to compare with. The run time depends on the
> underlying storage.

I have a better idea for detecting this, by using super generation as
failure criteria.

So then we could have a simple FAIL/PASS check.
Although, we still need to take a long time to run if we hit that
regression.

Thanks,
Qu

> 
> I takes 14 seconds on a physical machine on a rotational disk and 9
> seconds on a SSD.
> 
> Other than the, I've used this test to validate the fix, for that
> 
> Tested-by: David Sterba <dsterba@suse.com>
> 
> It would be good to have the test in the testsuite so possibly some
> large timeout can be set and the test will print a message, this should
> catch attention at least and let us decide.
> 
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Extra note:
>> In fact, without the snapshots created in the test case, it would return
>> -ENOSPC even we have enough unallocated space.
>> ---
>>  tests/btrfs/180     | 80 +++++++++++++++++++++++++++++++++++++++++++++
>>  tests/btrfs/180.out |  2 ++
>>  tests/btrfs/group   |  1 +
>>  3 files changed, 83 insertions(+)
>>  create mode 100755 tests/btrfs/180
>>  create mode 100644 tests/btrfs/180.out
>>
>> diff --git a/tests/btrfs/180 b/tests/btrfs/180
>> new file mode 100755
>> index 00000000..534fea01
>> --- /dev/null
>> +++ b/tests/btrfs/180
>> @@ -0,0 +1,80 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
>> +#
>> +# FS QA Test 180
>> +#
>> +# Test if metadata balance would take forever to finish or return ENOSPC even
>> +# there there are tons of space.
>> +#
>> +# This is regression caused by upstream commit 64403612b73a
>> +# ("btrfs: rework btrfs_check_space_for_delayed_refs")
>> +#
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1	# failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +	cd /
>> +	rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +
>> +# remove previous $seqres.full before test
>> +rm -f $seqres.full
>> +
>> +# real QA test starts here
>> +
>> +# Modify as appropriate.
>> +_supported_fs btrfs
>> +_supported_os Linux
>> +_require_scratch
>> +_require_xfs_io_command falloc
>> +
>> +i=0
>> +loop=16384
>> +
>> +_scratch_mkfs > /dev/null 2>&1
>> +_scratch_mount
>> +
>> +$BTRFS_UTIL_PROG subvolume create "$SCRATCH_MNT/src" > /dev/null
>> +
>> +while [ $i -le $loop ]; do
>> +	# Use small file to create inline extent
>> +	_pwrite_byte 0x00 0 2K "$SCRATCH_MNT/src/inline_$i" > /dev/null
>> +	#$XFS_IO_PROG -f -c "falloc 0 4K" "$SCRATCH_MNT/src/regular_$i" > /dev/null
> 
> You can delete the commented command
> 
>> +	i=$((i + 1))
>> +done
>> +
>> +# Create enough snapshots so at space reservation part of relocation, we could
>> +# generate enough space pressure
>> +for i in $(seq -w 0 16); do
>> +	$BTRFS_UTIL_PROG subvolume snapshot "$SCRATCH_MNT/src" \
>> +		"$SCRATCH_MNT/snapshot_$i" > /dev/null
>> +	# touch random files to create some new tree blocks
>> +	for j in $(seq -w 0 16); do
>> +		victim="$(ls $SCRATCH_MNT/snapshot_$i | sort -R | head -n1)"
>> +		touch "$SCRATCH_MNT/snapshot_$i/$victim"
>> +	done
>> +done
>> +
>> +# Balancing metadata shouldn't be too time consuming, as the amount of metadata
>> +# is less than 8M, thus normally it should finish very quick.
>> +#
>> +# However with that offending commit, it will take forever to finish or return
>> +# ENOSPC after a long wait.
>> +$BTRFS_UTIL_PROG balance start -m "$SCRATCH_MNT" > /dev/null
>> +
>> +echo "Silence is golden"
>> +# success, all done
>> +status=0
>> +exit
>> diff --git a/tests/btrfs/180.out b/tests/btrfs/180.out
>> new file mode 100644
>> index 00000000..50aba766
>> --- /dev/null
>> +++ b/tests/btrfs/180.out
>> @@ -0,0 +1,2 @@
>> +QA output created by 180
>> +Silence is golden
>> diff --git a/tests/btrfs/group b/tests/btrfs/group
>> index 46dd3c95..e724968b 100644
>> --- a/tests/btrfs/group
>> +++ b/tests/btrfs/group
>> @@ -182,3 +182,4 @@
>>  177 auto quick swap balance
>>  178 auto quick send
>>  179 auto qgroup dangerous
>> +180 auto balance dangerous
> 
> I don't think it falls to the dangerous category, it does not crash the
> machine, only that the test could take long.
>

Patch
diff mbox series

diff --git a/tests/btrfs/180 b/tests/btrfs/180
new file mode 100755
index 00000000..534fea01
--- /dev/null
+++ b/tests/btrfs/180
@@ -0,0 +1,80 @@ 
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test 180
+#
+# Test if metadata balance would take forever to finish or return ENOSPC even
+# there there are tons of space.
+#
+# This is regression caused by upstream commit 64403612b73a
+# ("btrfs: rework btrfs_check_space_for_delayed_refs")
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_xfs_io_command falloc
+
+i=0
+loop=16384
+
+_scratch_mkfs > /dev/null 2>&1
+_scratch_mount
+
+$BTRFS_UTIL_PROG subvolume create "$SCRATCH_MNT/src" > /dev/null
+
+while [ $i -le $loop ]; do
+	# Use small file to create inline extent
+	_pwrite_byte 0x00 0 2K "$SCRATCH_MNT/src/inline_$i" > /dev/null
+	#$XFS_IO_PROG -f -c "falloc 0 4K" "$SCRATCH_MNT/src/regular_$i" > /dev/null
+	i=$((i + 1))
+done
+
+# Create enough snapshots so at space reservation part of relocation, we could
+# generate enough space pressure
+for i in $(seq -w 0 16); do
+	$BTRFS_UTIL_PROG subvolume snapshot "$SCRATCH_MNT/src" \
+		"$SCRATCH_MNT/snapshot_$i" > /dev/null
+	# touch random files to create some new tree blocks
+	for j in $(seq -w 0 16); do
+		victim="$(ls $SCRATCH_MNT/snapshot_$i | sort -R | head -n1)"
+		touch "$SCRATCH_MNT/snapshot_$i/$victim"
+	done
+done
+
+# Balancing metadata shouldn't be too time consuming, as the amount of metadata
+# is less than 8M, thus normally it should finish very quick.
+#
+# However with that offending commit, it will take forever to finish or return
+# ENOSPC after a long wait.
+$BTRFS_UTIL_PROG balance start -m "$SCRATCH_MNT" > /dev/null
+
+echo "Silence is golden"
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/180.out b/tests/btrfs/180.out
new file mode 100644
index 00000000..50aba766
--- /dev/null
+++ b/tests/btrfs/180.out
@@ -0,0 +1,2 @@ 
+QA output created by 180
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 46dd3c95..e724968b 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -182,3 +182,4 @@ 
 177 auto quick swap balance
 178 auto quick send
 179 auto qgroup dangerous
+180 auto balance dangerous