diff mbox series

[5/5] fstests: add stress truncation + writeback test

Message ID 20240611030203.1719072-6-mcgrof@kernel.org (mailing list archive)
State New
Headers show
Series fstests: add some new LBS inspired tests | expand

Commit Message

Luis Chamberlain June 11, 2024, 3:02 a.m. UTC
Stress test folio splits by using the new debugfs interface to a target
a new smaller folio order while triggering writeback at the same time.

This is known to only creates a crash with min order enabled, so for example
with a 16k block sized XFS test profile, an xarray fix for that is merged
already. This issue is fixed by kernel commit 2a0774c2886d ("XArray: set the
marks correctly when splitting an entry").

If inspecting more closely, you'll want to enable on your kernel boot:

	dyndbg='file mm/huge_memory.c +p'

Since we want to race large folio splits we also augment the full test
output log $seqres.full with the test specific number of successful
splits from vmstat thp_split_page and thp_split_page_failed. The larger
the vmstat thp_split_page the more we stress test this path.

This test reproduces a really hard to reproduce crash immediately.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 common/rc             |  14 +++++
 tests/generic/751     | 127 ++++++++++++++++++++++++++++++++++++++++++
 tests/generic/751.out |   2 +
 3 files changed, 143 insertions(+)
 create mode 100755 tests/generic/751
 create mode 100644 tests/generic/751.out

Comments

Darrick J. Wong June 11, 2024, 2:45 p.m. UTC | #1
On Mon, Jun 10, 2024 at 08:02:02PM -0700, Luis Chamberlain wrote:
> Stress test folio splits by using the new debugfs interface to a target
> a new smaller folio order while triggering writeback at the same time.
> 
> This is known to only creates a crash with min order enabled, so for example
> with a 16k block sized XFS test profile, an xarray fix for that is merged
> already. This issue is fixed by kernel commit 2a0774c2886d ("XArray: set the
> marks correctly when splitting an entry").
> 
> If inspecting more closely, you'll want to enable on your kernel boot:
> 
> 	dyndbg='file mm/huge_memory.c +p'
> 
> Since we want to race large folio splits we also augment the full test
> output log $seqres.full with the test specific number of successful
> splits from vmstat thp_split_page and thp_split_page_failed. The larger
> the vmstat thp_split_page the more we stress test this path.
> 
> This test reproduces a really hard to reproduce crash immediately.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  common/rc             |  14 +++++
>  tests/generic/751     | 127 ++++++++++++++++++++++++++++++++++++++++++
>  tests/generic/751.out |   2 +
>  3 files changed, 143 insertions(+)
>  create mode 100755 tests/generic/751
>  create mode 100644 tests/generic/751.out
> 
> diff --git a/common/rc b/common/rc
> index 30beef4e5c02..60f572108818 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -158,6 +158,20 @@ _require_vm_compaction()
>  	    _notrun "Need compaction enabled CONFIG_COMPACTION=y"
>  	fi
>  }
> +
> +# Requires CONFIG_DEBUGFS and truncation knobs
> +_require_split_debugfs()

Er... I thought "split" referred to debugfs itself.

_require_split_huge_pages_knob?

> +{
> +       if [ ! -f $DEBUGFS_MNT/split_huge_pages ]; then
> +           _notrun "Needs CONFIG_DEBUGFS and split_huge_pages"
> +       fi
> +}
> +
> +_split_huge_pages_all()
> +{
> +	echo 1 > $DEBUGFS_MNT/split_huge_pages
> +}
> +
>  # Get hugepagesize in bytes
>  _get_hugepagesize()
>  {
> diff --git a/tests/generic/751 b/tests/generic/751
> new file mode 100755
> index 000000000000..7cc96054a5a9
> --- /dev/null
> +++ b/tests/generic/751
> @@ -0,0 +1,127 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2024 Luis Chamberlain. All Rights Reserved.
> +#
> +# FS QA Test No. 751
> +#
> +# stress page cache truncation + writeback
> +#
> +# This aims at trying to reproduce a difficult to reproduce bug found with
> +# min order. The issue was root caused to an xarray bug when we split folios
> +# to another order other than 0. This functionality is used to support min
> +# order. The crash:
> +#
> +# https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df

You might want to paste the stacktrace in here directly, in case the
gist ever goes away.

> +#
> +# This may also find future truncation bugs in the future, as truncating any
> +# mapped file through the collateral of using echo 1 > split_huge_pages will
> +# always respect the min order. Truncating to a larger order then is excercised
> +# when this test is run against any filesystem LBS profile or an LBS device.
> +#
> +# If you're enabling this and want to check underneath the hood you may want to
> +# enable:
> +#
> +# dyndbg='file mm/huge_memory.c +p'
> +#
> +# This tests aims at increasing the rate of successful truncations so we want
> +# to increase the value of thp_split_page in $seqres.full. Using echo 1 >
> +# split_huge_pages is extremely aggressive, and even accounts for anonymous
> +# memory on a system, however we accept that tradeoff for the efficiency of
> +# doing the work in-kernel for any mapped file too. Our general goal here is to
> +# race with folio truncation + writeback.
> +
> +. ./common/preamble
> +
> +_begin_fstest auto long_rw stress soak smoketest
> +
> +# Override the default cleanup function.
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +	rm -f $runfile
> +	kill -9 $split_huge_pages_files_pid > /dev/null 2>&1
> +}
> +
> +fio_config=$tmp.fio
> +fio_out=$tmp.fio.out
> +
> +# real QA test starts here
> +_supported_fs generic
> +_require_test
> +_require_scratch
> +_require_debugfs
> +_require_split_debugfs
> +_require_command "$KILLALL_PROG" "killall"
> +_fixed_by_git_commit kernel 2a0774c2886d \
> +	"XArray: set the marks correctly when splitting an entry"
> +
> +# we need buffered IO to force truncation races with writeback in the
> +# page cache
> +cat >$fio_config <<EOF
> +[force_large_large_folio_parallel_writes]
> +nrfiles=10
> +direct=0
> +bs=4M
> +group_reporting=1
> +filesize=1GiB
> +readwrite=write
> +fallocate=none
> +numjobs=$(nproc)
> +directory=$SCRATCH_MNT
> +runtime=100*${TIME_FACTOR}
> +time_based
> +EOF
> +
> +_require_fio $fio_config
> +
> +echo "Silence is golden"
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount >> $seqres.full 2>&1
> +
> +# used to let our loops know when to stop
> +runfile="$tmp.keep.running.loop"
> +touch $runfile
> +
> +# The background ops are out of bounds, the goal is to race with fsstress.
> +
> +# Force folio split if possible, this seems to be screaming for MADV_NOHUGEPAGE
> +# for large folios.
> +while [ -e $runfile ]; do
> +	_split_huge_pages_all >/dev/null 2>&1
> +done &
> +split_huge_pages_files_pid=$!
> +
> +split_count_before=0
> +split_count_failed_before=0
> +
> +if grep -q thp_split_page /proc/vmstat; then
> +	split_count_before=$(grep ^thp_split_page /proc/vmstat | head -1 | awk '{print $2}')
> +	split_count_failed_before=$(grep ^thp_split_page_failed /proc/vmstat | head -1 | awk '{print $2}')
> +else
> +	echo "no thp_split_page in /proc/vmstat" >> $seqres.full
> +fi
> +
> +# we blast away with large writes to force large folio writes when
> +# possible.
> +echo -e "Running fio with config:\n" >> $seqres.full
> +cat $fio_config >> $seqres.full
> +$FIO_PROG $fio_config --alloc-size=$(( $(nproc) * 8192 )) --output=$fio_out
> +
> +rm -f $runfile
> +
> +wait > /dev/null 2>&1
> +
> +if grep -q thp_split_page /proc/vmstat; then
> +	split_count_after=$(grep ^thp_split_page /proc/vmstat | head -1 | awk '{print $2}')
> +	split_count_failed_after=$(grep ^thp_split_page_failed /proc/vmstat | head -1 | awk '{print $2}')

I think this ought to be a separate function for cleanliness?

_proc_vmstat()
{
	awk -v name="$1" '{if ($1 ~ name) {print($2)}}' /proc/vmstat
}

	split_count_after=$(_proc_vmstat thp_split_page)

Otherwise this test looks fine to me.

--D

> +	thp_split_page=$((split_count_after - split_count_before))
> +	thp_split_page_failed=$((split_count_failed_after - split_count_failed_before))
> +
> +	echo "vmstat thp_split_page: $thp_split_page" >> $seqres.full
> +	echo "vmstat thp_split_page_failed: $thp_split_page_failed" >> $seqres.full
> +fi
> +
> +status=0
> +exit
> diff --git a/tests/generic/751.out b/tests/generic/751.out
> new file mode 100644
> index 000000000000..6479fa6f1404
> --- /dev/null
> +++ b/tests/generic/751.out
> @@ -0,0 +1,2 @@
> +QA output created by 751
> +Silence is golden
> -- 
> 2.43.0
> 
>
Luis Chamberlain June 11, 2024, 6:15 p.m. UTC | #2
On Tue, Jun 11, 2024 at 07:45:03AM -0700, Darrick J. Wong wrote:
> On Mon, Jun 10, 2024 at 08:02:02PM -0700, Luis Chamberlain wrote:
> > +# Requires CONFIG_DEBUGFS and truncation knobs
> > +_require_split_debugfs()
> 
> Er... I thought "split" referred to debugfs itself.
> 
> _require_split_huge_pages_knob?

Much better, thanks.

> > +# This aims at trying to reproduce a difficult to reproduce bug found with
> > +# min order. The issue was root caused to an xarray bug when we split folios
> > +# to another order other than 0. This functionality is used to support min
> > +# order. The crash:
> > +#
> > +# https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> 
> You might want to paste the stacktrace in here directly, in case the
> gist ever goes away.

Its not a simple crash trace, it is pretty enourmous considering I
decoded it, and it has all locking candidates. Even including it after
the "---" lines of the patch might make someone go: TLDR. Thoughts?

> > +if grep -q thp_split_page /proc/vmstat; then
> > +	split_count_after=$(grep ^thp_split_page /proc/vmstat | head -1 | awk '{print $2}')
> > +	split_count_failed_after=$(grep ^thp_split_page_failed /proc/vmstat | head -1 | awk '{print $2}')
> 
> I think this ought to be a separate function for cleanliness?
> 
> _proc_vmstat()
> {
> 	awk -v name="$1" '{if ($1 ~ name) {print($2)}}' /proc/vmstat
> }

> Otherwise this test looks fine to me.

Thanks!

  Luis
Darrick J. Wong June 11, 2024, 6:29 p.m. UTC | #3
On Tue, Jun 11, 2024 at 11:15:52AM -0700, Luis Chamberlain wrote:
> On Tue, Jun 11, 2024 at 07:45:03AM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 10, 2024 at 08:02:02PM -0700, Luis Chamberlain wrote:
> > > +# Requires CONFIG_DEBUGFS and truncation knobs
> > > +_require_split_debugfs()
> > 
> > Er... I thought "split" referred to debugfs itself.
> > 
> > _require_split_huge_pages_knob?
> 
> Much better, thanks.
> 
> > > +# This aims at trying to reproduce a difficult to reproduce bug found with
> > > +# min order. The issue was root caused to an xarray bug when we split folios
> > > +# to another order other than 0. This functionality is used to support min
> > > +# order. The crash:
> > > +#
> > > +# https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> > 
> > You might want to paste the stacktrace in here directly, in case the
> > gist ever goes away.
> 
> Its not a simple crash trace, it is pretty enourmous considering I
> decoded it, and it has all locking candidates. Even including it after
> the "---" lines of the patch might make someone go: TLDR. Thoughts?

I'd paste it in, even if it's quite lengthy.  I don't even think it's all that
much if you remove some of the less useful bits of the unwind:

"Crash excerpt is as follows:

"BUG: kernel NULL pointer dereference, address: 0000000000000036
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 PID: 2190 Comm: kworker/u38:5 Not tainted 6.9.0-rc5+ #14
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: writeback wb_workfn (flush-7:5)
RIP: 0010:filemap_get_folios_tag+0xa9/0x200
Call Trace:
 <TASK>
 writeback_iter+0x17d/0x310
 write_cache_pages+0x42/0xa0
 iomap_writepages+0x33/0x50
 xfs_vm_writepages+0x63/0x90 [xfs]
 do_writepages+0xcc/0x260
 __writeback_single_inode+0x3d/0x340
 writeback_sb_inodes+0x1ed/0x4b0
 __writeback_inodes_wb+0x4c/0xe0
 wb_writeback+0x267/0x2d0
 wb_workfn+0x2a4/0x440
 process_one_work+0x189/0x3b0
 worker_thread+0x273/0x390
 kthread+0xda/0x110
 ret_from_fork+0x2d/0x50
 ret_from_fork_asm+0x1a/0x30
 </TASK>"

--D

> > > +if grep -q thp_split_page /proc/vmstat; then
> > > +	split_count_after=$(grep ^thp_split_page /proc/vmstat | head -1 | awk '{print $2}')
> > > +	split_count_failed_after=$(grep ^thp_split_page_failed /proc/vmstat | head -1 | awk '{print $2}')
> > 
> > I think this ought to be a separate function for cleanliness?
> > 
> > _proc_vmstat()
> > {
> > 	awk -v name="$1" '{if ($1 ~ name) {print($2)}}' /proc/vmstat
> > }
> 
> > Otherwise this test looks fine to me.
> 
> Thanks!
> 
>   Luis
>
Luis Chamberlain June 11, 2024, 6:59 p.m. UTC | #4
On Tue, Jun 11, 2024 at 11:29:59AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 11, 2024 at 11:15:52AM -0700, Luis Chamberlain wrote:
> > On Tue, Jun 11, 2024 at 07:45:03AM -0700, Darrick J. Wong wrote:
> > > On Mon, Jun 10, 2024 at 08:02:02PM -0700, Luis Chamberlain wrote:
> > > > +# Requires CONFIG_DEBUGFS and truncation knobs
> > > > +_require_split_debugfs()
> > > 
> > > Er... I thought "split" referred to debugfs itself.
> > > 
> > > _require_split_huge_pages_knob?
> > 
> > Much better, thanks.
> > 
> > > > +# This aims at trying to reproduce a difficult to reproduce bug found with
> > > > +# min order. The issue was root caused to an xarray bug when we split folios
> > > > +# to another order other than 0. This functionality is used to support min
> > > > +# order. The crash:
> > > > +#
> > > > +# https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> > > 
> > > You might want to paste the stacktrace in here directly, in case the
> > > gist ever goes away.
> > 
> > Its not a simple crash trace, it is pretty enourmous considering I
> > decoded it, and it has all locking candidates. Even including it after
> > the "---" lines of the patch might make someone go: TLDR. Thoughts?
> 
> I'd paste it in, even if it's quite lengthy.  I don't even think it's all that
> much if you remove some of the less useful bits of the unwind:
> 
> "Crash excerpt is as follows:
> 
> "BUG: kernel NULL pointer dereference, address: 0000000000000036
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 7 PID: 2190 Comm: kworker/u38:5 Not tainted 6.9.0-rc5+ #14
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Workqueue: writeback wb_workfn (flush-7:5)
> RIP: 0010:filemap_get_folios_tag+0xa9/0x200
> Call Trace:
>  <TASK>
>  writeback_iter+0x17d/0x310
>  write_cache_pages+0x42/0xa0
>  iomap_writepages+0x33/0x50
>  xfs_vm_writepages+0x63/0x90 [xfs]
>  do_writepages+0xcc/0x260
>  __writeback_single_inode+0x3d/0x340
>  writeback_sb_inodes+0x1ed/0x4b0
>  __writeback_inodes_wb+0x4c/0xe0
>  wb_writeback+0x267/0x2d0
>  wb_workfn+0x2a4/0x440
>  process_one_work+0x189/0x3b0
>  worker_thread+0x273/0x390
>  kthread+0xda/0x110
>  ret_from_fork+0x2d/0x50
>  ret_from_fork_asm+0x1a/0x30
>  </TASK>"

Ah, sorry yes, this crash dump is small, the other one is the one that
was I thinking, which we still deadlock on and have only a lockdep hint
about likely what is going on. I'll include this dump on v2.

  Luis
diff mbox series

Patch

diff --git a/common/rc b/common/rc
index 30beef4e5c02..60f572108818 100644
--- a/common/rc
+++ b/common/rc
@@ -158,6 +158,20 @@  _require_vm_compaction()
 	    _notrun "Need compaction enabled CONFIG_COMPACTION=y"
 	fi
 }
+
+# Requires CONFIG_DEBUGFS and truncation knobs
+_require_split_debugfs()
+{
+       if [ ! -f $DEBUGFS_MNT/split_huge_pages ]; then
+           _notrun "Needs CONFIG_DEBUGFS and split_huge_pages"
+       fi
+}
+
+_split_huge_pages_all()
+{
+	echo 1 > $DEBUGFS_MNT/split_huge_pages
+}
+
 # Get hugepagesize in bytes
 _get_hugepagesize()
 {
diff --git a/tests/generic/751 b/tests/generic/751
new file mode 100755
index 000000000000..7cc96054a5a9
--- /dev/null
+++ b/tests/generic/751
@@ -0,0 +1,127 @@ 
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2024 Luis Chamberlain. All Rights Reserved.
+#
+# FS QA Test No. 751
+#
+# stress page cache truncation + writeback
+#
+# This aims at trying to reproduce a difficult to reproduce bug found with
+# min order. The issue was root caused to an xarray bug when we split folios
+# to another order other than 0. This functionality is used to support min
+# order. The crash:
+#
+# https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
+#
+# This may also find future truncation bugs in the future, as truncating any
+# mapped file through the collateral of using echo 1 > split_huge_pages will
+# always respect the min order. Truncating to a larger order then is excercised
+# when this test is run against any filesystem LBS profile or an LBS device.
+#
+# If you're enabling this and want to check underneath the hood you may want to
+# enable:
+#
+# dyndbg='file mm/huge_memory.c +p'
+#
+# This tests aims at increasing the rate of successful truncations so we want
+# to increase the value of thp_split_page in $seqres.full. Using echo 1 >
+# split_huge_pages is extremely aggressive, and even accounts for anonymous
+# memory on a system, however we accept that tradeoff for the efficiency of
+# doing the work in-kernel for any mapped file too. Our general goal here is to
+# race with folio truncation + writeback.
+
+. ./common/preamble
+
+_begin_fstest auto long_rw stress soak smoketest
+
+# Override the default cleanup function.
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+	rm -f $runfile
+	kill -9 $split_huge_pages_files_pid > /dev/null 2>&1
+}
+
+fio_config=$tmp.fio
+fio_out=$tmp.fio.out
+
+# real QA test starts here
+_supported_fs generic
+_require_test
+_require_scratch
+_require_debugfs
+_require_split_debugfs
+_require_command "$KILLALL_PROG" "killall"
+_fixed_by_git_commit kernel 2a0774c2886d \
+	"XArray: set the marks correctly when splitting an entry"
+
+# we need buffered IO to force truncation races with writeback in the
+# page cache
+cat >$fio_config <<EOF
+[force_large_large_folio_parallel_writes]
+nrfiles=10
+direct=0
+bs=4M
+group_reporting=1
+filesize=1GiB
+readwrite=write
+fallocate=none
+numjobs=$(nproc)
+directory=$SCRATCH_MNT
+runtime=100*${TIME_FACTOR}
+time_based
+EOF
+
+_require_fio $fio_config
+
+echo "Silence is golden"
+
+_scratch_mkfs >>$seqres.full 2>&1
+_scratch_mount >> $seqres.full 2>&1
+
+# used to let our loops know when to stop
+runfile="$tmp.keep.running.loop"
+touch $runfile
+
+# The background ops are out of bounds, the goal is to race with fsstress.
+
+# Force folio split if possible, this seems to be screaming for MADV_NOHUGEPAGE
+# for large folios.
+while [ -e $runfile ]; do
+	_split_huge_pages_all >/dev/null 2>&1
+done &
+split_huge_pages_files_pid=$!
+
+split_count_before=0
+split_count_failed_before=0
+
+if grep -q thp_split_page /proc/vmstat; then
+	split_count_before=$(grep ^thp_split_page /proc/vmstat | head -1 | awk '{print $2}')
+	split_count_failed_before=$(grep ^thp_split_page_failed /proc/vmstat | head -1 | awk '{print $2}')
+else
+	echo "no thp_split_page in /proc/vmstat" >> $seqres.full
+fi
+
+# we blast away with large writes to force large folio writes when
+# possible.
+echo -e "Running fio with config:\n" >> $seqres.full
+cat $fio_config >> $seqres.full
+$FIO_PROG $fio_config --alloc-size=$(( $(nproc) * 8192 )) --output=$fio_out
+
+rm -f $runfile
+
+wait > /dev/null 2>&1
+
+if grep -q thp_split_page /proc/vmstat; then
+	split_count_after=$(grep ^thp_split_page /proc/vmstat | head -1 | awk '{print $2}')
+	split_count_failed_after=$(grep ^thp_split_page_failed /proc/vmstat | head -1 | awk '{print $2}')
+	thp_split_page=$((split_count_after - split_count_before))
+	thp_split_page_failed=$((split_count_failed_after - split_count_failed_before))
+
+	echo "vmstat thp_split_page: $thp_split_page" >> $seqres.full
+	echo "vmstat thp_split_page_failed: $thp_split_page_failed" >> $seqres.full
+fi
+
+status=0
+exit
diff --git a/tests/generic/751.out b/tests/generic/751.out
new file mode 100644
index 000000000000..6479fa6f1404
--- /dev/null
+++ b/tests/generic/751.out
@@ -0,0 +1,2 @@ 
+QA output created by 751
+Silence is golden