diff mbox series

[14/23] generic/032: fix pinned mount failure

Message ID 173706974288.1927324.17585931341351454094.stgit@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series [01/23] generic/476: fix fsstress process management | expand

Commit Message

Darrick J. Wong Jan. 16, 2025, 11:28 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

generic/032 now periodically fails with:

 --- /tmp/fstests/tests/generic/032.out	2025-01-05 11:42:14.427388698 -0800
 +++ /var/tmp/fstests/generic/032.out.bad	2025-01-06 18:20:17.122818195 -0800
 @@ -1,5 +1,7 @@
  QA output created by 032
  100 iterations
 -000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd  >................<
 -*
 -100000
 +umount: /opt: target is busy.
 +mount: /opt: /dev/sda4 already mounted on /opt.
 +       dmesg(1) may have more information after failed mount system call.
 +cycle mount failed
 +(see /var/tmp/fstests/generic/032.full for details)

The root cause of this regression is the _syncloop subshell.  This
background process runs _scratch_sync, which is actually an xfs_io
process that calls syncfs on the scratch mount.

Unfortunately, while the test kills the _syncloop subshell, it doesn't
actually kill the xfs_io process.  If the xfs_io process is in D state
running the syncfs, it won't react to the signal, but it will pin the
mount.  Then the _scratch_cycle_mount fails because the mount is pinned.

Prior to commit 8973af00ec212f the _syncloop ran sync(1) which avoided
pinning the scratch filesystem.

Fix this by pgrepping for the xfs_io process and killing and waiting for
it if necessary.

Cc: <fstests@vger.kernel.org> # v2024.12.08
Fixes: 8973af00ec212f ("fstests: cleanup fsstress process management")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 common/rc         |    6 ++++++
 tests/generic/032 |    9 +++++++++
 2 files changed, 15 insertions(+)

Comments

Dave Chinner Jan. 21, 2025, 5:03 a.m. UTC | #1
On Thu, Jan 16, 2025 at 03:28:49PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> generic/032 now periodically fails with:
> 
>  --- /tmp/fstests/tests/generic/032.out	2025-01-05 11:42:14.427388698 -0800
>  +++ /var/tmp/fstests/generic/032.out.bad	2025-01-06 18:20:17.122818195 -0800
>  @@ -1,5 +1,7 @@
>   QA output created by 032
>   100 iterations
>  -000000 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd  >................<
>  -*
>  -100000
>  +umount: /opt: target is busy.
>  +mount: /opt: /dev/sda4 already mounted on /opt.
>  +       dmesg(1) may have more information after failed mount system call.
>  +cycle mount failed
>  +(see /var/tmp/fstests/generic/032.full for details)
> 
> The root cause of this regression is the _syncloop subshell.  This
> background process runs _scratch_sync, which is actually an xfs_io
> process that calls syncfs on the scratch mount.
> 
> Unfortunately, while the test kills the _syncloop subshell, it doesn't
> actually kill the xfs_io process.  If the xfs_io process is in D state
> running the syncfs, it won't react to the signal, but it will pin the
> mount.  Then the _scratch_cycle_mount fails because the mount is pinned.
> 
> Prior to commit 8973af00ec212f the _syncloop ran sync(1) which avoided
> pinning the scratch filesystem.

How does running sync(1) prevent this? they run the same kernel
code, so I'm a little confused as to why this is a problem caused
by using the syncfs() syscall rather than the sync() syscall...

> Fix this by pgrepping for the xfs_io process and killing and waiting for
> it if necessary.

Change looks fine, though.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
diff mbox series

Patch

diff --git a/common/rc b/common/rc
index 4419cfc3188374..d7f3c48eafe590 100644
--- a/common/rc
+++ b/common/rc
@@ -36,6 +36,12 @@  _pkill()
 	pkill --session 0 "$@"
 }
 
+# Find only the test processes started by this test
+_pgrep()
+{
+	pgrep --session 0 "$@"
+}
+
 # Common execution handling for fsstress invocation.
 #
 # We need per-test fsstress binaries because of the way fsstress forks and
diff --git a/tests/generic/032 b/tests/generic/032
index 30290c7225a2fa..48d594fe9315b8 100755
--- a/tests/generic/032
+++ b/tests/generic/032
@@ -81,6 +81,15 @@  echo $iters iterations
 kill $syncpid
 wait
 
+# The xfs_io instance started by _scratch_sync could be stuck in D state when
+# the subshell running _syncloop & is killed.  That xfs_io process pins the
+# mount so we must kill it and wait for it to die before cycling the mount.
+dead_syncfs_pid=$(_pgrep xfs_io)
+if [ -n "$dead_syncfs_pid" ]; then
+	_pkill xfs_io
+	wait $dead_syncfs_pid
+fi
+
 # clear page cache and dump the file
 _scratch_cycle_mount
 _hexdump $SCRATCH_MNT/file