Message ID | 173706974243.1927324.9105721327110864014.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [01/23] generic/476: fix fsstress process management | expand |
On Thu, Jan 16, 2025 at 03:28:02PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > xfs/336 does this somewhat sketchy thing where it mdrestores into a > regular file, and then does this to validate the restored metadata: > > SCRATCH_DEV=$TEST_DIR/image _scratch_mount That's a canonical example of what is called "stepping on a landmine". We validate that the SCRATCH_DEV is a block device at the start of check and each section it reads and runs (via common/config), and then make the assumption in all the infrastructure that SCRATCH_DEV always points to a valid block device. Now we have one new test that overwrites SCRATCH_DEV temporarily with a file and so we have to add checks all through the infrastructure to handle this one whacky test? > Unfortunately, commit 1a49022fab9b4d causes the following regression: > > --- /tmp/fstests/tests/xfs/336.out 2024-11-12 16:17:36.733447713 -0800 > +++ /var/tmp/fstests/xfs/336.out.bad 2025-01-04 19:10:39.861871114 -0800 > @@ -5,4 +5,5 @@ Create big file > Explode the rtrmapbt > Create metadump file > Restore metadump > -Check restored fs > +Usage: _set_fs_sysfs_attr <mounted_device> <attr> <content> > +(see /var/tmp/fstests/xfs/336.full for details) > > This is due to the fact that SCRATCH_DEV is temporarily reassigned to > the regular file. That path is passed straight through _scratch_mount > to _xfs_prepare_for_eio_shutdown, but that helper _fails because the > "dev" argument isn't actually a path to a block device. _scratch_mount assumes that SCRATCH_DEV points to a valid block device. xfs/336 is the problem here, not the code that assumes SCRATCH_DEV points to a block device.... Why are these hacks needed? Why can't _xfs_verify_metadumps() loopdev usage be extended to handle the new rt rmap code that this test is supposed to be exercising? > Fix this by detecting non-bdevs and finding (we hope) the loop device > that was created to handle the mount. What loop device? xfs/336 doesn't use loop devices at all. Oh, this is assuming that mount will silently do a loopback mount when passed a file rather than a block device. IOWs, it's relying on some third party to do the loop device creation and hence allow it to be mounted. IOWs, this change is addressing a landmine by adding another landmine. I really think that xfs/336 needs to be fixed - one off test hacks like this, while they may work, only make modifying and maintaining the fstests infrastructure that much harder.... > While we're at it, have the > helper return the exit code from mount, not _prepare_for_eio_shutdown. That should be a separate patch. -Dave.
diff --git a/common/rc b/common/rc index 885669beeb5e26..4419cfc3188374 100644 --- a/common/rc +++ b/common/rc @@ -441,6 +441,7 @@ _try_scratch_mount() [ $mount_ret -ne 0 ] && return $mount_ret _idmapped_mount $SCRATCH_DEV $SCRATCH_MNT _prepare_for_eio_shutdown $SCRATCH_DEV + return $mount_ret } # mount scratch device with given options and _fail if mount fails @@ -658,6 +659,7 @@ _test_mount() [ $mount_ret -ne 0 ] && return $mount_ret _idmapped_mount $TEST_DEV $TEST_DIR _prepare_for_eio_shutdown $TEST_DEV + return $mount_ret } _test_unmount() @@ -4469,6 +4471,12 @@ _destroy_loop_device() losetup -d $dev || _fail "Cannot destroy loop device $dev" } +# Find the loop bdev for a given file, if there is one. +_find_loop_device() +{ + losetup --list -n -O NAME -j "$1" +} + _scale_fsstress_args() { local args="" diff --git a/common/xfs b/common/xfs index 0417a40adba3e2..c68bd6d7c773ac 100644 --- a/common/xfs +++ b/common/xfs @@ -1110,6 +1110,12 @@ _xfs_prepare_for_eio_shutdown() local dev="$1" local ctlfile="error/fail_at_unmount" + # Is this a regular file? Check if there's a loop device somewhere. + # Hopefully that lines up with a mounted filesystem. + if [ ! -b "$dev" ]; then + dev=$(_find_loop_device "$1" | tail -n 1) + fi + # Once we enable IO errors, it's possible that a writer thread will # trip over EIO, cancel the transaction, and shut down the system. # This is expected behavior, so we need to remove the "Internal error"