diff mbox series

[10/40] fstests: fix DM device creation/removal vs udev races

Message ID 20241127045403.3665299-11-david@fromorbit.com (mailing list archive)
State New
Headers show
Series fstests: concurrent test execution | expand

Commit Message

Dave Chinner Nov. 27, 2024, 4:51 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

When there is load on the system, newly created DM devices don't
seem to be created consistently. When a new device is created,
it is supposed to be created as /dev/dm-X, and then a udev rule
creates the symlink from /dev/mapper/<dev name> to /dev/dm-X.

Unfortunately, a lot of the tests that use dynamically created dm
devices (dmerror, dmflakey) are not being created with this device
node structure. This is resulting in getting the wrong short device
name for the block device and hence we can't find the filesystem
sysfs attribute directory for the filesystem on that block device.

For example, with added debug to check what device name was being
passed around and resolved:

eneric/489       - output mismatch (see /mnt/xfs/runner-10/results/xfs/generic/489.out.bad)
    --- tests/generic/489.out   2022-12-21 15:53:25.503043574 +1100
    +++ /mnt/xfs/runner-10/results/xfs/generic/489.out.bad      2024-10-24 10:27:29.767196340 +1100
    @@ -1,4 +1,10 @@
     QA output created by 489
    +./common/rc: line 4955: /sys/fs/xfs/flakey-test.489/error/fail_at_unmount: No such file or directory
    +dev: /dev/mapper/flakey-test.489
    +resolved dev: /dev/mapper/flakey-test.489
    +brw-rw----. 1 root disk 251, 5 Oct 24 10:27 /dev/mapper/flakey-test.489
    +./common/rc: line 4955: /sys/fs/xfs/flakey-test.489/error/metadata/EIO/max_retries: No such file or directory
    +./common/rc: line 4955: /sys/fs/xfs/flakey-test.489/error/metadata/EIO/retry_timeout_seconds: No such file or directory
    ...
    (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/489.out /mnt/xfs/runner-10/results/xfs/generic/489.out.bad'  to see the entire diff)

Here we see that the block device node is actually at
/dev/mapper/flakey-test.489, not a link to a /dev/dm-X device node.

This implies that the udev rule to create the /dev/dm-X node and
the symlink to it at /dev/mapper/flakey-test.489 has not run, and
something else created the device node.

That looks like a bug in _dmsetup_create(). It creates the new DM
device, then runs 'dmsetup mknodes', then waits for udev to settle.
This means the mknodes command - which makes sure the dm device
nodes exist - is racing with udev to create the device nodes. They
don't use the same rules to create nodes, so we end up with this
broken situation.

'dmsetup mknodes' is considered legacy functionality, intended for
systems that have no udev capability. For systems that have udev
enabled (i.e. all modern distros), mknodes should not be run because
it creates a different device node structure to what udev creates
and can race with udev as we see here.

Fix it by removing the 'dmsetup mknodes' as it is unnecessary to
create the correct device node layout the rest of the system is
expecting to see.

Additionally,_dmsetup_remove() calls 'dmsetup mknodes' and that can
also race with udev and cause issues. Hence we need to remove that
call from the remove operation as well.

Further, 'dmsetup remove' is also subject to races with udev which
results in device remove failing.  This problem is documented in the
dmsetup man page and suggests the use of the "--retry" option. This
means dmsetup will retry several times over a few seconds before
failing the removal.

This reduces the remove failure rate substantially,
but it can still occasionally fail when the system is under heavy
load and udev processing is very slow. This is fixable, but requires
fstests udev infrastructure changes as it requires udevadm
functionality that is relatively new. Hence that will be done as
a separate fix.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 common/rc | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)
diff mbox series

Patch

diff --git a/common/rc b/common/rc
index 391370fd5..a601e2c80 100644
--- a/common/rc
+++ b/common/rc
@@ -5162,8 +5162,8 @@  _require_label_get_max()
 _dmsetup_remove()
 {
 	$UDEV_SETTLE_PROG >/dev/null 2>&1
-	$DMSETUP_PROG remove "$@" >>$seqres.full 2>&1
-	$DMSETUP_PROG mknodes >/dev/null 2>&1
+	$DMSETUP_PROG remove --retry "$@" >>$seqres.full 2>&1
+	$UDEV_SETTLE_PROG >/dev/null 2>&1
 }
 
 _dmsetup_create()
@@ -5174,7 +5174,6 @@  _dmsetup_create()
 	# device open won't also fail.
 	$UDEV_SETTLE_PROG >/dev/null 2>&1
 	$DMSETUP_PROG create "$@" >>$seqres.full 2>&1 || return 1
-	$DMSETUP_PROG mknodes >/dev/null 2>&1
 	$UDEV_SETTLE_PROG >/dev/null 2>&1
 }