diff mbox series

fstests: UDEV_SETTLE_PROG before dmsetup create

Message ID ebf63c27065c5fa676701184501353a9f2014832.1638382705.git.josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series fstests: UDEV_SETTLE_PROG before dmsetup create | expand

Commit Message

Josef Bacik Dec. 1, 2021, 6:18 p.m. UTC
We've been seeing transient errors with any test that uses a dm device
for the entirety of the time that we've been running nightly xfstests
runs.  This turns out to be because sometimes we get EBUSY while trying
to create our new dm device.  Generally this is because the test comes
right after another test that messes with the dm device, and thus we
still have udev messing around with the device when DM tries to O_EXCL
the block device.

Add a UDEV_SETTLE_PROG before creating the device to make sure we can
create our new dm device without getting this transient error.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 common/rc | 1 +
 1 file changed, 1 insertion(+)

Comments

Filipe Manana Dec. 2, 2021, 2:45 p.m. UTC | #1
On Wed, Dec 01, 2021 at 01:18:52PM -0500, Josef Bacik wrote:
> We've been seeing transient errors with any test that uses a dm device
> for the entirety of the time that we've been running nightly xfstests

I have been having it on my tests vms since ever as well.
It's really annoying, but fortunatelly it doesn't happen too often.

> runs.  This turns out to be because sometimes we get EBUSY while trying
> to create our new dm device.  Generally this is because the test comes
> right after another test that messes with the dm device, and thus we
> still have udev messing around with the device when DM tries to O_EXCL
> the block device.
> 
> Add a UDEV_SETTLE_PROG before creating the device to make sure we can
> create our new dm device without getting this transient error.

I suspect this might only make it seem the problem goes away but does not
really fix it.

I say that for 2 reasons:

1) All tests that use dm end up calling _dmsetup_remove(), like through
   _log_writes_remove() or _cleanup_flakey() for example. Normally those
   are called in the _cleanup() function, which ensures it's done even if
   the test fails for some reason.

   So I don't understand why we need that UDEV_SETTLE_PROG at _dmsetup_create().

   And I've seen the ebusy failure happen even when the previous tests did
   not use any dm device;

2) Some tests fail after creating the dm device and using it. For example
   btrfs/206 often fails when it tries to fsck the filesystem:

   btrfs/206 3s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad)
        --- tests/btrfs/206.out     2020-10-16 23:13:46.554152652 +0100
        +++ /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad      2021-12-01 21:09:46.317632589 +0000
        @@ -3,3 +3,5 @@
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        +_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
        +(see /home/fdmanana/git/hub/xfstests/results//btrfs/206.full for details)
        ...

       (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/206.out /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad'  to see the entire diff)

    In the .full file I got:

    (...)
    replaying 1239@11201: sector 2173408, size 16384, flags 0x10(METADATA)
    replaying 1240@11234: sector 0, size 0, flags 0x1(FLUSH)
    replaying 1241@11235: sector 128, size 4096, flags 0x12(FUA|METADATA)
    _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
    *** fsck.btrfs output ***
    ERROR: cannot open device '/dev/sdc': Device or resource busy
    ERROR: cannot open file system
    Opening filesystem to check...
    *** end fsck.btrfs output
    *** mount output ***

   The ebusy failure is not when the test starts, but when somewhere in the middle
   of the replay loop when it calls fsck, or when it ends and the fstests framework
   calls fsck.

   I've seen that with btrfs/172 too, which also uses dm logwrites in a similar way.

So to me this suggests 2 things:

1) Calling UDEV_SETTLE_PROG at _dmsetup_create() doesn't solve that problem with
   btrfs/206 (and other tests) - the problem is fsck failing to open the scratch
   device after it called _log_writes_remove() -> _dmsetup_remove(), and not a
   failure to create the dm device;

2) The problem is likely something missing at _dmsetup_remove(). Perhaps add
   another UDEV_SETTLE_PROG there:

   diff --git a/common/rc b/common/rc
   index 8e351f17..22b34677 100644
   --- a/common/rc
   +++ b/common/rc
   @@ -4563,6 +4563,7 @@ _dmsetup_remove()
            $UDEV_SETTLE_PROG >/dev/null 2>&1
            $DMSETUP_PROG remove "$@" >>$seqres.full 2>&1
            $DMSETUP_PROG mknodes >/dev/null 2>&1
    +       $UDEV_SETTLE_PROG >/dev/null 2>&1
     }
 
    _dmsetup_create()

  I can't say if that change to _dmsetup_remove() is correct, or what it's
  needed, as I really haven't spent time trying to figure out why the issue
  happens.

    


> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  common/rc | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/common/rc b/common/rc
> index 8e351f17..35e861ec 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -4567,6 +4567,7 @@ _dmsetup_remove()
>  
>  _dmsetup_create()
>  {
> +	$UDEV_SETTLE_PROG >/dev/null 2>&1
>  	$DMSETUP_PROG create "$@" >>$seqres.full 2>&1 || return 1
>  	$DMSETUP_PROG mknodes >/dev/null 2>&1
>  	$UDEV_SETTLE_PROG >/dev/null 2>&1
> -- 
> 2.26.3
>
Josef Bacik Dec. 2, 2021, 3:45 p.m. UTC | #2
On Thu, Dec 02, 2021 at 02:45:24PM +0000, Filipe Manana wrote:
> On Wed, Dec 01, 2021 at 01:18:52PM -0500, Josef Bacik wrote:
> > We've been seeing transient errors with any test that uses a dm device
> > for the entirety of the time that we've been running nightly xfstests
> 
> I have been having it on my tests vms since ever as well.
> It's really annoying, but fortunatelly it doesn't happen too often.
> 

Yeah before this change we'd fail ~6 tests on every configruation on every
overnight run.  With this change we fail 0.  It's rare, but with our continual
testing it happens sooooooo much.

> > runs.  This turns out to be because sometimes we get EBUSY while trying
> > to create our new dm device.  Generally this is because the test comes
> > right after another test that messes with the dm device, and thus we
> > still have udev messing around with the device when DM tries to O_EXCL
> > the block device.
> > 
> > Add a UDEV_SETTLE_PROG before creating the device to make sure we can
> > create our new dm device without getting this transient error.
> 
> I suspect this might only make it seem the problem goes away but does not
> really fix it.
> 
> I say that for 2 reasons:
> 
> 1) All tests that use dm end up calling _dmsetup_remove(), like through
>    _log_writes_remove() or _cleanup_flakey() for example. Normally those
>    are called in the _cleanup() function, which ensures it's done even if
>    the test fails for some reason.
> 
>    So I don't understand why we need that UDEV_SETTLE_PROG at _dmsetup_create().
> 
>    And I've seen the ebusy failure happen even when the previous tests did
>    not use any dm device;
> 
> 2) Some tests fail after creating the dm device and using it. For example
>    btrfs/206 often fails when it tries to fsck the filesystem:
> 
>    btrfs/206 3s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad)
>         --- tests/btrfs/206.out     2020-10-16 23:13:46.554152652 +0100
>         +++ /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad      2021-12-01 21:09:46.317632589 +0000
>         @@ -3,3 +3,5 @@
>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>         wrote 8192/8192 bytes at offset 0
>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>         +_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>         +(see /home/fdmanana/git/hub/xfstests/results//btrfs/206.full for details)
>         ...
> 
>        (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/206.out /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad'  to see the entire diff)
> 
>     In the .full file I got:
> 
>     (...)
>     replaying 1239@11201: sector 2173408, size 16384, flags 0x10(METADATA)
>     replaying 1240@11234: sector 0, size 0, flags 0x1(FLUSH)
>     replaying 1241@11235: sector 128, size 4096, flags 0x12(FUA|METADATA)
>     _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>     *** fsck.btrfs output ***
>     ERROR: cannot open device '/dev/sdc': Device or resource busy
>     ERROR: cannot open file system
>     Opening filesystem to check...
>     *** end fsck.btrfs output
>     *** mount output ***
> 
>    The ebusy failure is not when the test starts, but when somewhere in the middle
>    of the replay loop when it calls fsck, or when it ends and the fstests framework
>    calls fsck.
> 
>    I've seen that with btrfs/172 too, which also uses dm logwrites in a similar way.
> 
> So to me this suggests 2 things:
> 
> 1) Calling UDEV_SETTLE_PROG at _dmsetup_create() doesn't solve that problem with
>    btrfs/206 (and other tests) - the problem is fsck failing to open the scratch
>    device after it called _log_writes_remove() -> _dmsetup_remove(), and not a
>    failure to create the dm device;
> 
> 2) The problem is likely something missing at _dmsetup_remove(). Perhaps add
>    another UDEV_SETTLE_PROG there:
> 
>    diff --git a/common/rc b/common/rc
>    index 8e351f17..22b34677 100644
>    --- a/common/rc
>    +++ b/common/rc
>    @@ -4563,6 +4563,7 @@ _dmsetup_remove()
>             $UDEV_SETTLE_PROG >/dev/null 2>&1
>             $DMSETUP_PROG remove "$@" >>$seqres.full 2>&1
>             $DMSETUP_PROG mknodes >/dev/null 2>&1
>     +       $UDEV_SETTLE_PROG >/dev/null 2>&1
>      }
>  
>     _dmsetup_create()
> 
>   I can't say if that change to _dmsetup_remove() is correct, or what it's
>   needed, as I really haven't spent time trying to figure out why the issue
>   happens.
> 

I actually tried a few iterations before I settled on this one, but I was only
trying to reproduce the EBUSY when creating the flakey device, I hadn't seen it
with fsck.  So I originally started with your change, but it didn't fix the
problem.  Then I did both, UDEV_SETTLE at the end of remove and at the beginning
of create and the problem went away, and then I removed the one from remove and
the problem still was gone.

But since I've made this change I also have been investigating another problem
where we'll get EBUSY at mkfs time when we use SCRATCH_DEV_POOL.  I have a test
patch in our staging branch to make sure it actuall fixes it, but I have to add
this to the start of _scratch_pool_mkfs as well.

It turns out that udev is doing this thing where it writes to
/sys/block/whatever/uevent to make sure that a KOBJ_CHANGE event gets sent out
for the device.

This is racing with the test doing a mount.  So the mount gets O_EXCL, which
means the uevent doesn't get emitted until umount.  This would explain what
you're seeing, we umount, we get the KOBJ_CHANGE event once the O_EXCL is
dropped, udev runs, and then fsck gets an EBUSY.

This is a very long email to say that udev is causing spurious failures because
of behavior I don't entirely understand.  We're going to need to sprinkle in
UDEV_SETTLE_PROG in different places to kill all of these different scenarios.

What do we think is best here?  Put UDEV_SETTLE_PROG at the start of any
function that needs to do O_EXCL?  So this would mean

_dmsetup_create
_dmsetup_remove
*_mkfs
*_mount
*_check

That would be safest, and I can certainly do that.  My initial approach was just
to do it where it was problematic, but the debugging I did yesterday around
btrfs/223 failures and the fact that udev is queue'ing up events that get
delivered at some point in the future makes it kind of hard to handle on a
case-by-case basis.  Thanks,

Josef
Chandan Babu R Dec. 6, 2021, 5:47 a.m. UTC | #3
On 02 Dec 2021 at 21:15, Josef Bacik wrote:
> On Thu, Dec 02, 2021 at 02:45:24PM +0000, Filipe Manana wrote:
>> On Wed, Dec 01, 2021 at 01:18:52PM -0500, Josef Bacik wrote:
>> > We've been seeing transient errors with any test that uses a dm device
>> > for the entirety of the time that we've been running nightly xfstests
>> 
>> I have been having it on my tests vms since ever as well.
>> It's really annoying, but fortunatelly it doesn't happen too often.
>> 
>
> Yeah before this change we'd fail ~6 tests on every configruation on every
> overnight run.  With this change we fail 0.  It's rare, but with our continual
> testing it happens sooooooo much.
>
>> > runs.  This turns out to be because sometimes we get EBUSY while trying
>> > to create our new dm device.  Generally this is because the test comes
>> > right after another test that messes with the dm device, and thus we
>> > still have udev messing around with the device when DM tries to O_EXCL
>> > the block device.
>> > 
>> > Add a UDEV_SETTLE_PROG before creating the device to make sure we can
>> > create our new dm device without getting this transient error.
>> 
>> I suspect this might only make it seem the problem goes away but does not
>> really fix it.
>> 
>> I say that for 2 reasons:
>> 
>> 1) All tests that use dm end up calling _dmsetup_remove(), like through
>>    _log_writes_remove() or _cleanup_flakey() for example. Normally those
>>    are called in the _cleanup() function, which ensures it's done even if
>>    the test fails for some reason.
>> 
>>    So I don't understand why we need that UDEV_SETTLE_PROG at _dmsetup_create().
>> 
>>    And I've seen the ebusy failure happen even when the previous tests did
>>    not use any dm device;
>> 
>> 2) Some tests fail after creating the dm device and using it. For example
>>    btrfs/206 often fails when it tries to fsck the filesystem:
>> 
>>    btrfs/206 3s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad)
>>         --- tests/btrfs/206.out     2020-10-16 23:13:46.554152652 +0100
>>         +++ /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad      2021-12-01 21:09:46.317632589 +0000
>>         @@ -3,3 +3,5 @@
>>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>         wrote 8192/8192 bytes at offset 0
>>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>         +_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>         +(see /home/fdmanana/git/hub/xfstests/results//btrfs/206.full for details)
>>         ...
>> 
>>        (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/206.out /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad'  to see the entire diff)
>> 
>>     In the .full file I got:
>> 
>>     (...)
>>     replaying 1239@11201: sector 2173408, size 16384, flags 0x10(METADATA)
>>     replaying 1240@11234: sector 0, size 0, flags 0x1(FLUSH)
>>     replaying 1241@11235: sector 128, size 4096, flags 0x12(FUA|METADATA)
>>     _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>     *** fsck.btrfs output ***
>>     ERROR: cannot open device '/dev/sdc': Device or resource busy
>>     ERROR: cannot open file system
>>     Opening filesystem to check...
>>     *** end fsck.btrfs output
>>     *** mount output ***
>> 
>>    The ebusy failure is not when the test starts, but when somewhere in the middle
>>    of the replay loop when it calls fsck, or when it ends and the fstests framework
>>    calls fsck.
>> 
>>    I've seen that with btrfs/172 too, which also uses dm logwrites in a similar way.
>> 
>> So to me this suggests 2 things:
>> 
>> 1) Calling UDEV_SETTLE_PROG at _dmsetup_create() doesn't solve that problem with
>>    btrfs/206 (and other tests) - the problem is fsck failing to open the scratch
>>    device after it called _log_writes_remove() -> _dmsetup_remove(), and not a
>>    failure to create the dm device;
>> 
>> 2) The problem is likely something missing at _dmsetup_remove(). Perhaps add
>>    another UDEV_SETTLE_PROG there:
>> 
>>    diff --git a/common/rc b/common/rc
>>    index 8e351f17..22b34677 100644
>>    --- a/common/rc
>>    +++ b/common/rc
>>    @@ -4563,6 +4563,7 @@ _dmsetup_remove()
>>             $UDEV_SETTLE_PROG >/dev/null 2>&1
>>             $DMSETUP_PROG remove "$@" >>$seqres.full 2>&1
>>             $DMSETUP_PROG mknodes >/dev/null 2>&1
>>     +       $UDEV_SETTLE_PROG >/dev/null 2>&1
>>      }
>>  
>>     _dmsetup_create()
>> 
>>   I can't say if that change to _dmsetup_remove() is correct, or what it's
>>   needed, as I really haven't spent time trying to figure out why the issue
>>   happens.
>> 
>
> I actually tried a few iterations before I settled on this one, but I was only
> trying to reproduce the EBUSY when creating the flakey device, I hadn't seen it
> with fsck.  So I originally started with your change, but it didn't fix the
> problem.  Then I did both, UDEV_SETTLE at the end of remove and at the beginning
> of create and the problem went away, and then I removed the one from remove and
> the problem still was gone.
>
> But since I've made this change I also have been investigating another problem
> where we'll get EBUSY at mkfs time when we use SCRATCH_DEV_POOL.  I have a test
> patch in our staging branch to make sure it actuall fixes it, but I have to add
> this to the start of _scratch_pool_mkfs as well.
>
> It turns out that udev is doing this thing where it writes to
> /sys/block/whatever/uevent to make sure that a KOBJ_CHANGE event gets sent out
> for the device.
>
> This is racing with the test doing a mount.  So the mount gets O_EXCL, which
> means the uevent doesn't get emitted until umount.  This would explain what
> you're seeing, we umount, we get the KOBJ_CHANGE event once the O_EXCL is
> dropped, udev runs, and then fsck gets an EBUSY.

I started debugging this issue late last week. Among several tests which
failed, xfs/033 was failing once in a while because the umount syscall
returned -EBUSY. Debugging this further led me to the following,

1. Mounting an fs causes udev to invoke xfs_spaceman (via udisksd => xfs_info).
   This causes the per-cpu counter at mount->mnt_pcp->mnt_count to increase by
   1.
2. Unmount at this stage causes the umount syscall to fail since the refcount
   on struct mount is greater than 2.

I changed my test disks from the regular /dev/sd* devices to loop devices. I
then added loop devices to be ignored by udev with the following change in
/lib/udev/rules.d/60-persistent-storage.rules,

KERNEL!="sr*|loop*", IMPORT{builtin}="blkid"

This led to xfs/033 execute fine for 100 times without failure. However, other
tests which use device mapper devices are still failing arbitrarily.

Looks like a simple mount operation causes udev to indirectly invoke
xfs_spaceman as described by the output of perf script,

bash 50034 [002]  2541.601278: sched:sched_process_fork: comm=bash pid=50034 child_comm=bash child_pid=50044                                                
mount 50044 [003]  2541.601851: sched:sched_process_exec: filename=/usr/bin/mount pid=50044 old_pid=50044
systemd-udevd   173 [000]  2541.620525: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50048
systemd-udevd   173 [000]  2541.621071: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50051
systemd-udevd   173 [000]  2541.621562: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50052
systemd-udevd   173 [000]  2541.621971: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50053
systemd-udevd   173 [000]  2541.622657: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50054
udisksd   588 [000]  2541.675852: sched:sched_process_fork: comm=udisksd pid=588 child_comm=udisksd child_pid=50058
xfs_admin 50058 [000]  2541.676645: sched:sched_process_exec: filename=/usr/sbin/xfs_admin pid=50058 old_pid=50058
xfs_admin 50058 [000]  2541.677317: sched:sched_process_fork: comm=xfs_admin pid=50058 child_comm=xfs_admin child_pid=50059
xfs_db 50059 [000]  2541.677569: sched:sched_process_exec: filename=/usr/sbin/xfs_db pid=50059 old_pid=50059
udisksd   588 [003]  2541.687075: sched:sched_process_fork: comm=udisksd pid=588 child_comm=udisksd child_pid=50060
xfs_info 50060 [001]  2541.687843: sched:sched_process_exec: filename=/usr/sbin/xfs_info pid=50060 old_pid=50060
xfs_info 50060 [001]  2541.688362: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50061
xfs_info 50061 [003]  2541.688615: sched:sched_process_fork: comm=xfs_info pid=50061 child_comm=xfs_info child_pid=50062
xfs_info 50062 [001]  2541.688810: sched:sched_process_fork: comm=xfs_info pid=50062 child_comm=xfs_info child_pid=50063
xfs_info 50062 [001]  2541.688944: sched:sched_process_fork: comm=xfs_info pid=50062 child_comm=xfs_info child_pid=50064
losetup 50063 [003]  2541.689058: sched:sched_process_exec: filename=/usr/sbin/losetup pid=50063 old_pid=50063
tail 50064 [000]  2541.689169: sched:sched_process_exec: filename=/usr/bin/tail pid=50064 old_pid=50064
xfs_info 50060 [001]  2541.690200: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50065
findmnt 50065 [003]  2541.690501: sched:sched_process_exec: filename=/usr/bin/findmnt pid=50065 old_pid=50065
xfs_info 50060 [001]  2541.692154: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50066
xfs_spaceman 50066 [003]  2541.692345: sched:sched_process_exec: filename=/usr/sbin/xfs_spaceman pid=50066 old_pid=50066

xfs_spaceman in turn causes ref count on struct mount to increase. Hence an
unmount operation can fail with -EBUSY error.

>
> This is a very long email to say that udev is causing spurious failures because
> of behavior I don't entirely understand.  We're going to need to sprinkle in
> UDEV_SETTLE_PROG in different places to kill all of these different scenarios.
>
> What do we think is best here?  Put UDEV_SETTLE_PROG at the start of any
> function that needs to do O_EXCL?  So this would mean
>
> _dmsetup_create
> _dmsetup_remove
> *_mkfs
> *_mount
> *_check
>

To be honest, I don't understand udev well enough. But as pointed above, a
simple invocation of mount is causing xfs_spaceman to be executed (indirectly)
by udevd.  Hence, may be executing UDEV_SETTLE_PROG is probably not
sufficient.

> That would be safest, and I can certainly do that.  My initial approach was just
> to do it where it was problematic, but the debugging I did yesterday around
> btrfs/223 failures and the fact that udev is queue'ing up events that get
> delivered at some point in the future makes it kind of hard to handle on a
> case-by-case basis.  Thanks,
>
> Josef
Chandan Babu R Dec. 6, 2021, 2:07 p.m. UTC | #4
On 06 Dec 2021 at 11:17, Chandan Babu R wrote:
> On 02 Dec 2021 at 21:15, Josef Bacik wrote:
>> On Thu, Dec 02, 2021 at 02:45:24PM +0000, Filipe Manana wrote:
>>> On Wed, Dec 01, 2021 at 01:18:52PM -0500, Josef Bacik wrote:
>>> > We've been seeing transient errors with any test that uses a dm device
>>> > for the entirety of the time that we've been running nightly xfstests
>>> 
>>> I have been having it on my tests vms since ever as well.
>>> It's really annoying, but fortunatelly it doesn't happen too often.
>>> 
>>
>> Yeah before this change we'd fail ~6 tests on every configruation on every
>> overnight run.  With this change we fail 0.  It's rare, but with our continual
>> testing it happens sooooooo much.
>>
>>> > runs.  This turns out to be because sometimes we get EBUSY while trying
>>> > to create our new dm device.  Generally this is because the test comes
>>> > right after another test that messes with the dm device, and thus we
>>> > still have udev messing around with the device when DM tries to O_EXCL
>>> > the block device.
>>> > 
>>> > Add a UDEV_SETTLE_PROG before creating the device to make sure we can
>>> > create our new dm device without getting this transient error.
>>> 
>>> I suspect this might only make it seem the problem goes away but does not
>>> really fix it.
>>> 
>>> I say that for 2 reasons:
>>> 
>>> 1) All tests that use dm end up calling _dmsetup_remove(), like through
>>>    _log_writes_remove() or _cleanup_flakey() for example. Normally those
>>>    are called in the _cleanup() function, which ensures it's done even if
>>>    the test fails for some reason.
>>> 
>>>    So I don't understand why we need that UDEV_SETTLE_PROG at _dmsetup_create().
>>> 
>>>    And I've seen the ebusy failure happen even when the previous tests did
>>>    not use any dm device;
>>> 
>>> 2) Some tests fail after creating the dm device and using it. For example
>>>    btrfs/206 often fails when it tries to fsck the filesystem:
>>> 
>>>    btrfs/206 3s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad)
>>>         --- tests/btrfs/206.out     2020-10-16 23:13:46.554152652 +0100
>>>         +++ /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad      2021-12-01 21:09:46.317632589 +0000
>>>         @@ -3,3 +3,5 @@
>>>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>         wrote 8192/8192 bytes at offset 0
>>>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
>>>         +_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>>         +(see /home/fdmanana/git/hub/xfstests/results//btrfs/206.full for details)
>>>         ...
>>> 
>>>        (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/206.out /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad'  to see the entire diff)
>>> 
>>>     In the .full file I got:
>>> 
>>>     (...)
>>>     replaying 1239@11201: sector 2173408, size 16384, flags 0x10(METADATA)
>>>     replaying 1240@11234: sector 0, size 0, flags 0x1(FLUSH)
>>>     replaying 1241@11235: sector 128, size 4096, flags 0x12(FUA|METADATA)
>>>     _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
>>>     *** fsck.btrfs output ***
>>>     ERROR: cannot open device '/dev/sdc': Device or resource busy
>>>     ERROR: cannot open file system
>>>     Opening filesystem to check...
>>>     *** end fsck.btrfs output
>>>     *** mount output ***
>>> 
>>>    The ebusy failure is not when the test starts, but when somewhere in the middle
>>>    of the replay loop when it calls fsck, or when it ends and the fstests framework
>>>    calls fsck.
>>> 
>>>    I've seen that with btrfs/172 too, which also uses dm logwrites in a similar way.
>>> 
>>> So to me this suggests 2 things:
>>> 
>>> 1) Calling UDEV_SETTLE_PROG at _dmsetup_create() doesn't solve that problem with
>>>    btrfs/206 (and other tests) - the problem is fsck failing to open the scratch
>>>    device after it called _log_writes_remove() -> _dmsetup_remove(), and not a
>>>    failure to create the dm device;
>>> 
>>> 2) The problem is likely something missing at _dmsetup_remove(). Perhaps add
>>>    another UDEV_SETTLE_PROG there:
>>> 
>>>    diff --git a/common/rc b/common/rc
>>>    index 8e351f17..22b34677 100644
>>>    --- a/common/rc
>>>    +++ b/common/rc
>>>    @@ -4563,6 +4563,7 @@ _dmsetup_remove()
>>>             $UDEV_SETTLE_PROG >/dev/null 2>&1
>>>             $DMSETUP_PROG remove "$@" >>$seqres.full 2>&1
>>>             $DMSETUP_PROG mknodes >/dev/null 2>&1
>>>     +       $UDEV_SETTLE_PROG >/dev/null 2>&1
>>>      }
>>>  
>>>     _dmsetup_create()
>>> 
>>>   I can't say if that change to _dmsetup_remove() is correct, or what it's
>>>   needed, as I really haven't spent time trying to figure out why the issue
>>>   happens.
>>> 
>>
>> I actually tried a few iterations before I settled on this one, but I was only
>> trying to reproduce the EBUSY when creating the flakey device, I hadn't seen it
>> with fsck.  So I originally started with your change, but it didn't fix the
>> problem.  Then I did both, UDEV_SETTLE at the end of remove and at the beginning
>> of create and the problem went away, and then I removed the one from remove and
>> the problem still was gone.
>>
>> But since I've made this change I also have been investigating another problem
>> where we'll get EBUSY at mkfs time when we use SCRATCH_DEV_POOL.  I have a test
>> patch in our staging branch to make sure it actuall fixes it, but I have to add
>> this to the start of _scratch_pool_mkfs as well.
>>
>> It turns out that udev is doing this thing where it writes to
>> /sys/block/whatever/uevent to make sure that a KOBJ_CHANGE event gets sent out
>> for the device.
>>
>> This is racing with the test doing a mount.  So the mount gets O_EXCL, which
>> means the uevent doesn't get emitted until umount.  This would explain what
>> you're seeing, we umount, we get the KOBJ_CHANGE event once the O_EXCL is
>> dropped, udev runs, and then fsck gets an EBUSY.
>
> I started debugging this issue late last week. Among several tests which
> failed, xfs/033 was failing once in a while because the umount syscall
> returned -EBUSY. Debugging this further led me to the following,
>
> 1. Mounting an fs causes udev to invoke xfs_spaceman (via udisksd => xfs_info).
>    This causes the per-cpu counter at mount->mnt_pcp->mnt_count to increase by
>    1.
> 2. Unmount at this stage causes the umount syscall to fail since the refcount
>    on struct mount is greater than 2.
>
> I changed my test disks from the regular /dev/sd* devices to loop devices. I
> then added loop devices to be ignored by udev with the following change in
> /lib/udev/rules.d/60-persistent-storage.rules,
>
> KERNEL!="sr*|loop*", IMPORT{builtin}="blkid"
>
> This led to xfs/033 execute fine for 100 times without failure.

> However, other tests which use device mapper devices are still failing
> arbitrarily. 

This failure occurs because 60-persistent-storage-dm.rules has the following,

# Add inotify watch to track changes on this device.
# Using the watch rule is not optimal - it generates a lot of spurious
# and useless events whenever the device opened for read-write is closed.
# The best would be to generete the event directly in the tool changing
# relevant information so only relevant events will be processed
# (like creating a filesystem, changing filesystem label etc.).
#
# But let's use this until we have something better...                                                                                          
LABEL="dm_watch"
OPTIONS+="watch"

Hence any change to the device will generate a uevent causing udev's internal
blkid to be invoked. This ends up forking and executing xfs_spaceman.

>
> Looks like a simple mount operation causes udev to indirectly invoke
> xfs_spaceman as described by the output of perf script,
>
> bash 50034 [002]  2541.601278: sched:sched_process_fork: comm=bash pid=50034 child_comm=bash child_pid=50044
> mount 50044 [003]  2541.601851: sched:sched_process_exec: filename=/usr/bin/mount pid=50044 old_pid=50044
> systemd-udevd   173 [000]  2541.620525: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50048
> systemd-udevd   173 [000]  2541.621071: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50051
> systemd-udevd   173 [000]  2541.621562: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50052
> systemd-udevd   173 [000]  2541.621971: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50053
> systemd-udevd   173 [000]  2541.622657: sched:sched_process_fork: comm=systemd-udevd pid=173 child_comm=systemd-udevd child_pid=50054
> udisksd   588 [000]  2541.675852: sched:sched_process_fork: comm=udisksd pid=588 child_comm=udisksd child_pid=50058
> xfs_admin 50058 [000]  2541.676645: sched:sched_process_exec: filename=/usr/sbin/xfs_admin pid=50058 old_pid=50058
> xfs_admin 50058 [000]  2541.677317: sched:sched_process_fork: comm=xfs_admin pid=50058 child_comm=xfs_admin child_pid=50059
> xfs_db 50059 [000]  2541.677569: sched:sched_process_exec: filename=/usr/sbin/xfs_db pid=50059 old_pid=50059
> udisksd   588 [003]  2541.687075: sched:sched_process_fork: comm=udisksd pid=588 child_comm=udisksd child_pid=50060
> xfs_info 50060 [001]  2541.687843: sched:sched_process_exec: filename=/usr/sbin/xfs_info pid=50060 old_pid=50060
> xfs_info 50060 [001]  2541.688362: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50061
> xfs_info 50061 [003]  2541.688615: sched:sched_process_fork: comm=xfs_info pid=50061 child_comm=xfs_info child_pid=50062
> xfs_info 50062 [001]  2541.688810: sched:sched_process_fork: comm=xfs_info pid=50062 child_comm=xfs_info child_pid=50063
> xfs_info 50062 [001]  2541.688944: sched:sched_process_fork: comm=xfs_info pid=50062 child_comm=xfs_info child_pid=50064
> losetup 50063 [003]  2541.689058: sched:sched_process_exec: filename=/usr/sbin/losetup pid=50063 old_pid=50063
> tail 50064 [000]  2541.689169: sched:sched_process_exec: filename=/usr/bin/tail pid=50064 old_pid=50064
> xfs_info 50060 [001]  2541.690200: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50065
> findmnt 50065 [003]  2541.690501: sched:sched_process_exec: filename=/usr/bin/findmnt pid=50065 old_pid=50065
> xfs_info 50060 [001]  2541.692154: sched:sched_process_fork: comm=xfs_info pid=50060 child_comm=xfs_info child_pid=50066
> xfs_spaceman 50066 [003]  2541.692345: sched:sched_process_exec: filename=/usr/sbin/xfs_spaceman pid=50066 old_pid=50066
>
> xfs_spaceman in turn causes ref count on struct mount to increase. Hence an
> unmount operation can fail with -EBUSY error.
>
>>
>> This is a very long email to say that udev is causing spurious failures because
>> of behavior I don't entirely understand.  We're going to need to sprinkle in
>> UDEV_SETTLE_PROG in different places to kill all of these different scenarios.
>>
>> What do we think is best here?  Put UDEV_SETTLE_PROG at the start of any
>> function that needs to do O_EXCL?  So this would mean
>>
>> _dmsetup_create
>> _dmsetup_remove
>> *_mkfs
>> *_mount
>> *_check
>>
>
> To be honest, I don't understand udev well enough. But as pointed above, a
> simple invocation of mount is causing xfs_spaceman to be executed (indirectly)
> by udevd.  Hence, may be executing UDEV_SETTLE_PROG is probably not
> sufficient.
>
>> That would be safest, and I can certainly do that.  My initial approach was just
>> to do it where it was problematic, but the debugging I did yesterday around
>> btrfs/223 failures and the fact that udev is queue'ing up events that get
>> delivered at some point in the future makes it kind of hard to handle on a
>> case-by-case basis.  Thanks,
>>
>> Josef
Josef Bacik Dec. 6, 2021, 7:28 p.m. UTC | #5
On Mon, Dec 06, 2021 at 07:37:31PM +0530, Chandan Babu R wrote:
> On 06 Dec 2021 at 11:17, Chandan Babu R wrote:
> > On 02 Dec 2021 at 21:15, Josef Bacik wrote:
> >> On Thu, Dec 02, 2021 at 02:45:24PM +0000, Filipe Manana wrote:
> >>> On Wed, Dec 01, 2021 at 01:18:52PM -0500, Josef Bacik wrote:
> >>> > We've been seeing transient errors with any test that uses a dm device
> >>> > for the entirety of the time that we've been running nightly xfstests
> >>> 
> >>> I have been having it on my tests vms since ever as well.
> >>> It's really annoying, but fortunatelly it doesn't happen too often.
> >>> 
> >>
> >> Yeah before this change we'd fail ~6 tests on every configruation on every
> >> overnight run.  With this change we fail 0.  It's rare, but with our continual
> >> testing it happens sooooooo much.
> >>
> >>> > runs.  This turns out to be because sometimes we get EBUSY while trying
> >>> > to create our new dm device.  Generally this is because the test comes
> >>> > right after another test that messes with the dm device, and thus we
> >>> > still have udev messing around with the device when DM tries to O_EXCL
> >>> > the block device.
> >>> > 
> >>> > Add a UDEV_SETTLE_PROG before creating the device to make sure we can
> >>> > create our new dm device without getting this transient error.
> >>> 
> >>> I suspect this might only make it seem the problem goes away but does not
> >>> really fix it.
> >>> 
> >>> I say that for 2 reasons:
> >>> 
> >>> 1) All tests that use dm end up calling _dmsetup_remove(), like through
> >>>    _log_writes_remove() or _cleanup_flakey() for example. Normally those
> >>>    are called in the _cleanup() function, which ensures it's done even if
> >>>    the test fails for some reason.
> >>> 
> >>>    So I don't understand why we need that UDEV_SETTLE_PROG at _dmsetup_create().
> >>> 
> >>>    And I've seen the ebusy failure happen even when the previous tests did
> >>>    not use any dm device;
> >>> 
> >>> 2) Some tests fail after creating the dm device and using it. For example
> >>>    btrfs/206 often fails when it tries to fsck the filesystem:
> >>> 
> >>>    btrfs/206 3s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad)
> >>>         --- tests/btrfs/206.out     2020-10-16 23:13:46.554152652 +0100
> >>>         +++ /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad      2021-12-01 21:09:46.317632589 +0000
> >>>         @@ -3,3 +3,5 @@
> >>>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >>>         wrote 8192/8192 bytes at offset 0
> >>>         XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
> >>>         +_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
> >>>         +(see /home/fdmanana/git/hub/xfstests/results//btrfs/206.full for details)
> >>>         ...
> >>> 
> >>>        (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/206.out /home/fdmanana/git/hub/xfstests/results//btrfs/206.out.bad'  to see the entire diff)
> >>> 
> >>>     In the .full file I got:
> >>> 
> >>>     (...)
> >>>     replaying 1239@11201: sector 2173408, size 16384, flags 0x10(METADATA)
> >>>     replaying 1240@11234: sector 0, size 0, flags 0x1(FLUSH)
> >>>     replaying 1241@11235: sector 128, size 4096, flags 0x12(FUA|METADATA)
> >>>     _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
> >>>     *** fsck.btrfs output ***
> >>>     ERROR: cannot open device '/dev/sdc': Device or resource busy
> >>>     ERROR: cannot open file system
> >>>     Opening filesystem to check...
> >>>     *** end fsck.btrfs output
> >>>     *** mount output ***
> >>> 
> >>>    The ebusy failure is not when the test starts, but when somewhere in the middle
> >>>    of the replay loop when it calls fsck, or when it ends and the fstests framework
> >>>    calls fsck.
> >>> 
> >>>    I've seen that with btrfs/172 too, which also uses dm logwrites in a similar way.
> >>> 
> >>> So to me this suggests 2 things:
> >>> 
> >>> 1) Calling UDEV_SETTLE_PROG at _dmsetup_create() doesn't solve that problem with
> >>>    btrfs/206 (and other tests) - the problem is fsck failing to open the scratch
> >>>    device after it called _log_writes_remove() -> _dmsetup_remove(), and not a
> >>>    failure to create the dm device;
> >>> 
> >>> 2) The problem is likely something missing at _dmsetup_remove(). Perhaps add
> >>>    another UDEV_SETTLE_PROG there:
> >>> 
> >>>    diff --git a/common/rc b/common/rc
> >>>    index 8e351f17..22b34677 100644
> >>>    --- a/common/rc
> >>>    +++ b/common/rc
> >>>    @@ -4563,6 +4563,7 @@ _dmsetup_remove()
> >>>             $UDEV_SETTLE_PROG >/dev/null 2>&1
> >>>             $DMSETUP_PROG remove "$@" >>$seqres.full 2>&1
> >>>             $DMSETUP_PROG mknodes >/dev/null 2>&1
> >>>     +       $UDEV_SETTLE_PROG >/dev/null 2>&1
> >>>      }
> >>>  
> >>>     _dmsetup_create()
> >>> 
> >>>   I can't say if that change to _dmsetup_remove() is correct, or what it's
> >>>   needed, as I really haven't spent time trying to figure out why the issue
> >>>   happens.
> >>> 
> >>
> >> I actually tried a few iterations before I settled on this one, but I was only
> >> trying to reproduce the EBUSY when creating the flakey device, I hadn't seen it
> >> with fsck.  So I originally started with your change, but it didn't fix the
> >> problem.  Then I did both, UDEV_SETTLE at the end of remove and at the beginning
> >> of create and the problem went away, and then I removed the one from remove and
> >> the problem still was gone.
> >>
> >> But since I've made this change I also have been investigating another problem
> >> where we'll get EBUSY at mkfs time when we use SCRATCH_DEV_POOL.  I have a test
> >> patch in our staging branch to make sure it actuall fixes it, but I have to add
> >> this to the start of _scratch_pool_mkfs as well.
> >>
> >> It turns out that udev is doing this thing where it writes to
> >> /sys/block/whatever/uevent to make sure that a KOBJ_CHANGE event gets sent out
> >> for the device.
> >>
> >> This is racing with the test doing a mount.  So the mount gets O_EXCL, which
> >> means the uevent doesn't get emitted until umount.  This would explain what
> >> you're seeing, we umount, we get the KOBJ_CHANGE event once the O_EXCL is
> >> dropped, udev runs, and then fsck gets an EBUSY.
> >
> > I started debugging this issue late last week. Among several tests which
> > failed, xfs/033 was failing once in a while because the umount syscall
> > returned -EBUSY. Debugging this further led me to the following,
> >
> > 1. Mounting an fs causes udev to invoke xfs_spaceman (via udisksd => xfs_info).
> >    This causes the per-cpu counter at mount->mnt_pcp->mnt_count to increase by
> >    1.
> > 2. Unmount at this stage causes the umount syscall to fail since the refcount
> >    on struct mount is greater than 2.
> >
> > I changed my test disks from the regular /dev/sd* devices to loop devices. I
> > then added loop devices to be ignored by udev with the following change in
> > /lib/udev/rules.d/60-persistent-storage.rules,
> >
> > KERNEL!="sr*|loop*", IMPORT{builtin}="blkid"
> >
> > This led to xfs/033 execute fine for 100 times without failure.
> 
> > However, other tests which use device mapper devices are still failing
> > arbitrarily. 
> 
> This failure occurs because 60-persistent-storage-dm.rules has the following,
> 
> # Add inotify watch to track changes on this device.
> # Using the watch rule is not optimal - it generates a lot of spurious
> # and useless events whenever the device opened for read-write is closed.
> # The best would be to generete the event directly in the tool changing
> # relevant information so only relevant events will be processed
> # (like creating a filesystem, changing filesystem label etc.).
> #
> # But let's use this until we have something better...                                                                                          
> LABEL="dm_watch"
> OPTIONS+="watch"
> 
> Hence any change to the device will generate a uevent causing udev's internal
> blkid to be invoked. This ends up forking and executing xfs_spaceman.
> 

First of all excellent work digging into this Chandan, thanks so much!

<snip>

> >
> > To be honest, I don't understand udev well enough. But as pointed above, a
> > simple invocation of mount is causing xfs_spaceman to be executed (indirectly)
> > by udevd.  Hence, may be executing UDEV_SETTLE_PROG is probably not
> > sufficient.
> >

Yeah so the question is what do we do here to make xfstests and udev co-exist
nicely?  I imagine if you added UDEV_SETTLE_PROG around every operation we do
(for example before umount in your test) we would stop hitting these problems.

But this clearly sucks.  Should we instead come up with a couple of udev rules
and put them here in fstests with the disclaimer "hey if you see problems,
install these udev rules" and call it good enough?  I see there's
UDEV_DISABLE_PERSISTENT_STORAGE_RULES_FLAG and DM_NOSCAN, I don't know how to
write udev rules, but is there a way we could write a script to use the
local.config to generate a udev rule to be copied into place based on the users
device settings in local.config that would keep udev from touching the devices
we're using?  Thanks,

Josef
diff mbox series

Patch

diff --git a/common/rc b/common/rc
index 8e351f17..35e861ec 100644
--- a/common/rc
+++ b/common/rc
@@ -4567,6 +4567,7 @@  _dmsetup_remove()
 
 _dmsetup_create()
 {
+	$UDEV_SETTLE_PROG >/dev/null 2>&1
 	$DMSETUP_PROG create "$@" >>$seqres.full 2>&1 || return 1
 	$DMSETUP_PROG mknodes >/dev/null 2>&1
 	$UDEV_SETTLE_PROG >/dev/null 2>&1