diff mbox series

[blktests] dm/002: repeat dmsetup remove command on failure with EBUSY

Message ID 20240709124441.139769-1-shinichiro.kawasaki@wdc.com (mailing list archive)
State New, archived
Headers show
Series [blktests] dm/002: repeat dmsetup remove command on failure with EBUSY | expand

Commit Message

Shinichiro Kawasaki July 9, 2024, 12:44 p.m. UTC
The test case dm/002 rarely fails with the message below:

dm/002 => nvme0n1 (dm-dust general functionality test)       [failed]
    runtime  0.204s  ...  0.174s
    --- tests/dm/002.out        2024-06-14 14:37:40.480794693 +0900
    +++ /home/shin/Blktests/blktests/results/nvme0n1/dm/002.out.bad     2024-06-14 21:38:18.588976499 +0900
    @@ -7,4 +7,6 @@
     countbadblocks: 0 badblock(s) found
     countbadblocks: 3 badblock(s) found
     countbadblocks: 0 badblock(s) found
    +device-mapper: remove ioctl on dust1  failed: Device or resource busy
    +Command failed.
     Test complete
modprobe: FATAL: Module dm_dust is in use.

This failure happens at "dmsetup remove" command, when the previous
operation on the dm device is still ongoing. In this case,
dm_open_count() is non-zero, then IOCTL for device remove fails and
EBUSY is returned.

To avoid the failure, retry the "dmsetup remove" command when it fails
with EBUSY. Introduce the helper function _dm_remove for this purpose.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
This patch addresses a failure found during the debug work for another
dm/002 failure [1].

[1] https://lore.kernel.org/linux-block/42ecobcsduvlqh77iavjj2p3ewdh7u4opdz4xruauz4u5ddljz@yr7ye4fq72tr/

 tests/dm/002 |  2 +-
 tests/dm/rc  | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

Comments

Bryan Gurney July 10, 2024, 1:43 p.m. UTC | #1
On Tue, Jul 9, 2024 at 8:44 AM Shin'ichiro Kawasaki
<shinichiro.kawasaki@wdc.com> wrote:
>
> The test case dm/002 rarely fails with the message below:
>
> dm/002 => nvme0n1 (dm-dust general functionality test)       [failed]
>     runtime  0.204s  ...  0.174s
>     --- tests/dm/002.out        2024-06-14 14:37:40.480794693 +0900
>     +++ /home/shin/Blktests/blktests/results/nvme0n1/dm/002.out.bad     2024-06-14 21:38:18.588976499 +0900
>     @@ -7,4 +7,6 @@
>      countbadblocks: 0 badblock(s) found
>      countbadblocks: 3 badblock(s) found
>      countbadblocks: 0 badblock(s) found
>     +device-mapper: remove ioctl on dust1  failed: Device or resource busy
>     +Command failed.
>      Test complete
> modprobe: FATAL: Module dm_dust is in use.
>
> This failure happens at "dmsetup remove" command, when the previous
> operation on the dm device is still ongoing. In this case,
> dm_open_count() is non-zero, then IOCTL for device remove fails and
> EBUSY is returned.
>
> To avoid the failure, retry the "dmsetup remove" command when it fails
> with EBUSY. Introduce the helper function _dm_remove for this purpose.
>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>

I think this looks good, and I tested it on my system:

Reviewed-by: Bryan Gurney <bgurney@redhat.com>


I want to cc dm-devel and the device-mapper maintainers, in case there
are any questions on this test.  (It's probably a good idea to cc
dm-devel for any "dm" blktests.)


Thanks,

Bryan
Mikulas Patocka July 10, 2024, 3:46 p.m. UTC | #2
On Wed, 10 Jul 2024, Bryan Gurney wrote:

> On Tue, Jul 9, 2024 at 8:44 AM Shin'ichiro Kawasaki
> <shinichiro.kawasaki@wdc.com> wrote:
> >
> > The test case dm/002 rarely fails with the message below:
> >
> > dm/002 => nvme0n1 (dm-dust general functionality test)       [failed]
> >     runtime  0.204s  ...  0.174s
> >     --- tests/dm/002.out        2024-06-14 14:37:40.480794693 +0900
> >     +++ /home/shin/Blktests/blktests/results/nvme0n1/dm/002.out.bad     2024-06-14 21:38:18.588976499 +0900
> >     @@ -7,4 +7,6 @@
> >      countbadblocks: 0 badblock(s) found
> >      countbadblocks: 3 badblock(s) found
> >      countbadblocks: 0 badblock(s) found
> >     +device-mapper: remove ioctl on dust1  failed: Device or resource busy
> >     +Command failed.
> >      Test complete
> > modprobe: FATAL: Module dm_dust is in use.
> >
> > This failure happens at "dmsetup remove" command, when the previous
> > operation on the dm device is still ongoing. In this case,
> > dm_open_count() is non-zero, then IOCTL for device remove fails and
> > EBUSY is returned.
> >
> > To avoid the failure, retry the "dmsetup remove" command when it fails
> > with EBUSY. Introduce the helper function _dm_remove for this purpose.
> >
> > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> 
> I think this looks good, and I tested it on my system:
> 
> Reviewed-by: Bryan Gurney <bgurney@redhat.com>
> 
> 
> I want to cc dm-devel and the device-mapper maintainers, in case there
> are any questions on this test.  (It's probably a good idea to cc
> dm-devel for any "dm" blktests.)

I think it would be better to find out which code keeps the DM device open 
and fix that.

It is generally assumed that a DM device can be removed when you close it. 
If not, there's a bug somewhere else (and this patch is just papering over 
the bug).

Mikulas
Shinichiro Kawasaki July 11, 2024, 10:39 a.m. UTC | #3
On Jul 10, 2024 / 17:46, Mikulas Patocka wrote:
> 
> 
> On Wed, 10 Jul 2024, Bryan Gurney wrote:
> 
> > On Tue, Jul 9, 2024 at 8:44 AM Shin'ichiro Kawasaki
> > <shinichiro.kawasaki@wdc.com> wrote:
> > >
> > > The test case dm/002 rarely fails with the message below:
> > >
> > > dm/002 => nvme0n1 (dm-dust general functionality test)       [failed]
> > >     runtime  0.204s  ...  0.174s
> > >     --- tests/dm/002.out        2024-06-14 14:37:40.480794693 +0900
> > >     +++ /home/shin/Blktests/blktests/results/nvme0n1/dm/002.out.bad     2024-06-14 21:38:18.588976499 +0900
> > >     @@ -7,4 +7,6 @@
> > >      countbadblocks: 0 badblock(s) found
> > >      countbadblocks: 3 badblock(s) found
> > >      countbadblocks: 0 badblock(s) found
> > >     +device-mapper: remove ioctl on dust1  failed: Device or resource busy
> > >     +Command failed.
> > >      Test complete
> > > modprobe: FATAL: Module dm_dust is in use.
> > >
> > > This failure happens at "dmsetup remove" command, when the previous
> > > operation on the dm device is still ongoing. In this case,
> > > dm_open_count() is non-zero, then IOCTL for device remove fails and
> > > EBUSY is returned.
> > >
> > > To avoid the failure, retry the "dmsetup remove" command when it fails
> > > with EBUSY. Introduce the helper function _dm_remove for this purpose.
> > >
> > > Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> > 
> > I think this looks good, and I tested it on my system:
> > 
> > Reviewed-by: Bryan Gurney <bgurney@redhat.com>
> > 
> > 
> > I want to cc dm-devel and the device-mapper maintainers, in case there
> > are any questions on this test.  (It's probably a good idea to cc
> > dm-devel for any "dm" blktests.)
> 
> I think it would be better to find out which code keeps the DM device open 
> and fix that.
> 
> It is generally assumed that a DM device can be removed when you close it. 
> If not, there's a bug somewhere else (and this patch is just papering over 
> the bug).

Miklas, Bryan, thanks for the comments.

To understand what keeps the DM device open, I created a debug patch [1]. This
patch sets a debug flag when dm_lock_for_deletion() reports EBUSY. When the
flag is set, the process which kept the DM device open prints WARN at DM device
close. With this patch, I recreated the failure and checked the WARN [2]. It
showed "Comm: (udev-worker)" is the process which keeps the DM device open.

Question, should "dmsetup remove" succeed while udev accesses the DM device?

If the "dmsetup remove" failure is allowed in such cases, blktests side will
need a change. Now I know that the udev is the interfering process, I can
think of simpler change than this patch. Just adding "udevadm settle" before
"dmsetup remove" will avoid the failure.

[1]

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 08700bfc3e23..a4b561353dd2 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -161,6 +161,7 @@ struct mapped_device {
 #define DMF_SUSPENDED_INTERNALLY 7
 #define DMF_POST_SUSPENDING 8
 #define DMF_EMULATE_ZONE_APPEND 9
+#define DMF_DEBUG 10
 
 void disable_discard(struct mapped_device *md);
 void disable_write_zeroes(struct mapped_device *md);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 13037d6a6f62..542bc7cc2e79 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -344,9 +344,11 @@ static void dm_blk_close(struct gendisk *disk)
 	if (WARN_ON(!md))
 		goto out;
 
-	if (atomic_dec_and_test(&md->open_count) &&
-	    (test_bit(DMF_DEFERRED_REMOVE, &md->flags)))
-		queue_work(deferred_remove_workqueue, &deferred_remove_work);
+	if (atomic_dec_and_test(&md->open_count)) {
+		WARN_ON(test_bit(DMF_DEBUG, &md->flags));
+		if (test_bit(DMF_DEFERRED_REMOVE, &md->flags))
+			queue_work(deferred_remove_workqueue, &deferred_remove_work);
+	}
 
 	dm_put(md);
 out:
@@ -369,6 +371,8 @@ int dm_lock_for_deletion(struct mapped_device *md, bool mark_deferred, bool only
 
 	if (dm_open_count(md)) {
 		r = -EBUSY;
+		set_bit(DMF_DEBUG, &md->flags);
+		printk("%s: EBUSY\n", __func__);
 		if (mark_deferred)
 			set_bit(DMF_DEFERRED_REMOVE, &md->flags);
 	} else if (only_deferred && !test_bit(DMF_DEFERRED_REMOVE, &md->flags))

[2]

[  558.341218] run blktests dm/002 at 2024-07-11 14:42:03
[  558.381437] device-mapper: dust: dust_add_block: badblock added at block 60 with write fail count 0
[  558.392876] device-mapper: dust: dust_add_block: badblock added at block 67 with write fail count 0
[  558.404448] device-mapper: dust: dust_add_block: badblock added at block 72 with write fail count 0
[  558.425527] device-mapper: dust: dust_add_block: badblock added at block 60 with write fail count 0
[  558.436912] device-mapper: dust: dust_add_block: badblock added at block 67 with write fail count 0
[  558.448627] device-mapper: dust: dust_add_block: badblock added at block 72 with write fail count 0
[  558.462604] device-mapper: dust: enabling read failures on bad sectors
[  558.472068] device-mapper: dust: block 7 removed from badblocklist by write
[  558.479491] device-mapper: dust: block 8 removed from badblocklist by write
[  558.486624] device-mapper: dust: block 9 removed from badblocklist by write
[  558.504312] dm_lock_for_deletion: EBUSY
[  558.508184] ------------[ cut here ]------------
[  558.512835] WARNING: CPU: 0 PID: 13231 at drivers/md/dm.c:348 dm_blk_close+0x74/0x80
[  558.520640] Modules linked in: dm_dust ipmi_ssif snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq snd_seq_device snd_pcm x86_pkg_temp_thermal coretemp acpi_ipmi igb i2c_i801 snd_timer snd soundcore i2c_smbus atlantic ipmi_si ipmi_devintf ipmi_msghandler crc32_pclmul usbhid scsi_dh_rdac scsi_dh_emc [last unloaded: dm_dust]
[  558.557538] CPU: 0 PID: 13231 Comm: (udev-worker) Not tainted 6.10.0-rc5-dirty #29
[  558.565121] Hardware name: Supermicro C9X299-PG300F/C9X299-PG300F, BIOS 2.4 09/08/2021
[  558.573056] RIP: 0010:dm_blk_close+0x74/0x80
[  558.577346] Code: 00 48 8b 35 46 ec b5 02 48 c7 c2 e0 b6 f4 bc bf 02 00 00 00 e8 2d d3 e1 fe eb d2 0f 0b 5b 48 c7 c7 00 06 ee bd e9 1c f5 46 00 <0f> 0b eb b7 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90
[  558.596098] RSP: 0018:ffffaffac539bea8 EFLAGS: 00010202
[  558.601318] RAX: 0000000000000400 RBX: ffff9d4ccc9ea800 RCX: 0000000000000000
[  558.608450] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffffbdee0600
[  558.615582] RBP: ffff9d4cd13fc340 R08: ffffaffac539bc48 R09: 0000000000000408
[  558.622717] R10: 0000000000000000 R11: 00000000000007e0 R12: ffff9d4cc405e400
[  558.629849] R13: 0000000000000000 R14: ffff9d4cc405e5f8 R15: ffff9d4cc5f0b640
[  558.636981] FS:  00007f1b516be980(0000) GS:ffff9d5bffe00000(0000) knlGS:0000000000000000
[  558.645068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  558.650811] CR2: 00007f1b51fd0000 CR3: 00000001066b4003 CR4: 00000000003706f0
[  558.657947] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  558.665079] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  558.672211] Call Trace:
[  558.674666]  <TASK>
[  558.676766]  ? dm_blk_close+0x74/0x80
[  558.680425]  ? __warn.cold+0x8e/0xf4
[  558.684005]  ? dm_blk_close+0x74/0x80
[  558.687673]  ? report_bug+0xea/0x170
[  558.691251]  ? handle_bug+0x3c/0x80
[  558.694742]  ? exc_invalid_op+0x17/0x70
[  558.698585]  ? asm_exc_invalid_op+0x1a/0x20
[  558.702769]  ? dm_blk_close+0x74/0x80
[  558.706436]  ? dm_blk_close+0x14/0x80
[  558.710098]  bdev_release+0xec/0x160
[  558.713673]  blkdev_release+0xc/0x20
[  558.717251]  __fput+0xe9/0x2c0
[  558.720309]  __x64_sys_close+0x37/0x80
[  558.724063]  do_syscall_64+0x4f/0x120
[  558.727727]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  558.732783] RIP: 0033:0x7f1b51918ea4
[  558.736360] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 80 3d a5 51 0e 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d
[  558.755109] RSP: 002b:00007ffff3790248 EFLAGS: 00000202 ORIG_RAX: 0000000000000003
[  558.762674] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f1b51918ea4
[  558.769806] RDX: 0000000000000004 RSI: 000056136fe1fa00 RDI: 0000000000000013
[  558.776938] RBP: 00007ffff3790260 R08: 00000000ffffffff R09: 0000000000000000
[  558.784070] R10: 00007f1b51dfa6c0 R11: 0000000000000202 R12: 00007f1b516be7e0
[  558.791202] R13: 0000000000000000 R14: 00007ffff3790370 R15: 000056136fe23970
[  558.798329]  </TASK>
[  558.800517] ---[ end trace 0000000000000000 ]---
Bart Van Assche July 11, 2024, 5:59 p.m. UTC | #4
On 7/11/24 3:39 AM, Shinichiro Kawasaki wrote:
> It showed "Comm: (udev-worker)" is the process which keeps the DM
> device open. [ ... ] Just adding "udevadm settle" before "dmsetup
> remove" will avoid the failure.

I'm OK with either the retry loop from the patch at the start of this
email thread or adding "udevadm settle"

Thanks,

Bart.
Shinichiro Kawasaki July 18, 2024, 11:22 p.m. UTC | #5
On Jul 11, 2024 / 10:59, Bart Van Assche wrote:
> On 7/11/24 3:39 AM, Shinichiro Kawasaki wrote:
> > It showed "Comm: (udev-worker)" is the process which keeps the DM
> > device open. [ ... ] Just adding "udevadm settle" before "dmsetup
> > remove" will avoid the failure.
> 
> I'm OK with either the retry loop from the patch at the start of this
> email thread or adding "udevadm settle"

Thanks Bart. I think "udevadm settle" is the better. Will post v2.
diff mbox series

Patch

diff --git a/tests/dm/002 b/tests/dm/002
index fae3986..8ae8438 100755
--- a/tests/dm/002
+++ b/tests/dm/002
@@ -37,7 +37,7 @@  test_device() {
 	sync
 	dmsetup message dust1 0 countbadblocks
 	sync
-	dmsetup remove dust1
+	_dm_remove dust1
 
 	echo "Test complete"
 }
diff --git a/tests/dm/rc b/tests/dm/rc
index 0486db0..21a35f6 100644
--- a/tests/dm/rc
+++ b/tests/dm/rc
@@ -11,3 +11,26 @@  group_requires() {
 	_have_program dmsetup
 	_have_driver dm-mod
 }
+
+_dm_remove() {
+	local dm_dev=${1}
+	local i out
+
+	# Retry dmsetup remove command in case it fails with EBUSY because of
+	# non-zero dm open count.
+	for ((i = 0; i < 10; i++)); do
+		if out=$(dmsetup remove "${dm_dev}" 2>&1); then
+			break
+		fi
+		echo "$out" >> "$FULL"
+		if ! [[ $out =~ "Device or resource busy" ]]; then
+			echo "$out"
+			break
+		fi
+		sleep 1
+	done
+	if ((i == 10)); then
+		echo "dmsetup remove failed with EBUSY"
+	fi
+
+}