diff mbox series

common/scsi_debug: make sure scsi_debug been removed correctly

Message ID 20190109054845.93810-1-yangerkun@huawei.com (mailing list archive)
State New, archived
Headers show
Series common/scsi_debug: make sure scsi_debug been removed correctly | expand

Commit Message

yangerkun Jan. 9, 2019, 5:48 a.m. UTC
Case generic/108 sometimes will fail while testing ext2, and the reson
is that rmmod scsi_debug in _put_scsi_debug_dev may randomly fail. Now
add the loop to do rmmod to make sure scsi_debug can be removed
correctly.

Signed-off-by: yangerkun <yangerkun@huawei.com>
---
 common/scsi_debug | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Comments

Dave Chinner Jan. 9, 2019, 9:02 p.m. UTC | #1
On Wed, Jan 09, 2019 at 01:48:45PM +0800, yangerkun wrote:
> Case generic/108 sometimes will fail while testing ext2, and the reson
> is that rmmod scsi_debug in _put_scsi_debug_dev may randomly fail. Now
> add the loop to do rmmod to make sure scsi_debug can be removed
> correctly.

Why does 'rmmod scsi_debug' randomly fail?

What bug does ext2 have that prevents the scsi debug module from
being released and hence removed?

Cheers,

Dave.
yangerkun Jan. 12, 2019, 2:20 a.m. UTC | #2
Dave Chinner wrote on 2019/1/10 5:02:
> On Wed, Jan 09, 2019 at 01:48:45PM +0800, yangerkun wrote:
>> Case generic/108 sometimes will fail while testing ext2, and the reson
>> is that rmmod scsi_debug in _put_scsi_debug_dev may randomly fail. Now
>> add the loop to do rmmod to make sure scsi_debug can be removed
>> correctly.
> 
> Why does 'rmmod scsi_debug' randomly fail?
> 
> What bug does ext2 have that prevents the scsi debug module from
> being released and hence removed?

It's not a bug with ext2, ever been existing in ext4 too. This patch is 
a reference to 'https://gitlab.com/cryptsetup/cryptsetup.git' commit 
d7b9ed05f0931b416c33c8eb2ff1e6efa39270ff, and the reason why need this 
is that the behavior of udev cannot be speculated, so scsi_debug may 
rmmod failed since udev scan open the device and take the reference of 
module scsi_debug.

Thanks,
Kun.
> 
> Cheers,
> 
> Dave.
>
Eryu Guan Jan. 13, 2019, 3:21 p.m. UTC | #3
On Sat, Jan 12, 2019 at 10:20:26AM +0800, yangerkun wrote:
> 
> 
> Dave Chinner wrote on 2019/1/10 5:02:
> > On Wed, Jan 09, 2019 at 01:48:45PM +0800, yangerkun wrote:
> > > Case generic/108 sometimes will fail while testing ext2, and the reson
> > > is that rmmod scsi_debug in _put_scsi_debug_dev may randomly fail. Now
> > > add the loop to do rmmod to make sure scsi_debug can be removed
> > > correctly.
> > 
> > Why does 'rmmod scsi_debug' randomly fail?
> > 
> > What bug does ext2 have that prevents the scsi debug module from
> > being released and hence removed?
> 
> It's not a bug with ext2, ever been existing in ext4 too. This patch is a
> reference to 'https://gitlab.com/cryptsetup/cryptsetup.git' commit
> d7b9ed05f0931b416c33c8eb2ff1e6efa39270ff, and the reason why need this is
> that the behavior of udev cannot be speculated, so scsi_debug may rmmod
> failed since udev scan open the device and take the reference of module
> scsi_debug.

It's probably a good thing to have the specific reason in commit log
too.

Thanks,
Eryu
Dave Chinner Jan. 15, 2019, 3:47 a.m. UTC | #4
On Sat, Jan 12, 2019 at 10:20:26AM +0800, yangerkun wrote:
> 
> 
> Dave Chinner wrote on 2019/1/10 5:02:
> >On Wed, Jan 09, 2019 at 01:48:45PM +0800, yangerkun wrote:
> >>Case generic/108 sometimes will fail while testing ext2, and the reson
> >>is that rmmod scsi_debug in _put_scsi_debug_dev may randomly fail. Now
> >>add the loop to do rmmod to make sure scsi_debug can be removed
> >>correctly.
> >
> >Why does 'rmmod scsi_debug' randomly fail?
> >
> >What bug does ext2 have that prevents the scsi debug module from
> >being released and hence removed?
> 
> It's not a bug with ext2, ever been existing in ext4 too. This patch
> is a reference to 'https://gitlab.com/cryptsetup/cryptsetup.git'
> commit d7b9ed05f0931b416c33c8eb2ff1e6efa39270ff, and the reason why
> need this is that the behavior of udev cannot be speculated, so
> scsi_debug may rmmod failed since udev scan open the device and take
> the reference of module scsi_debug.

IOWs, you copied a hack from cryptsetup tests because you didn't
know about $UDEV_SETTLE_PROG and didn't think to ask if anyone knew
a solution to this problem?

Cheers,

Dave.
yangerkun Jan. 15, 2019, 5:31 a.m. UTC | #5
Dave Chinner wrote on 2019/1/15 11:47:
> On Sat, Jan 12, 2019 at 10:20:26AM +0800, yangerkun wrote:
>>
>>
>> Dave Chinner wrote on 2019/1/10 5:02:
>>> On Wed, Jan 09, 2019 at 01:48:45PM +0800, yangerkun wrote:
>>>> Case generic/108 sometimes will fail while testing ext2, and the reson
>>>> is that rmmod scsi_debug in _put_scsi_debug_dev may randomly fail. Now
>>>> add the loop to do rmmod to make sure scsi_debug can be removed
>>>> correctly.
>>>
>>> Why does 'rmmod scsi_debug' randomly fail?
>>>
>>> What bug does ext2 have that prevents the scsi debug module from
>>> being released and hence removed?
>>
>> It's not a bug with ext2, ever been existing in ext4 too. This patch
>> is a reference to 'https://gitlab.com/cryptsetup/cryptsetup.git'
>> commit d7b9ed05f0931b416c33c8eb2ff1e6efa39270ff, and the reason why
>> need this is that the behavior of udev cannot be speculated, so
>> scsi_debug may rmmod failed since udev scan open the device and take
>> the reference of module scsi_debug.
> 
> IOWs, you copied a hack from cryptsetup tests because you didn't
> know about $UDEV_SETTLE_PROG and didn't think to ask if anyone knew
> a solution to this problem?

I have test to add udev settle before really do 'rmmod scsi_debug', and 
the result is the same, sometimes 'rmmod scsi_debug' may failed too. For 
this question, i have ask Ondrej Konzina, the question and answer are as 
follow:

Question:
"Hi,

While we are testing fs with xfstests, generic/108 will fail at some
times with the reson that rmmod find the refcnt of scsi_debug module
isn't 0 by checking /proc/modules. And this your patch is really
describe the same thing. But, i am confusing about the scene as you said:
    cleanup() {
           if [ -d "$MNT_DIR" ] ; then
               umount -f $MNT_DIR 2>/dev/null
               rmdir $MNT_DIR 2>/dev/null
           fi
           rm -f $LOCAL_FILE 2> /dev/null
-       udevadm settle >/dev/null 2>&1
-       rmmod scsi_debug 2> /dev/null
+       scsi_debug_teardown "$DEV" || exit 100
    }

After the settle, all the event has been finish, so who will trigger the
scan, and how to scan? Can you help to explain this? Thanks a lot!
"
Answer:
"Well, almost true. But by our experience udev settle command behaviour 
is not strictly defined (and also discouraged from using even by udev 
maintainers themselves IIRC). For example it will not synchronize you 
against events coming after you trigger the settle command. And you have 
zero control over when the event is added. So the long-term goal is to 
get rid of it from all tests in cryptsetup (not being on top of our todo 
list as you can clearly see if you grep scripts across tests 
subdirectory :)).

Due to many existing udev versions among many distributions we found out 
there's no more reliable way than trying to remove scsi_debug in retry 
loop. And that loop also handles some weird versions (legacy versions 
but we do sometimes test even old distros) where udev fails to remove 
device nodes from /dev from time to time."

Since settle can not resolve the problem, so there has no choice add the 
loop to do these. Maybe it is not a good way for this problem, do you 
have any suggestion?

Cheer,

Kun.
> 
> Cheers,
> 
> Dave.
>
Dave Chinner Jan. 15, 2019, 6:08 a.m. UTC | #6
On Tue, Jan 15, 2019 at 01:31:24PM +0800, yangerkun wrote:
> 
> 
> Dave Chinner wrote on 2019/1/15 11:47:
> >On Sat, Jan 12, 2019 at 10:20:26AM +0800, yangerkun wrote:
> >>
> >>
> >>Dave Chinner wrote on 2019/1/10 5:02:
> >>>On Wed, Jan 09, 2019 at 01:48:45PM +0800, yangerkun wrote:
> >>>>Case generic/108 sometimes will fail while testing ext2, and the reson
> >>>>is that rmmod scsi_debug in _put_scsi_debug_dev may randomly fail. Now
> >>>>add the loop to do rmmod to make sure scsi_debug can be removed
> >>>>correctly.
> >>>
> >>>Why does 'rmmod scsi_debug' randomly fail?
> >>>
> >>>What bug does ext2 have that prevents the scsi debug module from
> >>>being released and hence removed?
> >>
> >>It's not a bug with ext2, ever been existing in ext4 too. This patch
> >>is a reference to 'https://gitlab.com/cryptsetup/cryptsetup.git'
> >>commit d7b9ed05f0931b416c33c8eb2ff1e6efa39270ff, and the reason why
> >>need this is that the behavior of udev cannot be speculated, so
> >>scsi_debug may rmmod failed since udev scan open the device and take
> >>the reference of module scsi_debug.
> >
> >IOWs, you copied a hack from cryptsetup tests because you didn't
> >know about $UDEV_SETTLE_PROG and didn't think to ask if anyone knew
> >a solution to this problem?
> 
> I have test to add udev settle before really do 'rmmod scsi_debug',
> and the result is the same, sometimes 'rmmod scsi_debug' may failed
> too.i

So what is generating the events that udev is running that pins the
scsi debug device given that it is no longer in use?

> For this question, i have ask Ondrej Konzina, the question and
> answer are as follow:
> 
> Question:
> "Hi,
> 
> While we are testing fs with xfstests, generic/108 will fail at some
> times with the reson that rmmod find the refcnt of scsi_debug module
> isn't 0 by checking /proc/modules. And this your patch is really
> describe the same thing. But, i am confusing about the scene as you said:
>    cleanup() {
>           if [ -d "$MNT_DIR" ] ; then
>               umount -f $MNT_DIR 2>/dev/null
>               rmdir $MNT_DIR 2>/dev/null
>           fi
>           rm -f $LOCAL_FILE 2> /dev/null
> -       udevadm settle >/dev/null 2>&1
> -       rmmod scsi_debug 2> /dev/null
> +       scsi_debug_teardown "$DEV" || exit 100
>    }
> 
> After the settle, all the event has been finish, so who will trigger the
> scan, and how to scan? Can you help to explain this? Thanks a lot!
> "
> Answer:
> "Well, almost true. But by our experience udev settle command
> behaviour is not strictly defined (and also discouraged from using
> even by udev maintainers themselves IIRC). For example it will not
> synchronize you against events coming after you trigger the settle
> command.

So, in summary:

We used to have wait loops like you are proposing, then the udevadm
settle command was added and we replaced all the wait loops with
settles because that worked reliably.

Now we're at the stage where udevadm settle doesn't work any more,
the maintainers have effectively deprecated it without a
replacement, and so we have to go back to hard coding hacky wait
loops everywhere to retry until udev releases whatever random
devices it holds references to?

Forget I said anything, I'm rapidly coming to the conclusion that I
have far too high standards for writing "working" software these
days.

-Dave.
diff mbox series

Patch

diff --git a/common/scsi_debug b/common/scsi_debug
index d9aa0bd..890ec68 100644
--- a/common/scsi_debug
+++ b/common/scsi_debug
@@ -45,11 +45,13 @@  _put_scsi_debug_dev()
 {
 	lsmod | grep -wq scsi_debug || return
 
-	n=2
+	n=15
 	# use redirection not -q option of modprobe here, because -q of old
 	# modprobe is only quiet when the module is not found, not when the
 	# module is in use.
-	while [ $n -ge 0 ] && ! modprobe -nr scsi_debug >/dev/null 2>&1; do
+	while [ $n -ge 0 ]; do
+		modprobe -r scsi_debug >/dev/null 2>&1
+		[ $? -eq 0 ] && return 0
 		sleep 1
 		n=$((n-1))
 	done