diff mbox series

usb: uas: fix usb subsystem hang after power off hub port

Message ID AE5419EAB4965843B3C0C1FE29F1FFE5872241@JPYOKXMS103.jp.sony.com (mailing list archive)
State Changes Requested
Headers show
Series usb: uas: fix usb subsystem hang after power off hub port | expand

Commit Message

Kento.A.Kobayashi@sony.com March 8, 2019, 9:13 a.m. UTC
The issue happens with following steps:
Access usb3.0/3.1 device that uses uas driver.
Power off hub port connecting device by ioctl(USBDEVFS_CONTROL).
Wait longer than 30s(scsi layer timeout period is 30s).
Execute commands like lsusb, no response and usb subsytem hangs.

After scsi layer timeout, uas_eh_bus_reset_handler works and enter
usb_reset_device. During reset, current uas_post_reset returns 1 if
uas_configure_endpoints fails with -ENODEV, then it tries to rebind
uas driver. The unbind request cannot complete because program goes
to host_not_ready in scsi_request_fn. As result, it stops at device
reset process and the lock took before reset will not be released.

The usb_reset_and_verify_device included in usb_reset_device fails
with -ENODEV after power off hub port, and the -ENODEV error will
be reported to uas_eh_bus_reset_handler and upper layer, so it
doesn't need to do rebind if -ENODEV happens.

Signed-off-by: Kento Kobayashi <Kento.A.Kobayashi@sony.com>
---
drivers/usb/storage/uas.c | 3 +++
1 file changed, 3 insertions(+)

Comments

Oliver Neukum March 8, 2019, 4:52 p.m. UTC | #1
On Fr, 2019-03-08 at 09:13 +0000, Kento.A.Kobayashi@sony.com wrote:
> The usb_reset_and_verify_device included in usb_reset_device fails
> with -ENODEV after power off hub port, and the -ENODEV error will
> be reported to uas_eh_bus_reset_handler and upper layer, so it
> doesn't need to do rebind if -ENODEV happens.

Hi,

no I am sorry, that is an assumption you just cannot make.
Anything can trigger a reset. That being SCSI is the common
case certainly, but not the only case. And in those cases
we cannot depend on upper layers doing the right thing, if
we just ignore an error.

NACK

	Sorry
		Oliver
Alan Stern March 8, 2019, 5:33 p.m. UTC | #2
On Fri, 8 Mar 2019, Oliver Neukum wrote:

> On Fr, 2019-03-08 at 09:13 +0000, Kento.A.Kobayashi@sony.com wrote:
> > The usb_reset_and_verify_device included in usb_reset_device fails
> > with -ENODEV after power off hub port, and the -ENODEV error will
> > be reported to uas_eh_bus_reset_handler and upper layer, so it
> > doesn't need to do rebind if -ENODEV happens.
> 
> Hi,
> 
> no I am sorry, that is an assumption you just cannot make.
> Anything can trigger a reset. That being SCSI is the common
> case certainly, but not the only case. And in those cases
> we cannot depend on upper layers doing the right thing, if
> we just ignore an error.
> 
> NACK
> 
> 	Sorry
> 		Oliver

Note that the reset routines in usb-storage do not make any exceptions
for -ENODEV.

Alan Stern
Kento.A.Kobayashi@sony.com March 11, 2019, 8:36 a.m. UTC | #3
Hi,

>no I am sorry, that is an assumption you just cannot make.
>Anything can trigger a reset. That being SCSI is the common case certainly, but not the only case. And in those cases we cannot depend on upper layers doing the right thing, if we just ignore an error.
While we investigate this issue, we debugged and found usb_reset_and_verify_device will return -NODEV before enter post_reset operation.
And the return value(-ENODEV) will be returned to error handler.
uas_eh_device_reset_handler->usb_reset_device -> usb_reset_and_verify_device (return -ENODEV) Then I wrote that commit message that we think even if we ignore "ENODEV" in post reset to avoid hang issue but the error will also be reported to error handler.
#If ignore an error and the error will not be reported then it is not good.

Additional information about usb-storage driver(usb/storage/usb.c) in usb_stor_post_reset() function, it always return 0 that means rebind will not be execute and this issue doesn't happen.

Regards,
Kento Kobayashi

-----Original Message-----
From: Oliver Neukum <oneukum@suse.com> 
Sent: Saturday, March 9, 2019 1:52 AM
To: Kobayashi, Kento (Sony) <Kento.A.Kobayashi@sony.com>; gregkh@linuxfoundation.org; stern@rowland.harvard.edu
Cc: usb-storage@lists.one-eyed-alien.net; linux-kernel@vger.kernel.org; linux-scsi@vger.kernel.org; linux-usb@vger.kernel.org
Subject: Re: [PATCH] usb: uas: fix usb subsystem hang after power off hub port

On Fr, 2019-03-08 at 09:13 +0000, Kento.A.Kobayashi@sony.com wrote:
> The usb_reset_and_verify_device included in usb_reset_device fails 
> with -ENODEV after power off hub port, and the -ENODEV error will be 
> reported to uas_eh_bus_reset_handler and upper layer, so it doesn't 
> need to do rebind if -ENODEV happens.

Hi,

no I am sorry, that is an assumption you just cannot make.
Anything can trigger a reset. That being SCSI is the common case certainly, but not the only case. And in those cases we cannot depend on upper layers doing the right thing, if we just ignore an error.

NACK

	Sorry
		Oliver
Oliver Neukum March 12, 2019, 3:37 p.m. UTC | #4
On Mo, 2019-03-11 at 08:36 +0000, Kento.A.Kobayashi@sony.com wrote:
> Hi,
> 
> > no I am sorry, that is an assumption you just cannot make.
> > Anything can trigger a reset. That being SCSI is the common case certainly, but not the only case. And in those cases we cannot depend on upper layers doing the right thing, if we just ignore an error.
> 
> While we investigate this issue, we debugged and found usb_reset_and_verify_device will return -NODEV before enter post_reset operation.

Yes, this can happen.

> And the return value(-ENODEV) will be returned to error handler.
> uas_eh_device_reset_handler->usb_reset_device -> usb_reset_and_verify_device (return -ENODEV) Then I wrote that commit message that we think even if we ignore "ENODEV" in post reset to avoid hang issue but the error will also be reported to error handler.
> #If ignore an error and the error will not be reported then it is not good.

Well, what do we do now? Are you saying that the state model SCSI is
using is wrong?

> Additional information about usb-storage driver(usb/storage/usb.c) in usb_stor_post_reset() function, it always return 0 that means rebind will not be execute and this issue doesn't happen.

I am afraid this is only partially correct. The device descriptors can
still fail to match.

	Regards
		Oliver
Kento.A.Kobayashi@sony.com March 15, 2019, 2:28 a.m. UTC | #5
Hi,

>> And the return value(-ENODEV) will be returned to error handler.
>> uas_eh_device_reset_handler->usb_reset_device -> usb_reset_and_verify_device (return -ENODEV) Then I wrote that commit message that we think even if we ignore "ENODEV" in post reset to avoid hang issue but the error will also be reported to error handler.
>> #If ignore an error and the error will not be reported then it is not good.
>
>Well, what do we do now? Are you saying that the state model SCSI is using is wrong?

No, it is not wrong.
But I think we may want to prevend hang by sleep peocess with continuing to take a lock at blk_execute_rq() from usb_unbind_and_rebind_marded_interfaces().

Regards,
Kento Kobayashi
Kento.A.Kobayashi@sony.com March 25, 2019, 10:21 a.m. UTC | #6
Hi,

>>> And the return value(-ENODEV) will be returned to error handler.
>>> uas_eh_device_reset_handler->usb_reset_device -> usb_reset_and_verify_device (return -ENODEV) Then I wrote that commit message that we think even if we ignore "ENODEV" in post reset to avoid hang issue but the error will also be reported to error handler.
>>> #If ignore an error and the error will not be reported then it is not good.
>>
>>Well, what do we do now? Are you saying that the state model SCSI is using is wrong?

>No, it is not wrong.
>But I think we may want to prevend hang by sleep peocess with continuing to take a lock at blk_execute_rq() from >usb_unbind_and_rebind_marded_interfaces().

Could you please you can apply this patch for mainline or not?
If not, could you please tell me why you can't allow to apply this patch?

Regards,
Kento Kobayashi
Oliver Neukum March 25, 2019, 10:34 a.m. UTC | #7
On Mo, 2019-03-25 at 10:21 +0000, Kento.A.Kobayashi@sony.com wrote:
> Hi,
> 
> > > > And the return value(-ENODEV) will be returned to error handler.
> > > > uas_eh_device_reset_handler->usb_reset_device -> usb_reset_and_verify_device (return -ENODEV) Then I wrote that commit message that we think even if we ignore "ENODEV" in post reset to avoid hang issue but the error will also be reported to error handler.
> > > > #If ignore an error and the error will not be reported then it is not good.
> > > 
> > > Well, what do we do now? Are you saying that the state model SCSI is using is wrong?
> > No, it is not wrong.
> > But I think we may want to prevend hang by sleep peocess with continuing to take a lock at blk_execute_rq() from >usb_unbind_and_rebind_marded_interfaces().
> 
> Could you please you can apply this patch for mainline or not?
> If not, could you please tell me why you can't allow to apply this patch?

Sorry,

I thought this was clear. Your patch is making the assumption that
the reset is triggered by the SCSI layer. You cannot make that
assumption, as there is an ioctl for resetting a USB device.
In case we are getting an error during the reset (our endpoints
vanish), the device driver must report that to the USB layer,
so the driver will always be disconnected.
We cannot drop errors.

	Regards
		Oliver
Kento.A.Kobayashi@sony.com March 28, 2019, 7:53 a.m. UTC | #8
Hi,

>Sorry,
>
>I thought this was clear. Your patch is making the assumption that the reset is triggered by the SCSI layer. You cannot make that assumption, as there is an ioctl for resetting a USB device.
>In case we are getting an error during the reset (our endpoints vanish), the device driver must report that to the USB layer, so the driver will always be disconnected.
>We cannot drop errors.
>
>	Regards
>		Oliver

This patch modified uas_post_reset to skip rebind operation to avoid exception while -ENODEV happens not drop error.
If uas_post_reset happens -ENODEV, usb_reset_and_verify_device must happen error.
So,when we use ioctl(USBDEVFS_RESET) to reset device, if usb_reset_and_verify_device happens error, the error will be reported through ioctl return value. 

Regards,
Kento Kobayashi
Oliver Neukum March 28, 2019, 3:15 p.m. UTC | #9
On Do, 2019-03-28 at 07:53 +0000, Kento.A.Kobayashi@sony.com wrote:
> Hi,
> 
> > Sorry,
> > 
> > I thought this was clear. Your patch is making the assumption that the reset is triggered by the SCSI layer. You cannot make that assumption, as there is an ioctl for resetting a USB device.
> > In case we are getting an error during the reset (our endpoints vanish), the device driver must report that to the USB layer, so the driver will always be disconnected.
> > We cannot drop errors.
> > 
> > 	Regards
> > 		Oliver
> 
> This patch modified uas_post_reset to skip rebind operation to avoid exception while -ENODEV happens not drop error.
> If uas_post_reset happens -ENODEV, usb_reset_and_verify_device must happen error.
> So,when we use ioctl(USBDEVFS_RESET) to reset device, if usb_reset_and_verify_device happens error, the error will be reported through ioctl return value. 

OK, It is possible that I am stupid. We must rebind if uas_post_reset()
fails. The driver will crash without the endpoints. Can you please
explain again in greater detail, what you are trying to achieve?

	Regards
		Oliver
Alan Stern March 28, 2019, 3:57 p.m. UTC | #10
On Thu, 28 Mar 2019, Oliver Neukum wrote:

> On Do, 2019-03-28 at 07:53 +0000, Kento.A.Kobayashi@sony.com wrote:
> > Hi,
> > 
> > > Sorry,
> > > 
> > > I thought this was clear. Your patch is making the assumption that the reset is triggered by the SCSI layer. You cannot make that assumption, as there is an ioctl for resetting a USB device.
> > > In case we are getting an error during the reset (our endpoints vanish), the device driver must report that to the USB layer, so the driver will always be disconnected.
> > > We cannot drop errors.
> > > 
> > > 	Regards
> > > 		Oliver
> > 
> > This patch modified uas_post_reset to skip rebind operation to avoid exception while -ENODEV happens not drop error.
> > If uas_post_reset happens -ENODEV, usb_reset_and_verify_device must happen error.
> > So,when we use ioctl(USBDEVFS_RESET) to reset device, if usb_reset_and_verify_device happens error, the error will be reported through ioctl return value. 
> 
> OK, It is possible that I am stupid. We must rebind if uas_post_reset()
> fails. The driver will crash without the endpoints. Can you please
> explain again in greater detail, what you are trying to achieve?

Actually no, the driver doesn't have to do anything if the post-reset 
method fails.  usbcore will take care of rebinding automatically.

Alan Stern
Oliver Neukum March 28, 2019, 4:49 p.m. UTC | #11
On Do, 2019-03-28 at 11:57 -0400, Alan Stern wrote:
> On Thu, 28 Mar 2019, Oliver Neukum wrote:
> 
> > On Do, 2019-03-28 at 07:53 +0000, Kento.A.Kobayashi@sony.com wrote:
> > > Hi,
> > > 
> > > > Sorry,
> > > > 
> > > > I thought this was clear. Your patch is making the assumption that the reset is triggered by the SCSI layer. You cannot make that assumption, as there is an ioctl for resetting a USB device.
> > > > In case we are getting an error during the reset (our endpoints vanish), the device driver must report that to the USB layer, so the driver will always be disconnected.
> > > > We cannot drop errors.
> > > > 
> > > >   Regards
> > > >           Oliver
> > > 
> > > This patch modified uas_post_reset to skip rebind operation to avoid exception while -ENODEV happens not drop error.
> > > If uas_post_reset happens -ENODEV, usb_reset_and_verify_device must happen error.
> > > So,when we use ioctl(USBDEVFS_RESET) to reset device, if usb_reset_and_verify_device happens error, the error will be reported through ioctl return value. 
> > 
> > OK, It is possible that I am stupid. We must rebind if uas_post_reset()
> > fails. The driver will crash without the endpoints. Can you please
> > explain again in greater detail, what you are trying to achieve?
> 
> Actually no, the driver doesn't have to do anything if the post-reset 
> method fails.  usbcore will take care of rebinding automatically.

Yes, but the rebinding is not optional. Hence, either the post_reset()
must fail to trigger it, or it will be triggered anyway.
So if the rebinding hangs the machine, I cannot see how alter
changing the return value of uas_post_reset() would help.

It looks to me like the state transitions of SCSI are too
restrictive.

	Regards
		Oliver
Alan Stern March 29, 2019, 2:13 p.m. UTC | #12
On Thu, 28 Mar 2019, Oliver Neukum wrote:

> On Do, 2019-03-28 at 11:57 -0400, Alan Stern wrote:
> > On Thu, 28 Mar 2019, Oliver Neukum wrote:
> > 
> > > On Do, 2019-03-28 at 07:53 +0000, Kento.A.Kobayashi@sony.com wrote:
> > > > Hi,
> > > > 
> > > > > Sorry,
> > > > > 
> > > > > I thought this was clear. Your patch is making the assumption that the reset is triggered by the SCSI layer. You cannot make that assumption, as there is an ioctl for resetting a USB device.
> > > > > In case we are getting an error during the reset (our endpoints vanish), the device driver must report that to the USB layer, so the driver will always be disconnected.
> > > > > We cannot drop errors.
> > > > > 
> > > > >   Regards
> > > > >           Oliver
> > > > 
> > > > This patch modified uas_post_reset to skip rebind operation to avoid exception while -ENODEV happens not drop error.
> > > > If uas_post_reset happens -ENODEV, usb_reset_and_verify_device must happen error.
> > > > So,when we use ioctl(USBDEVFS_RESET) to reset device, if usb_reset_and_verify_device happens error, the error will be reported through ioctl return value. 
> > > 
> > > OK, It is possible that I am stupid. We must rebind if uas_post_reset()
> > > fails. The driver will crash without the endpoints. Can you please
> > > explain again in greater detail, what you are trying to achieve?
> > 
> > Actually no, the driver doesn't have to do anything if the post-reset 
> > method fails.  usbcore will take care of rebinding automatically.
> 
> Yes, but the rebinding is not optional. Hence, either the post_reset()
> must fail to trigger it, or it will be triggered anyway.
> So if the rebinding hangs the machine, I cannot see how alter
> changing the return value of uas_post_reset() would help.
> 
> It looks to me like the state transitions of SCSI are too
> restrictive.

Maybe I don't understand the problem correctly.  It sounds like the
problem occurs because uas tries to rebind (Where, how, and why does it
do this?  I don't see anything special in uas_post_reset) using the
same scsi_host structure as before instead of creating a new one.  
Trying to reuse a defunct host is definitely not allowed.

But if uas didn't try to do anything special, it would be unbound from
the device, causing the scsi_host to be destroyed.  Then when usbcore
automatically attempted a rebind, uas would create a new scsi_host and
the rebind would have a chance to succeed, without hanging anything.

If this description is wrong, please explain more fully.

Alan Stern
Kento.A.Kobayashi@sony.com April 2, 2019, 12:28 a.m. UTC | #13
Hi,

>> Hi,
>> 
>> > Sorry,
>> > 
>> > I thought this was clear. Your patch is making the assumption that the reset is triggered by the SCSI layer. You cannot make that assumption, as there is an ioctl for resetting a USB device.
>> > In case we are getting an error during the reset (our endpoints vanish), the device driver must report that to the USB layer, so the driver will always be disconnected.
>> > We cannot drop errors.
>> > 
>> > 	Regards
>> > 		Oliver
>> 
>> This patch modified uas_post_reset to skip rebind operation to avoid exception while -ENODEV happens not drop error.
>> If uas_post_reset happens -ENODEV, usb_reset_and_verify_device must happen error.
>> So,when we use ioctl(USBDEVFS_RESET) to reset device, if usb_reset_and_verify_device happens error, the error will be reported through ioctl return value. 
>
>OK, It is possible that I am stupid. We must rebind if uas_post_reset() fails. The driver will crash without the endpoints. Can you please explain again in greater detail, what you are trying to achieve?

Follow is details for this patch.

Issue
- USB subsystem hangs if power off the hub port connecting UAS USB3.0/3.1 device by calling ioctl(USBDEVFS_CONTROL) to do Hub Class Request(CLEAR_FEATURE:PORT_POWER) while the device is being accessed. 
- Status of the process that is accessing the device becomes DEAD and cannot be killed.

Root Cause
- Block layer timeout happens after power off UAS USB device which is accessed as reproduce step. During timeout error handler process, scsi host state becomes SHOST_CANCEL_RECOVERY that causes IO hangs up and lock cannot be released. And in final, usb subsystem hangs up.
Follow is function call:
blk_mq_timeout_work 
  …->scsi_times_out  (… means some functions are not listed before this function.)
    …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
      … -> scsi_error_handler
        …-> uas_eh_device_reset_handler
            -> usb_lock_device_for_reset  <- take lock
              -> usb_reset_device
                …-> rebind = uas_post_reset (return 1 since ENODEV) 
                …-> usb_unbind_and_rebind_marked_interfaces (rebind=1)
                   …-> uas_disconnect  (scsi_host_set_state to SHOST_CANCEL_RECOVERY)
                        … -> scsi_queue_rq
                             -> scsi_host_queue_ready(return 0 causes IO hangs up.)
            -> usb_unlock_device          <- lock cannot be release since usb_reset_device not finish.


Countermeasure
- Make uas_post_reset doesn’t return 1 when ENODEV returns from uas_configure_endpoints since usb_unbind_and_rebind_marded_interfaces doesn’t need to do unbind/rebind operations in this situation.
blk_mq_timeout_work
  …->scsi_times_out  (… means some functions are not listed before this function.)
    …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
      … -> scsi_error_handler
       …-> uas_eh_device_reset_handler (*1)
           -> usb_lock_device_for_reset  <- take lock
             -> usb_reset_device
               -> usb_reset_and_verify_device (return ENODEV and FAILED will be reported to *1)
               -> uas_post_reset returns 0 when ENODEV => rebind=0 
               -> usb_unbind_and_rebind_marked_interfaces (rebind=0)
           -> usb_unlock_device          <- release lock


We can get error(-ENODEV) at uas_eh_device_reset_handler from usb_reset_and_verify_device.

Regards,
Kento Kobayashi
Alan Stern April 2, 2019, 2:38 p.m. UTC | #14
On Tue, 2 Apr 2019 Kento.A.Kobayashi@sony.com wrote:

> Hi,
> 
> >> Hi,
> >> 
> >> > Sorry,
> >> > 
> >> > I thought this was clear. Your patch is making the assumption that the reset is triggered by the SCSI layer. You cannot make that assumption, as there is an ioctl for resetting a USB device.
> >> > In case we are getting an error during the reset (our endpoints vanish), the device driver must report that to the USB layer, so the driver will always be disconnected.
> >> > We cannot drop errors.
> >> > 
> >> > 	Regards
> >> > 		Oliver
> >> 
> >> This patch modified uas_post_reset to skip rebind operation to avoid exception while -ENODEV happens not drop error.
> >> If uas_post_reset happens -ENODEV, usb_reset_and_verify_device must happen error.
> >> So,when we use ioctl(USBDEVFS_RESET) to reset device, if usb_reset_and_verify_device happens error, the error will be reported through ioctl return value. 
> >
> >OK, It is possible that I am stupid. We must rebind if uas_post_reset() fails. The driver will crash without the endpoints. Can you please explain again in greater detail, what you are trying to achieve?
> 
> Follow is details for this patch.
> 
> Issue
> - USB subsystem hangs if power off the hub port connecting UAS USB3.0/3.1 device by calling ioctl(USBDEVFS_CONTROL) to do Hub Class Request(CLEAR_FEATURE:PORT_POWER) while the device is being accessed. 
> - Status of the process that is accessing the device becomes DEAD and cannot be killed.
> 
> Root Cause
> - Block layer timeout happens after power off UAS USB device which is accessed as reproduce step. During timeout error handler process, scsi host state becomes SHOST_CANCEL_RECOVERY that causes IO hangs up and lock cannot be released. And in final, usb subsystem hangs up.
> Follow is function call:
> blk_mq_timeout_work 
>   …->scsi_times_out  (… means some functions are not listed before this function.)
>     …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
>       … -> scsi_error_handler
>         …-> uas_eh_device_reset_handler
>             -> usb_lock_device_for_reset  <- take lock
>               -> usb_reset_device
>                 …-> rebind = uas_post_reset (return 1 since ENODEV) 
>                 …-> usb_unbind_and_rebind_marked_interfaces (rebind=1)
>                    …-> uas_disconnect  (scsi_host_set_state to SHOST_CANCEL_RECOVERY)
>                         … -> scsi_queue_rq

How does scsi_queue_rq get called here?  As far as I can see, this 
shouldn't happen.

>                              -> scsi_host_queue_ready(return 0 causes IO hangs up.)
>             -> usb_unlock_device          <- lock cannot be release since usb_reset_device not finish.
> 
> 
> Countermeasure
> - Make uas_post_reset doesn’t return 1 when ENODEV returns from uas_configure_endpoints since usb_unbind_and_rebind_marded_interfaces doesn’t need to do unbind/rebind operations in this situation.
> blk_mq_timeout_work
>   …->scsi_times_out  (… means some functions are not listed before this function.)
>     …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
>       … -> scsi_error_handler
>        …-> uas_eh_device_reset_handler (*1)
>            -> usb_lock_device_for_reset  <- take lock
>              -> usb_reset_device
>                -> usb_reset_and_verify_device (return ENODEV and FAILED will be reported to *1)
>                -> uas_post_reset returns 0 when ENODEV => rebind=0 
>                -> usb_unbind_and_rebind_marked_interfaces (rebind=0)

The difference is that uas_disconnect wasn't called here.  But that
routine should not cause any problems -- you're always supposed to be
able to unbind a driver from a device.  So it looks like this is not
the right way to solve the problem.

Alan Stern

>            -> usb_unlock_device          <- release lock
> 
> 
> We can get error(-ENODEV) at uas_eh_device_reset_handler from usb_reset_and_verify_device.
> 
> Regards,
> Kento Kobayashi
>
Kento.A.Kobayashi@sony.com April 4, 2019, 3:57 a.m. UTC | #15
Hi,

>> Root Cause
>> - Block layer timeout happens after power off UAS USB device which is accessed as reproduce step. During timeout error handler process, scsi host state becomes SHOST_CANCEL_RECOVERY that causes IO hangs up and lock cannot be released. And in final, usb subsystem hangs up.
>> Follow is function call:
>> blk_mq_timeout_work 
>>   …->scsi_times_out  (… means some functions are not listed before this function.)
>>     …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
>>       … -> scsi_error_handler
>>         …-> uas_eh_device_reset_handler
>>             -> usb_lock_device_for_reset  <- take lock
>>               -> usb_reset_device
>>                 …-> rebind = uas_post_reset (return 1 since ENODEV) 
>>                 …-> usb_unbind_and_rebind_marked_interfaces (rebind=1)
>>                    …-> uas_disconnect  (scsi_host_set_state to SHOST_CANCEL_RECOVERY)
>>                         … -> scsi_queue_rq
>
>How does scsi_queue_rq get called here?  As far as I can see, this shouldn't happen.

We confirmed the function call path on linux 4.9 when this problem occured since we are working on it. In linux 4.9, the last function is scsi_request_fn instead of scsi_queue_rq. In staging.git, we think the scsi_queue_rq is called by follow path.
uas_disconnect
|- scsi_remove_host
 |- scsi_forget_host
  |- __scsi_remove_device
   |- device_del
    |- bus_remove_device
     |- device_release_driver
      |- device_release_driver_internal
       |- __device_release_driver
        |- drv->remove(dev) (sd_remove)  
         |- sd_shutdown
          |- sd_sync_cache
           |- scsi_execute
            |- __scsi_execute
             |- blk_execute_rq
              |- blk_execute_rq_nowait
               |- blk_mq_sched_insert_request
                |- blk_mq_run_hw_queue
                 |- __blk_mq_delay_run_hw_queue
                  |- __blk_mq_run_hw_queue
                   |- blk_mq_sched_dispatch_requests
                    |- blk_mq_dispatch_rq_list
                     |- q->mq_ops->queue_rq (scsi_queue_rq)

>> Countermeasure
>> - Make uas_post_reset doesn’t return 1 when ENODEV returns from uas_configure_endpoints since usb_unbind_and_rebind_marded_interfaces doesn’t need to do unbind/rebind operations in this situation.
>> blk_mq_timeout_work
>>   …->scsi_times_out  (… means some functions are not listed before this function.)
>>     …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
>>       … -> scsi_error_handler
>>        …-> uas_eh_device_reset_handler (*1)
>>            -> usb_lock_device_for_reset  <- take lock
>>              -> usb_reset_device
>>                -> usb_reset_and_verify_device (return ENODEV and FAILED will be reported to *1)
>>                -> uas_post_reset returns 0 when ENODEV => rebind=0 
>>                -> usb_unbind_and_rebind_marked_interfaces (rebind=0)
>
>The difference is that uas_disconnect wasn't called here.  But that routine should not cause any problems -- you're always supposed to be able to unbind a driver from a device.  So it looks like this is not the right way to solve the problem.

We confirmed usb_driver_release_interface will call usb_unbind_interface when this problem occurs.
So usb_unbind_interface will call driver disconnect callbak.

Regards,
Kento Kobayashi
Alan Stern April 4, 2019, 7:33 p.m. UTC | #16
On Thu, 4 Apr 2019 Kento.A.Kobayashi@sony.com wrote:

> Hi,
> 
> >> Root Cause
> >> - Block layer timeout happens after power off UAS USB device which is accessed as reproduce step. During timeout error handler process, scsi host state becomes SHOST_CANCEL_RECOVERY that causes IO hangs up and lock cannot be released. And in final, usb subsystem hangs up.
> >> Follow is function call:
> >> blk_mq_timeout_work 
> >>   …->scsi_times_out  (… means some functions are not listed before this function.)
> >>     …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
> >>       … -> scsi_error_handler
> >>         …-> uas_eh_device_reset_handler
> >>             -> usb_lock_device_for_reset  <- take lock
> >>               -> usb_reset_device
> >>                 …-> rebind = uas_post_reset (return 1 since ENODEV) 
> >>                 …-> usb_unbind_and_rebind_marked_interfaces (rebind=1)
> >>                    …-> uas_disconnect  (scsi_host_set_state to SHOST_CANCEL_RECOVERY)
> >>                         … -> scsi_queue_rq
  >>                              -> scsi_host_queue_ready(return 0 causes IO hangs up.)
> >
> >How does scsi_queue_rq get called here?  As far as I can see, this shouldn't happen.
> 
> We confirmed the function call path on linux 4.9 when this problem occured since we are working on it. In linux 4.9, the last function is scsi_request_fn instead of scsi_queue_rq. In staging.git, we think the scsi_queue_rq is called by follow path.
> uas_disconnect
> |- scsi_remove_host
>  |- scsi_forget_host
>   |- __scsi_remove_device
>    |- device_del
>     |- bus_remove_device
>      |- device_release_driver
>       |- device_release_driver_internal
>        |- __device_release_driver
>         |- drv->remove(dev) (sd_remove)  
>          |- sd_shutdown
>           |- sd_sync_cache
>            |- scsi_execute
... (unnecessary internal details elided)
>                     |- blk_mq_dispatch_rq_list
>                      |- q->mq_ops->queue_rq (scsi_queue_rq)

So it looks as though the SCSI subsystem doesn't like to have a reset 
handler call scsi_remove_host.  Commands dispatched by the removal 
routines are forced to wait for the reset recovery to finish, which 
won't happen until those commands have been completed.

Is this a bug in the SCSI core?  If not, we need to know what is the
right way to do things when a reset handler detects that the SCSI host
has been hot-unplugged.

James, Martin, any suggestions?

Alan Stern
Kento.A.Kobayashi@sony.com April 9, 2019, 12:28 a.m. UTC | #17
Hi,

>So it looks as though the SCSI subsystem doesn't like to have a reset handler call scsi_remove_host.  Commands dispatched by the removal routines are forced to wait for the reset recovery to finish, which won't happen until those commands have been completed.

You mean reset handeler should not call scsi_remove_host, right?
But actually, scsi_remove_host is called from reset handler in this case.

>Is this a bug in the SCSI core? 
No, I think this is a bug in uas driver.

Regards,
Kento Kobayashi
Alan Stern April 9, 2019, 1:21 a.m. UTC | #18
On Tue, 9 Apr 2019 Kento.A.Kobayashi@sony.com wrote:

> Hi,
> 
> >So it looks as though the SCSI subsystem doesn't like to have a reset handler call scsi_remove_host.  Commands dispatched by the removal routines are forced to wait for the reset recovery to finish, which won't happen until those commands have been completed.
> 
> You mean reset handeler should not call scsi_remove_host, right?
> But actually, scsi_remove_host is called from reset handler in this case.
> 
> >Is this a bug in the SCSI core? 
> No, I think this is a bug in uas driver.

Actually I was asking James and Martin, not you.

In any case, even assuming the cause of the problem is on the USB side, 
it's not clear whether the bug is in uas or in usbcore.

Alan Stern
Martin K. Petersen April 9, 2019, 2:10 a.m. UTC | #19
Alan,

> So it looks as though the SCSI subsystem doesn't like to have a reset 
> handler call scsi_remove_host.

Are you talking about a PCI device removal handler or a SCSI error
handler?

> Commands dispatched by the removal routines are forced to wait for the
> reset recovery to finish, which won't happen until those commands have
> been completed.
>
> Is this a bug in the SCSI core?  If not, we need to know what is the
> right way to do things when a reset handler detects that the SCSI host
> has been hot-unplugged.

PCI surprise removal should generally work. But it's somewhat unusual
for a SCSI host to evaporate in the middle of error handling. After all,
the main purpose of eh is to leverage the interfaces provided by the
host to try to reconnect to a target that tripped and fell off the
bus...
Alan Stern April 9, 2019, 2:44 p.m. UTC | #20
On Mon, 8 Apr 2019, Martin K. Petersen wrote:

> 
> Alan,
> 
> > So it looks as though the SCSI subsystem doesn't like to have a reset 
> > handler call scsi_remove_host.
> 
> Are you talking about a PCI device removal handler or a SCSI error
> handler?

The context of this discussion is a USB mass-storage device where the
device's port on its upstream hub has been powered off.  The
powered-off port causes an executing command to time out.  As a result
the SCSI error handler runs and calls the USB reset routine, but the
reset fails because the kernel is unable to communicate with the device
through the powered-off port.  This causes the USB reset routine to
unbind the device from its USB driver, which in turn calls
scsi_remove_host -- while the error handler is still running.

> > Commands dispatched by the removal routines are forced to wait for the
> > reset recovery to finish, which won't happen until those commands have
> > been completed.
> >
> > Is this a bug in the SCSI core?  If not, we need to know what is the
> > right way to do things when a reset handler detects that the SCSI host
> > has been hot-unplugged.
> 
> PCI surprise removal should generally work. But it's somewhat unusual
> for a SCSI host to evaporate in the middle of error handling. After all,
> the main purpose of eh is to leverage the interfaces provided by the
> host to try to reconnect to a target that tripped and fell off the
> bus...

Still, it's not impossible for a SCSI host to evaporate in the middle
of error handling, given an appropriately mistimed hot-unplug event.  
How does the SCSI layer expect this to be handled?  Should the
low-level driver wait to call scsi_remove_host until after the error
handling is finished?

What about races?  In theory, scsi_remove_host could be called just as 
the error handler is starting up.

Alan Stern
Bart Van Assche April 9, 2019, 3:16 p.m. UTC | #21
On Tue, 2019-04-09 at 10:44 -0400, Alan Stern wrote:
> On Mon, 8 Apr 2019, Martin K. Petersen wrote:
> 
> > 
> > Alan,
> > 
> > > So it looks as though the SCSI subsystem doesn't like to have a reset 
> > > handler call scsi_remove_host.
> > 
> > Are you talking about a PCI device removal handler or a SCSI error
> > handler?
> 
> The context of this discussion is a USB mass-storage device where the
> device's port on its upstream hub has been powered off.  The
> powered-off port causes an executing command to time out.  As a result
> the SCSI error handler runs and calls the USB reset routine, but the
> reset fails because the kernel is unable to communicate with the device
> through the powered-off port.  This causes the USB reset routine to
> unbind the device from its USB driver, which in turn calls
> scsi_remove_host -- while the error handler is still running.

From which context does that unbind happen? From inside a SCSI EH callback
or from the context of a workqueue? I think the former is not allowed but
that the latter is allowed. The SRP initiator driver (ib_srp.c) follows the
latter approach. See also srp_queue_remove_work().

Bart.
Alan Stern April 9, 2019, 4:45 p.m. UTC | #22
On Tue, 9 Apr 2019, Bart Van Assche wrote:

> On Tue, 2019-04-09 at 10:44 -0400, Alan Stern wrote:
> +AD4 On Mon, 8 Apr 2019, Martin K. Petersen wrote:
> +AD4 
> +AD4 +AD4 
> +AD4 +AD4 Alan,
> +AD4 +AD4 
> +AD4 +AD4 +AD4 So it looks as though the SCSI subsystem doesn't like to have a reset 
> +AD4 +AD4 +AD4 handler call scsi+AF8-remove+AF8-host.
> +AD4 +AD4 
> +AD4 +AD4 Are you talking about a PCI device removal handler or a SCSI error
> +AD4 +AD4 handler?
> +AD4 
> +AD4 The context of this discussion is a USB mass-storage device where the
> +AD4 device's port on its upstream hub has been powered off.  The
> +AD4 powered-off port causes an executing command to time out.  As a result
> +AD4 the SCSI error handler runs and calls the USB reset routine, but the
> +AD4 reset fails because the kernel is unable to communicate with the device
> +AD4 through the powered-off port.  This causes the USB reset routine to
> +AD4 unbind the device from its USB driver, which in turn calls
> +AD4 scsi+AF8-remove+AF8-host -- while the error handler is still running.
> 
> From which context does that unbind happen? From inside a SCSI EH callback
> or from the context of a workqueue? I think the former is not allowed but
> that the latter is allowed. The SRP initiator driver (ib+AF8-srp.c) follows the
> latter approach. See also srp+AF8-queue+AF8-remove+AF8-work().

The unbind happens from inside the SCSI EH callback.  If that really is
not allowed, we'll need to change it.  Or we can just change it
regardless, since the effort required is pretty small.

Kento, please try the patch below.  Does it help with your problem?

Alan Stern



Index: usb-4.x/drivers/usb/core/hub.c
===================================================================
--- usb-4.x.orig/drivers/usb/core/hub.c
+++ usb-4.x/drivers/usb/core/hub.c
@@ -5902,7 +5902,9 @@ int usb_reset_device(struct usb_device *
 					cintf->needs_binding = 1;
 			}
 		}
-		usb_unbind_and_rebind_marked_interfaces(udev);
+		/* If the reset failed, hub_wq will unbind drivers later */
+		if (ret == 0)
+			usb_unbind_and_rebind_marked_interfaces(udev);
 	}
 
 	usb_autosuspend_device(udev);
Martin K. Petersen April 10, 2019, 2:11 a.m. UTC | #23
Bart,

> From which context does that unbind happen? From inside a SCSI EH
> callback or from the context of a workqueue? I think the former is not
> allowed but that the latter is allowed. The SRP initiator driver
> (ib_srp.c) follows the latter approach.

Yeah. It's better to punt the actual tear down.
Kento.A.Kobayashi@sony.com April 15, 2019, 12:27 a.m. UTC | #24
Hi

>The unbind happens from inside the SCSI EH callback.  If that really is not allowed, we'll need to change it.  Or we can just change it regardless, since the effort required is pretty small.
>
>Kento, please try the patch below.  Does it help with your problem?

Thank you for suggestion about this problem.
I confirmed your patch fixes this problem.
I think you change policy for error handler to not calling unbind, right?
In addition, I have a question about this patch.
Could you please tell me why it should not be allowed that the unbind is occurred from eh callback?

This patch will ignore all error which is returned from usb_reset_and_verify_device.
But my patch will ignore error only being returned ENODEV case.
I think side effect of your patch is bigger than my patch.
So I want to know why the unbind is occurred from eh callback should not be allowed.

Regards,
Kento Kobayashi
Alan Stern April 15, 2019, 3:18 p.m. UTC | #25
On Mon, 15 Apr 2019 Kento.A.Kobayashi@sony.com wrote:

> Hi
> 
> >The unbind happens from inside the SCSI EH callback.  If that really is not allowed, we'll need to change it.  Or we can just change it regardless, since the effort required is pretty small.
> >
> >Kento, please try the patch below.  Does it help with your problem?
> 
> Thank you for suggestion about this problem.
> I confirmed your patch fixes this problem.

Good; I will submit it.

> I think you change policy for error handler to not calling unbind, right?

That's right.

> In addition, I have a question about this patch.
> Could you please tell me why it should not be allowed that the unbind is occurred from eh callback?

The SCSI core does not handle unbind properly when it occurs inside an 
error handler callback.  If you want to know more about why the SCSI 
core behaves this way, you should ask the SCSI developers -- I don't 
know the answer.

> This patch will ignore all error which is returned from usb_reset_and_verify_device.
> But my patch will ignore error only being returned ENODEV case.
> I think side effect of your patch is bigger than my patch.

The only other errors that usb_reset_and_verify_device() can return are 
EINVAL and EISDIR.  These error codes occur in three situations:

	The device has already been disconnected;

	The device has no parent hub (it is a root hub);

	The device is currently suspended.

The first situation is just as bad as -ENODEV.  The second cannot
happen for a USB mass storage device.  The third can happen only if an
error occurs when usb_reset_device() tries to carry out a resume, and
that's also just as bad as -ENODEV.

So although the side effects are larger than with your patch, they are 
not any worse.  Furthermore, they handle correctly some situations that 
your patch does not handle.

> So I want to know why the unbind is occurred from eh callback should not be allowed.

Ask the SCSI developers.

Alan Stern
Alan Stern April 15, 2019, 3:32 p.m. UTC | #26
On Mon, 15 Apr 2019, Alan Stern wrote:

> On Mon, 15 Apr 2019 Kento.A.Kobayashi@sony.com wrote:
> 
> > Hi
> > 
> > >The unbind happens from inside the SCSI EH callback.  If that really is not allowed, we'll need to change it.  Or we can just change it regardless, since the effort required is pretty small.
> > >
> > >Kento, please try the patch below.  Does it help with your problem?
> > 
> > Thank you for suggestion about this problem.
> > I confirmed your patch fixes this problem.
> 
> Good; I will submit it.

I forgot to ask: Is it okay to add

	Tested-by: Kento Kobayashi <Kento.A.Kobayashi@sony.com>

along with the patch?

Alan Stern
Kento.A.Kobayashi@sony.com April 16, 2019, 2:31 a.m. UTC | #27
Hi,

>On Mon, 15 Apr 2019, Alan Stern wrote:
>
>> On Mon, 15 Apr 2019 Kento.A.Kobayashi@sony.com wrote:
>> 
>> > Hi
>> > 
>> > >The unbind happens from inside the SCSI EH callback.  If that really is not allowed, we'll need to change it.  Or we can just change it regardless, since the effort required is pretty small.
>> > >
>> > >Kento, please try the patch below.  Does it help with your problem?
>> > 
>> > Thank you for suggestion about this problem.
>> > I confirmed your patch fixes this problem.
>> 
>> Good; I will submit it.

Thank you for allowing submit about fixing this problem.

>
>I forgot to ask: Is it okay to add
>
>	Tested-by: Kento Kobayashi <Kento.A.Kobayashi@sony.com>
>
>along with the patch?

Sure.
In addition, if you can, could you please add member who contribute to this problem fixing?
Cao, Jacky <Jacky.Cao@sony.com>

Regards,
Kento Kobayashi
diff mbox series

Patch

diff --git a/drivers/usb/storage/uas.c b/drivers/usb/storage/uas.c
index 36742e8..24b09fd 100644
--- a/drivers/usb/storage/uas.c
+++ b/drivers/usb/storage/uas.c
@@ -1116,6 +1116,9 @@  static int uas_post_reset(struct usb_interface *intf)

        scsi_unblock_requests(shost);

+       if (err == -ENODEV)
+               return 0;
+
        return err ? 1 : 0;
 }