Message ID | ZEllnjMKT8ulZbJh@sakura (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Reproducible deadlock when usb-storage scsi command timeouts twice | expand |
[Adding linux-scsi and Martin to the CC: list, since this looks like it might be a problem in the SCSI core] On Wed, Apr 26, 2023 at 07:55:42PM +0200, Maxime Bizon wrote: > > Hello, > > I have a faulty 128MB USB stick that happens to randomly timeout when > reading a specific block. > > After the first timeout, the scsi layer retries the command. If that > retry work, then everything goes well. But if the retried command > timeouts as well, then I get a deadlock. What version of the kernel are you using? > To reproduce fast & reliable, I'm using these settings: > > # echo 10 >/proc/sys/kernel/hung_task_timeout_secs > # echo 0x3f > /proc/sys/dev/scsi/logging_level > # cd /sys/block/sda/queue/ > # echo none > scheduler > # echo 1 > nr_requests > # echo 1 > nomerges > # echo 4 > max_sectors_kb > # echo 5000 > io_timeout > # cat /dev/sda > /dev/null > > > dmesg with timestamps so it's easy to spot the timeouts: > [ 211.733162] *** thread awakened > [ 211.733174] Command READ_10 (10 bytes) > [ 211.733183] bytes: 28 00 00 02 02 00 00 00 08 00 > [ 211.733192] Bulk Command S 0x43425355 T 0xb8cb L 4096 F 128 Trg 0 LUN 0 CL 10 > [ 211.733204] xfer 31 bytes > [ 211.733288] Status code 0; transferred 31/31 > [ 211.733300] -- transfer complete > [ 211.733306] Bulk command transfer result=0 > [ 211.733313] xfer 4096 bytes, 1 entries > [ 218.089304] sd 0:0:0:0: [sda] tag#0 abort scheduled > [ 218.109297] sd 0:0:0:0: [sda] tag#0 aborting command > [ 218.109315] command_abort called > [ 218.109324] -- cancelling sg request > [ 218.112380] Status code -104; transferred 3072/4096 > [ 218.112393] -- transfer cancelled > [ 218.112400] Bulk data transfer result 0x4 > [ 218.112407] -- command was aborted > [ 218.209278] usb 1-1.2: reset high-speed USB device number 3 using orion-ehci > [ 218.359923] usb_reset_device returns 0 > [ 218.359936] scsi command aborted > [ 218.359947] *** thread sleeping > [ 218.359964] sd 0:0:0:0: [sda] tag#0 retry aborted command > [ 218.399298] *** thread awakened > [ 218.399311] Command READ_10 (10 bytes) > [ 218.399320] bytes: 28 00 00 02 02 00 00 00 08 00 > [ 218.399330] Bulk Command S 0x43425355 T 0xb8cc L 4096 F 128 Trg 0 LUN 0 CL 10 > [ 218.399342] xfer 31 bytes > [ 218.399544] Status code 0; transferred 31/31 > [ 218.399556] -- transfer complete > [ 218.399562] Bulk command transfer result=0 > [ 218.399570] xfer 4096 bytes, 1 entries > [ 225.129297] sd 0:0:0:0: [sda] tag#0 previous abort failed > [ 225.129337] scsi host0: Waking error handler thread > [ 225.129358] scsi host0: scsi_eh_0: waking up 0/1/1 > [ 225.129375] scsi host0: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1 > [ 225.129387] scsi host0: Total of 1 commands on 1 devices require eh work > [ 225.129402] sd 0:0:0:0: scsi_eh_0: Sending BDR > [ 225.129414] device_reset called > [ 235.369290] INFO: task scsi_eh_0:173 blocked for more than 10 seconds. > [ 235.369311] Not tainted 6.3.0-00615-gffe64935a4a2 #9 > [ 235.369320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 235.369327] task:scsi_eh_0 state:D stack:0 pid:173 ppid:2 flags:0x00000000 > [ 235.369350] __schedule from schedule+0x70/0xac > [ 235.369374] schedule from schedule_preempt_disabled+0x24/0x34 > [ 235.369393] schedule_preempt_disabled from __mutex_lock.constprop.0+0x14c/0x280 > [ 235.369412] __mutex_lock.constprop.0 from device_reset+0x28/0x64 > [ 235.369433] device_reset from scsi_try_bus_device_reset+0x24/0x58 > [ 235.369452] scsi_try_bus_device_reset from scsi_eh_ready_devs+0x2f0/0x97c > [ 235.369470] scsi_eh_ready_devs from scsi_error_handler+0x238/0x49c > [ 235.369488] scsi_error_handler from kthread+0xc4/0xdc > [ 235.369507] kthread from ret_from_fork+0x14/0x3c > [ 235.369567] INFO: task usb-storage:176 blocked for more than 10 seconds. > [ 235.369576] Not tainted 6.3.0-00615-gffe64935a4a2 #9 > [ 235.369583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 235.369590] task:usb-storage state:D stack:0 pid:176 ppid:2 flags:0x00000000 > [ 235.369606] __schedule from schedule+0x70/0xac > [ 235.369622] schedule from schedule_timeout+0x18/0xd0 > [ 235.369641] schedule_timeout from __wait_for_common+0xc0/0x13c > [ 235.369660] __wait_for_common from usb_sg_wait+0x10c/0x118 > [ 235.369677] usb_sg_wait from usb_stor_bulk_transfer_sglist+0xc4/0x118 > [ 235.369695] usb_stor_bulk_transfer_sglist from usb_stor_bulk_srb+0x24/0x3c > [ 235.369713] usb_stor_bulk_srb from usb_stor_Bulk_transport+0x164/0x44c > [ 235.369731] usb_stor_Bulk_transport from usb_stor_invoke_transport+0x20/0x5c4 > [ 235.369750] usb_stor_invoke_transport from usb_stor_control_thread+0x1a4/0x2bc > [ 235.369769] usb_stor_control_thread from kthread+0xc4/0xdc > [ 235.369786] kthread from ret_from_fork+0x14/0x3c > > > Turns out eh_device_reset_handler() is called with a pending command > (srb == us->srb), and I don't know if the usb code was expecting > eh_abort_handler() to be called first. usb-storage definitely expects any pending command to be aborted before a reset is attempted. > This patch fixes the issue, not sure if it's correct: > > --- a/drivers/usb/storage/scsiglue.c > +++ b/drivers/usb/storage/scsiglue.c > @@ -455,6 +455,9 @@ static int device_reset(struct scsi_cmnd *srb) > > usb_stor_dbg(us, "%s called\n", __func__); > > + if (us->srb == srb) > + command_abort(srb); > + > /* lock the device pointers and do the reset */ > mutex_lock(&(us->dev_mutex)); > result = us->transport_reset(us); Maybe... But it would be better to check first whether the SCSI core is supposed to be reusing an active srb in this way. Martin, can tell us what is supposed to happen here? Is the eh_device_reset_handler routine supposed to be called with a scsi_cmnd for a currently active command? Alan Stern
On Wednesday 26 Apr 2023 à 15:20:07 (-0400), Alan Stern wrote:
> What version of the kernel are you using?
6.3
On 4/26/23 12:20, Alan Stern wrote: >> This patch fixes the issue, not sure if it's correct: >> >> --- a/drivers/usb/storage/scsiglue.c >> +++ b/drivers/usb/storage/scsiglue.c >> @@ -455,6 +455,9 @@ static int device_reset(struct scsi_cmnd *srb) >> >> usb_stor_dbg(us, "%s called\n", __func__); >> >> + if (us->srb == srb) >> + command_abort(srb); >> + >> /* lock the device pointers and do the reset */ >> mutex_lock(&(us->dev_mutex)); >> result = us->transport_reset(us); > > Maybe... But it would be better to check first whether the SCSI core is > supposed to be reusing an active srb in this way. > > Martin, can tell us what is supposed to happen here? Is the > eh_device_reset_handler routine supposed to be called with a scsi_cmnd > for a currently active command? Hi Alan, I'm not aware of any other .eh_device_reset_handler implementation that aborts the command that is passed to this callback before it aborts other SCSI commands. However, I'm not aware of an equivalent of us_data.dev_mutex in other SCSI LLDs either. Maybe this deadlock is specific to the USB storage handler? Martin, feel free to correct me if I got anything wrong. Bart.
On Sun, Apr 30, 2023 at 12:39:12PM -0700, Bart Van Assche wrote: > On 4/26/23 12:20, Alan Stern wrote: > > > This patch fixes the issue, not sure if it's correct: > > > > > > --- a/drivers/usb/storage/scsiglue.c > > > +++ b/drivers/usb/storage/scsiglue.c > > > @@ -455,6 +455,9 @@ static int device_reset(struct scsi_cmnd *srb) > > > usb_stor_dbg(us, "%s called\n", __func__); > > > + if (us->srb == srb) > > > + command_abort(srb); > > > + > > > /* lock the device pointers and do the reset */ > > > mutex_lock(&(us->dev_mutex)); > > > result = us->transport_reset(us); > > > > Maybe... But it would be better to check first whether the SCSI core is > > supposed to be reusing an active srb in this way. > > > > Martin, can tell us what is supposed to happen here? Is the > > eh_device_reset_handler routine supposed to be called with a scsi_cmnd > > for a currently active command? > > Hi Alan, > > I'm not aware of any other .eh_device_reset_handler implementation that > aborts the command that is passed to this callback before it aborts other > SCSI commands. That's not a good description of what's happening. For one thing, usb-storage can process only one command at a time, so the "before it aborts other SCSI commands" part doesn't apply. In fact, the proposed patch aborts the current command (which happens to be the same as the one passed to the .eh_device_reset_handler routine) before beginning a reset. > However, I'm not aware of an equivalent of us_data.dev_mutex > in other SCSI LLDs either. Maybe this deadlock is specific to the USB > storage handler? The deadlock, if that's what it is, is related to the fact that usb-storage does only one thing at a time. The mutex is there merely to enforce this; removing it wouldn't help matters since the driver still wouldn't be capable of doing more than one thing at a time. The underlying unstated issue is that usb-storage doesn't have any timeouts of its own. If a device fails to complete a command properly and the SCSI error handler doesn't abort the command, then the command will never finish and the driver will hang. (Until the device is hot-unplugged, of course.) I suppose a bus reset would probably cause an ongoing command to fail. A device reset almost certainly would not, since most device implementations of the USB bulk-only mass-storage protocol do a very bad job of handling device resets. Either way, it's a lot cleaner simply to abort the command before starting the reset. If usb-storage can't depend on the SCSI error handler to do this for us, it will have to do so itself. Alan Stern > Martin, feel free to correct me if I got anything wrong. > > Bart.
On Wed, Apr 26, 2023 at 03:20:07PM -0400, Alan Stern wrote: > > dmesg with timestamps so it's easy to spot the timeouts: > > > [ 211.733162] *** thread awakened > > [ 211.733174] Command READ_10 (10 bytes) > > [ 211.733183] bytes: 28 00 00 02 02 00 00 00 08 00 > > [ 211.733192] Bulk Command S 0x43425355 T 0xb8cb L 4096 F 128 Trg 0 LUN 0 CL 10 > > [ 211.733204] xfer 31 bytes > > [ 211.733288] Status code 0; transferred 31/31 > > [ 211.733300] -- transfer complete > > [ 211.733306] Bulk command transfer result=0 > > [ 211.733313] xfer 4096 bytes, 1 entries > > [ 218.089304] sd 0:0:0:0: [sda] tag#0 abort scheduled > > [ 218.109297] sd 0:0:0:0: [sda] tag#0 aborting command > > [ 218.109315] command_abort called > > [ 218.109324] -- cancelling sg request > > [ 218.112380] Status code -104; transferred 3072/4096 > > [ 218.112393] -- transfer cancelled > > [ 218.112400] Bulk data transfer result 0x4 > > [ 218.112407] -- command was aborted > > [ 218.209278] usb 1-1.2: reset high-speed USB device number 3 using orion-ehci > > [ 218.359923] usb_reset_device returns 0 > > [ 218.359936] scsi command aborted > > [ 218.359947] *** thread sleeping > > [ 218.359964] sd 0:0:0:0: [sda] tag#0 retry aborted command > > [ 218.399298] *** thread awakened > > [ 218.399311] Command READ_10 (10 bytes) > > [ 218.399320] bytes: 28 00 00 02 02 00 00 00 08 00 > > [ 218.399330] Bulk Command S 0x43425355 T 0xb8cc L 4096 F 128 Trg 0 LUN 0 CL 10 > > [ 218.399342] xfer 31 bytes > > [ 218.399544] Status code 0; transferred 31/31 > > [ 218.399556] -- transfer complete > > [ 218.399562] Bulk command transfer result=0 > > [ 218.399570] xfer 4096 bytes, 1 entries > > [ 225.129297] sd 0:0:0:0: [sda] tag#0 previous abort failed > > [ 225.129337] scsi host0: Waking error handler thread > > [ 225.129358] scsi host0: scsi_eh_0: waking up 0/1/1 > > [ 225.129375] scsi host0: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1 > > [ 225.129387] scsi host0: Total of 1 commands on 1 devices require eh work > > [ 225.129402] sd 0:0:0:0: scsi_eh_0: Sending BDR > > [ 225.129414] device_reset called > > [ 235.369290] INFO: task scsi_eh_0:173 blocked for more than 10 seconds. > > [ 235.369311] Not tainted 6.3.0-00615-gffe64935a4a2 #9 > > [ 235.369320] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [ 235.369327] task:scsi_eh_0 state:D stack:0 pid:173 ppid:2 flags:0x00000000 > > [ 235.369350] __schedule from schedule+0x70/0xac > > [ 235.369374] schedule from schedule_preempt_disabled+0x24/0x34 > > [ 235.369393] schedule_preempt_disabled from __mutex_lock.constprop.0+0x14c/0x280 > > [ 235.369412] __mutex_lock.constprop.0 from device_reset+0x28/0x64 > > [ 235.369433] device_reset from scsi_try_bus_device_reset+0x24/0x58 > > [ 235.369452] scsi_try_bus_device_reset from scsi_eh_ready_devs+0x2f0/0x97c > > [ 235.369470] scsi_eh_ready_devs from scsi_error_handler+0x238/0x49c > > [ 235.369488] scsi_error_handler from kthread+0xc4/0xdc > > [ 235.369507] kthread from ret_from_fork+0x14/0x3c > > [ 235.369567] INFO: task usb-storage:176 blocked for more than 10 seconds. > > [ 235.369576] Not tainted 6.3.0-00615-gffe64935a4a2 #9 > > [ 235.369583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [ 235.369590] task:usb-storage state:D stack:0 pid:176 ppid:2 flags:0x00000000 > > [ 235.369606] __schedule from schedule+0x70/0xac > > [ 235.369622] schedule from schedule_timeout+0x18/0xd0 > > [ 235.369641] schedule_timeout from __wait_for_common+0xc0/0x13c > > [ 235.369660] __wait_for_common from usb_sg_wait+0x10c/0x118 > > [ 235.369677] usb_sg_wait from usb_stor_bulk_transfer_sglist+0xc4/0x118 > > [ 235.369695] usb_stor_bulk_transfer_sglist from usb_stor_bulk_srb+0x24/0x3c > > [ 235.369713] usb_stor_bulk_srb from usb_stor_Bulk_transport+0x164/0x44c > > [ 235.369731] usb_stor_Bulk_transport from usb_stor_invoke_transport+0x20/0x5c4 > > [ 235.369750] usb_stor_invoke_transport from usb_stor_control_thread+0x1a4/0x2bc > > [ 235.369769] usb_stor_control_thread from kthread+0xc4/0xdc > > [ 235.369786] kthread from ret_from_fork+0x14/0x3c > > > > > > Turns out eh_device_reset_handler() is called with a pending command > > (srb == us->srb), and I don't know if the usb code was expecting > > eh_abort_handler() to be called first. > > usb-storage definitely expects any pending command to be aborted before > a reset is attempted. From a cursory look at the logs above, SCSI ML does just try that: > > [ 218.089304] sd 0:0:0:0: [sda] tag#0 abort scheduled > > [ 218.109297] sd 0:0:0:0: [sda] tag#0 aborting command calls `hostt->eh_abort_handler()` on the late request, and retries it after success. > > [ 218.359964] sd 0:0:0:0: [sda] tag#0 retry aborted command > > [ 225.129297] sd 0:0:0:0: [sda] tag#0 previous abort failed but it times out again, then we go straight into EH: > > [ 225.129337] scsi host0: Waking error handler thread > > [ 225.129358] scsi host0: scsi_eh_0: waking up 0/1/1 > > [ 225.129375] scsi host0: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1 > > [ 225.129387] scsi host0: Total of 1 commands on 1 devices require eh work > > [ 225.129402] sd 0:0:0:0: scsi_eh_0: Sending BDR IIRC in the past we used to call Abort a second time from within the EH thread before trying the different resets, but that was removed at some point a couple of years ago. Now we add the command straight to the EH list, and start with the TMF LUN reset, which ought to implicitly abort the command as well on the target. > > This patch fixes the issue, not sure if it's correct: > > > > --- a/drivers/usb/storage/scsiglue.c > > +++ b/drivers/usb/storage/scsiglue.c > > @@ -455,6 +455,9 @@ static int device_reset(struct scsi_cmnd *srb) > > > > usb_stor_dbg(us, "%s called\n", __func__); > > > > + if (us->srb == srb) > > + command_abort(srb); > > + > > /* lock the device pointers and do the reset */ > > mutex_lock(&(us->dev_mutex)); > > result = us->transport_reset(us); > > Maybe... But it would be better to check first whether the SCSI core is > supposed to be reusing an active srb in this way. IIRC it does indeed doe that by design. It saves the old command's meta-data, and piggy-backs the TMF onto the existing request to push it into the device driver. Hannes probably knows the complete story. > > Martin, can tell us what is supposed to happen here? Is the > eh_device_reset_handler routine supposed to be called with a scsi_cmnd > for a currently active command?
On 03.05.23 12:24, Benjamin Block wrote: > On Wed, Apr 26, 2023 at 03:20:07PM -0400, Alan Stern wrote: > From a cursory look at the logs above, SCSI ML does just try that: > >>> [ 218.089304] sd 0:0:0:0: [sda] tag#0 abort scheduled >>> [ 218.109297] sd 0:0:0:0: [sda] tag#0 aborting command > > calls `hostt->eh_abort_handler()` on the late request, and retries it > after success. > >>> [ 218.359964] sd 0:0:0:0: [sda] tag#0 retry aborted command >>> [ 225.129297] sd 0:0:0:0: [sda] tag#0 previous abort failed > > but it times out again, then we go straight into EH: And that is problematic to usb-storage > >>> [ 225.129337] scsi host0: Waking error handler thread >>> [ 225.129358] scsi host0: scsi_eh_0: waking up 0/1/1 >>> [ 225.129375] scsi host0: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1 >>> [ 225.129387] scsi host0: Total of 1 commands on 1 devices require eh work >>> [ 225.129402] sd 0:0:0:0: scsi_eh_0: Sending BDR > > IIRC in the past we used to call Abort a second time from within the EH > thread before trying the different resets, but that was removed at some > point a couple of years ago. Now we add the command straight to the EH > list, and start with the TMF LUN reset, which ought to implicitly abort > the command as well on the target. usb-storage can do a reset only on the USB device level, which translates to a bus reset on the SCSI level. And we are supposed to cancel any communication with the device before that. Regards Oliver
On Wed, May 03, 2023 at 12:55:03PM +0200, Oliver Neukum wrote: > On 03.05.23 12:24, Benjamin Block wrote: > > On Wed, Apr 26, 2023 at 03:20:07PM -0400, Alan Stern wrote: > > > From a cursory look at the logs above, SCSI ML does just try that: > > > >>> [ 218.089304] sd 0:0:0:0: [sda] tag#0 abort scheduled > >>> [ 218.109297] sd 0:0:0:0: [sda] tag#0 aborting command > > > > calls `hostt->eh_abort_handler()` on the late request, and retries it > > after success. > > > >>> [ 218.359964] sd 0:0:0:0: [sda] tag#0 retry aborted command > >>> [ 225.129297] sd 0:0:0:0: [sda] tag#0 previous abort failed > > > > but it times out again, then we go straight into EH: > > And that is problematic to usb-storage > > > >>> [ 225.129337] scsi host0: Waking error handler thread > >>> [ 225.129358] scsi host0: scsi_eh_0: waking up 0/1/1 > >>> [ 225.129375] scsi host0: scsi_eh_prt_fail_stats: cmds failed: 0, cancel: 1 > >>> [ 225.129387] scsi host0: Total of 1 commands on 1 devices require eh work > >>> [ 225.129402] sd 0:0:0:0: scsi_eh_0: Sending BDR > > > > IIRC in the past we used to call Abort a second time from within the EH > > thread before trying the different resets, but that was removed at some > > point a couple of years ago. Seems like I misremembered. Can't find the commit I was thinking happened, and the only thing that changed was that aborts moved out of the EH thread and be asynchronous. > > Now we add the command straight to the EH > > list, and start with the TMF LUN reset, which ought to implicitly abort > > the command as well on the target. > > usb-storage can do a reset only on the USB device level, > which translates to a bus reset on the SCSI level. > > And we are supposed to cancel any communication with the device > before that. Is that a limitation of the devices or drivers? Because then you don't match SCSI semantics for LU reset - which aborts all running commands on that scope among things. Which might explain the reason/choice behind this unexpected behavior for you. On random thought I had: in theory you could implement your own EH strategy handler if the default one doesn't work for you. ATA and SAS do so. [drivers/scsi/scsi_error.c:2285 `shost->transportt->eh_strategy_handler()`] This can re-use parts/all of the existing escalation sequence in `scsi_eh_ready_devs()`. But that's no short-term fix.
On 03.05.23 14:51, Benjamin Block wrote: >> usb-storage can do a reset only on the USB device level, >> which translates to a bus reset on the SCSI level. >> >> And we are supposed to cancel any communication with the device >> before that. > > Is that a limitation of the devices or drivers? Because then you don't > match SCSI semantics for LU reset - which aborts all running commands We do not support a LUN reset. That's a limitation of the protocol. If something goes wrong you need to reset the whole USB device, which corresponds to a host adaptor on the SCSI level. > on that scope among things. Which might explain the reason/choice behind > this unexpected behavior for you. For the device a reset presumably does wipe out the command currently under execution. The problem is within the driver. It thinks that a command is still active. And we are limited to one command at a time (on the whole bus - again protocol limitation) > On random thought I had: in theory you could implement your own EH > strategy handler if the default one doesn't work for you. ATA and SAS do so. > [drivers/scsi/scsi_error.c:2285 `shost->transportt->eh_strategy_handler()`] > This can re-use parts/all of the existing escalation sequence in > `scsi_eh_ready_devs()`. > > But that's no short-term fix. That looks like using a sledge hammer. Regards Oliver
On Wed, May 03, 2023 at 03:54:45PM +0200, Oliver Neukum wrote: > On 03.05.23 14:51, Benjamin Block wrote: > > > > usb-storage can do a reset only on the USB device level, > > > which translates to a bus reset on the SCSI level. > > > > > > And we are supposed to cancel any communication with the device > > > before that. > > > > Is that a limitation of the devices or drivers? Because then you don't > > match SCSI semantics for LU reset - which aborts all running commands > > We do not support a LUN reset. That's a limitation of the protocol. > If something goes wrong you need to reset the whole USB device, which > corresponds to a host adaptor on the SCSI level. There is the Bulk-Only Mass Storage Reset request, which is what usb-storage issues when asked to perform a device reset. But few devices implement it correctly. > > on that scope among things. Which might explain the reason/choice behind > > this unexpected behavior for you. > > For the device a reset presumably does wipe out the command currently > under execution. The problem is within the driver. It thinks that > a command is still active. And we are limited to one command at a time > (on the whole bus - again protocol limitation) > > > On random thought I had: in theory you could implement your own EH > > strategy handler if the default one doesn't work for you. ATA and SAS do so. > > [drivers/scsi/scsi_error.c:2285 `shost->transportt->eh_strategy_handler()`] > > This can re-use parts/all of the existing escalation sequence in > > `scsi_eh_ready_devs()`. > > > > But that's no short-term fix. > > That looks like using a sledge hammer. I think the best answer is to accept the patch that started this email thread, perhaps with a minor change. The driver can easily abort the currently running command (if any) on its own before starting a reset. Alan Stern
On Wed, May 03, 2023 at 10:25:10AM -0400, Alan Stern wrote: > On Wed, May 03, 2023 at 03:54:45PM +0200, Oliver Neukum wrote: > > For the device a reset presumably does wipe out the command currently > > under execution. The problem is within the driver. It thinks that > > a command is still active. And we are limited to one command at a time > > (on the whole bus - again protocol limitation) > > > > > On random thought I had: in theory you could implement your own EH > > > strategy handler if the default one doesn't work for you. ATA and SAS do so. > > > [drivers/scsi/scsi_error.c:2285 `shost->transportt->eh_strategy_handler()`] > > > This can re-use parts/all of the existing escalation sequence in > > > `scsi_eh_ready_devs()`. > > > > > > But that's no short-term fix. > > > > That looks like using a sledge hammer. > > I think the best answer is to accept the patch that started this email > thread, perhaps with a minor change. The driver can easily abort the > currently running command (if any) on its own before starting a reset. I don't think we would add an other layer of aborts in front of device/LUN reset for all SCSI targets. That's just overkill for - it seems - everything but USB storage, and would slow down error recovery considerable in some cases. If this is supposed to be done in the SCSI ML, it would have to be a quirk/option IMO.
On Thu, May 04, 2023 at 08:52:26AM +0000, Benjamin Block wrote: > On Wed, May 03, 2023 at 10:25:10AM -0400, Alan Stern wrote: > > I think the best answer is to accept the patch that started this email > > thread, perhaps with a minor change. The driver can easily abort the > > currently running command (if any) on its own before starting a reset. > > I don't think we would add an other layer of aborts in front of > device/LUN reset for all SCSI targets. That's just overkill for - it > seems - everything but USB storage, and would slow down error recovery > considerable in some cases. This is consistent with what I wrote. The patch that started this email thread made a change to the usb-storage driver; it did not touch any part of the SCSI core. Maxime, would you like to submit a revised version of your patch? The key difference is that it should abort the currently executing command (if there is one), regardless of whether the srb value matches. Alan Stern > If this is supposed to be done in the SCSI ML, it would have to be a > quirk/option IMO. > > -- > Best Regards, Benjamin Block / Linux on IBM Z Kernel Development > IBM Deutschland Research & Development GmbH / https://www.ibm.com/privacy > Vors. Aufs.-R.: Gregor Pillen / Geschäftsführung: David Faller > Sitz der Ges.: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294
On Thursday 04 May 2023 à 10:05:38 (-0400), Alan Stern wrote: > Maxime, would you like to submit a revised version of your patch? The > key difference is that it should abort the currently executing command > (if there is one), regardless of whether the srb value matches. Yes before I do a format submission, is this what you have in mind ? diff --git a/drivers/usb/storage/scsiglue.c b/drivers/usb/storage/scsiglue.c index 8931df5a85fd..380b04273969 100644 --- a/drivers/usb/storage/scsiglue.c +++ b/drivers/usb/storage/scsiglue.c @@ -406,10 +406,8 @@ static DEF_SCSI_QCMD(queuecommand) ***********************************************************************/ /* Command timeout and abort */ -static int command_abort(struct scsi_cmnd *srb) +static int command_abort_matching(struct us_data *us, struct scsi_cmnd *srb_match) { - struct us_data *us = host_to_us(srb->device->host); - usb_stor_dbg(us, "%s called\n", __func__); /* @@ -418,10 +416,17 @@ static int command_abort(struct scsi_cmnd *srb) */ scsi_lock(us_to_host(us)); - /* Is this command still active? */ - if (us->srb != srb) { + /* is there any active pending command to abort ? */ + if (!us->srb) { scsi_unlock(us_to_host(us)); usb_stor_dbg(us, "-- nothing to abort\n"); + return SUCCESS; + } + + /* Does the command match the passed srb if any ? */ + if (srb_match && us->srb != srb_match) { + scsi_unlock(us_to_host(us)); + usb_stor_dbg(us, "-- pending command mismatch\n"); return FAILED; } @@ -444,6 +449,16 @@ static int command_abort(struct scsi_cmnd *srb) return SUCCESS; } +static int command_abort_any(struct us_data *us) +{ + return command_abort_matching(us, NULL); +} + +static int command_abort(struct scsi_cmnd *srb) +{ + return command_abort_matching(host_to_us(srb->device->host), srb); +} + /* * This invokes the transport reset mechanism to reset the state of the * device @@ -455,6 +470,8 @@ static int device_reset(struct scsi_cmnd *srb) usb_stor_dbg(us, "%s called\n", __func__); + command_abort_any(us); + /* lock the device pointers and do the reset */ mutex_lock(&(us->dev_mutex)); result = us->transport_reset(us);
On Thu, May 04, 2023 at 04:41:11PM +0200, Maxime Bizon wrote: > > On Thursday 04 May 2023 à 10:05:38 (-0400), Alan Stern wrote: > > > Maxime, would you like to submit a revised version of your patch? The > > key difference is that it should abort the currently executing command > > (if there is one), regardless of whether the srb value matches. > > Yes > > before I do a format submission, is this what you have in mind ? Yes, except that I would not make command_abort_any() a separate routine. Just put it inline in device_reset(). Alan Stern > > diff --git a/drivers/usb/storage/scsiglue.c b/drivers/usb/storage/scsiglue.c > index 8931df5a85fd..380b04273969 100644 > --- a/drivers/usb/storage/scsiglue.c > +++ b/drivers/usb/storage/scsiglue.c > @@ -406,10 +406,8 @@ static DEF_SCSI_QCMD(queuecommand) > ***********************************************************************/ > > /* Command timeout and abort */ > -static int command_abort(struct scsi_cmnd *srb) > +static int command_abort_matching(struct us_data *us, struct scsi_cmnd *srb_match) > { > - struct us_data *us = host_to_us(srb->device->host); > - > usb_stor_dbg(us, "%s called\n", __func__); > > /* > @@ -418,10 +416,17 @@ static int command_abort(struct scsi_cmnd *srb) > */ > scsi_lock(us_to_host(us)); > > - /* Is this command still active? */ > - if (us->srb != srb) { > + /* is there any active pending command to abort ? */ > + if (!us->srb) { > scsi_unlock(us_to_host(us)); > usb_stor_dbg(us, "-- nothing to abort\n"); > + return SUCCESS; > + } > + > + /* Does the command match the passed srb if any ? */ > + if (srb_match && us->srb != srb_match) { > + scsi_unlock(us_to_host(us)); > + usb_stor_dbg(us, "-- pending command mismatch\n"); > return FAILED; > } > > @@ -444,6 +449,16 @@ static int command_abort(struct scsi_cmnd *srb) > return SUCCESS; > } > > +static int command_abort_any(struct us_data *us) > +{ > + return command_abort_matching(us, NULL); > +} > + > +static int command_abort(struct scsi_cmnd *srb) > +{ > + return command_abort_matching(host_to_us(srb->device->host), srb); > +} > + > /* > * This invokes the transport reset mechanism to reset the state of the > * device > @@ -455,6 +470,8 @@ static int device_reset(struct scsi_cmnd *srb) > > usb_stor_dbg(us, "%s called\n", __func__); > > + command_abort_any(us); > + > /* lock the device pointers and do the reset */ > mutex_lock(&(us->dev_mutex)); > result = us->transport_reset(us); > > > -- > Maxime
--- a/drivers/usb/storage/scsiglue.c +++ b/drivers/usb/storage/scsiglue.c @@ -455,6 +455,9 @@ static int device_reset(struct scsi_cmnd *srb) usb_stor_dbg(us, "%s called\n", __func__); + if (us->srb == srb) + command_abort(srb); + /* lock the device pointers and do the reset */ mutex_lock(&(us->dev_mutex)); result = us->transport_reset(us);