diff mbox series

RFC: Retrying SCSI pass-through commands

Message ID b0f92045-d10f-4038-a746-e3d87e5830e8@acm.org (mailing list archive)
State Changes Requested
Headers show
Series RFC: Retrying SCSI pass-through commands | expand

Commit Message

Bart Van Assche July 31, 2024, 8:22 p.m. UTC
Hi,

Recently I noticed that a particular UFS-based device does not resume
correctly. The logs of the device show that sd_start_stop_device() does
not retry the START STOP UNIT command if the device reports a unit
attention. I think that's a bug in the SCSI core. The following hack
makes resume work again. I think this confirms my understanding of this
issue (sd_start_stop_device() sets RQF_PM):

Bart.

Comments

Damien Le Moal Aug. 1, 2024, 3:33 a.m. UTC | #1
On 8/1/24 5:22 AM, Bart Van Assche wrote:
> Hi,
> 
> Recently I noticed that a particular UFS-based device does not resume
> correctly. The logs of the device show that sd_start_stop_device() does
> not retry the START STOP UNIT command if the device reports a unit
> attention. I think that's a bug in the SCSI core. The following hack
> makes resume work again. I think this confirms my understanding of this
> issue (sd_start_stop_device() sets RQF_PM):
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index da7dac77f8cd..e21becc5bcf9 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1816,6 +1816,8 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>          * assume caller has checked sense and determined
>          * the check condition was retryable.
>          */
> +       if (req->rq_flags & RQF_PM)
> +               return false;
>         if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
>                 return true;
> 
> My understanding is that SCSI pass-through commands submitted from
> user space must not be retried. Are there any objections against
> modifying the behavior of the SCSI core such that it retries
> REQ_OP_DRV_* operations submitted by the SCSI core, as illustrated
> by the pseudo-code below?

Looking at the code, e.g. sd_start_stop_device():

	res = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, NULL, 0, SD_TIMEOUT,
                               sdkp->max_retries, &exec_args);

It seems that it is expected that the retry count will be honored. But that
indeed is not the case as scsi_noretry_cmd() will always return false for
REQ_OP_DRV_* commands.

So may be we should have a RQF_USER_OP_DRV flag to differentiate user
REQ_OP_DRV_* passthrough commands from internally issued REQ_OP_DRV_* commands.
Or the reverse flag, e.g. RQF_INTERNAL_OP_DRV, that we can set in e.g.
scsi_execute_cmnd().

> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index da7dac77f8cd..e21becc5bcf9 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>          * assume caller has checked sense and determined
>          * the check condition was retryable.
>          */
> -       if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
> -               return true;
> +       if (req->cmd_flags & REQ_FAILFAST_DEV)
> +               return true;
> +       if (/* submitted by the SCSI core */)
> +               return false;
> +       if (blk_rq_is_passthrough(req))
> +               return true;
> 
> Thanks,
> 
> Bart.
Damien Le Moal Aug. 1, 2024, 7:04 a.m. UTC | #2
On 8/1/24 5:22 AM, Bart Van Assche wrote:
> Hi,
> 
> Recently I noticed that a particular UFS-based device does not resume
> correctly. The logs of the device show that sd_start_stop_device() does
> not retry the START STOP UNIT command if the device reports a unit
> attention. I think that's a bug in the SCSI core. The following hack
> makes resume work again. I think this confirms my understanding of this
> issue (sd_start_stop_device() sets RQF_PM):
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index da7dac77f8cd..e21becc5bcf9 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1816,6 +1816,8 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>          * assume caller has checked sense and determined
>          * the check condition was retryable.
>          */
> +       if (req->rq_flags & RQF_PM)
> +               return false;
>         if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
>                 return true;
> 
> My understanding is that SCSI pass-through commands submitted from
> user space must not be retried. Are there any objections against
> modifying the behavior of the SCSI core such that it retries
> REQ_OP_DRV_* operations submitted by the SCSI core, as illustrated
> by the pseudo-code below?

Another thing: may be check scsi_check_passthrough(). See the call to that in
scsi_execute_cmd() and how it is used to drive retry of the command. Tweaking
that may be what you need for your UFS device problem ?

> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index da7dac77f8cd..e21becc5bcf9 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>          * assume caller has checked sense and determined
>          * the check condition was retryable.
>          */
> -       if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
> -               return true;
> +       if (req->cmd_flags & REQ_FAILFAST_DEV)
> +               return true;
> +       if (/* submitted by the SCSI core */)
> +               return false;
> +       if (blk_rq_is_passthrough(req))
> +               return true;
> 
> Thanks,
> 
> Bart.
Bart Van Assche Aug. 1, 2024, 6 p.m. UTC | #3
On 8/1/24 12:04 AM, Damien Le Moal wrote:
> Another thing: may be check scsi_check_passthrough(). See the call to that in
> scsi_execute_cmd() and how it is used to drive retry of the command. Tweaking
> that may be what you need for your UFS device problem ?

My understanding is that making scsi_check_passthrough() retry commands would
require modifying all scsi_execute_cmd() callers. If the scsi_exec_args.
failures pointer is NULL, the scsi_check_passthrough() function does not
trigger a retry. If scsi_noretry_cmd() is modified, none of the
scsi_execute_cmd() callers have to be modified.

Thanks,

Bart.
Bart Van Assche Aug. 1, 2024, 8:12 p.m. UTC | #4
On 7/31/24 8:33 PM, Damien Le Moal wrote:
> Looking at the code, e.g. sd_start_stop_device():
> 
> 	res = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, NULL, 0, SD_TIMEOUT,
>                                 sdkp->max_retries, &exec_args);
> 
> It seems that it is expected that the retry count will be honored. But that
> indeed is not the case as scsi_noretry_cmd() will always return false for
> REQ_OP_DRV_* commands.
> 
> So may be we should have a RQF_USER_OP_DRV flag to differentiate user
> REQ_OP_DRV_* passthrough commands from internally issued REQ_OP_DRV_* commands.
> Or the reverse flag, e.g. RQF_INTERNAL_OP_DRV, that we can set in e.g.
> scsi_execute_cmnd().

Hmm ... how about using the simplest possible patch? The patch below seems
to work fine.

Thanks,

Bart.

[PATCH] scsi: core: Retry commands submitted by the SCSI core

Pass-through commands either come from user space (SG_IO ioctl,
SCSI_IOCTL_*, BSG), from the SCSI core or from a SCSI driver (ch,
cxlflash, pktcdvd, scsi_scan, scsi_dh_*, scsi_transport_spi, sd,
sg, st, virtio_scsi). All this code sets scsi_cmnd.allowed to the
maximum number of allowed retries. Hence, removing the
blk_rq_is_passthrough() check from scsi_noretry_cmd() has the
effect that scsi_cmnd.allowed is respected for pass-through
commands.

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 612489afe8d2..c09f65cfa898 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1855,7 +1855,7 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
  	 * assume caller has checked sense and determined
  	 * the check condition was retryable.
  	 */
-	if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
+	if (req->cmd_flags & REQ_FAILFAST_DEV)
  		return true;

  	return false;
Damien Le Moal Aug. 2, 2024, 12:47 a.m. UTC | #5
On 8/2/24 05:12, Bart Van Assche wrote:
> On 7/31/24 8:33 PM, Damien Le Moal wrote:
>> Looking at the code, e.g. sd_start_stop_device():
>>
>> 	res = scsi_execute_cmd(sdp, cmd, REQ_OP_DRV_IN, NULL, 0, SD_TIMEOUT,
>>                                 sdkp->max_retries, &exec_args);
>>
>> It seems that it is expected that the retry count will be honored. But that
>> indeed is not the case as scsi_noretry_cmd() will always return false for
>> REQ_OP_DRV_* commands.
>>
>> So may be we should have a RQF_USER_OP_DRV flag to differentiate user
>> REQ_OP_DRV_* passthrough commands from internally issued REQ_OP_DRV_* commands.
>> Or the reverse flag, e.g. RQF_INTERNAL_OP_DRV, that we can set in e.g.
>> scsi_execute_cmnd().
> 
> Hmm ... how about using the simplest possible patch? The patch below seems
> to work fine.
> 
> Thanks,
> 
> Bart.
> 
> [PATCH] scsi: core: Retry commands submitted by the SCSI core
> 
> Pass-through commands either come from user space (SG_IO ioctl,
> SCSI_IOCTL_*, BSG), from the SCSI core or from a SCSI driver (ch,
> cxlflash, pktcdvd, scsi_scan, scsi_dh_*, scsi_transport_spi, sd,
> sg, st, virtio_scsi). All this code sets scsi_cmnd.allowed to the
> maximum number of allowed retries. Hence, removing the
> blk_rq_is_passthrough() check from scsi_noretry_cmd() has the
> effect that scsi_cmnd.allowed is respected for pass-through
> commands.
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 612489afe8d2..c09f65cfa898 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1855,7 +1855,7 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>   	 * assume caller has checked sense and determined
>   	 * the check condition was retryable.
>   	 */
> -	if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
> +	if (req->cmd_flags & REQ_FAILFAST_DEV)
>   		return true;

But that will cause *all* passthrough requests to be retried, including those
issued from userspace, no ? I do not think that such change would be acceptable
as that can break userspace using passthrough and expecting to deal with errors
and handling them however they see fit, which may not be retrying commands.

> 
>   	return false;
>
Bart Van Assche Aug. 2, 2024, 4:52 p.m. UTC | #6
On 8/1/24 5:47 PM, Damien Le Moal wrote:
> On 8/2/24 05:12, Bart Van Assche wrote:
>> [PATCH] scsi: core: Retry commands submitted by the SCSI core
>>
>> Pass-through commands either come from user space (SG_IO ioctl,
>> SCSI_IOCTL_*, BSG), from the SCSI core or from a SCSI driver (ch,
>> cxlflash, pktcdvd, scsi_scan, scsi_dh_*, scsi_transport_spi, sd,
>> sg, st, virtio_scsi). All this code sets scsi_cmnd.allowed to the
>> maximum number of allowed retries. Hence, removing the
>> blk_rq_is_passthrough() check from scsi_noretry_cmd() has the
>> effect that scsi_cmnd.allowed is respected for pass-through
>> commands.
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index 612489afe8d2..c09f65cfa898 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -1855,7 +1855,7 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>>    	 * assume caller has checked sense and determined
>>    	 * the check condition was retryable.
>>    	 */
>> -	if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
>> +	if (req->cmd_flags & REQ_FAILFAST_DEV)
>>    		return true;
> 
> But that will cause *all* passthrough requests to be retried, including those
> issued from userspace, no ? I do not think that such change would be acceptable
> as that can break userspace using passthrough and expecting to deal with errors
> and handling them however they see fit, which may not be retrying commands.

In the SG_IO ioctl implementation I found the following assignment:

	scmd->allowed = 0;

In drivers/scsi/sg.c I found the following (SG_DEFAULT_RETRIES == 0):

	scmd->allowed = SG_DEFAULT_RETRIES;

Isn't that sufficient to prevent retries in the scsi_noretry_cmd() callers that
check scmd->allowed? The only scsi_noretry_cmd() that does not check
scmd->allowed is scsi_io_completion(). I propose to add an explicit
blk_rq_is_passthrough() check in that function next to the scsi_noretry_cmd()
call.

Thanks,

Bart.
Mike Christie Aug. 3, 2024, 8:42 p.m. UTC | #7
On 7/31/24 3:22 PM, Bart Van Assche wrote:
> 
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index da7dac77f8cd..e21becc5bcf9 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>          * assume caller has checked sense and determined
>          * the check condition was retryable.
>          */
> -       if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
> -               return true;
> +       if (req->cmd_flags & REQ_FAILFAST_DEV)
> +               return true;
> +       if (/* submitted by the SCSI core */)
> +               return false;
> +       if (blk_rq_is_passthrough(req))
> +               return true;

Do you just want to retry UAs or all internal passthrough command
errors that go through here?

For just the specific UA or NOT_READY/0x04/0x01 case for an example.
Does every scsi core passthrough caller want that retried? If so, then
I can see where you are coming from where you feel it's a bug. I would
agree we would normally want to retry that in general. Maybe others know
about some specific old case though.

However, I'm not sure for MEDIUM_ERROR or ABORTED_COMMAND.
I think MEDIUM_ERROR probably would not come up for the cases we are
talking about though.

I don't think we want to always retry DID_TIME_OUT though. The funny
thing is that I think Martin just wanted to retry one specific case
for that error. We had to do the scsi_check_passthrough patches so
we could retry just for scsi_probe_lun.
Damien Le Moal Aug. 5, 2024, 5:47 p.m. UTC | #8
On 2024/08/03 13:42, Mike Christie wrote:
> On 7/31/24 3:22 PM, Bart Van Assche wrote:
>>
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index da7dac77f8cd..e21becc5bcf9 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -1816,6 +1816,12 @@ bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
>>          * assume caller has checked sense and determined
>>          * the check condition was retryable.
>>          */
>> -       if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
>> -               return true;
>> +       if (req->cmd_flags & REQ_FAILFAST_DEV)
>> +               return true;
>> +       if (/* submitted by the SCSI core */)
>> +               return false;
>> +       if (blk_rq_is_passthrough(req))
>> +               return true;
> 
> Do you just want to retry UAs or all internal passthrough command
> errors that go through here?

I think that we definitely do *not* want to retry all internal commands. E.g. I
do not see the point in retrying INQUIRY or VPD page accesses failing on device
scan.

> 
> For just the specific UA or NOT_READY/0x04/0x01 case for an example.
> Does every scsi core passthrough caller want that retried? If so, then
> I can see where you are coming from where you feel it's a bug. I would
> agree we would normally want to retry that in general. Maybe others know
> about some specific old case though.

I think the command Bart had a problem with is START_STOP_UNIT. If that is the
only case causing problems, then we should probably improve
sd_start_stop_device() to allow retries. But if there are more commands that
needs the same, then the patch that Bart is proposing makes sense I think, but
it will require an audit of all REQ_OP_DRV_XXX internal users to make sure that
they all use that command with -> allowed set to 0. But hopefully, most cases go
through scsi_execute_cmnd(), which should simplify this audit.

> 
> However, I'm not sure for MEDIUM_ERROR or ABORTED_COMMAND.
> I think MEDIUM_ERROR probably would not come up for the cases we are
> talking about though.
> 
> I don't think we want to always retry DID_TIME_OUT though. The funny
> thing is that I think Martin just wanted to retry one specific case
> for that error. We had to do the scsi_check_passthrough patches so
> we could retry just for scsi_probe_lun.
Bart Van Assche Aug. 5, 2024, 5:56 p.m. UTC | #9
On 8/5/24 10:47 AM, Damien Le Moal wrote:
> On 2024/08/03 13:42, Mike Christie wrote:
>> Do you just want to retry UAs or all internal passthrough command
>> errors that go through here?
> 
> I think that we definitely do *not* want to retry all internal commands. E.g. I
> do not see the point in retrying INQUIRY or VPD page accesses failing on device
> scan.
> 
>>
>> For just the specific UA or NOT_READY/0x04/0x01 case for an example.
>> Does every scsi core passthrough caller want that retried? If so, then
>> I can see where you are coming from where you feel it's a bug. I would
>> agree we would normally want to retry that in general. Maybe others know
>> about some specific old case though.
> 
> I think the command Bart had a problem with is START_STOP_UNIT. If that is the
> only case causing problems, then we should probably improve
> sd_start_stop_device() to allow retries. But if there are more commands that
> needs the same, then the patch that Bart is proposing makes sense I think, but
> it will require an audit of all REQ_OP_DRV_XXX internal users to make sure that
> they all use that command with -> allowed set to 0. But hopefully, most cases go
> through scsi_execute_cmnd(), which should simplify this audit.

Thanks Mike and Damien for having shared your thoughts. I only need the
START STOP UNIT command to be retried. So the lowest risk approach is to
introduce a mechanism that only causes that command to be retried. I
plan to look into this and prepare a patch.

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index da7dac77f8cd..e21becc5bcf9 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1816,6 +1816,8 @@  bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
          * assume caller has checked sense and determined
          * the check condition was retryable.
          */
+       if (req->rq_flags & RQF_PM)
+               return false;
         if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
                 return true;

My understanding is that SCSI pass-through commands submitted from
user space must not be retried. Are there any objections against
modifying the behavior of the SCSI core such that it retries
REQ_OP_DRV_* operations submitted by the SCSI core, as illustrated
by the pseudo-code below?

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index da7dac77f8cd..e21becc5bcf9 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1816,6 +1816,12 @@  bool scsi_noretry_cmd(struct scsi_cmnd *scmd)
          * assume caller has checked sense and determined
          * the check condition was retryable.
          */
-       if (req->cmd_flags & REQ_FAILFAST_DEV || blk_rq_is_passthrough(req))
-               return true;
+       if (req->cmd_flags & REQ_FAILFAST_DEV)
+               return true;
+       if (/* submitted by the SCSI core */)
+               return false;
+       if (blk_rq_is_passthrough(req))
+               return true;

Thanks,