diff mbox series

[v3] scsi: ufs: Cleanup completed request without interrupt notification

Message ID 20200706132113.21096-1-stanley.chu@mediatek.com (mailing list archive)
State New, archived
Headers show
Series [v3] scsi: ufs: Cleanup completed request without interrupt notification | expand

Commit Message

Stanley Chu July 6, 2020, 1:21 p.m. UTC
If somehow no interrupt notification is raised for a completed request
and its doorbell bit is cleared by host, UFS driver needs to cleanup
its outstanding bit in ufshcd_abort().

Otherwise, system may crash by below abnormal flow:

After this request is requeued by SCSI layer with its
outstanding bit set, the next completed request will trigger
ufshcd_transfer_req_compl() to handle all "completed outstanding
bits". In this time, the "abnormal outstanding bit" will be detected
and the "requeued request" will be chosen to execute request
post-processing flow. This is wrong and blk_finish_request() will
BUG_ON because this request is still "alive".

It is worth mentioning that before ufshcd_abort() cleans the timed-out
request, driver need to check again if this request is really not
handled by __ufshcd_transfer_req_compl() yet because it may be
possible that the interrupt comes very lately before the cleaning.

Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
---
 drivers/scsi/ufs/ufshcd.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Avri Altman July 9, 2020, 8:31 a.m. UTC | #1
> 
> If somehow no interrupt notification is raised for a completed request
> and its doorbell bit is cleared by host, UFS driver needs to cleanup
> its outstanding bit in ufshcd_abort().
Theoretically, this case is already accounted for - 
See line 6407: a proper error is issued and eventually outstanding req is cleared.

Can you go over the scenario you are attending line by line,
And explain why ufshcd_abort does not account for it?

> 
> Otherwise, system may crash by below abnormal flow:
> 
> After this request is requeued by SCSI layer with its
> outstanding bit set, the next completed request will trigger
> ufshcd_transfer_req_compl() to handle all "completed outstanding
> bits". In this time, the "abnormal outstanding bit" will be detected
> and the "requeued request" will be chosen to execute request
> post-processing flow. This is wrong and blk_finish_request() will
> BUG_ON because this request is still "alive".
> 
> It is worth mentioning that before ufshcd_abort() cleans the timed-out
> request, driver need to check again if this request is really not
> handled by __ufshcd_transfer_req_compl() yet because it may be
> possible that the interrupt comes very lately before the cleaning.
What do you mean? Why checking the outstanding reqs isn't enough?

> 
> Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
> ---
>  drivers/scsi/ufs/ufshcd.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index 8603b07045a6..f23fb14df9f6 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>                         /* command completed already */
>                         dev_err(hba->dev, "%s: cmd at tag %d successfully cleared from
> DB.\n",
>                                 __func__, tag);
> -                       goto out;
> +                       goto cleanup;
But you've arrived here only if (!(test_bit(tag, &hba->outstanding_reqs))) - 
See line 6400. 

>                 } else {
>                         dev_err(hba->dev,
>                                 "%s: no response from device. tag = %d, err %d\n",
> @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
>                 goto out;
>         }
> 
> +cleanup:
> +       spin_lock_irqsave(host->host_lock, flags);
> +       if (!test_bit(tag, &hba->outstanding_reqs)) {
> +               spin_unlock_irqrestore(host->host_lock, flags);
> +               goto out;
> +       }
>         scsi_dma_unmap(cmd);
> 
> -       spin_lock_irqsave(host->host_lock, flags);
>         ufshcd_outstanding_req_clear(hba, tag);
>         hba->lrb[tag].cmd = NULL;
>         spin_unlock_irqrestore(host->host_lock, flags);
> --
> 2.18.0
Stanley Chu July 12, 2020, 1:26 a.m. UTC | #2
Hi Avri,

On Thu, 2020-07-09 at 08:31 +0000, Avri Altman wrote:
> > 
> > If somehow no interrupt notification is raised for a completed request
> > and its doorbell bit is cleared by host, UFS driver needs to cleanup
> > its outstanding bit in ufshcd_abort().
> Theoretically, this case is already accounted for - 
> See line 6407: a proper error is issued and eventually outstanding req is cleared.
> 
> Can you go over the scenario you are attending line by line,
> And explain why ufshcd_abort does not account for it?

Sure.

If a request using tag N is completed by UFS device without interrupt
notification till timeout happens, ufshcd_abort() will be invoked.

Since request completion flow is not executed, current status may be

- Tag N in hba->outstanding_reqs is set
- Tag N in doorbell register is not set

In this case, ufshcd_abort() flow would be

- This log is printed: "ufshcd_abort: cmd was completed, but without a
notifying intr, tag = N"
- This log is printed: "ufshcd_abort: Device abort task at tag N"
- If hba->req_abort_skip is zero, QUERY_TASK command is sent
- Device responds "UPIU_TASK_MANAGEMENT_FUNC_COMPL"
- This log is printed: "ufshcd_abort: cmd at tag N not pending in the
device."
- Doorbell tells that tag N is not set, so the driver goes to label
"out" with this log printed: "ufshcd_abort: cmd at tag %d successfully
cleared from DB."
- In label "out" section, no cleanup will be made, and then ufshcd_abort
exits
- This request will be re-queued to request queue by SCSI timeout
handler

Now, Inconsistent state shows-up: A request is "re-queued" but its
corresponding resource in UFS layer is not cleared, below flow will
trigger bad things,

- A new request with tag M is finished
- Interrupt is raised and ufshcd_transfer_req_compl() found both tag N
and M can process the completion flow
- The post-processing flow for tag N will be executed while its request
is still alive

I am sorry that below messages are only for old kernel in non-blk-mq
case. However above scenario will also trigger bad thing in blk-mq case.

> 
> > 
> > Otherwise, system may crash by below abnormal flow:
> > 
> > After this request is requeued by SCSI layer with its
> > outstanding bit set, the next completed request will trigger
> > ufshcd_transfer_req_compl() to handle all "completed outstanding
> > bits". In this time, the "abnormal outstanding bit" will be detected
> > and the "requeued request" will be chosen to execute request
> > post-processing flow. This is wrong and blk_finish_request() will
> > BUG_ON because this request is still "alive".
> > 
> > It is worth mentioning that before ufshcd_abort() cleans the timed-out
> > request, driver need to check again if this request is really not
> > handled by __ufshcd_transfer_req_compl() yet because it may be
> > possible that the interrupt comes very lately before the cleaning.
> What do you mean? Why checking the outstanding reqs isn't enough?
> 
> > 
> > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
> > ---
> >  drivers/scsi/ufs/ufshcd.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> > index 8603b07045a6..f23fb14df9f6 100644
> > --- a/drivers/scsi/ufs/ufshcd.c
> > +++ b/drivers/scsi/ufs/ufshcd.c
> > @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
> >                         /* command completed already */
> >                         dev_err(hba->dev, "%s: cmd at tag %d successfully cleared from
> > DB.\n",
> >                                 __func__, tag);
> > -                       goto out;
> > +                       goto cleanup;
> But you've arrived here only if (!(test_bit(tag, &hba->outstanding_reqs))) - 
> See line 6400. 
> 
> >                 } else {
> >                         dev_err(hba->dev,
> >                                 "%s: no response from device. tag = %d, err %d\n",
> > @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
> >                 goto out;
> >         }
> > 
> > +cleanup:
> > +       spin_lock_irqsave(host->host_lock, flags);
> > +       if (!test_bit(tag, &hba->outstanding_reqs)) {
> > +               spin_unlock_irqrestore(host->host_lock, flags);
> > +               goto out;
> > +       }
> >         scsi_dma_unmap(cmd);
> > 
> > -       spin_lock_irqsave(host->host_lock, flags);
> >         ufshcd_outstanding_req_clear(hba, tag);
> >         hba->lrb[tag].cmd = NULL;
> >         spin_unlock_irqrestore(host->host_lock, flags);
> > --
> > 2.18.0
Avri Altman July 12, 2020, 10:04 a.m. UTC | #3
> 
> Hi Avri,
> 
> On Thu, 2020-07-09 at 08:31 +0000, Avri Altman wrote:
> > >
> > > If somehow no interrupt notification is raised for a completed request
> > > and its doorbell bit is cleared by host, UFS driver needs to cleanup
> > > its outstanding bit in ufshcd_abort().
> > Theoretically, this case is already accounted for -
> > See line 6407: a proper error is issued and eventually outstanding req is
> cleared.
> >
> > Can you go over the scenario you are attending line by line,
> > And explain why ufshcd_abort does not account for it?
> 
> Sure.
> 
> If a request using tag N is completed by UFS device without interrupt
> notification till timeout happens, ufshcd_abort() will be invoked.
> 
> Since request completion flow is not executed, current status may be
> 
> - Tag N in hba->outstanding_reqs is set
> - Tag N in doorbell register is not set
> 
> In this case, ufshcd_abort() flow would be
> 
> - This log is printed: "ufshcd_abort: cmd was completed, but without a
> notifying intr, tag = N"
> - This log is printed: "ufshcd_abort: Device abort task at tag N"
> - If hba->req_abort_skip is zero, QUERY_TASK command is sent
> - Device responds "UPIU_TASK_MANAGEMENT_FUNC_COMPL"
> - This log is printed: "ufshcd_abort: cmd at tag N not pending in the
> device."
> - Doorbell tells that tag N is not set, so the driver goes to label
> "out" with this log printed: "ufshcd_abort: cmd at tag %d successfully
> cleared from DB."
> - In label "out" section, no cleanup will be made, and then ufshcd_abort
> exits
> - This request will be re-queued to request queue by SCSI timeout
> handler
> 
> Now, Inconsistent state shows-up: A request is "re-queued" but its
> corresponding resource in UFS layer is not cleared, below flow will
> trigger bad things,
> 
> - A new request with tag M is finished
> - Interrupt is raised and ufshcd_transfer_req_compl() found both tag N
> and M can process the completion flow
> - The post-processing flow for tag N will be executed while its request
> is still alive
> 
> I am sorry that below messages are only for old kernel in non-blk-mq
> case. However above scenario will also trigger bad thing in blk-mq case.

Ok.  Thanks.

> 
> >
> > >
> > > Otherwise, system may crash by below abnormal flow:
> > >
> > > After this request is requeued by SCSI layer with its
> > > outstanding bit set, the next completed request will trigger
> > > ufshcd_transfer_req_compl() to handle all "completed outstanding
> > > bits". In this time, the "abnormal outstanding bit" will be detected
> > > and the "requeued request" will be chosen to execute request
> > > post-processing flow. This is wrong and blk_finish_request() will
> > > BUG_ON because this request is still "alive".
> > >
> > > It is worth mentioning that before ufshcd_abort() cleans the timed-out
> > > request, driver need to check again if this request is really not
> > > handled by __ufshcd_transfer_req_compl() yet because it may be
> > > possible that the interrupt comes very lately before the cleaning.
> > What do you mean? Why checking the outstanding reqs isn't enough?
> >
> > >
> > > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
> > > ---
> > >  drivers/scsi/ufs/ufshcd.c | 9 +++++++--
> > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> > > index 8603b07045a6..f23fb14df9f6 100644
> > > --- a/drivers/scsi/ufs/ufshcd.c
> > > +++ b/drivers/scsi/ufs/ufshcd.c
> > > @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
> > >                         /* command completed already */
> > >                         dev_err(hba->dev, "%s: cmd at tag %d successfully cleared
> from
> > > DB.\n",
> > >                                 __func__, tag);
> > > -                       goto out;
> > > +                       goto cleanup;
> > But you've arrived here only if (!(test_bit(tag, &hba->outstanding_reqs))) -
> > See line 6400.
> >
> > >                 } else {
> > >                         dev_err(hba->dev,
> > >                                 "%s: no response from device. tag = %d, err %d\n",
> > > @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
> > >                 goto out;
> > >         }
> > >
> > > +cleanup:
> > > +       spin_lock_irqsave(host->host_lock, flags);
> > > +       if (!test_bit(tag, &hba->outstanding_reqs)) {
Is this needed?  it was already checked in line 6439.

Thanks,
Avri

> > > +               spin_unlock_irqrestore(host->host_lock, flags);
> > > +               goto out;
> > > +       }
> > >         scsi_dma_unmap(cmd);
> > >
> > > -       spin_lock_irqsave(host->host_lock, flags);
> > >         ufshcd_outstanding_req_clear(hba, tag);
> > >         hba->lrb[tag].cmd = NULL;
> > >         spin_unlock_irqrestore(host->host_lock, flags);
> > > --
> > > 2.18.0
Bart Van Assche July 13, 2020, 1:39 a.m. UTC | #4
On 2020-07-06 06:21, Stanley Chu wrote:
> If somehow no interrupt notification is raised for a completed request
> and its doorbell bit is cleared by host, UFS driver needs to cleanup
> its outstanding bit in ufshcd_abort().

How is it possible that no interrupt notification is raised for a completed
request? Is this the result of a hardware shortcoming or rather the result
of how the UFS driver works? In the latter case, is this patch perhaps a
workaround? If so, has it been considered to fix the root cause instead of
implementing a workaround?

In section 7.2.3 of the UFS specification I found the following about how
to process request completions: "Software determines if new TRs have
completed since step #2, by repeating one of the two methods described in
step #2. If new TRs have completed, software repeats the sequence from step
#3." Is such a loop perhaps missing from the Linux UFS driver?

Thanks,

Bart.
Stanley Chu July 13, 2020, 2:27 a.m. UTC | #5
Hi Bart and Avri,

On Sun, 2020-07-12 at 18:39 -0700, Bart Van Assche wrote:
> On 2020-07-06 06:21, Stanley Chu wrote:
> > If somehow no interrupt notification is raised for a completed request
> > and its doorbell bit is cleared by host, UFS driver needs to cleanup
> > its outstanding bit in ufshcd_abort().
> 
> How is it possible that no interrupt notification is raised for a completed
> request? Is this the result of a hardware shortcoming or rather the result
> of how the UFS driver works? In the latter case, is this patch perhaps a
> workaround? If so, has it been considered to fix the root cause instead of
> implementing a workaround?

Actually this fail is triggered by "error injection" to produce a
command timeout event for checking if anything can be improved or fixed.

I agree that "no interrupt notification" may be something wrong in
hardware and the root cause shall be fixed in the highest priority.
However from this injection, we found ufshcd_abort() indeed has a defect
flow for a corner case, so we are looking for the solution to fix the
"hole".

What would you think if Linux driver shall consider this case? If this
is not necessary, I would drop this patch : )

Thanks a lot,
Stanley Chu

> 
> In section 7.2.3 of the UFS specification I found the following about how
> to process request completions: "Software determines if new TRs have
> completed since step #2, by repeating one of the two methods described in
> step #2. If new TRs have completed, software repeats the sequence from step
> #3." Is such a loop perhaps missing from the Linux UFS driver?
> 
> Thanks,
> 
> Bart.
Avri Altman July 13, 2020, 8:10 a.m. UTC | #6
> 
> Hi Bart and Avri,
> 
> On Sun, 2020-07-12 at 18:39 -0700, Bart Van Assche wrote:
> > On 2020-07-06 06:21, Stanley Chu wrote:
> > > If somehow no interrupt notification is raised for a completed request
> > > and its doorbell bit is cleared by host, UFS driver needs to cleanup
> > > its outstanding bit in ufshcd_abort().
> >
> > How is it possible that no interrupt notification is raised for a completed
> > request? Is this the result of a hardware shortcoming or rather the result
> > of how the UFS driver works? In the latter case, is this patch perhaps a
> > workaround? If so, has it been considered to fix the root cause instead of
> > implementing a workaround?
> 
> Actually this fail is triggered by "error injection" to produce a
> command timeout event for checking if anything can be improved or fixed.
> 
> I agree that "no interrupt notification" may be something wrong in
> hardware and the root cause shall be fixed in the highest priority.
> However from this injection, we found ufshcd_abort() indeed has a defect
> flow for a corner case, so we are looking for the solution to fix the
> "hole".
> 
> What would you think if Linux driver shall consider this case? If this
> is not necessary, I would drop this patch : )
Artificially injecting errors is a very common validation mechanism,
Provided that you are not breaking anything of the upper-layers,
Which I don't think you are doing.

Can you refer please to my last comment?

> 
> Thanks a lot,
> Stanley Chu
> 
> >
> > In section 7.2.3 of the UFS specification I found the following about how
> > to process request completions: "Software determines if new TRs have
> > completed since step #2, by repeating one of the two methods described in
> > step #2. If new TRs have completed, software repeats the sequence from
> step
> > #3." Is such a loop perhaps missing from the Linux UFS driver?
Could not find that citation.
What version of the spec are you using?

Thanks,
Avri
> >
> > Thanks,
> >
> > Bart.
Stanley Chu July 14, 2020, 8:48 a.m. UTC | #7
Hi Avri,

Sorry for the late response.

On Sun, 2020-07-12 at 10:04 +0000, Avri Altman wrote:
> 
> > 
> > Hi Avri,
> > 
> > On Thu, 2020-07-09 at 08:31 +0000, Avri Altman wrote:
> > > >
> > > > If somehow no interrupt notification is raised for a completed request
> > > > and its doorbell bit is cleared by host, UFS driver needs to cleanup
> > > > its outstanding bit in ufshcd_abort().
> > > Theoretically, this case is already accounted for -
> > > See line 6407: a proper error is issued and eventually outstanding req is
> > cleared.
> > >
> > > Can you go over the scenario you are attending line by line,
> > > And explain why ufshcd_abort does not account for it?
> > 
> > Sure.
> > 
> > If a request using tag N is completed by UFS device without interrupt
> > notification till timeout happens, ufshcd_abort() will be invoked.
> > 
> > Since request completion flow is not executed, current status may be
> > 
> > - Tag N in hba->outstanding_reqs is set
> > - Tag N in doorbell register is not set
> > 
> > In this case, ufshcd_abort() flow would be
> > 
> > - This log is printed: "ufshcd_abort: cmd was completed, but without a
> > notifying intr, tag = N"
> > - This log is printed: "ufshcd_abort: Device abort task at tag N"
> > - If hba->req_abort_skip is zero, QUERY_TASK command is sent
> > - Device responds "UPIU_TASK_MANAGEMENT_FUNC_COMPL"
> > - This log is printed: "ufshcd_abort: cmd at tag N not pending in the
> > device."
> > - Doorbell tells that tag N is not set, so the driver goes to label
> > "out" with this log printed: "ufshcd_abort: cmd at tag %d successfully
> > cleared from DB."
> > - In label "out" section, no cleanup will be made, and then ufshcd_abort
> > exits
> > - This request will be re-queued to request queue by SCSI timeout
> > handler
> > 
> > Now, Inconsistent state shows-up: A request is "re-queued" but its
> > corresponding resource in UFS layer is not cleared, below flow will
> > trigger bad things,
> > 
> > - A new request with tag M is finished
> > - Interrupt is raised and ufshcd_transfer_req_compl() found both tag N
> > and M can process the completion flow
> > - The post-processing flow for tag N will be executed while its request
> > is still alive
> > 
> > I am sorry that below messages are only for old kernel in non-blk-mq
> > case. However above scenario will also trigger bad thing in blk-mq case.
> 
> Ok.  Thanks.
> 
> > 
> > >
> > > >
> > > > Otherwise, system may crash by below abnormal flow:
> > > >
> > > > After this request is requeued by SCSI layer with its
> > > > outstanding bit set, the next completed request will trigger
> > > > ufshcd_transfer_req_compl() to handle all "completed outstanding
> > > > bits". In this time, the "abnormal outstanding bit" will be detected
> > > > and the "requeued request" will be chosen to execute request
> > > > post-processing flow. This is wrong and blk_finish_request() will
> > > > BUG_ON because this request is still "alive".
> > > >
> > > > It is worth mentioning that before ufshcd_abort() cleans the timed-out
> > > > request, driver need to check again if this request is really not
> > > > handled by __ufshcd_transfer_req_compl() yet because it may be
> > > > possible that the interrupt comes very lately before the cleaning.
> > > What do you mean? Why checking the outstanding reqs isn't enough?
> > >
> > > >
> > > > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
> > > > ---
> > > >  drivers/scsi/ufs/ufshcd.c | 9 +++++++--
> > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> > > > index 8603b07045a6..f23fb14df9f6 100644
> > > > --- a/drivers/scsi/ufs/ufshcd.c
> > > > +++ b/drivers/scsi/ufs/ufshcd.c
> > > > @@ -6462,7 +6462,7 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
> > > >                         /* command completed already */
> > > >                         dev_err(hba->dev, "%s: cmd at tag %d successfully cleared
> > from
> > > > DB.\n",
> > > >                                 __func__, tag);
> > > > -                       goto out;
> > > > +                       goto cleanup;
> > > But you've arrived here only if (!(test_bit(tag, &hba->outstanding_reqs))) -
> > > See line 6400.
> > >
> > > >                 } else {
> > > >                         dev_err(hba->dev,
> > > >                                 "%s: no response from device. tag = %d, err %d\n",
> > > > @@ -6496,9 +6496,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
> > > >                 goto out;
> > > >         }
> > > >
> > > > +cleanup:
> > > > +       spin_lock_irqsave(host->host_lock, flags);
> > > > +       if (!test_bit(tag, &hba->outstanding_reqs)) {
> Is this needed?  it was already checked in line 6439.
> 

I am worried about the case that interrupt comes very lately. For
example, if interrupt finally comes while ufshcd_abort() is handling
this command, then probably this command may be completed first by
interrupt handler. In this case, ufshcd_abort() shall not clear this
command again. In contrast, if ufshcd_abort() clears this command first,
then interrupt shall not complete it. Thus here checking
hba->outstanding_req with host lock held is required to prevent above
racing.

Thanks,
Stanley Chu
Avri Altman July 14, 2020, 9:29 a.m. UTC | #8
> > > > > +cleanup:
> > > > > +       spin_lock_irqsave(host->host_lock, flags);
> > > > > +       if (!test_bit(tag, &hba->outstanding_reqs)) {
> > Is this needed?  it was already checked in line 6439.
> >
> 
> I am worried about the case that interrupt comes very lately. 
scsi timeout is 30sec - do you expect an interrupt to arrive after that?

Thanks,
Avri

>For
> example, if interrupt finally comes while ufshcd_abort() is handling
> this command, then probably this command may be completed first by
> interrupt handler. In this case, ufshcd_abort() shall not clear this
> command again. In contrast, if ufshcd_abort() clears this command first,
> then interrupt shall not complete it. Thus here checking
> hba->outstanding_req with host lock held is required to prevent above
> racing.
> 
> Thanks,
> Stanley Chu
>
Stanley Chu July 14, 2020, 10 a.m. UTC | #9
Hi Avri,

On Tue, 2020-07-14 at 09:29 +0000, Avri Altman wrote:
> > > > > > +cleanup:
> > > > > > +       spin_lock_irqsave(host->host_lock, flags);
> > > > > > +       if (!test_bit(tag, &hba->outstanding_reqs)) {
> > > Is this needed?  it was already checked in line 6439.
> > >
> > 
> > I am worried about the case that interrupt comes very lately. 
> scsi timeout is 30sec - do you expect an interrupt to arrive after that?
> 

Yeah, I agree that a 30s delayed interrupt sounds kind of ridiculous.
This checking is just to make the cleanup flow safer.

Thanks,
Stanley Chu
Bart Van Assche July 15, 2020, 4 a.m. UTC | #10
On 2020-07-13 01:10, Avri Altman wrote:
> Artificially injecting errors is a very common validation mechanism,
> Provided that you are not breaking anything of the upper-layers,
> Which I don't think you are doing.

Hi Avri,

My concern is that the code that is being added in the abort handler
sooner or later will evolve into a duplicate of the regular completion
path. Wouldn't it be better to poll for completions from the timeout
handler by calling ufshcd_transfer_req_compl() instead of duplicating
that function?

>>> In section 7.2.3 of the UFS specification I found the following about how
>>> to process request completions: "Software determines if new TRs have
>>> completed since step #2, by repeating one of the two methods described in
>>> step #2. If new TRs have completed, software repeats the sequence from
>>> step #3." Is such a loop perhaps missing from the Linux UFS driver?
>
> Could not find that citation.
> What version of the spec are you using?

That quote comes from the following document: "Universal Flash Storage
Host Controller Interface (UFSHCI); Version 2.1; JESD223C; (Revision of
JESD223B, September 2013); MARCH 2016".

Bart.
Stanley Chu July 22, 2020, 10:07 a.m. UTC | #11
Hi Bart, Avri,

On Tue, 2020-07-14 at 21:00 -0700, Bart Van Assche wrote:
> On 2020-07-13 01:10, Avri Altman wrote:
> > Artificially injecting errors is a very common validation mechanism,
> > Provided that you are not breaking anything of the upper-layers,
> > Which I don't think you are doing.
> 

As the concerns of below questions,

"scsi timeout is 30sec - do you expect an interrupt to arrive after
that?"

Actually in my test scenario, the flow works well without re-checking
"outstanding_reqs" in "cleanup" section in ufshcd_abort(), so I would
remove this checking first and resend this fix (with refined commit
message according to blk-mq, not legacy blk). Please let me know if you
have any suggestions.

> Hi Avri,
> 
> My concern is that the code that is being added in the abort handler
> sooner or later will evolve into a duplicate of the regular completion
> path. Wouldn't it be better to poll for completions from the timeout
> handler by calling ufshcd_transfer_req_compl() instead of duplicating
> that function?
> 

The duplicated calls of cleanup job would be as below,

scsi_dma_unmap(cmd);
hba->lrb[tag].cmd = NULL;
ufshcd_outstanding_req_clear(hba, tag);

As your suggestions, above calls could be re-factored but the third call
in __ufshcd_transfer_req_compl() would be more efficient by

hba->outstanding_reqs ^= completed_reqs;

for all handled requests in interrupt handler.


Here we could not directly use "ufshcd_transfer_req_compl()" or its
inner function "__ufshcd_transfer_req_compl()" since at least
scsi_done() is not required in ufshcd_abort() because the completion
flow will be handled by SCSI error handler, not ufshcd_abort() itself.

> >>> In section 7.2.3 of the UFS specification I found the following about how
> >>> to process request completions: "Software determines if new TRs have
> >>> completed since step #2, by repeating one of the two methods described in
> >>> step #2. If new TRs have completed, software repeats the sequence from
> >>> step #3." Is such a loop perhaps missing from the Linux UFS driver?
> >
> > Could not find that citation.
> > What version of the spec are you using?
> 
> That quote comes from the following document: "Universal Flash Storage
> Host Controller Interface (UFSHCI); Version 2.1; JESD223C; (Revision of
> JESD223B, September 2013); MARCH 2016".

Above description has already be implemented in ufshcd_intr() and
ufshcd_transfer_req_compl(). But this loop cannot save "missing
interrupt" just like this injected error case.

Thanks,
Stanley Chu
diff mbox series

Patch

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 8603b07045a6..f23fb14df9f6 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -6462,7 +6462,7 @@  static int ufshcd_abort(struct scsi_cmnd *cmd)
 			/* command completed already */
 			dev_err(hba->dev, "%s: cmd at tag %d successfully cleared from DB.\n",
 				__func__, tag);
-			goto out;
+			goto cleanup;
 		} else {
 			dev_err(hba->dev,
 				"%s: no response from device. tag = %d, err %d\n",
@@ -6496,9 +6496,14 @@  static int ufshcd_abort(struct scsi_cmnd *cmd)
 		goto out;
 	}
 
+cleanup:
+	spin_lock_irqsave(host->host_lock, flags);
+	if (!test_bit(tag, &hba->outstanding_reqs)) {
+		spin_unlock_irqrestore(host->host_lock, flags);
+		goto out;
+	}
 	scsi_dma_unmap(cmd);
 
-	spin_lock_irqsave(host->host_lock, flags);
 	ufshcd_outstanding_req_clear(hba, tag);
 	hba->lrb[tag].cmd = NULL;
 	spin_unlock_irqrestore(host->host_lock, flags);