mbox series

[v1,0/2] scsi: ufs: Fix broken hba->outstanding_tasks

Message ID 1562906656-27154-1-git-send-email-stanley.chu@mediatek.com (mailing list archive)
Headers show
Series scsi: ufs: Fix broken hba->outstanding_tasks | expand

Message

Stanley Chu July 12, 2019, 4:44 a.m. UTC
Currently bits in hba->outstanding_tasks are cleared only after their
corresponding task management commands are successfully done by
__ufshcd_issue_tm_cmd().

If timeout happens in a task management command, its corresponding
bit in hba->outstanding_tasks will not be cleared until next task
management command with the same tag used successfully finishes.‧

This is wrong and can lead to some issues, like power consumpton issue.
For example, ufshcd_release() and ufshcd_gate_work() will do nothing
if hba->outstanding_tasks is not zero even if both UFS host and devices
are actually idle.

Because error handling flow, i.e., ufshcd_reset_and_restore(), will be
triggered after any task management command times out, we fix this by
clearing corresponding hba->outstanding_tasks bits during this flow.
To achieve this, we need a mask to track timed-out commands and thus
error handling flow can clear their tags specifically.

Stanley Chu (2):
  scsi: ufs: Make new function for clearing outstanding task bits
  scsi: ufs: Fix broken hba->outstanding_tasks

 drivers/scsi/ufs/ufshcd.c | 49 +++++++++++++++++++++++++++++++++------
 drivers/scsi/ufs/ufshcd.h |  1 +
 2 files changed, 43 insertions(+), 7 deletions(-)

Comments

Stanley Chu July 18, 2019, 5:21 a.m. UTC | #1
Hi Avri, Alim and Pedrom,

Gentle ping for this fix.

On Fri, 2019-07-12 at 12:44 +0800, Stanley Chu wrote:
> Currently bits in hba->outstanding_tasks are cleared only after their
> corresponding task management commands are successfully done by
> __ufshcd_issue_tm_cmd().
> 
> If timeout happens in a task management command, its corresponding
> bit in hba->outstanding_tasks will not be cleared until next task
> management command with the same tag used successfully finishes.‧
> 
> This is wrong and can lead to some issues, like power consumpton issue.
> For example, ufshcd_release() and ufshcd_gate_work() will do nothing
> if hba->outstanding_tasks is not zero even if both UFS host and devices
> are actually idle.
> 
> Because error handling flow, i.e., ufshcd_reset_and_restore(), will be
> triggered after any task management command times out, we fix this by
> clearing corresponding hba->outstanding_tasks bits during this flow.
> To achieve this, we need a mask to track timed-out commands and thus
> error handling flow can clear their tags specifically.
> 
> Stanley Chu (2):
>   scsi: ufs: Make new function for clearing outstanding task bits
>   scsi: ufs: Fix broken hba->outstanding_tasks
> 
>  drivers/scsi/ufs/ufshcd.c | 49 +++++++++++++++++++++++++++++++++------
>  drivers/scsi/ufs/ufshcd.h |  1 +
>  2 files changed, 43 insertions(+), 7 deletions(-)
> 

Thanks,
Stanley
Avri Altman July 21, 2019, 12:46 p.m. UTC | #2
Hi,

> 
> Currently bits in hba->outstanding_tasks are cleared only after their
> corresponding task management commands are successfully done by
> __ufshcd_issue_tm_cmd().
> 
> If timeout happens in a task management command, its corresponding
> bit in hba->outstanding_tasks will not be cleared until next task
> management command with the same tag used successfully finishes.‧
ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler.
Does this change something in your assumptions?

Thanks,
Avri

> 
> This is wrong and can lead to some issues, like power consumpton issue.
> For example, ufshcd_release() and ufshcd_gate_work() will do nothing
> if hba->outstanding_tasks is not zero even if both UFS host and devices
> are actually idle.
> 
> Because error handling flow, i.e., ufshcd_reset_and_restore(), will be
> triggered after any task management command times out, we fix this by
> clearing corresponding hba->outstanding_tasks bits during this flow.
> To achieve this, we need a mask to track timed-out commands and thus
> error handling flow can clear their tags specifically.
> 
> Stanley Chu (2):
>   scsi: ufs: Make new function for clearing outstanding task bits
>   scsi: ufs: Fix broken hba->outstanding_tasks
> 
>  drivers/scsi/ufs/ufshcd.c | 49 +++++++++++++++++++++++++++++++++------
>  drivers/scsi/ufs/ufshcd.h |  1 +
>  2 files changed, 43 insertions(+), 7 deletions(-)
> 
> --
> 2.18.0
Avri Altman July 21, 2019, 12:52 p.m. UTC | #3
> 
> Hi,
> 
> >
> > Currently bits in hba->outstanding_tasks are cleared only after their
> > corresponding task management commands are successfully done by
> > __ufshcd_issue_tm_cmd().
> >
> > If timeout happens in a task management command, its corresponding
> > bit in hba->outstanding_tasks will not be cleared until next task
> > management command with the same tag used successfully finishes.‧
> ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler.
> Does this change something in your assumptions?
And BTW there is a specific __clear_bit in __ufshcd_issue_tm_cmd() in case of a TO.

> 
> Thanks,
> Avri
>
Avri Altman July 22, 2019, 6:10 a.m. UTC | #4
> 
> >
> > Hi,
> >
> > >
> > > Currently bits in hba->outstanding_tasks are cleared only after their
> > > corresponding task management commands are successfully done by
> > > __ufshcd_issue_tm_cmd().
> > >
> > > If timeout happens in a task management command, its corresponding
> > > bit in hba->outstanding_tasks will not be cleared until next task
> > > management command with the same tag used successfully finishes.‧
> > ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler.
> > Does this change something in your assumptions?
> And BTW there is a specific __clear_bit in __ufshcd_issue_tm_cmd() in case
> of a TO.

Gave it another look - 
If indeed this bit isn't cleared as part of the error flow that the timeout triggers,
I think you should relate to ufshcd_clear_tm_cmd specifically in your commit log - 
Because this is the obvious place where the bit cleanup should take place.

Also the fix should be much more intuitive IMO - 
Today we do __clear_bit() on success, ufshcd_clear_tm_cmd() on error,
And also ufshcd_put_tm_slot() either way?

Maybe you can choose a single place to clear it, without any additional code?

Thanks,
Avri
Stanley Chu July 24, 2019, 5:08 a.m. UTC | #5
Hi Avri,

On Mon, 2019-07-22 at 06:10 +0000, Avri Altman wrote:
> > 
> > >
> > > Hi,
> > >
> > > >
> > > > Currently bits in hba->outstanding_tasks are cleared only after their
> > > > corresponding task management commands are successfully done by
> > > > __ufshcd_issue_tm_cmd().
> > > >
> > > > If timeout happens in a task management command, its corresponding
> > > > bit in hba->outstanding_tasks will not be cleared until next task
> > > > management command with the same tag used successfully finishes.‧
> > > ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler.
> > > Does this change something in your assumptions?
> > And BTW there is a specific __clear_bit in __ufshcd_issue_tm_cmd() in case
> > of a TO.
> 
> Gave it another look - 
> If indeed this bit isn't cleared as part of the error flow that the timeout triggers,
> I think you should relate to ufshcd_clear_tm_cmd specifically in your commit log - 
> Because this is the obvious place where the bit cleanup should take place.
> 
> Also the fix should be much more intuitive IMO - 
> Today we do __clear_bit() on success, ufshcd_clear_tm_cmd() on error,
> And also ufshcd_put_tm_slot() either way?
> 
> Maybe you can choose a single place to clear it, without any additional code?

ufshcd_clear_tm_cmd() is similar to ufshcd_clear_cmd() which tries to
write timed-out bit in "clear register". They do not clean bits in both
outstanding masks.

Another reason to not to do outstanding bits cleanup in
ufshcd_clear_tm_cmd() is: if host is unable to clear TM command by
setting "clear register", the TM command may be still "outstanding" in
the device. In this case, it may be better to do cleanup after reset is
done. Cleanup includes bits in hba->outstanding_tasks and
hba->tm_slots_in_use which is possibly cleaned too early by
ufshcd_put_tm_slot() if command is timed-out now.

Referring to error handling flow of hba->outstanding_reqs, all timed-out
bits will be cleaned by ufshcd_reset_and_restore() =>
ufshcd_transfer_req_compl() after reset is done. Similar handling for
hba->outstanding_tasks could be applied, i.e., do cleanup by
ufshcd_reset_and_restore() => ufshcd_tmc_handler().

The next thing is what you suggested: How to make the fix more
intuitive. Patchset v2 will be uploaded for review: It tries to
"re-factor" cleanup jobs first, and then add fixed flow to make the
whole patch more readable.

One more thing, above description and flow is done through UFSHCD SCSI
error handling routines registered with SCSI Midlayer. For TM command
timeout happening in bsg path without error handling triggered by SCSI
layer, we may need to consider to handle those tasks in future patches.

> 
> Thanks,
> Avri

Thanks,
Stanley

> _______________________________________________
> Linux-mediatek mailing list
> Linux-mediatek@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-mediatek
Avri Altman July 24, 2019, 7:09 a.m. UTC | #6
Stanley,

> 
> Hi Avri,
> 
> On Mon, 2019-07-22 at 06:10 +0000, Avri Altman wrote:
> > >
> > > >
> > > > Hi,
> > > >
> > > > >
> > > > > Currently bits in hba->outstanding_tasks are cleared only after their
> > > > > corresponding task management commands are successfully done
> by
> > > > > __ufshcd_issue_tm_cmd().
> > > > >
> > > > > If timeout happens in a task management command, its
> corresponding
> > > > > bit in hba->outstanding_tasks will not be cleared until next task
> > > > > management command with the same tag used successfully
> finishes.‧
> > > > ufshcd_clear_tm_cmd is also called as part of ufshcd_err_handler.
> > > > Does this change something in your assumptions?
> > > And BTW there is a specific __clear_bit in __ufshcd_issue_tm_cmd() in
> case
> > > of a TO.
> >
> > Gave it another look -
> > If indeed this bit isn't cleared as part of the error flow that the timeout
> triggers,
> > I think you should relate to ufshcd_clear_tm_cmd specifically in your
> commit log -
> > Because this is the obvious place where the bit cleanup should take
> place.
> >
> > Also the fix should be much more intuitive IMO -
> > Today we do __clear_bit() on success, ufshcd_clear_tm_cmd() on error,
> > And also ufshcd_put_tm_slot() either way?
> >
> > Maybe you can choose a single place to clear it, without any additional
> code?
> 
> ufshcd_clear_tm_cmd() is similar to ufshcd_clear_cmd() which tries to
> write timed-out bit in "clear register". They do not clean bits in both
> outstanding masks.
> 
> Another reason to not to do outstanding bits cleanup in
> ufshcd_clear_tm_cmd() is: if host is unable to clear TM command by
> setting "clear register", the TM command may be still "outstanding" in
> the device. In this case, it may be better to do cleanup after reset is
> done. Cleanup includes bits in hba->outstanding_tasks and
> hba->tm_slots_in_use which is possibly cleaned too early by
> ufshcd_put_tm_slot() if command is timed-out now.
> 
> Referring to error handling flow of hba->outstanding_reqs, all timed-out
> bits will be cleaned by ufshcd_reset_and_restore() =>
> ufshcd_transfer_req_compl() after reset is done. Similar handling for
> hba->outstanding_tasks could be applied, i.e., do cleanup by
> ufshcd_reset_and_restore() => ufshcd_tmc_handler().
Fair enough.  Thank you for the detailed explanation.

> 
> The next thing is what you suggested: How to make the fix more
> intuitive. Patchset v2 will be uploaded for review: It tries to
> "re-factor" cleanup jobs first, and then add fixed flow to make the
> whole patch more readable.
> 
> One more thing, above description and flow is done through UFSHCD SCSI
> error handling routines registered with SCSI Midlayer. For TM command
> timeout happening in bsg path without error handling triggered by SCSI
> layer, we may need to consider to handle those tasks in future patches.
Please do.

Thanks,
Avri