diff mbox series

[v1] ufs: core: bypass get rpm when err handling with pm_op_in_progress

Message ID 20220915115858.7642-1-peter.wang@mediatek.com (mailing list archive)
State Rejected, archived
Headers show
Series [v1] ufs: core: bypass get rpm when err handling with pm_op_in_progress | expand

Commit Message

Peter Wang (王信友) Sept. 15, 2022, 11:58 a.m. UTC
From: Peter Wang <peter.wang@mediatek.com>

If error happened in rpm flow, get rpm will stuck because rpm is
suspending or resuming. And it cause IO hang.
This patch bypass get rpm when err handling with pm_op_in_progress.

Signed-off-by: Peter Wang <peter.wang@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

Comments

Bart Van Assche Sept. 16, 2022, 9:39 p.m. UTC | #1
On 9/15/22 04:58, peter.wang@mediatek.com wrote:
> -static void ufshcd_err_handling_prepare(struct ufs_hba *hba)
> +static void ufshcd_err_handling_prepare(struct ufs_hba *hba, bool *rpm_put)
>   {
> -	ufshcd_rpm_get_sync(hba);
> +	if (!hba->pm_op_in_progress) {
> +		ufshcd_rpm_get_sync(hba);
> +		*rpm_put = true;
> +	}
> +

Hi Peter,

I don't think that this patch is sufficient. If 
ufshcd_err_handling_prepare() is called by the host reset handler 
(ufshcd_eh_host_reset_handler()) then the host state will be 
SHOST_RECOVERY. In that state SCSI command submission will hang and 
hence any ufshcd_rpm_get_sync() call will hang.

How about removing the ufshcd_rpm_get_sync() call from 
ufshcd_err_handling_prepare() and the ufshcd_rpm_put() call from 
ufshcd_err_handling_unprepare()? It is guaranteed that no commands are 
in progress for a runtime suspended LUN so the code for aborting pending 
requests in the UFS error handler will be skipped anyway if it is 
invoked for a runtime suspended device.

Thanks,

Bart.
Peter Wang (王信友) Sept. 19, 2022, 2:47 p.m. UTC | #2
On 9/17/22 05:39, Bart Van Assche wrote:
> On 9/15/22 04:58, peter.wang@mediatek.com wrote:
>> -static void ufshcd_err_handling_prepare(struct ufs_hba *hba)
>> +static void ufshcd_err_handling_prepare(struct ufs_hba *hba, bool 
>> *rpm_put)
>>   {
>> -    ufshcd_rpm_get_sync(hba);
>> +    if (!hba->pm_op_in_progress) {
>> +        ufshcd_rpm_get_sync(hba);
>> +        *rpm_put = true;
>> +    }
>> +
>
> Hi Peter,
>
> I don't think that this patch is sufficient. If 
> ufshcd_err_handling_prepare() is called by the host reset handler 
> (ufshcd_eh_host_reset_handler()) then the host state will be 
> SHOST_RECOVERY. In that state SCSI command submission will hang and 
> hence any ufshcd_rpm_get_sync() call will hang.
>
> How about removing the ufshcd_rpm_get_sync() call from 
> ufshcd_err_handling_prepare() and the ufshcd_rpm_put() call from 
> ufshcd_err_handling_unprepare()? It is guaranteed that no commands are 
> in progress for a runtime suspended LUN so the code for aborting 
> pending requests in the UFS error handler will be skipped anyway if it 
> is invoked for a runtime suspended device.
>
> Thanks,
>
> Bart.

Hi Bart,

If the scsi error happened and need do ufshcd_eh_host_reset_handler, the 
rpm state should in RPM_ACTIVE.
Because scsi need wakeup suspended LUN, and send command to LUN then get 
error, right?
So, ufshcd_rpm_get_sync should not hang in this case.

If remove ufshcd_rpm_get_sync directly, think about this case.
ufshcd_err_handler is on going and try to abort some task (which may get 
stuck and timeout too).
Then rpm count down try to suspend. Finally runtime suspend callback may 
return IO error and IO hang.

Thanks.
Peter
Bart Van Assche Sept. 19, 2022, 4:25 p.m. UTC | #3
On 9/19/22 07:47, Peter Wang wrote:
> If the scsi error happened and need do ufshcd_eh_host_reset_handler, the 
> rpm state should in RPM_ACTIVE.
> Because scsi need wakeup suspended LUN, and send command to LUN then get 
> error, right?

The following sequence may activate the SCSI error handler while the RPM 
state is RPM_RESUMING:
* The RPM state is RPM_SUSPENDED.
* The RPM state is changed into RPM_RESUMING and ufshcd_wl_resume() is 
called.
* ufshcd_set_dev_pwr_mode() calls scsi_execute() and the START STOP UNIT 
command times out.
* Because of this timeout the SCSI error handler is activated.

> If remove ufshcd_rpm_get_sync directly, think about this case.
> ufshcd_err_handler is on going and try to abort some task (which may get 
> stuck and timeout too).
> Then rpm count down try to suspend. Finally runtime suspend callback may 
> return IO error and IO hang.

Hmm ... suspending a UFS device involves calling ufshcd_wl_shutdown(), 
ufshcd_set_dev_pwr_mode() and scsi_execute(). scsi_execute() is 
serialized against the UFS error handler because the latter calls 
ufshcd_scsi_block_requests().

Thanks,

Bart.
Peter Wang (王信友) Sept. 20, 2022, 2 a.m. UTC | #4
On 9/20/22 00:25, Bart Van Assche wrote:
> On 9/19/22 07:47, Peter Wang wrote:
>> If the scsi error happened and need do ufshcd_eh_host_reset_handler, 
>> the rpm state should in RPM_ACTIVE.
>> Because scsi need wakeup suspended LUN, and send command to LUN then 
>> get error, right?
>
> The following sequence may activate the SCSI error handler while the 
> RPM state is RPM_RESUMING:
> * The RPM state is RPM_SUSPENDED.
> * The RPM state is changed into RPM_RESUMING and ufshcd_wl_resume() is 
> called.
> * ufshcd_set_dev_pwr_mode() calls scsi_execute() and the START STOP 
> UNIT command times out.
> * Because of this timeout the SCSI error handler is activated.

This case will not get rpm, because pm_op_in_progress is true.

So it won't hang with ufshcd_rpm_get_sync.


Thanks

Peter
Bart Van Assche Sept. 20, 2022, 6:25 p.m. UTC | #5
On 9/19/22 19:00, Peter Wang wrote:
> 
> On 9/20/22 00:25, Bart Van Assche wrote:
>> On 9/19/22 07:47, Peter Wang wrote:
>>> If the scsi error happened and need do ufshcd_eh_host_reset_handler, 
>>> the rpm state should in RPM_ACTIVE.
>>> Because scsi need wakeup suspended LUN, and send command to LUN then 
>>> get error, right?
>>
>> The following sequence may activate the SCSI error handler while the 
>> RPM state is RPM_RESUMING:
>> * The RPM state is RPM_SUSPENDED.
>> * The RPM state is changed into RPM_RESUMING and ufshcd_wl_resume() is 
>> called.
>> * ufshcd_set_dev_pwr_mode() calls scsi_execute() and the START STOP 
>> UNIT command times out.
>> * Because of this timeout the SCSI error handler is activated.
> 
> This case will not get rpm, because pm_op_in_progress is true.
> 
> So it won't hang with ufshcd_rpm_get_sync.

Right, but I think the following scenario will result in a hang:
* The RPM state is changed from RPM_SUSPENDED into RPM_RESUMING and
   ufshcd_wl_resume() has not yet been called.
* ufshcd_eh_host_reset_handler() queues ufshcd_err_handler() and the
   latter function calls ufshcd_rpm_get_sync().
* This results in a deadlock: the scsi_execute() call by
   ufshcd_wl_resume() cannot make progress because the SCSI host state is
   SHOST_RECOVERY and the error handler cannot make progress because it
   keeps waiting until ufshcd_rpm_get_sync() has finished.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index a202d7d5240d..cc58fb585df2 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -6086,9 +6086,13 @@  static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend)
 	}
 }
 
-static void ufshcd_err_handling_prepare(struct ufs_hba *hba)
+static void ufshcd_err_handling_prepare(struct ufs_hba *hba, bool *rpm_put)
 {
-	ufshcd_rpm_get_sync(hba);
+	if (!hba->pm_op_in_progress) {
+		ufshcd_rpm_get_sync(hba);
+		*rpm_put = true;
+	}
+
 	if (pm_runtime_status_suspended(&hba->ufs_device_wlun->sdev_gendev) ||
 	    hba->is_sys_suspended) {
 		enum ufs_pm_op pm_op;
@@ -6122,13 +6126,14 @@  static void ufshcd_err_handling_prepare(struct ufs_hba *hba)
 	cancel_work_sync(&hba->eeh_work);
 }
 
-static void ufshcd_err_handling_unprepare(struct ufs_hba *hba)
+static void ufshcd_err_handling_unprepare(struct ufs_hba *hba, bool rpm_put)
 {
 	ufshcd_scsi_unblock_requests(hba);
 	ufshcd_release(hba);
 	if (ufshcd_is_clkscaling_supported(hba))
 		ufshcd_clk_scaling_suspend(hba, false);
-	ufshcd_rpm_put(hba);
+	if (rpm_put)
+		ufshcd_rpm_put(hba);
 }
 
 static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba)
@@ -6210,6 +6215,7 @@  static void ufshcd_err_handler(struct work_struct *work)
 	bool err_tm;
 	int pmc_err;
 	int tag;
+	bool rpm_put = false;
 
 	hba = container_of(work, struct ufs_hba, eh_work);
 
@@ -6231,7 +6237,7 @@  static void ufshcd_err_handler(struct work_struct *work)
 	}
 	ufshcd_set_eh_in_progress(hba);
 	spin_unlock_irqrestore(hba->host->host_lock, flags);
-	ufshcd_err_handling_prepare(hba);
+	ufshcd_err_handling_prepare(hba, &rpm_put);
 	/* Complete requests that have door-bell cleared by h/w */
 	ufshcd_complete_requests(hba);
 	spin_lock_irqsave(hba->host->host_lock, flags);
@@ -6394,7 +6400,7 @@  static void ufshcd_err_handler(struct work_struct *work)
 	}
 	ufshcd_clear_eh_in_progress(hba);
 	spin_unlock_irqrestore(hba->host->host_lock, flags);
-	ufshcd_err_handling_unprepare(hba);
+	ufshcd_err_handling_unprepare(hba, rpm_put);
 	up(&hba->host_sem);
 
 	dev_info(hba->dev, "%s finished; HBA state %s\n", __func__,