[v3] ufs: core: wlun resume SSU(Acitve) fail recovery

Message ID	20221228060157.24852-1-peter.wang@mediatek.com (mailing list archive)
State	Changes Requested
Headers	show Return-Path: <linux-scsi-owner@vger.kernel.org> From: <peter.wang@mediatek.com> To: <stanley.chu@mediatek.com>, <linux-scsi@vger.kernel.org>, <martin.petersen@oracle.com>, <avri.altman@wdc.com>, <alim.akhtar@samsung.com>, <jejb@linux.ibm.com> CC: <wsd_upstream@mediatek.com>, <linux-mediatek@lists.infradead.org>, <peter.wang@mediatek.com>, <chun-hung.wu@mediatek.com>, <alice.chao@mediatek.com>, <cc.chou@mediatek.com>, <chaotian.jing@mediatek.com>, <jiajie.hao@mediatek.com>, <powen.kao@mediatek.com>, <qilin.tan@mediatek.com>, <lin.gui@mediatek.com>, <tun-yu.yu@mediatek.com>, <eddie.huang@mediatek.com>, <naomi.chu@mediatek.com> Subject: [PATCH v3] ufs: core: wlun resume SSU(Acitve) fail recovery Date: Wed, 28 Dec 2022 14:01:57 +0800 Message-ID: <20221228060157.24852-1-peter.wang@mediatek.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	[v3] ufs: core: wlun resume SSU(Acitve) fail recovery \| expand [v3] ufs: core: wlun resume SSU(Acitve) fail recovery

Message ID

20221228060157.24852-1-peter.wang@mediatek.com (mailing list archive)

State

Changes Requested

Headers

From: <peter.wang@mediatek.com>
To: <stanley.chu@mediatek.com>, <linux-scsi@vger.kernel.org>,
        <martin.petersen@oracle.com>, <avri.altman@wdc.com>,
        <alim.akhtar@samsung.com>, <jejb@linux.ibm.com>
CC: <wsd_upstream@mediatek.com>, <linux-mediatek@lists.infradead.org>,
        <peter.wang@mediatek.com>, <chun-hung.wu@mediatek.com>,
        <alice.chao@mediatek.com>, <cc.chou@mediatek.com>,
        <chaotian.jing@mediatek.com>, <jiajie.hao@mediatek.com>,
        <powen.kao@mediatek.com>, <qilin.tan@mediatek.com>,
        <lin.gui@mediatek.com>, <tun-yu.yu@mediatek.com>,
        <eddie.huang@mediatek.com>, <naomi.chu@mediatek.com>
Subject: [PATCH v3] ufs: core: wlun resume SSU(Acitve) fail recovery
Date: Wed, 28 Dec 2022 14:01:57 +0800
Message-ID: <20221228060157.24852-1-peter.wang@mediatek.com>
MIME-Version: 1.0
Content-Type: text/plain
Precedence: bulk

Series

[v3] ufs: core: wlun resume SSU(Acitve) fail recovery | expand

Commit Message

Peter Wang (王信友) Dec. 28, 2022, 6:01 a.m. UTC

From: Peter Wang <peter.wang@mediatek.com>

When wlun resume SSU(Active) timeout, scsi try eh_host_reset_handler.
But ufshcd_eh_host_reset_handler hang at wait flush_work(&hba->eh_work).
And ufshcd_err_handler hang at wait rpm resume.
Do link recovery only in this case. Below is IO hang stack dump.

<ffffffdd78e02b34> schedule+0x110/0x204
<ffffffdd78e0be60> schedule_timeout+0x98/0x138
<ffffffdd78e040e8> wait_for_common_io+0x130/0x2d0
<ffffffdd77d6a000> blk_execute_rq+0x10c/0x16c
<ffffffdd78126d90> __scsi_execute+0xfc/0x278
<ffffffdd7813891c> ufshcd_set_dev_pwr_mode+0x1c8/0x40c
<ffffffdd78137d1c> __ufshcd_wl_resume+0xf0/0x5cc
<ffffffdd78137ae0> ufshcd_wl_runtime_resume+0x40/0x18c
<ffffffdd78136108> scsi_runtime_resume+0x88/0x104
<ffffffdd7809a4f8> __rpm_callback+0x1a0/0xaec
<ffffffdd7809b624> rpm_resume+0x7e0/0xcd0
<ffffffdd7809a788> __rpm_callback+0x430/0xaec
<ffffffdd7809b644> rpm_resume+0x800/0xcd0
<ffffffdd780a0778> pm_runtime_work+0x148/0x198

<ffffffdd78e02b34> schedule+0x110/0x204
<ffffffdd78e0be10> schedule_timeout+0x48/0x138
<ffffffdd78e03d9c> wait_for_common+0x144/0x2dc
<ffffffdd7758bba4> __flush_work+0x3d0/0x508
<ffffffdd7815572c> ufshcd_eh_host_reset_handler+0x134/0x3a8
<ffffffdd781216f4> scsi_try_host_reset+0x54/0x204
<ffffffdd78120594> scsi_eh_ready_devs+0xb30/0xd48
<ffffffdd7812373c> scsi_error_handler+0x260/0x874

<ffffffdd78e02b34> schedule+0x110/0x204
<ffffffdd7809af64> rpm_resume+0x120/0xcd0
<ffffffdd7809fde8> __pm_runtime_resume+0xa0/0x17c
<ffffffdd7815193c> ufshcd_err_handling_prepare+0x40/0x430
<ffffffdd7814cce8> ufshcd_err_handler+0x1c4/0xd4c

Signed-off-by: Peter Wang <peter.wang@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Bart Van Assche Jan. 3, 2023, 12:29 a.m. UTC | #1

On 12/27/22 22:01, peter.wang@mediatek.com wrote:
> When wlun resume SSU(Active) timeout, scsi try eh_host_reset_handler.
                    ^^^^^^^^^^^
Please use the same spelling in the patch subject (Acitve -> Active).

timeout -> times out
scsi try -> the SCSI core invokes

> But ufshcd_eh_host_reset_handler hang at wait flush_work(&hba->eh_work).

hang at -> hangs in

> And ufshcd_err_handler hang at wait rpm resume.

hang at wait rpm resume -> hangs in rpm_resume().

 > <ffffffdd78e02b34> schedule+0x110/0x204
 > <ffffffdd78e0be60> schedule_timeout+0x98/0x138
 > <ffffffdd78e040e8> wait_for_common_io+0x130/0x2d0
 > <ffffffdd77d6a000> blk_execute_rq+0x10c/0x16c
 > <ffffffdd78126d90> __scsi_execute+0xfc/0x278
 > <ffffffdd7813891c> ufshcd_set_dev_pwr_mode+0x1c8/0x40c
 > <ffffffdd78137d1c> __ufshcd_wl_resume+0xf0/0x5cc
 > <ffffffdd78137ae0> ufshcd_wl_runtime_resume+0x40/0x18c
 > <ffffffdd78136108> scsi_runtime_resume+0x88/0x104
 > <ffffffdd7809a4f8> __rpm_callback+0x1a0/0xaec
 > <ffffffdd7809b624> rpm_resume+0x7e0/0xcd0
 > <ffffffdd7809a788> __rpm_callback+0x430/0xaec
 > <ffffffdd7809b644> rpm_resume+0x800/0xcd0
 > <ffffffdd780a0778> pm_runtime_work+0x148/0x198
 >
 > <ffffffdd78e02b34> schedule+0x110/0x204
 > <ffffffdd78e0be10> schedule_timeout+0x48/0x138
 > <ffffffdd78e03d9c> wait_for_common+0x144/0x2dc
 > <ffffffdd7758bba4> __flush_work+0x3d0/0x508
 > <ffffffdd7815572c> ufshcd_eh_host_reset_handler+0x134/0x3a8
 > <ffffffdd781216f4> scsi_try_host_reset+0x54/0x204
 > <ffffffdd78120594> scsi_eh_ready_devs+0xb30/0xd48
 > <ffffffdd7812373c> scsi_error_handler+0x260/0x874
 >
 > <ffffffdd78e02b34> schedule+0x110/0x204
 > <ffffffdd7809af64> rpm_resume+0x120/0xcd0
 > <ffffffdd7809fde8> __pm_runtime_resume+0xa0/0x17c
 > <ffffffdd7815193c> ufshcd_err_handling_prepare+0x40/0x430
 > <ffffffdd7814cce8> ufshcd_err_handler+0x1c4/0xd4c

On top of which kernel version has this patch been developed?
I think this deadlock has already been fixed by commit 7029e2151a7c 
("scsi: ufs: Fix a deadlock between PM and the SCSI error handler").
Please check whether that commit by itself (without this patch) is 
sufficient to fix the reported deadlock.

> ---
>   drivers/ufs/core/ufshcd.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)

The changelog is missing. Please include a changelog when posting v2 or 
a later version of a patch.

> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index e18c9f4463ec..0dfb9a35bf66 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -7366,6 +7366,23 @@ static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
>   
>   	hba = shost_priv(cmd->device->host);
>   
> +	/*
> +	 * If pm op resume fail and wait err recovery, do link recovery only.
> +	 * Because schedule eh work will get dead lock in ufshcd_rpm_get_sync
> +	 * and wait wlun resume, but wlun resume error wait eh work finish.
> +	 */

The above comment has grammar issues and some parts are 
incomprehensible. What does e.g. "wait err recovery" mean? Please 
improve this source code comment.

Thanks,

Bart.

Peter Wang (王信友) Sept. 22, 2023, 9:08 a.m. UTC | #2

On Mon, 2023-01-02 at 16:29 -0800, Bart Van Assche wrote:
> On 12/27/22 22:01, peter.wang@mediatek.com wrote:
> > When wlun resume SSU(Active) timeout, scsi try
> > eh_host_reset_handler.
> 
>                     ^^^^^^^^^^^
> Please use the same spelling in the patch subject (Acitve -> Active).
> 
> timeout -> times out
> scsi try -> the SCSI core invokes
> 
> > But ufshcd_eh_host_reset_handler hang at wait flush_work(&hba-
> > >eh_work).
> 
> hang at -> hangs in
> 
> > And ufshcd_err_handler hang at wait rpm resume.
> 
> hang at wait rpm resume -> hangs in rpm_resume().
> 
>  > <ffffffdd78e02b34> schedule+0x110/0x204
>  > <ffffffdd78e0be60> schedule_timeout+0x98/0x138
>  > <ffffffdd78e040e8> wait_for_common_io+0x130/0x2d0
>  > <ffffffdd77d6a000> blk_execute_rq+0x10c/0x16c
>  > <ffffffdd78126d90> __scsi_execute+0xfc/0x278
>  > <ffffffdd7813891c> ufshcd_set_dev_pwr_mode+0x1c8/0x40c
>  > <ffffffdd78137d1c> __ufshcd_wl_resume+0xf0/0x5cc
>  > <ffffffdd78137ae0> ufshcd_wl_runtime_resume+0x40/0x18c
>  > <ffffffdd78136108> scsi_runtime_resume+0x88/0x104
>  > <ffffffdd7809a4f8> __rpm_callback+0x1a0/0xaec
>  > <ffffffdd7809b624> rpm_resume+0x7e0/0xcd0
>  > <ffffffdd7809a788> __rpm_callback+0x430/0xaec
>  > <ffffffdd7809b644> rpm_resume+0x800/0xcd0
>  > <ffffffdd780a0778> pm_runtime_work+0x148/0x198
>  >
>  > <ffffffdd78e02b34> schedule+0x110/0x204
>  > <ffffffdd78e0be10> schedule_timeout+0x48/0x138
>  > <ffffffdd78e03d9c> wait_for_common+0x144/0x2dc
>  > <ffffffdd7758bba4> __flush_work+0x3d0/0x508
>  > <ffffffdd7815572c> ufshcd_eh_host_reset_handler+0x134/0x3a8
>  > <ffffffdd781216f4> scsi_try_host_reset+0x54/0x204
>  > <ffffffdd78120594> scsi_eh_ready_devs+0xb30/0xd48
>  > <ffffffdd7812373c> scsi_error_handler+0x260/0x874
>  >
>  > <ffffffdd78e02b34> schedule+0x110/0x204
>  > <ffffffdd7809af64> rpm_resume+0x120/0xcd0
>  > <ffffffdd7809fde8> __pm_runtime_resume+0xa0/0x17c
>  > <ffffffdd7815193c> ufshcd_err_handling_prepare+0x40/0x430
>  > <ffffffdd7814cce8> ufshcd_err_handler+0x1c4/0xd4c
> 
> On top of which kernel version has this patch been developed?
> I think this deadlock has already been fixed by commit 7029e2151a7c 
> ("scsi: ufs: Fix a deadlock between PM and the SCSI error handler").
> Please check whether that commit by itself (without this patch) is 
> sufficient to fix the reported deadlock.
> 

Hi Bart, 

7029e2151a7c is not fix this dead lock, we still found this issue
happen in kernel-6.1
I will update this patch and correct comment.
Please have a look again.

Thanks.
Peter


> > ---
> >   drivers/ufs/core/ufshcd.c | 17 +++++++++++++++++
> >   1 file changed, 17 insertions(+)
> 
> The changelog is missing. Please include a changelog when posting v2
> or 
> a later version of a patch.
> 
> > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> > index e18c9f4463ec..0dfb9a35bf66 100644
> > --- a/drivers/ufs/core/ufshcd.c
> > +++ b/drivers/ufs/core/ufshcd.c
> > @@ -7366,6 +7366,23 @@ static int
> > ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
> >   
> >   	hba = shost_priv(cmd->device->host);
> >   
> > +	/*
> > +	 * If pm op resume fail and wait err recovery, do link recovery
> > only.
> > +	 * Because schedule eh work will get dead lock in
> > ufshcd_rpm_get_sync
> > +	 * and wait wlun resume, but wlun resume error wait eh work
> > finish.
> > +	 */
> 
> The above comment has grammar issues and some parts are 
> incomprehensible. What does e.g. "wait err recovery" mean? Please 
> improve this source code comment.
> 
> Thanks,
> 
> Bart.

diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index e18c9f4463ec..0dfb9a35bf66 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -7366,6 +7366,23 @@  static int ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
 
 	hba = shost_priv(cmd->device->host);
 
+	/*
+	 * If pm op resume fail and wait err recovery, do link recovery only.
+	 * Because schedule eh work will get dead lock in ufshcd_rpm_get_sync
+	 * and wait wlun resume, but wlun resume error wait eh work finish.
+	 */
+	if (hba->pm_op_in_progress &&
+	    hba->curr_dev_pwr_mode != UFS_ACTIVE_PWR_MODE) {
+		err = ufshcd_link_recovery(hba);
+		if (err) {
+			dev_err(hba->dev, "%s: power state %d pm_progress %d",
+				__func__,
+				hba->curr_dev_pwr_mode,
+				hba->pm_op_in_progress);
+		}
+		return err;
+	}
+
 	spin_lock_irqsave(hba->host->host_lock, flags);
 	hba->force_reset = true;
 	ufshcd_schedule_eh_work(hba);

[v3] ufs: core: wlun resume SSU(Acitve) fail recovery

Commit Message

Comments

Patch