diff mbox series

[v1,1/1] scsi: Synchronize request queue PM status only on successful resume

Message ID 1546410308-13486-3-git-send-email-stanley.chu@mediatek.com (mailing list archive)
State Superseded
Headers show
Series scsi: Synchronize request queue PM status only on successful resume | expand

Commit Message

Stanley Chu Jan. 2, 2019, 6:25 a.m. UTC
From: Stanley Chu <stanley.chu@mediatek.com>

The commit 356fd2663cff ("scsi: Set request queue runtime PM status
back to active on resume") fixed up the inconsistent RPM status between
request queue and device. However changing request queue RPM status
shall be done only on successful resume, otherwise status may be still
inconsistent as below,

Request queue: RPM_ACTIVE
Device: RPM_SUSPENDED

This ends up soft lockup because requests can be submitted to
underlying devices but those devices and their required resource
are not resumed.

Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
---
 drivers/scsi/scsi_pm.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

Comments

Bart Van Assche Jan. 2, 2019, 4:15 p.m. UTC | #1
On Wed, 2019-01-02 at 14:25 +0800, stanley.chu@mediatek.com wrote:
> From: Stanley Chu <stanley.chu@mediatek.com>
> 
> The commit 356fd2663cff ("scsi: Set request queue runtime PM status
> back to active on resume") fixed up the inconsistent RPM status between
> request queue and device. However changing request queue RPM status
> shall be done only on successful resume, otherwise status may be still
> inconsistent as below,
> 
> Request queue: RPM_ACTIVE
> Device: RPM_SUSPENDED
> 
> This ends up soft lockup because requests can be submitted to
> underlying devices but those devices and their required resource
> are not resumed.
> 
> Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>

Please add "Fixes:" and "Cc: stable" tags and also Cc the author of commit
356fd2663cff.


> ---
>  drivers/scsi/scsi_pm.c | 24 ++++++++++++++----------
>  1 file changed, 14 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
> index a2b4179..eff3e59 100644
> --- a/drivers/scsi/scsi_pm.c
> +++ b/drivers/scsi/scsi_pm.c
> @@ -82,6 +82,20 @@ static int scsi_dev_type_resume(struct device *dev,
>  		pm_runtime_disable(dev);
>  		pm_runtime_set_active(dev);
>  		pm_runtime_enable(dev);
> +
> +		/*
> +		 * Forcibly set runtime PM status of request queue to "active"
> +		 * to make sure we can again get requests from the queue
> +		 * (see also blk_pm_peek_request()).
> +		 *
> +		 * The resume hook will correct runtime PM status of the disk.
> +		 */
> +		if (!err && scsi_is_sdev_device(dev)) {
> +			struct scsi_device *sdev = to_scsi_device(dev);
> +
> +			if (sdev->request_queue->dev)
> +				blk_set_runtime_active(sdev->request_queue);
> +		}

What makes you think that the sdev->request_queue->dev test is necessary? The
scsi_dev_type_resume() function is only called after blk_pm_runtime_init() has
finished so I don't think that test is necessary.

Additionally, since the above code occurs inside a block controlled by an
"if (err == 0)" statement, I think the !err test is redundant and should be
left out.

Thanks,

Bart.
Stanley Chu Jan. 3, 2019, 6:38 a.m. UTC | #2
Hi Bart,

On Wed, 2019-01-02 at 08:15 -0800, Bart Van Assche wrote:
> On Wed, 2019-01-02 at 14:25 +0800, stanley.chu@mediatek.com wrote:
> > From: Stanley Chu <stanley.chu@mediatek.com>
> > 
> > The commit 356fd2663cff ("scsi: Set request queue runtime PM status
> > back to active on resume") fixed up the inconsistent RPM status between
> > request queue and device. However changing request queue RPM status
> > shall be done only on successful resume, otherwise status may be still
> > inconsistent as below,
> > 
> > Request queue: RPM_ACTIVE
> > Device: RPM_SUSPENDED
> > 
> > This ends up soft lockup because requests can be submitted to
> > underlying devices but those devices and their required resource
> > are not resumed.
> > 
> > Signed-off-by: Stanley Chu <stanley.chu@mediatek.com>
> 
> Please add "Fixes:" and "Cc: stable" tags and also Cc the author of commit
> 356fd2663cff.

Sure. Thanks for remind.

> 
> 
> > ---
> >  drivers/scsi/scsi_pm.c | 24 ++++++++++++++----------
> >  1 file changed, 14 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
> > index a2b4179..eff3e59 100644
> > --- a/drivers/scsi/scsi_pm.c
> > +++ b/drivers/scsi/scsi_pm.c
> > @@ -82,6 +82,20 @@ static int scsi_dev_type_resume(struct device *dev,
> >  		pm_runtime_disable(dev);
> >  		pm_runtime_set_active(dev);
> >  		pm_runtime_enable(dev);
> > +
> > +		/*
> > +		 * Forcibly set runtime PM status of request queue to "active"
> > +		 * to make sure we can again get requests from the queue
> > +		 * (see also blk_pm_peek_request()).
> > +		 *
> > +		 * The resume hook will correct runtime PM status of the disk.
> > +		 */
> > +		if (!err && scsi_is_sdev_device(dev)) {
> > +			struct scsi_device *sdev = to_scsi_device(dev);
> > +
> > +			if (sdev->request_queue->dev)
> > +				blk_set_runtime_active(sdev->request_queue);
> > +		}
> 
> What makes you think that the sdev->request_queue->dev test is necessary? The
> scsi_dev_type_resume() function is only called after blk_pm_runtime_init() has
> finished so I don't think that test is necessary.

We found NULL sdev->request_queue->dev may be dereferenced during below
system resume flow,

scsi_bus_resume_common()
 => async_schedule_domain(async_sdev_resume)

And then async_sdev_resume() is invoked asynchronously,
 
async_sdev_resume()
 => scsi_dev_type_resume(dev, do_scsi_resume)
  => blk_set_runtime_active(sdev->request_queue)

If a SCSI device does not have upper layer driver (like SCSI disk), it
may not be applied blk_pm_runtime_init() invoked by sd_probe() while
this SCSI device is added.

For example, some SCSI devices (like UFS Boot W-LUN) are added
explicitly in __scsi_add_device() by ufshcd_scsi_add_wlus() first and
thus sd_probe() for them is skipped because they are already visible.

For those SCSI devices, null sdev->request_queue->dev will be
dereferenced in blk_set_runtime_active() during above system resume
flow, therefore we add a null pointer checking for this case.

The same issue also happens on those SCSI devices before this patch as
below system resume flow while devices are already runtime-suspended.

scsi_bus_resume_common()
 => blk_set_runtime_active(to_scsi_device(dev)->request_queue)

> 
> Additionally, since the above code occurs inside a block controlled by an
> "if (err == 0)" statement, I think the !err test is redundant and should be
> left out.

Sorry this is my code merge defect.
"err" here shall be returned value from pm_runtime_set_active().

I will fix it in v2.

> 
> Thanks,
> 
> Bart.
> 
> _______________________________________________
> Linux-mediatek mailing list
> Linux-mediatek@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-mediatek
Bart Van Assche Jan. 3, 2019, 11:15 p.m. UTC | #3
On Thu, 2019-01-03 at 14:38 +0800, Stanley Chu wrote:
> We found NULL sdev->request_queue->dev may be dereferenced during below
> system resume flow,
> 
> scsi_bus_resume_common()
>  => async_schedule_domain(async_sdev_resume)
> 
> And then async_sdev_resume() is invoked asynchronously,
>  
> async_sdev_resume()
>  => scsi_dev_type_resume(dev, do_scsi_resume)
>   => blk_set_runtime_active(sdev->request_queue)
> 
> If a SCSI device does not have upper layer driver (like SCSI disk), it
> may not be applied blk_pm_runtime_init() invoked by sd_probe() while
> this SCSI device is added.
> 
> For example, some SCSI devices (like UFS Boot W-LUN) are added
> explicitly in __scsi_add_device() by ufshcd_scsi_add_wlus() first and
> thus sd_probe() for them is skipped because they are already visible.
> 
> For those SCSI devices, null sdev->request_queue->dev will be
> dereferenced in blk_set_runtime_active() during above system resume
> flow, therefore we add a null pointer checking for this case.
> 
> The same issue also happens on those SCSI devices before this patch as
> below system resume flow while devices are already runtime-suspended.
> 
> scsi_bus_resume_common()
>  => blk_set_runtime_active(to_scsi_device(dev)->request_queue)

Hi Stanley,

Thanks, this is helpful information. If you would have to repost your
patch please add a comment that refers to the __scsi_add_device() calls
in the UFS driver.

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_pm.c b/drivers/scsi/scsi_pm.c
index a2b4179..eff3e59 100644
--- a/drivers/scsi/scsi_pm.c
+++ b/drivers/scsi/scsi_pm.c
@@ -82,6 +82,20 @@  static int scsi_dev_type_resume(struct device *dev,
 		pm_runtime_disable(dev);
 		pm_runtime_set_active(dev);
 		pm_runtime_enable(dev);
+
+		/*
+		 * Forcibly set runtime PM status of request queue to "active"
+		 * to make sure we can again get requests from the queue
+		 * (see also blk_pm_peek_request()).
+		 *
+		 * The resume hook will correct runtime PM status of the disk.
+		 */
+		if (!err && scsi_is_sdev_device(dev)) {
+			struct scsi_device *sdev = to_scsi_device(dev);
+
+			if (sdev->request_queue->dev)
+				blk_set_runtime_active(sdev->request_queue);
+		}
 	}
 
 	return err;
@@ -140,16 +154,6 @@  static int scsi_bus_resume_common(struct device *dev,
 	else
 		fn = NULL;
 
-	/*
-	 * Forcibly set runtime PM status of request queue to "active" to
-	 * make sure we can again get requests from the queue (see also
-	 * blk_pm_peek_request()).
-	 *
-	 * The resume hook will correct runtime PM status of the disk.
-	 */
-	if (scsi_is_sdev_device(dev) && pm_runtime_suspended(dev))
-		blk_set_runtime_active(to_scsi_device(dev)->request_queue);
-
 	if (fn) {
 		async_schedule_domain(fn, dev, &scsi_sd_pm_domain);