diff mbox series

scsi: storvsc: Fix a panic in the hibernation procedure

Message ID 1587514644-47058-1-git-send-email-decui@microsoft.com (mailing list archive)
State Rejected
Headers show
Series scsi: storvsc: Fix a panic in the hibernation procedure | expand

Commit Message

Dexuan Cui April 22, 2020, 12:17 a.m. UTC
During hibernation, the sdevs are suspended automatically in
drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
storvsc_suspend(), there is no disk I/O from the file systems, but there
can still be disk I/O from the kernel space, e.g. disk_check_events() ->
sr_block_check_events() -> cdrom_check_events() can still submit I/O
to the storvsc driver, which causes a paic of NULL pointer dereference,
since storvsc has closed the vmbus channel in storvsc_suspend(): refer
to the below links for more info:
  https://lkml.org/lkml/2020/4/10/47
  https://lkml.org/lkml/2020/4/17/1103

Fix the panic by blocking/unblocking all the I/O queues properly.

Note: this patch depends on another patch "scsi: core: Allow the state
change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link above).

Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
 drivers/scsi/storvsc_drv.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Ming Lei April 22, 2020, 1:28 a.m. UTC | #1
On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote:
> During hibernation, the sdevs are suspended automatically in
> drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
> storvsc_suspend(), there is no disk I/O from the file systems, but there
> can still be disk I/O from the kernel space, e.g. disk_check_events() ->
> sr_block_check_events() -> cdrom_check_events() can still submit I/O
> to the storvsc driver, which causes a paic of NULL pointer dereference,
> since storvsc has closed the vmbus channel in storvsc_suspend(): refer
> to the below links for more info:
>   https://lkml.org/lkml/2020/4/10/47
>   https://lkml.org/lkml/2020/4/17/1103
> 
> Fix the panic by blocking/unblocking all the I/O queues properly.
> 
> Note: this patch depends on another patch "scsi: core: Allow the state
> change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link above).
> 
> Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/scsi/storvsc_drv.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index fb41636519ee..fd51d2f03778 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device *hv_dev)
>  	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
>  	struct Scsi_Host *host = stor_device->host;
>  	struct hv_host_device *host_dev = shost_priv(host);
> +	int ret;
> +
> +	ret = scsi_host_block(host);
> +	if (ret)
> +		return ret;
>  
>  	storvsc_wait_to_drain(stor_device);
>  
> @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device *hv_dev)
>  
>  static int storvsc_resume(struct hv_device *hv_dev)
>  {
> +	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> +	struct Scsi_Host *host = stor_device->host;
>  	int ret;
>  
>  	ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
>  				     hv_dev_is_fc(hv_dev));
> +	if (!ret)
> +		ret = scsi_host_unblock(host, SDEV_RUNNING);
> +
>  	return ret;
>  }

scsi_host_block() is actually too heavy for just avoiding
scsi internal command, which can be done simply by one atomic
variable.

Not mention scsi_host_block() is implemented too clumsy because
nr_luns * synchronize_rcu() are required in scsi_host_block(),
which should have been optimized to just one.

Also scsi_device_quiesce() is heavy too, still takes 2
synchronize_rcu() for one LUN.

That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in
case that the HBA's suspend handler needs scsi_host_block().

Thanks,
Ming
Dexuan Cui April 22, 2020, 1:48 a.m. UTC | #2
> From: Ming Lei <ming.lei@redhat.com>
> Sent: Tuesday, April 21, 2020 6:28 PM
> To: Dexuan Cui <decui@microsoft.com>
> 
> On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote:
> > During hibernation, the sdevs are suspended automatically in
> > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
> > storvsc_suspend(), there is no disk I/O from the file systems, but there
> > can still be disk I/O from the kernel space, e.g. disk_check_events() ->
> > sr_block_check_events() -> cdrom_check_events() can still submit I/O
> > to the storvsc driver, which causes a paic of NULL pointer dereference,
> > since storvsc has closed the vmbus channel in storvsc_suspend(): refer
> > to the below links for more info:
> >
> > Fix the panic by blocking/unblocking all the I/O queues properly.
> >
> > Note: this patch depends on another patch "scsi: core: Allow the state
> > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link
> above).
> >
> > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
> > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > ---
> >  drivers/scsi/storvsc_drv.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index fb41636519ee..fd51d2f03778 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device
> *hv_dev)
> >  	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> >  	struct Scsi_Host *host = stor_device->host;
> >  	struct hv_host_device *host_dev = shost_priv(host);
> > +	int ret;
> > +
> > +	ret = scsi_host_block(host);
> > +	if (ret)
> > +		return ret;
> >
> >  	storvsc_wait_to_drain(stor_device);
> >
> > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device
> *hv_dev)
> >
> >  static int storvsc_resume(struct hv_device *hv_dev)
> >  {
> > +	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > +	struct Scsi_Host *host = stor_device->host;
> >  	int ret;
> >
> >  	ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
> >  				     hv_dev_is_fc(hv_dev));
> > +	if (!ret)
> > +		ret = scsi_host_unblock(host, SDEV_RUNNING);
> > +
> >  	return ret;
> >  }
> 
> scsi_host_block() is actually too heavy for just avoiding
> scsi internal command, which can be done simply by one atomic
> variable.
> 
> Not mention scsi_host_block() is implemented too clumsy because
> nr_luns * synchronize_rcu() are required in scsi_host_block(),
> which should have been optimized to just one.
> 
> Also scsi_device_quiesce() is heavy too, still takes 2
> synchronize_rcu() for one LUN.
> 
> That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in
> case that the HBA's suspend handler needs scsi_host_block().
> 
> Thanks,
> Ming

When we're in storvsc_suspend(), all the userspace processes have been
frozen and all the file systems have been flushed, and there should not
be too much I/O from the kernel space, so IMO scsi_host_block() should be
pretty fast here. 

Of cource, any performance improvement to the scsi_host_block() API is
still great. :-)

Thanks,
-- Dexuan
Ming Lei April 22, 2020, 2:01 a.m. UTC | #3
On Wed, Apr 22, 2020 at 01:48:25AM +0000, Dexuan Cui wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> > Sent: Tuesday, April 21, 2020 6:28 PM
> > To: Dexuan Cui <decui@microsoft.com>
> > 
> > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote:
> > > During hibernation, the sdevs are suspended automatically in
> > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
> > > storvsc_suspend(), there is no disk I/O from the file systems, but there
> > > can still be disk I/O from the kernel space, e.g. disk_check_events() ->
> > > sr_block_check_events() -> cdrom_check_events() can still submit I/O
> > > to the storvsc driver, which causes a paic of NULL pointer dereference,
> > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer
> > > to the below links for more info:
> > >
> > > Fix the panic by blocking/unblocking all the I/O queues properly.
> > >
> > > Note: this patch depends on another patch "scsi: core: Allow the state
> > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link
> > above).
> > >
> > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
> > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > ---
> > >  drivers/scsi/storvsc_drv.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > index fb41636519ee..fd51d2f03778 100644
> > > --- a/drivers/scsi/storvsc_drv.c
> > > +++ b/drivers/scsi/storvsc_drv.c
> > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device
> > *hv_dev)
> > >  	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > >  	struct Scsi_Host *host = stor_device->host;
> > >  	struct hv_host_device *host_dev = shost_priv(host);
> > > +	int ret;
> > > +
> > > +	ret = scsi_host_block(host);
> > > +	if (ret)
> > > +		return ret;
> > >
> > >  	storvsc_wait_to_drain(stor_device);
> > >
> > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device
> > *hv_dev)
> > >
> > >  static int storvsc_resume(struct hv_device *hv_dev)
> > >  {
> > > +	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > > +	struct Scsi_Host *host = stor_device->host;
> > >  	int ret;
> > >
> > >  	ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
> > >  				     hv_dev_is_fc(hv_dev));
> > > +	if (!ret)
> > > +		ret = scsi_host_unblock(host, SDEV_RUNNING);
> > > +
> > >  	return ret;
> > >  }
> > 
> > scsi_host_block() is actually too heavy for just avoiding
> > scsi internal command, which can be done simply by one atomic
> > variable.
> > 
> > Not mention scsi_host_block() is implemented too clumsy because
> > nr_luns * synchronize_rcu() are required in scsi_host_block(),
> > which should have been optimized to just one.
> > 
> > Also scsi_device_quiesce() is heavy too, still takes 2
> > synchronize_rcu() for one LUN.
> > 
> > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in
> > case that the HBA's suspend handler needs scsi_host_block().
> > 
> > Thanks,
> > Ming
> 
> When we're in storvsc_suspend(), all the userspace processes have been
> frozen and all the file systems have been flushed, and there should not
> be too much I/O from the kernel space, so IMO scsi_host_block() should be
> pretty fast here. 

I guess it depends on RCU's implementation, so CC RCU guys.

Hello Paul & Josh,

Could you clarify that if sysnchronize_rcu becomes quickly during
system suspend?

Thanks,
Ming
Ming Lei April 22, 2020, 2:06 a.m. UTC | #4
On Wed, Apr 22, 2020 at 01:48:25AM +0000, Dexuan Cui wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> > Sent: Tuesday, April 21, 2020 6:28 PM
> > To: Dexuan Cui <decui@microsoft.com>
> > 
> > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote:
> > > During hibernation, the sdevs are suspended automatically in
> > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
> > > storvsc_suspend(), there is no disk I/O from the file systems, but there
> > > can still be disk I/O from the kernel space, e.g. disk_check_events() ->
> > > sr_block_check_events() -> cdrom_check_events() can still submit I/O
> > > to the storvsc driver, which causes a paic of NULL pointer dereference,
> > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer
> > > to the below links for more info:
> > >
> > > Fix the panic by blocking/unblocking all the I/O queues properly.
> > >
> > > Note: this patch depends on another patch "scsi: core: Allow the state
> > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link
> > above).
> > >
> > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
> > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > ---
> > >  drivers/scsi/storvsc_drv.c | 10 ++++++++++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > index fb41636519ee..fd51d2f03778 100644
> > > --- a/drivers/scsi/storvsc_drv.c
> > > +++ b/drivers/scsi/storvsc_drv.c
> > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device
> > *hv_dev)
> > >  	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > >  	struct Scsi_Host *host = stor_device->host;
> > >  	struct hv_host_device *host_dev = shost_priv(host);
> > > +	int ret;
> > > +
> > > +	ret = scsi_host_block(host);
> > > +	if (ret)
> > > +		return ret;
> > >
> > >  	storvsc_wait_to_drain(stor_device);
> > >
> > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device
> > *hv_dev)
> > >
> > >  static int storvsc_resume(struct hv_device *hv_dev)
> > >  {
> > > +	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > > +	struct Scsi_Host *host = stor_device->host;
> > >  	int ret;
> > >
> > >  	ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
> > >  				     hv_dev_is_fc(hv_dev));
> > > +	if (!ret)
> > > +		ret = scsi_host_unblock(host, SDEV_RUNNING);
> > > +
> > >  	return ret;
> > >  }
> > 
> > scsi_host_block() is actually too heavy for just avoiding
> > scsi internal command, which can be done simply by one atomic
> > variable.
> > 
> > Not mention scsi_host_block() is implemented too clumsy because
> > nr_luns * synchronize_rcu() are required in scsi_host_block(),
> > which should have been optimized to just one.
> > 
> > Also scsi_device_quiesce() is heavy too, still takes 2
> > synchronize_rcu() for one LUN.
> > 
> > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in
> > case that the HBA's suspend handler needs scsi_host_block().
> > 
> > Thanks,
> > Ming
> 
> When we're in storvsc_suspend(), all the userspace processes have been
> frozen and all the file systems have been flushed, and there should not
> be too much I/O from the kernel space, so IMO scsi_host_block() should be
> pretty fast here. 

Forget to mention, I guess that storvsc_suspend() may be called from runtime
PM path, at that time, the whole system is running, and RCU can't be quick.

Thanks, 
Ming
Paul E. McKenney April 22, 2020, 3:08 a.m. UTC | #5
On Wed, Apr 22, 2020 at 10:01:34AM +0800, Ming Lei wrote:
> On Wed, Apr 22, 2020 at 01:48:25AM +0000, Dexuan Cui wrote:
> > > From: Ming Lei <ming.lei@redhat.com>
> > > Sent: Tuesday, April 21, 2020 6:28 PM
> > > To: Dexuan Cui <decui@microsoft.com>
> > > 
> > > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote:
> > > > During hibernation, the sdevs are suspended automatically in
> > > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
> > > > storvsc_suspend(), there is no disk I/O from the file systems, but there
> > > > can still be disk I/O from the kernel space, e.g. disk_check_events() ->
> > > > sr_block_check_events() -> cdrom_check_events() can still submit I/O
> > > > to the storvsc driver, which causes a paic of NULL pointer dereference,
> > > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer
> > > > to the below links for more info:
> > > >
> > > > Fix the panic by blocking/unblocking all the I/O queues properly.
> > > >
> > > > Note: this patch depends on another patch "scsi: core: Allow the state
> > > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link
> > > above).
> > > >
> > > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
> > > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > > ---
> > > >  drivers/scsi/storvsc_drv.c | 10 ++++++++++
> > > >  1 file changed, 10 insertions(+)
> > > >
> > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > > index fb41636519ee..fd51d2f03778 100644
> > > > --- a/drivers/scsi/storvsc_drv.c
> > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device
> > > *hv_dev)
> > > >  	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > > >  	struct Scsi_Host *host = stor_device->host;
> > > >  	struct hv_host_device *host_dev = shost_priv(host);
> > > > +	int ret;
> > > > +
> > > > +	ret = scsi_host_block(host);
> > > > +	if (ret)
> > > > +		return ret;
> > > >
> > > >  	storvsc_wait_to_drain(stor_device);
> > > >
> > > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device
> > > *hv_dev)
> > > >
> > > >  static int storvsc_resume(struct hv_device *hv_dev)
> > > >  {
> > > > +	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > > > +	struct Scsi_Host *host = stor_device->host;
> > > >  	int ret;
> > > >
> > > >  	ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
> > > >  				     hv_dev_is_fc(hv_dev));
> > > > +	if (!ret)
> > > > +		ret = scsi_host_unblock(host, SDEV_RUNNING);
> > > > +
> > > >  	return ret;
> > > >  }
> > > 
> > > scsi_host_block() is actually too heavy for just avoiding
> > > scsi internal command, which can be done simply by one atomic
> > > variable.
> > > 
> > > Not mention scsi_host_block() is implemented too clumsy because
> > > nr_luns * synchronize_rcu() are required in scsi_host_block(),
> > > which should have been optimized to just one.
> > > 
> > > Also scsi_device_quiesce() is heavy too, still takes 2
> > > synchronize_rcu() for one LUN.
> > > 
> > > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in
> > > case that the HBA's suspend handler needs scsi_host_block().
> > > 
> > > Thanks,
> > > Ming
> > 
> > When we're in storvsc_suspend(), all the userspace processes have been
> > frozen and all the file systems have been flushed, and there should not
> > be too much I/O from the kernel space, so IMO scsi_host_block() should be
> > pretty fast here. 
> 
> I guess it depends on RCU's implementation, so CC RCU guys.
> 
> Hello Paul & Josh,
> 
> Could you clarify that if sysnchronize_rcu becomes quickly during
> system suspend?

Once you have all but one CPU offlined, it becomes extremely fast, as
in roughly a no-op (which is an idea of Josh's from back in the day).
But if there is more than one CPU online, then synchronize_rcu() still
takes on the order of several to several tens of jiffies.

So, yes, in some portions of system suspend, synchronize_rcu() becomes
very fast indeed.

							Thanx, Paul
Ming Lei April 22, 2020, 4:16 a.m. UTC | #6
On Tue, Apr 21, 2020 at 08:08:07PM -0700, Paul E. McKenney wrote:
> On Wed, Apr 22, 2020 at 10:01:34AM +0800, Ming Lei wrote:
> > On Wed, Apr 22, 2020 at 01:48:25AM +0000, Dexuan Cui wrote:
> > > > From: Ming Lei <ming.lei@redhat.com>
> > > > Sent: Tuesday, April 21, 2020 6:28 PM
> > > > To: Dexuan Cui <decui@microsoft.com>
> > > > 
> > > > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote:
> > > > > During hibernation, the sdevs are suspended automatically in
> > > > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
> > > > > storvsc_suspend(), there is no disk I/O from the file systems, but there
> > > > > can still be disk I/O from the kernel space, e.g. disk_check_events() ->
> > > > > sr_block_check_events() -> cdrom_check_events() can still submit I/O
> > > > > to the storvsc driver, which causes a paic of NULL pointer dereference,
> > > > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer
> > > > > to the below links for more info:
> > > > >
> > > > > Fix the panic by blocking/unblocking all the I/O queues properly.
> > > > >
> > > > > Note: this patch depends on another patch "scsi: core: Allow the state
> > > > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link
> > > > above).
> > > > >
> > > > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
> > > > > Signed-off-by: Dexuan Cui <decui@microsoft.com>
> > > > > ---
> > > > >  drivers/scsi/storvsc_drv.c | 10 ++++++++++
> > > > >  1 file changed, 10 insertions(+)
> > > > >
> > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > > > > index fb41636519ee..fd51d2f03778 100644
> > > > > --- a/drivers/scsi/storvsc_drv.c
> > > > > +++ b/drivers/scsi/storvsc_drv.c
> > > > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device
> > > > *hv_dev)
> > > > >  	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > > > >  	struct Scsi_Host *host = stor_device->host;
> > > > >  	struct hv_host_device *host_dev = shost_priv(host);
> > > > > +	int ret;
> > > > > +
> > > > > +	ret = scsi_host_block(host);
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > >
> > > > >  	storvsc_wait_to_drain(stor_device);
> > > > >
> > > > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device
> > > > *hv_dev)
> > > > >
> > > > >  static int storvsc_resume(struct hv_device *hv_dev)
> > > > >  {
> > > > > +	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> > > > > +	struct Scsi_Host *host = stor_device->host;
> > > > >  	int ret;
> > > > >
> > > > >  	ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
> > > > >  				     hv_dev_is_fc(hv_dev));
> > > > > +	if (!ret)
> > > > > +		ret = scsi_host_unblock(host, SDEV_RUNNING);
> > > > > +
> > > > >  	return ret;
> > > > >  }
> > > > 
> > > > scsi_host_block() is actually too heavy for just avoiding
> > > > scsi internal command, which can be done simply by one atomic
> > > > variable.
> > > > 
> > > > Not mention scsi_host_block() is implemented too clumsy because
> > > > nr_luns * synchronize_rcu() are required in scsi_host_block(),
> > > > which should have been optimized to just one.
> > > > 
> > > > Also scsi_device_quiesce() is heavy too, still takes 2
> > > > synchronize_rcu() for one LUN.
> > > > 
> > > > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in
> > > > case that the HBA's suspend handler needs scsi_host_block().
> > > > 
> > > > Thanks,
> > > > Ming
> > > 
> > > When we're in storvsc_suspend(), all the userspace processes have been
> > > frozen and all the file systems have been flushed, and there should not
> > > be too much I/O from the kernel space, so IMO scsi_host_block() should be
> > > pretty fast here. 
> > 
> > I guess it depends on RCU's implementation, so CC RCU guys.
> > 
> > Hello Paul & Josh,
> > 
> > Could you clarify that if sysnchronize_rcu becomes quickly during
> > system suspend?
> 
> Once you have all but one CPU offlined, it becomes extremely fast, as
> in roughly a no-op (which is an idea of Josh's from back in the day).
> But if there is more than one CPU online, then synchronize_rcu() still
> takes on the order of several to several tens of jiffies.
> 
> So, yes, in some portions of system suspend, synchronize_rcu() becomes
> very fast indeed.

Hi Paul,

Thanks for your clarification.

In system suspend path, device is suspended before suspend_disable_secondary_cpus(), 
so I guess synchronize_rcu() is not quick enough even though user space
processes and some kernel threads are frozen.

Thanks,
Ming
Dexuan Cui April 22, 2020, 4:58 a.m. UTC | #7
> From: Ming Lei <ming.lei@redhat.com>
> Sent: Tuesday, April 21, 2020 9:16 PM
> ...
> > > > When we're in storvsc_suspend(), all the userspace processes have been
> > > > frozen and all the file systems have been flushed, and there should not
> > > > be too much I/O from the kernel space, so IMO scsi_host_block() should
> be
> > > > pretty fast here.
> > >
> > > I guess it depends on RCU's implementation, so CC RCU guys.
> > >
> > > Hello Paul & Josh,
> > >
> > > Could you clarify that if sysnchronize_rcu becomes quickly during
> > > system suspend?
> >
> > Once you have all but one CPU offlined, it becomes extremely fast, as
> > in roughly a no-op (which is an idea of Josh's from back in the day).
> > But if there is more than one CPU online, then synchronize_rcu() still
> > takes on the order of several to several tens of jiffies.
> >
> > So, yes, in some portions of system suspend, synchronize_rcu() becomes
> > very fast indeed.
> 
> Hi Paul,
> 
> Thanks for your clarification.
> 
> In system suspend path, device is suspended before
> suspend_disable_secondary_cpus(),
> so I guess synchronize_rcu() is not quick enough even though user space
> processes and some kernel threads are frozen.
> 
> Thanks,
> Ming

storvsc_suspend() -> scsi_host_block() is only called in the hibernation
path, which is not a hot path at all, so IMHO we don't really care if it
takes 10ms or 100ms or even 1s. :-)  BTW, in my test, typically the
scsi_host_block() here takes about 3ms in my 40-vCPU VM.

storvsc_suspend() is not called from the runtime PM path, because the
runtime_suspend/runtime_resume/runtime_idle ops are not defined 
at all for the devices on the Hyper-V VMBus bus: these are pure 
software-emulated devices, so runtime PM is unnecessary for them.

Thanks,
-- Dexuan
Bart Van Assche April 22, 2020, 5:02 a.m. UTC | #8
On 4/21/20 5:17 PM, Dexuan Cui wrote:
> During hibernation, the sdevs are suspended automatically in
> drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
> storvsc_suspend(), there is no disk I/O from the file systems, but there
> can still be disk I/O from the kernel space, e.g. disk_check_events() ->
> sr_block_check_events() -> cdrom_check_events() can still submit I/O
> to the storvsc driver, which causes a paic of NULL pointer dereference,
> since storvsc has closed the vmbus channel in storvsc_suspend(): refer
> to the below links for more info:
>    https://lkml.org/lkml/2020/4/10/47
>    https://lkml.org/lkml/2020/4/17/1103
> 
> Fix the panic by blocking/unblocking all the I/O queues properly.
> 
> Note: this patch depends on another patch "scsi: core: Allow the state
> change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link above).
> 
> Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation")
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>   drivers/scsi/storvsc_drv.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index fb41636519ee..fd51d2f03778 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device *hv_dev)
>   	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
>   	struct Scsi_Host *host = stor_device->host;
>   	struct hv_host_device *host_dev = shost_priv(host);
> +	int ret;
> +
> +	ret = scsi_host_block(host);
> +	if (ret)
> +		return ret;
>   
>   	storvsc_wait_to_drain(stor_device);
>   
> @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device *hv_dev)
>   
>   static int storvsc_resume(struct hv_device *hv_dev)
>   {
> +	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
> +	struct Scsi_Host *host = stor_device->host;
>   	int ret;
>   
>   	ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
>   				     hv_dev_is_fc(hv_dev));
> +	if (!ret)
> +		ret = scsi_host_unblock(host, SDEV_RUNNING);
> +
>   	return ret;
>   }

I don't like this patch. It makes the behavior of the storsvc driver 
different from every other SCSI LLD. Other SCSI LLDs don't need this 
change because these don't destroy I/O queues upon suspend.

Bart.
Dexuan Cui April 22, 2020, 6:24 a.m. UTC | #9
> From: Bart Van Assche <bvanassche@acm.org>
> Sent: Tuesday, April 21, 2020 10:02 PM
> On 4/21/20 5:17 PM, Dexuan Cui wrote:
> > During hibernation, the sdevs are suspended automatically in
> > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after
> > storvsc_suspend(), there is no disk I/O from the file systems, but there
> > can still be disk I/O from the kernel space, e.g. disk_check_events() ->
> > sr_block_check_events() -> cdrom_check_events() can still submit I/O
> > to the storvsc driver, which causes a panic of NULL pointer dereference,
> > since storvsc has closed the vmbus channel in storvsc_suspend(): refer
> > to the below links for more info:
> > ...
> > Fix the panic by blocking/unblocking all the I/O queues properly.
> >
> > Note: this patch depends on another patch "scsi: core: Allow the state
> > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link
> above).
> 
> I don't like this patch. It makes the behavior of the storsvc driver
> different from every other SCSI LLD. Other SCSI LLDs don't need this
> change because these don't destroy I/O queues upon suspend.
> 
> Bart.

Upon suspend, I suppose the other LLDs can not accept I/O either, then
what do they do when some kernel thread still tries to submit I/O? Do 
they "block" the I/O until resume, or just return an error immediately?

I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements
a mechanism of marking the device busy upon suspend, so any I/O will
immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the
disadvantage is: the mechanism of marking the device busy looks
complex, and the hot path .queuecommand() has to take the
spinlock shost->host_lock, which should affect the performance.

It looks drivers/scsi/nsp32.c: nsp32_suspend() and 
drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O 
after suspend. I doubt this is correct.

It looks drivers/scsi/hisi_sas/hisi_sas_v3_hw.c: hisi_sas_v3_suspend()
tries to use scsi_block_requests() as a means of blocking new I/O, but
I doubt it can work: scsi_block_requests() only sets a variable --
how could this prevent another concurrent kernel thread to submit new
I/O without a race condition issue?

So it looks to me there is no simple mechanism to handle the scenario
here, and I guess that's why the scsi_host_block/unblock APIs are
introduced, and actually there is already an user of the APIs:
3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O").

The aacraid patch says: "This has the advantage that the block layer will 
stop sending I/O to the adapter instead of having the SCSI midlayer
requeueing I/O internally". It looks this may imply that using the new 
APIs is encouraged?

Sorry for my lack of knowledge in Linux SCSI system, but IMHO the new
APIs are simple and effective, and they really look like a good fit for the
hibernation scenario. BTW, storvsc_suspend() is not a hot path here,
so IMO I don't have any performance concern.

PS, here storvsc has to destroy and re-construct the I/O queues: the
I/O queues are shared memory ringbufers between the guest and the
host; in the resume path of the hibernation procedure, the memory 
pages allocated by the 'new' kernel is different from that allocated by
the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel
must destroy the mapping of the pages, and later after we jump to 
the 'old' kernel, we'll re-create the mapping using the pages allocated 
by the 'old' kernel. Here "create the mapping" means the guest tells
the host about the physical addresses of the pages.

Thanks,
-- Dexuan
Ming Lei April 22, 2020, 9:23 a.m. UTC | #10
On Wed, Apr 22, 2020 at 04:58:14AM +0000, Dexuan Cui wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> > Sent: Tuesday, April 21, 2020 9:16 PM
> > ...
> > > > > When we're in storvsc_suspend(), all the userspace processes have been
> > > > > frozen and all the file systems have been flushed, and there should not
> > > > > be too much I/O from the kernel space, so IMO scsi_host_block() should
> > be
> > > > > pretty fast here.
> > > >
> > > > I guess it depends on RCU's implementation, so CC RCU guys.
> > > >
> > > > Hello Paul & Josh,
> > > >
> > > > Could you clarify that if sysnchronize_rcu becomes quickly during
> > > > system suspend?
> > >
> > > Once you have all but one CPU offlined, it becomes extremely fast, as
> > > in roughly a no-op (which is an idea of Josh's from back in the day).
> > > But if there is more than one CPU online, then synchronize_rcu() still
> > > takes on the order of several to several tens of jiffies.
> > >
> > > So, yes, in some portions of system suspend, synchronize_rcu() becomes
> > > very fast indeed.
> > 
> > Hi Paul,
> > 
> > Thanks for your clarification.
> > 
> > In system suspend path, device is suspended before
> > suspend_disable_secondary_cpus(),
> > so I guess synchronize_rcu() is not quick enough even though user space
> > processes and some kernel threads are frozen.
> > 
> > Thanks,
> > Ming
> 
> storvsc_suspend() -> scsi_host_block() is only called in the hibernation
> path, which is not a hot path at all, so IMHO we don't really care if it
> takes 10ms or 100ms or even 1s. :-)  BTW, in my test, typically the

Are you sure the 'we' can cover all users?

> scsi_host_block() here takes about 3ms in my 40-vCPU VM.

If more LUNs are added, the time should be increased proportionallly,
that is why I think scsi_host_block() is bad.


Thanks,
Ming
Paul E. McKenney April 22, 2020, 4:19 p.m. UTC | #11
On Wed, Apr 22, 2020 at 05:23:51PM +0800, Ming Lei wrote:
> On Wed, Apr 22, 2020 at 04:58:14AM +0000, Dexuan Cui wrote:
> > > From: Ming Lei <ming.lei@redhat.com>
> > > Sent: Tuesday, April 21, 2020 9:16 PM
> > > ...
> > > > > > When we're in storvsc_suspend(), all the userspace processes have been
> > > > > > frozen and all the file systems have been flushed, and there should not
> > > > > > be too much I/O from the kernel space, so IMO scsi_host_block() should
> > > be
> > > > > > pretty fast here.
> > > > >
> > > > > I guess it depends on RCU's implementation, so CC RCU guys.
> > > > >
> > > > > Hello Paul & Josh,
> > > > >
> > > > > Could you clarify that if sysnchronize_rcu becomes quickly during
> > > > > system suspend?
> > > >
> > > > Once you have all but one CPU offlined, it becomes extremely fast, as
> > > > in roughly a no-op (which is an idea of Josh's from back in the day).
> > > > But if there is more than one CPU online, then synchronize_rcu() still
> > > > takes on the order of several to several tens of jiffies.
> > > >
> > > > So, yes, in some portions of system suspend, synchronize_rcu() becomes
> > > > very fast indeed.
> > > 
> > > Hi Paul,
> > > 
> > > Thanks for your clarification.
> > > 
> > > In system suspend path, device is suspended before
> > > suspend_disable_secondary_cpus(),
> > > so I guess synchronize_rcu() is not quick enough even though user space
> > > processes and some kernel threads are frozen.
> > > 
> > > Thanks,
> > > Ming
> > 
> > storvsc_suspend() -> scsi_host_block() is only called in the hibernation
> > path, which is not a hot path at all, so IMHO we don't really care if it
> > takes 10ms or 100ms or even 1s. :-)  BTW, in my test, typically the
> 
> Are you sure the 'we' can cover all users?
> 
> > scsi_host_block() here takes about 3ms in my 40-vCPU VM.
> 
> If more LUNs are added, the time should be increased proportionallly,
> that is why I think scsi_host_block() is bad.

If the caller must wait until the grace period ends, then the traditional
approach is to use a single synchronize_rcu() to cover all LUNs.  This of
course can require some reworking of the code.

If the caller does not need to wait, then either call_rcu() or kfree_rcu()
can work quite well.

							Thanx, Paul
Dexuan Cui April 22, 2020, 6:01 p.m. UTC | #12
> From: Ming Lei <ming.lei@redhat.com>
> Sent: Wednesday, April 22, 2020 2:24 AM
> > ...
> > storvsc_suspend() -> scsi_host_block() is only called in the hibernation
> > path, which is not a hot path at all, so IMHO we don't really care if it
> > takes 10ms or 100ms or even 1s. :-)  BTW, in my test, typically the
> 
> Are you sure the 'we' can cover all users?

I actully only meant I don't have any performance concern in this path
storvsc_suspend() -> scsi_host_block()  :-)

Thanks for your effort to improve the scsi_host_block() API in the patch
[PATCH] scsi: avoid to run synchronize_rcu for each device in scsi_host_block.
 
> > scsi_host_block() here takes about 3ms in my 40-vCPU VM.
> 
> If more LUNs are added, the time should be increased proportionallly,
> that is why I think scsi_host_block() is bad.
> 
> Thanks,
> Ming

FWIW, if the time increases linearly against the number of the LUNs, it
looks typically it's still fast enough for storvsc_suspend().

Thanks,
-- Dexuan
Bart Van Assche April 22, 2020, 7:56 p.m. UTC | #13
On 4/21/20 11:24 PM, Dexuan Cui wrote:
> Upon suspend, I suppose the other LLDs can not accept I/O either, then
> what do they do when some kernel thread still tries to submit I/O? Do
> they "block" the I/O until resume, or just return an error immediately?

This is my understanding of how other SCSI LLDs handle suspend/resume:
- The ULP driver, e.g. the SCSI sd driver, implements power management 
support by providing callbacks in struct scsi_driver.gendrv.pm and also 
in scsi_bus_type.pm. The SCSI sd suspend callbacks flush the device 
cache and send a STOP command to the device.
- SCSI LLDs for PCIe devices optionally provide pci_driver.suspend and 
resume callbacks. These callbacks can be used to make the PCIe device 
enter/leave a power saving state. No new SCSI commands should be 
submitted after pci_driver.suspend has been called.

> I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements
> a mechanism of marking the device busy upon suspend, so any I/O will
> immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the
> disadvantage is: the mechanism of marking the device busy looks
> complex, and the hot path .queuecommand() has to take the
> spinlock shost->host_lock, which should affect the performance.

I think the code you are referring to is the code in 
scsifront_suspend(). A pointer to that function is stored in a struct 
xenbus_driver instance. That's another structure than the structures 
mentioned above.

Wouldn't it be better to make sure that any hypervisor suspend 
operations happen after it is guaranteed that no further SCSI commands 
will be submitted such that hypervisor suspend operations do not have to 
deal with SCSI commands submitted during or after the hypervisor suspend 
callback?

> It looks drivers/scsi/nsp32.c: nsp32_suspend() and
> drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O
> after suspend. I doubt this is correct.

nsp32_suspend() is a PCI suspend callback. If any SCSI commands would be 
submitted after that callback has started that would mean that the SCSI 
suspend and PCIe suspend operations are called in the wrong order. I do 
not agree that code for suspending SCSI commands should be added in 
nsp32_suspend().

> So it looks to me there is no simple mechanism to handle the scenario
> here, and I guess that's why the scsi_host_block/unblock APIs are
> introduced, and actually there is already an user of the APIs:
> 3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O").
> 
> The aacraid patch says: "This has the advantage that the block layer will
> stop sending I/O to the adapter instead of having the SCSI midlayer
> requeueing I/O internally". It looks this may imply that using the new
> APIs is encouraged?

I'm fine with using these new functions in device reset handlers. Using 
these functions in power management handlers seems wrong to me.

> PS, here storvsc has to destroy and re-construct the I/O queues: the
> I/O queues are shared memory ringbufers between the guest and the
> host; in the resume path of the hibernation procedure, the memory
> pages allocated by the 'new' kernel is different from that allocated by
> the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel
> must destroy the mapping of the pages, and later after we jump to
> the 'old' kernel, we'll re-create the mapping using the pages allocated
> by the 'old' kernel. Here "create the mapping" means the guest tells
> the host about the physical addresses of the pages.

Thank you for having clarified this. This helps me to understand the HV 
driver framework better. I think this means that the hv_driver.suspend 
function should be called at a later time than SCSI suspend. From 
Documentation/driver-api/device_link.rst: "By default, the driver core 
only enforces dependencies between devices that are borne out of a 
parent/child relationship within the device hierarchy: When suspending, 
resuming or shutting down the system, devices are ordered based on this 
relationship, i.e. children are always suspended before their parent, 
and the parent is always resumed before its children." Is there a single 
storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has 
it been considered to make storvsc_drv the parent device of all SCSI 
devices created by the storvsc driver?

Thanks,

Bart.
Dexuan Cui April 23, 2020, 7:04 a.m. UTC | #14
> From: Bart Van Assche <bvanassche@acm.org>
> Sent: Wednesday, April 22, 2020 12:56 PM
> On 4/21/20 11:24 PM, Dexuan Cui wrote:
> > Upon suspend, I suppose the other LLDs can not accept I/O either, then
> > what do they do when some kernel thread still tries to submit I/O? Do
> > they "block" the I/O until resume, or just return an error immediately?
> 
> This is my understanding of how other SCSI LLDs handle suspend/resume:
> - The ULP driver, e.g. the SCSI sd driver, implements power management
> support by providing callbacks in struct scsi_driver.gendrv.pm and also
> in scsi_bus_type.pm. The SCSI sd suspend callbacks flush the device
> cache and send a STOP command to the device.

It looks the sd suspend callbacks are only for the I/O from the disk, e.g.
from the file system that lives in some partition of some disk.

The panic I'm seeing is not from sd. I think it's from a kernel thread
that tries to detect the status of the SCSI CDROM. This is the snipped
messages (the full version is at https://lkml.org/lkml/2020/4/10/47): here
the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called,
and later the storvsc LLD's suspend callback is also called, but
sr_block_check_events() can still try to submit SCSI commands to storvsc:

[   11.668741] sr 0:0:0:1: bus quiesce
[   11.668804] sd 0:0:0:0: bus quiesce
[   11.698082] scsi target0:0:0: bus quiesce
[   11.703296] scsi host0: bus quiesce
[   11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq bus quiesce
[   11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq bus quiesce
[   11.804042] BUG: kernel NULL pointer dereference, address: 0000000000000090
[   11.804996] Workqueue: events_freezable_power_ disk_events_workfn
[   11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 [hv_storvsc]
[   11.804996] Call Trace:
[   11.804996]  scsi_queue_rq+0x593/0xa10
[   11.804996]  blk_mq_dispatch_rq_list+0x8d/0x510
[   11.804996]  blk_mq_sched_dispatch_requests+0xed/0x170
[   11.804996]  __blk_mq_run_hw_queue+0x55/0x110
[   11.804996]  __blk_mq_delay_run_hw_queue+0x141/0x160
[   11.804996]  blk_mq_sched_insert_request+0xc3/0x170
[   11.804996]  blk_execute_rq+0x4b/0xa0
[   11.804996]  __scsi_execute+0xeb/0x250
[   11.804996]  sr_check_events+0x9f/0x270 [sr_mod]
[   11.804996]  cdrom_check_events+0x1a/0x30 [cdrom]
[   11.804996]  sr_block_check_events+0xcc/0x110 [sr_mod]
[   11.804996]  disk_check_events+0x68/0x160
[   11.804996]  process_one_work+0x20c/0x3d0
[   11.804996]  worker_thread+0x2d/0x3e0
[   11.804996]  kthread+0x10c/0x130
[   11.804996]  ret_from_fork+0x35/0x40

It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend ->
scsi_device_quiesce() does not guarantee the device is totally quiescent:

/**
 *      scsi_device_quiesce - Block user issued commands.
 *      @sdev:  scsi device to quiesce.
 *
 *      This works by trying to transition to the SDEV_QUIESCE state
 *      (which must be a legal transition).  When the device is in this
 *      state, only special requests will be accepted, all others will
 *      be deferred.  Since special requests may also be requeued requests,
 *      a successful return doesn't guarantee the device will be
 *      totally quiescent.

I guess the "special requests" may include the "detect the status of the
SCSI CDROM" request from sr_block_check_events().

It looks scsi_device_quiesce() does not freeze the queue and it just puts the
device to the SDEV_QUIESCE state, which is not enough to prevent
sr_block_check_events from submitting SCSI commands.

It looks scsi_host_block() freezes the queue and flushes all the pending I/O,
so it can fix the aforementioned NULL pointer deference panic for me.

> - SCSI LLDs for PCIe devices optionally provide pci_driver.suspend and
> resume callbacks. These callbacks can be used to make the PCIe device
> enter/leave a power saving state. No new SCSI commands should be
> submitted after pci_driver.suspend has been called.
> 
> > I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements
> > a mechanism of marking the device busy upon suspend, so any I/O will
> > immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the
> > disadvantage is: the mechanism of marking the device busy looks
> > complex, and the hot path .queuecommand() has to take the
> > spinlock shost->host_lock, which should affect the performance.
> 
> I think the code you are referring to is the code in
> scsifront_suspend(). A pointer to that function is stored in a struct
> xenbus_driver instance. That's another structure than the structures
> mentioned above.
> 
> Wouldn't it be better to make sure that any hypervisor suspend
> operations happen after it is guaranteed that no further SCSI commands
> will be submitted such that hypervisor suspend operations do not have to
> deal with SCSI commands submitted during or after the hypervisor suspend
> callback?

I agree with you, but it looks scsi_bus_freeze() doesn't guarantee no
further SCSI commands will be submitted, as I described above. Maybe that
is why scsifront_suspend() has to invent a mechanism to cope with the issue.
 
> > It looks drivers/scsi/nsp32.c: nsp32_suspend() and
> > drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O
> > after suspend. I doubt this is correct.
> 
> nsp32_suspend() is a PCI suspend callback. If any SCSI commands would be
> submitted after that callback has started that would mean that the SCSI
> suspend and PCIe suspend operations are called in the wrong order. I do
> not agree that code for suspending SCSI commands should be added in
> nsp32_suspend().
> 
> > So it looks to me there is no simple mechanism to handle the scenario
> > here, and I guess that's why the scsi_host_block/unblock APIs are
> > introduced, and actually there is already an user of the APIs:
> > 3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O").
> >
> > The aacraid patch says: "This has the advantage that the block layer will
> > stop sending I/O to the adapter instead of having the SCSI midlayer
> > requeueing I/O internally". It looks this may imply that using the new
> > APIs is encouraged?
> 
> I'm fine with using these new functions in device reset handlers. Using
> these functions in power management handlers seems wrong to me.

It looks you're suggesting the scsi_host_block() in aac_suspend() should
be removed? :-)

> > PS, here storvsc has to destroy and re-construct the I/O queues: the
> > I/O queues are shared memory ringbufers between the guest and the
> > host; in the resume path of the hibernation procedure, the memory
> > pages allocated by the 'new' kernel is different from that allocated by
> > the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel
> > must destroy the mapping of the pages, and later after we jump to
> > the 'old' kernel, we'll re-create the mapping using the pages allocated
> > by the 'old' kernel. Here "create the mapping" means the guest tells
> > the host about the physical addresses of the pages.
> 
> Thank you for having clarified this. This helps me to understand the HV
> driver framework better. I think this means that the hv_driver.suspend
> function should be called at a later time than SCSI suspend. From

I agree, but it looks here scsi_bus_freeze() doesn't work as we expected?

> Documentation/driver-api/device_link.rst: "By default, the driver core
> only enforces dependencies between devices that are borne out of a
> parent/child relationship within the device hierarchy: When suspending,
> resuming or shutting down the system, devices are ordered based on this
> relationship, i.e. children are always suspended before their parent,
> and the parent is always resumed before its children." Is there a single
> storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has

Yes.

> it been considered to make storvsc_drv the parent device of all SCSI
> devices created by the storvsc driver?
> 
> Thanks,
> 
> Bart.

Yes, I think so:

root@localhost:~# ls -rtl  /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver
lrwxrwxrwx 1 root root 0 Apr 22 01:10 /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver -> ../../../../../../../../../../bus/scsi/drivers/sd

Here the driver of /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943
is storvsc, which creates host3/target3:0:0/3:0:0:0.

So it looks there is no ordering issue.

Thanks,
-- Dexuan
Bart Van Assche April 23, 2020, 4:37 p.m. UTC | #15
On 4/23/20 12:04 AM, Dexuan Cui wrote:
> It looks the sd suspend callbacks are only for the I/O from the disk, e.g.
> from the file system that lives in some partition of some disk.
> 
> The panic I'm seeing is not from sd. I think it's from a kernel thread
> that tries to detect the status of the SCSI CDROM. This is the snipped
> messages (the full version is at https://lkml.org/lkml/2020/4/10/47): here
> the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called,
> and later the storvsc LLD's suspend callback is also called, but
> sr_block_check_events() can still try to submit SCSI commands to storvsc:
> 
> [   11.668741] sr 0:0:0:1: bus quiesce
> [   11.668804] sd 0:0:0:0: bus quiesce
> [   11.698082] scsi target0:0:0: bus quiesce
> [   11.703296] scsi host0: bus quiesce
> [   11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq bus quiesce
> [   11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq bus quiesce
> [   11.804042] BUG: kernel NULL pointer dereference, address: 0000000000000090
> [   11.804996] Workqueue: events_freezable_power_ disk_events_workfn
> [   11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 [hv_storvsc]
> [   11.804996] Call Trace:
> [   11.804996]  scsi_queue_rq+0x593/0xa10
> [   11.804996]  blk_mq_dispatch_rq_list+0x8d/0x510
> [   11.804996]  blk_mq_sched_dispatch_requests+0xed/0x170
> [   11.804996]  __blk_mq_run_hw_queue+0x55/0x110
> [   11.804996]  __blk_mq_delay_run_hw_queue+0x141/0x160
> [   11.804996]  blk_mq_sched_insert_request+0xc3/0x170
> [   11.804996]  blk_execute_rq+0x4b/0xa0
> [   11.804996]  __scsi_execute+0xeb/0x250
> [   11.804996]  sr_check_events+0x9f/0x270 [sr_mod]
> [   11.804996]  cdrom_check_events+0x1a/0x30 [cdrom]
> [   11.804996]  sr_block_check_events+0xcc/0x110 [sr_mod]
> [   11.804996]  disk_check_events+0x68/0x160
> [   11.804996]  process_one_work+0x20c/0x3d0
> [   11.804996]  worker_thread+0x2d/0x3e0
> [   11.804996]  kthread+0x10c/0x130
> [   11.804996]  ret_from_fork+0x35/0x40
> 
> It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend ->
> scsi_device_quiesce() does not guarantee the device is totally quiescent:

During hibernation processes are frozen before devices are quiesced. 
freeze_processes() calls try_to_freeze_tasks() and that function in turn 
calls freeze_workqueues_begin() and freeze_workqueues_busy(). 
freeze_workqueues_busy() freezes all freezable workqueues including 
system_freezable_power_efficient_wq, the workqueue from which 
check_events functions are called. Some time after freezable workqueues 
are frozen dpm_suspend(PMSG_FREEZE) is called. That last call triggers 
the pm_ops.freeze callbacks, including the pm_ops.freeze callbacks 
defined in the SCSI core.

The above trace seems to indicate that freezing workqueues has not 
happened before devices were frozen. How about doing the following to 
retrieve more information about what is going on?
* Enable CONFIG_PM_DEBUG in the kernel configuration.
* Run echo 1 > /sys/power/pm_print_times and echo 1 > 
/sys/power/pm_debug_messages before hibernation starts.

>> Documentation/driver-api/device_link.rst: "By default, the driver core
>> only enforces dependencies between devices that are borne out of a
>> parent/child relationship within the device hierarchy: When suspending,
>> resuming or shutting down the system, devices are ordered based on this
>> relationship, i.e. children are always suspended before their parent,
>> and the parent is always resumed before its children." Is there a single
>> storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has
>> it been considered to make storvsc_drv the parent device of all SCSI
>> devices created by the storvsc driver?
> 
> Yes, I think so:
> 
> root@localhost:~# ls -rtl  /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver
> lrwxrwxrwx 1 root root 0 Apr 22 01:10 /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver -> ../../../../../../../../../../bus/scsi/drivers/sd
> 
> Here the driver of /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943
> is storvsc, which creates host3/target3:0:0/3:0:0:0.
> 
> So it looks there is no ordering issue.

Right, I had overlooked the code in storvsc_probe() that associates SCSI 
devices with storvsc_drv.

Bart.
Dexuan Cui April 23, 2020, 6:29 p.m. UTC | #16
> From: Bart Van Assche <bvanassche@acm.org>
> Sent: Thursday, April 23, 2020 9:38 AM
> 
> On 4/23/20 12:04 AM, Dexuan Cui wrote:
> > It looks the sd suspend callbacks are only for the I/O from the disk, e.g.
> > from the file system that lives in some partition of some disk.
> >
> > The panic I'm seeing is not from sd. I think it's from a kernel thread
> > that tries to detect the status of the SCSI CDROM. This is the snipped
> > messages (the full version is at ...): here
> > the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called,
> > and later the storvsc LLD's suspend callback is also called, but
> > sr_block_check_events() can still try to submit SCSI commands to storvsc:
> >
> > [   11.668741] sr 0:0:0:1: bus quiesce
> > [   11.668804] sd 0:0:0:0: bus quiesce
> > [   11.698082] scsi target0:0:0: bus quiesce
> > [   11.703296] scsi host0: bus quiesce
> > [   11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq
> bus quiesce
> > [   11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq
> bus quiesce
> > [   11.804042] BUG: kernel NULL pointer dereference, address:
> 0000000000000090
> > [   11.804996] Workqueue: events_freezable_power_ disk_events_workfn
> > [   11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714
> [hv_storvsc]
> > [   11.804996] Call Trace:
> > [   11.804996]  scsi_queue_rq+0x593/0xa10
> > [   11.804996]  blk_mq_dispatch_rq_list+0x8d/0x510
> > [   11.804996]  blk_mq_sched_dispatch_requests+0xed/0x170
> > [   11.804996]  __blk_mq_run_hw_queue+0x55/0x110
> > [   11.804996]  __blk_mq_delay_run_hw_queue+0x141/0x160
> > [   11.804996]  blk_mq_sched_insert_request+0xc3/0x170
> > [   11.804996]  blk_execute_rq+0x4b/0xa0
> > [   11.804996]  __scsi_execute+0xeb/0x250
> > [   11.804996]  sr_check_events+0x9f/0x270 [sr_mod]
> > [   11.804996]  cdrom_check_events+0x1a/0x30 [cdrom]
> > [   11.804996]  sr_block_check_events+0xcc/0x110 [sr_mod]
> > [   11.804996]  disk_check_events+0x68/0x160
> > [   11.804996]  process_one_work+0x20c/0x3d0
> > [   11.804996]  worker_thread+0x2d/0x3e0
> > [   11.804996]  kthread+0x10c/0x130
> > [   11.804996]  ret_from_fork+0x35/0x40
> >
> > It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend ->
> > scsi_device_quiesce() does not guarantee the device is totally quiescent:
> 
> During hibernation processes are frozen before devices are quiesced.
> freeze_processes() calls try_to_freeze_tasks() and that function in turn
> calls freeze_workqueues_begin() and freeze_workqueues_busy().
> freeze_workqueues_busy() freezes all freezable workqueues including
> system_freezable_power_efficient_wq, the workqueue from which
> check_events functions are called. Some time after freezable workqueues
> are frozen dpm_suspend(PMSG_FREEZE) is called. That last call triggers
> the pm_ops.freeze callbacks, including the pm_ops.freeze callbacks
> defined in the SCSI core.
> 
> The above trace seems to indicate that freezing workqueues has not
> happened before devices were frozen. How about doing the following to
> retrieve more information about what is going on?
> * Enable CONFIG_PM_DEBUG in the kernel configuration.
> * Run echo 1 > /sys/power/pm_print_times and echo 1 >
> /sys/power/pm_debug_messages before hibernation starts.
> 
> Bart.

Good point! My panic happens in the "restore" path, not the "save" path.

In the "save" path, as you described, it looks everything is done correctly.
BTW, freeze_processes() only freezes the userspace processes. After
hibernate() -> freeze_processes(), hibernate() -> hibernation_snapshot()
-> freeze_kernel_threads() also freezes the "freezable" kernel processes,
and then we call dpm_suspend(PMSG_FREEZE), and next we do a lot of
other things, and finally the system is powered off.

In the "restore" path of the hibernation:
1. We start the system and a fresh new kernel starts to run.

2. The 'new' kernel itself finishes the initialization and the 'init' script
in the initramfs starts to run. The 'init' script notices there is a saved
state of the previous 'old' kernel in some swap file/partition, so it won't
do a usual start-up; instead, the 'init' script runs a program called 'resume'.

3. The 'resume' program talks to the kernel via /sys/power/resume and
asks the kernel to do a resume-from-disk.

4. The kernel starts to run resume_store() -> software_resume ->
freeze_processes(), which freezes the userspace, but the "freezable"
kernel threads are not frozen!!!

So it looks the below patch also works for me:

--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -898,6 +898,11 @@ static int software_resume(void)
        error = freeze_processes();
        if (error)
                goto Close_Finish;
+
+       error = freeze_kernel_threads();
+       if (error)
+               goto Close_Finish;
+
        error = load_image_and_restore();
        thaw_processes();
  Finish:

Just to be sure, I'll do more tests, but I believe the panic can be fixed
by this according to my tests I have done so far.

I'm still not sure what the comment before scsi_device_quiesce() means:
 *  ... Since special requests may also be requeued requests,
 *      a successful return doesn't guarantee the device will be
 *      totally quiescent.

I don't know if there can be some other I/O submitted after
scsi_device_quiesce() returns in the case of hibernation, and I don't
know if aac_suspend() -> scsi_host_block() should be fixed/removed,
but as far as the panic is concerned, I'm very glad I have found a better
fix with your help. 

Thank you so much, Bart! 

Thanks,
-- Dexuan
Bart Van Assche April 23, 2020, 11:25 p.m. UTC | #17
On 2020-04-23 11:29, Dexuan Cui wrote:
> So it looks the below patch also works for me:
> 
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -898,6 +898,11 @@ static int software_resume(void)
>         error = freeze_processes();
>         if (error)
>                 goto Close_Finish;
> +
> +       error = freeze_kernel_threads();
> +       if (error)
> +               goto Close_Finish;
> +
>         error = load_image_and_restore();
>         thaw_processes();
>   Finish:
> 
> Just to be sure, I'll do more tests, but I believe the panic can be fixed
> by this according to my tests I have done so far.

If a freeze_kernel_threads() call is added in software_resume(), should
a thaw_kernel_threads() call be added too?

Anyway, please Cc me if a patch for software_resume() is submitted.

> I'm still not sure what the comment before scsi_device_quiesce() means:
>  *  ... Since special requests may also be requeued requests,
>  *      a successful return doesn't guarantee the device will be
>  *      totally quiescent.
> 
> I don't know if there can be some other I/O submitted after
> scsi_device_quiesce() returns in the case of hibernation, and I don't
> know if aac_suspend() -> scsi_host_block() should be fixed/removed,
> but as far as the panic is concerned, I'm very glad I have found a better
> fix with your help. 

The function blk_set_pm_only() increments the q->pm_only counter while
the blk_clear_pm_only() function decrements the q->pm_only counter.
If q->pm_only > 0, blk_queue_enter() only succeeds if the flag
BLK_MQ_REQ_PREEMPT is set in the second argument passed to that
function. blk_get_request() calls blk_queue_enter(). The result is that
while q->pm_only > 0 blk_get_request() only submits a request without
waiting if the BLK_MQ_REQ_PREEMPT flag is set in its second argument.
scsi_execute() sets the BLK_MQ_REQ_PREEMPT flag. In other words,
scsi_device_quiesce() blocks requests submitted by filesystems but still
allows SCSI commands submitted by the SCSI core to be executed.
"special" refers to requests with the BLK_MQ_REQ_PREEMPT flag set.

Bart.
Dexuan Cui April 24, 2020, 2:40 a.m. UTC | #18
> From: Bart Van Assche <bvanassche@acm.org>
> Sent: Thursday, April 23, 2020 4:25 PM
> On 2020-04-23 11:29, Dexuan Cui wrote:
> > So it looks the below patch also works for me:
> >
> > --- a/kernel/power/hibernate.c
> > +++ b/kernel/power/hibernate.c
> > @@ -898,6 +898,11 @@ static int software_resume(void)
> >         error = freeze_processes();
> >         if (error)
> >                 goto Close_Finish;
> > +
> > +       error = freeze_kernel_threads();
> > +       if (error)
> > +               goto Close_Finish;
> > +
> >         error = load_image_and_restore();
> >         thaw_processes();
> >   Finish:
> >
> > Just to be sure, I'll do more tests, but I believe the panic can be fixed
> > by this according to my tests I have done so far.
> 
> If a freeze_kernel_threads() call is added in software_resume(), should
> a thaw_kernel_threads() call be added too?

Good catch! 

Note: thaw_processes() thaws every frozen process, including both user
space processes and kernel processes.

In software_resume():
1. If freeze_kernel_threads() fails, I should add a "thaw_processes(); "
before "goto Close_Finish; " so that all the user space processes can 
be thawed.

2. If freeze_kernel_threads() succeeds, but load_image_and_restore()
Fails, there is already a thaw_processes().

3. If load_image_and_restore() succeeds, it won't return, and the
execution will return from the 'old' kernel's hibernate() -> 
hibernation_snapshot() -> create_image() -> swsusp_arch_suspend(),
and later hibernate() -> thaw_processes() will thaw every frozen
process of the 'old' kernel.
 
> Anyway, please Cc me if a patch for software_resume() is submitted.

Sure. Will do.

> > I'm still not sure what the comment before scsi_device_quiesce() means:
> >  *  ... Since special requests may also be requeued requests,
> >  *      a successful return doesn't guarantee the device will be
> >  *      totally quiescent.
> >
> > I don't know if there can be some other I/O submitted after
> > scsi_device_quiesce() returns in the case of hibernation, and I don't
> > know if aac_suspend() -> scsi_host_block() should be fixed/removed,
> > but as far as the panic is concerned, I'm very glad I have found a better
> > fix with your help.
> 
> The function blk_set_pm_only() increments the q->pm_only counter while
> the blk_clear_pm_only() function decrements the q->pm_only counter.
> If q->pm_only > 0, blk_queue_enter() only succeeds if the flag
> BLK_MQ_REQ_PREEMPT is set in the second argument passed to that
> function. blk_get_request() calls blk_queue_enter(). The result is that
> while q->pm_only > 0 blk_get_request() only submits a request without
> waiting if the BLK_MQ_REQ_PREEMPT flag is set in its second argument.
> scsi_execute() sets the BLK_MQ_REQ_PREEMPT flag. In other words,
> scsi_device_quiesce() blocks requests submitted by filesystems but still
> allows SCSI commands submitted by the SCSI core to be executed.
> "special" refers to requests with the BLK_MQ_REQ_PREEMPT flag set.
> 
> Bart.

Thanks for the detailed clarification! So it sounds like we're safe here,
and I guess the scsi_host_block() in aac_suspend() should be removed
to discourage people from trying to use scsi_host_block() in a .suspend()
callback. :-)

Thanks,
-- Dexuan
Dexuan Cui April 24, 2020, 3:45 a.m. UTC | #19
> From: Bart Van Assche <bvanassche@acm.org>
> Sent: Thursday, April 23, 2020 4:25 PM
> On 2020-04-23 11:29, Dexuan Cui wrote:
> > So it looks the below patch also works for me:
> >
> > --- a/kernel/power/hibernate.c
> > +++ b/kernel/power/hibernate.c
> > @@ -898,6 +898,11 @@ static int software_resume(void)
> >         error = freeze_processes();
> >         if (error)
> >                 goto Close_Finish;
> > +
> > +       error = freeze_kernel_threads();
> > +       if (error)
> > +               goto Close_Finish;
> > +
> >         error = load_image_and_restore();
> >         thaw_processes();
> >   Finish:
> >
> > Just to be sure, I'll do more tests, but I believe the panic can be fixed
> > by this according to my tests I have done so far.
> 
> If a freeze_kernel_threads() call is added in software_resume(), should
> a thaw_kernel_threads() call be added too?
> 
> Anyway, please Cc me if a patch for software_resume() is submitted.

FYI, I posted a fix: https://lkml.org/lkml/2020/4/23/1540

Thanks,
-- Dexuan
diff mbox series

Patch

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index fb41636519ee..fd51d2f03778 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1948,6 +1948,11 @@  static int storvsc_suspend(struct hv_device *hv_dev)
 	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
 	struct Scsi_Host *host = stor_device->host;
 	struct hv_host_device *host_dev = shost_priv(host);
+	int ret;
+
+	ret = scsi_host_block(host);
+	if (ret)
+		return ret;
 
 	storvsc_wait_to_drain(stor_device);
 
@@ -1968,10 +1973,15 @@  static int storvsc_suspend(struct hv_device *hv_dev)
 
 static int storvsc_resume(struct hv_device *hv_dev)
 {
+	struct storvsc_device *stor_device = hv_get_drvdata(hv_dev);
+	struct Scsi_Host *host = stor_device->host;
 	int ret;
 
 	ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size,
 				     hv_dev_is_fc(hv_dev));
+	if (!ret)
+		ret = scsi_host_unblock(host, SDEV_RUNNING);
+
 	return ret;
 }