Message ID | 1587514644-47058-1-git-send-email-decui@microsoft.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
Series | scsi: storvsc: Fix a panic in the hibernation procedure | expand |
On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote: > During hibernation, the sdevs are suspended automatically in > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after > storvsc_suspend(), there is no disk I/O from the file systems, but there > can still be disk I/O from the kernel space, e.g. disk_check_events() -> > sr_block_check_events() -> cdrom_check_events() can still submit I/O > to the storvsc driver, which causes a paic of NULL pointer dereference, > since storvsc has closed the vmbus channel in storvsc_suspend(): refer > to the below links for more info: > https://lkml.org/lkml/2020/4/10/47 > https://lkml.org/lkml/2020/4/17/1103 > > Fix the panic by blocking/unblocking all the I/O queues properly. > > Note: this patch depends on another patch "scsi: core: Allow the state > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link above). > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") > Signed-off-by: Dexuan Cui <decui@microsoft.com> > --- > drivers/scsi/storvsc_drv.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index fb41636519ee..fd51d2f03778 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device *hv_dev) > struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > struct Scsi_Host *host = stor_device->host; > struct hv_host_device *host_dev = shost_priv(host); > + int ret; > + > + ret = scsi_host_block(host); > + if (ret) > + return ret; > > storvsc_wait_to_drain(stor_device); > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device *hv_dev) > > static int storvsc_resume(struct hv_device *hv_dev) > { > + struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > + struct Scsi_Host *host = stor_device->host; > int ret; > > ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size, > hv_dev_is_fc(hv_dev)); > + if (!ret) > + ret = scsi_host_unblock(host, SDEV_RUNNING); > + > return ret; > } scsi_host_block() is actually too heavy for just avoiding scsi internal command, which can be done simply by one atomic variable. Not mention scsi_host_block() is implemented too clumsy because nr_luns * synchronize_rcu() are required in scsi_host_block(), which should have been optimized to just one. Also scsi_device_quiesce() is heavy too, still takes 2 synchronize_rcu() for one LUN. That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in case that the HBA's suspend handler needs scsi_host_block(). Thanks, Ming
> From: Ming Lei <ming.lei@redhat.com> > Sent: Tuesday, April 21, 2020 6:28 PM > To: Dexuan Cui <decui@microsoft.com> > > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote: > > During hibernation, the sdevs are suspended automatically in > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after > > storvsc_suspend(), there is no disk I/O from the file systems, but there > > can still be disk I/O from the kernel space, e.g. disk_check_events() -> > > sr_block_check_events() -> cdrom_check_events() can still submit I/O > > to the storvsc driver, which causes a paic of NULL pointer dereference, > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer > > to the below links for more info: > > > > Fix the panic by blocking/unblocking all the I/O queues properly. > > > > Note: this patch depends on another patch "scsi: core: Allow the state > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link > above). > > > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") > > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > --- > > drivers/scsi/storvsc_drv.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > index fb41636519ee..fd51d2f03778 100644 > > --- a/drivers/scsi/storvsc_drv.c > > +++ b/drivers/scsi/storvsc_drv.c > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device > *hv_dev) > > struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > struct Scsi_Host *host = stor_device->host; > > struct hv_host_device *host_dev = shost_priv(host); > > + int ret; > > + > > + ret = scsi_host_block(host); > > + if (ret) > > + return ret; > > > > storvsc_wait_to_drain(stor_device); > > > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device > *hv_dev) > > > > static int storvsc_resume(struct hv_device *hv_dev) > > { > > + struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > + struct Scsi_Host *host = stor_device->host; > > int ret; > > > > ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size, > > hv_dev_is_fc(hv_dev)); > > + if (!ret) > > + ret = scsi_host_unblock(host, SDEV_RUNNING); > > + > > return ret; > > } > > scsi_host_block() is actually too heavy for just avoiding > scsi internal command, which can be done simply by one atomic > variable. > > Not mention scsi_host_block() is implemented too clumsy because > nr_luns * synchronize_rcu() are required in scsi_host_block(), > which should have been optimized to just one. > > Also scsi_device_quiesce() is heavy too, still takes 2 > synchronize_rcu() for one LUN. > > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in > case that the HBA's suspend handler needs scsi_host_block(). > > Thanks, > Ming When we're in storvsc_suspend(), all the userspace processes have been frozen and all the file systems have been flushed, and there should not be too much I/O from the kernel space, so IMO scsi_host_block() should be pretty fast here. Of cource, any performance improvement to the scsi_host_block() API is still great. :-) Thanks, -- Dexuan
On Wed, Apr 22, 2020 at 01:48:25AM +0000, Dexuan Cui wrote: > > From: Ming Lei <ming.lei@redhat.com> > > Sent: Tuesday, April 21, 2020 6:28 PM > > To: Dexuan Cui <decui@microsoft.com> > > > > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote: > > > During hibernation, the sdevs are suspended automatically in > > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after > > > storvsc_suspend(), there is no disk I/O from the file systems, but there > > > can still be disk I/O from the kernel space, e.g. disk_check_events() -> > > > sr_block_check_events() -> cdrom_check_events() can still submit I/O > > > to the storvsc driver, which causes a paic of NULL pointer dereference, > > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer > > > to the below links for more info: > > > > > > Fix the panic by blocking/unblocking all the I/O queues properly. > > > > > > Note: this patch depends on another patch "scsi: core: Allow the state > > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link > > above). > > > > > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") > > > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > > --- > > > drivers/scsi/storvsc_drv.c | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > > index fb41636519ee..fd51d2f03778 100644 > > > --- a/drivers/scsi/storvsc_drv.c > > > +++ b/drivers/scsi/storvsc_drv.c > > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device > > *hv_dev) > > > struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > > struct Scsi_Host *host = stor_device->host; > > > struct hv_host_device *host_dev = shost_priv(host); > > > + int ret; > > > + > > > + ret = scsi_host_block(host); > > > + if (ret) > > > + return ret; > > > > > > storvsc_wait_to_drain(stor_device); > > > > > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device > > *hv_dev) > > > > > > static int storvsc_resume(struct hv_device *hv_dev) > > > { > > > + struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > > + struct Scsi_Host *host = stor_device->host; > > > int ret; > > > > > > ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size, > > > hv_dev_is_fc(hv_dev)); > > > + if (!ret) > > > + ret = scsi_host_unblock(host, SDEV_RUNNING); > > > + > > > return ret; > > > } > > > > scsi_host_block() is actually too heavy for just avoiding > > scsi internal command, which can be done simply by one atomic > > variable. > > > > Not mention scsi_host_block() is implemented too clumsy because > > nr_luns * synchronize_rcu() are required in scsi_host_block(), > > which should have been optimized to just one. > > > > Also scsi_device_quiesce() is heavy too, still takes 2 > > synchronize_rcu() for one LUN. > > > > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in > > case that the HBA's suspend handler needs scsi_host_block(). > > > > Thanks, > > Ming > > When we're in storvsc_suspend(), all the userspace processes have been > frozen and all the file systems have been flushed, and there should not > be too much I/O from the kernel space, so IMO scsi_host_block() should be > pretty fast here. I guess it depends on RCU's implementation, so CC RCU guys. Hello Paul & Josh, Could you clarify that if sysnchronize_rcu becomes quickly during system suspend? Thanks, Ming
On Wed, Apr 22, 2020 at 01:48:25AM +0000, Dexuan Cui wrote: > > From: Ming Lei <ming.lei@redhat.com> > > Sent: Tuesday, April 21, 2020 6:28 PM > > To: Dexuan Cui <decui@microsoft.com> > > > > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote: > > > During hibernation, the sdevs are suspended automatically in > > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after > > > storvsc_suspend(), there is no disk I/O from the file systems, but there > > > can still be disk I/O from the kernel space, e.g. disk_check_events() -> > > > sr_block_check_events() -> cdrom_check_events() can still submit I/O > > > to the storvsc driver, which causes a paic of NULL pointer dereference, > > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer > > > to the below links for more info: > > > > > > Fix the panic by blocking/unblocking all the I/O queues properly. > > > > > > Note: this patch depends on another patch "scsi: core: Allow the state > > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link > > above). > > > > > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") > > > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > > --- > > > drivers/scsi/storvsc_drv.c | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > > index fb41636519ee..fd51d2f03778 100644 > > > --- a/drivers/scsi/storvsc_drv.c > > > +++ b/drivers/scsi/storvsc_drv.c > > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device > > *hv_dev) > > > struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > > struct Scsi_Host *host = stor_device->host; > > > struct hv_host_device *host_dev = shost_priv(host); > > > + int ret; > > > + > > > + ret = scsi_host_block(host); > > > + if (ret) > > > + return ret; > > > > > > storvsc_wait_to_drain(stor_device); > > > > > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device > > *hv_dev) > > > > > > static int storvsc_resume(struct hv_device *hv_dev) > > > { > > > + struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > > + struct Scsi_Host *host = stor_device->host; > > > int ret; > > > > > > ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size, > > > hv_dev_is_fc(hv_dev)); > > > + if (!ret) > > > + ret = scsi_host_unblock(host, SDEV_RUNNING); > > > + > > > return ret; > > > } > > > > scsi_host_block() is actually too heavy for just avoiding > > scsi internal command, which can be done simply by one atomic > > variable. > > > > Not mention scsi_host_block() is implemented too clumsy because > > nr_luns * synchronize_rcu() are required in scsi_host_block(), > > which should have been optimized to just one. > > > > Also scsi_device_quiesce() is heavy too, still takes 2 > > synchronize_rcu() for one LUN. > > > > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in > > case that the HBA's suspend handler needs scsi_host_block(). > > > > Thanks, > > Ming > > When we're in storvsc_suspend(), all the userspace processes have been > frozen and all the file systems have been flushed, and there should not > be too much I/O from the kernel space, so IMO scsi_host_block() should be > pretty fast here. Forget to mention, I guess that storvsc_suspend() may be called from runtime PM path, at that time, the whole system is running, and RCU can't be quick. Thanks, Ming
On Wed, Apr 22, 2020 at 10:01:34AM +0800, Ming Lei wrote: > On Wed, Apr 22, 2020 at 01:48:25AM +0000, Dexuan Cui wrote: > > > From: Ming Lei <ming.lei@redhat.com> > > > Sent: Tuesday, April 21, 2020 6:28 PM > > > To: Dexuan Cui <decui@microsoft.com> > > > > > > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote: > > > > During hibernation, the sdevs are suspended automatically in > > > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after > > > > storvsc_suspend(), there is no disk I/O from the file systems, but there > > > > can still be disk I/O from the kernel space, e.g. disk_check_events() -> > > > > sr_block_check_events() -> cdrom_check_events() can still submit I/O > > > > to the storvsc driver, which causes a paic of NULL pointer dereference, > > > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer > > > > to the below links for more info: > > > > > > > > Fix the panic by blocking/unblocking all the I/O queues properly. > > > > > > > > Note: this patch depends on another patch "scsi: core: Allow the state > > > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link > > > above). > > > > > > > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") > > > > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > > > --- > > > > drivers/scsi/storvsc_drv.c | 10 ++++++++++ > > > > 1 file changed, 10 insertions(+) > > > > > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > > > index fb41636519ee..fd51d2f03778 100644 > > > > --- a/drivers/scsi/storvsc_drv.c > > > > +++ b/drivers/scsi/storvsc_drv.c > > > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device > > > *hv_dev) > > > > struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > > > struct Scsi_Host *host = stor_device->host; > > > > struct hv_host_device *host_dev = shost_priv(host); > > > > + int ret; > > > > + > > > > + ret = scsi_host_block(host); > > > > + if (ret) > > > > + return ret; > > > > > > > > storvsc_wait_to_drain(stor_device); > > > > > > > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device > > > *hv_dev) > > > > > > > > static int storvsc_resume(struct hv_device *hv_dev) > > > > { > > > > + struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > > > + struct Scsi_Host *host = stor_device->host; > > > > int ret; > > > > > > > > ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size, > > > > hv_dev_is_fc(hv_dev)); > > > > + if (!ret) > > > > + ret = scsi_host_unblock(host, SDEV_RUNNING); > > > > + > > > > return ret; > > > > } > > > > > > scsi_host_block() is actually too heavy for just avoiding > > > scsi internal command, which can be done simply by one atomic > > > variable. > > > > > > Not mention scsi_host_block() is implemented too clumsy because > > > nr_luns * synchronize_rcu() are required in scsi_host_block(), > > > which should have been optimized to just one. > > > > > > Also scsi_device_quiesce() is heavy too, still takes 2 > > > synchronize_rcu() for one LUN. > > > > > > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in > > > case that the HBA's suspend handler needs scsi_host_block(). > > > > > > Thanks, > > > Ming > > > > When we're in storvsc_suspend(), all the userspace processes have been > > frozen and all the file systems have been flushed, and there should not > > be too much I/O from the kernel space, so IMO scsi_host_block() should be > > pretty fast here. > > I guess it depends on RCU's implementation, so CC RCU guys. > > Hello Paul & Josh, > > Could you clarify that if sysnchronize_rcu becomes quickly during > system suspend? Once you have all but one CPU offlined, it becomes extremely fast, as in roughly a no-op (which is an idea of Josh's from back in the day). But if there is more than one CPU online, then synchronize_rcu() still takes on the order of several to several tens of jiffies. So, yes, in some portions of system suspend, synchronize_rcu() becomes very fast indeed. Thanx, Paul
On Tue, Apr 21, 2020 at 08:08:07PM -0700, Paul E. McKenney wrote: > On Wed, Apr 22, 2020 at 10:01:34AM +0800, Ming Lei wrote: > > On Wed, Apr 22, 2020 at 01:48:25AM +0000, Dexuan Cui wrote: > > > > From: Ming Lei <ming.lei@redhat.com> > > > > Sent: Tuesday, April 21, 2020 6:28 PM > > > > To: Dexuan Cui <decui@microsoft.com> > > > > > > > > On Tue, Apr 21, 2020 at 05:17:24PM -0700, Dexuan Cui wrote: > > > > > During hibernation, the sdevs are suspended automatically in > > > > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after > > > > > storvsc_suspend(), there is no disk I/O from the file systems, but there > > > > > can still be disk I/O from the kernel space, e.g. disk_check_events() -> > > > > > sr_block_check_events() -> cdrom_check_events() can still submit I/O > > > > > to the storvsc driver, which causes a paic of NULL pointer dereference, > > > > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer > > > > > to the below links for more info: > > > > > > > > > > Fix the panic by blocking/unblocking all the I/O queues properly. > > > > > > > > > > Note: this patch depends on another patch "scsi: core: Allow the state > > > > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link > > > > above). > > > > > > > > > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") > > > > > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > > > > --- > > > > > drivers/scsi/storvsc_drv.c | 10 ++++++++++ > > > > > 1 file changed, 10 insertions(+) > > > > > > > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > > > > index fb41636519ee..fd51d2f03778 100644 > > > > > --- a/drivers/scsi/storvsc_drv.c > > > > > +++ b/drivers/scsi/storvsc_drv.c > > > > > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device > > > > *hv_dev) > > > > > struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > > > > struct Scsi_Host *host = stor_device->host; > > > > > struct hv_host_device *host_dev = shost_priv(host); > > > > > + int ret; > > > > > + > > > > > + ret = scsi_host_block(host); > > > > > + if (ret) > > > > > + return ret; > > > > > > > > > > storvsc_wait_to_drain(stor_device); > > > > > > > > > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device > > > > *hv_dev) > > > > > > > > > > static int storvsc_resume(struct hv_device *hv_dev) > > > > > { > > > > > + struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > > > > > + struct Scsi_Host *host = stor_device->host; > > > > > int ret; > > > > > > > > > > ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size, > > > > > hv_dev_is_fc(hv_dev)); > > > > > + if (!ret) > > > > > + ret = scsi_host_unblock(host, SDEV_RUNNING); > > > > > + > > > > > return ret; > > > > > } > > > > > > > > scsi_host_block() is actually too heavy for just avoiding > > > > scsi internal command, which can be done simply by one atomic > > > > variable. > > > > > > > > Not mention scsi_host_block() is implemented too clumsy because > > > > nr_luns * synchronize_rcu() are required in scsi_host_block(), > > > > which should have been optimized to just one. > > > > > > > > Also scsi_device_quiesce() is heavy too, still takes 2 > > > > synchronize_rcu() for one LUN. > > > > > > > > That is said SCSI suspend may take (3 * nr_luns) sysnchronize_rcu() in > > > > case that the HBA's suspend handler needs scsi_host_block(). > > > > > > > > Thanks, > > > > Ming > > > > > > When we're in storvsc_suspend(), all the userspace processes have been > > > frozen and all the file systems have been flushed, and there should not > > > be too much I/O from the kernel space, so IMO scsi_host_block() should be > > > pretty fast here. > > > > I guess it depends on RCU's implementation, so CC RCU guys. > > > > Hello Paul & Josh, > > > > Could you clarify that if sysnchronize_rcu becomes quickly during > > system suspend? > > Once you have all but one CPU offlined, it becomes extremely fast, as > in roughly a no-op (which is an idea of Josh's from back in the day). > But if there is more than one CPU online, then synchronize_rcu() still > takes on the order of several to several tens of jiffies. > > So, yes, in some portions of system suspend, synchronize_rcu() becomes > very fast indeed. Hi Paul, Thanks for your clarification. In system suspend path, device is suspended before suspend_disable_secondary_cpus(), so I guess synchronize_rcu() is not quick enough even though user space processes and some kernel threads are frozen. Thanks, Ming
> From: Ming Lei <ming.lei@redhat.com> > Sent: Tuesday, April 21, 2020 9:16 PM > ... > > > > When we're in storvsc_suspend(), all the userspace processes have been > > > > frozen and all the file systems have been flushed, and there should not > > > > be too much I/O from the kernel space, so IMO scsi_host_block() should > be > > > > pretty fast here. > > > > > > I guess it depends on RCU's implementation, so CC RCU guys. > > > > > > Hello Paul & Josh, > > > > > > Could you clarify that if sysnchronize_rcu becomes quickly during > > > system suspend? > > > > Once you have all but one CPU offlined, it becomes extremely fast, as > > in roughly a no-op (which is an idea of Josh's from back in the day). > > But if there is more than one CPU online, then synchronize_rcu() still > > takes on the order of several to several tens of jiffies. > > > > So, yes, in some portions of system suspend, synchronize_rcu() becomes > > very fast indeed. > > Hi Paul, > > Thanks for your clarification. > > In system suspend path, device is suspended before > suspend_disable_secondary_cpus(), > so I guess synchronize_rcu() is not quick enough even though user space > processes and some kernel threads are frozen. > > Thanks, > Ming storvsc_suspend() -> scsi_host_block() is only called in the hibernation path, which is not a hot path at all, so IMHO we don't really care if it takes 10ms or 100ms or even 1s. :-) BTW, in my test, typically the scsi_host_block() here takes about 3ms in my 40-vCPU VM. storvsc_suspend() is not called from the runtime PM path, because the runtime_suspend/runtime_resume/runtime_idle ops are not defined at all for the devices on the Hyper-V VMBus bus: these are pure software-emulated devices, so runtime PM is unnecessary for them. Thanks, -- Dexuan
On 4/21/20 5:17 PM, Dexuan Cui wrote: > During hibernation, the sdevs are suspended automatically in > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after > storvsc_suspend(), there is no disk I/O from the file systems, but there > can still be disk I/O from the kernel space, e.g. disk_check_events() -> > sr_block_check_events() -> cdrom_check_events() can still submit I/O > to the storvsc driver, which causes a paic of NULL pointer dereference, > since storvsc has closed the vmbus channel in storvsc_suspend(): refer > to the below links for more info: > https://lkml.org/lkml/2020/4/10/47 > https://lkml.org/lkml/2020/4/17/1103 > > Fix the panic by blocking/unblocking all the I/O queues properly. > > Note: this patch depends on another patch "scsi: core: Allow the state > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link above). > > Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") > Signed-off-by: Dexuan Cui <decui@microsoft.com> > --- > drivers/scsi/storvsc_drv.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index fb41636519ee..fd51d2f03778 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device *hv_dev) > struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > struct Scsi_Host *host = stor_device->host; > struct hv_host_device *host_dev = shost_priv(host); > + int ret; > + > + ret = scsi_host_block(host); > + if (ret) > + return ret; > > storvsc_wait_to_drain(stor_device); > > @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device *hv_dev) > > static int storvsc_resume(struct hv_device *hv_dev) > { > + struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); > + struct Scsi_Host *host = stor_device->host; > int ret; > > ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size, > hv_dev_is_fc(hv_dev)); > + if (!ret) > + ret = scsi_host_unblock(host, SDEV_RUNNING); > + > return ret; > } I don't like this patch. It makes the behavior of the storsvc driver different from every other SCSI LLD. Other SCSI LLDs don't need this change because these don't destroy I/O queues upon suspend. Bart.
> From: Bart Van Assche <bvanassche@acm.org> > Sent: Tuesday, April 21, 2020 10:02 PM > On 4/21/20 5:17 PM, Dexuan Cui wrote: > > During hibernation, the sdevs are suspended automatically in > > drivers/scsi/scsi_pm.c before storvsc_suspend(), so after > > storvsc_suspend(), there is no disk I/O from the file systems, but there > > can still be disk I/O from the kernel space, e.g. disk_check_events() -> > > sr_block_check_events() -> cdrom_check_events() can still submit I/O > > to the storvsc driver, which causes a panic of NULL pointer dereference, > > since storvsc has closed the vmbus channel in storvsc_suspend(): refer > > to the below links for more info: > > ... > > Fix the panic by blocking/unblocking all the I/O queues properly. > > > > Note: this patch depends on another patch "scsi: core: Allow the state > > change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link > above). > > I don't like this patch. It makes the behavior of the storsvc driver > different from every other SCSI LLD. Other SCSI LLDs don't need this > change because these don't destroy I/O queues upon suspend. > > Bart. Upon suspend, I suppose the other LLDs can not accept I/O either, then what do they do when some kernel thread still tries to submit I/O? Do they "block" the I/O until resume, or just return an error immediately? I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements a mechanism of marking the device busy upon suspend, so any I/O will immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the disadvantage is: the mechanism of marking the device busy looks complex, and the hot path .queuecommand() has to take the spinlock shost->host_lock, which should affect the performance. It looks drivers/scsi/nsp32.c: nsp32_suspend() and drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O after suspend. I doubt this is correct. It looks drivers/scsi/hisi_sas/hisi_sas_v3_hw.c: hisi_sas_v3_suspend() tries to use scsi_block_requests() as a means of blocking new I/O, but I doubt it can work: scsi_block_requests() only sets a variable -- how could this prevent another concurrent kernel thread to submit new I/O without a race condition issue? So it looks to me there is no simple mechanism to handle the scenario here, and I guess that's why the scsi_host_block/unblock APIs are introduced, and actually there is already an user of the APIs: 3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O"). The aacraid patch says: "This has the advantage that the block layer will stop sending I/O to the adapter instead of having the SCSI midlayer requeueing I/O internally". It looks this may imply that using the new APIs is encouraged? Sorry for my lack of knowledge in Linux SCSI system, but IMHO the new APIs are simple and effective, and they really look like a good fit for the hibernation scenario. BTW, storvsc_suspend() is not a hot path here, so IMO I don't have any performance concern. PS, here storvsc has to destroy and re-construct the I/O queues: the I/O queues are shared memory ringbufers between the guest and the host; in the resume path of the hibernation procedure, the memory pages allocated by the 'new' kernel is different from that allocated by the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel must destroy the mapping of the pages, and later after we jump to the 'old' kernel, we'll re-create the mapping using the pages allocated by the 'old' kernel. Here "create the mapping" means the guest tells the host about the physical addresses of the pages. Thanks, -- Dexuan
On Wed, Apr 22, 2020 at 04:58:14AM +0000, Dexuan Cui wrote: > > From: Ming Lei <ming.lei@redhat.com> > > Sent: Tuesday, April 21, 2020 9:16 PM > > ... > > > > > When we're in storvsc_suspend(), all the userspace processes have been > > > > > frozen and all the file systems have been flushed, and there should not > > > > > be too much I/O from the kernel space, so IMO scsi_host_block() should > > be > > > > > pretty fast here. > > > > > > > > I guess it depends on RCU's implementation, so CC RCU guys. > > > > > > > > Hello Paul & Josh, > > > > > > > > Could you clarify that if sysnchronize_rcu becomes quickly during > > > > system suspend? > > > > > > Once you have all but one CPU offlined, it becomes extremely fast, as > > > in roughly a no-op (which is an idea of Josh's from back in the day). > > > But if there is more than one CPU online, then synchronize_rcu() still > > > takes on the order of several to several tens of jiffies. > > > > > > So, yes, in some portions of system suspend, synchronize_rcu() becomes > > > very fast indeed. > > > > Hi Paul, > > > > Thanks for your clarification. > > > > In system suspend path, device is suspended before > > suspend_disable_secondary_cpus(), > > so I guess synchronize_rcu() is not quick enough even though user space > > processes and some kernel threads are frozen. > > > > Thanks, > > Ming > > storvsc_suspend() -> scsi_host_block() is only called in the hibernation > path, which is not a hot path at all, so IMHO we don't really care if it > takes 10ms or 100ms or even 1s. :-) BTW, in my test, typically the Are you sure the 'we' can cover all users? > scsi_host_block() here takes about 3ms in my 40-vCPU VM. If more LUNs are added, the time should be increased proportionallly, that is why I think scsi_host_block() is bad. Thanks, Ming
On Wed, Apr 22, 2020 at 05:23:51PM +0800, Ming Lei wrote: > On Wed, Apr 22, 2020 at 04:58:14AM +0000, Dexuan Cui wrote: > > > From: Ming Lei <ming.lei@redhat.com> > > > Sent: Tuesday, April 21, 2020 9:16 PM > > > ... > > > > > > When we're in storvsc_suspend(), all the userspace processes have been > > > > > > frozen and all the file systems have been flushed, and there should not > > > > > > be too much I/O from the kernel space, so IMO scsi_host_block() should > > > be > > > > > > pretty fast here. > > > > > > > > > > I guess it depends on RCU's implementation, so CC RCU guys. > > > > > > > > > > Hello Paul & Josh, > > > > > > > > > > Could you clarify that if sysnchronize_rcu becomes quickly during > > > > > system suspend? > > > > > > > > Once you have all but one CPU offlined, it becomes extremely fast, as > > > > in roughly a no-op (which is an idea of Josh's from back in the day). > > > > But if there is more than one CPU online, then synchronize_rcu() still > > > > takes on the order of several to several tens of jiffies. > > > > > > > > So, yes, in some portions of system suspend, synchronize_rcu() becomes > > > > very fast indeed. > > > > > > Hi Paul, > > > > > > Thanks for your clarification. > > > > > > In system suspend path, device is suspended before > > > suspend_disable_secondary_cpus(), > > > so I guess synchronize_rcu() is not quick enough even though user space > > > processes and some kernel threads are frozen. > > > > > > Thanks, > > > Ming > > > > storvsc_suspend() -> scsi_host_block() is only called in the hibernation > > path, which is not a hot path at all, so IMHO we don't really care if it > > takes 10ms or 100ms or even 1s. :-) BTW, in my test, typically the > > Are you sure the 'we' can cover all users? > > > scsi_host_block() here takes about 3ms in my 40-vCPU VM. > > If more LUNs are added, the time should be increased proportionallly, > that is why I think scsi_host_block() is bad. If the caller must wait until the grace period ends, then the traditional approach is to use a single synchronize_rcu() to cover all LUNs. This of course can require some reworking of the code. If the caller does not need to wait, then either call_rcu() or kfree_rcu() can work quite well. Thanx, Paul
> From: Ming Lei <ming.lei@redhat.com> > Sent: Wednesday, April 22, 2020 2:24 AM > > ... > > storvsc_suspend() -> scsi_host_block() is only called in the hibernation > > path, which is not a hot path at all, so IMHO we don't really care if it > > takes 10ms or 100ms or even 1s. :-) BTW, in my test, typically the > > Are you sure the 'we' can cover all users? I actully only meant I don't have any performance concern in this path storvsc_suspend() -> scsi_host_block() :-) Thanks for your effort to improve the scsi_host_block() API in the patch [PATCH] scsi: avoid to run synchronize_rcu for each device in scsi_host_block. > > scsi_host_block() here takes about 3ms in my 40-vCPU VM. > > If more LUNs are added, the time should be increased proportionallly, > that is why I think scsi_host_block() is bad. > > Thanks, > Ming FWIW, if the time increases linearly against the number of the LUNs, it looks typically it's still fast enough for storvsc_suspend(). Thanks, -- Dexuan
On 4/21/20 11:24 PM, Dexuan Cui wrote: > Upon suspend, I suppose the other LLDs can not accept I/O either, then > what do they do when some kernel thread still tries to submit I/O? Do > they "block" the I/O until resume, or just return an error immediately? This is my understanding of how other SCSI LLDs handle suspend/resume: - The ULP driver, e.g. the SCSI sd driver, implements power management support by providing callbacks in struct scsi_driver.gendrv.pm and also in scsi_bus_type.pm. The SCSI sd suspend callbacks flush the device cache and send a STOP command to the device. - SCSI LLDs for PCIe devices optionally provide pci_driver.suspend and resume callbacks. These callbacks can be used to make the PCIe device enter/leave a power saving state. No new SCSI commands should be submitted after pci_driver.suspend has been called. > I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements > a mechanism of marking the device busy upon suspend, so any I/O will > immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the > disadvantage is: the mechanism of marking the device busy looks > complex, and the hot path .queuecommand() has to take the > spinlock shost->host_lock, which should affect the performance. I think the code you are referring to is the code in scsifront_suspend(). A pointer to that function is stored in a struct xenbus_driver instance. That's another structure than the structures mentioned above. Wouldn't it be better to make sure that any hypervisor suspend operations happen after it is guaranteed that no further SCSI commands will be submitted such that hypervisor suspend operations do not have to deal with SCSI commands submitted during or after the hypervisor suspend callback? > It looks drivers/scsi/nsp32.c: nsp32_suspend() and > drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O > after suspend. I doubt this is correct. nsp32_suspend() is a PCI suspend callback. If any SCSI commands would be submitted after that callback has started that would mean that the SCSI suspend and PCIe suspend operations are called in the wrong order. I do not agree that code for suspending SCSI commands should be added in nsp32_suspend(). > So it looks to me there is no simple mechanism to handle the scenario > here, and I guess that's why the scsi_host_block/unblock APIs are > introduced, and actually there is already an user of the APIs: > 3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O"). > > The aacraid patch says: "This has the advantage that the block layer will > stop sending I/O to the adapter instead of having the SCSI midlayer > requeueing I/O internally". It looks this may imply that using the new > APIs is encouraged? I'm fine with using these new functions in device reset handlers. Using these functions in power management handlers seems wrong to me. > PS, here storvsc has to destroy and re-construct the I/O queues: the > I/O queues are shared memory ringbufers between the guest and the > host; in the resume path of the hibernation procedure, the memory > pages allocated by the 'new' kernel is different from that allocated by > the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel > must destroy the mapping of the pages, and later after we jump to > the 'old' kernel, we'll re-create the mapping using the pages allocated > by the 'old' kernel. Here "create the mapping" means the guest tells > the host about the physical addresses of the pages. Thank you for having clarified this. This helps me to understand the HV driver framework better. I think this means that the hv_driver.suspend function should be called at a later time than SCSI suspend. From Documentation/driver-api/device_link.rst: "By default, the driver core only enforces dependencies between devices that are borne out of a parent/child relationship within the device hierarchy: When suspending, resuming or shutting down the system, devices are ordered based on this relationship, i.e. children are always suspended before their parent, and the parent is always resumed before its children." Is there a single storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has it been considered to make storvsc_drv the parent device of all SCSI devices created by the storvsc driver? Thanks, Bart.
> From: Bart Van Assche <bvanassche@acm.org> > Sent: Wednesday, April 22, 2020 12:56 PM > On 4/21/20 11:24 PM, Dexuan Cui wrote: > > Upon suspend, I suppose the other LLDs can not accept I/O either, then > > what do they do when some kernel thread still tries to submit I/O? Do > > they "block" the I/O until resume, or just return an error immediately? > > This is my understanding of how other SCSI LLDs handle suspend/resume: > - The ULP driver, e.g. the SCSI sd driver, implements power management > support by providing callbacks in struct scsi_driver.gendrv.pm and also > in scsi_bus_type.pm. The SCSI sd suspend callbacks flush the device > cache and send a STOP command to the device. It looks the sd suspend callbacks are only for the I/O from the disk, e.g. from the file system that lives in some partition of some disk. The panic I'm seeing is not from sd. I think it's from a kernel thread that tries to detect the status of the SCSI CDROM. This is the snipped messages (the full version is at https://lkml.org/lkml/2020/4/10/47): here the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called, and later the storvsc LLD's suspend callback is also called, but sr_block_check_events() can still try to submit SCSI commands to storvsc: [ 11.668741] sr 0:0:0:1: bus quiesce [ 11.668804] sd 0:0:0:0: bus quiesce [ 11.698082] scsi target0:0:0: bus quiesce [ 11.703296] scsi host0: bus quiesce [ 11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq bus quiesce [ 11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq bus quiesce [ 11.804042] BUG: kernel NULL pointer dereference, address: 0000000000000090 [ 11.804996] Workqueue: events_freezable_power_ disk_events_workfn [ 11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 [hv_storvsc] [ 11.804996] Call Trace: [ 11.804996] scsi_queue_rq+0x593/0xa10 [ 11.804996] blk_mq_dispatch_rq_list+0x8d/0x510 [ 11.804996] blk_mq_sched_dispatch_requests+0xed/0x170 [ 11.804996] __blk_mq_run_hw_queue+0x55/0x110 [ 11.804996] __blk_mq_delay_run_hw_queue+0x141/0x160 [ 11.804996] blk_mq_sched_insert_request+0xc3/0x170 [ 11.804996] blk_execute_rq+0x4b/0xa0 [ 11.804996] __scsi_execute+0xeb/0x250 [ 11.804996] sr_check_events+0x9f/0x270 [sr_mod] [ 11.804996] cdrom_check_events+0x1a/0x30 [cdrom] [ 11.804996] sr_block_check_events+0xcc/0x110 [sr_mod] [ 11.804996] disk_check_events+0x68/0x160 [ 11.804996] process_one_work+0x20c/0x3d0 [ 11.804996] worker_thread+0x2d/0x3e0 [ 11.804996] kthread+0x10c/0x130 [ 11.804996] ret_from_fork+0x35/0x40 It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend -> scsi_device_quiesce() does not guarantee the device is totally quiescent: /** * scsi_device_quiesce - Block user issued commands. * @sdev: scsi device to quiesce. * * This works by trying to transition to the SDEV_QUIESCE state * (which must be a legal transition). When the device is in this * state, only special requests will be accepted, all others will * be deferred. Since special requests may also be requeued requests, * a successful return doesn't guarantee the device will be * totally quiescent. I guess the "special requests" may include the "detect the status of the SCSI CDROM" request from sr_block_check_events(). It looks scsi_device_quiesce() does not freeze the queue and it just puts the device to the SDEV_QUIESCE state, which is not enough to prevent sr_block_check_events from submitting SCSI commands. It looks scsi_host_block() freezes the queue and flushes all the pending I/O, so it can fix the aforementioned NULL pointer deference panic for me. > - SCSI LLDs for PCIe devices optionally provide pci_driver.suspend and > resume callbacks. These callbacks can be used to make the PCIe device > enter/leave a power saving state. No new SCSI commands should be > submitted after pci_driver.suspend has been called. > > > I had a look at drivers/scsi/xen-scsifront.c. It looks this LLD implements > > a mechanism of marking the device busy upon suspend, so any I/O will > > immediately get an error SCSI_MLQUEUE_HOST_BUSY. IMO the > > disadvantage is: the mechanism of marking the device busy looks > > complex, and the hot path .queuecommand() has to take the > > spinlock shost->host_lock, which should affect the performance. > > I think the code you are referring to is the code in > scsifront_suspend(). A pointer to that function is stored in a struct > xenbus_driver instance. That's another structure than the structures > mentioned above. > > Wouldn't it be better to make sure that any hypervisor suspend > operations happen after it is guaranteed that no further SCSI commands > will be submitted such that hypervisor suspend operations do not have to > deal with SCSI commands submitted during or after the hypervisor suspend > callback? I agree with you, but it looks scsi_bus_freeze() doesn't guarantee no further SCSI commands will be submitted, as I described above. Maybe that is why scsifront_suspend() has to invent a mechanism to cope with the issue. > > It looks drivers/scsi/nsp32.c: nsp32_suspend() and > > drivers/scsi/3w-9xxx.c: twa_suspend() do nothing to handle new I/O > > after suspend. I doubt this is correct. > > nsp32_suspend() is a PCI suspend callback. If any SCSI commands would be > submitted after that callback has started that would mean that the SCSI > suspend and PCIe suspend operations are called in the wrong order. I do > not agree that code for suspending SCSI commands should be added in > nsp32_suspend(). > > > So it looks to me there is no simple mechanism to handle the scenario > > here, and I guess that's why the scsi_host_block/unblock APIs are > > introduced, and actually there is already an user of the APIs: > > 3d3ca53b1639 ("scsi: aacraid: use scsi_host_(block,unblock) to block I/O"). > > > > The aacraid patch says: "This has the advantage that the block layer will > > stop sending I/O to the adapter instead of having the SCSI midlayer > > requeueing I/O internally". It looks this may imply that using the new > > APIs is encouraged? > > I'm fine with using these new functions in device reset handlers. Using > these functions in power management handlers seems wrong to me. It looks you're suggesting the scsi_host_block() in aac_suspend() should be removed? :-) > > PS, here storvsc has to destroy and re-construct the I/O queues: the > > I/O queues are shared memory ringbufers between the guest and the > > host; in the resume path of the hibernation procedure, the memory > > pages allocated by the 'new' kernel is different from that allocated by > > the 'old' kernel, so before jumping to the 'old' kernel, the 'new' kernel > > must destroy the mapping of the pages, and later after we jump to > > the 'old' kernel, we'll re-create the mapping using the pages allocated > > by the 'old' kernel. Here "create the mapping" means the guest tells > > the host about the physical addresses of the pages. > > Thank you for having clarified this. This helps me to understand the HV > driver framework better. I think this means that the hv_driver.suspend > function should be called at a later time than SCSI suspend. From I agree, but it looks here scsi_bus_freeze() doesn't work as we expected? > Documentation/driver-api/device_link.rst: "By default, the driver core > only enforces dependencies between devices that are borne out of a > parent/child relationship within the device hierarchy: When suspending, > resuming or shutting down the system, devices are ordered based on this > relationship, i.e. children are always suspended before their parent, > and the parent is always resumed before its children." Is there a single > storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has Yes. > it been considered to make storvsc_drv the parent device of all SCSI > devices created by the storvsc driver? > > Thanks, > > Bart. Yes, I think so: root@localhost:~# ls -rtl /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver lrwxrwxrwx 1 root root 0 Apr 22 01:10 /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver -> ../../../../../../../../../../bus/scsi/drivers/sd Here the driver of /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943 is storvsc, which creates host3/target3:0:0/3:0:0:0. So it looks there is no ordering issue. Thanks, -- Dexuan
On 4/23/20 12:04 AM, Dexuan Cui wrote: > It looks the sd suspend callbacks are only for the I/O from the disk, e.g. > from the file system that lives in some partition of some disk. > > The panic I'm seeing is not from sd. I think it's from a kernel thread > that tries to detect the status of the SCSI CDROM. This is the snipped > messages (the full version is at https://lkml.org/lkml/2020/4/10/47): here > the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called, > and later the storvsc LLD's suspend callback is also called, but > sr_block_check_events() can still try to submit SCSI commands to storvsc: > > [ 11.668741] sr 0:0:0:1: bus quiesce > [ 11.668804] sd 0:0:0:0: bus quiesce > [ 11.698082] scsi target0:0:0: bus quiesce > [ 11.703296] scsi host0: bus quiesce > [ 11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq bus quiesce > [ 11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq bus quiesce > [ 11.804042] BUG: kernel NULL pointer dereference, address: 0000000000000090 > [ 11.804996] Workqueue: events_freezable_power_ disk_events_workfn > [ 11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 [hv_storvsc] > [ 11.804996] Call Trace: > [ 11.804996] scsi_queue_rq+0x593/0xa10 > [ 11.804996] blk_mq_dispatch_rq_list+0x8d/0x510 > [ 11.804996] blk_mq_sched_dispatch_requests+0xed/0x170 > [ 11.804996] __blk_mq_run_hw_queue+0x55/0x110 > [ 11.804996] __blk_mq_delay_run_hw_queue+0x141/0x160 > [ 11.804996] blk_mq_sched_insert_request+0xc3/0x170 > [ 11.804996] blk_execute_rq+0x4b/0xa0 > [ 11.804996] __scsi_execute+0xeb/0x250 > [ 11.804996] sr_check_events+0x9f/0x270 [sr_mod] > [ 11.804996] cdrom_check_events+0x1a/0x30 [cdrom] > [ 11.804996] sr_block_check_events+0xcc/0x110 [sr_mod] > [ 11.804996] disk_check_events+0x68/0x160 > [ 11.804996] process_one_work+0x20c/0x3d0 > [ 11.804996] worker_thread+0x2d/0x3e0 > [ 11.804996] kthread+0x10c/0x130 > [ 11.804996] ret_from_fork+0x35/0x40 > > It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend -> > scsi_device_quiesce() does not guarantee the device is totally quiescent: During hibernation processes are frozen before devices are quiesced. freeze_processes() calls try_to_freeze_tasks() and that function in turn calls freeze_workqueues_begin() and freeze_workqueues_busy(). freeze_workqueues_busy() freezes all freezable workqueues including system_freezable_power_efficient_wq, the workqueue from which check_events functions are called. Some time after freezable workqueues are frozen dpm_suspend(PMSG_FREEZE) is called. That last call triggers the pm_ops.freeze callbacks, including the pm_ops.freeze callbacks defined in the SCSI core. The above trace seems to indicate that freezing workqueues has not happened before devices were frozen. How about doing the following to retrieve more information about what is going on? * Enable CONFIG_PM_DEBUG in the kernel configuration. * Run echo 1 > /sys/power/pm_print_times and echo 1 > /sys/power/pm_debug_messages before hibernation starts. >> Documentation/driver-api/device_link.rst: "By default, the driver core >> only enforces dependencies between devices that are borne out of a >> parent/child relationship within the device hierarchy: When suspending, >> resuming or shutting down the system, devices are ordered based on this >> relationship, i.e. children are always suspended before their parent, >> and the parent is always resumed before its children." Is there a single >> storvsc_drv instance for all SCSI devices supported by storvsc_drv? Has >> it been considered to make storvsc_drv the parent device of all SCSI >> devices created by the storvsc driver? > > Yes, I think so: > > root@localhost:~# ls -rtl /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver > lrwxrwxrwx 1 root root 0 Apr 22 01:10 /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943/host3/target3:0:0/3:0:0:0/driver -> ../../../../../../../../../../bus/scsi/drivers/sd > > Here the driver of /sys/bus/vmbus/devices/9be03cb2-d37b-409f-b09b-81059b4f6943 > is storvsc, which creates host3/target3:0:0/3:0:0:0. > > So it looks there is no ordering issue. Right, I had overlooked the code in storvsc_probe() that associates SCSI devices with storvsc_drv. Bart.
> From: Bart Van Assche <bvanassche@acm.org> > Sent: Thursday, April 23, 2020 9:38 AM > > On 4/23/20 12:04 AM, Dexuan Cui wrote: > > It looks the sd suspend callbacks are only for the I/O from the disk, e.g. > > from the file system that lives in some partition of some disk. > > > > The panic I'm seeing is not from sd. I think it's from a kernel thread > > that tries to detect the status of the SCSI CDROM. This is the snipped > > messages (the full version is at ...): here > > the suspend callbacks of the sd, sr and scsi_bus_type.pm have been called, > > and later the storvsc LLD's suspend callback is also called, but > > sr_block_check_events() can still try to submit SCSI commands to storvsc: > > > > [ 11.668741] sr 0:0:0:1: bus quiesce > > [ 11.668804] sd 0:0:0:0: bus quiesce > > [ 11.698082] scsi target0:0:0: bus quiesce > > [ 11.703296] scsi host0: bus quiesce > > [ 11.781730] hv_storvsc bf78936f-7d8f-45ce-ab03-6c341452e55d: noirq > bus quiesce > > [ 11.796479] hv_netvsc dda5a2be-b8b8-4237-b330-be8a516a72c0: noirq > bus quiesce > > [ 11.804042] BUG: kernel NULL pointer dereference, address: > 0000000000000090 > > [ 11.804996] Workqueue: events_freezable_power_ disk_events_workfn > > [ 11.804996] RIP: 0010:storvsc_queuecommand+0x261/0x714 > [hv_storvsc] > > [ 11.804996] Call Trace: > > [ 11.804996] scsi_queue_rq+0x593/0xa10 > > [ 11.804996] blk_mq_dispatch_rq_list+0x8d/0x510 > > [ 11.804996] blk_mq_sched_dispatch_requests+0xed/0x170 > > [ 11.804996] __blk_mq_run_hw_queue+0x55/0x110 > > [ 11.804996] __blk_mq_delay_run_hw_queue+0x141/0x160 > > [ 11.804996] blk_mq_sched_insert_request+0xc3/0x170 > > [ 11.804996] blk_execute_rq+0x4b/0xa0 > > [ 11.804996] __scsi_execute+0xeb/0x250 > > [ 11.804996] sr_check_events+0x9f/0x270 [sr_mod] > > [ 11.804996] cdrom_check_events+0x1a/0x30 [cdrom] > > [ 11.804996] sr_block_check_events+0xcc/0x110 [sr_mod] > > [ 11.804996] disk_check_events+0x68/0x160 > > [ 11.804996] process_one_work+0x20c/0x3d0 > > [ 11.804996] worker_thread+0x2d/0x3e0 > > [ 11.804996] kthread+0x10c/0x130 > > [ 11.804996] ret_from_fork+0x35/0x40 > > > > It looks the issue is: scsi_bus_freeze() -> ... -> scsi_dev_type_suspend -> > > scsi_device_quiesce() does not guarantee the device is totally quiescent: > > During hibernation processes are frozen before devices are quiesced. > freeze_processes() calls try_to_freeze_tasks() and that function in turn > calls freeze_workqueues_begin() and freeze_workqueues_busy(). > freeze_workqueues_busy() freezes all freezable workqueues including > system_freezable_power_efficient_wq, the workqueue from which > check_events functions are called. Some time after freezable workqueues > are frozen dpm_suspend(PMSG_FREEZE) is called. That last call triggers > the pm_ops.freeze callbacks, including the pm_ops.freeze callbacks > defined in the SCSI core. > > The above trace seems to indicate that freezing workqueues has not > happened before devices were frozen. How about doing the following to > retrieve more information about what is going on? > * Enable CONFIG_PM_DEBUG in the kernel configuration. > * Run echo 1 > /sys/power/pm_print_times and echo 1 > > /sys/power/pm_debug_messages before hibernation starts. > > Bart. Good point! My panic happens in the "restore" path, not the "save" path. In the "save" path, as you described, it looks everything is done correctly. BTW, freeze_processes() only freezes the userspace processes. After hibernate() -> freeze_processes(), hibernate() -> hibernation_snapshot() -> freeze_kernel_threads() also freezes the "freezable" kernel processes, and then we call dpm_suspend(PMSG_FREEZE), and next we do a lot of other things, and finally the system is powered off. In the "restore" path of the hibernation: 1. We start the system and a fresh new kernel starts to run. 2. The 'new' kernel itself finishes the initialization and the 'init' script in the initramfs starts to run. The 'init' script notices there is a saved state of the previous 'old' kernel in some swap file/partition, so it won't do a usual start-up; instead, the 'init' script runs a program called 'resume'. 3. The 'resume' program talks to the kernel via /sys/power/resume and asks the kernel to do a resume-from-disk. 4. The kernel starts to run resume_store() -> software_resume -> freeze_processes(), which freezes the userspace, but the "freezable" kernel threads are not frozen!!! So it looks the below patch also works for me: --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -898,6 +898,11 @@ static int software_resume(void) error = freeze_processes(); if (error) goto Close_Finish; + + error = freeze_kernel_threads(); + if (error) + goto Close_Finish; + error = load_image_and_restore(); thaw_processes(); Finish: Just to be sure, I'll do more tests, but I believe the panic can be fixed by this according to my tests I have done so far. I'm still not sure what the comment before scsi_device_quiesce() means: * ... Since special requests may also be requeued requests, * a successful return doesn't guarantee the device will be * totally quiescent. I don't know if there can be some other I/O submitted after scsi_device_quiesce() returns in the case of hibernation, and I don't know if aac_suspend() -> scsi_host_block() should be fixed/removed, but as far as the panic is concerned, I'm very glad I have found a better fix with your help. Thank you so much, Bart! Thanks, -- Dexuan
On 2020-04-23 11:29, Dexuan Cui wrote: > So it looks the below patch also works for me: > > --- a/kernel/power/hibernate.c > +++ b/kernel/power/hibernate.c > @@ -898,6 +898,11 @@ static int software_resume(void) > error = freeze_processes(); > if (error) > goto Close_Finish; > + > + error = freeze_kernel_threads(); > + if (error) > + goto Close_Finish; > + > error = load_image_and_restore(); > thaw_processes(); > Finish: > > Just to be sure, I'll do more tests, but I believe the panic can be fixed > by this according to my tests I have done so far. If a freeze_kernel_threads() call is added in software_resume(), should a thaw_kernel_threads() call be added too? Anyway, please Cc me if a patch for software_resume() is submitted. > I'm still not sure what the comment before scsi_device_quiesce() means: > * ... Since special requests may also be requeued requests, > * a successful return doesn't guarantee the device will be > * totally quiescent. > > I don't know if there can be some other I/O submitted after > scsi_device_quiesce() returns in the case of hibernation, and I don't > know if aac_suspend() -> scsi_host_block() should be fixed/removed, > but as far as the panic is concerned, I'm very glad I have found a better > fix with your help. The function blk_set_pm_only() increments the q->pm_only counter while the blk_clear_pm_only() function decrements the q->pm_only counter. If q->pm_only > 0, blk_queue_enter() only succeeds if the flag BLK_MQ_REQ_PREEMPT is set in the second argument passed to that function. blk_get_request() calls blk_queue_enter(). The result is that while q->pm_only > 0 blk_get_request() only submits a request without waiting if the BLK_MQ_REQ_PREEMPT flag is set in its second argument. scsi_execute() sets the BLK_MQ_REQ_PREEMPT flag. In other words, scsi_device_quiesce() blocks requests submitted by filesystems but still allows SCSI commands submitted by the SCSI core to be executed. "special" refers to requests with the BLK_MQ_REQ_PREEMPT flag set. Bart.
> From: Bart Van Assche <bvanassche@acm.org> > Sent: Thursday, April 23, 2020 4:25 PM > On 2020-04-23 11:29, Dexuan Cui wrote: > > So it looks the below patch also works for me: > > > > --- a/kernel/power/hibernate.c > > +++ b/kernel/power/hibernate.c > > @@ -898,6 +898,11 @@ static int software_resume(void) > > error = freeze_processes(); > > if (error) > > goto Close_Finish; > > + > > + error = freeze_kernel_threads(); > > + if (error) > > + goto Close_Finish; > > + > > error = load_image_and_restore(); > > thaw_processes(); > > Finish: > > > > Just to be sure, I'll do more tests, but I believe the panic can be fixed > > by this according to my tests I have done so far. > > If a freeze_kernel_threads() call is added in software_resume(), should > a thaw_kernel_threads() call be added too? Good catch! Note: thaw_processes() thaws every frozen process, including both user space processes and kernel processes. In software_resume(): 1. If freeze_kernel_threads() fails, I should add a "thaw_processes(); " before "goto Close_Finish; " so that all the user space processes can be thawed. 2. If freeze_kernel_threads() succeeds, but load_image_and_restore() Fails, there is already a thaw_processes(). 3. If load_image_and_restore() succeeds, it won't return, and the execution will return from the 'old' kernel's hibernate() -> hibernation_snapshot() -> create_image() -> swsusp_arch_suspend(), and later hibernate() -> thaw_processes() will thaw every frozen process of the 'old' kernel. > Anyway, please Cc me if a patch for software_resume() is submitted. Sure. Will do. > > I'm still not sure what the comment before scsi_device_quiesce() means: > > * ... Since special requests may also be requeued requests, > > * a successful return doesn't guarantee the device will be > > * totally quiescent. > > > > I don't know if there can be some other I/O submitted after > > scsi_device_quiesce() returns in the case of hibernation, and I don't > > know if aac_suspend() -> scsi_host_block() should be fixed/removed, > > but as far as the panic is concerned, I'm very glad I have found a better > > fix with your help. > > The function blk_set_pm_only() increments the q->pm_only counter while > the blk_clear_pm_only() function decrements the q->pm_only counter. > If q->pm_only > 0, blk_queue_enter() only succeeds if the flag > BLK_MQ_REQ_PREEMPT is set in the second argument passed to that > function. blk_get_request() calls blk_queue_enter(). The result is that > while q->pm_only > 0 blk_get_request() only submits a request without > waiting if the BLK_MQ_REQ_PREEMPT flag is set in its second argument. > scsi_execute() sets the BLK_MQ_REQ_PREEMPT flag. In other words, > scsi_device_quiesce() blocks requests submitted by filesystems but still > allows SCSI commands submitted by the SCSI core to be executed. > "special" refers to requests with the BLK_MQ_REQ_PREEMPT flag set. > > Bart. Thanks for the detailed clarification! So it sounds like we're safe here, and I guess the scsi_host_block() in aac_suspend() should be removed to discourage people from trying to use scsi_host_block() in a .suspend() callback. :-) Thanks, -- Dexuan
> From: Bart Van Assche <bvanassche@acm.org> > Sent: Thursday, April 23, 2020 4:25 PM > On 2020-04-23 11:29, Dexuan Cui wrote: > > So it looks the below patch also works for me: > > > > --- a/kernel/power/hibernate.c > > +++ b/kernel/power/hibernate.c > > @@ -898,6 +898,11 @@ static int software_resume(void) > > error = freeze_processes(); > > if (error) > > goto Close_Finish; > > + > > + error = freeze_kernel_threads(); > > + if (error) > > + goto Close_Finish; > > + > > error = load_image_and_restore(); > > thaw_processes(); > > Finish: > > > > Just to be sure, I'll do more tests, but I believe the panic can be fixed > > by this according to my tests I have done so far. > > If a freeze_kernel_threads() call is added in software_resume(), should > a thaw_kernel_threads() call be added too? > > Anyway, please Cc me if a patch for software_resume() is submitted. FYI, I posted a fix: https://lkml.org/lkml/2020/4/23/1540 Thanks, -- Dexuan
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index fb41636519ee..fd51d2f03778 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1948,6 +1948,11 @@ static int storvsc_suspend(struct hv_device *hv_dev) struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); struct Scsi_Host *host = stor_device->host; struct hv_host_device *host_dev = shost_priv(host); + int ret; + + ret = scsi_host_block(host); + if (ret) + return ret; storvsc_wait_to_drain(stor_device); @@ -1968,10 +1973,15 @@ static int storvsc_suspend(struct hv_device *hv_dev) static int storvsc_resume(struct hv_device *hv_dev) { + struct storvsc_device *stor_device = hv_get_drvdata(hv_dev); + struct Scsi_Host *host = stor_device->host; int ret; ret = storvsc_connect_to_vsp(hv_dev, storvsc_ringbuffer_size, hv_dev_is_fc(hv_dev)); + if (!ret) + ret = scsi_host_unblock(host, SDEV_RUNNING); + return ret; }
During hibernation, the sdevs are suspended automatically in drivers/scsi/scsi_pm.c before storvsc_suspend(), so after storvsc_suspend(), there is no disk I/O from the file systems, but there can still be disk I/O from the kernel space, e.g. disk_check_events() -> sr_block_check_events() -> cdrom_check_events() can still submit I/O to the storvsc driver, which causes a paic of NULL pointer dereference, since storvsc has closed the vmbus channel in storvsc_suspend(): refer to the below links for more info: https://lkml.org/lkml/2020/4/10/47 https://lkml.org/lkml/2020/4/17/1103 Fix the panic by blocking/unblocking all the I/O queues properly. Note: this patch depends on another patch "scsi: core: Allow the state change from SDEV_QUIESCE to SDEV_BLOCK" (refer to the second link above). Fixes: 56fb10585934 ("scsi: storvsc: Add the support of hibernation") Signed-off-by: Dexuan Cui <decui@microsoft.com> --- drivers/scsi/storvsc_drv.c | 10 ++++++++++ 1 file changed, 10 insertions(+)