nvme/pci: Use host managed power state for suspend

Message ID	20190510212937.11661-1-keith.busch@intel.com (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Keith Busch <keith.busch@intel.com> To: Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, linux-nvme@lists.infradead.org Cc: Rafael Wysocki <rafael@kernel.org>, lkml <linux-kernel@vger.kernel.org>, linux-pm <linux-pm@vger.kernel.org>, Keith Busch <keith.busch@intel.com>, Mario Limonciello <Mario.Limonciello@dell.com>, Kai Heng Feng <kai.heng.feng@canonical.com> Subject: [PATCH] nvme/pci: Use host managed power state for suspend Date: Fri, 10 May 2019 15:29:37 -0600 Message-Id: <20190510212937.11661-1-keith.busch@intel.com> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk
Series	nvme/pci: Use host managed power state for suspend \| expand nvme/pci: Use host managed power state for suspend

Keith Busch May 10, 2019, 9:29 p.m. UTC

The nvme pci driver prepares its devices for power loss during suspend
by shutting down the controllers, and the power setting is deferred to
pci driver's power management before the platform removes power. The
suspend-to-idle mode, however, does not remove power.

NVMe devices that implement host managed power settings can achieve
lower power and better transition latencies than using generic PCI
power settings. Try to use this feature if the platform is not involved
with the suspend. If successful, restore the previous power state on
resume.

Cc: Mario Limonciello <Mario.Limonciello@dell.com>
Cc: Kai Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
Disclaimer: I've tested only on emulation faking support for the feature.

General question: different devices potentially have divergent values
for power consumption and transition latencies. Would it be useful to
allow a user tunable setting to select the desired target power state
instead of assuming the lowest one?

 drivers/nvme/host/core.c | 27 ++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |  2 ++
 drivers/nvme/host/pci.c  | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 82 insertions(+)

Limonciello, Mario May 11, 2019, 12:52 a.m. UTC | #1

> 
> Cc: Mario Limonciello <Mario.Limonciello@dell.com>
> Cc: Kai Heng Feng <kai.heng.feng@canonical.com>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
> Disclaimer: I've tested only on emulation faking support for the feature.

Thanks for sharing.  I'll arrange some testing with this with storage partners early 
next week.

> 
> General question: different devices potentially have divergent values for power
> consumption and transition latencies. Would it be useful to allow a user tunable
> setting to select the desired target power state instead of assuming the lowest
> one?

Since this action only happens on the way down to suspend to idle I don't think
there is a lot of value in configuring this to be user tunable which state is used.

If you don't go into the deepest state for NVME, at least on Intel the PCH generally
won't be able to go into its deepest state either.

> 
>  drivers/nvme/host/core.c | 27 ++++++++++++++++++++++++
> drivers/nvme/host/nvme.h |  2 ++  drivers/nvme/host/pci.c  | 53
> ++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 82 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index
> a6644a2c3ef7..eb3640fd8838 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1132,6 +1132,33 @@ static int nvme_set_features(struct nvme_ctrl *dev,
> unsigned fid, unsigned dword
>  	return ret;
>  }
> 
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps) {
> +	return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL,
> 0,
> +NULL); } EXPORT_SYMBOL_GPL(nvme_set_power);
> +
> +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result) {
> +	struct nvme_command c;
> +	union nvme_result res;
> +	int ret;
> +
> +	if (!result)
> +		return -EINVAL;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.features.opcode = nvme_admin_get_features;
> +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
> +
> +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
> +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
> +	if (ret >= 0)
> +		*result = le32_to_cpu(res.u32);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(nvme_get_power);
> +
>  int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)  {
>  	u32 q_count = (*count - 1) | ((*count - 1) << 16); diff --git
> a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index
> 5ee75b5ff83f..eaa571ac06d2 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -459,6 +459,8 @@ int __nvme_submit_sync_cmd(struct request_queue *q,
> struct nvme_command *cmd,
>  		unsigned timeout, int qid, int at_head,
>  		blk_mq_req_flags_t flags, bool poll);  int
> nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps); int
> +nvme_get_power(struct nvme_ctrl *ctrl, u32 *result);
>  void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);  int nvme_reset_ctrl(struct
> nvme_ctrl *ctrl);  int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl); diff --git
> a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
> 3e4fb891a95a..0d5d91e5b293 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
>  #include <linux/mutex.h>
>  #include <linux/once.h>
>  #include <linux/pci.h>
> +#include <linux/suspend.h>
>  #include <linux/t10-pi.h>
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h> @@ -116,6 +117,7 @@ struct
> nvme_dev {
>  	u32 cmbsz;
>  	u32 cmbloc;
>  	struct nvme_ctrl ctrl;
> +	u32 last_ps;
> 
>  	mempool_t *iod_mempool;
> 
> @@ -2828,11 +2830,59 @@ static void nvme_remove(struct pci_dev *pdev)  }
> 
>  #ifdef CONFIG_PM_SLEEP
> +static int nvme_deep_state(struct nvme_dev *dev) {
> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> +	int ret;
> +
> +	/*
> +	 * Save the current power state in case a user tool set a power policy
> +	 * for this device. We'll restore that state on resume.
> +	 */
> +	dev->last_ps = 0;
> +	ret = nvme_get_power(&dev->ctrl, &dev->last_ps);
> +
> +	/*
> +	 * Return the error to halt suspend if the driver either couldn't
> +	 * submit a command or didn't see a response.
> +	 */
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = nvme_set_power(&dev->ctrl, dev->ctrl.npss);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (!ret) {
> +		/*
> +		 * A saved state prevents pci pm from generically controlling
> +		 * the device's power. We're using protocol specific settings
> +		 * so we don't want pci interfering.
> +		 */
> +		pci_save_state(pdev);
> +	} else {
> +		/*
> +		 * The drive failed the low power request. Fallback to device
> +		 * shutdown and clear npss to force a controller reset on
> +		 * resume. The value will be rediscovered during reset.
> +		 */
> +		dev->ctrl.npss = 0;
> +		nvme_dev_disable(dev, true);
> +	}
> +	return 0;
> +}
> +
>  static int nvme_suspend(struct device *dev)  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> 
> +	/*
> +	 * Try to use nvme if the device supports host managed power settings
> +	 * and platform firmware is not involved.
> +	 */
> +	if (!pm_suspend_via_firmware() && ndev->ctrl.npss)
> +		return nvme_deep_state(ndev);
>  	nvme_dev_disable(ndev, true);
>  	return 0;
>  }
> @@ -2842,6 +2892,9 @@ static int nvme_resume(struct device *dev)
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> 
> +	if (!pm_resume_via_firmware() && ndev->ctrl.npss)
> +		if (nvme_set_power(&ndev->ctrl, ndev->last_ps) == 0)
> +			return 0;
>  	nvme_reset_ctrl(&ndev->ctrl);
>  	return 0;
>  }
> --
> 2.14.4

Edmund Nadolski May 11, 2019, 1:33 a.m. UTC | #2

On 5/10/19 2:29 PM, Keith Busch wrote:
> The nvme pci driver prepares its devices for power loss during suspend
> by shutting down the controllers, and the power setting is deferred to
> pci driver's power management before the platform removes power. The
> suspend-to-idle mode, however, does not remove power.
> 
> NVMe devices that implement host managed power settings can achieve
> lower power and better transition latencies than using generic PCI
> power settings. Try to use this feature if the platform is not involved
> with the suspend. If successful, restore the previous power state on
> resume.
> 
> Cc: Mario Limonciello <Mario.Limonciello@dell.com>
> Cc: Kai Heng Feng <kai.heng.feng@canonical.com>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
> Disclaimer: I've tested only on emulation faking support for the feature.
> 
> General question: different devices potentially have divergent values
> for power consumption and transition latencies. Would it be useful to
> allow a user tunable setting to select the desired target power state
> instead of assuming the lowest one?

In general I prefer fewer knobs; but for a new setting it seems
advisable not to presume that one value is appropriate for everyone (as
long as they can't set an invalid value or otherwise hose things).


>   drivers/nvme/host/core.c | 27 ++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h |  2 ++
>   drivers/nvme/host/pci.c  | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 82 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a6644a2c3ef7..eb3640fd8838 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1132,6 +1132,33 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
>   	return ret;
>   }
>   
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps)
> +{
> +	return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL, 0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_power);
> +
> +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result)
> +{
> +	struct nvme_command c;
> +	union nvme_result res;
> +	int ret;
> +
> +	if (!result)
> +		return -EINVAL;

As this is only called from one place, is this check really needed?


> +
> +	memset(&c, 0, sizeof(c));
> +	c.features.opcode = nvme_admin_get_features;
> +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
> +
> +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
> +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
> +	if (ret >= 0)
> +		*result = le32_to_cpu(res.u32);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(nvme_get_power);
> +
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>   {
>   	u32 q_count = (*count - 1) | ((*count - 1) << 16);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 5ee75b5ff83f..eaa571ac06d2 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -459,6 +459,8 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
>   		unsigned timeout, int qid, int at_head,
>   		blk_mq_req_flags_t flags, bool poll);
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps);
> +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result);
>   void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
>   int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
>   int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3e4fb891a95a..0d5d91e5b293 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
>   #include <linux/mutex.h>
>   #include <linux/once.h>
>   #include <linux/pci.h>
> +#include <linux/suspend.h>
>   #include <linux/t10-pi.h>
>   #include <linux/types.h>
>   #include <linux/io-64-nonatomic-lo-hi.h>
> @@ -116,6 +117,7 @@ struct nvme_dev {
>   	u32 cmbsz;
>   	u32 cmbloc;
>   	struct nvme_ctrl ctrl;
> +	u32 last_ps;
>   
>   	mempool_t *iod_mempool;
>   
> @@ -2828,11 +2830,59 @@ static void nvme_remove(struct pci_dev *pdev)
>   }
>   
>   #ifdef CONFIG_PM_SLEEP
> +static int nvme_deep_state(struct nvme_dev *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> +	int ret;
> +
> +	/*
> +	 * Save the current power state in case a user tool set a power policy
> +	 * for this device. We'll restore that state on resume.
> +	 */
> +	dev->last_ps = 0;
> +	ret = nvme_get_power(&dev->ctrl, &dev->last_ps);

Should we validate (range check) the value reported by the device?


> +
> +	/*
> +	 * Return the error to halt suspend if the driver either couldn't
> +	 * submit a command or didn't see a response.
> +	 */
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = nvme_set_power(&dev->ctrl, dev->ctrl.npss);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (!ret) {
> +		/*
> +		 * A saved state prevents pci pm from generically controlling
> +		 * the device's power. We're using protocol specific settings
> +		 * so we don't want pci interfering.
> +		 */
> +		pci_save_state(pdev);
> +	} else {
> +		/*
> +		 * The drive failed the low power request. Fallback to device
> +		 * shutdown and clear npss to force a controller reset on
> +		 * resume. The value will be rediscovered during reset.
> +		 */
> +		dev->ctrl.npss = 0;
> +		nvme_dev_disable(dev, true);
> +	}
> +	return 0;
> +}
> +
>   static int nvme_suspend(struct device *dev)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   	struct nvme_dev *ndev = pci_get_drvdata(pdev);
>   
> +	/*
> +	 * Try to use nvme if the device supports host managed power settings
> +	 * and platform firmware is not involved.
> +	 */
> +	if (!pm_suspend_via_firmware() && ndev->ctrl.npss)
> +		return nvme_deep_state(ndev);
>   	nvme_dev_disable(ndev, true);
>   	return 0;
>   }
> @@ -2842,6 +2892,9 @@ static int nvme_resume(struct device *dev)
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   	struct nvme_dev *ndev = pci_get_drvdata(pdev);
>   
> +	if (!pm_resume_via_firmware() && ndev->ctrl.npss)
> +		if (nvme_set_power(&ndev->ctrl, ndev->last_ps) == 0)
> +			return 0;
>   	nvme_reset_ctrl(&ndev->ctrl);
>   	return 0;
>   }
>

Christoph Hellwig May 11, 2019, 7:22 a.m. UTC | #3

A couple nitpicks, mostly leftover from the previous iteration
(I didn't see replies to those comments from you, despite seeing
a reply to my mail, assuming it didn't get lost):

> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps)
> +{
> +	return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL, 0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_power);
> +
> +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result)
> +{
> +	struct nvme_command c;
> +	union nvme_result res;
> +	int ret;
> +
> +	if (!result)
> +		return -EINVAL;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.features.opcode = nvme_admin_get_features;
> +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
> +
> +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
> +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
> +	if (ret >= 0)
> +		*result = le32_to_cpu(res.u32);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(nvme_get_power);

At this point I'd rather see those in the PCIe driver.  While the
power state feature is generic in the spec I don't see it actually
being used anytime anywhere else any time soon.

But maybe we can add a nvme_get_features helper ala nvme_set_features
in the core to avoid a little boilerplate code for the future?

> +	ret = nvme_set_power(&dev->ctrl, dev->ctrl.npss);
> +	if (ret < 0)
> +		return ret;

I can't find any wording in the spec that guarantees the highest
numerical power state is the deepest.  But maybe I'm just missing
something as such an ordering would be really helpful?

>  static int nvme_suspend(struct device *dev)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
>  
> +	/*
> +	 * Try to use nvme if the device supports host managed power settings
> +	 * and platform firmware is not involved.
> +	 */

This just comments that what, but I think we need a why here as the
what is fairly obvious..

Chaitanya Kulkarni May 12, 2019, 6:06 a.m. UTC | #4

Hi Keith,

Thanks for the patch, few comments below mostly nitpicks.

On 5/10/19 2:35 PM, Keith Busch wrote:
> The nvme pci driver prepares its devices for power loss during suspend
> by shutting down the controllers, and the power setting is deferred to
> pci driver's power management before the platform removes power. The
> suspend-to-idle mode, however, does not remove power.
> 
> NVMe devices that implement host managed power settings can achieve
> lower power and better transition latencies than using generic PCI
> power settings. Try to use this feature if the platform is not involved
> with the suspend. If successful, restore the previous power state on
> resume.
> 
> Cc: Mario Limonciello <Mario.Limonciello@dell.com>
> Cc: Kai Heng Feng <kai.heng.feng@canonical.com>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
> Disclaimer: I've tested only on emulation faking support for the feature.
> 
> General question: different devices potentially have divergent values
> for power consumption and transition latencies. Would it be useful to
> allow a user tunable setting to select the desired target power state
> instead of assuming the lowest one?
> 
>   drivers/nvme/host/core.c | 27 ++++++++++++++++++++++++
>   drivers/nvme/host/nvme.h |  2 ++
>   drivers/nvme/host/pci.c  | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 82 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a6644a2c3ef7..eb3640fd8838 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1132,6 +1132,33 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
>   	return ret;
>   }
>   
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps)
dev->ctrl.npss is u8 can we use same data type here ?
If this is due to last_ps we use as a result and then call set_power may 
be we can change the type of last_ps ?
OR
can we please use unsigned int to avoid possible warnings ?
> +{
> +	return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL, 0, NULL);
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_power);
> +
> +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result)
> +{
> +	struct nvme_command c;
May be use struct nvme_command c {} so we can get rid of the memset() call.
> +	union nvme_result res;
> +	int ret;
> +
> +	if (!result)
> +		return -EINVAL;
> +
> +	memset(&c, 0, sizeof(c));
> +	c.features.opcode = nvme_admin_get_features;
> +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
> +
> +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
> +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
> +	if (ret >= 0)
> +		*result = le32_to_cpu(res.u32);
May be add a check for result here in above if before deref pointer :-
	if (ret >= 0 && result)

> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(nvme_get_power);
> +
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>   {
>   	u32 q_count = (*count - 1) | ((*count - 1) << 16);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 5ee75b5ff83f..eaa571ac06d2 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -459,6 +459,8 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
>   		unsigned timeout, int qid, int at_head,
>   		blk_mq_req_flags_t flags, bool poll);
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps);
> +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result);
>   void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
>   int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
>   int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3e4fb891a95a..0d5d91e5b293 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
>   #include <linux/mutex.h>
>   #include <linux/once.h>
>   #include <linux/pci.h>
> +#include <linux/suspend.h>
>   #include <linux/t10-pi.h>
>   #include <linux/types.h>
>   #include <linux/io-64-nonatomic-lo-hi.h>
> @@ -116,6 +117,7 @@ struct nvme_dev {
>   	u32 cmbsz;
>   	u32 cmbloc;
>   	struct nvme_ctrl ctrl;
> +	u32 last_ps;
>   
>   	mempool_t *iod_mempool;
>   
> @@ -2828,11 +2830,59 @@ static void nvme_remove(struct pci_dev *pdev)
>   }
>   
>   #ifdef CONFIG_PM_SLEEP
> +static int nvme_deep_state(struct nvme_dev *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> +	int ret;
> +
> +	/*
> +	 * Save the current power state in case a user tool set a power policy
> +	 * for this device. We'll restore that state on resume.
> +	 */
> +	dev->last_ps = 0;
Should we only overwrite dev->last_ps value after nvme_get_power() is 
successful and use temp variable to fetch the current power value in the 
following call ?
> +	ret = nvme_get_power(&dev->ctrl, &dev->last_ps);
> +
> +	/*
> +	 * Return the error to halt suspend if the driver either couldn't
> +	 * submit a command or didn't see a response.
> +	 */
> +	if (ret < 0)
> +		return ret;
> +
> +	ret = nvme_set_power(&dev->ctrl, dev->ctrl.npss);
> +	if (ret < 0)
> +		return ret;

> +
> +	if (!ret) {
> +		/*
> +		 * A saved state prevents pci pm from generically controlling
> +		 * the device's power. We're using protocol specific settings
> +		 * so we don't want pci interfering.
> +		 */
> +		pci_save_state(pdev);
> +	} else {
> +		/*
> +		 * The drive failed the low power request. Fallback to device
> +		 * shutdown and clear npss to force a controller reset on
> +		 * resume. The value will be rediscovered during reset.
> +		 */
> +		dev->ctrl.npss = 0;
> +		nvme_dev_disable(dev, true);
> +	}
> +	return 0;
> +}
> +
>   static int nvme_suspend(struct device *dev)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   	struct nvme_dev *ndev = pci_get_drvdata(pdev);
>   
> +	/*
> +	 * Try to use nvme if the device supports host managed power settings
> +	 * and platform firmware is not involved.
> +	 */
> +	if (!pm_suspend_via_firmware() && ndev->ctrl.npss)
> +		return nvme_deep_state(ndev);
>   	nvme_dev_disable(ndev, true);
>   	return 0;
>   }
> @@ -2842,6 +2892,9 @@ static int nvme_resume(struct device *dev)
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   	struct nvme_dev *ndev = pci_get_drvdata(pdev);
>   
> +	if (!pm_resume_via_firmware() && ndev->ctrl.npss)
> +		if (nvme_set_power(&ndev->ctrl, ndev->last_ps) == 0)
> +			return 0;
>   	nvme_reset_ctrl(&ndev->ctrl);
>   	return 0;
>   }
>

Minwoo Im May 12, 2019, 2:30 p.m. UTC | #5

> > +	union nvme_result res;
> > +	int ret;
> > +
> > +	if (!result)
> > +		return -EINVAL;
> > +
> > +	memset(&c, 0, sizeof(c));
> > +	c.features.opcode = nvme_admin_get_features;
> > +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
> > +
> > +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
> > +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
> > +	if (ret >= 0)
> > +		*result = le32_to_cpu(res.u32);
> May be add a check for result here in above if before deref pointer :-
> 	if (ret >= 0 && result)
> 

'result' already has been checked in a few lines above.

Sagi Grimberg May 12, 2019, 2:34 p.m. UTC | #6

>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps)
>> +{
>> +	return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL, 0, NULL);
>> +}
>> +EXPORT_SYMBOL_GPL(nvme_set_power);
>> +
>> +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result)
>> +{
>> +	struct nvme_command c;
>> +	union nvme_result res;
>> +	int ret;
>> +
>> +	if (!result)
>> +		return -EINVAL;
>> +
>> +	memset(&c, 0, sizeof(c));
>> +	c.features.opcode = nvme_admin_get_features;
>> +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
>> +
>> +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
>> +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
>> +	if (ret >= 0)
>> +		*result = le32_to_cpu(res.u32);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(nvme_get_power);
> 
> At this point I'd rather see those in the PCIe driver.  While the
> power state feature is generic in the spec I don't see it actually
> being used anytime anywhere else any time soon.
> 
> But maybe we can add a nvme_get_features helper ala nvme_set_features
> in the core to avoid a little boilerplate code for the future?

+1 on this comment.

Chaitanya Kulkarni May 12, 2019, 3:20 p.m. UTC | #7

Disregard my comment.

On 5/12/19 7:31 AM, Minwoo Im wrote:
>>> +	union nvme_result res;
>>> +	int ret;
>>> +
>>> +	if (!result)
>>> +		return -EINVAL;
>>> +
>>> +	memset(&c, 0, sizeof(c));
>>> +	c.features.opcode = nvme_admin_get_features;
>>> +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
>>> +
>>> +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
>>> +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
>>> +	if (ret >= 0)
>>> +		*result = le32_to_cpu(res.u32);
>> May be add a check for result here in above if before deref pointer :-
>> 	if (ret >= 0 && result)
>>
> 'result' already has been checked in a few lines above.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
>

Keith Busch May 13, 2019, 1:36 p.m. UTC | #8

On Sat, May 11, 2019 at 12:22:58AM -0700, Christoph Hellwig wrote:
> A couple nitpicks, mostly leftover from the previous iteration
> (I didn't see replies to those comments from you, despite seeing
> a reply to my mail, assuming it didn't get lost):

I thought you just meant the freeze/unfreeze sequence. I removed that
part entirely, but yes, I can move all of this from the core. I will
just need to export 'nvme_set_features'
 
> > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps)
> > +{
> > +	return nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ps, NULL, 0, NULL);
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_set_power);
> > +
> > +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result)
> > +{
> > +	struct nvme_command c;
> > +	union nvme_result res;
> > +	int ret;
> > +
> > +	if (!result)
> > +		return -EINVAL;
> > +
> > +	memset(&c, 0, sizeof(c));
> > +	c.features.opcode = nvme_admin_get_features;
> > +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
> > +
> > +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
> > +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
> > +	if (ret >= 0)
> > +		*result = le32_to_cpu(res.u32);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_get_power);
> 
> At this point I'd rather see those in the PCIe driver.  While the
> power state feature is generic in the spec I don't see it actually
> being used anytime anywhere else any time soon.
> 
> But maybe we can add a nvme_get_features helper ala nvme_set_features
> in the core to avoid a little boilerplate code for the future?

Sounds good.
 
> > +	ret = nvme_set_power(&dev->ctrl, dev->ctrl.npss);
> > +	if (ret < 0)
> > +		return ret;
> 
> I can't find any wording in the spec that guarantees the highest
> numerical power state is the deepest.  But maybe I'm just missing
> something as such an ordering would be really helpful?

I actually only noticed APST made this assumption, and I had to search
the spec to see where it calls this out. It is in section 8.4:

  Power states are contiguously numbered starting with zero such that
  each subsequent power state consumes less than or equal to the maximum
  power consumed in the previous state.

> >  static int nvme_suspend(struct device *dev)
> >  {
> >  	struct pci_dev *pdev = to_pci_dev(dev);
> >  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> >  
> > +	/*
> > +	 * Try to use nvme if the device supports host managed power settings
> > +	 * and platform firmware is not involved.
> > +	 */
> 
> This just comments that what, but I think we need a why here as the
> what is fairly obvious..

Sounds good.

Keith Busch May 13, 2019, 1:47 p.m. UTC | #9

On Sat, May 11, 2019 at 11:06:35PM -0700, Chaitanya Kulkarni wrote:
> On 5/10/19 2:35 PM, Keith Busch wrote:
> >   
> > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned ps)

> dev->ctrl.npss is u8 can we use same data type here ?
> If this is due to last_ps we use as a result and then call set_power may 
> be we can change the type of last_ps ?
> OR
> can we please use unsigned int to avoid possible warnings ?

Right, the feature uses a 32-bit value even though only the first byte
is defined at the moment. It's just for foward compatibility. Will make
this a u32.

> > +int nvme_get_power(struct nvme_ctrl *ctrl, u32 *result)
> > +{
> > +	struct nvme_command c;
> May be use struct nvme_command c {} so we can get rid of the memset() call.

Good point.

> > +	union nvme_result res;
> > +	int ret;
> > +
> > +	if (!result)
> > +		return -EINVAL;
> > +
> > +	memset(&c, 0, sizeof(c));
> > +	c.features.opcode = nvme_admin_get_features;
> > +	c.features.fid = cpu_to_le32(NVME_FEAT_POWER_MGMT);
> > +
> > +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &c, &res,
> > +			NULL, 0, 0, NVME_QID_ANY, 0, 0, false);
> > +	if (ret >= 0)
> > +		*result = le32_to_cpu(res.u32);
>
> May be add a check for result here in above if before deref pointer :-
> 	if (ret >= 0 && result)

I have it checked at the top of the function since it doesn't make much
sense to call this without a way to return the result.

Limonciello, Mario May 13, 2019, 2:24 p.m. UTC | #10

> > Cc: Mario Limonciello <Mario.Limonciello@dell.com>
> > Cc: Kai Heng Feng <kai.heng.feng@canonical.com>
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> > Disclaimer: I've tested only on emulation faking support for the feature.
> 
> Thanks for sharing.  I'll arrange some testing with this with storage partners early
> next week.
> 

I've received the result that from one of my partners this patch doesn't work properly
and the platform doesn't go into a lower power state.  (The patch submitted by KH/me
did work on said system to get it into a lower power state for them).

This was not a disk with HMB, but with regard to the HMB I believe it needs to be
removed during s0ix so that there isn't any mistake that SSD thinks it can access HMB
memory in s0ix.

Christoph Hellwig May 13, 2019, 2:37 p.m. UTC | #11

On Mon, May 13, 2019 at 02:24:41PM +0000, Mario.Limonciello@dell.com wrote:
> I've received the result that from one of my partners this patch doesn't
> work properly and the platform doesn't go into a lower power state.

Well, it sounds like your partners device does not work properly in this
case.  There is nothing in the NVMe spec that says queues should be
torn down for deep power states, and that whole idea seems rather
counter productive to low-latency suspend/resume cycles.

> This was not a disk with HMB, but with regard to the HMB I believe it
> needs to be removed during s0ix so that there isn't any mistake that SSD
> thinks it can access HMB memory in s0ix.

There is no mistake - the device is allowed to use the HMB from the
point that we give it the memory range until the point where we either
disable it, or shut the controller down.  If something else requires the
device not to use the HMB after ->suspend is called we need to disable
the HMB, and we better have a good reason for that and document it in
the code.  Note that shutting down queues or having CPU memory barriers
is not going to help with any of that.

Keith Busch May 13, 2019, 2:37 p.m. UTC | #12

On Mon, May 13, 2019 at 02:24:41PM +0000, Mario.Limonciello@dell.com wrote:
> This was not a disk with HMB, but with regard to the HMB I believe it needs to be
> removed during s0ix so that there isn't any mistake that SSD thinks it can access HMB
> memory in s0ix.

Is that really the case, though? Where may I find that DMA is not
allowed in this state? I just want an authoritative reference to attach
to the behavior.

Limonciello, Mario May 13, 2019, 2:43 p.m. UTC | #13

> -----Original Message-----
> From: Christoph Hellwig <hch@lst.de>
> Sent: Monday, May 13, 2019 9:38 AM
> To: Limonciello, Mario
> Cc: keith.busch@intel.com; hch@lst.de; sagi@grimberg.me; linux-
> nvme@lists.infradead.org; rafael@kernel.org; linux-kernel@vger.kernel.org; linux-
> pm@vger.kernel.org; kai.heng.feng@canonical.com
> Subject: Re: [PATCH] nvme/pci: Use host managed power state for suspend
> 
> 
> [EXTERNAL EMAIL]
> 
> On Mon, May 13, 2019 at 02:24:41PM +0000, Mario.Limonciello@dell.com wrote:
> > I've received the result that from one of my partners this patch doesn't
> > work properly and the platform doesn't go into a lower power state.
> 
> Well, it sounds like your partners device does not work properly in this
> case.  There is nothing in the NVMe spec that says queues should be
> torn down for deep power states, and that whole idea seems rather
> counter productive to low-latency suspend/resume cycles.

Well I've got a thought, quoting the NVME spec:
"After a successful completion of a Set Features command for this feature, the controller shall be in the
Power State specified. If enabled, autonomous power state transitions continue to occur from the new state."

If APST is enabled on this disk, what is to stop an autonomous  reverse
transition from queue activity on the way down?

> 
> > This was not a disk with HMB, but with regard to the HMB I believe it
> > needs to be removed during s0ix so that there isn't any mistake that SSD
> > thinks it can access HMB memory in s0ix.
> 
> There is no mistake - the device is allowed to use the HMB from the
> point that we give it the memory range until the point where we either
> disable it, or shut the controller down.  If something else requires the
> device not to use the HMB after ->suspend is called we need to disable
> the HMB, and we better have a good reason for that and document it in
> the code.  Note that shutting down queues or having CPU memory barriers
> is not going to help with any of that.

So maybe the CPU memory barriers were probably just working around it in terms of
timing.

Christoph Hellwig May 13, 2019, 2:54 p.m. UTC | #14

On Mon, May 13, 2019 at 02:43:43PM +0000, Mario.Limonciello@dell.com wrote:
> Well I've got a thought, quoting the NVME spec:
> "After a successful completion of a Set Features command for this feature, the controller shall be in the
> Power State specified. If enabled, autonomous power state transitions continue to occur from the new state."
> 
> If APST is enabled on this disk, what is to stop an autonomous  reverse
> transition from queue activity on the way down?

Nothing.  But once the system is suspending we should not see I/O.

If we see I/O the queue freezing in the original patch Kai Heng and the
previous one from Keith is probably required, although I suspect it
just papers over problems higher up in the queue.  If we don't see I/O
the device is just behaving oddly.

Limonciello, Mario May 13, 2019, 2:54 p.m. UTC | #15

> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Monday, May 13, 2019 9:38 AM
> To: Limonciello, Mario
> Cc: keith.busch@intel.com; hch@lst.de; sagi@grimberg.me; linux-
> nvme@lists.infradead.org; rafael@kernel.org; linux-kernel@vger.kernel.org; linux-
> pm@vger.kernel.org; kai.heng.feng@canonical.com
> Subject: Re: [PATCH] nvme/pci: Use host managed power state for suspend
> 
> 
> [EXTERNAL EMAIL]
> 
> On Mon, May 13, 2019 at 02:24:41PM +0000, Mario.Limonciello@dell.com wrote:
> > This was not a disk with HMB, but with regard to the HMB I believe it needs to be
> > removed during s0ix so that there isn't any mistake that SSD thinks it can access
> HMB
> > memory in s0ix.
> 
> Is that really the case, though? Where may I find that DMA is not
> allowed in this state? I just want an authoritative reference to attach
> to the behavior.

The Intel DMA controller suspend callbacks in drivers/dma/idma64.c look to me to
turn off the controller.

And NVME spec made it sound to me that while in a low power state it shouldn't
be available if the memory isn't available.

NVME spec in 8.9:

"Host software should request that the controller release the
assigned ranges prior to a shutdown event, a Runtime D3 event, or any other event
that requires host software to reclaim the assigned ranges."

Keith Busch May 13, 2019, 2:55 p.m. UTC | #16

On Mon, May 13, 2019 at 02:43:43PM +0000, Mario.Limonciello@dell.com wrote:
> > Well, it sounds like your partners device does not work properly in this
> > case.  There is nothing in the NVMe spec that says queues should be
> > torn down for deep power states, and that whole idea seems rather
> > counter productive to low-latency suspend/resume cycles.
> 
> Well I've got a thought, quoting the NVME spec:
> "After a successful completion of a Set Features command for this feature, the controller shall be in the
> Power State specified. If enabled, autonomous power state transitions continue to occur from the new state."
> 
> If APST is enabled on this disk, what is to stop an autonomous  reverse
> transition from queue activity on the way down?

Regardless of whether APST is enabled or not, the controller may
autonomously transition out of a host requested low power state in
response to host activity. Exiting a low power state should mean some
other task is actively using the device, and if that's the case, why are
you trying to enter a low power state in the first place? Alternatively,
if host activity really is idle, then why is the device autonomously
leaving the requested state?

Christoph Hellwig May 13, 2019, 2:57 p.m. UTC | #17

On Mon, May 13, 2019 at 02:54:49PM +0000, Mario.Limonciello@dell.com wrote:
> The Intel DMA controller suspend callbacks in drivers/dma/idma64.c look to me to
> turn off the controller.

How is that relevant?  That thing is neither a NVMe controller, nor
even an PCIe device..

> And NVME spec made it sound to me that while in a low power state it shouldn't
> be available if the memory isn't available.
> 
> NVME spec in 8.9:
> 
> "Host software should request that the controller release the
> assigned ranges prior to a shutdown event, a Runtime D3 event, or any other event
> that requires host software to reclaim the assigned ranges."

The last part of the quoted text is the key - if the assigned range
is reclaimed, that is the memory is going to be used for something else,
we need to release the ranges.  But we do not release the ranges,
as we want to keep the memory in use so that we can quickly use it
again.

Keith Busch May 13, 2019, 3:04 p.m. UTC | #18

On Mon, May 13, 2019 at 03:05:42PM +0000, Mario.Limonciello@dell.com wrote:
> This system power state - suspend to idle is going to freeze threads.
> But we're talking a multi threaded kernel.  Can't there be a timing problem going
> on then too?  With a disk flush being active in one task and the other task trying
> to put the disk into the deepest power state.  If you don't freeze the queues how
> can you guarantee that didn't happen?

But if an active data flush task is running, then we're not idle and
shouldn't go to low power.

Limonciello, Mario May 13, 2019, 3:05 p.m. UTC | #19

> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Monday, May 13, 2019 9:55 AM
> To: Limonciello, Mario
> Cc: hch@lst.de; keith.busch@intel.com; sagi@grimberg.me; linux-
> nvme@lists.infradead.org; rafael@kernel.org; linux-kernel@vger.kernel.org; linux-
> pm@vger.kernel.org; kai.heng.feng@canonical.com
> Subject: Re: [PATCH] nvme/pci: Use host managed power state for suspend
> 
> 
> [EXTERNAL EMAIL]
> 
> On Mon, May 13, 2019 at 02:43:43PM +0000, Mario.Limonciello@dell.com wrote:
> > > Well, it sounds like your partners device does not work properly in this
> > > case.  There is nothing in the NVMe spec that says queues should be
> > > torn down for deep power states, and that whole idea seems rather
> > > counter productive to low-latency suspend/resume cycles.
> >
> > Well I've got a thought, quoting the NVME spec:
> > "After a successful completion of a Set Features command for this feature, the
> controller shall be in the
> > Power State specified. If enabled, autonomous power state transitions continue
> to occur from the new state."
> >
> > If APST is enabled on this disk, what is to stop an autonomous  reverse
> > transition from queue activity on the way down?
> 
> Regardless of whether APST is enabled or not, the controller may
> autonomously transition out of a host requested low power state in
> response to host activity. Exiting a low power state should mean some
> other task is actively using the device, and if that's the case, why are
> you trying to enter a low power state in the first place? Alternatively,
> if host activity really is idle, then why is the device autonomously
> leaving the requested state?

This system power state - suspend to idle is going to freeze threads.
But we're talking a multi threaded kernel.  Can't there be a timing problem going
on then too?  With a disk flush being active in one task and the other task trying
to put the disk into the deepest power state.  If you don't freeze the queues how
can you guarantee that didn't happen?

Keith Busch May 13, 2019, 3:16 p.m. UTC | #20

On Mon, May 13, 2019 at 04:57:08PM +0200, Christoph Hellwig wrote:
> On Mon, May 13, 2019 at 02:54:49PM +0000, Mario.Limonciello@dell.com wrote:
> > And NVME spec made it sound to me that while in a low power state it shouldn't
> > be available if the memory isn't available.
> > 
> > NVME spec in 8.9:
> > 
> > "Host software should request that the controller release the
> > assigned ranges prior to a shutdown event, a Runtime D3 event, or any other event
> > that requires host software to reclaim the assigned ranges."
> 
> The last part of the quoted text is the key - if the assigned range
> is reclaimed, that is the memory is going to be used for something else,
> we need to release the ranges.  But we do not release the ranges,
> as we want to keep the memory in use so that we can quickly use it
> again.

Yes, this. As far as I know, the host memory range is still accessible in
the idle suspend state. If there are states in which host memory is not
accessible, a reference explaining the requirement will be most helpful.

Kai-Heng Feng May 13, 2019, 5:16 p.m. UTC | #21

at 23:16, Keith Busch <kbusch@kernel.org> wrote:

> On Mon, May 13, 2019 at 04:57:08PM +0200, Christoph Hellwig wrote:
>> On Mon, May 13, 2019 at 02:54:49PM +0000, Mario.Limonciello@dell.com  
>> wrote:
>>> And NVME spec made it sound to me that while in a low power state it  
>>> shouldn't
>>> be available if the memory isn't available.
>>>
>>> NVME spec in 8.9:
>>>
>>> "Host software should request that the controller release the
>>> assigned ranges prior to a shutdown event, a Runtime D3 event, or any  
>>> other event
>>> that requires host software to reclaim the assigned ranges."
>>
>> The last part of the quoted text is the key - if the assigned range
>> is reclaimed, that is the memory is going to be used for something else,
>> we need to release the ranges.  But we do not release the ranges,
>> as we want to keep the memory in use so that we can quickly use it
>> again.
>
> Yes, this. As far as I know, the host memory range is still accessible in
> the idle suspend state. If there are states in which host memory is not
> accessible, a reference explaining the requirement will be most helpful.

Disabling HMB prior suspend makes my original patch work without memory  
barrier.

However, using the same trick on this patch still freezes the system during  
S2I.

Kai-Heng

Limonciello, Mario May 13, 2019, 6:01 p.m. UTC | #22

> -----Original Message-----
> From: Christoph Hellwig <hch@lst.de>
> Sent: Monday, May 13, 2019 9:57 AM
> To: Limonciello, Mario
> Cc: kbusch@kernel.org; keith.busch@intel.com; hch@lst.de; sagi@grimberg.me;
> linux-nvme@lists.infradead.org; rafael@kernel.org; linux-kernel@vger.kernel.org;
> linux-pm@vger.kernel.org; kai.heng.feng@canonical.com
> Subject: Re: [PATCH] nvme/pci: Use host managed power state for suspend
> 
> 
> [EXTERNAL EMAIL]
> 
> On Mon, May 13, 2019 at 02:54:49PM +0000, Mario.Limonciello@dell.com wrote:
> > The Intel DMA controller suspend callbacks in drivers/dma/idma64.c look to me
> to
> > turn off the controller.
> 
> How is that relevant?  That thing is neither a NVMe controller, nor
> even an PCIe device..

When using HMB the SSD will be writing to some memory mapped region.  Writing to
that region would use DMA to access host memory, no?
If the DMA controller is not functional writing to that region won't work properly as 
it can't access that memory.

> 
> > And NVME spec made it sound to me that while in a low power state it shouldn't
> > be available if the memory isn't available.
> >
> > NVME spec in 8.9:
> >
> > "Host software should request that the controller release the
> > assigned ranges prior to a shutdown event, a Runtime D3 event, or any other
> event
> > that requires host software to reclaim the assigned ranges."
> 
> The last part of the quoted text is the key - if the assigned range
> is reclaimed, that is the memory is going to be used for something else,
> we need to release the ranges.  But we do not release the ranges,
> as we want to keep the memory in use so that we can quickly use it
> again.

Keith Busch May 13, 2019, 6:16 p.m. UTC | #23

On Tue, May 14, 2019 at 01:16:22AM +0800, Kai-Heng Feng wrote:
> Disabling HMB prior suspend makes my original patch work without memory
> barrier.
> 
> However, using the same trick on this patch still freezes the system during
> S2I.

Could you post your code, please?

Christoph Hellwig May 14, 2019, 6:11 a.m. UTC | #24

On Mon, May 13, 2019 at 06:01:39PM +0000, Mario.Limonciello@dell.com wrote:
> When using HMB the SSD will be writing to some memory mapped region.
> Writing to
> that region would use DMA to access host memory, no?

Memory mapped region?  It will use the devices DMA engine to write
host memory, which we explicitly allowed it.

> If the DMA controller is not functional writing to that region won't work properly as 
> it can't access that memory.

The DMA controller is in the device.  External DMA controllers are only
used on very low-end periphals, usually cheap IP blocks like SPI
controllers or similar.

Rafael J. Wysocki May 14, 2019, 8:04 a.m. UTC | #25

On Mon, May 13, 2019 at 5:10 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Mon, May 13, 2019 at 03:05:42PM +0000, Mario.Limonciello@dell.com wrote:
> > This system power state - suspend to idle is going to freeze threads.
> > But we're talking a multi threaded kernel.  Can't there be a timing problem going
> > on then too?  With a disk flush being active in one task and the other task trying
> > to put the disk into the deepest power state.  If you don't freeze the queues how
> > can you guarantee that didn't happen?
>
> But if an active data flush task is running, then we're not idle and
> shouldn't go to low power.

To be entirely precise, system suspend prevents user space from
running while it is in progress.  It doesn't do that to kernel
threads, at least not by default, though, so if there is a kernel
thread flushing the data, it needs to be stopped or suspended somehow
directly in the system suspend path.  [And yes, system suspend (or
hibernation) may take place at any time so long as all user space can
be prevented from running then (by means of the tasks freezer).]

However, freezing the queues from a driver ->suspend callback doesn't
help in general and the reason why is hibernation.  Roughly speaking,
hibernation works in two steps, the first of which creates a snapshot
image of system memory and the second one writes that image to
persistent storage.  Devices are resumed between the two steps in
order to make it possible to do the write, but that would unfreeze the
queues and let the data flusher run.  If it runs, it may cause the
memory snapshot image that has just been created to become outdated
and restoring the system memory contents from that image going forward
may cause corruption to occur.

Thus freezing the queues from a driver ->suspend callback should not
be relied on for correctness if the same callback is used for system
suspend and hibernation, which is the case here.  If doing that
prevents the system from crashing, it is critical to find out why IMO,
as that may very well indicate a broader issue, not necessarily in the
driver itself.

But note that even if the device turns out to behave oddly, it still
needs to be handled, unless it may be prevented from shipping to users
in that shape.  If it ships, users will face the odd behavior anyway.

Keith Busch May 14, 2019, 10:16 p.m. UTC | #26

On Tue, May 14, 2019 at 10:04:22AM +0200, Rafael J. Wysocki wrote:
> On Mon, May 13, 2019 at 5:10 PM Keith Busch <kbusch@kernel.org> wrote:
> >
> > On Mon, May 13, 2019 at 03:05:42PM +0000, Mario.Limonciello@dell.com wrote:
> > > This system power state - suspend to idle is going to freeze threads.
> > > But we're talking a multi threaded kernel.  Can't there be a timing problem going
> > > on then too?  With a disk flush being active in one task and the other task trying
> > > to put the disk into the deepest power state.  If you don't freeze the queues how
> > > can you guarantee that didn't happen?
> >
> > But if an active data flush task is running, then we're not idle and
> > shouldn't go to low power.
> 
> To be entirely precise, system suspend prevents user space from
> running while it is in progress.  It doesn't do that to kernel
> threads, at least not by default, though, so if there is a kernel
> thread flushing the data, it needs to be stopped or suspended somehow
> directly in the system suspend path.  [And yes, system suspend (or
> hibernation) may take place at any time so long as all user space can
> be prevented from running then (by means of the tasks freezer).]
> 
> However, freezing the queues from a driver ->suspend callback doesn't
> help in general and the reason why is hibernation.  Roughly speaking,
> hibernation works in two steps, the first of which creates a snapshot
> image of system memory and the second one writes that image to
> persistent storage.  Devices are resumed between the two steps in
> order to make it possible to do the write, but that would unfreeze the
> queues and let the data flusher run.  If it runs, it may cause the
> memory snapshot image that has just been created to become outdated
> and restoring the system memory contents from that image going forward
> may cause corruption to occur.
> 
> Thus freezing the queues from a driver ->suspend callback should not
> be relied on for correctness if the same callback is used for system
> suspend and hibernation, which is the case here.  If doing that
> prevents the system from crashing, it is critical to find out why IMO,
> as that may very well indicate a broader issue, not necessarily in the
> driver itself.
> 
> But note that even if the device turns out to behave oddly, it still
> needs to be handled, unless it may be prevented from shipping to users
> in that shape.  If it ships, users will face the odd behavior anyway.

Thanks for all the information. I'll take another shot at this, should
have it posted tomorrow.

It's mostly not a problem to ensure enqueued and dispatched requests are
completed before returning from our suspend callback. I originally had
that behavior and backed it out when I thought it wasn't necessary. So
I'll reintroduce that. I'm not sure yet how we may handle kernel tasks
that are about to read/write pages, but haven't yet enqueued their
requests.

IO timeouts, shold they occur, have some problems here, so I'll need to
send prep patches to address a few issues with that.

Rafael J. Wysocki May 15, 2019, 9 a.m. UTC | #27

On Wed, May 15, 2019 at 12:21 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Tue, May 14, 2019 at 10:04:22AM +0200, Rafael J. Wysocki wrote:
> > On Mon, May 13, 2019 at 5:10 PM Keith Busch <kbusch@kernel.org> wrote:
> > >
> > > On Mon, May 13, 2019 at 03:05:42PM +0000, Mario.Limonciello@dell.com wrote:
> > > > This system power state - suspend to idle is going to freeze threads.
> > > > But we're talking a multi threaded kernel.  Can't there be a timing problem going
> > > > on then too?  With a disk flush being active in one task and the other task trying
> > > > to put the disk into the deepest power state.  If you don't freeze the queues how
> > > > can you guarantee that didn't happen?
> > >
> > > But if an active data flush task is running, then we're not idle and
> > > shouldn't go to low power.
> >
> > To be entirely precise, system suspend prevents user space from
> > running while it is in progress.  It doesn't do that to kernel
> > threads, at least not by default, though, so if there is a kernel
> > thread flushing the data, it needs to be stopped or suspended somehow
> > directly in the system suspend path.  [And yes, system suspend (or
> > hibernation) may take place at any time so long as all user space can
> > be prevented from running then (by means of the tasks freezer).]
> >
> > However, freezing the queues from a driver ->suspend callback doesn't
> > help in general and the reason why is hibernation.  Roughly speaking,
> > hibernation works in two steps, the first of which creates a snapshot
> > image of system memory and the second one writes that image to
> > persistent storage.  Devices are resumed between the two steps in
> > order to make it possible to do the write, but that would unfreeze the
> > queues and let the data flusher run.  If it runs, it may cause the
> > memory snapshot image that has just been created to become outdated
> > and restoring the system memory contents from that image going forward
> > may cause corruption to occur.
> >
> > Thus freezing the queues from a driver ->suspend callback should not
> > be relied on for correctness if the same callback is used for system
> > suspend and hibernation, which is the case here.  If doing that
> > prevents the system from crashing, it is critical to find out why IMO,
> > as that may very well indicate a broader issue, not necessarily in the
> > driver itself.
> >
> > But note that even if the device turns out to behave oddly, it still
> > needs to be handled, unless it may be prevented from shipping to users
> > in that shape.  If it ships, users will face the odd behavior anyway.
>
> Thanks for all the information. I'll take another shot at this, should
> have it posted tomorrow.
>
> It's mostly not a problem to ensure enqueued and dispatched requests are
> completed before returning from our suspend callback. I originally had
> that behavior and backed it out when I thought it wasn't necessary. So
> I'll reintroduce that. I'm not sure yet how we may handle kernel tasks
> that are about to read/write pages, but haven't yet enqueued their
> requests.

That is a hard problem in general.

Currently, there are two ways to prevent a kernel thread from running
across hibernation.  One of them is to freeze it with the help of the
tasks freezer, but that isn't nice.  The second one would be to
register a PM notifier with register_pm_notifier() and then stop the
thread on the PM_HIBERNATION_PREPARE events (and restart it on the
PM_POST_HIBERNATION and PM_POST_RESTORE events).  The drawback here
is, though, that PM_HIBERNATION_PREPARE is signaled before freezing
user space which may produce some deadlock scenarios in principle if
anything in user space may be blocked waiting for the stopped kernel
thread.

It may be necessary to add PM_HIBERNATION/SUSPEND_POSTFREEZE and
_PRETHAW events to the PM notifier for that, which will be signaled
after running the tasks freezer and before unfreezing tasks,
respectively.  Or something else to that effect.

If you need something like that, please let me know and I will prepare
a core patch to add it.

nvme/pci: Use host managed power state for suspend

Commit Message

Comments

Patch