diff mbox

[v2,11/13] PM / sleep: Allow opt-out from runtime resume after direct-complete

Message ID 9f0713b23baf7c6139a4b476cd9637d96e18cc95.1463134232.git.lukas@wunner.de (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Lukas Wunner May 13, 2016, 11:15 a.m. UTC
Since commit aae4518b3124 ("PM / sleep: Mechanism to avoid resuming
runtime-suspended devices unnecessarily"), we no longer wake up devices
which are already runtime suspended upon entering system sleep
("direct-complete").

However commit 58a1fbbb2ee8 ("PM / PCI / ACPI: Kick devices that might
have been reset by firmware") changed this to mandatorily runtime resume
such devices after the system is woken.  The motivation was to ensure
that devices do not remain in a reset-power-on state after system
resume, potentially preventing deep SoC-wide low-power states from being
entered on idle.

This is counter-productive for devices of which we know that the
mandatory runtime resume is unnecessary.  Thunderbolt on the Mac is a
case in point: Runtime resume not just powers up the controller, but
multiple adjacent chips, including a 15V boost converter, multiplexers
and an eeprom.  Gratuitously powering this up after every system sleep
burns a not insignificant amount of energy and needlessly strains the
hardware.

Perhaps it would have been better to carry out the mandatory runtime
resume only for those devices that actually need it, but at least we
should allow an opt-out.

Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 drivers/base/power/generic_ops.c | 3 ++-
 include/linux/pm.h               | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

Comments

Rafael J. Wysocki July 18, 2016, 1:18 p.m. UTC | #1
On Friday, May 13, 2016 01:15:31 PM Lukas Wunner wrote:
> Since commit aae4518b3124 ("PM / sleep: Mechanism to avoid resuming
> runtime-suspended devices unnecessarily"), we no longer wake up devices
> which are already runtime suspended upon entering system sleep
> ("direct-complete").
> 
> However commit 58a1fbbb2ee8 ("PM / PCI / ACPI: Kick devices that might
> have been reset by firmware") changed this to mandatorily runtime resume
> such devices after the system is woken.  The motivation was to ensure
> that devices do not remain in a reset-power-on state after system
> resume, potentially preventing deep SoC-wide low-power states from being
> entered on idle.
> 
> This is counter-productive for devices of which we know that the
> mandatory runtime resume is unnecessary.  Thunderbolt on the Mac is a
> case in point: Runtime resume not just powers up the controller, but
> multiple adjacent chips, including a 15V boost converter, multiplexers
> and an eeprom.  Gratuitously powering this up after every system sleep
> burns a not insignificant amount of energy and needlessly strains the
> hardware.
> 
> Perhaps it would have been better to carry out the mandatory runtime
> resume only for those devices that actually need it, but at least we
> should allow an opt-out.
> 
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> Signed-off-by: Lukas Wunner <lukas@wunner.de>

I don't like this patch and especially adding a new dev_pm_ops flag to
work around something that you're seeing as an issue in the generic ops.

It is sort of like saying "the generic ops don't work for me, so modify
them as well as struct dev_pm_ops", but maybe it's better to change the
PCI bus type to do something different from calling the generic function?

Or you can add a ->complete callback to your driver that will clear
power.direct_complete for the device in question.

> ---
>  drivers/base/power/generic_ops.c | 3 ++-
>  include/linux/pm.h               | 1 +
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/power/generic_ops.c b/drivers/base/power/generic_ops.c
> index 07c3c4a..6e88f55 100644
> --- a/drivers/base/power/generic_ops.c
> +++ b/drivers/base/power/generic_ops.c
> @@ -316,7 +316,8 @@ void pm_complete_with_resume_check(struct device *dev)
>  	 * the sleep state it is going out of and it has never been resumed till
>  	 * now, resume it in case the firmware powered it up.
>  	 */
> -	if (dev->power.direct_complete && pm_resume_via_firmware())
> +	if (dev->power.direct_complete && pm_resume_via_firmware() &&
> +	    !dev->power.direct_complete_noresume)
>  		pm_request_resume(dev);
>  }
>  EXPORT_SYMBOL_GPL(pm_complete_with_resume_check);
> diff --git a/include/linux/pm.h b/include/linux/pm.h
> index 6a5d654..023de94 100644
> --- a/include/linux/pm.h
> +++ b/include/linux/pm.h
> @@ -596,6 +596,7 @@ struct dev_pm_info {
>  	unsigned int		use_autosuspend:1;
>  	unsigned int		timer_autosuspends:1;
>  	unsigned int		memalloc_noio:1;
> +	unsigned int		direct_complete_noresume:1;
>  	enum rpm_request	request;
>  	enum rpm_status		runtime_status;
>  	int			runtime_error;
>
Lukas Wunner Aug. 7, 2016, 9:56 a.m. UTC | #2
On Mon, Jul 18, 2016 at 03:18:25PM +0200, Rafael J. Wysocki wrote:
> On Friday, May 13, 2016 01:15:31 PM Lukas Wunner wrote:
> > Since commit aae4518b3124 ("PM / sleep: Mechanism to avoid resuming
> > runtime-suspended devices unnecessarily"), we no longer wake up devices
> > which are already runtime suspended upon entering system sleep
> > ("direct-complete").
> > 
> > However commit 58a1fbbb2ee8 ("PM / PCI / ACPI: Kick devices that might
> > have been reset by firmware") changed this to mandatorily runtime resume
> > such devices after the system is woken.  The motivation was to ensure
> > that devices do not remain in a reset-power-on state after system
> > resume, potentially preventing deep SoC-wide low-power states from being
> > entered on idle.
> > 
> > This is counter-productive for devices of which we know that the
> > mandatory runtime resume is unnecessary.  Thunderbolt on the Mac is a
> > case in point: Runtime resume not just powers up the controller, but
> > multiple adjacent chips, including a 15V boost converter, multiplexers
> > and an eeprom.  Gratuitously powering this up after every system sleep
> > burns a not insignificant amount of energy and needlessly strains the
> > hardware.
> > 
> > Perhaps it would have been better to carry out the mandatory runtime
> > resume only for those devices that actually need it, but at least we
> > should allow an opt-out.
> > 
> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Cc: Alan Stern <stern@rowland.harvard.edu>
> > Signed-off-by: Lukas Wunner <lukas@wunner.de>
> 
> I don't like this patch and especially adding a new dev_pm_ops flag to
> work around something that you're seeing as an issue in the generic ops.
> 
> It is sort of like saying "the generic ops don't work for me, so modify
> them as well as struct dev_pm_ops", but maybe it's better to change the
> PCI bus type to do something different from calling the generic function?
> 
> Or you can add a ->complete callback to your driver that will clear
> power.direct_complete for the device in question.

First of all, the direct_complete flag is marked "Owned by the PM core"
in include/linux/pm.h. So I would have expected that a driver is not
supposed to fudge it.

Second, yes it's possible to make it work by clearing direct_complete
in the ->complete callback, but there's a catch: The device tree is
traversed bottom-up in dpm_complete(). Recall that a Thunderbolt
controller consists of multiple devices and that power control is
governed by its top-most device (upstream bridge). But because we're
going bottom-up, clearing the direct_complete flag must be done by
the bottom-most device (NHI)! So I've got all the power management
stuff nicely separated in functions executed for the upstream bridge,
but a small portion needs to be executed for the NHI. That's ugly.

Normally the device hierarchy is traversed bottom-up during suspend
and top-down during resume. However ->prepare and ->complete do it
the other way round. In the case of ->prepare, this is even documented
in Documentation/power/devices.txt but the reason thereof is not.
Could you explain this please?

Third, I'm irritated by your question "maybe it's better to change the
PCI bus type to do something different from calling the generic function".
What should that be? Under which circumstances can we leave a PCI device
asleep after direct-complete?

I'm generally irritated by commit 58a1fbbb2ee8, it's a significant change
to mandatorily wake all devices, it wastes a not insignificant amount of
energy, yet the reasoning in the commit message sounds vague and handwavy
("There is a concern [...] devices that are most likely to be affected").

Are there clear indications for or against a device requiring a resume?
E.g. the commit message names SoCs, perhaps those can be recognized by
having child devices of certain types?

Thanks,

Lukas

> 
> > ---
> >  drivers/base/power/generic_ops.c | 3 ++-
> >  include/linux/pm.h               | 1 +
> >  2 files changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/base/power/generic_ops.c b/drivers/base/power/generic_ops.c
> > index 07c3c4a..6e88f55 100644
> > --- a/drivers/base/power/generic_ops.c
> > +++ b/drivers/base/power/generic_ops.c
> > @@ -316,7 +316,8 @@ void pm_complete_with_resume_check(struct device *dev)
> >  	 * the sleep state it is going out of and it has never been resumed till
> >  	 * now, resume it in case the firmware powered it up.
> >  	 */
> > -	if (dev->power.direct_complete && pm_resume_via_firmware())
> > +	if (dev->power.direct_complete && pm_resume_via_firmware() &&
> > +	    !dev->power.direct_complete_noresume)
> >  		pm_request_resume(dev);
> >  }
> >  EXPORT_SYMBOL_GPL(pm_complete_with_resume_check);
> > diff --git a/include/linux/pm.h b/include/linux/pm.h
> > index 6a5d654..023de94 100644
> > --- a/include/linux/pm.h
> > +++ b/include/linux/pm.h
> > @@ -596,6 +596,7 @@ struct dev_pm_info {
> >  	unsigned int		use_autosuspend:1;
> >  	unsigned int		timer_autosuspends:1;
> >  	unsigned int		memalloc_noio:1;
> > +	unsigned int		direct_complete_noresume:1;
> >  	enum rpm_request	request;
> >  	enum rpm_status		runtime_status;
> >  	int			runtime_error;
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Aug. 7, 2016, 3:33 p.m. UTC | #3
On Sun, 7 Aug 2016, Lukas Wunner wrote:

> Normally the device hierarchy is traversed bottom-up during suspend
> and top-down during resume. However ->prepare and ->complete do it
> the other way round. In the case of ->prepare, this is even documented
> in Documentation/power/devices.txt but the reason thereof is not.
> Could you explain this please?

The purpose of ->prepare is to tell drivers that a system sleep is
beginning and accordingly they should stop registering new children.  
This is necessary for the PM core to be able to traverse the entire
device tree safely; we want to avoid races where a new child is added
below a device concurrently with that device being suspended.  (Or if
you want to be more precise, races in which a new child is added below
a device while the PM core is acquiring the device's lock just prior to
invoking its ->suspend callback.)

Telling drivers to stop registering new children below a device has to
be done top-down, because if it were done bottom-up then it would be
subject to the same race described above.  Doing it top-down avoids 
problems; if a device registers new children while the PM core is 
acquiring its lock prior to invoking ->prepare, it doesn't matter.  The 
new children will be handled later, right along with the existing ones.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Wunner Aug. 12, 2016, 4:39 p.m. UTC | #4
On Sun, Aug 07, 2016 at 11:33:17AM -0400, Alan Stern wrote:
> On Sun, 7 Aug 2016, Lukas Wunner wrote:
> 
> > Normally the device hierarchy is traversed bottom-up during suspend
> > and top-down during resume. However ->prepare and ->complete do it
> > the other way round. In the case of ->prepare, this is even documented
> > in Documentation/power/devices.txt but the reason thereof is not.
> > Could you explain this please?
> 
> The purpose of ->prepare is to tell drivers that a system sleep is
> beginning and accordingly they should stop registering new children.  
> This is necessary for the PM core to be able to traverse the entire
> device tree safely; we want to avoid races where a new child is added
> below a device concurrently with that device being suspended.  (Or if
> you want to be more precise, races in which a new child is added below
> a device while the PM core is acquiring the device's lock just prior to
> invoking its ->suspend callback.)
> 
> Telling drivers to stop registering new children below a device has to
> be done top-down, because if it were done bottom-up then it would be
> subject to the same race described above.  Doing it top-down avoids 
> problems; if a device registers new children while the PM core is 
> acquiring its lock prior to invoking ->prepare, it doesn't matter.  The
> new children will be handled later, right along with the existing ones.

Thank you for explaining the motivation to carry out ->prepare top-down.
However my problem is really that ->complete is carried out bottom-up.
What's the motivation for that? Merely to mirror the behaviour of
->prepare? Would it be possible to change it to top-down? Note that
re-enablement of device addition is already allowed in ->resume,
which is called top-down.

By the way, neither the PCI nor USB bus-level ->prepare callbacks perform
any action that would stop device addition. Same for the pciehp driver
(we don't even have a ->prepare callback defined for PCIe port services.
So it *is* possible to hotplug PCI devices after ->prepare.

Best regards,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Aug. 12, 2016, 5:30 p.m. UTC | #5
On Fri, 12 Aug 2016, Lukas Wunner wrote:

> Thank you for explaining the motivation to carry out ->prepare top-down.
> However my problem is really that ->complete is carried out bottom-up.
> What's the motivation for that? Merely to mirror the behaviour of
> ->prepare? Would it be possible to change it to top-down? Note that
> re-enablement of device addition is already allowed in ->resume,
> which is called top-down.

I'm not aware of any particular reason why making ->complete run
top-down wouldn't work.  Of course, if you did then the environment at
the start of the ->complete callback wouldn't be the same as it was at
the end of the ->prepare callback.

I think originally the idea was just to mirror ->prepare.  Perhaps
Rafael will remember something that has escaped me.

> By the way, neither the PCI nor USB bus-level ->prepare callbacks perform
> any action that would stop device addition. Same for the pciehp driver
> (we don't even have a ->prepare callback defined for PCIe port services.
> So it *is* possible to hotplug PCI devices after ->prepare.

I don't know about PCI (although what you describe sounds like a bug).  

USB relies on a freezable workqueue for adding child devices, so it
stops adding children even before the prepare phase begins.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Aug. 12, 2016, 10:40 p.m. UTC | #6
On Friday, August 12, 2016 01:30:04 PM Alan Stern wrote:
> On Fri, 12 Aug 2016, Lukas Wunner wrote:
> 
> > Thank you for explaining the motivation to carry out ->prepare top-down.
> > However my problem is really that ->complete is carried out bottom-up.
> > What's the motivation for that? Merely to mirror the behaviour of
> > ->prepare? Would it be possible to change it to top-down? Note that
> > re-enablement of device addition is already allowed in ->resume,
> > which is called top-down.
> 
> I'm not aware of any particular reason why making ->complete run
> top-down wouldn't work.  Of course, if you did then the environment at
> the start of the ->complete callback wouldn't be the same as it was at
> the end of the ->prepare callback.
> 
> I think originally the idea was just to mirror ->prepare.  Perhaps
> Rafael will remember something that has escaped me.

Nothing specific from the top of my head.

> > By the way, neither the PCI nor USB bus-level ->prepare callbacks perform
> > any action that would stop device addition. Same for the pciehp driver
> > (we don't even have a ->prepare callback defined for PCIe port services.
> > So it *is* possible to hotplug PCI devices after ->prepare.

Not via ACPI, though.  The ACPI core blocks all hotplug events at the
beginning of the suspend sequence and releases them at the end of device
resume.

> I don't know about PCI (although what you describe sounds like a bug).  
> 
> USB relies on a freezable workqueue for adding child devices, so it
> stops adding children even before the prepare phase begins.

Right.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/base/power/generic_ops.c b/drivers/base/power/generic_ops.c
index 07c3c4a..6e88f55 100644
--- a/drivers/base/power/generic_ops.c
+++ b/drivers/base/power/generic_ops.c
@@ -316,7 +316,8 @@  void pm_complete_with_resume_check(struct device *dev)
 	 * the sleep state it is going out of and it has never been resumed till
 	 * now, resume it in case the firmware powered it up.
 	 */
-	if (dev->power.direct_complete && pm_resume_via_firmware())
+	if (dev->power.direct_complete && pm_resume_via_firmware() &&
+	    !dev->power.direct_complete_noresume)
 		pm_request_resume(dev);
 }
 EXPORT_SYMBOL_GPL(pm_complete_with_resume_check);
diff --git a/include/linux/pm.h b/include/linux/pm.h
index 6a5d654..023de94 100644
--- a/include/linux/pm.h
+++ b/include/linux/pm.h
@@ -596,6 +596,7 @@  struct dev_pm_info {
 	unsigned int		use_autosuspend:1;
 	unsigned int		timer_autosuspends:1;
 	unsigned int		memalloc_noio:1;
+	unsigned int		direct_complete_noresume:1;
 	enum rpm_request	request;
 	enum rpm_status		runtime_status;
 	int			runtime_error;