diff mbox

[RFC] PCI: Runtime power management

Message ID 20090813002925.GA2532@srcf.ucam.org (mailing list archive)
State RFC, archived
Headers show

Commit Message

Matthew Garrett Aug. 13, 2009, 12:29 a.m. UTC
I got a fixed BIOS from Dell and have been able to get this working now. 
It seems entirely happy with USB, but I'd like some sanity checks on 
whether I'm doing this correctly. There's certainly a couple of quirks 
related to setting the ACPI GPE type that would need a little bit of 
work in the ACPI layer, and it breaks ACPI-mediated PCI hotplug though 
that's easy enough to fix by just calling into the hotplug code from the 
core notifier.

This patch builds on top of Rafael's work on systemwide runtime power
management. It supports suspending and resuming PCI devices at runtime,
enabling platform wakeup events that allow the devices to automatically
resume when appropriate. It currently requires platform support, but PCIe
setups could be supported natively once native PCIe PME code has been added
to the kernel.
---
 drivers/pci/pci-acpi.c   |   55 +++++++++++++++++++++++++
 drivers/pci/pci-driver.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.c        |   87 ++++++++++++++++++++++++++++++++++++++++
 drivers/pci/pci.h        |    3 +
 include/linux/pci.h      |    3 +
 5 files changed, 248 insertions(+), 0 deletions(-)

Comments

Alan Stern Aug. 13, 2009, 3:17 p.m. UTC | #1
On Thu, 13 Aug 2009, Matthew Garrett wrote:

> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c

> +#ifdef CONFIG_PM_RUNTIME
> +
> +static int pci_pm_runtime_suspend(struct device *dev)
> +{
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	int error;
> +
> +	device_set_wakeup_enable(dev, 1);

This is a userspace policy parameter.  Kernel code should not alter it.
Instead you should test device_may_wakeup.

> +	error = pci_enable_runtime_wake(pci_dev, true);
> +
> +	if (error)
> +		return -EBUSY;
> +
> +	if (pm && pm->runtime_suspend)
> +		error = pm->runtime_suspend(dev);
> +
> +	if (error)
> +		goto out;
> +
> +	error = pci_pm_suspend(dev);
> +
> +	if (error)
> +		goto resume;
> +
> +	disable_irq(pci_dev->irq);
> +	error = pci_pm_suspend_noirq(dev);
> +	enable_irq(pci_dev->irq);
> +
> +	if (error)
> +		goto resume_noirq;
> +
> +	return 0;
> +
> +resume_noirq:
> +	disable_irq(pci_dev->irq);
> +	pci_pm_resume_noirq(dev);
> +	enable_irq(pci_dev->irq);
> +resume:
> +	pci_pm_resume(dev);
> +out:
> +	pci_enable_runtime_wake(pci_dev, false);
> +	return error;
> +}

The goto statements and unwinding code don't match up.

> +static int pci_pm_runtime_resume(struct device *dev)
> +{
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	int error = 0;
> +
> +	disable_irq(pci_dev->irq);
> +	error = pci_pm_resume_noirq(dev);
> +	enable_irq(pci_dev->irq);
> +
> +	if (error)
> +		return error;
> +
> +	error = pci_pm_resume(dev);
> +
> +	if (error)
> +		return error;
> +
> +	if (pm->runtime_resume)
> +		error = pm->runtime_resume(dev);
> +
> +	if (error)
> +		return error;
> +
> +	error = pci_enable_runtime_wake(pci_dev, false);
> +
> +	if (error)
> +		return error;
> +
> +	return 0;
> +}

Log an error message when something goes wrong?

> +static void pci_pm_runtime_idle(struct device *dev)
> +{
> +	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +
> +	if (pm && pm->runtime_idle)
> +		pm->runtime_idle(dev);
> +
> +	pm_schedule_suspend(dev, 0);
> +}

This misses the point.  The whole idea of runtime_idle is to tell you 
that the device is idle and might be ready to be suspended.  If you're 
going to call pm_schedule_suspend anyway, there's no reason to invoke 
pm->runtime_idle.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett Aug. 13, 2009, 9:47 p.m. UTC | #2
On Thu, Aug 13, 2009 at 11:17:05AM -0400, Alan Stern wrote:
> On Thu, 13 Aug 2009, Matthew Garrett wrote:
> 
> > --- a/drivers/pci/pci-driver.c
> > +++ b/drivers/pci/pci-driver.c
> 
> > +#ifdef CONFIG_PM_RUNTIME
> > +
> > +static int pci_pm_runtime_suspend(struct device *dev)
> > +{
> > +	struct pci_dev *pci_dev = to_pci_dev(dev);
> > +	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> > +	int error;
> > +
> > +	device_set_wakeup_enable(dev, 1);
> 
> This is a userspace policy parameter.  Kernel code should not alter it.
> Instead you should test device_may_wakeup.

Ugh. I'd really prefer us to assume that drivers are able to cope unless 
proven otherwise. Userspace policy makes sense where we don't have any 
idea whether something will work or not, but I'd really expect that most 
PCI drivers will either cope (in which case they'll have enabling code) 
or won't (in which case they won't). Why would we want userspace to 
influence this?

> > +	enable_irq(pci_dev->irq);
> > +resume:
> > +	pci_pm_resume(dev);
> > +out:
> > +	pci_enable_runtime_wake(pci_dev, false);
> > +	return error;
> > +}
> 
> The goto statements and unwinding code don't match up.

I'll look at that.

> > +	if (error)
> > +		return error;
> > +
> > +	return 0;
> > +}
> 
> Log an error message when something goes wrong?

Seems fair.

> > +static void pci_pm_runtime_idle(struct device *dev)
> > +{
> > +	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> > +
> > +	if (pm && pm->runtime_idle)
> > +		pm->runtime_idle(dev);
> > +
> > +	pm_schedule_suspend(dev, 0);
> > +}
> 
> This misses the point.  The whole idea of runtime_idle is to tell you 
> that the device is idle and might be ready to be suspended.  If you're 
> going to call pm_schedule_suspend anyway, there's no reason to invoke 
> pm->runtime_idle.

My understanding of the API was that pm_device_put() invokes 
runtime_idle if the refcount hits 0. The bus layer has no idea of the 
refcount, and calling suspend directly from the driver would defeat the 
point of the system-wide recounting.
Matthew Garrett Aug. 14, 2009, 12:30 p.m. UTC | #3
On Thu, Aug 13, 2009 at 10:47:01PM +0100, Matthew Garrett wrote:

> Ugh. I'd really prefer us to assume that drivers are able to cope unless 
> proven otherwise. Userspace policy makes sense where we don't have any 
> idea whether something will work or not, but I'd really expect that most 
> PCI drivers will either cope (in which case they'll have enabling code) 
> or won't (in which case they won't). Why would we want userspace to 
> influence this?

Though, thinking about it, you're right that setting this does override 
user policy. I think we need an additional flag to indicate that the 
device supports runtime wakeup and test that as well when doing 
device_may_wakeup().

> > This misses the point.  The whole idea of runtime_idle is to tell you 
> > that the device is idle and might be ready to be suspended.  If you're 
> > going to call pm_schedule_suspend anyway, there's no reason to invoke 
> > pm->runtime_idle.
> 
> My understanding of the API was that pm_device_put() invokes 
> runtime_idle if the refcount hits 0. The bus layer has no idea of the 
> refcount, and calling suspend directly from the driver would defeat the 
> point of the system-wide recounting.

From the API docs:

"The action performed by a bus type's ->runtime_idle() callback is 
totally dependent on the bus type in question, but the expected and 
recommended action is to check if the device can be suspended (i.e. if 
all of the conditions necessary for suspending the device are satisfied) 
and to queue up a suspend request for the device in that case."

Though perhaps the device level runtime_idle shouldn't be void - that 
way the bus can ask the driver whether its suspend conditions have been 
satisfied? Right now there doesn't seem to be any way for the bus to ask 
that.
Alan Stern Aug. 14, 2009, 2:43 p.m. UTC | #4
On Thu, 13 Aug 2009, Matthew Garrett wrote:

> On Thu, Aug 13, 2009 at 11:22:44AM -0400, Alan Stern wrote:
>
> > You have to call the HCD's pci_suspend method!  Not to mention calling
> > synchronize_irq and all the other stuff in hcd_pci_suspend and
> > hcd_pci_suspend_noirq.
>
> The bus level code does this, assuming that the driver-level code
> doesn't return an error.

So it does; my mistake.


On Fri, 14 Aug 2009, Matthew Garrett wrote:

> On Thu, Aug 13, 2009 at 10:47:01PM +0100, Matthew Garrett wrote:
> 
> > Ugh. I'd really prefer us to assume that drivers are able to cope unless 
> > proven otherwise. Userspace policy makes sense where we don't have any 
> > idea whether something will work or not, but I'd really expect that most 
> > PCI drivers will either cope (in which case they'll have enabling code) 
> > or won't (in which case they won't). Why would we want userspace to 
> > influence this?
> 
> Though, thinking about it, you're right that setting this does override 
> user policy. I think we need an additional flag to indicate that the 
> device supports runtime wakeup and test that as well when doing 
> device_may_wakeup().

You are suggesting separate flag sets for system-wide wakeup and
runtime wakeup?  I don't disagree, but implementing them will be
problematical.

That's because it's not always possible to change a device's wakeup 
setting while it is suspended.  Thus if a device was runtime suspended 
with wakeup enabled, and then we want to do a system sleep and change 
the device's wakeup setting to disabled, we would have to wake the 
device back up in order to do it.


> > > This misses the point.  The whole idea of runtime_idle is to tell you 
> > > that the device is idle and might be ready to be suspended.  If you're 
> > > going to call pm_schedule_suspend anyway, there's no reason to invoke 
> > > pm->runtime_idle.
> > 
> > My understanding of the API was that pm_device_put() invokes 
> > runtime_idle if the refcount hits 0. The bus layer has no idea of the 
> > refcount, and calling suspend directly from the driver would defeat the 
> > point of the system-wide recounting.
> 
> From the API docs:
> 
> "The action performed by a bus type's ->runtime_idle() callback is 
> totally dependent on the bus type in question, but the expected and 
> recommended action is to check if the device can be suspended (i.e. if 
> all of the conditions necessary for suspending the device are satisfied) 
> and to queue up a suspend request for the device in that case."
> 
> Though perhaps the device level runtime_idle shouldn't be void - that 
> way the bus can ask the driver whether its suspend conditions have been 
> satisfied? Right now there doesn't seem to be any way for the bus to ask 
> that.

If you want to get the device-level runtime_idle involved, you can make
_it_ responsible for scheduling the suspend.  Then the bus-level code
simply has to check whether everything is okay at the bus level, and if
it is, call the device-level routine.

However changing the return type wouldn't hurt anything, and it would 
allow the pm_schedule_suspend call to be centralized in the bus code.  
You could ask Rafael about it, or just send him a patch.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki Aug. 14, 2009, 5:05 p.m. UTC | #5
On Friday 14 August 2009, Alan Stern wrote:
> On Thu, 13 Aug 2009, Matthew Garrett wrote:
> 
> > On Thu, Aug 13, 2009 at 11:22:44AM -0400, Alan Stern wrote:
> >
> > > You have to call the HCD's pci_suspend method!  Not to mention calling
> > > synchronize_irq and all the other stuff in hcd_pci_suspend and
> > > hcd_pci_suspend_noirq.
> >
> > The bus level code does this, assuming that the driver-level code
> > doesn't return an error.
> 
> So it does; my mistake.
> 
> 
> On Fri, 14 Aug 2009, Matthew Garrett wrote:
> 
> > On Thu, Aug 13, 2009 at 10:47:01PM +0100, Matthew Garrett wrote:
> > 
> > > Ugh. I'd really prefer us to assume that drivers are able to cope unless 
> > > proven otherwise. Userspace policy makes sense where we don't have any 
> > > idea whether something will work or not, but I'd really expect that most 
> > > PCI drivers will either cope (in which case they'll have enabling code) 
> > > or won't (in which case they won't). Why would we want userspace to 
> > > influence this?
> > 
> > Though, thinking about it, you're right that setting this does override 
> > user policy. I think we need an additional flag to indicate that the 
> > device supports runtime wakeup and test that as well when doing 
> > device_may_wakeup().
> 
> You are suggesting separate flag sets for system-wide wakeup and
> runtime wakeup?  I don't disagree, but implementing them will be
> problematical.
> 
> That's because it's not always possible to change a device's wakeup 
> setting while it is suspended.  Thus if a device was runtime suspended 
> with wakeup enabled, and then we want to do a system sleep and change 
> the device's wakeup setting to disabled, we would have to wake the 
> device back up in order to do it.
> 
> 
> > > > This misses the point.  The whole idea of runtime_idle is to tell you 
> > > > that the device is idle and might be ready to be suspended.  If you're 
> > > > going to call pm_schedule_suspend anyway, there's no reason to invoke 
> > > > pm->runtime_idle.
> > > 
> > > My understanding of the API was that pm_device_put() invokes 
> > > runtime_idle if the refcount hits 0. The bus layer has no idea of the 
> > > refcount, and calling suspend directly from the driver would defeat the 
> > > point of the system-wide recounting.
> > 
> > From the API docs:
> > 
> > "The action performed by a bus type's ->runtime_idle() callback is 
> > totally dependent on the bus type in question, but the expected and 
> > recommended action is to check if the device can be suspended (i.e. if 
> > all of the conditions necessary for suspending the device are satisfied) 
> > and to queue up a suspend request for the device in that case."
> > 
> > Though perhaps the device level runtime_idle shouldn't be void - that 
> > way the bus can ask the driver whether its suspend conditions have been 
> > satisfied? Right now there doesn't seem to be any way for the bus to ask 
> > that.
> 
> If you want to get the device-level runtime_idle involved, you can make
> _it_ responsible for scheduling the suspend.  Then the bus-level code
> simply has to check whether everything is okay at the bus level, and if
> it is, call the device-level routine.
> 
> However changing the return type wouldn't hurt anything, and it would 
> allow the pm_schedule_suspend call to be centralized in the bus code.  
> You could ask Rafael about it, or just send him a patch.

Well, I'm not against that, but what should pm_runtime_idle() do with the
result returned by it?  Just pass it to the caller?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki Aug. 14, 2009, 5:13 p.m. UTC | #6
On Friday 14 August 2009, Rafael J. Wysocki wrote:
> On Friday 14 August 2009, Alan Stern wrote:
> > On Thu, 13 Aug 2009, Matthew Garrett wrote:
> > 
> > > On Thu, Aug 13, 2009 at 11:22:44AM -0400, Alan Stern wrote:
...
> > > Though perhaps the device level runtime_idle shouldn't be void - that 
> > > way the bus can ask the driver whether its suspend conditions have been 
> > > satisfied? Right now there doesn't seem to be any way for the bus to ask 
> > > that.
> > 
> > If you want to get the device-level runtime_idle involved, you can make
> > _it_ responsible for scheduling the suspend.  Then the bus-level code
> > simply has to check whether everything is okay at the bus level, and if
> > it is, call the device-level routine.
> > 
> > However changing the return type wouldn't hurt anything, and it would 
> > allow the pm_schedule_suspend call to be centralized in the bus code.  
> > You could ask Rafael about it, or just send him a patch.
> 
> Well, I'm not against that, but what should pm_runtime_idle() do with the
> result returned by it?  Just pass it to the caller?

Hm, perhaps its better to ignore it, though.

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Barnes Aug. 14, 2009, 5:37 p.m. UTC | #7
On Thu, 13 Aug 2009 01:29:25 +0100
Matthew Garrett <mjg59@srcf.ucam.org> wrote:

> I got a fixed BIOS from Dell and have been able to get this working
> now. It seems entirely happy with USB, but I'd like some sanity
> checks on whether I'm doing this correctly. There's certainly a
> couple of quirks related to setting the ACPI GPE type that would need
> a little bit of work in the ACPI layer, and it breaks ACPI-mediated
> PCI hotplug though that's easy enough to fix by just calling into the
> hotplug code from the core notifier.
> 
> This patch builds on top of Rafael's work on systemwide runtime power
> management. It supports suspending and resuming PCI devices at
> runtime, enabling platform wakeup events that allow the devices to
> automatically resume when appropriate. It currently requires platform
> support, but PCIe setups could be supported natively once native PCIe
> PME code has been added to the kernel.

PCI bits look pretty good to me, though Rafael should take a look too.
Card readers and firewire could benefit from similar treatment, maybe
that would get us to .5W territory on some machines.
Alan Stern Aug. 14, 2009, 7:01 p.m. UTC | #8
On Fri, 14 Aug 2009, Rafael J. Wysocki wrote:

> On Friday 14 August 2009, Rafael J. Wysocki wrote:
> > On Friday 14 August 2009, Alan Stern wrote:
> > > On Thu, 13 Aug 2009, Matthew Garrett wrote:
> > > 
> > > > On Thu, Aug 13, 2009 at 11:22:44AM -0400, Alan Stern wrote:
> ...
> > > > Though perhaps the device level runtime_idle shouldn't be void - that 
> > > > way the bus can ask the driver whether its suspend conditions have been 
> > > > satisfied? Right now there doesn't seem to be any way for the bus to ask 
> > > > that.
> > > 
> > > If you want to get the device-level runtime_idle involved, you can make
> > > _it_ responsible for scheduling the suspend.  Then the bus-level code
> > > simply has to check whether everything is okay at the bus level, and if
> > > it is, call the device-level routine.
> > > 
> > > However changing the return type wouldn't hurt anything, and it would 
> > > allow the pm_schedule_suspend call to be centralized in the bus code.  
> > > You could ask Rafael about it, or just send him a patch.
> > 
> > Well, I'm not against that, but what should pm_runtime_idle() do with the
> > result returned by it?  Just pass it to the caller?
> 
> Hm, perhaps its better to ignore it, though.

That's what I was going to say.  The return value is intended for use 
by bus-level code when calling a driver-level routine.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki Aug. 14, 2009, 7:15 p.m. UTC | #9
On Friday 14 August 2009, Jesse Barnes wrote:
> On Thu, 13 Aug 2009 01:29:25 +0100
> Matthew Garrett <mjg59@srcf.ucam.org> wrote:
> 
> > I got a fixed BIOS from Dell and have been able to get this working
> > now. It seems entirely happy with USB, but I'd like some sanity
> > checks on whether I'm doing this correctly. There's certainly a
> > couple of quirks related to setting the ACPI GPE type that would need
> > a little bit of work in the ACPI layer, and it breaks ACPI-mediated
> > PCI hotplug though that's easy enough to fix by just calling into the
> > hotplug code from the core notifier.
> > 
> > This patch builds on top of Rafael's work on systemwide runtime power
> > management. It supports suspending and resuming PCI devices at
> > runtime, enabling platform wakeup events that allow the devices to
> > automatically resume when appropriate. It currently requires platform
> > support, but PCIe setups could be supported natively once native PCIe
> > PME code has been added to the kernel.
> 
> PCI bits look pretty good to me, though Rafael should take a look too.

I'm going to do that shortly.

> Card readers and firewire could benefit from similar treatment,

As well us network adapters.

> maybe that would get us to .5W territory on some machines.

Hopefully. :-)

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki Aug. 14, 2009, 8:05 p.m. UTC | #10
On Friday 14 August 2009, Alan Stern wrote:
> On Thu, 13 Aug 2009, Matthew Garrett wrote:
> 
> > On Thu, Aug 13, 2009 at 11:22:44AM -0400, Alan Stern wrote:
> >
> > > You have to call the HCD's pci_suspend method!  Not to mention calling
> > > synchronize_irq and all the other stuff in hcd_pci_suspend and
> > > hcd_pci_suspend_noirq.
> >
> > The bus level code does this, assuming that the driver-level code
> > doesn't return an error.
> 
> So it does; my mistake.
> 
> 
> On Fri, 14 Aug 2009, Matthew Garrett wrote:
> 
> > On Thu, Aug 13, 2009 at 10:47:01PM +0100, Matthew Garrett wrote:
> > 
> > > Ugh. I'd really prefer us to assume that drivers are able to cope unless 
> > > proven otherwise. Userspace policy makes sense where we don't have any 
> > > idea whether something will work or not, but I'd really expect that most 
> > > PCI drivers will either cope (in which case they'll have enabling code) 
> > > or won't (in which case they won't). Why would we want userspace to 
> > > influence this?
> > 
> > Though, thinking about it, you're right that setting this does override 
> > user policy. I think we need an additional flag to indicate that the 
> > device supports runtime wakeup and test that as well when doing 
> > device_may_wakeup().
> 
> You are suggesting separate flag sets for system-wide wakeup and
> runtime wakeup?  I don't disagree, but implementing them will be
> problematical.
> 
> That's because it's not always possible to change a device's wakeup 
> setting while it is suspended.  Thus if a device was runtime suspended 
> with wakeup enabled, and then we want to do a system sleep and change 
> the device's wakeup setting to disabled, we would have to wake the 
> device back up in order to do it.

Well, sometimes the user may want a device to be power managed at run time
and not to be able to wake up the system from sleep states.  For example,
I'd like the USB controller in my box to be suspended at run time whenever it's
not used, but surely I wouldn't like it to do system-wide wakeup, because it
does that when I move the mouse which is a cordless one.  Simply turning the
mouse on causes the system to wake up. :-)

So, I don't think we have a choice here.  If the user forbids the device to
wake up the system from sleep states, we have to do what he wants, even if
that means the devices has to be resumed before putting the system into a
sleep state.

Why don't we add a flag indicating whether or not the device is allowed to
be power managed at run time, something like runtime_forbidden, that the
user space will be able to set through sysfs?

If the user space does that, we can't power manage the device at run time,
so it can't do runtime wakeup as well.  Otherwise, the device is power
manageable at run time, which implies runtime wakeup if supported.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki Aug. 14, 2009, 9:22 p.m. UTC | #11
Hi,

First of all, thanks a lot for looking into this!

On Thursday 13 August 2009, Matthew Garrett wrote:
> I got a fixed BIOS from Dell and have been able to get this working now. 
> It seems entirely happy with USB, but I'd like some sanity checks on 
> whether I'm doing this correctly. There's certainly a couple of quirks 
> related to setting the ACPI GPE type that would need a little bit of 
> work in the ACPI layer, and it breaks ACPI-mediated PCI hotplug though 
> that's easy enough to fix by just calling into the hotplug code from the 
> core notifier.
> 
> This patch builds on top of Rafael's work on systemwide runtime power
> management. It supports suspending and resuming PCI devices at runtime,
> enabling platform wakeup events that allow the devices to automatically
> resume when appropriate. It currently requires platform support, but PCIe
> setups could be supported natively once native PCIe PME code has been added
> to the kernel.

Do you have any prototypes for that?  I started working on it some time ago,
but then I focused on the core runtime PM framework.

> ---
>  drivers/pci/pci-acpi.c   |   55 +++++++++++++++++++++++++
>  drivers/pci/pci-driver.c |  100 ++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c        |   87 ++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.h        |    3 +
>  include/linux/pci.h      |    3 +
>  5 files changed, 248 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
> index ea15b05..a98a777 100644
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -12,6 +12,7 @@
>  #include <linux/pci.h>
>  #include <linux/module.h>
>  #include <linux/pci-aspm.h>
> +#include <linux/pm_runtime.h>
>  #include <acpi/acpi.h>
>  #include <acpi/acpi_bus.h>
>  
> @@ -120,14 +121,62 @@ static int acpi_pci_sleep_wake(struct pci_dev *dev, bool enable)
>  	return error;
>  }
>  
> +static int acpi_pci_runtime_wake(struct pci_dev *dev, bool enable)
> +{
> +	acpi_status status;
> +	acpi_handle handle = DEVICE_ACPI_HANDLE(&dev->dev);
> +	struct acpi_device *acpi_dev;
> +

Hm, I'd move that into ACPI as

int acp_runtime_wake_enable(acpi_handle handle, bool enable)

in which form it could also be useful to non-PCI devices.

> +	if (!handle)
> +		return -ENODEV;
> +
> +	status = acpi_bus_get_device(handle, &acpi_dev);
> +	if (ACPI_FAILURE(status))
> +		return -ENODEV;
> +
> +	if (enable) {
> +		acpi_set_gpe_type(acpi_dev->wakeup.gpe_device,
> +				  acpi_dev->wakeup.gpe_number,
> +				  ACPI_GPE_TYPE_WAKE_RUN);
> +		acpi_enable_gpe(acpi_dev->wakeup.gpe_device,
> +				acpi_dev->wakeup.gpe_number);
> +	} else {
> +		acpi_set_gpe_type(acpi_dev->wakeup.gpe_device,
> +				  acpi_dev->wakeup.gpe_number,
> +				  ACPI_GPE_TYPE_WAKE);
> +		acpi_disable_gpe(acpi_dev->wakeup.gpe_device,
> +				 acpi_dev->wakeup.gpe_number);
> +	}
> +	return 0;
> +}

Ah, that's the part I've always been missing!

How exactly do we figure out which GPE is a wake-up one for given device?
IOW, how are the wakeup.gpe_device and wakeup.gpe_number fields populated?

> +
> +
>  static struct pci_platform_pm_ops acpi_pci_platform_pm = {
>  	.is_manageable = acpi_pci_power_manageable,
>  	.set_state = acpi_pci_set_power_state,
>  	.choose_state = acpi_pci_choose_state,
>  	.can_wakeup = acpi_pci_can_wakeup,
>  	.sleep_wake = acpi_pci_sleep_wake,
> +	.runtime_wake = acpi_pci_runtime_wake,
>  };
>  
> +static void pci_device_notify(acpi_handle handle, u32 event, void *data)
> +{
> +	struct device *dev = data;
> +
> +	if (event == ACPI_NOTIFY_DEVICE_WAKE)
> +		pm_runtime_resume(dev);
> +}
> +
> +static void pci_root_bridge_notify(acpi_handle handle, u32 event, void *data)
> +{
> +	struct device *dev = data;
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +
> +	if (event == ACPI_NOTIFY_DEVICE_WAKE)
> +		pci_bus_pme_event(pci_dev);
> +}
> +
>  /* ACPI bus type */
>  static int acpi_pci_find_device(struct device *dev, acpi_handle *handle)
>  {
> @@ -140,6 +189,9 @@ static int acpi_pci_find_device(struct device *dev, acpi_handle *handle)
>  	*handle = acpi_get_child(DEVICE_ACPI_HANDLE(dev->parent), addr);
>  	if (!*handle)
>  		return -ENODEV;
> +
> +	acpi_install_notify_handler(*handle, ACPI_SYSTEM_NOTIFY,
> +				    pci_device_notify, dev);
>  	return 0;
>  }
>  
> @@ -158,6 +210,9 @@ static int acpi_pci_find_root_bridge(struct device *dev, acpi_handle *handle)
>  	*handle = acpi_get_pci_rootbridge_handle(seg, bus);
>  	if (!*handle)
>  		return -ENODEV;
> +
> +	acpi_install_notify_handler(*handle, ACPI_SYSTEM_NOTIFY,
> +				    pci_root_bridge_notify, dev);
>  	return 0;
>  }
>  
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index d76c4c8..1f605d8 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -11,12 +11,14 @@
>  #include <linux/pci.h>
>  #include <linux/module.h>
>  #include <linux/init.h>
> +#include <linux/interrupt.h>
>  #include <linux/device.h>
>  #include <linux/mempolicy.h>
>  #include <linux/string.h>
>  #include <linux/slab.h>
>  #include <linux/sched.h>
>  #include <linux/cpu.h>
> +#include <linux/pm_runtime.h>
>  #include "pci.h"
>  
>  /*
> @@ -910,6 +912,101 @@ static int pci_pm_restore(struct device *dev)
>  
>  #endif /* !CONFIG_HIBERNATION */
>  
> +#ifdef CONFIG_PM_RUNTIME
> +
> +static int pci_pm_runtime_suspend(struct device *dev)
> +{
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	int error;
> +
> +	device_set_wakeup_enable(dev, 1);
> +	error = pci_enable_runtime_wake(pci_dev, true);
> +
> +	if (error)
> +		return -EBUSY;
> +
> +	if (pm && pm->runtime_suspend)
> +		error = pm->runtime_suspend(dev);
> +
> +	if (error)
> +		goto out;
> +
> +	error = pci_pm_suspend(dev);

This has a chance to be confusing IMO.  pci_pm_suspend() calls the driver's
->suspend() routine, which is specific to suspend to RAM.  So, this means
that drivers are supposed to implement ->runtime_suspend() only if they
want to do something _in_ _addition_ to the things done by
->suspend() and ->suspend_noirq().

> +
> +	if (error)
> +		goto resume;
> +
> +	disable_irq(pci_dev->irq);

I don't really think it's necessary to disable the interrupt here.  We prevent
drivers from receiving interrupts while pci_pm_suspend_noirq() is being run
during system-wide power transitions to protect them from receiving "alien"
interrupts they might be unable to handle, but in the runtime case I think the
driver should take care of protecting itself from that.

The generic part of pci_pm_suspend_noirq() doesn't do things that require
interrupts to be disabled.  The driver's ->suspend_noirq() may do such things
in principle, but then I'd prefer it not to be called directly from here.

IMO pci_pm_runtime_suspend() should work like a combination of
pci_pm_suspend() and pci_pm_suspend_noirq(), except that instead of executing
two driver callbacks it will only call one ->runtime_suspend() callback without
disabling the device interrupt.

If the driver has to disable its interrupt in ->runtime_suspend(), it should do
that by itself.

Of course, analogous comment apply to pci_pm_runtime_resume() below.

> +	error = pci_pm_suspend_noirq(dev);
> +	enable_irq(pci_dev->irq);
> +
> +	if (error)
> +		goto resume_noirq;
> +
> +	return 0;
> +
> +resume_noirq:
> +	disable_irq(pci_dev->irq);
> +	pci_pm_resume_noirq(dev);
> +	enable_irq(pci_dev->irq);
> +resume:
> +	pci_pm_resume(dev);
> +out:
> +	pci_enable_runtime_wake(pci_dev, false);
> +	return error;
> +}
> +
> +static int pci_pm_runtime_resume(struct device *dev)
> +{
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +	int error = 0;
> +
> +	disable_irq(pci_dev->irq);
> +	error = pci_pm_resume_noirq(dev);
> +	enable_irq(pci_dev->irq);
> +
> +	if (error)
> +		return error;
> +
> +	error = pci_pm_resume(dev);
> +
> +	if (error)
> +		return error;
> +
> +	if (pm->runtime_resume)
> +		error = pm->runtime_resume(dev);
> +
> +	if (error)
> +		return error;
> +
> +	error = pci_enable_runtime_wake(pci_dev, false);
> +
> +	if (error)
> +		return error;
> +
> +	return 0;
> +}
> +
> +static void pci_pm_runtime_idle(struct device *dev)
> +{
> +	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
> +
> +	if (pm && pm->runtime_idle)
> +		pm->runtime_idle(dev);
> +
> +	pm_schedule_suspend(dev, 0);
> +}

That has already been discussed.

> +
> +#else /* !CONFIG_PM_RUNTIME */
> +
> +#define pci_pm_runtime_suspend	NULL
> +#define pci_pm_runtime_resume	NULL
> +#define pci_pm_runtime_idle	NULL
> +
> +#endif
> +
>  struct dev_pm_ops pci_dev_pm_ops = {
>  	.prepare = pci_pm_prepare,
>  	.complete = pci_pm_complete,
> @@ -925,6 +1022,9 @@ struct dev_pm_ops pci_dev_pm_ops = {
>  	.thaw_noirq = pci_pm_thaw_noirq,
>  	.poweroff_noirq = pci_pm_poweroff_noirq,
>  	.restore_noirq = pci_pm_restore_noirq,
> +	.runtime_suspend = pci_pm_runtime_suspend,
> +	.runtime_resume = pci_pm_runtime_resume,
> +	.runtime_idle = pci_pm_runtime_idle,
>  };
>  
>  #define PCI_PM_OPS_PTR	(&pci_dev_pm_ops)
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index dbd0f94..ab3a116 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -18,6 +18,7 @@
>  #include <linux/log2.h>
>  #include <linux/pci-aspm.h>
>  #include <linux/pm_wakeup.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/interrupt.h>
>  #include <asm/dma.h>	/* isa_dma_bridge_buggy */
>  #include <linux/device.h>
> @@ -428,6 +429,12 @@ static inline int platform_pci_sleep_wake(struct pci_dev *dev, bool enable)
>  			pci_platform_pm->sleep_wake(dev, enable) : -ENODEV;
>  }
>  
> +static inline int platform_pci_runtime_wake(struct pci_dev *dev, bool enable)
> +{
> +	return pci_platform_pm ?
> +			pci_platform_pm->runtime_wake(dev, enable) : -ENODEV;
> +}
> +
>  /**
>   * pci_raw_set_power_state - Use PCI PM registers to set the power state of
>   *                           given PCI device
> @@ -1239,6 +1246,38 @@ int pci_enable_wake(struct pci_dev *dev, pci_power_t state, bool enable)
>  }
>  
>  /**
> + * pci_enable_runtime_wake - enable PCI device as runtime wakeup event source
> + * @dev: PCI device affected
> + * @enable: True to enable event generation; false to disable
> + *
> + * This enables the device as a runtime wakeup event source, or disables it.
> + * This typically requires platform support.
> + *
> + * RETURN VALUE:
> + * 0 is returned on success
> + * -EINVAL is returned if device is not supposed to wake up the system
> + * -ENODEV is returned if platform cannot support runtime PM on the device
> + */
> +int pci_enable_runtime_wake(struct pci_dev *dev, bool enable)
> +{
> +	int error = 0;
> +	bool pme_done = false;
> +
> +	if (!enable && platform_pci_can_wakeup(dev))
> +		error = platform_pci_runtime_wake(dev, false);
> +
> +	if (!enable || pci_pme_capable(dev, PCI_D3hot)) {
> +		pci_pme_active(dev, enable);
> +		pme_done = true;
> +	}

I don't really follow your intention here.  The condition means that PME is
going to be enabled unless 'enable' is set and the device is not capable
of generating PMEs.  However, if 'enable' is unset, we're still going to try
to enable the PME, even if the device can't generate it.  Shouldn't that
be

if (enable && pci_pme_capable(dev, PCI_D3hot)) ?

Also, that assumes the device is going to be put into D3_hot, but do we know
that for sure?

> +
> +	if (enable && platform_pci_can_wakeup(dev))
> +		error = platform_pci_runtime_wake(dev, true);
> +
> +	return pme_done ? 0 : error;
> +}

I have no comments to the part below.

> +/**
>   * pci_wake_from_d3 - enable/disable device to wake up from D3_hot or D3_cold
>   * @dev: PCI device to prepare
>   * @enable: True to enable wake-up event generation; false to disable
> @@ -1346,6 +1385,54 @@ int pci_back_from_sleep(struct pci_dev *dev)
>  }
>  
>  /**
> + * pci_dev_pme_event - check if a device has a pending pme
> + *
> + * @dev: Device to handle.
> + */
> +
> +int pci_dev_pme_event(struct pci_dev *dev)
> +{
> +	u16 pmcsr;
> +
> +	if (!dev->pm_cap)
> +		return -ENODEV;
> +
> +	pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> +
> +	if (pmcsr & PCI_PM_CTRL_PME_STATUS) {
> +		pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, pmcsr);
> +		pm_runtime_get(&dev->dev);
> +		return 0;
> +	}
> +
> +	return -ENODEV;
> +}
> +
> +/**
> + * pci_bus_pme_event - search for subordinate devices with a pending
> + *		   pme and handle them
> + *
> + * @dev: Parent device to handle
> + */
> +int pci_bus_pme_event(struct pci_dev *dev)
> +{
> +	struct pci_bus *bus;
> +	struct pci_dev *pdev;
> +
> +	if (pci_is_root_bus(dev->bus))
> +		bus = dev->bus;
> +	else if (dev->subordinate)
> +		bus = dev->subordinate;
> +	else
> +		return -ENODEV;
> +
> +	list_for_each_entry(pdev, &bus->devices, bus_list)
> +		pci_dev_pme_event(pdev);
> +
> +	return 0;
> +}
> +
> +/**
>   * pci_pm_init - Initialize PM functions of given PCI device
>   * @dev: PCI device to handle.
>   */
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index f73bcbe..a81aff2 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -34,6 +34,8 @@ extern int pci_mmap_fits(struct pci_dev *pdev, int resno,
>   *
>   * @sleep_wake: enables/disables the system wake up capability of given device
>   *
> + * @runtime_wake: enables/disables the runtime wakeup capability of given device
> + *
>   * If given platform is generally capable of power managing PCI devices, all of
>   * these callbacks are mandatory.
>   */
> @@ -43,6 +45,7 @@ struct pci_platform_pm_ops {
>  	pci_power_t (*choose_state)(struct pci_dev *dev);
>  	bool (*can_wakeup)(struct pci_dev *dev);
>  	int (*sleep_wake)(struct pci_dev *dev, bool enable);
> +	int (*runtime_wake)(struct pci_dev *dev, bool enable);
>  };
>  
>  extern int pci_set_platform_pm(struct pci_platform_pm_ops *ops);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 115fb7b..8a3fea0 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -734,10 +734,13 @@ pci_power_t pci_choose_state(struct pci_dev *dev, pm_message_t state);
>  bool pci_pme_capable(struct pci_dev *dev, pci_power_t state);
>  void pci_pme_active(struct pci_dev *dev, bool enable);
>  int pci_enable_wake(struct pci_dev *dev, pci_power_t state, bool enable);
> +int pci_enable_runtime_wake(struct pci_dev *dev, bool enable);
>  int pci_wake_from_d3(struct pci_dev *dev, bool enable);
>  pci_power_t pci_target_state(struct pci_dev *dev);
>  int pci_prepare_to_sleep(struct pci_dev *dev);
>  int pci_back_from_sleep(struct pci_dev *dev);
> +int pci_dev_pme_event(struct pci_dev *dev);
> +int pci_bus_pme_event(struct pci_dev *dev);
>  
>  /* Functions for PCI Hotplug drivers to use */
>  int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett Aug. 14, 2009, 10:21 p.m. UTC | #12
On Fri, Aug 14, 2009 at 10:05:19PM +0200, Rafael J. Wysocki wrote:

> Well, sometimes the user may want a device to be power managed at run time
> and not to be able to wake up the system from sleep states.  For example,
> I'd like the USB controller in my box to be suspended at run time whenever it's
> not used, but surely I wouldn't like it to do system-wide wakeup, because it
> does that when I move the mouse which is a cordless one.  Simply turning the
> mouse on causes the system to wake up. :-)

Right, so clearly my code is broken right now.

> Why don't we add a flag indicating whether or not the device is allowed to
> be power managed at run time, something like runtime_forbidden, that the
> user space will be able to set through sysfs?

I think even having a runtime_wakeup flag (which defaults to on) would 
be sufficient. If the worst case is scenario is that we have to resume 
devices in order to apply the correct policy when going into a 
systemwide suspend state, I think that's acceptable.
Matthew Garrett Aug. 14, 2009, 10:30 p.m. UTC | #13
On Fri, Aug 14, 2009 at 11:22:27PM +0200, Rafael J. Wysocki wrote:

> Do you have any prototypes for that?  I started working on it some time ago,
> but then I focused on the core runtime PM framework.

The native PCIe PME code? There's some in the final patchset at 
http://bugzilla.kernel.org/show_bug.cgi?id=6892 but I haven't had time 
to look into merging that into the current kernel. I also don't have 
anything to test against, which makes life more awkward.

> > +static int acpi_pci_runtime_wake(struct pci_dev *dev, bool enable)
> > +{
> > +	acpi_status status;
> > +	acpi_handle handle = DEVICE_ACPI_HANDLE(&dev->dev);
> > +	struct acpi_device *acpi_dev;
> > +
> 
> Hm, I'd move that into ACPI as
> 
> int acp_runtime_wake_enable(acpi_handle handle, bool enable)
> 
> in which form it could also be useful to non-PCI devices.

Hm. Yeah, that's not too bad an idea.

> > +		acpi_disable_gpe(acpi_dev->wakeup.gpe_device,
> > +				 acpi_dev->wakeup.gpe_number);
> > +	}
> > +	return 0;
> > +}
> 
> Ah, that's the part I've always been missing!
> 
> How exactly do we figure out which GPE is a wake-up one for given device?
> IOW, how are the wakeup.gpe_device and wakeup.gpe_number fields populated?

There's a field in the ACPI device definition in the DSDT that defines 
the needed GPE and which runlevels it can resume from.

> > +	error = pci_pm_suspend(dev);
> 
> This has a chance to be confusing IMO.  pci_pm_suspend() calls the driver's
> ->suspend() routine, which is specific to suspend to RAM.  So, this means
> that drivers are supposed to implement ->runtime_suspend() only if they
> want to do something _in_ _addition_ to the things done by
> ->suspend() and ->suspend_noirq().

Yes, that was how I'd planned it. An alternative would be for 
runtime_suspend to return a negative value if there's an error, 0 if the 
bus code should continue or a positive value if the runtime_suspend() 
call handles all of it and the bus code should just return immediately?

> > +	disable_irq(pci_dev->irq);
> 
> I don't really think it's necessary to disable the interrupt here.  We prevent
> drivers from receiving interrupts while pci_pm_suspend_noirq() is being run
> during system-wide power transitions to protect them from receiving "alien"
> interrupts they might be unable to handle, but in the runtime case I think the
> driver should take care of protecting itself from that.

That sounds fine. I didn't want to take a risk in that respect, but if 
we should be safe here I can just drop that.

> > +	if (!enable || pci_pme_capable(dev, PCI_D3hot)) {
> > +		pci_pme_active(dev, enable);
> > +		pme_done = true;
> > +	}
> 
> I don't really follow your intention here.  The condition means that PME is
> going to be enabled unless 'enable' is set and the device is not capable
> of generating PMEs.  However, if 'enable' is unset, we're still going to try
> to enable the PME, even if the device can't generate it.  Shouldn't that
> be

Hmm. That was copied from pci_enable_wake() just above, but it does seem 
a little bit odd. I suspect that that needs some clarification as well.

> Also, that assumes the device is going to be put into D3_hot, but do we know
> that for sure?

I'd be surprised if there's any hardware that supports wakeups from D2 
but not D3hot, so I just kept the code simple for now.
Rafael Wysocki Aug. 15, 2009, 2:18 p.m. UTC | #14
On Saturday 15 August 2009, Matthew Garrett wrote:
> On Fri, Aug 14, 2009 at 10:05:19PM +0200, Rafael J. Wysocki wrote:
> 
> > Well, sometimes the user may want a device to be power managed at run time
> > and not to be able to wake up the system from sleep states.  For example,
> > I'd like the USB controller in my box to be suspended at run time whenever it's
> > not used, but surely I wouldn't like it to do system-wide wakeup, because it
> > does that when I move the mouse which is a cordless one.  Simply turning the
> > mouse on causes the system to wake up. :-)
> 
> Right, so clearly my code is broken right now.
> 
> > Why don't we add a flag indicating whether or not the device is allowed to
> > be power managed at run time, something like runtime_forbidden, that the
> > user space will be able to set through sysfs?
> 
> I think even having a runtime_wakeup flag (which defaults to on) would 
> be sufficient.

Perhaps it would, but then unsetting runtime_wakeup would effectively disable
runtime PM for devices that need it to be power managed at run time (probably
all input devices).  Also there may be situations in which user space may
really want to disable runtime PM for some devices (think of broken hardware
for one example).

> If the worst case is scenario is that we have to resume 
> devices in order to apply the correct policy when going into a 
> systemwide suspend state, I think that's acceptable.

I agree.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki Aug. 15, 2009, 2:41 p.m. UTC | #15
On Saturday 15 August 2009, Matthew Garrett wrote:
> On Fri, Aug 14, 2009 at 11:22:27PM +0200, Rafael J. Wysocki wrote:
> 
> > Do you have any prototypes for that?  I started working on it some time ago,
> > but then I focused on the core runtime PM framework.
> 
> The native PCIe PME code?

Yes.

> There's some in the final patchset at 
> http://bugzilla.kernel.org/show_bug.cgi?id=6892

Well, I don't like that very much.

> but I haven't had time to look into merging that into the current kernel.
> I also don't have anything to test against, which makes life more awkward.

One of my AMD-based boxes should be suitable for that.

> > > +static int acpi_pci_runtime_wake(struct pci_dev *dev, bool enable)
> > > +{
> > > +	acpi_status status;
> > > +	acpi_handle handle = DEVICE_ACPI_HANDLE(&dev->dev);
> > > +	struct acpi_device *acpi_dev;
> > > +
> > 
> > Hm, I'd move that into ACPI as
> > 
> > int acp_runtime_wake_enable(acpi_handle handle, bool enable)
> > 
> > in which form it could also be useful to non-PCI devices.
> 
> Hm. Yeah, that's not too bad an idea.
> 
> > > +		acpi_disable_gpe(acpi_dev->wakeup.gpe_device,
> > > +				 acpi_dev->wakeup.gpe_number);
> > > +	}
> > > +	return 0;
> > > +}
> > 
> > Ah, that's the part I've always been missing!
> > 
> > How exactly do we figure out which GPE is a wake-up one for given device?
> > IOW, how are the wakeup.gpe_device and wakeup.gpe_number fields populated?
> 
> There's a field in the ACPI device definition in the DSDT that defines 
> the needed GPE and which runlevels it can resume from.
> 
> > > +	error = pci_pm_suspend(dev);
> > 
> > This has a chance to be confusing IMO.  pci_pm_suspend() calls the driver's
> > ->suspend() routine, which is specific to suspend to RAM.  So, this means
> > that drivers are supposed to implement ->runtime_suspend() only if they
> > want to do something _in_ _addition_ to the things done by
> > ->suspend() and ->suspend_noirq().
> 
> Yes, that was how I'd planned it. An alternative would be for 
> runtime_suspend to return a negative value if there's an error, 0 if the 
> bus code should continue or a positive value if the runtime_suspend() 
> call handles all of it and the bus code should just return immediately?

I just don't think that the existing suspend-resume callbacks are suitable for
runtime PM.

For example, in the majority of cases the existing suspend callbacks put
devices into D3_hot (or into the deepest low power state allowed by the
platform and hardware), but that means 10 ms latency, also for the resume
part.  Do we want that for runtime PM for all drivers?

Perhaps a more suitable model would be to put devices into D1 first, if
available, and then put them into D2 and D3 after specific delays?  Currently
the core framework doesn't provide any tools for that, but it may be worth
extending it for this purpose.

Also, I think it should be impossible to use the "legacy" callbacks for runtime
PM.  They surely are not designed with that in mind.

> > > +	disable_irq(pci_dev->irq);
> > 
> > I don't really think it's necessary to disable the interrupt here.  We prevent
> > drivers from receiving interrupts while pci_pm_suspend_noirq() is being run
> > during system-wide power transitions to protect them from receiving "alien"
> > interrupts they might be unable to handle, but in the runtime case I think the
> > driver should take care of protecting itself from that.
> 
> That sounds fine. I didn't want to take a risk in that respect, but if 
> we should be safe here I can just drop that.

As far as the PCI PM core is concerned, we should.

> > > +	if (!enable || pci_pme_capable(dev, PCI_D3hot)) {
> > > +		pci_pme_active(dev, enable);
> > > +		pme_done = true;
> > > +	}
> > 
> > I don't really follow your intention here.  The condition means that PME is
> > going to be enabled unless 'enable' is set and the device is not capable
> > of generating PMEs.  However, if 'enable' is unset, we're still going to try
> > to enable the PME, even if the device can't generate it.  Shouldn't that
> > be
> 
> Hmm. That was copied from pci_enable_wake() just above, but it does seem 
> a little bit odd. I suspect that that needs some clarification as well.

Ah, OK.  If 'enabled' is unset, we want to disable the PME for all states, so
we call pci_pm_active(dev, false) unconditionally in that case.  If 'enable' is
set, we only want to enable the PME if it's supported in given state.  Sorry
for the noise.

> > Also, that assumes the device is going to be put into D3_hot, but do we know
> > that for sure?
> 
> I'd be surprised if there's any hardware that supports wakeups from D2 
> but not D3hot, so I just kept the code simple for now.

OK

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki Aug. 15, 2009, 3:24 p.m. UTC | #16
On Saturday 15 August 2009, Rafael J. Wysocki wrote:
> On Saturday 15 August 2009, Matthew Garrett wrote:
> > On Fri, Aug 14, 2009 at 11:22:27PM +0200, Rafael J. Wysocki wrote:
...
> > > > +	error = pci_pm_suspend(dev);
> > > 
> > > This has a chance to be confusing IMO.  pci_pm_suspend() calls the driver's
> > > ->suspend() routine, which is specific to suspend to RAM.  So, this means
> > > that drivers are supposed to implement ->runtime_suspend() only if they
> > > want to do something _in_ _addition_ to the things done by
> > > ->suspend() and ->suspend_noirq().
> > 
> > Yes, that was how I'd planned it. An alternative would be for 
> > runtime_suspend to return a negative value if there's an error, 0 if the 
> > bus code should continue or a positive value if the runtime_suspend() 
> > call handles all of it and the bus code should just return immediately?
> 
> I just don't think that the existing suspend-resume callbacks are suitable for
> runtime PM.
> 
> For example, in the majority of cases the existing suspend callbacks put
> devices into D3_hot (or into the deepest low power state allowed by the
> platform and hardware), but that means 10 ms latency, also for the resume
> part.  Do we want that for runtime PM for all drivers?
> 
> Perhaps a more suitable model would be to put devices into D1 first, if
> available, and then put them into D2 and D3 after specific delays?  Currently
> the core framework doesn't provide any tools for that, but it may be worth
> extending it for this purpose.
> 
> Also, I think it should be impossible to use the "legacy" callbacks for runtime
> PM.  They surely are not designed with that in mind.

To be more specific, let's go through pci_pm_suspend() and
pci_pm_suspend_noirq() and see what parts of these are useful for runtime PM.

pci_legacy_suspend() shouldn't be used at run time IMO, as I said above.

For runtime PM we shouldn't even continue if pm is NULL.  There's nothing
like "default runtime PM", either the driver supports it, or not.  We also
should require the callback to be implemented and IMO that should be
->runtime_suspend().

Of course we need to invoke the callback and call pci_fixup_device() after
that.

pci_legacy_suspend_late() shouldn't be used at run time.

Since I think we should require pm to be not NULL for runtime PM, the second
block in pci_pm_suspend_noirq() is redundant in that case.

Now, I don't think we need the _noirq callback for runtime PM, because
we've just executed the "regular" callback and I bet there are no devices
requiring additional driver-specific operations after pci_fixup_device().
So ->runtime_suspend() should be sufficient.

The remaining part of pci_pm_suspend_noirq() is useful, so it can be executed
by pci_pm_runtime_suspend() directly.

Thus we get

static int pci_pm_runtime_suspend(struct device *dev)
{
	struct pci_dev *pci_dev = to_pci_dev(dev);
	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
	int error;

	if (!pm || !pm->runtime_suspend)
		return -ENOSYS;

	pci_dev->state_saved = false;

	error = pm->runtime_suspend(dev);
	suspend_report_result(pm->runtime_suspend, error);
	if (error)
		return error;

	pci_fixup_device(pci_fixup_suspend, pci_dev);

	if (!pci_dev->state_saved && pci_dev->current_state != PCI_D0
	    && pci_dev->current_state != PCI_UNKNOWN) {
		WARN_ONCE(pci_dev->current_state != prev,
				"PCI PM: State of device not saved by %pF\n",
				pm->runtime_suspend);
		return 0;
	}

	if (!pci_dev->state_saved) {
		pci_save_state(pci_dev);
		if (!pci_is_bridge(pci_dev))
			pci_prepare_to_sleep(pci_dev);
	}

	pci_pm_set_unknown_state(pci_dev);

	return 0;
}

Now, if the driver has a universal suspend routine, like for example r8169,
it only needs to point .runtime_suspend to that one and it should work.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Aug. 15, 2009, 3:53 p.m. UTC | #17
On Sat, 15 Aug 2009, Rafael J. Wysocki wrote:

> > > Why don't we add a flag indicating whether or not the device is allowed to
> > > be power managed at run time, something like runtime_forbidden, that the
> > > user space will be able to set through sysfs?
> > 
> > I think even having a runtime_wakeup flag (which defaults to on) would 
> > be sufficient.
> 
> Perhaps it would, but then unsetting runtime_wakeup would effectively disable
> runtime PM for devices that need it to be power managed at run time (probably
> all input devices).  Also there may be situations in which user space may
> really want to disable runtime PM for some devices (think of broken hardware
> for one example).

It sounds like there are really three choices here, and the decision 
should largely be left up to the user:

	1. don't use runtime PM,

	2. allow runtime PM but disable remote wakeup,

	3. allow runtime PM with remote wakeup enabled.

Now, a driver may say "I can't do my job without remote wakeup".  Such
a driver would refuse to do runtime_suspend in case 2.  But otherwise
we should follow the preference of the user.

The only remaining question is how to expose this in sysfs in a way 
that won't be confusing and that won't be confused with the "wakeup" 
attribute.  One possibility is to use the "level" attribute introduced 
in USB; possible levels are "on" (no runtime PM) and "auto" (runtime 
PM allowed).  Then a new "runtime_wakeup" attribute could contain 
nothing (if wakeup is not available), "enabled", or "disabled".

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael Wysocki Aug. 15, 2009, 8:54 p.m. UTC | #18
On Saturday 15 August 2009, Alan Stern wrote:
> On Sat, 15 Aug 2009, Rafael J. Wysocki wrote:
> 
> > > > Why don't we add a flag indicating whether or not the device is allowed to
> > > > be power managed at run time, something like runtime_forbidden, that the
> > > > user space will be able to set through sysfs?
> > > 
> > > I think even having a runtime_wakeup flag (which defaults to on) would 
> > > be sufficient.
> > 
> > Perhaps it would, but then unsetting runtime_wakeup would effectively disable
> > runtime PM for devices that need it to be power managed at run time (probably
> > all input devices).  Also there may be situations in which user space may
> > really want to disable runtime PM for some devices (think of broken hardware
> > for one example).
> 
> It sounds like there are really three choices here, and the decision 
> should largely be left up to the user:
> 
> 	1. don't use runtime PM,
> 
> 	2. allow runtime PM but disable remote wakeup,
> 
> 	3. allow runtime PM with remote wakeup enabled.
> 
> Now, a driver may say "I can't do my job without remote wakeup".  Such
> a driver would refuse to do runtime_suspend in case 2.  But otherwise
> we should follow the preference of the user.
> 
> The only remaining question is how to expose this in sysfs in a way 
> that won't be confusing and that won't be confused with the "wakeup" 
> attribute.  One possibility is to use the "level" attribute introduced 
> in USB; possible levels are "on" (no runtime PM) and "auto" (runtime 
> PM allowed).  Then a new "runtime_wakeup" attribute could contain 
> nothing (if wakeup is not available), "enabled", or "disabled".

That seems to require two flags.

runtime_forbidden - if unset, the driver decides whether or not to use runtime
  PM; that could be exposed through sysfs as 'runtime' under the 'power'
  subdirectory with the following values:
  * 'disabled' - runtime_forbidden is set by the user space
  * 'on' - runtime_forbidden is unset, runtime PM is used (disable_depth == 0)
  * 'off' - runtime_forbidden is unset, runtime PM is not used
  To set/unset the user space writes 'enabled'/'disabled' to it, respectively.
  The default is unset.

runtime_wakeup - if set, the device is allowed to do remote wakeup at run time
  That could be represented as 'runtime_wakeup' under 'power' with the
  following values:
  * no value (empty file) is 'runtime' is 'disabled'
  * 'enabled'
  * 'disabled'
  To set/unset the user space writes 'enabled'/'disabled' to it, respectively.
  The default is set.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett Aug. 15, 2009, 8:58 p.m. UTC | #19
On Sat, Aug 15, 2009 at 10:54:23PM +0200, Rafael J. Wysocki wrote:

> runtime_wakeup - if set, the device is allowed to do remote wakeup at run time
>   That could be represented as 'runtime_wakeup' under 'power' with the
>   following values:
>   * no value (empty file) is 'runtime' is 'disabled'
>   * 'enabled'
>   * 'disabled'
>   To set/unset the user space writes 'enabled'/'disabled' to it, respectively.
>   The default is set.

Why would you ever want runtime_wakeup to be false unless 
runtime_forbidden is true? Surely the point of runtime power management 
is to be transparent to the user, in which case remote wakeup is 
required?
Rafael Wysocki Aug. 15, 2009, 9:21 p.m. UTC | #20
On Saturday 15 August 2009, Matthew Garrett wrote:
> On Sat, Aug 15, 2009 at 10:54:23PM +0200, Rafael J. Wysocki wrote:
> 
> > runtime_wakeup - if set, the device is allowed to do remote wakeup at run time
> >   That could be represented as 'runtime_wakeup' under 'power' with the
> >   following values:
> >   * no value (empty file) is 'runtime' is 'disabled'
> >   * 'enabled'
> >   * 'disabled'
> >   To set/unset the user space writes 'enabled'/'disabled' to it, respectively.
> >   The default is set.
> 
> Why would you ever want runtime_wakeup to be false unless 
> runtime_forbidden is true? Surely the point of runtime power management 
> is to be transparent to the user, in which case remote wakeup is 
> required?

Well, this was exactly my point previously. :-)

Still, although for the majority of devices 'runtime_wakeup' disabled would
mean no runtime PM at all IMO, there may be devices that actually work without
remote wakeup, although they support it in general.

I can even imagine a scenario where this setting might be useful, like when
we don't want a network adapter to be woken up from the outside.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett Aug. 15, 2009, 9:27 p.m. UTC | #21
On Sat, Aug 15, 2009 at 11:21:53PM +0200, Rafael J. Wysocki wrote:

> I can even imagine a scenario where this setting might be useful, like when
> we don't want a network adapter to be woken up from the outside.

I think in that case we'd probably just want the interface to be downed? 
Some of this is going to require device-specific policy, I think - for 
the network case we probably want something in between IF_RUNNING and 
IF_DOWN (IF_CARRIER, perhaps) that indicates that we want the PHY to be 
powered. Pushing this out to sysfs would mean we'd have a consistent 
interface but varying semantics, and I'm not convinced that's an 
especially helpful interface.
Rafael Wysocki Aug. 15, 2009, 9:44 p.m. UTC | #22
On Saturday 15 August 2009, Matthew Garrett wrote:
> On Sat, Aug 15, 2009 at 11:21:53PM +0200, Rafael J. Wysocki wrote:
> 
> > I can even imagine a scenario where this setting might be useful, like when
> > we don't want a network adapter to be woken up from the outside.
> 
> I think in that case we'd probably just want the interface to be downed? 
> Some of this is going to require device-specific policy, I think - for 
> the network case we probably want something in between IF_RUNNING and 
> IF_DOWN (IF_CARRIER, perhaps) that indicates that we want the PHY to be 
> powered. Pushing this out to sysfs would mean we'd have a consistent 
> interface but varying semantics, and I'm not convinced that's an 
> especially helpful interface.

I'm not disagreeing with that.

At this point I'd like to know the Alan's opinion.  I would gladly use the
'runtime_forbidden' flag only, but if we overlook something now, it's going
to be difficult to fix later.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Aug. 16, 2009, 3:50 p.m. UTC | #23
On Sat, 15 Aug 2009, Rafael J. Wysocki wrote:

> > The only remaining question is how to expose this in sysfs in a way 
> > that won't be confusing and that won't be confused with the "wakeup" 
> > attribute.  One possibility is to use the "level" attribute introduced 
> > in USB; possible levels are "on" (no runtime PM) and "auto" (runtime 
> > PM allowed).  Then a new "runtime_wakeup" attribute could contain 
> > nothing (if wakeup is not available), "enabled", or "disabled".
> 
> That seems to require two flags.
> 
> runtime_forbidden - if unset, the driver decides whether or not to use runtime
>   PM; that could be exposed through sysfs as 'runtime' under the 'power'
>   subdirectory with the following values:
>   * 'disabled' - runtime_forbidden is set by the user space
>   * 'on' - runtime_forbidden is unset, runtime PM is used (disable_depth == 0)
>   * 'off' - runtime_forbidden is unset, runtime PM is not used
>   To set/unset the user space writes 'enabled'/'disabled' to it, respectively.
>   The default is unset.

Too confusing.  The difference between "disabled" and "off" is so
subtle that even I don't understand it!

All we need for this is a simple on/off switch.  Or to be consistent
with the terms that are already exposed to userspace for USB devices:  
"auto"/"on", where "auto" means the system automatically changes power
levels (runtime PM is enabled) and "on" means the device is always on
(runtime PM is disabled).

And I don't like the "runtime_forbidden" name either.  It's long and
it's a negative, making it harder to understand.  ("off" means that the
device is permanently on!)  What's wrong with "level", as in
"power/level"?  That seems very intuitive.

> runtime_wakeup - if set, the device is allowed to do remote wakeup at run time
>   That could be represented as 'runtime_wakeup' under 'power' with the
>   following values:
>   * no value (empty file) is 'runtime' is 'disabled'
>   * 'enabled'
>   * 'disabled'
>   To set/unset the user space writes 'enabled'/'disabled' to it, respectively.
>   The default is set.

This is okay except for your definition of empty file.  This makes it
dependent on the power/level setting.  The two should be independent.  
Thus, empty file should mean that the device doesn't support remote
wakeup, in which case writes are ignored.

This scheme, like yours, requires two new bitflags.  We could call them
something like "may_runtime_suspend" and "may_runtime_wakeup".

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern Aug. 16, 2009, 3:57 p.m. UTC | #24
On Sat, 15 Aug 2009, Rafael J. Wysocki wrote:

> > Why would you ever want runtime_wakeup to be false unless 
> > runtime_forbidden is true? Surely the point of runtime power management 
> > is to be transparent to the user, in which case remote wakeup is 
> > required?

Matthew, what makes you think remote wakeup is required?  Lots of
power-manageable devices don't support it at all (consider disk drives
or display screens).

> Well, this was exactly my point previously. :-)
> 
> Still, although for the majority of devices 'runtime_wakeup' disabled would
> mean no runtime PM at all IMO, there may be devices that actually work without
> remote wakeup, although they support it in general.

That last part is quite true.  For example, we might suspend the device
whenever no process has opened the device file.  It would be a degraded 
form of power management, but better than nothing.

> I can even imagine a scenario where this setting might be useful, like when
> we don't want a network adapter to be woken up from the outside.

Or if the device's support for remote wakeup is broken.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Garrett Aug. 16, 2009, 4:04 p.m. UTC | #25
On Sun, Aug 16, 2009 at 11:57:53AM -0400, Alan Stern wrote:
> On Sat, 15 Aug 2009, Rafael J. Wysocki wrote:
> 
> > > Why would you ever want runtime_wakeup to be false unless 
> > > runtime_forbidden is true? Surely the point of runtime power management 
> > > is to be transparent to the user, in which case remote wakeup is 
> > > required?
> 
> Matthew, what makes you think remote wakeup is required?  Lots of
> power-manageable devices don't support it at all (consider disk drives
> or display screens).

Sorry, I meant in cases where remote wakeup is a sensible concept. For 
things like storage there's obviously no reason to require it, but for 
something like USB the absence of remote wakeup would effectively break 
the driver for most users.
Alan Stern Aug. 16, 2009, 4:09 p.m. UTC | #26
On Sat, 15 Aug 2009, Rafael J. Wysocki wrote:

> On Saturday 15 August 2009, Matthew Garrett wrote:
> > On Sat, Aug 15, 2009 at 11:21:53PM +0200, Rafael J. Wysocki wrote:
> > 
> > > I can even imagine a scenario where this setting might be useful, like when
> > > we don't want a network adapter to be woken up from the outside.
> > 
> > I think in that case we'd probably just want the interface to be downed? 
> > Some of this is going to require device-specific policy, I think - for 
> > the network case we probably want something in between IF_RUNNING and 
> > IF_DOWN (IF_CARRIER, perhaps) that indicates that we want the PHY to be 
> > powered. Pushing this out to sysfs would mean we'd have a consistent 
> > interface but varying semantics, and I'm not convinced that's an 
> > especially helpful interface.
> 
> I'm not disagreeing with that.
> 
> At this point I'd like to know the Alan's opinion.  I would gladly use the
> 'runtime_forbidden' flag only, but if we overlook something now, it's going
> to be difficult to fix later.

The notion of "remote wakeup" is more or less the same for all devices,
and it can effectively be abstracted into the core.  Other notions,
like IF_CARRIER, are more device-dependent.  They must be handled at
the bus or driver level, not in the core.

The real question is whether we should export a "may_runtime_wakeup"  
flag.  If we don't then we prevent the use of a degraded
power-management mode in devices with broken wakeup support.  Perhaps 
that's okay, I don't know.

Which reminds me...  In addition to these flags controlling what
settings should be enabled, maybe we should add a flag to record
whether or not remote wakeup was turned on when the device was last
suspended.  Although the core wouldn't use it, such a flag might be
very convenient for drivers.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index ea15b05..a98a777 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -12,6 +12,7 @@ 
 #include <linux/pci.h>
 #include <linux/module.h>
 #include <linux/pci-aspm.h>
+#include <linux/pm_runtime.h>
 #include <acpi/acpi.h>
 #include <acpi/acpi_bus.h>
 
@@ -120,14 +121,62 @@  static int acpi_pci_sleep_wake(struct pci_dev *dev, bool enable)
 	return error;
 }
 
+static int acpi_pci_runtime_wake(struct pci_dev *dev, bool enable)
+{
+	acpi_status status;
+	acpi_handle handle = DEVICE_ACPI_HANDLE(&dev->dev);
+	struct acpi_device *acpi_dev;
+
+	if (!handle)
+		return -ENODEV;
+
+	status = acpi_bus_get_device(handle, &acpi_dev);
+	if (ACPI_FAILURE(status))
+		return -ENODEV;
+
+	if (enable) {
+		acpi_set_gpe_type(acpi_dev->wakeup.gpe_device,
+				  acpi_dev->wakeup.gpe_number,
+				  ACPI_GPE_TYPE_WAKE_RUN);
+		acpi_enable_gpe(acpi_dev->wakeup.gpe_device,
+				acpi_dev->wakeup.gpe_number);
+	} else {
+		acpi_set_gpe_type(acpi_dev->wakeup.gpe_device,
+				  acpi_dev->wakeup.gpe_number,
+				  ACPI_GPE_TYPE_WAKE);
+		acpi_disable_gpe(acpi_dev->wakeup.gpe_device,
+				 acpi_dev->wakeup.gpe_number);
+	}
+	return 0;
+}
+
+
 static struct pci_platform_pm_ops acpi_pci_platform_pm = {
 	.is_manageable = acpi_pci_power_manageable,
 	.set_state = acpi_pci_set_power_state,
 	.choose_state = acpi_pci_choose_state,
 	.can_wakeup = acpi_pci_can_wakeup,
 	.sleep_wake = acpi_pci_sleep_wake,
+	.runtime_wake = acpi_pci_runtime_wake,
 };
 
+static void pci_device_notify(acpi_handle handle, u32 event, void *data)
+{
+	struct device *dev = data;
+
+	if (event == ACPI_NOTIFY_DEVICE_WAKE)
+		pm_runtime_resume(dev);
+}
+
+static void pci_root_bridge_notify(acpi_handle handle, u32 event, void *data)
+{
+	struct device *dev = data;
+	struct pci_dev *pci_dev = to_pci_dev(dev);
+
+	if (event == ACPI_NOTIFY_DEVICE_WAKE)
+		pci_bus_pme_event(pci_dev);
+}
+
 /* ACPI bus type */
 static int acpi_pci_find_device(struct device *dev, acpi_handle *handle)
 {
@@ -140,6 +189,9 @@  static int acpi_pci_find_device(struct device *dev, acpi_handle *handle)
 	*handle = acpi_get_child(DEVICE_ACPI_HANDLE(dev->parent), addr);
 	if (!*handle)
 		return -ENODEV;
+
+	acpi_install_notify_handler(*handle, ACPI_SYSTEM_NOTIFY,
+				    pci_device_notify, dev);
 	return 0;
 }
 
@@ -158,6 +210,9 @@  static int acpi_pci_find_root_bridge(struct device *dev, acpi_handle *handle)
 	*handle = acpi_get_pci_rootbridge_handle(seg, bus);
 	if (!*handle)
 		return -ENODEV;
+
+	acpi_install_notify_handler(*handle, ACPI_SYSTEM_NOTIFY,
+				    pci_root_bridge_notify, dev);
 	return 0;
 }
 
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index d76c4c8..1f605d8 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -11,12 +11,14 @@ 
 #include <linux/pci.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/interrupt.h>
 #include <linux/device.h>
 #include <linux/mempolicy.h>
 #include <linux/string.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/cpu.h>
+#include <linux/pm_runtime.h>
 #include "pci.h"
 
 /*
@@ -910,6 +912,101 @@  static int pci_pm_restore(struct device *dev)
 
 #endif /* !CONFIG_HIBERNATION */
 
+#ifdef CONFIG_PM_RUNTIME
+
+static int pci_pm_runtime_suspend(struct device *dev)
+{
+	struct pci_dev *pci_dev = to_pci_dev(dev);
+	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	int error;
+
+	device_set_wakeup_enable(dev, 1);
+	error = pci_enable_runtime_wake(pci_dev, true);
+
+	if (error)
+		return -EBUSY;
+
+	if (pm && pm->runtime_suspend)
+		error = pm->runtime_suspend(dev);
+
+	if (error)
+		goto out;
+
+	error = pci_pm_suspend(dev);
+
+	if (error)
+		goto resume;
+
+	disable_irq(pci_dev->irq);
+	error = pci_pm_suspend_noirq(dev);
+	enable_irq(pci_dev->irq);
+
+	if (error)
+		goto resume_noirq;
+
+	return 0;
+
+resume_noirq:
+	disable_irq(pci_dev->irq);
+	pci_pm_resume_noirq(dev);
+	enable_irq(pci_dev->irq);
+resume:
+	pci_pm_resume(dev);
+out:
+	pci_enable_runtime_wake(pci_dev, false);
+	return error;
+}
+
+static int pci_pm_runtime_resume(struct device *dev)
+{
+	struct pci_dev *pci_dev = to_pci_dev(dev);
+	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+	int error = 0;
+
+	disable_irq(pci_dev->irq);
+	error = pci_pm_resume_noirq(dev);
+	enable_irq(pci_dev->irq);
+
+	if (error)
+		return error;
+
+	error = pci_pm_resume(dev);
+
+	if (error)
+		return error;
+
+	if (pm->runtime_resume)
+		error = pm->runtime_resume(dev);
+
+	if (error)
+		return error;
+
+	error = pci_enable_runtime_wake(pci_dev, false);
+
+	if (error)
+		return error;
+
+	return 0;
+}
+
+static void pci_pm_runtime_idle(struct device *dev)
+{
+	struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
+
+	if (pm && pm->runtime_idle)
+		pm->runtime_idle(dev);
+
+	pm_schedule_suspend(dev, 0);
+}
+
+#else /* !CONFIG_PM_RUNTIME */
+
+#define pci_pm_runtime_suspend	NULL
+#define pci_pm_runtime_resume	NULL
+#define pci_pm_runtime_idle	NULL
+
+#endif
+
 struct dev_pm_ops pci_dev_pm_ops = {
 	.prepare = pci_pm_prepare,
 	.complete = pci_pm_complete,
@@ -925,6 +1022,9 @@  struct dev_pm_ops pci_dev_pm_ops = {
 	.thaw_noirq = pci_pm_thaw_noirq,
 	.poweroff_noirq = pci_pm_poweroff_noirq,
 	.restore_noirq = pci_pm_restore_noirq,
+	.runtime_suspend = pci_pm_runtime_suspend,
+	.runtime_resume = pci_pm_runtime_resume,
+	.runtime_idle = pci_pm_runtime_idle,
 };
 
 #define PCI_PM_OPS_PTR	(&pci_dev_pm_ops)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index dbd0f94..ab3a116 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -18,6 +18,7 @@ 
 #include <linux/log2.h>
 #include <linux/pci-aspm.h>
 #include <linux/pm_wakeup.h>
+#include <linux/pm_runtime.h>
 #include <linux/interrupt.h>
 #include <asm/dma.h>	/* isa_dma_bridge_buggy */
 #include <linux/device.h>
@@ -428,6 +429,12 @@  static inline int platform_pci_sleep_wake(struct pci_dev *dev, bool enable)
 			pci_platform_pm->sleep_wake(dev, enable) : -ENODEV;
 }
 
+static inline int platform_pci_runtime_wake(struct pci_dev *dev, bool enable)
+{
+	return pci_platform_pm ?
+			pci_platform_pm->runtime_wake(dev, enable) : -ENODEV;
+}
+
 /**
  * pci_raw_set_power_state - Use PCI PM registers to set the power state of
  *                           given PCI device
@@ -1239,6 +1246,38 @@  int pci_enable_wake(struct pci_dev *dev, pci_power_t state, bool enable)
 }
 
 /**
+ * pci_enable_runtime_wake - enable PCI device as runtime wakeup event source
+ * @dev: PCI device affected
+ * @enable: True to enable event generation; false to disable
+ *
+ * This enables the device as a runtime wakeup event source, or disables it.
+ * This typically requires platform support.
+ *
+ * RETURN VALUE:
+ * 0 is returned on success
+ * -EINVAL is returned if device is not supposed to wake up the system
+ * -ENODEV is returned if platform cannot support runtime PM on the device
+ */
+int pci_enable_runtime_wake(struct pci_dev *dev, bool enable)
+{
+	int error = 0;
+	bool pme_done = false;
+
+	if (!enable && platform_pci_can_wakeup(dev))
+		error = platform_pci_runtime_wake(dev, false);
+
+	if (!enable || pci_pme_capable(dev, PCI_D3hot)) {
+		pci_pme_active(dev, enable);
+		pme_done = true;
+	}
+
+	if (enable && platform_pci_can_wakeup(dev))
+		error = platform_pci_runtime_wake(dev, true);
+
+	return pme_done ? 0 : error;
+}
+
+/**
  * pci_wake_from_d3 - enable/disable device to wake up from D3_hot or D3_cold
  * @dev: PCI device to prepare
  * @enable: True to enable wake-up event generation; false to disable
@@ -1346,6 +1385,54 @@  int pci_back_from_sleep(struct pci_dev *dev)
 }
 
 /**
+ * pci_dev_pme_event - check if a device has a pending pme
+ *
+ * @dev: Device to handle.
+ */
+
+int pci_dev_pme_event(struct pci_dev *dev)
+{
+	u16 pmcsr;
+
+	if (!dev->pm_cap)
+		return -ENODEV;
+
+	pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
+
+	if (pmcsr & PCI_PM_CTRL_PME_STATUS) {
+		pci_write_config_word(dev, dev->pm_cap + PCI_PM_CTRL, pmcsr);
+		pm_runtime_get(&dev->dev);
+		return 0;
+	}
+
+	return -ENODEV;
+}
+
+/**
+ * pci_bus_pme_event - search for subordinate devices with a pending
+ *		   pme and handle them
+ *
+ * @dev: Parent device to handle
+ */
+int pci_bus_pme_event(struct pci_dev *dev)
+{
+	struct pci_bus *bus;
+	struct pci_dev *pdev;
+
+	if (pci_is_root_bus(dev->bus))
+		bus = dev->bus;
+	else if (dev->subordinate)
+		bus = dev->subordinate;
+	else
+		return -ENODEV;
+
+	list_for_each_entry(pdev, &bus->devices, bus_list)
+		pci_dev_pme_event(pdev);
+
+	return 0;
+}
+
+/**
  * pci_pm_init - Initialize PM functions of given PCI device
  * @dev: PCI device to handle.
  */
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index f73bcbe..a81aff2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -34,6 +34,8 @@  extern int pci_mmap_fits(struct pci_dev *pdev, int resno,
  *
  * @sleep_wake: enables/disables the system wake up capability of given device
  *
+ * @runtime_wake: enables/disables the runtime wakeup capability of given device
+ *
  * If given platform is generally capable of power managing PCI devices, all of
  * these callbacks are mandatory.
  */
@@ -43,6 +45,7 @@  struct pci_platform_pm_ops {
 	pci_power_t (*choose_state)(struct pci_dev *dev);
 	bool (*can_wakeup)(struct pci_dev *dev);
 	int (*sleep_wake)(struct pci_dev *dev, bool enable);
+	int (*runtime_wake)(struct pci_dev *dev, bool enable);
 };
 
 extern int pci_set_platform_pm(struct pci_platform_pm_ops *ops);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 115fb7b..8a3fea0 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -734,10 +734,13 @@  pci_power_t pci_choose_state(struct pci_dev *dev, pm_message_t state);
 bool pci_pme_capable(struct pci_dev *dev, pci_power_t state);
 void pci_pme_active(struct pci_dev *dev, bool enable);
 int pci_enable_wake(struct pci_dev *dev, pci_power_t state, bool enable);
+int pci_enable_runtime_wake(struct pci_dev *dev, bool enable);
 int pci_wake_from_d3(struct pci_dev *dev, bool enable);
 pci_power_t pci_target_state(struct pci_dev *dev);
 int pci_prepare_to_sleep(struct pci_dev *dev);
 int pci_back_from_sleep(struct pci_dev *dev);
+int pci_dev_pme_event(struct pci_dev *dev);
+int pci_bus_pme_event(struct pci_dev *dev);
 
 /* Functions for PCI Hotplug drivers to use */
 int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);