diff mbox

PM/domains: add delayed power off capability

Message ID 1363078159-17571-1-git-send-email-mkulkarni@nvidia.com (mailing list archive)
State Changes Requested, archived
Headers show

Commit Message

Mayuresh Kulkarni March 12, 2013, 8:49 a.m. UTC
- this commit adds a capability to delay the powering off
of the domain
- callers can use pm_genpd_set_poweroff_delay to set the
power off delay for a domain
- it expects the delay in milli-seconds
- it also adds a pm_notifier per pm domain which cancels
the delayed power off work when system suspend is invoked

Signed-off-by: Mayuresh Kulkarni <mkulkarni@nvidia.com>
---

- The idea here is to avoid the powering off of a domain if use cases causes
it to turn on within specific time in coming future
- This seems to save power and better response by avoiding context save/restore
which are heavy operations in most of the cases

 drivers/base/power/domain.c | 84 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/pm_domain.h   | 14 ++++++++
 2 files changed, 93 insertions(+), 5 deletions(-)

Comments

Greg KH March 12, 2013, 1:49 p.m. UTC | #1
On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
> - this commit adds a capability to delay the powering off
> of the domain

Why?

> - callers can use pm_genpd_set_poweroff_delay to set the
> power off delay for a domain

Why?

> - it expects the delay in milli-seconds

Why?

> - it also adds a pm_notifier per pm domain which cancels
> the delayed power off work when system suspend is invoked

Why?

Come on, please tell us why something is needed and not what it does in
a changelog comment.

> Signed-off-by: Mayuresh Kulkarni <mkulkarni@nvidia.com>
> ---
> 
> - The idea here is to avoid the powering off of a domain if use cases causes
> it to turn on within specific time in coming future

What use case?

> - This seems to save power and better response by avoiding context save/restore
> which are heavy operations in most of the cases

Now these four lines are bit better, why are you hiding it down here?

Who is going to use this code?  Where is those callers?  We don't add
new infrastructure without in-kernel users, unless there is a very good
reason.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mayuresh Kulkarni March 12, 2013, 2:56 p.m. UTC | #2
On Tuesday 12 March 2013 07:19 PM, Greg KH wrote:
> On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
>> - this commit adds a capability to delay the powering off
>> of the domain
>
> Why?
>

- As I mentioned below (sorry for mentioning below rather than at 
start), powering off a domain means the context save needs to be done && 
it needs to be restored during power on.
- It often observed that saving and restoring of context is expensive in 
terms of power consumed as well as overall system response.

>> - callers can use pm_genpd_set_poweroff_delay to set the
>> power off delay for a domain
>
> Why?
>

So, the idea is to delay the powering off a domain to some time in 
future with the anticipation that the domain would be really needed 
(starting from now till power off delay).

>> - it expects the delay in milli-seconds
>
> Why?
>

For Tegra we found that a delay in terms of milli-seconds is good in 
terms of power as well as system response.

>> - it also adds a pm_notifier per pm domain which cancels
>> the delayed power off work when system suspend is invoked
>
> Why?
>

- This patch uses pm_wq which is freezable during system suspend.
- It is observed that sometimes the work queued *before* system suspend 
is never canceled before system suspend is invoked. As a result, the 
pm_wq gets frozen and work is scheduled after the system resumes causing 
unwanted issues (like messed up ref-counts etc).
- So, I wanted a notification which tells that system suspend is invoked 
&& it happens *before* work queue is frozen. Hence, I used the 
pm_notifier which is executed before freeze of work queue to cancel any 
pending work associated for a domain which implements delayed power off.

> Come on, please tell us why something is needed and not what it does in
> a changelog comment.
>
>> Signed-off-by: Mayuresh Kulkarni <mkulkarni@nvidia.com>
>> ---
>>
>> - The idea here is to avoid the powering off of a domain if use cases causes
>> it to turn on within specific time in coming future
>
> What use case?

- We use this concept in downstream kernel for Tegra and have observed 
that delayed power off reduces the number of oscillations between 
on->off->on of a domain.
- A typical use case could be boot-Android-to-home-screen. This helps to 
reduces the 3d power partition turning on->off->on when user space is 
waiting for something other than 3d && it needs to use 3d after it gets 
that event.
- The delayed used here has to be tuned by someone who is working on 
power optimization for the various use cases.

>
>> - This seems to save power and better response by avoiding context save/restore
>> which are heavy operations in most of the cases
>
> Now these four lines are bit better, why are you hiding it down here?
>
> Who is going to use this code?  Where is those callers?  We don't add
> new infrastructure without in-kernel users, unless there is a very good
> reason.
>

- The downstream kernel has a graphics-host (host1x) driver which has a 
home grown power management framework called ACM 
(auto-clock-management). We implement both delayed clock and power 
partition off functionality in it.
- With preliminary host1x driver getting up-stream, the next logical 
step for us is to add power management feature(s) to it. However, the 
home grown ACM framework will not be acceptable upstream when runtime pm 
(for clock management) & pm domain (power partition management) are 
available to do the same tasks.
- The runtime pm supports auto-suspend concept so we exploit it to 
implement delayed clock off. However, pm domain doesn't have concept of 
delayed power off. And hence this patch adds this functionality.
- Having said that, there is no in-kernel component that uses this API 
as of now. But we anticipate that it will have one in near future, 
at-least for Tegra (specifically host1x driver).

> greg k-h
>

P.S.: All the above information is available in Tegra TRM and 
http://nv-tegra.nvidia.com/gitweb/?p=linux-2.6.git;a=summary (public git 
repository for NV released kernels).
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH March 12, 2013, 3:19 p.m. UTC | #3
On Tue, Mar 12, 2013 at 08:26:08PM +0530, Mayuresh Kulkarni wrote:
> On Tuesday 12 March 2013 07:19 PM, Greg KH wrote:
> >On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
> >>- this commit adds a capability to delay the powering off
> >>of the domain
> >
> >Why?
> >
> 
> - As I mentioned below (sorry for mentioning below rather than at
> start), powering off a domain means the context save needs to be
> done && it needs to be restored during power on.

<good stuff snipped>

All of this should be in the changelog section of the patch, please,
when you resend this, add it there.

> - The downstream kernel has a graphics-host (host1x) driver which
> has a home grown power management framework called ACM
> (auto-clock-management). We implement both delayed clock and power
> partition off functionality in it.

We should wait for this driver to be submitted and accepted before
we can add these new api hooks.  When is that going to happen?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter De Schrijver March 12, 2013, 3:24 p.m. UTC | #4
On Tue, Mar 12, 2013 at 04:19:01PM +0100, Greg KH wrote:
> On Tue, Mar 12, 2013 at 08:26:08PM +0530, Mayuresh Kulkarni wrote:
> > On Tuesday 12 March 2013 07:19 PM, Greg KH wrote:
> > >On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
> > >>- this commit adds a capability to delay the powering off
> > >>of the domain
> > >
> > >Why?
> > >
> > 
> > - As I mentioned below (sorry for mentioning below rather than at
> > start), powering off a domain means the context save needs to be
> > done && it needs to be restored during power on.
> 
> <good stuff snipped>
> 
> All of this should be in the changelog section of the patch, please,
> when you resend this, add it there.
> 
> > - The downstream kernel has a graphics-host (host1x) driver which
> > has a home grown power management framework called ACM
> > (auto-clock-management). We implement both delayed clock and power
> > partition off functionality in it.
> 
> We should wait for this driver to be submitted and accepted before
> we can add these new api hooks.  When is that going to happen?
> 

The basic host1x driver has been under review and rework for quite some
time now. See http://www.spinics.net/lists/linux-tegra/msg10128.html
for the latest version. The PM features are still missing though, but
will be submitted once the basic version is accepted.

Cheers,

Peter.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH March 12, 2013, 3:44 p.m. UTC | #5
On Tue, Mar 12, 2013 at 05:24:50PM +0200, Peter De Schrijver wrote:
> On Tue, Mar 12, 2013 at 04:19:01PM +0100, Greg KH wrote:
> > On Tue, Mar 12, 2013 at 08:26:08PM +0530, Mayuresh Kulkarni wrote:
> > > On Tuesday 12 March 2013 07:19 PM, Greg KH wrote:
> > > >On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
> > > >>- this commit adds a capability to delay the powering off
> > > >>of the domain
> > > >
> > > >Why?
> > > >
> > > 
> > > - As I mentioned below (sorry for mentioning below rather than at
> > > start), powering off a domain means the context save needs to be
> > > done && it needs to be restored during power on.
> > 
> > <good stuff snipped>
> > 
> > All of this should be in the changelog section of the patch, please,
> > when you resend this, add it there.
> > 
> > > - The downstream kernel has a graphics-host (host1x) driver which
> > > has a home grown power management framework called ACM
> > > (auto-clock-management). We implement both delayed clock and power
> > > partition off functionality in it.
> > 
> > We should wait for this driver to be submitted and accepted before
> > we can add these new api hooks.  When is that going to happen?
> > 
> 
> The basic host1x driver has been under review and rework for quite some
> time now. See http://www.spinics.net/lists/linux-tegra/msg10128.html
> for the latest version. The PM features are still missing though, but
> will be submitted once the basic version is accepted.

That sounds good to me, thanks.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kevin Hilman March 13, 2013, 7:27 p.m. UTC | #6
Mayuresh Kulkarni <mkulkarni@nvidia.com> writes:

> On Tuesday 12 March 2013 07:19 PM, Greg KH wrote:
>> On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
>>> - this commit adds a capability to delay the powering off
>>> of the domain
>>
>> Why?
>>
>
> - As I mentioned below (sorry for mentioning below rather than at
> start), powering off a domain means the context save needs to be done
> && it needs to be restored during power on.
> - It often observed that saving and restoring of context is expensive
> in terms of power consumed as well as overall system response.

Are you talking about per-device context save/restore?

Your drivers should be using runtime PM, which has an autosuspend
timeout feature.  Using autosuspend on any device in the power domain
will have the same effect as this proposed patch.  

Using runtime PM to target specific devices that have expensive
save/restore paths is also more understandable and maintainable IMO,
because the "fix" is targetted.

If instead you're talking about CPU context/save restore, you should be
coupling the powerdomain with CPUidle (genpd provides this for you) and
having the C-state latency/threshold values set properly so CPUidle will
not pick those states.

If instead, you're talking about per-powerdomain QoS, the genpd code
also provides pluggable governors based on per-device PM QoS that can be
used for making decisions about whether or not to actually cut power.

IOW, I'm not sure about exactly which problem you're trying to solve
since it's not described very thoroughly, but I'm pretty sure one of the
existing solutions should be used instead.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter De Schrijver March 14, 2013, 8:59 a.m. UTC | #7
On Wed, Mar 13, 2013 at 08:27:02PM +0100, Kevin Hilman wrote:
> Mayuresh Kulkarni <mkulkarni@nvidia.com> writes:
> 
> > On Tuesday 12 March 2013 07:19 PM, Greg KH wrote:
> >> On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
> >>> - this commit adds a capability to delay the powering off
> >>> of the domain
> >>
> >> Why?
> >>
> >
> > - As I mentioned below (sorry for mentioning below rather than at
> > start), powering off a domain means the context save needs to be done
> > && it needs to be restored during power on.
> > - It often observed that saving and restoring of context is expensive
> > in terms of power consumed as well as overall system response.
> 
> Are you talking about per-device context save/restore?
> 
> Your drivers should be using runtime PM, which has an autosuspend
> timeout feature.  Using autosuspend on any device in the power domain
> will have the same effect as this proposed patch.  
> 
> Using runtime PM to target specific devices that have expensive
> save/restore paths is also more understandable and maintainable IMO,
> because the "fix" is targetted.
> 

No. The goal of this patch is to introduce something similar to autosuspend
for turning off domains. Even if all peripherals in a domain are idle, you
might not want to turn off the domain immediately, but wait a little in case
a new request comes in. We are using this scheme today in our GPU driver and
it works quite well, but that's done entirely without PM domains.

Cheers,

Peter.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mayuresh Kulkarni March 14, 2013, 2:12 p.m. UTC | #8
On Thursday 14 March 2013 02:29 PM, Peter De Schrijver wrote:
> On Wed, Mar 13, 2013 at 08:27:02PM +0100, Kevin Hilman wrote:
>> Mayuresh Kulkarni <mkulkarni@nvidia.com> writes:
>>
>>> On Tuesday 12 March 2013 07:19 PM, Greg KH wrote:
>>>> On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
>>>>> - this commit adds a capability to delay the powering off
>>>>> of the domain
>>>>
>>>> Why?
>>>>
>>>
>>> - As I mentioned below (sorry for mentioning below rather than at
>>> start), powering off a domain means the context save needs to be done
>>> && it needs to be restored during power on.
>>> - It often observed that saving and restoring of context is expensive
>>> in terms of power consumed as well as overall system response.
>>
>> Are you talking about per-device context save/restore?
>>
>> Your drivers should be using runtime PM, which has an autosuspend
>> timeout feature.  Using autosuspend on any device in the power domain
>> will have the same effect as this proposed patch.
>>
>> Using runtime PM to target specific devices that have expensive
>> save/restore paths is also more understandable and maintainable IMO,
>> because the "fix" is targetted.
>>
>
> No. The goal of this patch is to introduce something similar to autosuspend
> for turning off domains. Even if all peripherals in a domain are idle, you
> might not want to turn off the domain immediately, but wait a little in case
> a new request comes in. We are using this scheme today in our GPU driver and
> it works quite well, but that's done entirely without PM domains.
>
> Cheers,
>
> Peter.
>

- As Peter mentioned above, the idea is to wait for some time before 
turning off a domain with the anticipation that it will be needed soon 
for a particular use case.
- My previous comments would sound valid if you imagine a domain that 
has a single device attached to it. In such case, there are 2 things 
that happen during power-on/off: context save/restore of the device and 
powering on/off of a domain. Both of them needs energy and possibly time 
to settle down (stabilize the domain power).
- If such a domain is going to be needed very soon in future, it makes 
sense to avoid its power down for at-least that much amount of time 
(which is what the proposed patch does).
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern March 14, 2013, 2:59 p.m. UTC | #9
On Thu, 14 Mar 2013, Peter De Schrijver wrote:

> > Your drivers should be using runtime PM, which has an autosuspend
> > timeout feature.  Using autosuspend on any device in the power domain
> > will have the same effect as this proposed patch.  
> > 
> > Using runtime PM to target specific devices that have expensive
> > save/restore paths is also more understandable and maintainable IMO,
> > because the "fix" is targetted.
> > 
> 
> No. The goal of this patch is to introduce something similar to autosuspend
> for turning off domains. Even if all peripherals in a domain are idle, you
> might not want to turn off the domain immediately, but wait a little in case
> a new request comes in. We are using this scheme today in our GPU driver and
> it works quite well, but that's done entirely without PM domains.

The real question here is whether it would be better to implement 
delayed suspend for PM domains or to use autosuspend for all the 
devices in those domains.

The implications of doing it one way rather than the other aren't 
entirely clear.  One thing: PM domains aren't exposed in sysfs, are 
theY?  If not, there is no way for userspace to adjust the length of 
the delay.  That's a definite disadvantage.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kevin Hilman March 15, 2013, 6:04 p.m. UTC | #10
Mayuresh Kulkarni <mkulkarni@nvidia.com> writes:

> On Thursday 14 March 2013 02:29 PM, Peter De Schrijver wrote:
>> On Wed, Mar 13, 2013 at 08:27:02PM +0100, Kevin Hilman wrote:
>>> Mayuresh Kulkarni <mkulkarni@nvidia.com> writes:
>>>
>>>> On Tuesday 12 March 2013 07:19 PM, Greg KH wrote:
>>>>> On Tue, Mar 12, 2013 at 02:19:19PM +0530, Mayuresh Kulkarni wrote:
>>>>>> - this commit adds a capability to delay the powering off
>>>>>> of the domain
>>>>>
>>>>> Why?
>>>>>
>>>>
>>>> - As I mentioned below (sorry for mentioning below rather than at
>>>> start), powering off a domain means the context save needs to be done
>>>> && it needs to be restored during power on.
>>>> - It often observed that saving and restoring of context is expensive
>>>> in terms of power consumed as well as overall system response.
>>>
>>> Are you talking about per-device context save/restore?
>>>
>>> Your drivers should be using runtime PM, which has an autosuspend
>>> timeout feature.  Using autosuspend on any device in the power domain
>>> will have the same effect as this proposed patch.
>>>
>>> Using runtime PM to target specific devices that have expensive
>>> save/restore paths is also more understandable and maintainable IMO,
>>> because the "fix" is targetted.
>>>
>>
>> No. The goal of this patch is to introduce something similar to autosuspend
>> for turning off domains. Even if all peripherals in a domain are idle, you
>> might not want to turn off the domain immediately, but wait a little in case
>> a new request comes in. We are using this scheme today in our GPU driver and
>> it works quite well, but that's done entirely without PM domains.
>>
>> Cheers,
>>
>> Peter.
>>
>
> - As Peter mentioned above, the idea is to wait for some time before
> turning off a domain with the anticipation that it will be needed soon
> for a particular use case.
> - My previous comments would sound valid if you imagine a domain that
> has a single device attached to it. In such case, there are 2 things
> that happen during power-on/off: context save/restore of the device
> and powering on/off of a domain. Both of them needs energy and
> possibly time to settle down (stabilize the domain power).
> - If such a domain is going to be needed very soon in future, it makes
> sense to avoid its power down for at-least that much amount of time
> (which is what the proposed patch does).

...and is also what the pluggable governors inside genpd are meant to
allow you to do.

More specifically, what you said above: "if such a domain is going to be
needed very soon in the future" is just another way of saying it there
is a wakeup latency constraint.   Wakeup latency constratints are
handled by per-device PM QoS, which can be queried by a governor
associated with a genpd.

Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter De Schrijver March 18, 2013, 10:07 a.m. UTC | #11
> > - If such a domain is going to be needed very soon in future, it makes
> > sense to avoid its power down for at-least that much amount of time
> > (which is what the proposed patch does).
> 
> ...and is also what the pluggable governors inside genpd are meant to
> allow you to do.
> 

Unfortunately this is rather akward to implement in a genpd governor. The
governor only gets called when the genpd core wants to power off a domain.
It can then say yes or no. You could start a timer and on expiry call into
genpd and use a flag to indicate to the governor (which will be called
again), it should now allow the power off.

> More specifically, what you said above: "if such a domain is going to be
> needed very soon in the future" is just another way of saying it there
> is a wakeup latency constraint.   Wakeup latency constratints are
> handled by per-device PM QoS, which can be queried by a governor
> associated with a genpd.

No. This is not a wakeup latency constraint but rather an energy breakeven
point constraint.

Cheers,

Peter.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kevin Hilman March 22, 2013, 4:58 p.m. UTC | #12
Peter De Schrijver <pdeschrijver@nvidia.com> writes:

>> > - If such a domain is going to be needed very soon in future, it makes
>> > sense to avoid its power down for at-least that much amount of time
>> > (which is what the proposed patch does).
>> 
>> ...and is also what the pluggable governors inside genpd are meant to
>> allow you to do.
>> 
>
> Unfortunately this is rather akward to implement in a genpd governor. The
> governor only gets called when the genpd core wants to power off a domain.
> It can then say yes or no. You could start a timer and on expiry call into
> genpd and use a flag to indicate to the governor (which will be called
> again), it should now allow the power off.
>
>> More specifically, what you said above: "if such a domain is going to be
>> needed very soon in the future" is just another way of saying it there
>> is a wakeup latency constraint.   Wakeup latency constratints are
>> handled by per-device PM QoS, which can be queried by a governor
>> associated with a genpd.
>
> No. This is not a wakeup latency constraint but rather an energy breakeven
> point constraint.

OK, that makes more sense, but the way it was described in the
changelog, it sounded like a wakeup latency constraint.

Speaking of the changelog, I would suggest it be updated to describe why
the other methods proposed here would not work.

Kevin

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index 9a6b05a..349f778 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -585,6 +585,19 @@  static int pm_genpd_poweroff(struct generic_pm_domain *genpd)
 	return ret;
 }
 
+static int __pm_genpd_poweroff(struct generic_pm_domain *genpd)
+{
+	int ret = 0;
+
+	mutex_lock(&genpd->lock);
+	genpd->in_progress++;
+	ret = pm_genpd_poweroff(genpd);
+	genpd->in_progress--;
+	mutex_unlock(&genpd->lock);
+
+	return ret;
+}
+
 /**
  * genpd_power_off_work_fn - Power off PM domain whose subdomain count is 0.
  * @work: Work structure used for scheduling the execution of this function.
@@ -601,6 +614,22 @@  static void genpd_power_off_work_fn(struct work_struct *work)
 }
 
 /**
+ * genpd_delayed_power_off_work_fn - Power off PM domain after the delay.
+ * @work: Delayed work structure used for scheduling the
+ *        execution of this function.
+ */
+static void genpd_delayed_power_off_work_fn(struct work_struct *work)
+{
+	struct generic_pm_domain *genpd;
+	struct delayed_work *delay_work = to_delayed_work(work);
+
+	genpd = container_of(delay_work, struct generic_pm_domain,
+		power_off_delayed_work);
+
+	__pm_genpd_poweroff(genpd);
+}
+
+/**
  * pm_genpd_runtime_suspend - Suspend a device belonging to I/O PM domain.
  * @dev: Device to suspend.
  *
@@ -637,11 +666,11 @@  static int pm_genpd_runtime_suspend(struct device *dev)
 	if (dev->power.irq_safe)
 		return 0;
 
-	mutex_lock(&genpd->lock);
-	genpd->in_progress++;
-	pm_genpd_poweroff(genpd);
-	genpd->in_progress--;
-	mutex_unlock(&genpd->lock);
+	if (genpd->power_off_delay)
+		queue_delayed_work(pm_wq, &genpd->power_off_delayed_work,
+			msecs_to_jiffies(genpd->power_off_delay));
+	else
+		__pm_genpd_poweroff(genpd);
 
 	return 0;
 }
@@ -672,6 +701,12 @@  static int pm_genpd_runtime_resume(struct device *dev)
 	if (dev->power.irq_safe)
 		return genpd_start_dev_no_timing(genpd, dev);
 
+	if (genpd->power_off_delay) {
+		if (delayed_work_pending(&genpd->power_off_delayed_work))
+			cancel_delayed_work_sync(
+				&genpd->power_off_delayed_work);
+	}
+
 	mutex_lock(&genpd->lock);
 	ret = __pm_genpd_poweron(genpd);
 	if (ret) {
@@ -730,6 +765,7 @@  static inline int genpd_dev_pm_qos_notifier(struct notifier_block *nb,
 }
 
 static inline void genpd_power_off_work_fn(struct work_struct *work) {}
+static inline void genpd_delayed_power_off_work_fn(struct work_struct *work) {}
 
 #define pm_genpd_runtime_suspend	NULL
 #define pm_genpd_runtime_resume		NULL
@@ -2101,6 +2137,36 @@  static int pm_genpd_default_thaw(struct device *dev)
 	return cb ? cb(dev) : pm_generic_thaw(dev);
 }
 
+static int pm_genpd_suspend_notifier(struct notifier_block *notifier,
+	unsigned long pm_event, void *unused)
+{
+	struct generic_pm_domain *genpd = container_of(notifier,
+		struct generic_pm_domain, system_suspend_notifier);
+
+	if (!genpd)
+		return NOTIFY_DONE;
+
+	switch (pm_event) {
+	case PM_SUSPEND_PREPARE:
+		if (genpd->power_off_delay) {
+			/* if a domain has scheduled a delayed work */
+			if (delayed_work_pending(
+				&genpd->power_off_delayed_work)) {
+
+				/* cancel it now */
+				cancel_delayed_work_sync(
+					&genpd->power_off_delayed_work);
+
+				/* call its power off */
+				__pm_genpd_poweroff(genpd);
+			}
+		}
+		return NOTIFY_OK;
+	}
+
+	return NOTIFY_DONE;
+}
+
 #else /* !CONFIG_PM_SLEEP */
 
 #define pm_genpd_default_suspend	NULL
@@ -2132,7 +2198,10 @@  void pm_genpd_init(struct generic_pm_domain *genpd,
 	mutex_init(&genpd->lock);
 	genpd->gov = gov;
 	INIT_WORK(&genpd->power_off_work, genpd_power_off_work_fn);
+	INIT_DELAYED_WORK(&genpd->power_off_delayed_work,
+		genpd_delayed_power_off_work_fn);
 	genpd->in_progress = 0;
+	genpd->power_off_delay = 0;
 	atomic_set(&genpd->sd_count, 0);
 	genpd->status = is_off ? GPD_STATE_POWER_OFF : GPD_STATE_ACTIVE;
 	init_waitqueue_head(&genpd->status_wait_queue);
@@ -2174,6 +2243,11 @@  void pm_genpd_init(struct generic_pm_domain *genpd,
 	genpd->dev_ops.freeze_late = pm_genpd_default_freeze_late;
 	genpd->dev_ops.thaw_early = pm_genpd_default_thaw_early;
 	genpd->dev_ops.thaw = pm_genpd_default_thaw;
+#ifdef CONFIG_PM_SLEEP
+	genpd->system_suspend_notifier.notifier_call =
+		pm_genpd_suspend_notifier;
+	register_pm_notifier(&genpd->system_suspend_notifier);
+#endif
 	mutex_lock(&gpd_list_lock);
 	list_add(&genpd->gpd_list_node, &gpd_list);
 	mutex_unlock(&gpd_list_lock);
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 7c1d252..3ffb068 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -82,6 +82,11 @@  struct generic_pm_domain {
 	bool cached_power_down_ok;
 	struct device_node *of_node; /* Node in device tree */
 	struct gpd_cpu_data *cpu_data;
+	struct delayed_work power_off_delayed_work;
+	s64 power_off_delay;
+#ifdef CONFIG_PM_SLEEP
+	struct notifier_block system_suspend_notifier;
+#endif
 };
 
 static inline struct generic_pm_domain *pd_to_genpd(struct dev_pm_domain *pd)
@@ -127,6 +132,12 @@  static inline struct generic_pm_domain_data *dev_gpd_data(struct device *dev)
 	return to_gpd_data(dev->power.subsys_data->domain_data);
 }
 
+static inline void pm_genpd_set_poweroff_delay(struct generic_pm_domain *genpd,
+	s64 delay)
+{
+	genpd->power_off_delay = delay;
+}
+
 extern struct dev_power_governor simple_qos_governor;
 
 extern struct generic_pm_domain *dev_to_genpd(struct device *dev);
@@ -170,6 +181,9 @@  extern bool default_stop_ok(struct device *dev);
 extern struct dev_power_governor pm_domain_always_on_gov;
 #else
 
+static inline void pm_genpd_set_poweroff_delay(struct generic_pm_domain *genpd,
+	s64 delay) {}
+
 static inline struct generic_pm_domain_data *dev_gpd_data(struct device *dev)
 {
 	return ERR_PTR(-ENOSYS);