diff mbox

[v8,07/26] PM / Domains: Add genpd governor for CPUs

Message ID 20180620172226.15012-8-ulf.hansson@linaro.org (mailing list archive)
State New, archived
Headers show

Commit Message

Ulf Hansson June 20, 2018, 5:22 p.m. UTC
As it's now perfectly possible that a PM domain managed by genpd contains
devices belonging to CPUs, we should start to take into account the
residency values for the idle states during the state selection process.
The residency value specifies the minimum duration of time, the CPU or a
group of CPUs, needs to spend in an idle state to not waste energy entering
it.

To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
may be used for a PM domain that have CPU devices attached or if the CPUs
are attached through subdomains.

The new governor computes the minimum expected idle duration time for the
online CPUs being attached to the PM domain and its subdomains. Then in the
state selection process, trying the deepest state first, it verifies that
the idle duration time satisfies the state's residency value.

It should be noted that, when computing the minimum expected idle duration
time, we use the information from tick_nohz_get_next_wakeup(), to find the
next wakeup for the related CPUs. Future wise, this may deserve to be
improved, as there are more reasons to why a CPU may be woken up from idle.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Lina Iyer <ilina@codeaurora.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
---
 drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
 include/linux/pm_domain.h            |  2 +
 2 files changed, 60 insertions(+)

Comments

Rafael J. Wysocki July 19, 2018, 10:32 a.m. UTC | #1
On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
> As it's now perfectly possible that a PM domain managed by genpd contains
> devices belonging to CPUs, we should start to take into account the
> residency values for the idle states during the state selection process.
> The residency value specifies the minimum duration of time, the CPU or a
> group of CPUs, needs to spend in an idle state to not waste energy entering
> it.
> 
> To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
> may be used for a PM domain that have CPU devices attached or if the CPUs
> are attached through subdomains.
> 
> The new governor computes the minimum expected idle duration time for the
> online CPUs being attached to the PM domain and its subdomains. Then in the
> state selection process, trying the deepest state first, it verifies that
> the idle duration time satisfies the state's residency value.
> 
> It should be noted that, when computing the minimum expected idle duration
> time, we use the information from tick_nohz_get_next_wakeup(), to find the
> next wakeup for the related CPUs. Future wise, this may deserve to be
> improved, as there are more reasons to why a CPU may be woken up from idle.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Lina Iyer <ilina@codeaurora.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
> Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
>  drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
>  include/linux/pm_domain.h            |  2 +
>  2 files changed, 60 insertions(+)
> 
> diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
> index 99896fbf18e4..1aad55719537 100644
> --- a/drivers/base/power/domain_governor.c
> +++ b/drivers/base/power/domain_governor.c
> @@ -10,6 +10,9 @@
>  #include <linux/pm_domain.h>
>  #include <linux/pm_qos.h>
>  #include <linux/hrtimer.h>
> +#include <linux/cpumask.h>
> +#include <linux/ktime.h>
> +#include <linux/tick.h>
>  
>  static int dev_update_qos_constraint(struct device *dev, void *data)
>  {
> @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>  	return false;
>  }
>  
> +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> +{
> +	struct generic_pm_domain *genpd = pd_to_genpd(pd);
> +	ktime_t domain_wakeup, cpu_wakeup;
> +	s64 idle_duration_ns;
> +	int cpu, i;
> +
> +	if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> +		return true;
> +
> +	/*
> +	 * Find the next wakeup for any of the online CPUs within the PM domain
> +	 * and its subdomains. Note, we only need the genpd->cpus, as it already
> +	 * contains a mask of all CPUs from subdomains.
> +	 */
> +	domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> +	for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> +		cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> +		if (ktime_before(cpu_wakeup, domain_wakeup))
> +			domain_wakeup = cpu_wakeup;
> +	}
> +
> +	/* The minimum idle duration is from now - until the next wakeup. */
> +	idle_duration_ns = ktime_to_ns(ktime_sub(domain_wakeup, ktime_get()));
> +

If idle_duration_ns is negative at this point, you can return false right
away and then you won't need to bother with this case below.

> +	/*
> +	 * Find the deepest idle state that has its residency value satisfied
> +	 * and by also taking into account the power off latency for the state.
> +	 * Start at the deepest supported state.
> +	 */
> +	i = genpd->state_count - 1;
> +	do {
> +		if (!genpd->states[i].residency_ns)
> +			break;
> +
> +		/* Check idle_duration_ns >= 0 to compare signed/unsigned. */
> +		if (idle_duration_ns >= 0 && idle_duration_ns >=
> +		    (genpd->states[i].residency_ns +
> +		     genpd->states[i].power_off_latency_ns))

Why don't you set state_idx and return true right here?

Then you'll only need to return false if you haven't found a matching state.

> +			break;
> +		i--;
> +	} while (i >= 0);
> +
> +	if (i < 0)
> +		return false;
> +
> +	genpd->state_idx = i;
> +	return true;
> +}
> +
>  struct dev_power_governor simple_qos_governor = {
>  	.suspend_ok = default_suspend_ok,
>  	.power_down_ok = default_power_down_ok,
> @@ -257,3 +310,8 @@ struct dev_power_governor pm_domain_always_on_gov = {
>  	.power_down_ok = always_on_power_down_ok,
>  	.suspend_ok = default_suspend_ok,
>  };
> +
> +struct dev_power_governor pm_domain_cpu_gov = {
> +	.suspend_ok = NULL,
> +	.power_down_ok = cpu_power_down_ok,

I see that I haven't got your code flow right after all. :-)

And which means that this should work AFAICS.

> +};
> diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
> index 2c09cf80b285..97901c833108 100644
> --- a/include/linux/pm_domain.h
> +++ b/include/linux/pm_domain.h
> @@ -160,6 +160,7 @@ int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state);
>  
>  extern struct dev_power_governor simple_qos_governor;
>  extern struct dev_power_governor pm_domain_always_on_gov;
> +extern struct dev_power_governor pm_domain_cpu_gov;
>  #else
>  
>  static inline struct generic_pm_domain_data *dev_gpd_data(struct device *dev)
> @@ -203,6 +204,7 @@ static inline int dev_pm_genpd_set_performance_state(struct device *dev,
>  
>  #define simple_qos_governor		(*(struct dev_power_governor *)(NULL))
>  #define pm_domain_always_on_gov		(*(struct dev_power_governor *)(NULL))
> +#define pm_domain_cpu_gov		(*(struct dev_power_governor *)(NULL))
>  #endif
>  
>  #ifdef CONFIG_PM_GENERIC_DOMAINS_SLEEP
>
Rafael J. Wysocki July 26, 2018, 9:14 a.m. UTC | #2
On Thursday, July 19, 2018 12:32:52 PM CEST Rafael J. Wysocki wrote:
> On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
> > As it's now perfectly possible that a PM domain managed by genpd contains
> > devices belonging to CPUs, we should start to take into account the
> > residency values for the idle states during the state selection process.
> > The residency value specifies the minimum duration of time, the CPU or a
> > group of CPUs, needs to spend in an idle state to not waste energy entering
> > it.
> > 
> > To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
> > may be used for a PM domain that have CPU devices attached or if the CPUs
> > are attached through subdomains.
> > 
> > The new governor computes the minimum expected idle duration time for the
> > online CPUs being attached to the PM domain and its subdomains. Then in the
> > state selection process, trying the deepest state first, it verifies that
> > the idle duration time satisfies the state's residency value.
> > 
> > It should be noted that, when computing the minimum expected idle duration
> > time, we use the information from tick_nohz_get_next_wakeup(), to find the
> > next wakeup for the related CPUs. Future wise, this may deserve to be
> > improved, as there are more reasons to why a CPU may be woken up from idle.
> > 
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Cc: Lina Iyer <ilina@codeaurora.org>
> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
> > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
> > ---
> >  drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
> >  include/linux/pm_domain.h            |  2 +
> >  2 files changed, 60 insertions(+)
> > 
> > diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
> > index 99896fbf18e4..1aad55719537 100644
> > --- a/drivers/base/power/domain_governor.c
> > +++ b/drivers/base/power/domain_governor.c
> > @@ -10,6 +10,9 @@
> >  #include <linux/pm_domain.h>
> >  #include <linux/pm_qos.h>
> >  #include <linux/hrtimer.h>
> > +#include <linux/cpumask.h>
> > +#include <linux/ktime.h>
> > +#include <linux/tick.h>
> >  
> >  static int dev_update_qos_constraint(struct device *dev, void *data)
> >  {
> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> >  	return false;
> >  }
> >  
> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > +{
> > +	struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > +	ktime_t domain_wakeup, cpu_wakeup;
> > +	s64 idle_duration_ns;
> > +	int cpu, i;
> > +
> > +	if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> > +		return true;
> > +
> > +	/*
> > +	 * Find the next wakeup for any of the online CPUs within the PM domain
> > +	 * and its subdomains. Note, we only need the genpd->cpus, as it already
> > +	 * contains a mask of all CPUs from subdomains.
> > +	 */
> > +	domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> > +	for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> > +		cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> > +		if (ktime_before(cpu_wakeup, domain_wakeup))
> > +			domain_wakeup = cpu_wakeup;
> > +	}

Here's a concern I have missed before. :-/

Say, one of the CPUs you're walking here is woken up in the meantime.

I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
to update domain_wakeup.  We really should just avoid the domain power off in
that case at all IMO.

Sure enough, if the domain power off is already started and one of the CPUs
in the domain is woken up then, too bad, it will suffer the latency (but in
that case the hardware should be able to help somewhat), but otherwise CPU
wakeup should prevent domain power off from being carried out.

Thanks,
Rafael
Ulf Hansson Aug. 3, 2018, 2:28 p.m. UTC | #3
On 26 July 2018 at 11:14, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Thursday, July 19, 2018 12:32:52 PM CEST Rafael J. Wysocki wrote:
>> On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
>> > As it's now perfectly possible that a PM domain managed by genpd contains
>> > devices belonging to CPUs, we should start to take into account the
>> > residency values for the idle states during the state selection process.
>> > The residency value specifies the minimum duration of time, the CPU or a
>> > group of CPUs, needs to spend in an idle state to not waste energy entering
>> > it.
>> >
>> > To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
>> > may be used for a PM domain that have CPU devices attached or if the CPUs
>> > are attached through subdomains.
>> >
>> > The new governor computes the minimum expected idle duration time for the
>> > online CPUs being attached to the PM domain and its subdomains. Then in the
>> > state selection process, trying the deepest state first, it verifies that
>> > the idle duration time satisfies the state's residency value.
>> >
>> > It should be noted that, when computing the minimum expected idle duration
>> > time, we use the information from tick_nohz_get_next_wakeup(), to find the
>> > next wakeup for the related CPUs. Future wise, this may deserve to be
>> > improved, as there are more reasons to why a CPU may be woken up from idle.
>> >
>> > Cc: Thomas Gleixner <tglx@linutronix.de>
>> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>> > Cc: Lina Iyer <ilina@codeaurora.org>
>> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
>> > Cc: Ingo Molnar <mingo@kernel.org>
>> > Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
>> > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
>> > ---
>> >  drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
>> >  include/linux/pm_domain.h            |  2 +
>> >  2 files changed, 60 insertions(+)
>> >
>> > diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
>> > index 99896fbf18e4..1aad55719537 100644
>> > --- a/drivers/base/power/domain_governor.c
>> > +++ b/drivers/base/power/domain_governor.c
>> > @@ -10,6 +10,9 @@
>> >  #include <linux/pm_domain.h>
>> >  #include <linux/pm_qos.h>
>> >  #include <linux/hrtimer.h>
>> > +#include <linux/cpumask.h>
>> > +#include <linux/ktime.h>
>> > +#include <linux/tick.h>
>> >
>> >  static int dev_update_qos_constraint(struct device *dev, void *data)
>> >  {
>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>> >     return false;
>> >  }
>> >
>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>> > +{
>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);
>> > +   ktime_t domain_wakeup, cpu_wakeup;
>> > +   s64 idle_duration_ns;
>> > +   int cpu, i;
>> > +
>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>> > +           return true;
>> > +
>> > +   /*
>> > +    * Find the next wakeup for any of the online CPUs within the PM domain
>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already
>> > +    * contains a mask of all CPUs from subdomains.
>> > +    */
>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))
>> > +                   domain_wakeup = cpu_wakeup;
>> > +   }
>
> Here's a concern I have missed before. :-/
>
> Say, one of the CPUs you're walking here is woken up in the meantime.

Yes, that can happen - when we miss-predicted "next wakeup".

>
> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> to update domain_wakeup.  We really should just avoid the domain power off in
> that case at all IMO.

Correct.

However, we also want to avoid locking contentions in the idle path,
which is what this boils done to.

>
> Sure enough, if the domain power off is already started and one of the CPUs
> in the domain is woken up then, too bad, it will suffer the latency (but in
> that case the hardware should be able to help somewhat), but otherwise CPU
> wakeup should prevent domain power off from being carried out.

The CPU is not prevented from waking up, as we rely on the FW to deal with that.

Even if the above computation turns out to wrongly suggest that the
cluster can be powered off, the FW shall together with the genpd
backend driver prevent it.

To cover this case for PSCI, we also use a per cpu variable for the
CPU's power off state, as can be seen later in the series.

Hope this clarifies your concern, else tell and will to elaborate a bit more.

Kind regards
Uffe
Rafael J. Wysocki Aug. 6, 2018, 9:20 a.m. UTC | #4
On Fri, Aug 3, 2018 at 4:28 PM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
> On 26 July 2018 at 11:14, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> On Thursday, July 19, 2018 12:32:52 PM CEST Rafael J. Wysocki wrote:
>>> On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
>>> > As it's now perfectly possible that a PM domain managed by genpd contains
>>> > devices belonging to CPUs, we should start to take into account the
>>> > residency values for the idle states during the state selection process.
>>> > The residency value specifies the minimum duration of time, the CPU or a
>>> > group of CPUs, needs to spend in an idle state to not waste energy entering
>>> > it.
>>> >
>>> > To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
>>> > may be used for a PM domain that have CPU devices attached or if the CPUs
>>> > are attached through subdomains.
>>> >
>>> > The new governor computes the minimum expected idle duration time for the
>>> > online CPUs being attached to the PM domain and its subdomains. Then in the
>>> > state selection process, trying the deepest state first, it verifies that
>>> > the idle duration time satisfies the state's residency value.
>>> >
>>> > It should be noted that, when computing the minimum expected idle duration
>>> > time, we use the information from tick_nohz_get_next_wakeup(), to find the
>>> > next wakeup for the related CPUs. Future wise, this may deserve to be
>>> > improved, as there are more reasons to why a CPU may be woken up from idle.
>>> >
>>> > Cc: Thomas Gleixner <tglx@linutronix.de>
>>> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>>> > Cc: Lina Iyer <ilina@codeaurora.org>
>>> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
>>> > Cc: Ingo Molnar <mingo@kernel.org>
>>> > Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
>>> > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
>>> > ---
>>> >  drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
>>> >  include/linux/pm_domain.h            |  2 +
>>> >  2 files changed, 60 insertions(+)
>>> >
>>> > diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
>>> > index 99896fbf18e4..1aad55719537 100644
>>> > --- a/drivers/base/power/domain_governor.c
>>> > +++ b/drivers/base/power/domain_governor.c
>>> > @@ -10,6 +10,9 @@
>>> >  #include <linux/pm_domain.h>
>>> >  #include <linux/pm_qos.h>
>>> >  #include <linux/hrtimer.h>
>>> > +#include <linux/cpumask.h>
>>> > +#include <linux/ktime.h>
>>> > +#include <linux/tick.h>
>>> >
>>> >  static int dev_update_qos_constraint(struct device *dev, void *data)
>>> >  {
>>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>>> >     return false;
>>> >  }
>>> >
>>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>>> > +{
>>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);
>>> > +   ktime_t domain_wakeup, cpu_wakeup;
>>> > +   s64 idle_duration_ns;
>>> > +   int cpu, i;
>>> > +
>>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>>> > +           return true;
>>> > +
>>> > +   /*
>>> > +    * Find the next wakeup for any of the online CPUs within the PM domain
>>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already
>>> > +    * contains a mask of all CPUs from subdomains.
>>> > +    */
>>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
>>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
>>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
>>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))
>>> > +                   domain_wakeup = cpu_wakeup;
>>> > +   }
>>
>> Here's a concern I have missed before. :-/
>>
>> Say, one of the CPUs you're walking here is woken up in the meantime.
>
> Yes, that can happen - when we miss-predicted "next wakeup".
>
>>
>> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
>> to update domain_wakeup.  We really should just avoid the domain power off in
>> that case at all IMO.
>
> Correct.
>
> However, we also want to avoid locking contentions in the idle path,
> which is what this boils done to.

This already is done under genpd_lock() AFAICS, so I'm not quite sure
what exactly you mean.

Besides, this is not just about increased latency, which is a concern
by itself but maybe not so much in all environments, but also about
possibility of missing a CPU wakeup, which is a major issue.

If one of the CPUs sharing the domain with the current one is woken up
during cpu_power_down_ok() and the wakeup is an edge-triggered
interrupt and the domain is turned off regardless, the wakeup may be
missed entirely if I'm not mistaken.

It looks like there needs to be a way for the hardware to prevent a
domain poweroff when there's a pending interrupt or I don't quite see
how this can be handled correctly.

>> Sure enough, if the domain power off is already started and one of the CPUs
>> in the domain is woken up then, too bad, it will suffer the latency (but in
>> that case the hardware should be able to help somewhat), but otherwise CPU
>> wakeup should prevent domain power off from being carried out.
>
> The CPU is not prevented from waking up, as we rely on the FW to deal with that.
>
> Even if the above computation turns out to wrongly suggest that the
> cluster can be powered off, the FW shall together with the genpd
> backend driver prevent it.

Fine, but then the solution depends on specific FW/HW behavior, so I'm
not sure how generic it really is.  At least, that expectation should
be clearly documented somewhere, preferably in code comments.

> To cover this case for PSCI, we also use a per cpu variable for the
> CPU's power off state, as can be seen later in the series.

Oh great, but the generic part should be independent on the underlying
implementation of the driver.  If it isn't, then it also is not
generic.

> Hope this clarifies your concern, else tell and will to elaborate a bit more.

Not really.

There also is one more problem and that is the interaction between
this code and the idle governor.

Namely, the idle governor may select a shallower state for some
reason, for example due to an additional latency limit derived from
CPU utilization (like in the menu governor), and how does the code in
cpu_power_down_ok() know what state has been selected and how does it
honor the selection made by the idle governor?
Lorenzo Pieralisi Aug. 9, 2018, 3:39 p.m. UTC | #5
On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:

[...]

> >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> >>> >     return false;
> >>> >  }
> >>> >
> >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> >>> > +{
> >>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);
> >>> > +   ktime_t domain_wakeup, cpu_wakeup;
> >>> > +   s64 idle_duration_ns;
> >>> > +   int cpu, i;
> >>> > +
> >>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> >>> > +           return true;
> >>> > +
> >>> > +   /*
> >>> > +    * Find the next wakeup for any of the online CPUs within the PM domain
> >>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already
> >>> > +    * contains a mask of all CPUs from subdomains.
> >>> > +    */
> >>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> >>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> >>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> >>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))
> >>> > +                   domain_wakeup = cpu_wakeup;
> >>> > +   }
> >>
> >> Here's a concern I have missed before. :-/
> >>
> >> Say, one of the CPUs you're walking here is woken up in the meantime.
> >
> > Yes, that can happen - when we miss-predicted "next wakeup".
> >
> >>
> >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> >> to update domain_wakeup.  We really should just avoid the domain power off in
> >> that case at all IMO.
> >
> > Correct.
> >
> > However, we also want to avoid locking contentions in the idle path,
> > which is what this boils done to.
> 
> This already is done under genpd_lock() AFAICS, so I'm not quite sure
> what exactly you mean.
> 
> Besides, this is not just about increased latency, which is a concern
> by itself but maybe not so much in all environments, but also about
> possibility of missing a CPU wakeup, which is a major issue.
> 
> If one of the CPUs sharing the domain with the current one is woken up
> during cpu_power_down_ok() and the wakeup is an edge-triggered
> interrupt and the domain is turned off regardless, the wakeup may be
> missed entirely if I'm not mistaken.
> 
> It looks like there needs to be a way for the hardware to prevent a
> domain poweroff when there's a pending interrupt or I don't quite see
> how this can be handled correctly.
> 
> >> Sure enough, if the domain power off is already started and one of the CPUs
> >> in the domain is woken up then, too bad, it will suffer the latency (but in
> >> that case the hardware should be able to help somewhat), but otherwise CPU
> >> wakeup should prevent domain power off from being carried out.
> >
> > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
> >
> > Even if the above computation turns out to wrongly suggest that the
> > cluster can be powered off, the FW shall together with the genpd
> > backend driver prevent it.
> 
> Fine, but then the solution depends on specific FW/HW behavior, so I'm
> not sure how generic it really is.  At least, that expectation should
> be clearly documented somewhere, preferably in code comments.
> 
> > To cover this case for PSCI, we also use a per cpu variable for the
> > CPU's power off state, as can be seen later in the series.
> 
> Oh great, but the generic part should be independent on the underlying
> implementation of the driver.  If it isn't, then it also is not
> generic.
> 
> > Hope this clarifies your concern, else tell and will to elaborate a bit more.
> 
> Not really.
> 
> There also is one more problem and that is the interaction between
> this code and the idle governor.
> 
> Namely, the idle governor may select a shallower state for some
> reason, for example due to an additional latency limit derived from
> CPU utilization (like in the menu governor), and how does the code in
> cpu_power_down_ok() know what state has been selected and how does it
> honor the selection made by the idle governor?

That's a good question and it maybe gives a path towards a solution.

AFAICS the genPD governor only selects the idle state parameter that
determines the idle state at, say, GenPD cpumask level it does not touch
the CPUidle decision, that works on a subset of idle states (at cpu
level).

That's my understanding, which can be wrong so please correct me
if that's the case because that's a bit confusing.

Let's imagine that we flattened out the list of idle states and feed
CPUidle with it (all of them - cpu, cluster, package, system - as it is
in the mainline _now_). Then the GenPD governor can run-through the
CPUidle selection and _demote_ the idle state if necessary since it
understands that some CPUs in the GenPD will wake up shortly and break
the target residency hyphothesis the CPUidle governor is expecting.

The whole idea about this series is improving CPUidle decision when
the target idle state is _shared_ among groups of cpus (again, please
do correct me if I am wrong).

It is obvious that a GenPD governor must only demote - never promote a
CPU idle state selection given that hierarchy implies more power
savings and higher target residencies required.

This whole series would become more generic and won't depend on
PSCI OSI at all - actually that would become a hierarchical
CPUidle governor.

I still think that PSCI firmware and most certainly mwait() play the
role the GenPD governor does since they can detect in FW/HW whether
that's worthwhile to switch off a domain, the information is obviously
there and the kernel would just add latency to the idle path in that
case but let's gloss over this for the sake of this discussion.

Lorenzo
Ulf Hansson Aug. 24, 2018, 8:29 a.m. UTC | #6
On 6 August 2018 at 11:20, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Aug 3, 2018 at 4:28 PM, Ulf Hansson <ulf.hansson@linaro.org> wrote:
>> On 26 July 2018 at 11:14, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>> On Thursday, July 19, 2018 12:32:52 PM CEST Rafael J. Wysocki wrote:
>>>> On Wednesday, June 20, 2018 7:22:07 PM CEST Ulf Hansson wrote:
>>>> > As it's now perfectly possible that a PM domain managed by genpd contains
>>>> > devices belonging to CPUs, we should start to take into account the
>>>> > residency values for the idle states during the state selection process.
>>>> > The residency value specifies the minimum duration of time, the CPU or a
>>>> > group of CPUs, needs to spend in an idle state to not waste energy entering
>>>> > it.
>>>> >
>>>> > To deal with this, let's add a new genpd governor, pm_domain_cpu_gov, that
>>>> > may be used for a PM domain that have CPU devices attached or if the CPUs
>>>> > are attached through subdomains.
>>>> >
>>>> > The new governor computes the minimum expected idle duration time for the
>>>> > online CPUs being attached to the PM domain and its subdomains. Then in the
>>>> > state selection process, trying the deepest state first, it verifies that
>>>> > the idle duration time satisfies the state's residency value.
>>>> >
>>>> > It should be noted that, when computing the minimum expected idle duration
>>>> > time, we use the information from tick_nohz_get_next_wakeup(), to find the
>>>> > next wakeup for the related CPUs. Future wise, this may deserve to be
>>>> > improved, as there are more reasons to why a CPU may be woken up from idle.
>>>> >
>>>> > Cc: Thomas Gleixner <tglx@linutronix.de>
>>>> > Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
>>>> > Cc: Lina Iyer <ilina@codeaurora.org>
>>>> > Cc: Frederic Weisbecker <fweisbec@gmail.com>
>>>> > Cc: Ingo Molnar <mingo@kernel.org>
>>>> > Co-developed-by: Lina Iyer <lina.iyer@linaro.org>
>>>> > Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
>>>> > ---
>>>> >  drivers/base/power/domain_governor.c | 58 ++++++++++++++++++++++++++++
>>>> >  include/linux/pm_domain.h            |  2 +
>>>> >  2 files changed, 60 insertions(+)
>>>> >
>>>> > diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
>>>> > index 99896fbf18e4..1aad55719537 100644
>>>> > --- a/drivers/base/power/domain_governor.c
>>>> > +++ b/drivers/base/power/domain_governor.c
>>>> > @@ -10,6 +10,9 @@
>>>> >  #include <linux/pm_domain.h>
>>>> >  #include <linux/pm_qos.h>
>>>> >  #include <linux/hrtimer.h>
>>>> > +#include <linux/cpumask.h>
>>>> > +#include <linux/ktime.h>
>>>> > +#include <linux/tick.h>
>>>> >
>>>> >  static int dev_update_qos_constraint(struct device *dev, void *data)
>>>> >  {
>>>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>>>> >     return false;
>>>> >  }
>>>> >
>>>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>>>> > +{
>>>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);
>>>> > +   ktime_t domain_wakeup, cpu_wakeup;
>>>> > +   s64 idle_duration_ns;
>>>> > +   int cpu, i;
>>>> > +
>>>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>>>> > +           return true;
>>>> > +
>>>> > +   /*
>>>> > +    * Find the next wakeup for any of the online CPUs within the PM domain
>>>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already
>>>> > +    * contains a mask of all CPUs from subdomains.
>>>> > +    */
>>>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
>>>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
>>>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
>>>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))
>>>> > +                   domain_wakeup = cpu_wakeup;
>>>> > +   }
>>>
>>> Here's a concern I have missed before. :-/
>>>
>>> Say, one of the CPUs you're walking here is woken up in the meantime.
>>
>> Yes, that can happen - when we miss-predicted "next wakeup".
>>
>>>
>>> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
>>> to update domain_wakeup.  We really should just avoid the domain power off in
>>> that case at all IMO.
>>
>> Correct.
>>
>> However, we also want to avoid locking contentions in the idle path,
>> which is what this boils done to.
>
> This already is done under genpd_lock() AFAICS, so I'm not quite sure
> what exactly you mean.
>
> Besides, this is not just about increased latency, which is a concern
> by itself but maybe not so much in all environments, but also about
> possibility of missing a CPU wakeup, which is a major issue.
>
> If one of the CPUs sharing the domain with the current one is woken up
> during cpu_power_down_ok() and the wakeup is an edge-triggered
> interrupt and the domain is turned off regardless, the wakeup may be
> missed entirely if I'm not mistaken.
>
> It looks like there needs to be a way for the hardware to prevent a
> domain poweroff when there's a pending interrupt or I don't quite see
> how this can be handled correctly.

Well, the job of genpd and its new cpu governor is not directly to
power off the PM domain, but rather to try to select/promote an idle
state for it. Along the lines of what Lorenzo explained in the other
thread.

Then what happens in the genpd backend driver's ->power_off()
callback, is platform specific. In other words, it's the job of the
backend driver to understand how its FW works and thus to correctly
deal with the last man standing algorithm.

In regards to the PSCI FW, it supports the race condition you are
referring to in the FW (which makes it easier), no matter if it's
running in OS-initiated mode or platform-coordinated mode.

>
>>> Sure enough, if the domain power off is already started and one of the CPUs
>>> in the domain is woken up then, too bad, it will suffer the latency (but in
>>> that case the hardware should be able to help somewhat), but otherwise CPU
>>> wakeup should prevent domain power off from being carried out.
>>
>> The CPU is not prevented from waking up, as we rely on the FW to deal with that.
>>
>> Even if the above computation turns out to wrongly suggest that the
>> cluster can be powered off, the FW shall together with the genpd
>> backend driver prevent it.
>
> Fine, but then the solution depends on specific FW/HW behavior, so I'm
> not sure how generic it really is.  At least, that expectation should
> be clearly documented somewhere, preferably in code comments.

Alright, let me add some comments somewhere in the code, to explain a
bit about what a genpd backend driver should expect when using the
GENPD_FLAG_CPU_DOMAIN flag.

>
>> To cover this case for PSCI, we also use a per cpu variable for the
>> CPU's power off state, as can be seen later in the series.
>
> Oh great, but the generic part should be independent on the underlying
> implementation of the driver.  If it isn't, then it also is not
> generic.
>
>> Hope this clarifies your concern, else tell and will to elaborate a bit more.
>
> Not really.
>
> There also is one more problem and that is the interaction between
> this code and the idle governor.
>
> Namely, the idle governor may select a shallower state for some
> reason, for example due to an additional latency limit derived from
> CPU utilization (like in the menu governor), and how does the code in
> cpu_power_down_ok() know what state has been selected and how does it
> honor the selection made by the idle governor?

This is indeed a valid concern. I must have failed to explained this
during various conferences, but at least I have tried. :-)

Ideally, we need the menu idle governor and genpd's new cpu governor
to share code or exchange information, somehow. I am looking into that
as a next step of improvements, count on it!

The idea at this point was instead to take a simplified approach to
the problem, to at least get some support for cpu cluster idle
management in place, then improve it on top.

This means, for PSCI, we are using the new genpd cpu governor *only*
for the cluster PM domain (master), but not for the genpd subdomains,
which each contains of a single CPU device. So, the subdomains don't
have a genpd governor assigned, but instead rely on the existing menu
idle governor to select an idle state for the CPU. This means that
*most* of the problem disappears, as its only when the last CPU in the
cluster goes idle, when the selection could be "wrong". In worst case,
genpd will promote an idle state for the cluster PM domain, while it
shouldn't.

Moreover, for the QCOM case in 410c, this isn't even a potential
problem, because there is only *one* idle state to pick by the menu
idle governor for the CPU (besides WFI). Hence, when the genpd cpu
governor runs to pick and idle state, we know that the menu idle
governor have already selected the deepest idle state for each CPU.

Kind regards
Uffe
Ulf Hansson Aug. 24, 2018, 9:26 a.m. UTC | #7
On 9 August 2018 at 17:39, Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> wrote:
> On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
>
> [...]
>
>> >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
>> >>> >     return false;
>> >>> >  }
>> >>> >
>> >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
>> >>> > +{
>> >>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);
>> >>> > +   ktime_t domain_wakeup, cpu_wakeup;
>> >>> > +   s64 idle_duration_ns;
>> >>> > +   int cpu, i;
>> >>> > +
>> >>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
>> >>> > +           return true;
>> >>> > +
>> >>> > +   /*
>> >>> > +    * Find the next wakeup for any of the online CPUs within the PM domain
>> >>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already
>> >>> > +    * contains a mask of all CPUs from subdomains.
>> >>> > +    */
>> >>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
>> >>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
>> >>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
>> >>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))
>> >>> > +                   domain_wakeup = cpu_wakeup;
>> >>> > +   }
>> >>
>> >> Here's a concern I have missed before. :-/
>> >>
>> >> Say, one of the CPUs you're walking here is woken up in the meantime.
>> >
>> > Yes, that can happen - when we miss-predicted "next wakeup".
>> >
>> >>
>> >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
>> >> to update domain_wakeup.  We really should just avoid the domain power off in
>> >> that case at all IMO.
>> >
>> > Correct.
>> >
>> > However, we also want to avoid locking contentions in the idle path,
>> > which is what this boils done to.
>>
>> This already is done under genpd_lock() AFAICS, so I'm not quite sure
>> what exactly you mean.
>>
>> Besides, this is not just about increased latency, which is a concern
>> by itself but maybe not so much in all environments, but also about
>> possibility of missing a CPU wakeup, which is a major issue.
>>
>> If one of the CPUs sharing the domain with the current one is woken up
>> during cpu_power_down_ok() and the wakeup is an edge-triggered
>> interrupt and the domain is turned off regardless, the wakeup may be
>> missed entirely if I'm not mistaken.
>>
>> It looks like there needs to be a way for the hardware to prevent a
>> domain poweroff when there's a pending interrupt or I don't quite see
>> how this can be handled correctly.
>>
>> >> Sure enough, if the domain power off is already started and one of the CPUs
>> >> in the domain is woken up then, too bad, it will suffer the latency (but in
>> >> that case the hardware should be able to help somewhat), but otherwise CPU
>> >> wakeup should prevent domain power off from being carried out.
>> >
>> > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
>> >
>> > Even if the above computation turns out to wrongly suggest that the
>> > cluster can be powered off, the FW shall together with the genpd
>> > backend driver prevent it.
>>
>> Fine, but then the solution depends on specific FW/HW behavior, so I'm
>> not sure how generic it really is.  At least, that expectation should
>> be clearly documented somewhere, preferably in code comments.
>>
>> > To cover this case for PSCI, we also use a per cpu variable for the
>> > CPU's power off state, as can be seen later in the series.
>>
>> Oh great, but the generic part should be independent on the underlying
>> implementation of the driver.  If it isn't, then it also is not
>> generic.
>>
>> > Hope this clarifies your concern, else tell and will to elaborate a bit more.
>>
>> Not really.
>>
>> There also is one more problem and that is the interaction between
>> this code and the idle governor.
>>
>> Namely, the idle governor may select a shallower state for some
>> reason, for example due to an additional latency limit derived from
>> CPU utilization (like in the menu governor), and how does the code in
>> cpu_power_down_ok() know what state has been selected and how does it
>> honor the selection made by the idle governor?
>
> That's a good question and it maybe gives a path towards a solution.
>
> AFAICS the genPD governor only selects the idle state parameter that
> determines the idle state at, say, GenPD cpumask level it does not touch
> the CPUidle decision, that works on a subset of idle states (at cpu
> level).
>
> That's my understanding, which can be wrong so please correct me
> if that's the case because that's a bit confusing.
>
> Let's imagine that we flattened out the list of idle states and feed
> CPUidle with it (all of them - cpu, cluster, package, system - as it is
> in the mainline _now_). Then the GenPD governor can run-through the
> CPUidle selection and _demote_ the idle state if necessary since it
> understands that some CPUs in the GenPD will wake up shortly and break
> the target residency hyphothesis the CPUidle governor is expecting.
>
> The whole idea about this series is improving CPUidle decision when
> the target idle state is _shared_ among groups of cpus (again, please
> do correct me if I am wrong).

Absolutely, this is one of the main reason for the series!

>
> It is obvious that a GenPD governor must only demote - never promote a
> CPU idle state selection given that hierarchy implies more power
> savings and higher target residencies required.

Absolutely. I apologize if I have been using the word "promote"
wrongly, I realize it may be a bit confusing.

>
> This whole series would become more generic and won't depend on
> PSCI OSI at all - actually that would become a hierarchical
> CPUidle governor.

Well, to me we need a first user of the new infrastructure code in
genpd and PSCI is probably the easiest one to start with. An option
would be to start with an old ARM32 platform, but it seems a bit silly
to me.

In regards to OS-initiated mode vs platform coordinated mode, let's
discuss that in details in the other email thread instead.

>
> I still think that PSCI firmware and most certainly mwait() play the
> role the GenPD governor does since they can detect in FW/HW whether
> that's worthwhile to switch off a domain, the information is obviously
> there and the kernel would just add latency to the idle path in that
> case but let's gloss over this for the sake of this discussion.

Yep, let's discuss that separately.

That said, can I interpret your comments on the series up until this
change, that you seems rather happy with where the series is going?

Kind regards
Uffe
Lorenzo Pieralisi Aug. 24, 2018, 10:38 a.m. UTC | #8
On Fri, Aug 24, 2018 at 11:26:19AM +0200, Ulf Hansson wrote:

[...]

> > That's a good question and it maybe gives a path towards a solution.
> >
> > AFAICS the genPD governor only selects the idle state parameter that
> > determines the idle state at, say, GenPD cpumask level it does not touch
> > the CPUidle decision, that works on a subset of idle states (at cpu
> > level).
> >
> > That's my understanding, which can be wrong so please correct me
> > if that's the case because that's a bit confusing.
> >
> > Let's imagine that we flattened out the list of idle states and feed
> > CPUidle with it (all of them - cpu, cluster, package, system - as it is
> > in the mainline _now_). Then the GenPD governor can run-through the
> > CPUidle selection and _demote_ the idle state if necessary since it
> > understands that some CPUs in the GenPD will wake up shortly and break
> > the target residency hyphothesis the CPUidle governor is expecting.
> >
> > The whole idea about this series is improving CPUidle decision when
> > the target idle state is _shared_ among groups of cpus (again, please
> > do correct me if I am wrong).
> 
> Absolutely, this is one of the main reason for the series!
> 
> >
> > It is obvious that a GenPD governor must only demote - never promote a
> > CPU idle state selection given that hierarchy implies more power
> > savings and higher target residencies required.
> 
> Absolutely. I apologize if I have been using the word "promote"
> wrongly, I realize it may be a bit confusing.
> 
> >
> > This whole series would become more generic and won't depend on
> > PSCI OSI at all - actually that would become a hierarchical
> > CPUidle governor.
> 
> Well, to me we need a first user of the new infrastructure code in
> genpd and PSCI is probably the easiest one to start with. An option
> would be to start with an old ARM32 platform, but it seems a bit silly
> to me.

If the code can be structured as described above as a hierarchical
(possibly optional through a Kconfig entry or sysfs tuning) idle
decision you can apply it to _any_ PSCI based platform out there,
provided that the new governor improves power savings.

> In regards to OS-initiated mode vs platform coordinated mode, let's
> discuss that in details in the other email thread instead.

I think that's crystal clear by now that IMHO PSCI OS-initiated mode is
a red-herring, it has nothing to do with this series, it is there just
because QC firmware does not support PSCI platform coordinated suspend
mode.

You can apply the concept in this series to _any_ arch provided
the power domains representation is correct (and again, I would sound
like a broken record but the series must improve power savings over
vanilla CPUidle menu governor).

> > I still think that PSCI firmware and most certainly mwait() play the
> > role the GenPD governor does since they can detect in FW/HW whether
> > that's worthwhile to switch off a domain, the information is obviously
> > there and the kernel would just add latency to the idle path in that
> > case but let's gloss over this for the sake of this discussion.
> 
> Yep, let's discuss that separately.
> 
> That said, can I interpret your comments on the series up until this
> change, that you seems rather happy with where the series is going?

It is something we have been discussing with Daniel since generic idle
was merged for Arm a long while back. I have nothing against describing
idle states with power domains but it must improve idle decisions
against the mainline. As I said before, runtime PM can also be used
to get rid of CPU PM notifiers (because with power domains we KNOW
what devices eg PMU are switched off on idle entry, we do not guess
any longer; replacing CPU PM notifiers is challenging and can be
tackled - if required - in a different series).

Bottom line (talk is cheap, I know and apologise about that): this
series (up until this change) adds complexity to the idle path and lots
of code; if its usage is made optional and can be switched on on systems
where it saves power that's fine by me as long as we keep PSCI
OS-initiated idle states out of the equation, that's an orthogonal
discussion as, I hope, I managed to convey.

Thanks,
Lorenzo
Ulf Hansson Aug. 30, 2018, 1:36 p.m. UTC | #9
On 24 August 2018 at 12:38, Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> wrote:
> On Fri, Aug 24, 2018 at 11:26:19AM +0200, Ulf Hansson wrote:
>
> [...]
>
>> > That's a good question and it maybe gives a path towards a solution.
>> >
>> > AFAICS the genPD governor only selects the idle state parameter that
>> > determines the idle state at, say, GenPD cpumask level it does not touch
>> > the CPUidle decision, that works on a subset of idle states (at cpu
>> > level).
>> >
>> > That's my understanding, which can be wrong so please correct me
>> > if that's the case because that's a bit confusing.
>> >
>> > Let's imagine that we flattened out the list of idle states and feed
>> > CPUidle with it (all of them - cpu, cluster, package, system - as it is
>> > in the mainline _now_). Then the GenPD governor can run-through the
>> > CPUidle selection and _demote_ the idle state if necessary since it
>> > understands that some CPUs in the GenPD will wake up shortly and break
>> > the target residency hyphothesis the CPUidle governor is expecting.
>> >
>> > The whole idea about this series is improving CPUidle decision when
>> > the target idle state is _shared_ among groups of cpus (again, please
>> > do correct me if I am wrong).
>>
>> Absolutely, this is one of the main reason for the series!
>>
>> >
>> > It is obvious that a GenPD governor must only demote - never promote a
>> > CPU idle state selection given that hierarchy implies more power
>> > savings and higher target residencies required.
>>
>> Absolutely. I apologize if I have been using the word "promote"
>> wrongly, I realize it may be a bit confusing.
>>
>> >
>> > This whole series would become more generic and won't depend on
>> > PSCI OSI at all - actually that would become a hierarchical
>> > CPUidle governor.
>>
>> Well, to me we need a first user of the new infrastructure code in
>> genpd and PSCI is probably the easiest one to start with. An option
>> would be to start with an old ARM32 platform, but it seems a bit silly
>> to me.
>
> If the code can be structured as described above as a hierarchical
> (possibly optional through a Kconfig entry or sysfs tuning) idle
> decision you can apply it to _any_ PSCI based platform out there,
> provided that the new governor improves power savings.
>
>> In regards to OS-initiated mode vs platform coordinated mode, let's
>> discuss that in details in the other email thread instead.
>
> I think that's crystal clear by now that IMHO PSCI OS-initiated mode is
> a red-herring, it has nothing to do with this series, it is there just
> because QC firmware does not support PSCI platform coordinated suspend
> mode.

I fully agree that the series isn't specific to PSCI OSI mode. On the
other hand, for PSCI OSI mode, that's where I see this series to fit
naturally. And in particular for the QCOM 410c board.

When it comes to the PSCI PC mode, it may under certain circumstances
be useful to deploy this approach for that as well, and I agree that
it seems reasonable to have that configurable as opt-in, somehow.

Although, let's discuss that separately, in a next step. Or at least
let's try to keep PSCI related technical discussions to the other
thread, as that makes it easier to follow.

>
> You can apply the concept in this series to _any_ arch provided
> the power domains representation is correct (and again, I would sound
> like a broken record but the series must improve power savings over
> vanilla CPUidle menu governor).

I agree, but let me elaborate a bit, to hopefully add some clarity,
which I may not have been able to communicate earlier.

The goal with the series is to enable platforms to support all its
available idlestates, which are shared among a group of CPUs. This is
the case for QCOM 410c, for example.

To my knowledge, we have other ARM32 based platforms that currently
have disabled some of its cluster idle states. That's because they
can't know when it's safe to power off the cluster "coherency domain",
in cases when the platform also have other shared resources in it.

The point is, to see improved power savings, additional platform
deployment may be needed and that just takes time. For example runtime
PM support is needed in those drivers that deals with the "shared
resources", a correctly modeled PM domain topology using genpd, etc,
etc.

>
>> > I still think that PSCI firmware and most certainly mwait() play the
>> > role the GenPD governor does since they can detect in FW/HW whether
>> > that's worthwhile to switch off a domain, the information is obviously
>> > there and the kernel would just add latency to the idle path in that
>> > case but let's gloss over this for the sake of this discussion.
>>
>> Yep, let's discuss that separately.
>>
>> That said, can I interpret your comments on the series up until this
>> change, that you seems rather happy with where the series is going?
>
> It is something we have been discussing with Daniel since generic idle
> was merged for Arm a long while back. I have nothing against describing
> idle states with power domains but it must improve idle decisions
> against the mainline. As I said before, runtime PM can also be used
> to get rid of CPU PM notifiers (because with power domains we KNOW
> what devices eg PMU are switched off on idle entry, we do not guess
> any longer; replacing CPU PM notifiers is challenging and can be
> tackled - if required - in a different series).

Yes, we have be talking about the CPU PM and CPU_CLUSTER_PM notifiers
and I fully agree. It's something that we should look into and in
future steps.

>
> Bottom line (talk is cheap, I know and apologise about that): this
> series (up until this change) adds complexity to the idle path and lots
> of code; if its usage is made optional and can be switched on on systems
> where it saves power that's fine by me as long as we keep PSCI
> OS-initiated idle states out of the equation, that's an orthogonal
> discussion as, I hope, I managed to convey.
>
> Thanks,
> Lorenzo

Lorenzo, thanks for your feedback!

Please, when you have time, could you also reply to the other thread
we started, I would like to understand how I should proceed with this
series.

Kind regards
Uffe
Lorenzo Pieralisi Sept. 13, 2018, 3:37 p.m. UTC | #10
On Thu, Aug 30, 2018 at 03:36:02PM +0200, Ulf Hansson wrote:
> On 24 August 2018 at 12:38, Lorenzo Pieralisi <lorenzo.pieralisi@arm.com> wrote:
> > On Fri, Aug 24, 2018 at 11:26:19AM +0200, Ulf Hansson wrote:
> >
> > [...]
> >
> >> > That's a good question and it maybe gives a path towards a solution.
> >> >
> >> > AFAICS the genPD governor only selects the idle state parameter that
> >> > determines the idle state at, say, GenPD cpumask level it does not touch
> >> > the CPUidle decision, that works on a subset of idle states (at cpu
> >> > level).
> >> >
> >> > That's my understanding, which can be wrong so please correct me
> >> > if that's the case because that's a bit confusing.
> >> >
> >> > Let's imagine that we flattened out the list of idle states and feed
> >> > CPUidle with it (all of them - cpu, cluster, package, system - as it is
> >> > in the mainline _now_). Then the GenPD governor can run-through the
> >> > CPUidle selection and _demote_ the idle state if necessary since it
> >> > understands that some CPUs in the GenPD will wake up shortly and break
> >> > the target residency hyphothesis the CPUidle governor is expecting.
> >> >
> >> > The whole idea about this series is improving CPUidle decision when
> >> > the target idle state is _shared_ among groups of cpus (again, please
> >> > do correct me if I am wrong).
> >>
> >> Absolutely, this is one of the main reason for the series!
> >>
> >> >
> >> > It is obvious that a GenPD governor must only demote - never promote a
> >> > CPU idle state selection given that hierarchy implies more power
> >> > savings and higher target residencies required.
> >>
> >> Absolutely. I apologize if I have been using the word "promote"
> >> wrongly, I realize it may be a bit confusing.
> >>
> >> >
> >> > This whole series would become more generic and won't depend on
> >> > PSCI OSI at all - actually that would become a hierarchical
> >> > CPUidle governor.
> >>
> >> Well, to me we need a first user of the new infrastructure code in
> >> genpd and PSCI is probably the easiest one to start with. An option
> >> would be to start with an old ARM32 platform, but it seems a bit silly
> >> to me.
> >
> > If the code can be structured as described above as a hierarchical
> > (possibly optional through a Kconfig entry or sysfs tuning) idle
> > decision you can apply it to _any_ PSCI based platform out there,
> > provided that the new governor improves power savings.
> >
> >> In regards to OS-initiated mode vs platform coordinated mode, let's
> >> discuss that in details in the other email thread instead.
> >
> > I think that's crystal clear by now that IMHO PSCI OS-initiated mode is
> > a red-herring, it has nothing to do with this series, it is there just
> > because QC firmware does not support PSCI platform coordinated suspend
> > mode.
> 
> I fully agree that the series isn't specific to PSCI OSI mode. On the
> other hand, for PSCI OSI mode, that's where I see this series to fit
> naturally. And in particular for the QCOM 410c board.
> 
> When it comes to the PSCI PC mode, it may under certain circumstances
> be useful to deploy this approach for that as well, and I agree that
> it seems reasonable to have that configurable as opt-in, somehow.
> 
> Although, let's discuss that separately, in a next step. Or at least
> let's try to keep PSCI related technical discussions to the other
> thread, as that makes it easier to follow.
> 
> >
> > You can apply the concept in this series to _any_ arch provided
> > the power domains representation is correct (and again, I would sound
> > like a broken record but the series must improve power savings over
> > vanilla CPUidle menu governor).
> 
> I agree, but let me elaborate a bit, to hopefully add some clarity,
> which I may not have been able to communicate earlier.
> 
> The goal with the series is to enable platforms to support all its
> available idlestates, which are shared among a group of CPUs. This is
> the case for QCOM 410c, for example.
> 
> To my knowledge, we have other ARM32 based platforms that currently
> have disabled some of its cluster idle states. That's because they
> can't know when it's safe to power off the cluster "coherency domain",
> in cases when the platform also have other shared resources in it.
> 
> The point is, to see improved power savings, additional platform
> deployment may be needed and that just takes time. For example runtime
> PM support is needed in those drivers that deals with the "shared
> resources", a correctly modeled PM domain topology using genpd, etc,
> etc.
> 
> >
> >> > I still think that PSCI firmware and most certainly mwait() play the
> >> > role the GenPD governor does since they can detect in FW/HW whether
> >> > that's worthwhile to switch off a domain, the information is obviously
> >> > there and the kernel would just add latency to the idle path in that
> >> > case but let's gloss over this for the sake of this discussion.
> >>
> >> Yep, let's discuss that separately.
> >>
> >> That said, can I interpret your comments on the series up until this
> >> change, that you seems rather happy with where the series is going?
> >
> > It is something we have been discussing with Daniel since generic idle
> > was merged for Arm a long while back. I have nothing against describing
> > idle states with power domains but it must improve idle decisions
> > against the mainline. As I said before, runtime PM can also be used
> > to get rid of CPU PM notifiers (because with power domains we KNOW
> > what devices eg PMU are switched off on idle entry, we do not guess
> > any longer; replacing CPU PM notifiers is challenging and can be
> > tackled - if required - in a different series).
> 
> Yes, we have be talking about the CPU PM and CPU_CLUSTER_PM notifiers
> and I fully agree. It's something that we should look into and in
> future steps.
> 
> >
> > Bottom line (talk is cheap, I know and apologise about that): this
> > series (up until this change) adds complexity to the idle path and lots
> > of code; if its usage is made optional and can be switched on on systems
> > where it saves power that's fine by me as long as we keep PSCI
> > OS-initiated idle states out of the equation, that's an orthogonal
> > discussion as, I hope, I managed to convey.
> >
> > Thanks,
> > Lorenzo
> 
> Lorenzo, thanks for your feedback!
> 
> Please, when you have time, could you also reply to the other thread
> we started, I would like to understand how I should proceed with this
> series.

OK, thanks, I will, sorry for the delay in responding.

Lorenzo
Rafael J. Wysocki Sept. 14, 2018, 9:50 a.m. UTC | #11
On Thursday, August 9, 2018 5:39:25 PM CEST Lorenzo Pieralisi wrote:
> On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
> 
> [...]
> 
> > >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> > >>> >     return false;
> > >>> >  }
> > >>> >
> > >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > >>> > +{
> > >>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > >>> > +   ktime_t domain_wakeup, cpu_wakeup;
> > >>> > +   s64 idle_duration_ns;
> > >>> > +   int cpu, i;
> > >>> > +
> > >>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> > >>> > +           return true;
> > >>> > +
> > >>> > +   /*
> > >>> > +    * Find the next wakeup for any of the online CPUs within the PM domain
> > >>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already
> > >>> > +    * contains a mask of all CPUs from subdomains.
> > >>> > +    */
> > >>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> > >>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> > >>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> > >>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))
> > >>> > +                   domain_wakeup = cpu_wakeup;
> > >>> > +   }
> > >>
> > >> Here's a concern I have missed before. :-/
> > >>
> > >> Say, one of the CPUs you're walking here is woken up in the meantime.
> > >
> > > Yes, that can happen - when we miss-predicted "next wakeup".
> > >
> > >>
> > >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> > >> to update domain_wakeup.  We really should just avoid the domain power off in
> > >> that case at all IMO.
> > >
> > > Correct.
> > >
> > > However, we also want to avoid locking contentions in the idle path,
> > > which is what this boils done to.
> > 
> > This already is done under genpd_lock() AFAICS, so I'm not quite sure
> > what exactly you mean.
> > 
> > Besides, this is not just about increased latency, which is a concern
> > by itself but maybe not so much in all environments, but also about
> > possibility of missing a CPU wakeup, which is a major issue.
> > 
> > If one of the CPUs sharing the domain with the current one is woken up
> > during cpu_power_down_ok() and the wakeup is an edge-triggered
> > interrupt and the domain is turned off regardless, the wakeup may be
> > missed entirely if I'm not mistaken.
> > 
> > It looks like there needs to be a way for the hardware to prevent a
> > domain poweroff when there's a pending interrupt or I don't quite see
> > how this can be handled correctly.
> > 
> > >> Sure enough, if the domain power off is already started and one of the CPUs
> > >> in the domain is woken up then, too bad, it will suffer the latency (but in
> > >> that case the hardware should be able to help somewhat), but otherwise CPU
> > >> wakeup should prevent domain power off from being carried out.
> > >
> > > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
> > >
> > > Even if the above computation turns out to wrongly suggest that the
> > > cluster can be powered off, the FW shall together with the genpd
> > > backend driver prevent it.
> > 
> > Fine, but then the solution depends on specific FW/HW behavior, so I'm
> > not sure how generic it really is.  At least, that expectation should
> > be clearly documented somewhere, preferably in code comments.
> > 
> > > To cover this case for PSCI, we also use a per cpu variable for the
> > > CPU's power off state, as can be seen later in the series.
> > 
> > Oh great, but the generic part should be independent on the underlying
> > implementation of the driver.  If it isn't, then it also is not
> > generic.
> > 
> > > Hope this clarifies your concern, else tell and will to elaborate a bit more.
> > 
> > Not really.
> > 
> > There also is one more problem and that is the interaction between
> > this code and the idle governor.
> > 
> > Namely, the idle governor may select a shallower state for some
> > reason, for example due to an additional latency limit derived from
> > CPU utilization (like in the menu governor), and how does the code in
> > cpu_power_down_ok() know what state has been selected and how does it
> > honor the selection made by the idle governor?
> 
> That's a good question and it maybe gives a path towards a solution.
> 
> AFAICS the genPD governor only selects the idle state parameter that
> determines the idle state at, say, GenPD cpumask level it does not touch
> the CPUidle decision, that works on a subset of idle states (at cpu
> level).

I've deferred responding to this as I wasn't quite sure if I followed you
at that time, but I'm afraid I'm still not following you now. :-)

The idle governor has to take the total worst-case wakeup latency into
account.  Not just from the logical CPU itself, but also from whatever
state the SoC may end up in as a result of this particular logical CPU
going idle, this way or another.

So for example, if your logical CPU has an idle state A that may trigger an
idle state X at the cluster level (if the other logical CPUs happen to be in
the right states and so on), then the worst-case exit latency for that
is the one of state X.

> That's my understanding, which can be wrong so please correct me
> if that's the case because that's a bit confusing.
> 
> Let's imagine that we flattened out the list of idle states and feed
> CPUidle with it (all of them - cpu, cluster, package, system - as it is
> in the mainline _now_). Then the GenPD governor can run-through the
> CPUidle selection and _demote_ the idle state if necessary since it
> understands that some CPUs in the GenPD will wake up shortly and break
> the target residency hyphothesis the CPUidle governor is expecting.
> 
> The whole idea about this series is improving CPUidle decision when
> the target idle state is _shared_ among groups of cpus (again, please
> do correct me if I am wrong).
> 
> It is obvious that a GenPD governor must only demote - never promote a
> CPU idle state selection given that hierarchy implies more power
> savings and higher target residencies required.

So I see a problem here, because the way patch 9 in this series is done,
the genpd governor for CPUs has no idea what states have been selected by
the idle governor, so how does it know how deep it can go with turning
off domains?

My point is that the selection made by the idle governor need not be
based only on timers which is the only thing that the genpd governor
seems to be looking at.  The genpd governor should rather look at what
idle states have been selected for each CPU in the domain by the idle
governor and work within the boundaries of those.

Thanks,
Rafael
Lorenzo Pieralisi Sept. 14, 2018, 10:44 a.m. UTC | #12
On Fri, Sep 14, 2018 at 11:50:15AM +0200, Rafael J. Wysocki wrote:
> On Thursday, August 9, 2018 5:39:25 PM CEST Lorenzo Pieralisi wrote:
> > On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
> > 
> > [...]
> > 
> > > >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> > > >>> >     return false;
> > > >>> >  }
> > > >>> >
> > > >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > >>> > +{
> > > >>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > > >>> > +   ktime_t domain_wakeup, cpu_wakeup;
> > > >>> > +   s64 idle_duration_ns;
> > > >>> > +   int cpu, i;
> > > >>> > +
> > > >>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> > > >>> > +           return true;
> > > >>> > +
> > > >>> > +   /*
> > > >>> > +    * Find the next wakeup for any of the online CPUs within the PM domain
> > > >>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already
> > > >>> > +    * contains a mask of all CPUs from subdomains.
> > > >>> > +    */
> > > >>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> > > >>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> > > >>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> > > >>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))
> > > >>> > +                   domain_wakeup = cpu_wakeup;
> > > >>> > +   }
> > > >>
> > > >> Here's a concern I have missed before. :-/
> > > >>
> > > >> Say, one of the CPUs you're walking here is woken up in the meantime.
> > > >
> > > > Yes, that can happen - when we miss-predicted "next wakeup".
> > > >
> > > >>
> > > >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> > > >> to update domain_wakeup.  We really should just avoid the domain power off in
> > > >> that case at all IMO.
> > > >
> > > > Correct.
> > > >
> > > > However, we also want to avoid locking contentions in the idle path,
> > > > which is what this boils done to.
> > > 
> > > This already is done under genpd_lock() AFAICS, so I'm not quite sure
> > > what exactly you mean.
> > > 
> > > Besides, this is not just about increased latency, which is a concern
> > > by itself but maybe not so much in all environments, but also about
> > > possibility of missing a CPU wakeup, which is a major issue.
> > > 
> > > If one of the CPUs sharing the domain with the current one is woken up
> > > during cpu_power_down_ok() and the wakeup is an edge-triggered
> > > interrupt and the domain is turned off regardless, the wakeup may be
> > > missed entirely if I'm not mistaken.
> > > 
> > > It looks like there needs to be a way for the hardware to prevent a
> > > domain poweroff when there's a pending interrupt or I don't quite see
> > > how this can be handled correctly.
> > > 
> > > >> Sure enough, if the domain power off is already started and one of the CPUs
> > > >> in the domain is woken up then, too bad, it will suffer the latency (but in
> > > >> that case the hardware should be able to help somewhat), but otherwise CPU
> > > >> wakeup should prevent domain power off from being carried out.
> > > >
> > > > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
> > > >
> > > > Even if the above computation turns out to wrongly suggest that the
> > > > cluster can be powered off, the FW shall together with the genpd
> > > > backend driver prevent it.
> > > 
> > > Fine, but then the solution depends on specific FW/HW behavior, so I'm
> > > not sure how generic it really is.  At least, that expectation should
> > > be clearly documented somewhere, preferably in code comments.
> > > 
> > > > To cover this case for PSCI, we also use a per cpu variable for the
> > > > CPU's power off state, as can be seen later in the series.
> > > 
> > > Oh great, but the generic part should be independent on the underlying
> > > implementation of the driver.  If it isn't, then it also is not
> > > generic.
> > > 
> > > > Hope this clarifies your concern, else tell and will to elaborate a bit more.
> > > 
> > > Not really.
> > > 
> > > There also is one more problem and that is the interaction between
> > > this code and the idle governor.
> > > 
> > > Namely, the idle governor may select a shallower state for some
> > > reason, for example due to an additional latency limit derived from
> > > CPU utilization (like in the menu governor), and how does the code in
> > > cpu_power_down_ok() know what state has been selected and how does it
> > > honor the selection made by the idle governor?
> > 
> > That's a good question and it maybe gives a path towards a solution.
> > 
> > AFAICS the genPD governor only selects the idle state parameter that
> > determines the idle state at, say, GenPD cpumask level it does not touch
> > the CPUidle decision, that works on a subset of idle states (at cpu
> > level).
> 
> I've deferred responding to this as I wasn't quite sure if I followed you
> at that time, but I'm afraid I'm still not following you now. :-)
> 
> The idle governor has to take the total worst-case wakeup latency into
> account.  Not just from the logical CPU itself, but also from whatever
> state the SoC may end up in as a result of this particular logical CPU
> going idle, this way or another.
> 
> So for example, if your logical CPU has an idle state A that may trigger an
> idle state X at the cluster level (if the other logical CPUs happen to be in
> the right states and so on), then the worst-case exit latency for that
> is the one of state X.

I will provide an example:

IDLE STATE A (affects CPU {0,1}): exit latency 1ms, min-residency 1.5ms

CPU 0 is about to enter IDLE state A since its "next-event" fulfill the
residency requirements and exit latency constraints.

CPU 1 is in idle state A (given that CPU 0 is ON, some of the common
logic shared between CPU {0,1} is still ON, but, as soon as CPU 0
enters idle state A CPU {0,1} can enter the "full" idle state A
power savings mode).

The current CPUidle governor does not check the "next-event" for CPU 1,
that it may wake up in, say, 10us.

Requesting IDLE STATE A is a waste of power (if firmware or hardware
does not demote it since it does peek at CPU 1 next-event and actually
demote CPU 0 request).

The current flat list of idle states has no notion of CPUs sharing
an idle state request and that's where I think this series kicks in
and that's the reason I say that the genPD governor can only demote
an idle state request.

Linking power domains to idle states is the only sensible way I see
to define what logical cpus are affected by an idle state entry, this
information is missing in the current kernel (whether that's wortwhile
adding it that's another question).

> > That's my understanding, which can be wrong so please correct me
> > if that's the case because that's a bit confusing.
> > 
> > Let's imagine that we flattened out the list of idle states and feed
> > CPUidle with it (all of them - cpu, cluster, package, system - as it is
> > in the mainline _now_). Then the GenPD governor can run-through the
> > CPUidle selection and _demote_ the idle state if necessary since it
> > understands that some CPUs in the GenPD will wake up shortly and break
> > the target residency hyphothesis the CPUidle governor is expecting.
> > 
> > The whole idea about this series is improving CPUidle decision when
> > the target idle state is _shared_ among groups of cpus (again, please
> > do correct me if I am wrong).
> > 
> > It is obvious that a GenPD governor must only demote - never promote a
> > CPU idle state selection given that hierarchy implies more power
> > savings and higher target residencies required.
> 
> So I see a problem here, because the way patch 9 in this series is done,
> the genpd governor for CPUs has no idea what states have been selected by
> the idle governor, so how does it know how deep it can go with turning
> off domains?
> 
> My point is that the selection made by the idle governor need not be
> based only on timers which is the only thing that the genpd governor
> seems to be looking at.  The genpd governor should rather look at what
> idle states have been selected for each CPU in the domain by the idle
> governor and work within the boundaries of those.

That's agreed.

Lorenzo
Rafael J. Wysocki Sept. 14, 2018, 11:34 a.m. UTC | #13
On Fri, Sep 14, 2018 at 12:44 PM Lorenzo Pieralisi
<lorenzo.pieralisi@arm.com> wrote:
>
> On Fri, Sep 14, 2018 at 11:50:15AM +0200, Rafael J. Wysocki wrote:
> > On Thursday, August 9, 2018 5:39:25 PM CEST Lorenzo Pieralisi wrote:
> > > On Mon, Aug 06, 2018 at 11:20:59AM +0200, Rafael J. Wysocki wrote:
> > >
> > > [...]
> > >
> > > > >>> > @@ -245,6 +248,56 @@ static bool always_on_power_down_ok(struct dev_pm_domain *domain)
> > > > >>> >     return false;
> > > > >>> >  }
> > > > >>> >
> > > > >>> > +static bool cpu_power_down_ok(struct dev_pm_domain *pd)
> > > > >>> > +{
> > > > >>> > +   struct generic_pm_domain *genpd = pd_to_genpd(pd);
> > > > >>> > +   ktime_t domain_wakeup, cpu_wakeup;
> > > > >>> > +   s64 idle_duration_ns;
> > > > >>> > +   int cpu, i;
> > > > >>> > +
> > > > >>> > +   if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
> > > > >>> > +           return true;
> > > > >>> > +
> > > > >>> > +   /*
> > > > >>> > +    * Find the next wakeup for any of the online CPUs within the PM domain
> > > > >>> > +    * and its subdomains. Note, we only need the genpd->cpus, as it already
> > > > >>> > +    * contains a mask of all CPUs from subdomains.
> > > > >>> > +    */
> > > > >>> > +   domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
> > > > >>> > +   for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
> > > > >>> > +           cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
> > > > >>> > +           if (ktime_before(cpu_wakeup, domain_wakeup))
> > > > >>> > +                   domain_wakeup = cpu_wakeup;
> > > > >>> > +   }
> > > > >>
> > > > >> Here's a concern I have missed before. :-/
> > > > >>
> > > > >> Say, one of the CPUs you're walking here is woken up in the meantime.
> > > > >
> > > > > Yes, that can happen - when we miss-predicted "next wakeup".
> > > > >
> > > > >>
> > > > >> I don't think it is valid to evaluate tick_nohz_get_next_wakeup() for it then
> > > > >> to update domain_wakeup.  We really should just avoid the domain power off in
> > > > >> that case at all IMO.
> > > > >
> > > > > Correct.
> > > > >
> > > > > However, we also want to avoid locking contentions in the idle path,
> > > > > which is what this boils done to.
> > > >
> > > > This already is done under genpd_lock() AFAICS, so I'm not quite sure
> > > > what exactly you mean.
> > > >
> > > > Besides, this is not just about increased latency, which is a concern
> > > > by itself but maybe not so much in all environments, but also about
> > > > possibility of missing a CPU wakeup, which is a major issue.
> > > >
> > > > If one of the CPUs sharing the domain with the current one is woken up
> > > > during cpu_power_down_ok() and the wakeup is an edge-triggered
> > > > interrupt and the domain is turned off regardless, the wakeup may be
> > > > missed entirely if I'm not mistaken.
> > > >
> > > > It looks like there needs to be a way for the hardware to prevent a
> > > > domain poweroff when there's a pending interrupt or I don't quite see
> > > > how this can be handled correctly.
> > > >
> > > > >> Sure enough, if the domain power off is already started and one of the CPUs
> > > > >> in the domain is woken up then, too bad, it will suffer the latency (but in
> > > > >> that case the hardware should be able to help somewhat), but otherwise CPU
> > > > >> wakeup should prevent domain power off from being carried out.
> > > > >
> > > > > The CPU is not prevented from waking up, as we rely on the FW to deal with that.
> > > > >
> > > > > Even if the above computation turns out to wrongly suggest that the
> > > > > cluster can be powered off, the FW shall together with the genpd
> > > > > backend driver prevent it.
> > > >
> > > > Fine, but then the solution depends on specific FW/HW behavior, so I'm
> > > > not sure how generic it really is.  At least, that expectation should
> > > > be clearly documented somewhere, preferably in code comments.
> > > >
> > > > > To cover this case for PSCI, we also use a per cpu variable for the
> > > > > CPU's power off state, as can be seen later in the series.
> > > >
> > > > Oh great, but the generic part should be independent on the underlying
> > > > implementation of the driver.  If it isn't, then it also is not
> > > > generic.
> > > >
> > > > > Hope this clarifies your concern, else tell and will to elaborate a bit more.
> > > >
> > > > Not really.
> > > >
> > > > There also is one more problem and that is the interaction between
> > > > this code and the idle governor.
> > > >
> > > > Namely, the idle governor may select a shallower state for some
> > > > reason, for example due to an additional latency limit derived from
> > > > CPU utilization (like in the menu governor), and how does the code in
> > > > cpu_power_down_ok() know what state has been selected and how does it
> > > > honor the selection made by the idle governor?
> > >
> > > That's a good question and it maybe gives a path towards a solution.
> > >
> > > AFAICS the genPD governor only selects the idle state parameter that
> > > determines the idle state at, say, GenPD cpumask level it does not touch
> > > the CPUidle decision, that works on a subset of idle states (at cpu
> > > level).
> >
> > I've deferred responding to this as I wasn't quite sure if I followed you
> > at that time, but I'm afraid I'm still not following you now. :-)
> >
> > The idle governor has to take the total worst-case wakeup latency into
> > account.  Not just from the logical CPU itself, but also from whatever
> > state the SoC may end up in as a result of this particular logical CPU
> > going idle, this way or another.
> >
> > So for example, if your logical CPU has an idle state A that may trigger an
> > idle state X at the cluster level (if the other logical CPUs happen to be in
> > the right states and so on), then the worst-case exit latency for that
> > is the one of state X.
>
> I will provide an example:
>
> IDLE STATE A (affects CPU {0,1}): exit latency 1ms, min-residency 1.5ms
>
> CPU 0 is about to enter IDLE state A since its "next-event" fulfill the
> residency requirements and exit latency constraints.
>
> CPU 1 is in idle state A (given that CPU 0 is ON, some of the common
> logic shared between CPU {0,1} is still ON, but, as soon as CPU 0
> enters idle state A CPU {0,1} can enter the "full" idle state A
> power savings mode).
>
> The current CPUidle governor does not check the "next-event" for CPU 1,
> that it may wake up in, say, 10us.

Right.

> Requesting IDLE STATE A is a waste of power (if firmware or hardware
> does not demote it since it does peek at CPU 1 next-event and actually
> demote CPU 0 request).

OK, I see.

That's because the state is "collaborative" so to speak.  But was't
that supposed to be covered by the "coupled" thing?

> The current flat list of idle states has no notion of CPUs sharing
> an idle state request and that's where I think this series kicks in
> and that's the reason I say that the genPD governor can only demote
> an idle state request.
>
> Linking power domains to idle states is the only sensible way I see
> to define what logical cpus are affected by an idle state entry, this
> information is missing in the current kernel (whether that's wortwhile
> adding it that's another question).

OK, thanks for the clarification!

Cheers,
Rafael
Lorenzo Pieralisi Sept. 14, 2018, 12:30 p.m. UTC | #14
On Fri, Sep 14, 2018 at 01:34:14PM +0200, Rafael J. Wysocki wrote:

[...]

> > > So for example, if your logical CPU has an idle state A that may trigger an
> > > idle state X at the cluster level (if the other logical CPUs happen to be in
> > > the right states and so on), then the worst-case exit latency for that
> > > is the one of state X.
> >
> > I will provide an example:
> >
> > IDLE STATE A (affects CPU {0,1}): exit latency 1ms, min-residency 1.5ms
> >
> > CPU 0 is about to enter IDLE state A since its "next-event" fulfill the
> > residency requirements and exit latency constraints.
> >
> > CPU 1 is in idle state A (given that CPU 0 is ON, some of the common
> > logic shared between CPU {0,1} is still ON, but, as soon as CPU 0
> > enters idle state A CPU {0,1} can enter the "full" idle state A
> > power savings mode).
> >
> > The current CPUidle governor does not check the "next-event" for CPU 1,
> > that it may wake up in, say, 10us.
> 
> Right.
> 
> > Requesting IDLE STATE A is a waste of power (if firmware or hardware
> > does not demote it since it does peek at CPU 1 next-event and actually
> > demote CPU 0 request).
> 
> OK, I see.
> 
> That's because the state is "collaborative" so to speak.  But was't
> that supposed to be covered by the "coupled" thing?

The coupled idle states code was merged because on some early SMP
ARM platforms CPUs must enter cluster idle states orderly otherwise
the system would break; "coupled" as-in "syncronized idle state entry".

Basically coupled idle code fixed a HW bug. This series code instead
applies to all arches where an idle state may span multiple CPUs (x86
inclusive, but as I mentioned it is probably not needed since FW/HW
behind mwait is capable of detecting whether that's wortwhile to shut
down, say, a package. PSCI, whether OSI or PC mode can work the same way).

Entering an idle state spanning multiple cpus need not be synchronized
but a sort of cpumask aware governor may help optimize idle state
selection.

I hope this makes the whole point clearer.

Cheers,
Lorenzo
diff mbox

Patch

diff --git a/drivers/base/power/domain_governor.c b/drivers/base/power/domain_governor.c
index 99896fbf18e4..1aad55719537 100644
--- a/drivers/base/power/domain_governor.c
+++ b/drivers/base/power/domain_governor.c
@@ -10,6 +10,9 @@ 
 #include <linux/pm_domain.h>
 #include <linux/pm_qos.h>
 #include <linux/hrtimer.h>
+#include <linux/cpumask.h>
+#include <linux/ktime.h>
+#include <linux/tick.h>
 
 static int dev_update_qos_constraint(struct device *dev, void *data)
 {
@@ -245,6 +248,56 @@  static bool always_on_power_down_ok(struct dev_pm_domain *domain)
 	return false;
 }
 
+static bool cpu_power_down_ok(struct dev_pm_domain *pd)
+{
+	struct generic_pm_domain *genpd = pd_to_genpd(pd);
+	ktime_t domain_wakeup, cpu_wakeup;
+	s64 idle_duration_ns;
+	int cpu, i;
+
+	if (!(genpd->flags & GENPD_FLAG_CPU_DOMAIN))
+		return true;
+
+	/*
+	 * Find the next wakeup for any of the online CPUs within the PM domain
+	 * and its subdomains. Note, we only need the genpd->cpus, as it already
+	 * contains a mask of all CPUs from subdomains.
+	 */
+	domain_wakeup = ktime_set(KTIME_SEC_MAX, 0);
+	for_each_cpu_and(cpu, genpd->cpus, cpu_online_mask) {
+		cpu_wakeup = tick_nohz_get_next_wakeup(cpu);
+		if (ktime_before(cpu_wakeup, domain_wakeup))
+			domain_wakeup = cpu_wakeup;
+	}
+
+	/* The minimum idle duration is from now - until the next wakeup. */
+	idle_duration_ns = ktime_to_ns(ktime_sub(domain_wakeup, ktime_get()));
+
+	/*
+	 * Find the deepest idle state that has its residency value satisfied
+	 * and by also taking into account the power off latency for the state.
+	 * Start at the deepest supported state.
+	 */
+	i = genpd->state_count - 1;
+	do {
+		if (!genpd->states[i].residency_ns)
+			break;
+
+		/* Check idle_duration_ns >= 0 to compare signed/unsigned. */
+		if (idle_duration_ns >= 0 && idle_duration_ns >=
+		    (genpd->states[i].residency_ns +
+		     genpd->states[i].power_off_latency_ns))
+			break;
+		i--;
+	} while (i >= 0);
+
+	if (i < 0)
+		return false;
+
+	genpd->state_idx = i;
+	return true;
+}
+
 struct dev_power_governor simple_qos_governor = {
 	.suspend_ok = default_suspend_ok,
 	.power_down_ok = default_power_down_ok,
@@ -257,3 +310,8 @@  struct dev_power_governor pm_domain_always_on_gov = {
 	.power_down_ok = always_on_power_down_ok,
 	.suspend_ok = default_suspend_ok,
 };
+
+struct dev_power_governor pm_domain_cpu_gov = {
+	.suspend_ok = NULL,
+	.power_down_ok = cpu_power_down_ok,
+};
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index 2c09cf80b285..97901c833108 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -160,6 +160,7 @@  int dev_pm_genpd_set_performance_state(struct device *dev, unsigned int state);
 
 extern struct dev_power_governor simple_qos_governor;
 extern struct dev_power_governor pm_domain_always_on_gov;
+extern struct dev_power_governor pm_domain_cpu_gov;
 #else
 
 static inline struct generic_pm_domain_data *dev_gpd_data(struct device *dev)
@@ -203,6 +204,7 @@  static inline int dev_pm_genpd_set_performance_state(struct device *dev,
 
 #define simple_qos_governor		(*(struct dev_power_governor *)(NULL))
 #define pm_domain_always_on_gov		(*(struct dev_power_governor *)(NULL))
+#define pm_domain_cpu_gov		(*(struct dev_power_governor *)(NULL))
 #endif
 
 #ifdef CONFIG_PM_GENERIC_DOMAINS_SLEEP