[RFC,v4,0/6] sched/cpufreq: Make schedutil energy aware

Message ID	20200122173538.1142069-1-douglas.raillard@arm.com (mailing list archive)
Headers	show Return-Path: <SRS0=8fnT=3L=vger.kernel.org=linux-pm-owner@kernel.org> From: Douglas RAILLARD <douglas.raillard@arm.com> To: linux-kernel@vger.kernel.org, rjw@rjwysocki.net, viresh.kumar@linaro.org, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org Cc: douglas.raillard@arm.com, dietmar.eggemann@arm.com, qperret@google.com, linux-pm@vger.kernel.org Subject: [RFC PATCH v4 0/6] sched/cpufreq: Make schedutil energy aware Date: Wed, 22 Jan 2020 17:35:32 +0000 Message-Id: <20200122173538.1142069-1-douglas.raillard@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-pm-owner@vger.kernel.org Precedence: bulk
Series	sched/cpufreq: Make schedutil energy aware \| expand [RFC,v4,0/6] sched/cpufreq: Make schedutil energy aware [RFC,v4,1/6] PM: Introduce em_pd_get_higher_freq() [RFC,v4,2/6] sched/cpufreq: Attach perf domain to sugov policy [RFC,v4,3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq() [RFC,v4,4/6] sched/cpufreq: Introduce sugov_cpu_ramp_boost [RFC,v4,5/6] sched/cpufreq: Boost schedutil frequency ramp up [RFC,v4,6/6] sched/cpufreq: Add schedutil_em_tp tracepoint

Douglas RAILLARD Jan. 22, 2020, 5:35 p.m. UTC

Make schedutil cpufreq governor energy-aware.

- patch 1 introduces a function to retrieve a frequency given a base
  frequency and an energy cost margin.
- patch 2 links Energy Model perf_domain to sugov_policy.
- patch 3 updates get_next_freq() to make use of the Energy Model.
- patch 4 adds sugov_cpu_ramp_boost() function.
- patch 5 updates sugov_update_(single|shared)() to make use of
  sugov_cpu_ramp_boost().
- patch 6 introduces a tracepoint in get_next_freq() for
  testing/debugging. Since it's not a trace event, it's not exposed to
  userspace in a directly usable way, allowing for painless future
  updates/removal.

The benefits of using the EM in schedutil are twofold:

1) Selecting the highest possible frequency for a given cost. Some
   platforms can have lower frequencies that are less efficient than
   higher ones, in which case they should be skipped for most purposes.
   They can still be useful to give more freedom to thermal throttling
   mechanisms, but not under normal circumstances.
   note: the EM framework will warn about such OPPs "hertz/watts ratio
   non-monotonically decreasing"

2) Driving the frequency selection with power in mind, in addition to
   maximizing the utilization of the non-idle CPUs in the system.

Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
enabled in schedutil by
"sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".

Point 2) is enabled in
"sched/cpufreq: Boost schedutil frequency ramp up". It allows using
higher frequencies when it is known that the true utilization of
currently running tasks is exceeding their previous stable point.
The benefits are:

* Boosting the frequency when the behavior of a runnable task changes,
  leading to an increase in utilization. That shortens the frequency
  ramp up duration, which in turns allows the utilization signal to
  reach stable values quicker.  Since the allowed frequency boost is
  bounded in energy, it will behave consistently across platforms,
  regardless of the OPP cost range.

* The boost is only transient, and should not impact a lot the energy
  consumed of workloads with very stable utilization signals.

This has been ligthly tested with a rtapp task ramping from 10% to 75%
utilisation on a big core.

v1 -> v2:

  * Split the new sugov_cpu_ramp_boost() from the existing
    sugov_cpu_is_busy() as they seem to seek a different goal.

  * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
    util_est_enqueued signals, rather than using idle calls count.
    This makes the ramp boost much more accurate in finding boost
    opportunities, and give a "continuous" output rather than a boolean.

  * Add EM_COST_MARGIN_SCALE=1024 to represent the
    margin values of em_pd_get_higher_freq().

v2 -> v3:

  * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
    to avoid boosting when the utilization is decreasing.

  * Add a tracepoint for testing. 

v3 -> v4:

  * em_pd_get_higher_freq() now interprets the margin as absolute,
    rather than relative to the cost of the base frequency.

  * Modify misleading comment in em_pd_get_higher_freq() since min_freq
    can actually be higher than the max available frequency in normal
    operations.

Douglas RAILLARD (6):
  PM: Introduce em_pd_get_higher_freq()
  sched/cpufreq: Attach perf domain to sugov policy
  sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
  sched/cpufreq: Introduce sugov_cpu_ramp_boost
  sched/cpufreq: Boost schedutil frequency ramp up
  sched/cpufreq: Add schedutil_em_tp tracepoint

 include/linux/energy_model.h     |  56 ++++++++++++++
 include/trace/events/power.h     |   9 +++
 kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++--
 3 files changed, 182 insertions(+), 7 deletions(-)

Douglas RAILLARD Jan. 22, 2020, 6:14 p.m. UTC | #1

Hi Peter,

Since the v3 was posted a while ago, here is a short recap of the hanging
comments:

* The boost margin was relative, but we came to the conclusion it would make
  more sense to make it absolute (done in that v4).

* The main remaining blur point was why defining boost=(util - util_est) makes
  sense. The justification for that is that we use PELT-shaped signal to drive
  the frequency, so using a PELT-shaped signal for the boost makes sense for the
  same reasons.

AFAIK there is no specific criteria to meet for frequency selection signal shape
for anything else than periodic tasks (if we don't add other constraints on
top), so (util - util_est)=(util - constant) seems as good as anything else.
Especially since util is deemed to be a good fit in practice for frequency
selection. Let me know if I missed anything on that front.


v3 thread: https://lore.kernel.org/lkml/20191011134500.235736-1-douglas.raillard@arm.com/

Cheers,
Douglas

On 1/22/20 5:35 PM, Douglas RAILLARD wrote:
> Make schedutil cpufreq governor energy-aware.
> 
> - patch 1 introduces a function to retrieve a frequency given a base
>   frequency and an energy cost margin.
> - patch 2 links Energy Model perf_domain to sugov_policy.
> - patch 3 updates get_next_freq() to make use of the Energy Model.
> - patch 4 adds sugov_cpu_ramp_boost() function.
> - patch 5 updates sugov_update_(single|shared)() to make use of
>   sugov_cpu_ramp_boost().
> - patch 6 introduces a tracepoint in get_next_freq() for
>   testing/debugging. Since it's not a trace event, it's not exposed to
>   userspace in a directly usable way, allowing for painless future
>   updates/removal.
> 
> The benefits of using the EM in schedutil are twofold:
> 
> 1) Selecting the highest possible frequency for a given cost. Some
>    platforms can have lower frequencies that are less efficient than
>    higher ones, in which case they should be skipped for most purposes.
>    They can still be useful to give more freedom to thermal throttling
>    mechanisms, but not under normal circumstances.
>    note: the EM framework will warn about such OPPs "hertz/watts ratio
>    non-monotonically decreasing"
> 
> 2) Driving the frequency selection with power in mind, in addition to
>    maximizing the utilization of the non-idle CPUs in the system.
> 
> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
> enabled in schedutil by
> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
> 
> Point 2) is enabled in
> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
> higher frequencies when it is known that the true utilization of
> currently running tasks is exceeding their previous stable point.
> The benefits are:
> 
> * Boosting the frequency when the behavior of a runnable task changes,
>   leading to an increase in utilization. That shortens the frequency
>   ramp up duration, which in turns allows the utilization signal to
>   reach stable values quicker.  Since the allowed frequency boost is
>   bounded in energy, it will behave consistently across platforms,
>   regardless of the OPP cost range.
> 
> * The boost is only transient, and should not impact a lot the energy
>   consumed of workloads with very stable utilization signals.
> 
> This has been ligthly tested with a rtapp task ramping from 10% to 75%
> utilisation on a big core.
> 
> v1 -> v2:
> 
>   * Split the new sugov_cpu_ramp_boost() from the existing
>     sugov_cpu_is_busy() as they seem to seek a different goal.
> 
>   * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
>     util_est_enqueued signals, rather than using idle calls count.
>     This makes the ramp boost much more accurate in finding boost
>     opportunities, and give a "continuous" output rather than a boolean.
> 
>   * Add EM_COST_MARGIN_SCALE=1024 to represent the
>     margin values of em_pd_get_higher_freq().
> 
> v2 -> v3:
> 
>   * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
>     to avoid boosting when the utilization is decreasing.
> 
>   * Add a tracepoint for testing. 
> 
> v3 -> v4:
> 
>   * em_pd_get_higher_freq() now interprets the margin as absolute,
>     rather than relative to the cost of the base frequency.
> 
>   * Modify misleading comment in em_pd_get_higher_freq() since min_freq
>     can actually be higher than the max available frequency in normal
>     operations.
> 
> Douglas RAILLARD (6):
>   PM: Introduce em_pd_get_higher_freq()
>   sched/cpufreq: Attach perf domain to sugov policy
>   sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
>   sched/cpufreq: Introduce sugov_cpu_ramp_boost
>   sched/cpufreq: Boost schedutil frequency ramp up
>   sched/cpufreq: Add schedutil_em_tp tracepoint
> 
>  include/linux/energy_model.h     |  56 ++++++++++++++
>  include/trace/events/power.h     |   9 +++
>  kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++--
>  3 files changed, 182 insertions(+), 7 deletions(-)
>

Rafael J. Wysocki Jan. 23, 2020, 3:43 p.m. UTC | #2

On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
<douglas.raillard@arm.com> wrote:
>
> Make schedutil cpufreq governor energy-aware.

I have to say that your terminology is confusing to me, like what
exactly does "energy-aware" mean in the first place?

> - patch 1 introduces a function to retrieve a frequency given a base
>   frequency and an energy cost margin.
> - patch 2 links Energy Model perf_domain to sugov_policy.
> - patch 3 updates get_next_freq() to make use of the Energy Model.
> - patch 4 adds sugov_cpu_ramp_boost() function.
> - patch 5 updates sugov_update_(single|shared)() to make use of
>   sugov_cpu_ramp_boost().
> - patch 6 introduces a tracepoint in get_next_freq() for
>   testing/debugging. Since it's not a trace event, it's not exposed to
>   userspace in a directly usable way, allowing for painless future
>   updates/removal.
>
> The benefits of using the EM in schedutil are twofold:

I guess you mean using the EM directly in schedutil (note that it is
used indirectly already, because of EAS), but that needs to be clearly
stated.

> 1) Selecting the highest possible frequency for a given cost. Some
>    platforms can have lower frequencies that are less efficient than
>    higher ones, in which case they should be skipped for most purposes.
>    They can still be useful to give more freedom to thermal throttling
>    mechanisms, but not under normal circumstances.
>    note: the EM framework will warn about such OPPs "hertz/watts ratio
>    non-monotonically decreasing"

While all of that is fair enough for platforms using the EM, do you
realize that the EM is not available on the majority of architectures
(including some fairly significant ones) and so adding overhead
related to it for all of them is quite less than welcome?

> 2) Driving the frequency selection with power in mind, in addition to
>    maximizing the utilization of the non-idle CPUs in the system.

Care to explain this?  I'm totally unsure what you mean here.

> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
> enabled in schedutil by
> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>
> Point 2) is enabled in
> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
> higher frequencies when it is known that the true utilization of
> currently running tasks is exceeding their previous stable point.

Please explain "true utilization" and "stable point".

> The benefits are:
>
> * Boosting the frequency when the behavior of a runnable task changes,
>   leading to an increase in utilization. That shortens the frequency
>   ramp up duration, which in turns allows the utilization signal to
>   reach stable values quicker.  Since the allowed frequency boost is
>   bounded in energy, it will behave consistently across platforms,
>   regardless of the OPP cost range.

Sounds good.

Can you please describe the algorithm applied to achieve that?

> * The boost is only transient, and should not impact a lot the energy
>   consumed of workloads with very stable utilization signals.

Douglas RAILLARD Jan. 23, 2020, 5:16 p.m. UTC | #3

Hi Rafael,

On 1/23/20 3:43 PM, Rafael J. Wysocki wrote:
> On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
> <douglas.raillard@arm.com> wrote:
>>
>> Make schedutil cpufreq governor energy-aware.
> 
> I have to say that your terminology is confusing to me, like what
> exactly does "energy-aware" mean in the first place?

Should be better rephrased as "Make schedutil cpufreq governor use the
energy model" I guess. Schedutil is indeed already energy aware since it
tries to use the lowest frequency possible for the job to be done (kind of).

> 
>> - patch 1 introduces a function to retrieve a frequency given a base
>>   frequency and an energy cost margin.
>> - patch 2 links Energy Model perf_domain to sugov_policy.
>> - patch 3 updates get_next_freq() to make use of the Energy Model.
>> - patch 4 adds sugov_cpu_ramp_boost() function.
>> - patch 5 updates sugov_update_(single|shared)() to make use of
>>   sugov_cpu_ramp_boost().
>> - patch 6 introduces a tracepoint in get_next_freq() for
>>   testing/debugging. Since it's not a trace event, it's not exposed to
>>   userspace in a directly usable way, allowing for painless future
>>   updates/removal.
>>
>> The benefits of using the EM in schedutil are twofold:
> 
> I guess you mean using the EM directly in schedutil (note that it is
> used indirectly already, because of EAS), but that needs to be clearly
> stated.

In the current state (of the code and my knowledge), the EM "leaks" into
schedutil only by the fact that tasks are moved around by EAS, so the
CPU util seen by schedutil is impacted compared to the same workload on
non-EAS setup.

Other than that, the only energy-related information schedutil uses is
the assumption that lower freq == better efficiency. Explicit use of the
EM allows to refine this assumption.

> 
>> 1) Selecting the highest possible frequency for a given cost. Some
>>    platforms can have lower frequencies that are less efficient than
>>    higher ones, in which case they should be skipped for most purposes.
>>    They can still be useful to give more freedom to thermal throttling
>>    mechanisms, but not under normal circumstances.
>>    note: the EM framework will warn about such OPPs "hertz/watts ratio
>>    non-monotonically decreasing"
> 
> While all of that is fair enough for platforms using the EM, do you
> realize that the EM is not available on the majority of architectures
> (including some fairly significant ones) and so adding overhead
> related to it for all of them is quite less than welcome?

When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is
defined to a static inline no-op function, so that feature won't incur
overhead (patch 1+2+3).

Patch 4 and 5 do add some new logic that could be used on any platform.
Current code will use the boost as an energy margin, but it would be
straightforward to make a util-based version (like iowait boost) on
non-EM platforms.

>> 2) Driving the frequency selection with power in mind, in addition to
>>    maximizing the utilization of the non-idle CPUs in the system.
> 
> Care to explain this?  I'm totally unsure what you mean here.

Currently, schedutil is basically tailoring the CPU capacity to the util
of the tasks on it. That's all good for periodic tasks, but there are
situations where we can do better than assuming the task is periodic
with a fixed duty cycle.

The case improved by that series is when a task increases its duty
cycle. In that specific case, it can be a good idea to increase the
frequency until the util stabilizes again. We don't have a crystal ball
so we can't adjust the freq right away. However, we do want to avoid the
task to crave for speed until schedutil realizes it needs it. Using the
EM here allows to boost within reasonable limits, without destroying the
average energy consumption.

> 
>> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
>> enabled in schedutil by
>> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>>
>> Point 2) is enabled in
>> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
>> higher frequencies when it is known that the true utilization of
>> currently running tasks is exceeding their previous stable point.
> 
> Please explain "true utilization" and "stable point".

"true utilization" would be an instantaneous duty cycle. If a task
suddenly starts doing twice as much work, its "true utilization" will
double instantly. "stable point" would be util est enqueued here. If a
task is periodic, util est enqueued will be constant once it reaches a
steady state. As soon as the duty cycle of the task changes, util est
enqueued will change.

> 
>> The benefits are:
>>
>> * Boosting the frequency when the behavior of a runnable task changes,
>>   leading to an increase in utilization. That shortens the frequency
>>   ramp up duration, which in turns allows the utilization signal to
>>   reach stable values quicker.  Since the allowed frequency boost is
>>   bounded in energy, it will behave consistently across platforms,
>>   regardless of the OPP cost range.
> 
> Sounds good.
> 
> Can you please describe the algorithm applied to achieve that?

The util est enqueued of a task is basically a snapshot of the util of
the task just before it's dequeued. This means that when the util has
stabilized, util est enqueued will be a constant signal. Specifically,
util est enqueued will be an upper bound of the swing of util avg.

When the task starts doing more work than at the previous activation,
its util avg will rise above the current util est enqueued. This means
we cannot assume anymore that util est enqueued represents an upper
bound of the duty cycle, so we can decide to boost until util avg
"stabilizes" again [note].

At the CPU level, we can track that in the rq aggregated signals:
  - "stable rq's util est enqueued" is assumed to mean "same set of
enqueued tasks as the last time we looked at that rq".

  - task util est enqueued and util avg can be replaced by the rq
signal. This will hide cases where a task's util increases while another
one decreases by the same amount.

The limitations of both assumptions can be fixed by more invasive
changes (a rq cookie to know the set of enqueued tasks and an
OR-aggregated per-task flag to ask for boosting), but these heuristics
allow using the existing signals with changes limited to schedutil.

Once we detected this situation, we can decide to boost. We don't want
black&white boosting, since a tiny increase in util should lead to a
tiny boost. Here, we use (util - util_est_enqueued). If the increase is
small, that boost will be small.

[note]:
util avg of a periodic task never actually stabilizes, it just enters an
interval and never leaves it. When the duty cycle changes, it will leave
that interval to enter another one. The centre of that interval is the
task's duty cycle.

>> * The boost is only transient, and should not impact a lot the energy
>>   consumed of workloads with very stable utilization signals.

Thanks,
Douglas

Vincent Guittot Jan. 27, 2020, 5:16 p.m. UTC | #4

On Wed, 22 Jan 2020 at 18:36, Douglas RAILLARD <douglas.raillard@arm.com> wrote:
>
> Make schedutil cpufreq governor energy-aware.
>
> - patch 1 introduces a function to retrieve a frequency given a base
>   frequency and an energy cost margin.
> - patch 2 links Energy Model perf_domain to sugov_policy.
> - patch 3 updates get_next_freq() to make use of the Energy Model.
> - patch 4 adds sugov_cpu_ramp_boost() function.
> - patch 5 updates sugov_update_(single|shared)() to make use of
>   sugov_cpu_ramp_boost().
> - patch 6 introduces a tracepoint in get_next_freq() for
>   testing/debugging. Since it's not a trace event, it's not exposed to
>   userspace in a directly usable way, allowing for painless future
>   updates/removal.
>
> The benefits of using the EM in schedutil are twofold:
>
> 1) Selecting the highest possible frequency for a given cost. Some
>    platforms can have lower frequencies that are less efficient than
>    higher ones, in which case they should be skipped for most purposes.

This make sense. Why using a lower frequency when a higher one is more
power efficient

>    They can still be useful to give more freedom to thermal throttling
>    mechanisms, but not under normal circumstances.
>    note: the EM framework will warn about such OPPs "hertz/watts ratio
>    non-monotonically decreasing"
>
> 2) Driving the frequency selection with power in mind, in addition to
>    maximizing the utilization of the non-idle CPUs in the system.
>
> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
> enabled in schedutil by
> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>
> Point 2) is enabled in
> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
> higher frequencies when it is known that the true utilization of
> currently running tasks is exceeding their previous stable point.
> The benefits are:
>
> * Boosting the frequency when the behavior of a runnable task changes,
>   leading to an increase in utilization. That shortens the frequency
>   ramp up duration, which in turns allows the utilization signal to
>   reach stable values quicker.  Since the allowed frequency boost is
>   bounded in energy, it will behave consistently across platforms,
>   regardless of the OPP cost range.

Could you explain this a bit more ?

>
> * The boost is only transient, and should not impact a lot the energy
>   consumed of workloads with very stable utilization signals.
>
> This has been lightly tested with a rtapp task ramping from 10% to 75%
> utilisation on a big core.

Which kind of UC are you targeting ?

Do you have some benchmark showing the benefit and how you can bound
the increase of energy ?

The benefit of point2 is less obvious for me. We already have uclamp
which helps to overwrite the "utilization" that is seen by schedutil
to boost or cap the frequency when some tasks are running. I'm curious
to see what would be the benefit of this on top.

>
> v1 -> v2:
>
>   * Split the new sugov_cpu_ramp_boost() from the existing
>     sugov_cpu_is_busy() as they seem to seek a different goal.
>
>   * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
>     util_est_enqueued signals, rather than using idle calls count.
>     This makes the ramp boost much more accurate in finding boost
>     opportunities, and give a "continuous" output rather than a boolean.
>
>   * Add EM_COST_MARGIN_SCALE=1024 to represent the
>     margin values of em_pd_get_higher_freq().
>
> v2 -> v3:
>
>   * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
>     to avoid boosting when the utilization is decreasing.
>
>   * Add a tracepoint for testing.
>
> v3 -> v4:
>
>   * em_pd_get_higher_freq() now interprets the margin as absolute,
>     rather than relative to the cost of the base frequency.
>
>   * Modify misleading comment in em_pd_get_higher_freq() since min_freq
>     can actually be higher than the max available frequency in normal
>     operations.
>
> Douglas RAILLARD (6):
>   PM: Introduce em_pd_get_higher_freq()
>   sched/cpufreq: Attach perf domain to sugov policy
>   sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
>   sched/cpufreq: Introduce sugov_cpu_ramp_boost
>   sched/cpufreq: Boost schedutil frequency ramp up
>   sched/cpufreq: Add schedutil_em_tp tracepoint
>
>  include/linux/energy_model.h     |  56 ++++++++++++++
>  include/trace/events/power.h     |   9 +++
>  kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++--
>  3 files changed, 182 insertions(+), 7 deletions(-)
>
> --
> 2.24.1
>

Douglas RAILLARD Feb. 10, 2020, 11:37 a.m. UTC | #5

Hi Vincent,

On 1/27/20 5:16 PM, Vincent Guittot wrote:
> On Wed, 22 Jan 2020 at 18:36, Douglas RAILLARD <douglas.raillard@arm.com> wrote:
>>
>> Make schedutil cpufreq governor energy-aware.
>>
>> - patch 1 introduces a function to retrieve a frequency given a base
>>   frequency and an energy cost margin.
>> - patch 2 links Energy Model perf_domain to sugov_policy.
>> - patch 3 updates get_next_freq() to make use of the Energy Model.
>> - patch 4 adds sugov_cpu_ramp_boost() function.
>> - patch 5 updates sugov_update_(single|shared)() to make use of
>>   sugov_cpu_ramp_boost().
>> - patch 6 introduces a tracepoint in get_next_freq() for
>>   testing/debugging. Since it's not a trace event, it's not exposed to
>>   userspace in a directly usable way, allowing for painless future
>>   updates/removal.
>>
>> The benefits of using the EM in schedutil are twofold:
>>
>> 1) Selecting the highest possible frequency for a given cost. Some
>>    platforms can have lower frequencies that are less efficient than
>>    higher ones, in which case they should be skipped for most purposes.
> 
> This make sense. Why using a lower frequency when a higher one is more
> power efficient

Apparently in some cases it can be useful for thermal capping. AFAIU the
alternate solution is to race to idle with a more efficient OPP (idle
injection work of Linaro).

>>    They can still be useful to give more freedom to thermal throttling
>>    mechanisms, but not under normal circumstances.
>>    note: the EM framework will warn about such OPPs "hertz/watts ratio
>>    non-monotonically decreasing"
>>
>> 2) Driving the frequency selection with power in mind, in addition to
>>    maximizing the utilization of the non-idle CPUs in the system.
>>
>> Point 1) is implemented in "PM: Introduce em_pd_get_higher_freq()" and
>> enabled in schedutil by
>> "sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()".
>>
>> Point 2) is enabled in
>> "sched/cpufreq: Boost schedutil frequency ramp up". It allows using
>> higher frequencies when it is known that the true utilization of
>> currently running tasks is exceeding their previous stable point.
>> The benefits are:
>>
>> * Boosting the frequency when the behavior of a runnable task changes,
>>   leading to an increase in utilization. That shortens the frequency
>>   ramp up duration, which in turns allows the utilization signal to
>>   reach stable values quicker.  Since the allowed frequency boost is
>>   bounded in energy, it will behave consistently across platforms,
>>   regardless of the OPP cost range.
> 
> Could you explain this a bit more ?

The goal is to detect when the task starts asking more CPU time than it
did during the previous period. At this stage, we don't know how much
more, so we increase the frequency faster to allow signals to settle
more quickly.

The PELT signal does increases independently from the chosen frequency,
but that's only up until idle time shows up. At this point, the util
will drop again, and the frequency with it.

>>
>> * The boost is only transient, and should not impact a lot the energy
>>   consumed of workloads with very stable utilization signals.
>>
>> This has been lightly tested with a rtapp task ramping from 10% to 75%
>> utilisation on a big core.
> 
> Which kind of UC are you targeting ?

One case are tasks with "random" behavior like threads in thread pools
that can end up doing very different things. There may be other cases as
well, but I'll need to do more extensive testing with actual applications.

> 
> Do you have some benchmark showing the benefit and how you can bound
> the increase of energy ?

In the test setup described above, it increases the energy consumption
by ~2.5%.

I also did some preliminary experiments to reduce the margin taken in
map_util_freq(), which becomes less necessary if the frequency is
boosted in the cases where util increase is getting out of hands. That
can recover some amount of lost power.

The real cost in practice heavily depends on:
* the workloads (if its util jumps around, it will boost more frequently)
* the discrete frequencies available (if boosting does not bring us to
the next freq, no boost is actually applied).

> 
> The benefit of point2 is less obvious for me. We already have uclamp
> which helps to overwrite the "utilization" that is seen by schedutil
> to boost or cap the frequency when some tasks are running. I'm curious
> to see what would be the benefit of this on top.

uclamp is only useful when a target utilization is known beforehand by
the task itself or some kind of manager. In all the cases relying on
plain PELT, we can decrease the freq change reaction time.

Note that schedutil is already built around the duty cycle detection
with a bias for higher frequency when the task period increases (using
util est enqueued). What this series bring is a way to detect when util
est enqueued turns from a set-point into a lower bound.

>>
>> v1 -> v2:
>>
>>   * Split the new sugov_cpu_ramp_boost() from the existing
>>     sugov_cpu_is_busy() as they seem to seek a different goal.
>>
>>   * Implement sugov_cpu_ramp_boost() based on CFS util_avg and
>>     util_est_enqueued signals, rather than using idle calls count.
>>     This makes the ramp boost much more accurate in finding boost
>>     opportunities, and give a "continuous" output rather than a boolean.
>>
>>   * Add EM_COST_MARGIN_SCALE=1024 to represent the
>>     margin values of em_pd_get_higher_freq().
>>
>> v2 -> v3:
>>
>>   * Check util_avg >= sg_cpu->util_avg in sugov_cpu_ramp_boost_update()
>>     to avoid boosting when the utilization is decreasing.
>>
>>   * Add a tracepoint for testing.
>>
>> v3 -> v4:
>>
>>   * em_pd_get_higher_freq() now interprets the margin as absolute,
>>     rather than relative to the cost of the base frequency.
>>
>>   * Modify misleading comment in em_pd_get_higher_freq() since min_freq
>>     can actually be higher than the max available frequency in normal
>>     operations.
>>
>> Douglas RAILLARD (6):
>>   PM: Introduce em_pd_get_higher_freq()
>>   sched/cpufreq: Attach perf domain to sugov policy
>>   sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()
>>   sched/cpufreq: Introduce sugov_cpu_ramp_boost
>>   sched/cpufreq: Boost schedutil frequency ramp up
>>   sched/cpufreq: Add schedutil_em_tp tracepoint
>>
>>  include/linux/energy_model.h     |  56 ++++++++++++++
>>  include/trace/events/power.h     |   9 +++
>>  kernel/sched/cpufreq_schedutil.c | 124 +++++++++++++++++++++++++++++--
>>  3 files changed, 182 insertions(+), 7 deletions(-)
>>
>> --
>> 2.24.1
>>

Peter Zijlstra Feb. 10, 2020, 1:21 p.m. UTC | #6

On Wed, Jan 22, 2020 at 06:14:24PM +0000, Douglas Raillard wrote:
> Hi Peter,
> 
> Since the v3 was posted a while ago, here is a short recap of the hanging
> comments:
> 
> * The boost margin was relative, but we came to the conclusion it would make
>   more sense to make it absolute (done in that v4).

As per (patch #1):

+       max_cost = pd->table[pd->nr_cap_states - 1].cost;
+       cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;

So we'll allow the boost to double energy consumption (or rather, since
you cannot go above the max OPP, we're allowed that).

> * The main remaining blur point was why defining boost=(util - util_est) makes
>   sense. The justification for that is that we use PELT-shaped signal to drive
>   the frequency, so using a PELT-shaped signal for the boost makes sense for the
>   same reasons.

As per (patch #4):

+       unsigned long boost = 0;

+       if (util_est_enqueued == sg_cpu->util_est_enqueued &&
+           util_avg >= sg_cpu->util_avg &&
+           util_avg > util_est_enqueued)
+               boost = util_avg - util_est_enqueued;

The result of that is not, strictly speaking, a PELT shaped signal.
Although when it is !0 the curves are similar, albeit offset.

> AFAIK there is no specific criteria to meet for frequency selection signal shape
> for anything else than periodic tasks (if we don't add other constraints on
> top), so (util - util_est)=(util - constant) seems as good as anything else.
> Especially since util is deemed to be a good fit in practice for frequency
> selection. Let me know if I missed anything on that front.

Given:

  sugov_get_util() <- cpu_util_cfs() <- UTIL_EST ? util_est.enqueued : util_avg.

our next_f becomes:

  next_f = 1.25 * util_est * max_freq / max;

so our min_freq in em_pd_get_higher_freq() will already be compensated
for the offset.

So even when:

  boost = util_avg - util_est

is small, despite util_avg being huge (~1024), due to large util_est,
we'll still get an effective boost to max_cost ASSUMING cs[].cost and
cost_margin have the same curve.

They have not.

assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a
factor f^2 on the table.

So the higher the min_freq, the less effective the boost.

Maybe it all works out in practise, but I'm missing a big picture
description of it all somewhere.

Peter Zijlstra Feb. 10, 2020, 1:30 p.m. UTC | #7

On Thu, Jan 23, 2020 at 05:16:52PM +0000, Douglas Raillard wrote:
> Hi Rafael,
> 
> On 1/23/20 3:43 PM, Rafael J. Wysocki wrote:
> > On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
> > <douglas.raillard@arm.com> wrote:
> >>
> >> Make schedutil cpufreq governor energy-aware.
> > 
> > I have to say that your terminology is confusing to me, like what
> > exactly does "energy-aware" mean in the first place?
> 
> Should be better rephrased as "Make schedutil cpufreq governor use the
> energy model" I guess. Schedutil is indeed already energy aware since it
> tries to use the lowest frequency possible for the job to be done (kind of).

So ARM64 will soon get x86-like power management if I read these here
patches right:

  https://lkml.kernel.org/r/20191218182607.21607-2-ionela.voinescu@arm.com

And I'm thinking a part of Rafael's concerns will also apply to those
platforms.

> Other than that, the only energy-related information schedutil uses is
> the assumption that lower freq == better efficiency. Explicit use of the
> EM allows to refine this assumption.

I'm thinking that such platforms guarantee this on their own, if not,
there just isn't anything we can do about it, so that assumption is
fair.

(I've always found it weird to have less efficient OPPs listed anyway)

> >> 1) Selecting the highest possible frequency for a given cost. Some
> >>    platforms can have lower frequencies that are less efficient than
> >>    higher ones, in which case they should be skipped for most purposes.
> >>    They can still be useful to give more freedom to thermal throttling
> >>    mechanisms, but not under normal circumstances.
> >>    note: the EM framework will warn about such OPPs "hertz/watts ratio
> >>    non-monotonically decreasing"
> > 
> > While all of that is fair enough for platforms using the EM, do you
> > realize that the EM is not available on the majority of architectures
> > (including some fairly significant ones) and so adding overhead
> > related to it for all of them is quite less than welcome?
> 
> When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is
> defined to a static inline no-op function, so that feature won't incur
> overhead (patch 1+2+3).
> 
> Patch 4 and 5 do add some new logic that could be used on any platform.
> Current code will use the boost as an energy margin, but it would be
> straightforward to make a util-based version (like iowait boost) on
> non-EM platforms.

Right, so the condition 'util_avg > util_est' makes sense to trigger
some sort of boost off of.

What kind would make sense for these platforms? One possibility would be
to instead of frobbing the energy margin, as you do here, to frob the C
in get_next_freq().

(I have vague memories of this being proposed earlier; it also avoids
that double OPP iteration thing complained about elsewhere in this
thread if I'm not mistaken).


That is; I'm thinking it is important (esp. now that we got frequency
invariance sorted for x86), to have this patch also work for !EM
architectures (as those ARM64-AMU things would be).

Douglas RAILLARD Feb. 13, 2020, 11:55 a.m. UTC | #8

On 2/10/20 1:30 PM, Peter Zijlstra wrote:
> On Thu, Jan 23, 2020 at 05:16:52PM +0000, Douglas Raillard wrote:
>> Hi Rafael,
>>
>> On 1/23/20 3:43 PM, Rafael J. Wysocki wrote:
>>> On Wed, Jan 22, 2020 at 6:36 PM Douglas RAILLARD
>>> <douglas.raillard@arm.com> wrote:
>>>>
>>>> Make schedutil cpufreq governor energy-aware.
>>>
>>> I have to say that your terminology is confusing to me, like what
>>> exactly does "energy-aware" mean in the first place?
>>
>> Should be better rephrased as "Make schedutil cpufreq governor use the
>> energy model" I guess. Schedutil is indeed already energy aware since it
>> tries to use the lowest frequency possible for the job to be done (kind of).
> 
> So ARM64 will soon get x86-like power management if I read these here
> patches right:
> 
>   https://lkml.kernel.org/r/20191218182607.21607-2-ionela.voinescu@arm.com
> 
> And I'm thinking a part of Rafael's concerns will also apply to those
> platforms.

AFAIU there is an important difference: ARM64 firmware should not end up
increasing frequency on its own, it should only cap the frequency. That
means that the situation stays the same for that boost:

Let's say you let schedutil selecting a freq that is +2% more power
hungry. That will probably not be enough to make it jump to the next
OPP, so you end up not boosting. Now if there is a firmware that decides
for some reasons to cap frequency, it will be a similar situation.

> 
>> Other than that, the only energy-related information schedutil uses is
>> the assumption that lower freq == better efficiency. Explicit use of the
>> EM allows to refine this assumption.
> 
> I'm thinking that such platforms guarantee this on their own, if not,
> there just isn't anything we can do about it, so that assumption is
> fair.
> 
> (I've always found it weird to have less efficient OPPs listed anyway)

Ultimately, (mostly) the piece of code involved in thermal capping needs
to know about these inefficient OPPs (be it the firmware or some kernel
subsystem). The rest of the world doesn't need to care.

>>>> 1) Selecting the highest possible frequency for a given cost. Some
>>>>    platforms can have lower frequencies that are less efficient than
>>>>    higher ones, in which case they should be skipped for most purposes.
>>>>    They can still be useful to give more freedom to thermal throttling
>>>>    mechanisms, but not under normal circumstances.
>>>>    note: the EM framework will warn about such OPPs "hertz/watts ratio
>>>>    non-monotonically decreasing"
>>>
>>> While all of that is fair enough for platforms using the EM, do you
>>> realize that the EM is not available on the majority of architectures
>>> (including some fairly significant ones) and so adding overhead
>>> related to it for all of them is quite less than welcome?
>>
>> When CONFIG_ENERGY_MODEL is not defined, em_pd_get_higher_freq() is
>> defined to a static inline no-op function, so that feature won't incur
>> overhead (patch 1+2+3).
>>
>> Patch 4 and 5 do add some new logic that could be used on any platform.
>> Current code will use the boost as an energy margin, but it would be
>> straightforward to make a util-based version (like iowait boost) on
>> non-EM platforms.
> 
> Right, so the condition 'util_avg > util_est' makes sense to trigger
> some sort of boost off of.
> 
> What kind would make sense for these platforms? One possibility would be
> to instead of frobbing the energy margin, as you do here, to frob the C
> in get_next_freq().

If I'm correct, changing the C value would be somewhat similar to the
relative boosting I had in a previous version. Maybe adding a fixed
offset would give more predictable results as was discussed with Vincent
Guittot. In any case, it would change the perceived util (like iowait
boost).

> (I have vague memories of this being proposed earlier; it also avoids
> that double OPP iteration thing complained about elsewhere in this
> thread if I'm not mistaken).

It should be possible to get rid of the double iteration mentioned by
Quentin. Choosing to boost the util or the energy boils down to:

1) If you care more about predictable battery life (or energy bill) than
predictability of the boost feature, EM should be used.

2) If you don't have an EM or you care more about having a predictable
boost for a given workload, use util (or disable that boost).

The rational is that with 1), you will get a different speed boost for a
given workload depending on the other things executing at the same time,
as the speed up is not linear with the task-related metric (util -
util_est). If you are already at high freq because of another workload,
the speed up will be small because the next 100Mhz will cost much more
than the same +100Mhz delta starting from a low OPP.

> That is; I'm thinking it is important (esp. now that we got frequency
> invariance sorted for x86), to have this patch also work for !EM
> architectures (as those ARM64-AMU things would be).

For sure, that feature is supposed to help in cases that would be
impossible to pinpoint with hardware, since it has to know what tasks
execute.

Peter Zijlstra Feb. 13, 2020, 1:20 p.m. UTC | #9

On Thu, Feb 13, 2020 at 11:55:32AM +0000, Douglas Raillard wrote:
> On 2/10/20 1:30 PM, Peter Zijlstra wrote:

> > So ARM64 will soon get x86-like power management if I read these here
> > patches right:
> > 
> >   https://lkml.kernel.org/r/20191218182607.21607-2-ionela.voinescu@arm.com
> > 
> > And I'm thinking a part of Rafael's concerns will also apply to those
> > platforms.
> 
> AFAIU there is an important difference: ARM64 firmware should not end up
> increasing frequency on its own, it should only cap the frequency. That
> means that the situation stays the same for that boost:
> 
> Let's say you let schedutil selecting a freq that is +2% more power
> hungry. That will probably not be enough to make it jump to the next
> OPP, so you end up not boosting. Now if there is a firmware that decides
> for some reasons to cap frequency, it will be a similar situation.

The moment you give out OPP selection to a 3rd party (be it firmware or
a micro-controller) things are uncertain at best anyway.

Still, in general, if you give it higher input, it tends to at least
consider going faster -- which might be all you can ask for...

So I'm not exactly seeing what your argument is here.

> > Right, so the condition 'util_avg > util_est' makes sense to trigger
> > some sort of boost off of.
> > 
> > What kind would make sense for these platforms? One possibility would be
> > to instead of frobbing the energy margin, as you do here, to frob the C
> > in get_next_freq().
> 
> If I'm correct, changing the C value would be somewhat similar to the
> relative boosting I had in a previous version. Maybe adding a fixed
> offset would give more predictable results as was discussed with Vincent
> Guittot. In any case, it would change the perceived util (like iowait
> boost).

It depends a bit on what you change C into. If we do something trivial
like:
		1.25 ; !(util_avg > util_est)
	C := {
		2    ;  (util_avg > util_est)

ie. a binary selection of constants, then yes, I suppose that is the
case.

But nothing stops us from making it more complicated; or having it
depend on the presence of EM data.

> > (I have vague memories of this being proposed earlier; it also avoids
> > that double OPP iteration thing complained about elsewhere in this
> > thread if I'm not mistaken).
> 
> It should be possible to get rid of the double iteration mentioned by
> Quentin. Choosing to boost the util or the energy boils down to:
> 
> 1) If you care more about predictable battery life (or energy bill) than
> predictability of the boost feature, EM should be used.
> 
> 2) If you don't have an EM or you care more about having a predictable
> boost for a given workload, use util (or disable that boost).
> 
> The rational is that with 1), you will get a different speed boost for a
> given workload depending on the other things executing at the same time,
> as the speed up is not linear with the task-related metric (util -
> util_est). If you are already at high freq because of another workload,
> the speed up will be small because the next 100Mhz will cost much more
> than the same +100Mhz delta starting from a low OPP.

It's just that I'm not seeing how 1 actually works or provides that more
predictable battery life I suppose. We have this other sub-thread to
argue about that :-)

> > That is; I'm thinking it is important (esp. now that we got frequency
> > invariance sorted for x86), to have this patch also work for !EM
> > architectures (as those ARM64-AMU things would be).
> 
> For sure, that feature is supposed to help in cases that would be
> impossible to pinpoint with hardware, since it has to know what tasks
> execute.

OK, so I'm thinking we're agreeing that it would be good to have this
support !EM systems too.

Douglas RAILLARD Feb. 13, 2020, 5:49 p.m. UTC | #10

On 2/10/20 1:21 PM, Peter Zijlstra wrote:
> On Wed, Jan 22, 2020 at 06:14:24PM +0000, Douglas Raillard wrote:
>> Hi Peter,
>>
>> Since the v3 was posted a while ago, here is a short recap of the hanging
>> comments:
>>
>> * The boost margin was relative, but we came to the conclusion it would make
>>   more sense to make it absolute (done in that v4).
> 
> As per (patch #1):
> 
> +       max_cost = pd->table[pd->nr_cap_states - 1].cost;
> +       cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;
> 
> So we'll allow the boost to double energy consumption (or rather, since
> you cannot go above the max OPP, we're allowed that).

Indeed. This might need some tweaking based on testing, maybe +50% is
enough, or maybe +200% is even better.

>> * The main remaining blur point was why defining boost=(util - util_est) makes
>>   sense. The justification for that is that we use PELT-shaped signal to drive
>>   the frequency, so using a PELT-shaped signal for the boost makes sense for the
>>   same reasons.
> 
> As per (patch #4):
> 
> +       unsigned long boost = 0;
> 
> +       if (util_est_enqueued == sg_cpu->util_est_enqueued &&
> +           util_avg >= sg_cpu->util_avg &&
> +           util_avg > util_est_enqueued)
> +               boost = util_avg - util_est_enqueued;
> 
> The result of that is not, strictly speaking, a PELT shaped signal.
> Although when it is !0 the curves are similar, albeit offset.

Yes, it has the same rate of increase as PELT.

> 
>> AFAIK there is no specific criteria to meet for frequency selection signal shape
>> for anything else than periodic tasks (if we don't add other constraints on
>> top), so (util - util_est)=(util - constant) seems as good as anything else.
>> Especially since util is deemed to be a good fit in practice for frequency
>> selection. Let me know if I missed anything on that front.
> 
> 
> Given:
> 
>   sugov_get_util() <- cpu_util_cfs() <- UTIL_EST ? util_est.enqueued : util_avg.

cpu_util_cfs uses max_t (maybe irrelevant for this discussion):
UTIL_EST ? max(util_est.enqueued, util_avg) : util_avg

> our next_f becomes:
> 
>   next_f = 1.25 * util_est * max_freq / max;

> so our min_freq in em_pd_get_higher_freq() will already be compensated
> for the offset.

Yes, the boost is added on top of the existing behavior.

> So even when:
> 
>   boost = util_avg - util_est
> 
> is small, despite util_avg being huge (~1024), due to large util_est,
> we'll still get an effective boost to max_cost ASSUMING cs[].cost and
> cost_margin have the same curve.

I'm not sure to follow, cs[].cost can be plotted against cs[].freq, but
cost_margin is a time-based signal (the boost value), so it would be
plotted against time.

> 
> They have not.
> 
> assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a
> factor f^2 on the table.

I'm guessing that you arrived to `cost_margin ~ f` this way:

cost_margin = util - util_est_enqueued
cost_margin = util - constant

# with constant small enough
cost_margin ~ util

# with util ~ 1/f
cost_margin ~ 1/f

In the case you describe, `constant` is actually almost equal to `util`
so `cost_margin ~! util`, and that series assumes frequency invariant
util_avg so `util !~ 1/f` (I'll probably have to fix that).

> So the higher the min_freq, the less effective the boost.

Yes, since the boost is allowing a fixed amount of extra power. Higher
OPPs are less efficient than lower ones, so if min_freq is high, we
won't speed up as much as if min_freq was low.

> Maybe it all works out in practise, but I'm missing a big picture

Here is a big picture :)

https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg

The board is a Juno R0, with a periodic task pinned on a big CPU
(capa=1024):
* phase 1:  5% duty cycle (=51 PELT units)
* phase 2: 75% duty cycle (=768 PELT units)

Legend:
* blue square wave: when the task executes (like in kernelshark)
* base_cost = cost of frequency as selected by schedutil in normal
operations
* allowed_cost = base_cost + cost_margin
* util = util_avg

note: the small gaps right after the duty cycle transition between
t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue
and no util_est update.

> description of it all somewhere.

Now a textual version of it:

em_pd_get_higher_freq() does the following:

# Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a
# concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete
# value of "max_cost", which is the highest OPP on that CPU.
concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;

# Then it finds the lowest OPP satisfying min_freq:
min_opp = OPP_AT_FREQ(min_freq)

# It takes the cost associated, and finds the highest OPP that has a
# cost lower than that:
max_cost = COST_OF(min_opp) + concrete_margin

final_freq = MAX(
	FREQ_OF(opp)
	for opp in available_opps
	if COST_OF(opp) <= max_cost
)

So this means that:
   util - util_est_enqueued ~= 0
=> cost_margin              ~= 0
=> concrete_cost_margin     ~= 0
=> max_cost   = COST_OF(min_opp) + 0
=> final_freq = FREQ_OF(min_opp)

The effective boost is ~0, so you will get the current behaviour of
schedutil.

If the task starts needing more cycles than during its previous period,
`util - util_est_enqueued` will grow like util since util_est_enqueued
is constant. The longer we wait, the higher the boost, until the task
goes to sleep again.

At next wakeup, util_est_enqueued has caught up and either:
1) util becomes stable, so no more boosting
2) util keeps increasing, so go for another round of boosting

Thanks,
Douglas

Peter Zijlstra Feb. 14, 2020, 12:21 p.m. UTC | #11

On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:

> > So even when:
> > 
> >   boost = util_avg - util_est
> > 
> > is small, despite util_avg being huge (~1024), due to large util_est,
> > we'll still get an effective boost to max_cost ASSUMING cs[].cost and
> > cost_margin have the same curve.
> 
> I'm not sure to follow, cs[].cost can be plotted against cs[].freq, but
> cost_margin is a time-based signal (the boost value), so it would be
> plotted against time.

Suppose we have the normalized energy vs frequency curve: x^3

( P ~ V^2 * f, due to lack of better: V ~ f -> P ~ f^3 )

  1 +--------------------------------------------------------------------+
    |             +             +            +             +            *|
    |                                                       x**3 ******* |
    |                                                                **  |
0.8 |-+                                                            **  +-|
    |                                                             **     |
    |                                                            *       |
    |                                                          **        |
0.6 |-+                                                       **       +-|
    |                                                       **           |
    |                                                     **             |
    |                                                   ***              |
0.4 |-+                                               ***              +-|
    |                                               **                   |
    |                                            ***                     |
    |                                          ***                       |
0.2 |-+                                    ****                        +-|
    |                                  ****                              |
    |                            ******                                  |
    |             +     **********           +             +             |
  0 +--------------------------------------------------------------------+
    0            0.2           0.4          0.6           0.8            1


where x is our normalized frequency and y is the normalized energy.

Further, remember that schedutil does (per construction; for lack of
better):

  f ~ u

So at u=0.6, we're at f=0.6 and P=0.2

+               boost = util_avg - util_est_enqueued;

So for util_est = 0.6, we're limited to: boost = 0.4.

+       max_cost = pd->table[pd->nr_cap_states - 1].cost;
+       cost_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;

Which then gives:

  cost_margin = boost = 0.4

And we find that:

  P' = P + cost_margin = 0.2 + 0.4 = 0.6 < 1

So even though set out to allow a 100% boost in energy usage, we were in
fact incapable of achieving this, because our cost_margin is linear in u
while the energy (or cost) curve is cubic in u.

That was my argument; but I think that now that I've expanded on it, I
see a flaw, because when we do have boost = 0.4, this means util_avg =
1, and we would've selected f = 1, and boosting would've been pointless.

So let me try again:

  f = util_avg, P = f^3, boost = util_avg - util_est

  P' = util_avg ^ 3 + util_avg - util_est

And I'm then failing to make further sense of that; it of course means
that P'(u) is larger than P(2u) for some u, but I don't think we set
that as a goal either.

Let me ponder this a little more while I go read the rest of your email.

Peter Zijlstra Feb. 14, 2020, 12:52 p.m. UTC | #12

On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:
> On 2/10/20 1:21 PM, Peter Zijlstra wrote:

> > assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a
> > factor f^2 on the table.
> 
> I'm guessing that you arrived to `cost_margin ~ f` this way:
> 
> cost_margin = util - util_est_enqueued
> cost_margin = util - constant
> 
> # with constant small enough
> cost_margin ~ util
> 
> # with util ~ 1/f
> cost_margin ~ 1/f
> 
> In the case you describe, `constant` is actually almost equal to `util`
> so `cost_margin ~! util`, and that series assumes frequency invariant
> util_avg so `util !~ 1/f` (I'll probably have to fix that).

Nah, perhaps already clear from the other email; but it goes like:

  boost = util_avg - util_est
  cost_margin = boost * C = C * util_avg - C * util_est

And since u ~ f (per schedutil construction), cost_margin is a function
linear in either u or f.

> > So the higher the min_freq, the less effective the boost.
> 
> Yes, since the boost is allowing a fixed amount of extra power. Higher
> OPPs are less efficient than lower ones, so if min_freq is high, we
> won't speed up as much as if min_freq was low.
> 
> > Maybe it all works out in practise, but I'm missing a big picture
> 
> Here is a big picture :)
> 
> https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg
> 
> The board is a Juno R0, with a periodic task pinned on a big CPU
> (capa=1024):
> * phase 1:  5% duty cycle (=51 PELT units)
> * phase 2: 75% duty cycle (=768 PELT units)
> 
> Legend:
> * blue square wave: when the task executes (like in kernelshark)
> * base_cost = cost of frequency as selected by schedutil in normal
> operations
> * allowed_cost = base_cost + cost_margin
> * util = util_avg
> 
> note: the small gaps right after the duty cycle transition between
> t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue
> and no util_est update.

I'm confused by the giant drop in frequency (blue line) around 4.18

schedutil shouldn't select f < max(util_avg, util_est), which is
violated right about there.

I'm also confused by the base_cost line; how can that be flat until
somewhere around 4.16. Sadly there is no line for pure schedutil freq to
compare against.

Other than that, I can see the green line is consistent with
util_avg>util_est, and how it help grow the frequency (blue).

Peter Zijlstra Feb. 14, 2020, 1:37 p.m. UTC | #13

On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:

> > description of it all somewhere.
> 
> Now a textual version of it:
> 
> em_pd_get_higher_freq() does the following:
> 
> # Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a
> # concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete
> # value of "max_cost", which is the highest OPP on that CPU.
> concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;
> 
> # Then it finds the lowest OPP satisfying min_freq:
> min_opp = OPP_AT_FREQ(min_freq)
> 
> # It takes the cost associated, and finds the highest OPP that has a
> # cost lower than that:
> max_cost = COST_OF(min_opp) + concrete_margin
> 
> final_freq = MAX(
> 	FREQ_OF(opp)
> 	for opp in available_opps
> 	if COST_OF(opp) <= max_cost
> )

Right; I got that.

> So this means that:
>    util - util_est_enqueued ~= 0

Only if you assume the task will get scheduled out reasonably frequent.

> => cost_margin              ~= 0
> => concrete_cost_margin     ~= 0
> => max_cost   = COST_OF(min_opp) + 0
> => final_freq = FREQ_OF(min_opp)
> 
> The effective boost is ~0, so you will get the current behaviour of
> schedutil.

But the argument holds; because if things don't get scheduled out, we'll
peg u = 1 and hit f = 1 and all is well anyway.

Which is a useful property; it shows that in the steady state, this
patch-set is a NOP, but the above argument only relies on 'util_avg >
util_est' being used a trigger.

> If the task starts needing more cycles than during its previous period,
> `util - util_est_enqueued` will grow like util since util_est_enqueued
> is constant. The longer we wait, the higher the boost, until the task
> goes to sleep again.
> 
> At next wakeup, util_est_enqueued has caught up and either:
> 1) util becomes stable, so no more boosting
> 2) util keeps increasing, so go for another round of boosting

Agreed; however elsewhere you wrote:

> 1) If you care more about predictable battery life (or energy bill) than
> predictability of the boost feature, EM should be used.
>
> 2) If you don't have an EM or you care more about having a predictable
> boost for a given workload, use util (or disable that boost).

This is the part I'm still not sure about; how do the specifics of the
cost_margin setup lead to 1), or how would some frobbing with frequency
selection destroy that property.

Douglas RAILLARD Feb. 27, 2020, 3:50 p.m. UTC | #14

Hi Peter,

On 2/13/20 1:20 PM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 11:55:32AM +0000, Douglas Raillard wrote:
>> On 2/10/20 1:30 PM, Peter Zijlstra wrote:
> 
>>> So ARM64 will soon get x86-like power management if I read these here
>>> patches right:
>>>
>>>   https://lkml.kernel.org/r/20191218182607.21607-2-ionela.voinescu@arm.com
>>>
>>> And I'm thinking a part of Rafael's concerns will also apply to those
>>> platforms.
>>
>> AFAIU there is an important difference: ARM64 firmware should not end up
>> increasing frequency on its own, it should only cap the frequency. That
>> means that the situation stays the same for that boost:
>>
>> Let's say you let schedutil selecting a freq that is +2% more power
>> hungry. That will probably not be enough to make it jump to the next
>> OPP, so you end up not boosting. Now if there is a firmware that decides
>> for some reasons to cap frequency, it will be a similar situation.
> 
> The moment you give out OPP selection to a 3rd party (be it firmware or
> a micro-controller) things are uncertain at best anyway.
> 
> Still, in general, if you give it higher input, it tends to at least
> consider going faster -- which might be all you can ask for...
> 
> So I'm not exactly seeing what your argument is here.

My point is that a +2% boost will give *up to* +2% increase in energy
use. With or without a fancy firmware, having cost_margin > 0 does not
mean you will always actually get a speedup.

>>> Right, so the condition 'util_avg > util_est' makes sense to trigger
>>> some sort of boost off of.
>>>
>>> What kind would make sense for these platforms? One possibility would be
>>> to instead of frobbing the energy margin, as you do here, to frob the C
>>> in get_next_freq().
>>
>> If I'm correct, changing the C value would be somewhat similar to the
>> relative boosting I had in a previous version. Maybe adding a fixed
>> offset would give more predictable results as was discussed with Vincent
>> Guittot. In any case, it would change the perceived util (like iowait
>> boost).
> 
> It depends a bit on what you change C into. If we do something trivial
> like:
> 		1.25 ; !(util_avg > util_est)
> 	C := {
> 		2    ;  (util_avg > util_est)
> 
> ie. a binary selection of constants, then yes, I suppose that is the
> case.
> 
> But nothing stops us from making it more complicated; or having it
> depend on the presence of EM data.

The series currently fiddles with energy cost directly, but it's
possible to have the exact same effect by fiddling with util if we have
the function `(base_util, cost_margin) -> boosted_util`. It just that it
forces to:
1. map util to freq
2. find a higher freq for the given cost_margin
3. Map freq to util
4. Re-inject the new util, which will eventually get remapped to a freq

While it's easier to just do it directly:
1. map util to freq
2. find higher_freq for the given cost_margin
3. Use the increased freq

> 
>>> (I have vague memories of this being proposed earlier; it also avoids
>>> that double OPP iteration thing complained about elsewhere in this
>>> thread if I'm not mistaken).
>>
>> It should be possible to get rid of the double iteration mentioned by
>> Quentin. Choosing to boost the util or the energy boils down to:
>>
>> 1) If you care more about predictable battery life (or energy bill) than
>> predictability of the boost feature, EM should be used.
>>
>> 2) If you don't have an EM or you care more about having a predictable
>> boost for a given workload, use util (or disable that boost).
>>
>> The rational is that with 1), you will get a different speed boost for a
>> given workload depending on the other things executing at the same time,
>> as the speed up is not linear with the task-related metric (util -
>> util_est). If you are already at high freq because of another workload,
>> the speed up will be small because the next 100Mhz will cost much more
>> than the same +100Mhz delta starting from a low OPP.
> 
> It's just that I'm not seeing how 1 actually works or provides that more
> predictable battery life I suppose. We have this other sub-thread to
> argue about that :-)

Ok, I've posted the answer there, so this thread can focus on
boost=(util - util_est_enqueued) logic, and the other one on how to make
actual use of the boost value.

>>> That is; I'm thinking it is important (esp. now that we got frequency
>>> invariance sorted for x86), to have this patch also work for !EM
>>> architectures (as those ARM64-AMU things would be).
>>
>> For sure, that feature is supposed to help in cases that would be
>> impossible to pinpoint with hardware, since it has to know what tasks
>> execute.
> 
> OK, so I'm thinking we're agreeing that it would be good to have this
> support !EM systems too.
> 

Thanks,
Douglas

Douglas RAILLARD March 11, 2020, 12:25 p.m. UTC | #15

Hi Peter,

The preemption stack unwinded back to schedutil :)

On 2/14/20 12:52 PM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:
>> On 2/10/20 1:21 PM, Peter Zijlstra wrote:
> 
>>> assuming cs[].cost ~ f^3, and given our cost_margin ~ f, that leaves a
>>> factor f^2 on the table.
>>
>> I'm guessing that you arrived to `cost_margin ~ f` this way:
>>
>> cost_margin = util - util_est_enqueued
>> cost_margin = util - constant
>>
>> # with constant small enough
>> cost_margin ~ util
>>
>> # with util ~ 1/f
>> cost_margin ~ 1/f
>>
>> In the case you describe, `constant` is actually almost equal to `util`
>> so `cost_margin ~! util`, and that series assumes frequency invariant
>> util_avg so `util !~ 1/f` (I'll probably have to fix that).
> 
> Nah, perhaps already clear from the other email; but it goes like:
> 
>   boost = util_avg - util_est
>   cost_margin = boost * C = C * util_avg - C * util_est
> 
> And since u ~ f (per schedutil construction), cost_margin is a function
> linear in either u or f.

cost_margin(u) is not linear in `u` or `f`, as that does not hold:

cost_margin(A*u) = (A*u - CONSTANT)
                != A*u - A*CONSTANT
                   A * (u - CONSTANT)
		   A * cost_margin(u)

This would only approximately work if CONSTANT was much smaller than u,
which it isn't.

The tricky part is that CONSTANT is actually CONSTANT(u), but the
relation between u and CONSTANT is time-dependent.

That said, the problem can be simplified if we split it in two phases:

* when boost != 0 for a given task, in which case CONSTANT(u) is a true
constant by construction, i.e. cost_margin(u) !~ u.

* when boost = 0 for a given task: all that matters is that only one
task is boosted at a time, so that we can easily relate task composition
and boost composition as done in:
https://lore.kernel.org/lkml/5d732dc1-d343-24d2-bda9-072021a510ed@arm.com/
note: as mentioned in that email, this reasoning relies on
util_est_enqueued(wload) at rq level being linear in wload, which does
not hold with preemption. That would be fixed by working on task signals
rather than rq aggregated ones.

> 
>>> So the higher the min_freq, the less effective the boost.

Ultimately I can't remember exactly what the cost_margin(u) linearity
was intended to be used for, but the relative impact of boosting does
become smaller if u is already high.

This is by design, as the dynamic of that boosting is purposely mostly
tied to local PELT variations, by removing a fixed offset from it. I
don't think big tasks specifically deserve more boosting, as it would
imply the incertitude on their new period is higher. We would need a
more real world trace evaluating how the duty cycle changes relate to
the initial duty cycle. That should not be particularly difficult to do,
but the trace parsing infrastructure in LISA currently does not handle
very nicely large traces, so that will probably have to wait until I get
around fixing that by using libtraceevent from trace-cmd project.

This means boosting currently depends on two things:
1) the PELT parameters (half life and window size)
2) the former task period

1) is fixed by definition in the kernel.
2) will make boosting more aggressive for short periods (i.e. the boost
will rise faster). The initial rate of increase of boosting once it
triggers should be related to the former task period in (approximately)
that way:

|   period (s) |   boost rate of increase (PELT unit/ms) |
|-------------:|----------------------------------------:|
|        0     |                                21.3839  |
|        0.016 |                                15.3101  |
|        0.032 |                                10.9615  |
|        0.048 |                                 7.84807 |
|        0.064 |                                 5.61895 |
|        0.08  |                                 4.02297 |
|        0.096 |                                 2.8803  |
|        0.112 |                                 2.0622  |
|        0.128 |                                 1.47646 |

This assumes that while the task is running:
util_avg(t) =
   util_avg(t_enqueue) + 1024 * (1-e^(-(t - t_enqueue)/tau))

with tau:
    # tau as defined by:
https://en.wikipedia.org/wiki/Low-pass_filter#Simple_infinite_impulse_response_filter
    # Alpha as defined in https://en.wikipedia.org/wiki/Moving_average
    decay = (1 / 2)^(1 / half_life)
    alpha = 1 - decay
    window = 1024 * 1024 * 1e-9
    tau = window * ((1 - alpha) / alpha)
    tau = 0.047886417385345964
>>
>> Yes, since the boost is allowing a fixed amount of extra power. Higher
>> OPPs are less efficient than lower ones, so if min_freq is high, we
>> won't speed up as much as if min_freq was low.
>>
>>> Maybe it all works out in practise, but I'm missing a big picture
>>
>> Here is a big picture :)
>>
>> https://gist.github.com/douglas-raillard-arm/f76586428836ec70c6db372993e0b731#file-ramp_boost-svg
>>
>> The board is a Juno R0, with a periodic task pinned on a big CPU
>> (capa=1024):
>> * phase 1:  5% duty cycle (=51 PELT units)
>> * phase 2: 75% duty cycle (=768 PELT units)
>>
>> Legend:
>> * blue square wave: when the task executes (like in kernelshark)
>> * base_cost = cost of frequency as selected by schedutil in normal
>> operations
>> * allowed_cost = base_cost + cost_margin
>> * util = util_avg
>>
>> note: the small gaps right after the duty cycle transition between
>> t=4.15 and 4.25 are due to sugov task executing, so there is no dequeue
>> and no util_est update.
> 
> I'm confused by the giant drop in frequency (blue line) around 4.18
>
> schedutil shouldn't select f < max(util_avg, util_est), which is
> violated right about there.

AFAICT that's due the boost being disabled, which means the allowed cost
becomes smaller. The boost is disabled because the task's
util_est_enqueued has been updated since last schedutil invocation, so
the rq one changes, which in turn disables boosting.

Here are the normalized costs on the CPU the workload was run on:

|    freq |     cost |     capa |
|--------:|---------:|---------:|
|  450000 |  67.086  |  418.909 |
|  625000 |  72.1509 |  581.818 |
|  800000 |  80.8962 |  744.727 |
|  950000 |  90.1688 |  884.364 |
| 1100000 | 100      | 1024     |

In our case, it seems that schedutil is invoked when the task is
preempted (presumably by schedutil kthread) and we end up with:
util_avg          ~= 370
util_est_enqueued ~= 427.
We therefore end up at freq=625000 (capa=581).

> I'm also confused by the base_cost line; how can that be flat until
> somewhere around 4.16.

The minimum frequency on the board being used provides a capacity of
418, so it stays flat until max(util_avg, util_est) goes above that.

> Sadly there is no line for pure schedutil freq to
> compare against.

I could add that one fairly easily. Would you be interested in the
"real" frequency selected by normal schedutil or the "ideal" frequency
schedutil would like to apply ?

> Other than that, I can see the green line is consistent with
> util_avg>util_est, and how it help grow the frequency (blue).
>

Douglas RAILLARD March 11, 2020, 12:40 p.m. UTC | #16

On 2/14/20 1:37 PM, Peter Zijlstra wrote:
> On Thu, Feb 13, 2020 at 05:49:48PM +0000, Douglas Raillard wrote:
> 
>>> description of it all somewhere.
>>
>> Now a textual version of it:
>>
>> em_pd_get_higher_freq() does the following:
>>
>> # Turn the abstract cost margin on the EM_COST_MARGIN_SCALE into a
>> # concrete value. cost_margin=EM_COST_MARGIN_SCALE will give a concrete
>> # value of "max_cost", which is the highest OPP on that CPU.
>> concrete_margin = (cost_margin * max_cost) / EM_COST_MARGIN_SCALE;
>>
>> # Then it finds the lowest OPP satisfying min_freq:
>> min_opp = OPP_AT_FREQ(min_freq)
>>
>> # It takes the cost associated, and finds the highest OPP that has a
>> # cost lower than that:
>> max_cost = COST_OF(min_opp) + concrete_margin
>>
>> final_freq = MAX(
>> 	FREQ_OF(opp)
>> 	for opp in available_opps
>> 	if COST_OF(opp) <= max_cost
>> )
> 
> Right; I got that.
> 
>> So this means that:
>>    util - util_est_enqueued ~= 0
> 
> Only if you assume the task will get scheduled out reasonably frequent.
> 
>> => cost_margin              ~= 0
>> => concrete_cost_margin     ~= 0
>> => max_cost   = COST_OF(min_opp) + 0
>> => final_freq = FREQ_OF(min_opp)
>>
>> The effective boost is ~0, so you will get the current behaviour of
>> schedutil.
> 
> But the argument holds; because if things don't get scheduled out, we'll
> peg u = 1 and hit f = 1 and all is well anyway.
> 
> Which is a useful property; it shows that in the steady state, this
> patch-set is a NOP, but the above argument only relies on 'util_avg >
> util_est' being used a trigger.

Yes, `util_avg > util_est` can only happen when the task's duty cycle is
changing, which does not happen at steady state.

Either it's periodic and the boost is legitimate, or it's not periodic
and we assume it's a periodic task well represented by its last
activation and sleep (for the purpose of boosting).

Tasks with a high variability in their activation durations (i.e. not
periodic at all) will likely get more boosting on average, which is
probably good since we can't predict much about them, so in doubt we
tilt the behaviour of schedutil toward racing to completion.

>> If the task starts needing more cycles than during its previous period,
>> `util - util_est_enqueued` will grow like util since util_est_enqueued
>> is constant. The longer we wait, the higher the boost, until the task
>> goes to sleep again.
>>
>> At next wakeup, util_est_enqueued has caught up and either:
>> 1) util becomes stable, so no more boosting
>> 2) util keeps increasing, so go for another round of boosting
> 
> Agreed; however elsewhere you wrote:
> 
>> 1) If you care more about predictable battery life (or energy bill) than
>> predictability of the boost feature, EM should be used.
>>
>> 2) If you don't have an EM or you care more about having a predictable
>> boost for a given workload, use util (or disable that boost).
> 
> This is the part I'm still not sure about; how do the specifics of the
> cost_margin setup lead to 1), or how would some frobbing with frequency
> selection destroy that property.

This should be answered by this other thread:
https://lore.kernel.org/lkml/5d732dc1-d343-24d2-bda9-072021a510ed@arm.com/#t

[RFC,v4,0/6] sched/cpufreq: Make schedutil energy aware

Message

Comments