diff mbox series

[RFC,v4,3/6] sched/cpufreq: Hook em_pd_get_higher_power() into get_next_freq()

Message ID 20200122173538.1142069-4-douglas.raillard@arm.com (mailing list archive)
State RFC, archived
Headers show
Series sched/cpufreq: Make schedutil energy aware | expand

Commit Message

Douglas RAILLARD Jan. 22, 2020, 5:35 p.m. UTC
Choose the highest OPP for a given energy cost, allowing to skip lower
frequencies that would not be cheaper in terms of consumed power. These
frequencies can still be interesting to keep in the energy model to give
more freedom to thermal throttling, but should not be selected under
normal circumstances.

This also prepares the ground for energy-aware frequency boosting.

Signed-off-by: Douglas RAILLARD <douglas.raillard@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Quentin Perret Jan. 23, 2020, 4:16 p.m. UTC | #1
On Wednesday 22 Jan 2020 at 17:35:35 (+0000), Douglas RAILLARD wrote:
> @@ -210,9 +211,16 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>  	struct cpufreq_policy *policy = sg_policy->policy;
>  	unsigned int freq = arch_scale_freq_invariant() ?
>  				policy->cpuinfo.max_freq : policy->cur;
> +	struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);
>  
>  	freq = map_util_freq(util, freq, max);
>  
> +	/*
> +	 * Try to get a higher frequency if one is available, given the extra
> +	 * power we are ready to spend.
> +	 */
> +	freq = em_pd_get_higher_freq(pd, freq, 0);

I find it sad that the call just below to cpufreq_driver_resolve_freq()
and cpufreq_frequency_table_target() iterates the OPPs all over again.
It's especially a shame since most existing users of the EM stuff do
have a cpufreq frequency table.

Have you looked at hooking this inside cpufreq_driver_resolve_freq()
instead ? If we have a well-formed EM available, the call to
cpufreq_frequency_table_target() feels redundant, so we might want to
skip it.

Thoughts ?

Quentin
Douglas RAILLARD Jan. 23, 2020, 5:52 p.m. UTC | #2
On 1/23/20 4:16 PM, Quentin Perret wrote:
> On Wednesday 22 Jan 2020 at 17:35:35 (+0000), Douglas RAILLARD wrote:
>> @@ -210,9 +211,16 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>>  	struct cpufreq_policy *policy = sg_policy->policy;
>>  	unsigned int freq = arch_scale_freq_invariant() ?
>>  				policy->cpuinfo.max_freq : policy->cur;
>> +	struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);
>>  
>>  	freq = map_util_freq(util, freq, max);
>>  
>> +	/*
>> +	 * Try to get a higher frequency if one is available, given the extra
>> +	 * power we are ready to spend.
>> +	 */
>> +	freq = em_pd_get_higher_freq(pd, freq, 0);
> 
> I find it sad that the call just below to cpufreq_driver_resolve_freq()
> and cpufreq_frequency_table_target() iterates the OPPs all over again.
> It's especially a shame since most existing users of the EM stuff do
> have a cpufreq frequency table.
> 
> Have you looked at hooking this inside cpufreq_driver_resolve_freq()
> instead ? If we have a well-formed EM available, the call to
> cpufreq_frequency_table_target() feels redundant, so we might want to
> skip it.

We can't really move the call to em_pd_get_higher_freq() into
cpufreq_driver_resolve_freq() since that's a schedutil-specific feature,
and we would loose the !sg_policy->need_freq_update optimization.

Maybe we can add a flag to cpufreq_driver_resolve_freq() that promises
that the frequency is already a valid one. We have to be careful though,
since a number of things can make that untrue:
 - em_pd_get_higher_freq() will return the passed freq verbatim if it's
higher than the max freq, so em_pd_get_higher_freq() will have to set
the flag itself in case that logic changes.
 - policy limits can change the value
 - future things could tinker with the freq and forget to reset the flag.

If you think it's worth it I can make these changes.

> Thoughts ?
> 
> Quentin
> 

Cheers,
Douglas
Quentin Perret Jan. 24, 2020, 2:37 p.m. UTC | #3
On Thursday 23 Jan 2020 at 17:52:53 (+0000), Douglas Raillard wrote:
> We can't really move the call to em_pd_get_higher_freq() into
> cpufreq_driver_resolve_freq() since that's a schedutil-specific feature,
> and we would loose the !sg_policy->need_freq_update optimization.

Depends how you do it. You could add a new method to cpufreq_policy that
is defined only for sugov or something along those lines. And you'd call
that instead of cpufreq_frequency_table_target() when that makes sense.

> Maybe we can add a flag to cpufreq_driver_resolve_freq() that promises
> that the frequency is already a valid one. We have to be careful though,
> since a number of things can make that untrue:
>  - em_pd_get_higher_freq() will return the passed freq verbatim if it's
> higher than the max freq, so em_pd_get_higher_freq() will have to set
> the flag itself in case that logic changes.
>  - policy limits can change the value
>  - future things could tinker with the freq and forget to reset the flag.
> 
> If you think it's worth it I can make these changes.

The thing is, not only with the current patch we end up iterating the
frequencies twice for nothing, but also I think it'd be interesting to
use the EM for consistency with EAS. It'd be nice to use the same data
structure for the predictions we do in compute_energy() and for the
actual request.

Thoughts ?

Quentin
Quentin Perret Jan. 24, 2020, 2:58 p.m. UTC | #4
On Friday 24 Jan 2020 at 14:37:04 (+0000), Quentin Perret wrote:
> On Thursday 23 Jan 2020 at 17:52:53 (+0000), Douglas Raillard wrote:
> > We can't really move the call to em_pd_get_higher_freq() into
> > cpufreq_driver_resolve_freq() since that's a schedutil-specific feature,
> > and we would loose the !sg_policy->need_freq_update optimization.
> 
> Depends how you do it. You could add a new method to cpufreq_policy that

s/cpufreq_policy/cpufreq_governor

> is defined only for sugov or something along those lines. And you'd call
> that instead of cpufreq_frequency_table_target() when that makes sense.
> 
> > Maybe we can add a flag to cpufreq_driver_resolve_freq() that promises
> > that the frequency is already a valid one. We have to be careful though,
> > since a number of things can make that untrue:
> >  - em_pd_get_higher_freq() will return the passed freq verbatim if it's
> > higher than the max freq, so em_pd_get_higher_freq() will have to set
> > the flag itself in case that logic changes.
> >  - policy limits can change the value
> >  - future things could tinker with the freq and forget to reset the flag.
> > 
> > If you think it's worth it I can make these changes.
> 
> The thing is, not only with the current patch we end up iterating the
> frequencies twice for nothing, but also I think it'd be interesting to
> use the EM for consistency with EAS. It'd be nice to use the same data
> structure for the predictions we do in compute_energy() and for the
> actual request.
> 
> Thoughts ?
> 
> Quentin
Douglas RAILLARD Feb. 27, 2020, 3:51 p.m. UTC | #5
Let's this thread be about util boosting vs energy boosting.

Short recap of the conversation:

Peter:
>>> (I have vague memories of this (the util boosting) being proposed
earlier; it also avoids
>>> that double OPP iteration thing complained about elsewhere in this
>>> thread if I'm not mistaken).
>>

Douglas:
>> It should be possible to get rid of the double iteration mentioned by
>> Quentin. Choosing to boost the util or the energy boils down to:
>>
>> 1) If you care more about predictable battery life (or energy bill) than
>> predictability of the boost feature, EM should be used.
>>
>> 2) If you don't have an EM or you care more about having a predictable
>> boost for a given workload, use util (or disable that boost).
>>
>> The rational is that with 1), you will get a different speed boost for a
>> given workload depending on the other things executing at the same time,
>> as the speed up is not linear with the task-related metric (util -
>> util_est). If you are already at high freq because of another workload,
>> the speed up will be small because the next 100Mhz will cost much more
>> than the same +100Mhz delta starting from a low OPP.

Peter:
>
> It's just that I'm not seeing how 1 actually works or provides that more
> predictable battery life I suppose. We have this other sub-thread to
> argue about that :-)

Here is a more detailed version of util boost vs energy boost.
The goal of what follows is to show that util-based boost gives predictable
performance speedup, while energy-based boost gives predictable increase in
energy consumption.

In both cases, the boost value is computed as (util - ue.enqueued),
and only its interpretation differs:
* as an increase in util for util-based boost
* as an OPP cost margin for the energy-based boost



util(wload) = util_avg(wload)
util_est_enqueued(wload)
	| wload is enqueued = f(util_avg(wload))
	| otherwise         = 0

(wloadA + wloadB) denotes a "virtual" task that is running whenever either
wloadA or wloadB is running.

# Assuming wloadA and wloadB are attached to the same rq:
util(wloadA + wloadB) = util(wloadA) + util(wloadB)

# Assuming wloadA and wloadB do not preempt each other:
util_est_enqueued(wloadA + wloadB) =
	util_est_enqueued(wloadA) + util_est_enqueued(wloadB)

# boostU(wload) is the increase of util due to the util-based boost.
# boostE(wload) is the increase of *util* due to the energy-based boost.

boostU(wload) 
	| wload enqueued and util(wload) > util_est_enqueued(wload) =
		util(wload) - util_est_enqueued(wload)
	| otherwise = 0

boostU(wloadA + wloadB) = util(wloadA + wloadB) -
    util_est_enqueued(wloadA + wloadB)
boostU(wloadA + wloadB) = util(wloadA) + util(wloadB) -
    util_est_enqueued(wloadA) - util_est_enqueued(wloadB)
boostU(wloadA + wloadB) = boostU(wloadA) + boostU(wloadB)

# Now if we now intepret the same boost value as a freq energy cost margin:
boostE(wload) 
	| wload enqueued and util(wload) > util_est_enqueued(wload) =
		apply_boostE(util(wload), util_est_enqueued(wload), map_util_freq(util(wload)))
	| otherwise = 0

# with:
apply_boostE(util, util_est_enqueued, base_freq) = 
	boost = util - util_est_enqueued
	boosted_freq = freq_at(boost + cost_of(base_freq))
	freq_delta = boosted_freq - base_freq
	speedup = 1 + freq_delta / max_freq
	return util * speedup

Since freq_at(cost) function is not a linear function of cost
and util(wloadA + wloadB) = util(wloadA) + util(wloadB),
apply_boostE() is not a linear function of wload, which means:
boostE(wloadA + wloadB) != boostE(wloadA) + boostE(wloadB)

This means the speedup (util increase) given by boostE cannot be evaluated for
one task without considering all other attached tasks on the same rq. Therefore
the speedup introduced by boostE is not easily predictable.

On the other hand, the increase in energy cost is linear in the workload for
energy-based boost. but not linear for the util-based boost.

That made me realize that to achieve that, EM_COST_MARGIN_SCALE needs to
map to the highest cost in the system, not the highest cost of the
current CPU, I will fix that.

Also, util_est_enqueued is not linear in the general case when wloadA and
wloadB preempt each-other. There could be ways of making that work but it's
probably a better idea to move the logic at the task level and aggregate the
flag at runqueue level. This will also allow the boost to work when the set of
enqueued tasks varies, which is likely to happen in a real system.
diff mbox series

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 09e284dc918a..608963da4916 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -10,6 +10,7 @@ 
 
 #include "sched.h"
 
+#include <linux/energy_model.h>
 #include <linux/sched/cpufreq.h>
 #include <trace/events/power.h>
 
@@ -210,9 +211,16 @@  static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
+	struct em_perf_domain *pd = sugov_policy_get_pd(sg_policy);
 
 	freq = map_util_freq(util, freq, max);
 
+	/*
+	 * Try to get a higher frequency if one is available, given the extra
+	 * power we are ready to spend.
+	 */
+	freq = em_pd_get_higher_freq(pd, freq, 0);
+
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
 		return sg_policy->next_freq;