diff mbox series

[v6,11/16] sched/fair: Add uclamp support to energy_compute()

Message ID 20190115101513.2822-12-patrick.bellasi@arm.com (mailing list archive)
State Not Applicable, archived
Headers show
Series Add utilization clamping support | expand

Commit Message

Patrick Bellasi Jan. 15, 2019, 10:15 a.m. UTC
The Energy Aware Scheduler (AES) estimates the energy impact of waking
up a task on a given CPU. This estimation is based on:
 a) an (active) power consumptions defined for each CPU frequency
 b) an estimation of which frequency will be used on each CPU
 c) an estimation of the busy time (utilization) of each CPU

Utilization clamping can affect both b) and c) estimations. A CPU is
expected to run:
 - on an higher than required frequency, but for a shorter time, in case
   its estimated utilization will be smaller then the minimum utilization
   enforced by uclamp
 - on a smaller than required frequency, but for a longer time, in case
   its estimated utilization is bigger then the maximum utilization
   enforced by uclamp

While effects on busy time for both boosted/capped tasks are already
considered by compute_energy(), clamping effects on frequency selection
are currently ignored by that function.

Fix it by considering how CPU clamp values will be affected by a
task waking up and being RUNNABLE on that CPU.

Do that by refactoring schedutil_freq_util() to take an additional
task_struct* which allows EAS to evaluate the impact on clamp values of
a task being eventually queued in a CPU. Clamp values are applied to the
RT+CFS utilization only when a FREQUENCY_UTIL is required by
compute_energy().

Since we are at that:
- rename schedutil_freq_util() into schedutil_cpu_util(),
  since it's not only used for frequency selection.
- use "unsigned int" instead of "unsigned long" whenever the tracked
  utilization value is not expected to overflow 32bit.

Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 kernel/sched/cpufreq_schedutil.c | 26 +++++++++++-----------
 kernel/sched/fair.c              | 37 ++++++++++++++++++++++++++------
 kernel/sched/sched.h             | 19 +++++-----------
 3 files changed, 48 insertions(+), 34 deletions(-)

Comments

Quentin Perret Jan. 22, 2019, 12:13 p.m. UTC | #1
On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> The Energy Aware Scheduler (AES) estimates the energy impact of waking

s/AES/EAS :-)

[...]
> +		for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> +			cfs_util = cpu_util_next(cpu, p, dst_cpu);
> +
> +			/*
> +			 * Busy time computation: utilization clamping is not
> +			 * required since the ratio (sum_util / cpu_capacity)
> +			 * is already enough to scale the EM reported power
> +			 * consumption at the (eventually clamped) cpu_capacity.
> +			 */

Right.

> +			sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> +						       ENERGY_UTIL, NULL);
> +
> +			/*
> +			 * Performance domain frequency: utilization clamping
> +			 * must be considered since it affects the selection
> +			 * of the performance domain frequency.
> +			 */

So that actually affects the way we deal with RT I think. I assume the
idea is to say if you don't want to reflect the RT-go-to-max-freq thing
in EAS (which is what we do now) you should set the min cap for RT to 0.
Is that correct ?

I'm fine with this conceptually but maybe the specific case of RT should
be mentioned somewhere in the commit message or so ? I think it's
important to say that clearly since this patch changes the default
behaviour.

> +			cpu_util = schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> +						      FREQUENCY_UTIL,
> +						      cpu == dst_cpu ? p : NULL);
> +			max_util = max(max_util, cpu_util);
>  		}
>  
>  		energy += em_pd_energy(pd->em_pd, max_util, sum_util);

Thanks,
Quentin
Patrick Bellasi Jan. 22, 2019, 12:45 p.m. UTC | #2
On 22-Jan 12:13, Quentin Perret wrote:
> On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > The Energy Aware Scheduler (AES) estimates the energy impact of waking

[...]

> > +		for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > +			cfs_util = cpu_util_next(cpu, p, dst_cpu);
> > +
> > +			/*
> > +			 * Busy time computation: utilization clamping is not
> > +			 * required since the ratio (sum_util / cpu_capacity)
> > +			 * is already enough to scale the EM reported power
> > +			 * consumption at the (eventually clamped) cpu_capacity.
> > +			 */
> 
> Right.
> 
> > +			sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > +						       ENERGY_UTIL, NULL);
> > +
> > +			/*
> > +			 * Performance domain frequency: utilization clamping
> > +			 * must be considered since it affects the selection
> > +			 * of the performance domain frequency.
> > +			 */
> 
> So that actually affects the way we deal with RT I think. I assume the
> idea is to say if you don't want to reflect the RT-go-to-max-freq thing
> in EAS (which is what we do now) you should set the min cap for RT to 0.
> Is that correct ?

By default configuration, RT tasks still go to max when uclamp is
enabled, since they get a util_min=1024.

If we want to save power on RT tasks, we can set a smaller util_min...
but not necessarily 0. A util_min=0 for RT tasks means to use just
cpu_util_rt() for that class.

> I'm fine with this conceptually but maybe the specific case of RT should
> be mentioned somewhere in the commit message or so ? I think it's
> important to say that clearly since this patch changes the default
> behaviour.

Default behavior for RT should not be affected. While a capping is
possible for those tasks... where do you see issues ?
Here we are just figuring out what's the capacity the task will run
at, if we will have clamped RT tasks will not be the max but: is that
a problem ?

> > +			cpu_util = schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > +						      FREQUENCY_UTIL,
> > +						      cpu == dst_cpu ? p : NULL);
> > +			max_util = max(max_util, cpu_util);
> >  		}
> >  
> >  		energy += em_pd_energy(pd->em_pd, max_util, sum_util);
> 
> Thanks,
> Quentin
Quentin Perret Jan. 22, 2019, 1:29 p.m. UTC | #3
On Tuesday 22 Jan 2019 at 12:45:46 (+0000), Patrick Bellasi wrote:
> On 22-Jan 12:13, Quentin Perret wrote:
> > On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > > The Energy Aware Scheduler (AES) estimates the energy impact of waking
> 
> [...]
> 
> > > +		for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > > +			cfs_util = cpu_util_next(cpu, p, dst_cpu);
> > > +
> > > +			/*
> > > +			 * Busy time computation: utilization clamping is not
> > > +			 * required since the ratio (sum_util / cpu_capacity)
> > > +			 * is already enough to scale the EM reported power
> > > +			 * consumption at the (eventually clamped) cpu_capacity.
> > > +			 */
> > 
> > Right.
> > 
> > > +			sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > > +						       ENERGY_UTIL, NULL);
> > > +
> > > +			/*
> > > +			 * Performance domain frequency: utilization clamping
> > > +			 * must be considered since it affects the selection
> > > +			 * of the performance domain frequency.
> > > +			 */
> > 
> > So that actually affects the way we deal with RT I think. I assume the
> > idea is to say if you don't want to reflect the RT-go-to-max-freq thing
> > in EAS (which is what we do now) you should set the min cap for RT to 0.
> > Is that correct ?
> 
> By default configuration, RT tasks still go to max when uclamp is
> enabled, since they get a util_min=1024.
> 
> If we want to save power on RT tasks, we can set a smaller util_min...
> but not necessarily 0. A util_min=0 for RT tasks means to use just
> cpu_util_rt() for that class.

Ah, sorry, I guess my message was misleading. I'm saying this is
changing the way _EAS_ deals with RT tasks. Right now we don't actually
consider the RT-go-to-max thing at all in the EAS prediction. Your
patch is changing that AFAICT. It actually changes the way EAS sees RT
tasks even without uclamp ...

But I'm not hostile to the idea since it's possible to enable uclamp and
set the min cap to 0 for RT if you want. So there is a story there.
However, I think this needs be documented somewhere, at the very least.

Thanks,
Quentin
Patrick Bellasi Jan. 22, 2019, 2:26 p.m. UTC | #4
On 22-Jan 13:29, Quentin Perret wrote:
> On Tuesday 22 Jan 2019 at 12:45:46 (+0000), Patrick Bellasi wrote:
> > On 22-Jan 12:13, Quentin Perret wrote:
> > > On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > > > The Energy Aware Scheduler (AES) estimates the energy impact of waking
> > 
> > [...]
> > 
> > > > +		for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > > > +			cfs_util = cpu_util_next(cpu, p, dst_cpu);
> > > > +
> > > > +			/*
> > > > +			 * Busy time computation: utilization clamping is not
> > > > +			 * required since the ratio (sum_util / cpu_capacity)
> > > > +			 * is already enough to scale the EM reported power
> > > > +			 * consumption at the (eventually clamped) cpu_capacity.
> > > > +			 */
> > > 
> > > Right.
> > > 
> > > > +			sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > > > +						       ENERGY_UTIL, NULL);
> > > > +
> > > > +			/*
> > > > +			 * Performance domain frequency: utilization clamping
> > > > +			 * must be considered since it affects the selection
> > > > +			 * of the performance domain frequency.
> > > > +			 */
> > > 
> > > So that actually affects the way we deal with RT I think. I assume the
> > > idea is to say if you don't want to reflect the RT-go-to-max-freq thing
> > > in EAS (which is what we do now) you should set the min cap for RT to 0.
> > > Is that correct ?
> > 
> > By default configuration, RT tasks still go to max when uclamp is
> > enabled, since they get a util_min=1024.
> > 
> > If we want to save power on RT tasks, we can set a smaller util_min...
> > but not necessarily 0. A util_min=0 for RT tasks means to use just
> > cpu_util_rt() for that class.
> 
> Ah, sorry, I guess my message was misleading. I'm saying this is
> changing the way _EAS_ deals with RT tasks. Right now we don't actually
> consider the RT-go-to-max thing at all in the EAS prediction. Your
> patch is changing that AFAICT. It actually changes the way EAS sees RT
> tasks even without uclamp ...

Lemme see if I get it right.

Currently, whenever we look at CPU utilization for ENERGY_UTIL, we
always use cpu_util_rt() for RT tasks:

---8<---
        util = util_cfs;
        util += cpu_util_rt(rq);
        util += dl_util;
---8<---

Thus, even when RT tasks are RUNNABLE, we don't always assume the CPU
running at the max capacity but just whatever is the aggregated
utilization across all the classes.

With uclamp, we have:

---8<---
        util = cpu_util_rt(rq) + util_cfs;
        if (type == FREQUENCY_UTIL)
            util = uclamp_util_with(rq, util, p);
        dl_util = cpu_util_dl(rq);
        if (type == ENERGY_UTIL)
            util += dl_util;
---8<---

So, I would say that, in terms of ENERGY_UTIL we do the same both
w/ and w/o uclamp. Isn't it?


> But I'm not hostile to the idea since it's possible to enable uclamp and
> set the min cap to 0 for RT if you want. So there is a story there.
> However, I think this needs be documented somewhere, at the very least.

The only difference I see is that the actual frequency could be
different (lower then max) when a clamped RT task is RUNNABLE.

Are you worried that running RT on a lower freq could have side
effects on the estimated busy time the CPU ?

I also still don't completely get why you say it could be useful to
   "set the min cap to 0 for RT if you want"

IMO this will be an even bigger difference wrt mainline, since the RT
tasks will never have a granted minimum freq but just whatever
utilization we measure for them.
Quentin Perret Jan. 22, 2019, 2:39 p.m. UTC | #5
On Tuesday 22 Jan 2019 at 14:26:06 (+0000), Patrick Bellasi wrote:
> On 22-Jan 13:29, Quentin Perret wrote:
> > On Tuesday 22 Jan 2019 at 12:45:46 (+0000), Patrick Bellasi wrote:
> > > On 22-Jan 12:13, Quentin Perret wrote:
> > > > On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > > > > The Energy Aware Scheduler (AES) estimates the energy impact of waking
> > > 
> > > [...]
> > > 
> > > > > +		for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > > > > +			cfs_util = cpu_util_next(cpu, p, dst_cpu);
> > > > > +
> > > > > +			/*
> > > > > +			 * Busy time computation: utilization clamping is not
> > > > > +			 * required since the ratio (sum_util / cpu_capacity)
> > > > > +			 * is already enough to scale the EM reported power
> > > > > +			 * consumption at the (eventually clamped) cpu_capacity.
> > > > > +			 */
> > > > 
> > > > Right.
> > > > 
> > > > > +			sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
> > > > > +						       ENERGY_UTIL, NULL);
> > > > > +
> > > > > +			/*
> > > > > +			 * Performance domain frequency: utilization clamping
> > > > > +			 * must be considered since it affects the selection
> > > > > +			 * of the performance domain frequency.
> > > > > +			 */
> > > > 
> > > > So that actually affects the way we deal with RT I think. I assume the
> > > > idea is to say if you don't want to reflect the RT-go-to-max-freq thing
> > > > in EAS (which is what we do now) you should set the min cap for RT to 0.
> > > > Is that correct ?
> > > 
> > > By default configuration, RT tasks still go to max when uclamp is
> > > enabled, since they get a util_min=1024.
> > > 
> > > If we want to save power on RT tasks, we can set a smaller util_min...
> > > but not necessarily 0. A util_min=0 for RT tasks means to use just
> > > cpu_util_rt() for that class.
> > 
> > Ah, sorry, I guess my message was misleading. I'm saying this is
> > changing the way _EAS_ deals with RT tasks. Right now we don't actually
> > consider the RT-go-to-max thing at all in the EAS prediction. Your
> > patch is changing that AFAICT. It actually changes the way EAS sees RT
> > tasks even without uclamp ...
> 
> Lemme see if I get it right.
> 
> Currently, whenever we look at CPU utilization for ENERGY_UTIL, we
> always use cpu_util_rt() for RT tasks:
> 
> ---8<---
>         util = util_cfs;
>         util += cpu_util_rt(rq);
>         util += dl_util;
> ---8<---
> 
> Thus, even when RT tasks are RUNNABLE, we don't always assume the CPU
> running at the max capacity but just whatever is the aggregated
> utilization across all the classes.
> 
> With uclamp, we have:
> 
> ---8<---
>         util = cpu_util_rt(rq) + util_cfs;
>         if (type == FREQUENCY_UTIL)
>             util = uclamp_util_with(rq, util, p);
>         dl_util = cpu_util_dl(rq);
>         if (type == ENERGY_UTIL)
>             util += dl_util;
> ---8<---
> 
> So, I would say that, in terms of ENERGY_UTIL we do the same both
> w/ and w/o uclamp. Isn't it?

Yes but now you use FREQUENCY_UTIL for computing 'max_util' in the EAS
prediction.

Let's take an example. You have a perf domain with two CPUs. One CPU is
busy running a RT task, the other CPU runs a CFS task. Right now in
compute_energy() we only use ENERGY_UTIL, so 'max_util' ends up being
the max between the utilization of the two tasks. So we don't predict
we're going to max freq.

With your patch, we use FREQUENCY_UTIL to compute 'max_util', so we
_will_ predict that we're going to max freq. And we will do that even if
CONFIG_UCLAMP_TASK=n.

The default EAS calculation will be different with your patch when there
are runnable RT tasks in the system. This needs to be documented, I
think.

> > But I'm not hostile to the idea since it's possible to enable uclamp and
> > set the min cap to 0 for RT if you want. So there is a story there.
> > However, I think this needs be documented somewhere, at the very least.
> 
> The only difference I see is that the actual frequency could be
> different (lower then max) when a clamped RT task is RUNNABLE.
> 
> Are you worried that running RT on a lower freq could have side
> effects on the estimated busy time the CPU ?
> 
> I also still don't completely get why you say it could be useful to
>    "set the min cap to 0 for RT if you want"

I'm not saying it's useful, I'm saying userspace can decide to do that
if it thinks it is a good idea. The default should be min_cap = 1024 for
RT, no questions. But you _can_ change it at runtime if you want to.
That's my point. And doing that basically provides the same behaviour as
what we have right now in terms of EAS calculation (but it changes the
freq selection obviously) which is why I'm not fundamentally opposed to
your patch.

So in short, I'm fine with the behavioural change, but please at least
mention it somewhere :-)

Thanks,
Quentin
Patrick Bellasi Jan. 22, 2019, 3:01 p.m. UTC | #6
On 22-Jan 14:39, Quentin Perret wrote:
> On Tuesday 22 Jan 2019 at 14:26:06 (+0000), Patrick Bellasi wrote:
> > On 22-Jan 13:29, Quentin Perret wrote:
> > > On Tuesday 22 Jan 2019 at 12:45:46 (+0000), Patrick Bellasi wrote:
> > > > On 22-Jan 12:13, Quentin Perret wrote:
> > > > > On Tuesday 15 Jan 2019 at 10:15:08 (+0000), Patrick Bellasi wrote:
> > > > > > The Energy Aware Scheduler (AES) estimates the energy impact of waking

[...]

> > > Ah, sorry, I guess my message was misleading. I'm saying this is
> > > changing the way _EAS_ deals with RT tasks. Right now we don't actually
> > > consider the RT-go-to-max thing at all in the EAS prediction. Your
> > > patch is changing that AFAICT. It actually changes the way EAS sees RT
> > > tasks even without uclamp ...
> > 
> > Lemme see if I get it right.
> > 
> > Currently, whenever we look at CPU utilization for ENERGY_UTIL, we
> > always use cpu_util_rt() for RT tasks:
> > 
> > ---8<---
> >         util = util_cfs;
> >         util += cpu_util_rt(rq);
> >         util += dl_util;
> > ---8<---
> > 
> > Thus, even when RT tasks are RUNNABLE, we don't always assume the CPU
> > running at the max capacity but just whatever is the aggregated
> > utilization across all the classes.
> > 
> > With uclamp, we have:
> > 
> > ---8<---
> >         util = cpu_util_rt(rq) + util_cfs;
> >         if (type == FREQUENCY_UTIL)
> >             util = uclamp_util_with(rq, util, p);
> >         dl_util = cpu_util_dl(rq);
> >         if (type == ENERGY_UTIL)
> >             util += dl_util;
> > ---8<---
> > 
> > So, I would say that, in terms of ENERGY_UTIL we do the same both
> > w/ and w/o uclamp. Isn't it?
> 
> Yes but now you use FREQUENCY_UTIL for computing 'max_util' in the EAS
> prediction.

Right, I overlook that "little" detail... :/

> Let's take an example. You have a perf domain with two CPUs. One CPU is
> busy running a RT task, the other CPU runs a CFS task. Right now in
> compute_energy() we only use ENERGY_UTIL, so 'max_util' ends up being
> the max between the utilization of the two tasks. So we don't predict
> we're going to max freq.

+1

> With your patch, we use FREQUENCY_UTIL to compute 'max_util', so we
> _will_ predict that we're going to max freq.

Right, with the default conf yes.

> And we will do that even if CONFIG_UCLAMP_TASK=n.

While this should not happen, as I wrote in the RT integration patch,
that's happening because I'm missing some compilation guard or
similar. In this configurations we should always go to max... will
look into that.

> The default EAS calculation will be different with your patch when there
> are runnable RT tasks in the system. This needs to be documented, I
> think.

Sure...

> > > But I'm not hostile to the idea since it's possible to enable uclamp and
> > > set the min cap to 0 for RT if you want. So there is a story there.
> > > However, I think this needs be documented somewhere, at the very least.
> > 
> > The only difference I see is that the actual frequency could be
> > different (lower then max) when a clamped RT task is RUNNABLE.
> > 
> > Are you worried that running RT on a lower freq could have side
> > effects on the estimated busy time the CPU ?
> > 
> > I also still don't completely get why you say it could be useful to
> >    "set the min cap to 0 for RT if you want"
> 
> I'm not saying it's useful, I'm saying userspace can decide to do that
> if it thinks it is a good idea. The default should be min_cap = 1024 for
> RT, no questions. But you _can_ change it at runtime if you want to.
> That's my point. And doing that basically provides the same behaviour as
> what we have right now in terms of EAS calculation (but it changes the
> freq selection obviously) which is why I'm not fundamentally opposed to
> your patch.

Well, I think it's tricky to say whether the current or new approach
is better... it probably depends on the use-case.

> So in short, I'm fine with the behavioural change, but please at least
> mention it somewhere :-)

Anyway... agree, it's just that to add some documentation I need to
get what you are pointing out ;)

Will come up with some additional text to be added to the changelog

Maybe we can add a more detailed explanation of the different
behaviors you can get in the EAS documentation which is coming to
mainline ?

> Thanks,
> Quentin
Quentin Perret Jan. 22, 2019, 3:14 p.m. UTC | #7
On Tuesday 22 Jan 2019 at 15:01:37 (+0000), Patrick Bellasi wrote:
> > I'm not saying it's useful, I'm saying userspace can decide to do that
> > if it thinks it is a good idea. The default should be min_cap = 1024 for
> > RT, no questions. But you _can_ change it at runtime if you want to.
> > That's my point. And doing that basically provides the same behaviour as
> > what we have right now in terms of EAS calculation (but it changes the
> > freq selection obviously) which is why I'm not fundamentally opposed to
> > your patch.
> 
> Well, I think it's tricky to say whether the current or new approach
> is better... it probably depends on the use-case.

Agreed.

> > So in short, I'm fine with the behavioural change, but please at least
> > mention it somewhere :-)
> 
> Anyway... agree, it's just that to add some documentation I need to
> get what you are pointing out ;)
> 
> Will come up with some additional text to be added to the changelog

Sounds good.

> Maybe we can add a more detailed explanation of the different
> behaviors you can get in the EAS documentation which is coming to
> mainline ?

Yeah, if you feel like it, I guess that won't hurt :-)

Thanks,
Quentin
diff mbox series

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 38a05a4f78cc..4e02b419c482 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -195,10 +195,11 @@  static unsigned int get_next_freq(struct sugov_policy *sg_policy,
  * based on the task model parameters and gives the minimal utilization
  * required to meet deadlines.
  */
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
-				  unsigned long max, enum schedutil_type type)
+unsigned int schedutil_cpu_util(int cpu, unsigned int util_cfs,
+				 unsigned int max, enum schedutil_type type,
+				 struct task_struct *p)
 {
-	unsigned long dl_util, util, irq;
+	unsigned int dl_util, util, irq;
 	struct rq *rq = cpu_rq(cpu);
 
 	/*
@@ -222,13 +223,9 @@  unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
 	 * When there are no CFS RUNNABLE tasks, clamps are released and
 	 * frequency will be gracefully reduced with the utilization decay.
 	 */
-	util = cpu_util_rt(rq);
-	if (type == FREQUENCY_UTIL) {
-		util += cpu_util_cfs(rq);
-		util  = uclamp_util(rq, util);
-	} else {
-		util += util_cfs;
-	}
+	util = cpu_util_rt(rq) + util_cfs;
+	if (type == FREQUENCY_UTIL)
+		util = uclamp_util_with(rq, util, p);
 
 	dl_util = cpu_util_dl(rq);
 
@@ -282,13 +279,14 @@  unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
 static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 {
 	struct rq *rq = cpu_rq(sg_cpu->cpu);
-	unsigned long util = cpu_util_cfs(rq);
-	unsigned long max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+	unsigned int util_cfs = cpu_util_cfs(rq);
+	unsigned int cpu_cap = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
 
-	sg_cpu->max = max;
+	sg_cpu->max = cpu_cap;
 	sg_cpu->bw_dl = cpu_bw_dl(rq);
 
-	return schedutil_freq_util(sg_cpu->cpu, util, max, FREQUENCY_UTIL);
+	return schedutil_cpu_util(sg_cpu->cpu, util_cfs, cpu_cap,
+				  FREQUENCY_UTIL, NULL);
 }
 
 /**
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5de061b055d2..7f8ca3b02dec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6424,11 +6424,20 @@  static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 static long
 compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 {
-	long util, max_util, sum_util, energy = 0;
+	unsigned int max_util, cfs_util, cpu_util, cpu_cap;
+	unsigned long sum_util, energy = 0;
 	int cpu;
 
 	for (; pd; pd = pd->next) {
+		struct cpumask *pd_mask = perf_domain_span(pd);
+
+		/*
+		 * The energy model mandate all the CPUs of a performance
+		 * domain have the same capacity.
+		 */
+		cpu_cap = arch_scale_cpu_capacity(NULL, cpumask_first(pd_mask));
 		max_util = sum_util = 0;
+
 		/*
 		 * The capacity state of CPUs of the current rd can be driven by
 		 * CPUs of another rd if they belong to the same performance
@@ -6439,11 +6448,27 @@  compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * it will not appear in its pd list and will not be accounted
 		 * by compute_energy().
 		 */
-		for_each_cpu_and(cpu, perf_domain_span(pd), cpu_online_mask) {
-			util = cpu_util_next(cpu, p, dst_cpu);
-			util = schedutil_energy_util(cpu, util);
-			max_util = max(util, max_util);
-			sum_util += util;
+		for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
+			cfs_util = cpu_util_next(cpu, p, dst_cpu);
+
+			/*
+			 * Busy time computation: utilization clamping is not
+			 * required since the ratio (sum_util / cpu_capacity)
+			 * is already enough to scale the EM reported power
+			 * consumption at the (eventually clamped) cpu_capacity.
+			 */
+			sum_util += schedutil_cpu_util(cpu, cfs_util, cpu_cap,
+						       ENERGY_UTIL, NULL);
+
+			/*
+			 * Performance domain frequency: utilization clamping
+			 * must be considered since it affects the selection
+			 * of the performance domain frequency.
+			 */
+			cpu_util = schedutil_cpu_util(cpu, cfs_util, cpu_cap,
+						      FREQUENCY_UTIL,
+						      cpu == dst_cpu ? p : NULL);
+			max_util = max(max_util, cpu_util);
 		}
 
 		energy += em_pd_energy(pd->em_pd, max_util, sum_util);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b7ce3023d023..a70f4bf66285 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2317,7 +2317,6 @@  static inline unsigned int uclamp_util(struct rq *rq, unsigned int util)
 # define arch_scale_freq_invariant()	false
 #endif
 
-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
 /**
  * enum schedutil_type - CPU utilization type
  * @FREQUENCY_UTIL:	Utilization used to select frequency
@@ -2333,15 +2332,10 @@  enum schedutil_type {
 	ENERGY_UTIL,
 };
 
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
-				  unsigned long max, enum schedutil_type type);
-
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
-	unsigned long max = arch_scale_cpu_capacity(NULL, cpu);
-
-	return schedutil_freq_util(cpu, cfs, max, ENERGY_UTIL);
-}
+#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+unsigned int schedutil_cpu_util(int cpu, unsigned int util_cfs,
+				unsigned int max, enum schedutil_type type,
+				struct task_struct *p);
 
 static inline unsigned long cpu_bw_dl(struct rq *rq)
 {
@@ -2370,10 +2364,7 @@  static inline unsigned long cpu_util_rt(struct rq *rq)
 	return READ_ONCE(rq->avg_rt.util_avg);
 }
 #else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
-static inline unsigned long schedutil_energy_util(int cpu, unsigned long cfs)
-{
-	return cfs;
-}
+#define schedutil_cpu_util(cpu, util_cfs, max, type, p) 0
 #endif
 
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ