diff mbox series

[v7,02/14] sched/cpufreq: Prepare schedutil for Energy Aware Scheduling

Message ID 20180912091309.7551-3-quentin.perret@arm.com (mailing list archive)
State Changes Requested, archived
Headers show
Series Energy Aware Scheduling | expand

Commit Message

Quentin Perret Sept. 12, 2018, 9:12 a.m. UTC
Schedutil requests frequency by aggregating utilization signals from
the scheduler (CFS, RT, DL, IRQ) and applying and 25% margin on top of
them. Since Energy Aware Scheduling (EAS) needs to be able to predict
the frequency requests, it needs to forecast the decisions made by the
governor.

In order to prepare the introduction of EAS, introduce
schedutil_freq_util() to centralize the aforementioned signal
aggregation and make it available to both schedutil and EAS. Since
frequency selection and energy estimation still need to deal with RT and
DL signals slightly differently, schedutil_freq_util() is called with a
different 'type' parameter in those two contexts, and returns an
aggregated utilization signal accordingly. While at it, introduce the
map_util_freq() function which is designed to make schedutil's 25%
margin usable easily for both sugov and EAS.

As EAS will be able to predict schedutil's frequency requests more
accurately than any other governor by design, it'd be sensible to make
sure EAS cannot be used without schedutil. This will be done later, once
EAS has actually been introduced.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/sched/cpufreq.h    |  6 +++
 kernel/sched/cpufreq_schedutil.c | 89 +++++++++++++++++++++-----------
 kernel/sched/sched.h             | 14 +++++
 3 files changed, 80 insertions(+), 29 deletions(-)

Comments

Vincent Guittot Sept. 12, 2018, 2:56 p.m. UTC | #1
Hi Quentin,

On Wed, 12 Sep 2018 at 11:13, Quentin Perret <quentin.perret@arm.com> wrote:
>
> Schedutil requests frequency by aggregating utilization signals from
> the scheduler (CFS, RT, DL, IRQ) and applying and 25% margin on top of

quite a minor thing but s/and 25%/a 25%/

> them. Since Energy Aware Scheduling (EAS) needs to be able to predict
> the frequency requests, it needs to forecast the decisions made by the
> governor.
>
> In order to prepare the introduction of EAS, introduce
> schedutil_freq_util() to centralize the aforementioned signal
> aggregation and make it available to both schedutil and EAS. Since
> frequency selection and energy estimation still need to deal with RT and
> DL signals slightly differently, schedutil_freq_util() is called with a
> different 'type' parameter in those two contexts, and returns an
> aggregated utilization signal accordingly. While at it, introduce the
> map_util_freq() function which is designed to make schedutil's 25%
> margin usable easily for both sugov and EAS.
>
> As EAS will be able to predict schedutil's frequency requests more
> accurately than any other governor by design, it'd be sensible to make
> sure EAS cannot be used without schedutil. This will be done later, once
> EAS has actually been introduced.
>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Rafael J. Wysocki Sept. 18, 2018, 9:33 p.m. UTC | #2
On Wed, Sep 12, 2018 at 11:14 AM Quentin Perret <quentin.perret@arm.com> wrote:
>
> Schedutil requests frequency by aggregating utilization signals from
> the scheduler (CFS, RT, DL, IRQ) and applying and 25% margin on top of
> them. Since Energy Aware Scheduling (EAS) needs to be able to predict
> the frequency requests, it needs to forecast the decisions made by the
> governor.
>
> In order to prepare the introduction of EAS, introduce
> schedutil_freq_util() to centralize the aforementioned signal
> aggregation and make it available to both schedutil and EAS. Since
> frequency selection and energy estimation still need to deal with RT and
> DL signals slightly differently, schedutil_freq_util() is called with a
> different 'type' parameter in those two contexts, and returns an
> aggregated utilization signal accordingly. While at it, introduce the
> map_util_freq() function which is designed to make schedutil's 25%
> margin usable easily for both sugov and EAS.
>
> As EAS will be able to predict schedutil's frequency requests more
> accurately than any other governor by design, it'd be sensible to make
> sure EAS cannot be used without schedutil. This will be done later, once
> EAS has actually been introduced.
>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>

While the overall idea is fine by me, I would do a couple of things
differently (see below).

> ---
>  include/linux/sched/cpufreq.h    |  6 +++
>  kernel/sched/cpufreq_schedutil.c | 89 +++++++++++++++++++++-----------
>  kernel/sched/sched.h             | 14 +++++
>  3 files changed, 80 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index 59667444669f..afa940cd50dc 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -20,6 +20,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
>                         void (*func)(struct update_util_data *data, u64 time,
>                                     unsigned int flags));
>  void cpufreq_remove_update_util_hook(int cpu);
> +
> +static inline unsigned long map_util_freq(unsigned long util,
> +                                       unsigned long freq, unsigned long cap)
> +{
> +       return (freq + (freq >> 2)) * util / cap;
> +}
>  #endif /* CONFIG_CPU_FREQ */
>
>  #endif /* _LINUX_SCHED_CPUFREQ_H */
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 3fffad3bc8a8..8356cb0072a6 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -13,6 +13,7 @@
>
>  #include "sched.h"
>
> +#include <linux/sched/cpufreq.h>
>  #include <trace/events/power.h>
>
>  struct sugov_tunables {
> @@ -167,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>         unsigned int freq = arch_scale_freq_invariant() ?
>                                 policy->cpuinfo.max_freq : policy->cur;
>
> -       freq = (freq + (freq >> 2)) * util / max;
> +       freq = map_util_freq(util, freq, max);
>
>         if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
>                 return sg_policy->next_freq;
> @@ -197,15 +198,15 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>   * based on the task model parameters and gives the minimal utilization
>   * required to meet deadlines.
>   */
> -static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> +unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> +                                 enum schedutil_type type)

The new "type" argument should be documented.

Also IMO using the special enum for it is quite confusing, because you
ever only check one value from it directly.  What would be wrong with
using a plain "bool" instead?

>  {
> -       struct rq *rq = cpu_rq(sg_cpu->cpu);
> +       struct rq *rq = cpu_rq(cpu);
>         unsigned long util, irq, max;
>
> -       sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> -       sg_cpu->bw_dl = cpu_bw_dl(rq);
> +       max = arch_scale_cpu_capacity(NULL, cpu);
>
> -       if (rt_rq_is_runnable(&rq->rt))
> +       if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
>                 return max;
>
>         /*
> @@ -223,20 +224,33 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
>          * utilization (PELT windows are synchronized) we can directly add them
>          * to obtain the CPU's actual utilization.
>          */
> -       util = cpu_util_cfs(rq);
> +       util = util_cfs;
>         util += cpu_util_rt(rq);
>
> -       /*
> -        * We do not make cpu_util_dl() a permanent part of this sum because we
> -        * want to use cpu_bw_dl() later on, but we need to check if the
> -        * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
> -        * f_max when there is no idle time.
> -        *
> -        * NOTE: numerical errors or stop class might cause us to not quite hit
> -        * saturation when we should -- something for later.
> -        */
> -       if ((util + cpu_util_dl(rq)) >= max)
> -               return max;
> +       if (type == FREQUENCY_UTIL) {
> +               /*
> +                * For frequency selection we do not make cpu_util_dl() a
> +                * permanent part of this sum because we want to use
> +                * cpu_bw_dl() later on, but we need to check if the
> +                * CFS+RT+DL sum is saturated (ie. no idle time) such
> +                * that we select f_max when there is no idle time.
> +                *
> +                * NOTE: numerical errors or stop class might cause us
> +                * to not quite hit saturation when we should --
> +                * something for later.
> +                */
> +
> +               if ((util + cpu_util_dl(rq)) >= max)
> +                       return max;
> +       } else {
> +               /*
> +                * OTOH, for energy computation we need the estimated
> +                * running time, so include util_dl and ignore dl_bw.
> +                */
> +               util += cpu_util_dl(rq);
> +               if (util >= max)
> +                       return max;
> +       }
>
>         /*
>          * There is still idle time; further improve the number by using the
> @@ -250,17 +264,34 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
>         util = scale_irq_capacity(util, irq, max);
>         util += irq;
>
> -       /*
> -        * Bandwidth required by DEADLINE must always be granted while, for
> -        * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> -        * to gracefully reduce the frequency when no tasks show up for longer
> -        * periods of time.
> -        *
> -        * Ideally we would like to set bw_dl as min/guaranteed freq and util +
> -        * bw_dl as requested freq. However, cpufreq is not yet ready for such
> -        * an interface. So, we only do the latter for now.
> -        */
> -       return min(max, util + sg_cpu->bw_dl);
> +       if (type == FREQUENCY_UTIL) {
> +               /*
> +                * Bandwidth required by DEADLINE must always be granted
> +                * while, for FAIR and RT, we use blocked utilization of
> +                * IDLE CPUs as a mechanism to gracefully reduce the
> +                * frequency when no tasks show up for longer periods of
> +                * time.
> +                *
> +                * Ideally we would like to set bw_dl as min/guaranteed
> +                * freq and util + bw_dl as requested freq. However,
> +                * cpufreq is not yet ready for such an interface. So,
> +                * we only do the latter for now.
> +                */
> +               util += cpu_bw_dl(rq);
> +       }
> +
> +       return min(max, util);
> +}
> +
> +static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> +{
> +       struct rq *rq = cpu_rq(sg_cpu->cpu);
> +       unsigned long util = cpu_util_cfs(rq);
> +
> +       sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> +       sg_cpu->bw_dl = cpu_bw_dl(rq);
> +
> +       return schedutil_freq_util(sg_cpu->cpu, util, FREQUENCY_UTIL);

If you add a "max" argument to schedutil_freq_util(), you can avoid
the second (and arguably redundant) evaluation of
arch_scale_cpu_capacity() in there.

>  }
>
>  /**
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 481e6759441b..d59effb34786 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2172,7 +2172,15 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  # define arch_scale_freq_invariant()   false
>  #endif
>
> +enum schedutil_type {
> +       FREQUENCY_UTIL,
> +       ENERGY_UTIL,
> +};

As I said above, I would just use "bool" instead of this new enum (it
has two values too) or the new type needs to be documented.

> +
>  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> +                                 enum schedutil_type type);
> +
>  static inline unsigned long cpu_bw_dl(struct rq *rq)
>  {
>         return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> @@ -2199,6 +2207,12 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
>  {
>         return READ_ONCE(rq->avg_rt.util_avg);
>  }
> +#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
> +static inline unsigned long schedutil_freq_util(int cpu, unsigned long util,
> +                                 enum schedutil_type type)
> +{
> +       return util;
> +}
>  #endif

And I would add a wrapper around schedutil_freq_util(), called say
schedutil_energy_util(), that would pass a specific value as the
"type".

>
>  #ifdef HAVE_SCHED_AVG_IRQ
> --

Cheers,
Rafael
Quentin Perret Sept. 27, 2018, 12:17 p.m. UTC | #3
Hi Rafael,

Very sorry for the late reply ...

On Tuesday 18 Sep 2018 at 23:33:22 (+0200), Rafael J. Wysocki wrote:
[...]
> The new "type" argument should be documented.
> 
> Also IMO using the special enum for it is quite confusing, because you
> ever only check one value from it directly.  What would be wrong with
> using a plain "bool" instead?

So, this part of the code was originally proposed by Peter. I basically
took it from the following message (hence the Suggested-by) which was
fine by me:

https://lore.kernel.org/lkml/20180709120138.GQ2494@hirez.programming.kicks-ass.net/

Also, one of the things that has been mentioned during reviews was that
other clients (such as cpuidle, IIRC) could potentially be interested
in a 'global' cpu util value. And since those clients might have
different needs than EAS or sugov, they might need a new entry in the
enum.

So that's probably the main argument for the enum, it is easy to extend.

[...]
> > +static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> > +{
> > +       struct rq *rq = cpu_rq(sg_cpu->cpu);
> > +       unsigned long util = cpu_util_cfs(rq);
> > +
> > +       sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> > +       sg_cpu->bw_dl = cpu_bw_dl(rq);
> > +
> > +       return schedutil_freq_util(sg_cpu->cpu, util, FREQUENCY_UTIL);
> 
> If you add a "max" argument to schedutil_freq_util(), you can avoid
> the second (and arguably redundant) evaluation of
> arch_scale_cpu_capacity() in there.

OK

[...]
> > +enum schedutil_type {
> > +       FREQUENCY_UTIL,
> > +       ENERGY_UTIL,
> > +};
> 
> As I said above, I would just use "bool" instead of this new enum (it
> has two values too) or the new type needs to be documented.

As I said above, the enum has the good side of being easier to extend.
So, if we care about that, I guess I'd rather add a doc for the new
type.

> > +
> >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > +unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> > +                                 enum schedutil_type type);
> > +
> >  static inline unsigned long cpu_bw_dl(struct rq *rq)
> >  {
> >         return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > @@ -2199,6 +2207,12 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
> >  {
> >         return READ_ONCE(rq->avg_rt.util_avg);
> >  }
> > +#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
> > +static inline unsigned long schedutil_freq_util(int cpu, unsigned long util,
> > +                                 enum schedutil_type type)
> > +{
> > +       return util;
> > +}
> >  #endif
> 
> And I would add a wrapper around schedutil_freq_util(), called say
> schedutil_energy_util(), that would pass a specific value as the
> "type".

OK, that's fine by me.

Other than that, do you think these changes could be done later ? Or do
you see that as mandatory before the patches can be picked up ?

Thanks,
Quentin
Rafael J. Wysocki Sept. 28, 2018, 8:23 a.m. UTC | #4
On Thursday, September 27, 2018 2:17:49 PM CEST Quentin Perret wrote:
> Hi Rafael,
> 
> Very sorry for the late reply ...
> 
> On Tuesday 18 Sep 2018 at 23:33:22 (+0200), Rafael J. Wysocki wrote:
> [...]
> > The new "type" argument should be documented.
> > 
> > Also IMO using the special enum for it is quite confusing, because you
> > ever only check one value from it directly.  What would be wrong with
> > using a plain "bool" instead?
> 
> So, this part of the code was originally proposed by Peter. I basically
> took it from the following message (hence the Suggested-by) which was
> fine by me:
> 
> https://lore.kernel.org/lkml/20180709120138.GQ2494@hirez.programming.kicks-ass.net/
> 
> Also, one of the things that has been mentioned during reviews was that
> other clients (such as cpuidle, IIRC) could potentially be interested
> in a 'global' cpu util value. And since those clients might have
> different needs than EAS or sugov, they might need a new entry in the
> enum.
> 
> So that's probably the main argument for the enum, it is easy to extend.

OK, so please document the enum type.

> [...]
> > > +static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> > > +{
> > > +       struct rq *rq = cpu_rq(sg_cpu->cpu);
> > > +       unsigned long util = cpu_util_cfs(rq);
> > > +
> > > +       sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> > > +       sg_cpu->bw_dl = cpu_bw_dl(rq);
> > > +
> > > +       return schedutil_freq_util(sg_cpu->cpu, util, FREQUENCY_UTIL);
> > 
> > If you add a "max" argument to schedutil_freq_util(), you can avoid
> > the second (and arguably redundant) evaluation of
> > arch_scale_cpu_capacity() in there.
> 
> OK
> 
> [...]
> > > +enum schedutil_type {
> > > +       FREQUENCY_UTIL,
> > > +       ENERGY_UTIL,
> > > +};
> > 
> > As I said above, I would just use "bool" instead of this new enum (it
> > has two values too) or the new type needs to be documented.
> 
> As I said above, the enum has the good side of being easier to extend.
> So, if we care about that, I guess I'd rather add a doc for the new
> type.

Right, so please add a kerneldoc description here.

> > > +
> > >  #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > > +unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> > > +                                 enum schedutil_type type);
> > > +
> > >  static inline unsigned long cpu_bw_dl(struct rq *rq)
> > >  {
> > >         return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > @@ -2199,6 +2207,12 @@ static inline unsigned long cpu_util_rt(struct rq *rq)
> > >  {
> > >         return READ_ONCE(rq->avg_rt.util_avg);
> > >  }
> > > +#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
> > > +static inline unsigned long schedutil_freq_util(int cpu, unsigned long util,
> > > +                                 enum schedutil_type type)
> > > +{
> > > +       return util;
> > > +}
> > >  #endif
> > 
> > And I would add a wrapper around schedutil_freq_util(), called say
> > schedutil_energy_util(), that would pass a specific value as the
> > "type".
> 
> OK, that's fine by me.
> 
> Other than that, do you think these changes could be done later ? Or do
> you see that as mandatory before the patches can be picked up ?

Documenting things properly is absolutely required.

The other changes suggested by me are rather straightforward, so why would it
be a problem to make them right away if you agree to make them?

Thanks,
Rafael
Quentin Perret Sept. 28, 2018, 8:35 a.m. UTC | #5
Hi Rafael,

On Friday 28 Sep 2018 at 10:23:42 (+0200), Rafael J. Wysocki wrote:
[...]
> OK, so please document the enum type.

Will do.

[...]
> Documenting things properly is absolutely required.

Understood.

> The other changes suggested by me are rather straightforward, so why would it
> be a problem to make them right away if you agree to make them?

Right, if I need to spin a v8 for doc purpose anyway, I guess I can do
the other small fixes at the same time.

I'll wait a bit before sending the update to see if others have some
feedback. It's probably a bit late for 4.20 anyway ... Or is it not ?

Thanks,
Quentin
Rafael J. Wysocki Sept. 28, 2018, 8:35 a.m. UTC | #6
On Friday, September 28, 2018 10:35:05 AM CEST Quentin Perret wrote:
> Hi Rafael,
> 
> On Friday 28 Sep 2018 at 10:23:42 (+0200), Rafael J. Wysocki wrote:
> [...]
> > OK, so please document the enum type.
> 
> Will do.
> 
> [...]
> > Documenting things properly is absolutely required.
> 
> Understood.
> 
> > The other changes suggested by me are rather straightforward, so why would it
> > be a problem to make them right away if you agree to make them?
> 
> Right, if I need to spin a v8 for doc purpose anyway, I guess I can do
> the other small fixes at the same time.
> 
> I'll wait a bit before sending the update to see if others have some
> feedback. It's probably a bit late for 4.20 anyway ... Or is it not ?

I think this is scheduler material anyway, even though it affects cpufreq
somewhat, so that depends on what Peter thinks really.

Thanks,
Rafael
diff mbox series

Patch

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 59667444669f..afa940cd50dc 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -20,6 +20,12 @@  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
                        void (*func)(struct update_util_data *data, u64 time,
 				    unsigned int flags));
 void cpufreq_remove_update_util_hook(int cpu);
+
+static inline unsigned long map_util_freq(unsigned long util,
+					unsigned long freq, unsigned long cap)
+{
+	return (freq + (freq >> 2)) * util / cap;
+}
 #endif /* CONFIG_CPU_FREQ */
 
 #endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 3fffad3bc8a8..8356cb0072a6 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -13,6 +13,7 @@ 
 
 #include "sched.h"
 
+#include <linux/sched/cpufreq.h>
 #include <trace/events/power.h>
 
 struct sugov_tunables {
@@ -167,7 +168,7 @@  static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
-	freq = (freq + (freq >> 2)) * util / max;
+	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
 		return sg_policy->next_freq;
@@ -197,15 +198,15 @@  static unsigned int get_next_freq(struct sugov_policy *sg_policy,
  * based on the task model parameters and gives the minimal utilization
  * required to meet deadlines.
  */
-static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
+unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
+				  enum schedutil_type type)
 {
-	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	struct rq *rq = cpu_rq(cpu);
 	unsigned long util, irq, max;
 
-	sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
-	sg_cpu->bw_dl = cpu_bw_dl(rq);
+	max = arch_scale_cpu_capacity(NULL, cpu);
 
-	if (rt_rq_is_runnable(&rq->rt))
+	if (type == FREQUENCY_UTIL && rt_rq_is_runnable(&rq->rt))
 		return max;
 
 	/*
@@ -223,20 +224,33 @@  static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 	 * utilization (PELT windows are synchronized) we can directly add them
 	 * to obtain the CPU's actual utilization.
 	 */
-	util = cpu_util_cfs(rq);
+	util = util_cfs;
 	util += cpu_util_rt(rq);
 
-	/*
-	 * We do not make cpu_util_dl() a permanent part of this sum because we
-	 * want to use cpu_bw_dl() later on, but we need to check if the
-	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
-	 * f_max when there is no idle time.
-	 *
-	 * NOTE: numerical errors or stop class might cause us to not quite hit
-	 * saturation when we should -- something for later.
-	 */
-	if ((util + cpu_util_dl(rq)) >= max)
-		return max;
+	if (type == FREQUENCY_UTIL) {
+		/*
+		 * For frequency selection we do not make cpu_util_dl() a
+		 * permanent part of this sum because we want to use
+		 * cpu_bw_dl() later on, but we need to check if the
+		 * CFS+RT+DL sum is saturated (ie. no idle time) such
+		 * that we select f_max when there is no idle time.
+		 *
+		 * NOTE: numerical errors or stop class might cause us
+		 * to not quite hit saturation when we should --
+		 * something for later.
+		 */
+
+		if ((util + cpu_util_dl(rq)) >= max)
+			return max;
+	} else {
+		/*
+		 * OTOH, for energy computation we need the estimated
+		 * running time, so include util_dl and ignore dl_bw.
+		 */
+		util += cpu_util_dl(rq);
+		if (util >= max)
+			return max;
+	}
 
 	/*
 	 * There is still idle time; further improve the number by using the
@@ -250,17 +264,34 @@  static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 	util = scale_irq_capacity(util, irq, max);
 	util += irq;
 
-	/*
-	 * Bandwidth required by DEADLINE must always be granted while, for
-	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
-	 * to gracefully reduce the frequency when no tasks show up for longer
-	 * periods of time.
-	 *
-	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
-	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
-	 * an interface. So, we only do the latter for now.
-	 */
-	return min(max, util + sg_cpu->bw_dl);
+	if (type == FREQUENCY_UTIL) {
+		/*
+		 * Bandwidth required by DEADLINE must always be granted
+		 * while, for FAIR and RT, we use blocked utilization of
+		 * IDLE CPUs as a mechanism to gracefully reduce the
+		 * frequency when no tasks show up for longer periods of
+		 * time.
+		 *
+		 * Ideally we would like to set bw_dl as min/guaranteed
+		 * freq and util + bw_dl as requested freq. However,
+		 * cpufreq is not yet ready for such an interface. So,
+		 * we only do the latter for now.
+		 */
+		util += cpu_bw_dl(rq);
+	}
+
+	return min(max, util);
+}
+
+static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
+{
+	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	unsigned long util = cpu_util_cfs(rq);
+
+	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+	sg_cpu->bw_dl = cpu_bw_dl(rq);
+
+	return schedutil_freq_util(sg_cpu->cpu, util, FREQUENCY_UTIL);
 }
 
 /**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 481e6759441b..d59effb34786 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2172,7 +2172,15 @@  static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 # define arch_scale_freq_invariant()	false
 #endif
 
+enum schedutil_type {
+	FREQUENCY_UTIL,
+	ENERGY_UTIL,
+};
+
 #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
+				  enum schedutil_type type);
+
 static inline unsigned long cpu_bw_dl(struct rq *rq)
 {
 	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
@@ -2199,6 +2207,12 @@  static inline unsigned long cpu_util_rt(struct rq *rq)
 {
 	return READ_ONCE(rq->avg_rt.util_avg);
 }
+#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+static inline unsigned long schedutil_freq_util(int cpu, unsigned long util,
+				  enum schedutil_type type)
+{
+	return util;
+}
 #endif
 
 #ifdef HAVE_SCHED_AVG_IRQ