diff mbox

[RFC,v3,09/10] sched/fair: Select an energy-efficient CPU on task wake-up

Message ID 20180521142505.6522-10-quentin.perret@arm.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Quentin Perret May 21, 2018, 2:25 p.m. UTC
If an energy model is available, and if the system isn't overutilized,
waking tasks are re-routed into a new energy-aware placement algorithm.
The selection of an energy-efficient CPU for a task is achieved by
estimating the impact on system-level active energy resulting from the
placement of the task on the CPU with the highest spare capacity in each
frequency domain. This strategy spreads tasks in a frequency domain and
avoids overly aggressive task packing. The best CPU energy-wise is then
selected if it saves a large enough amount of energy with respect to
prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This patch is an attempt to do something useful as writing a fast
heuristic that performs reasonably well on a broad spectrum of
architectures isn't an easy task. As such, the scope of usability of the
energy-aware wake-up path is restricted to systems with the
SD_ASYM_CPUCAPACITY flag set, and where the EM isn't too complex.

In addition, the energy-aware wake-up path is accessible only if
sched_energy_enabled() is true. For systems which don't meet all
dependencies for EAS (CONFIG_ENERGY_MODEL for ex.) at compile time,
sched_enegy_enabled() defaults to a constant "false" value, hence
letting the compiler remove the unused EAS code entirely.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c | 84 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 83 insertions(+), 1 deletion(-)

Comments

Juri Lelli June 8, 2018, 10:24 a.m. UTC | #1
Hi,

On 21/05/18 15:25, Quentin Perret wrote:

[...]

> +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> +{
> +	unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
> +	int cpu, best_energy_cpu = prev_cpu;
> +	struct sched_energy_fd *sfd;
> +	struct sched_domain *sd;
> +
> +	sync_entity_load_avg(&p->se);
> +
> +	task_util = task_util_est(p);
> +	if (!task_util)
> +		return prev_cpu;
> +
> +	/*
> +	 * Energy-aware wake-up happens on the lowest sched_domain starting
> +	 * from sd_ea spanning over this_cpu and prev_cpu.
> +	 */
> +	sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
> +	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> +		sd = sd->parent;
> +	if (!sd)
> +		return -1;

Shouldn't this be return prev_cpu?

> +
> +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> +	else
> +		prev_energy = best_energy = ULONG_MAX;
> +
> +	for_each_freq_domain(sfd) {
> +		unsigned long spare_cap, max_spare_cap = 0;
> +		int max_spare_cap_cpu = -1;
> +		unsigned long util;
> +
> +		/* Find the CPU with the max spare cap in the freq. dom. */

I undestand this being a heuristic to cut some overhead, but shouldn't
the model tell between packing vs. spreading?

Thanks,

 -Juri
Quentin Perret June 8, 2018, 11:19 a.m. UTC | #2
On Friday 08 Jun 2018 at 12:24:46 (+0200), Juri Lelli wrote:
> Hi,
> 
> On 21/05/18 15:25, Quentin Perret wrote:
> 
> [...]
> 
> > +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > +{
> > +	unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
> > +	int cpu, best_energy_cpu = prev_cpu;
> > +	struct sched_energy_fd *sfd;
> > +	struct sched_domain *sd;
> > +
> > +	sync_entity_load_avg(&p->se);
> > +
> > +	task_util = task_util_est(p);
> > +	if (!task_util)
> > +		return prev_cpu;
> > +
> > +	/*
> > +	 * Energy-aware wake-up happens on the lowest sched_domain starting
> > +	 * from sd_ea spanning over this_cpu and prev_cpu.
> > +	 */
> > +	sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
> > +	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > +		sd = sd->parent;
> > +	if (!sd)
> > +		return -1;
> 
> Shouldn't this be return prev_cpu?

Well, you shouldn't be entering this function without an sd_ea pointer,
so this case is a sort of bug I think. By returning -1 I think we should
end-up picking a CPU using select_fallback_rq(), which sort of makes
sense ?

> 
> > +
> > +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> > +	else
> > +		prev_energy = best_energy = ULONG_MAX;
> > +
> > +	for_each_freq_domain(sfd) {
> > +		unsigned long spare_cap, max_spare_cap = 0;
> > +		int max_spare_cap_cpu = -1;
> > +		unsigned long util;
> > +
> > +		/* Find the CPU with the max spare cap in the freq. dom. */
> 
> I undestand this being a heuristic to cut some overhead, but shouldn't
> the model tell between packing vs. spreading?

Ah, that's a very interesting one :-) !

So, with only active costs of the CPUs in the model, we can't really
tell what's best between packing or spreading between identical CPUs if
the migration of the task doesn't change the OPP request.

In a frequency domain, all the "best" CPU candidates for a task are
those for which we'll request a low OPP. When there are several CPUs for
which the OPP request will be the same, we just don't know which one to
pick from an energy standpoint, because we don't have other energy costs
(for idle states for ex) to break the tie.

With this EM, the interesting thing is that if you assume that OPP
requests follow utilization, you are _guaranteed_ that the CPU with
the max spare capacity in a freq domain will always be among the best
candidates of this freq domain. And since we don't know how to
differentiate those candidates, why not using this one ?

Yes, it _might_ be better from an energy standpoint to pack small tasks
on a CPU in order to let other CPUs go in deeper idle states. But that
also hurts your chances to go cluster idle. Which solution is the best ?
It depends, and we have no ways to tell with this EM.

This approach basically favors cluster-packing, and spreading inside a
cluster. That should at least be a good thing for latency, and this is
consistent with the idea that most of the energy savings come from the
asymmetry of the system, and not so much from breaking the tie between
identical CPUs. That's also the reason why EAS is enabled only if your
system has SD_ASYM_CPUCAPACITY set, as we already discussed for patch
05/10 :-).

Does that make sense ?

Thanks,
Quentin
Juri Lelli June 8, 2018, 11:59 a.m. UTC | #3
On 08/06/18 12:19, Quentin Perret wrote:
> On Friday 08 Jun 2018 at 12:24:46 (+0200), Juri Lelli wrote:
> > Hi,
> > 
> > On 21/05/18 15:25, Quentin Perret wrote:
> > 
> > [...]
> > 
> > > +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > +{
> > > +	unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
> > > +	int cpu, best_energy_cpu = prev_cpu;
> > > +	struct sched_energy_fd *sfd;
> > > +	struct sched_domain *sd;
> > > +
> > > +	sync_entity_load_avg(&p->se);
> > > +
> > > +	task_util = task_util_est(p);
> > > +	if (!task_util)
> > > +		return prev_cpu;
> > > +
> > > +	/*
> > > +	 * Energy-aware wake-up happens on the lowest sched_domain starting
> > > +	 * from sd_ea spanning over this_cpu and prev_cpu.
> > > +	 */
> > > +	sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
> > > +	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > > +		sd = sd->parent;
> > > +	if (!sd)
> > > +		return -1;
> > 
> > Shouldn't this be return prev_cpu?
> 
> Well, you shouldn't be entering this function without an sd_ea pointer,
> so this case is a sort of bug I think. By returning -1 I think we should
> end-up picking a CPU using select_fallback_rq(), which sort of makes
> sense ?

I fear cpumask_test_cpu() and such won't be happy with a -1 arg.
If it's a recoverable bug, I'd say return prev and WARN_ON_ONCE() ?

> > > +
> > > +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > > +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> > > +	else
> > > +		prev_energy = best_energy = ULONG_MAX;
> > > +
> > > +	for_each_freq_domain(sfd) {
> > > +		unsigned long spare_cap, max_spare_cap = 0;
> > > +		int max_spare_cap_cpu = -1;
> > > +		unsigned long util;
> > > +
> > > +		/* Find the CPU with the max spare cap in the freq. dom. */
> > 
> > I undestand this being a heuristic to cut some overhead, but shouldn't
> > the model tell between packing vs. spreading?
> 
> Ah, that's a very interesting one :-) !
> 
> So, with only active costs of the CPUs in the model, we can't really
> tell what's best between packing or spreading between identical CPUs if
> the migration of the task doesn't change the OPP request.
> 
> In a frequency domain, all the "best" CPU candidates for a task are
> those for which we'll request a low OPP. When there are several CPUs for
> which the OPP request will be the same, we just don't know which one to
> pick from an energy standpoint, because we don't have other energy costs
> (for idle states for ex) to break the tie.
> 
> With this EM, the interesting thing is that if you assume that OPP
> requests follow utilization, you are _guaranteed_ that the CPU with
> the max spare capacity in a freq domain will always be among the best
> candidates of this freq domain. And since we don't know how to
> differentiate those candidates, why not using this one ?
> 
> Yes, it _might_ be better from an energy standpoint to pack small tasks
> on a CPU in order to let other CPUs go in deeper idle states. But that
> also hurts your chances to go cluster idle. Which solution is the best ?
> It depends, and we have no ways to tell with this EM.
> 
> This approach basically favors cluster-packing, and spreading inside a
> cluster. That should at least be a good thing for latency, and this is
> consistent with the idea that most of the energy savings come from the
> asymmetry of the system, and not so much from breaking the tie between
> identical CPUs. That's also the reason why EAS is enabled only if your
> system has SD_ASYM_CPUCAPACITY set, as we already discussed for patch
> 05/10 :-).
> 
> Does that make sense ?

Yes, thanks for the explanation. It would probably make sense to copy
and paste your text above somewhere in comment/doc for future ref.
Quentin Perret June 8, 2018, 4:26 p.m. UTC | #4
On Friday 08 Jun 2018 at 13:59:28 (+0200), Juri Lelli wrote:
> On 08/06/18 12:19, Quentin Perret wrote:
> > On Friday 08 Jun 2018 at 12:24:46 (+0200), Juri Lelli wrote:
> > > Hi,
> > > 
> > > On 21/05/18 15:25, Quentin Perret wrote:
> > > 
> > > [...]
> > > 
> > > > +static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
> > > > +{
> > > > +	unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
> > > > +	int cpu, best_energy_cpu = prev_cpu;
> > > > +	struct sched_energy_fd *sfd;
> > > > +	struct sched_domain *sd;
> > > > +
> > > > +	sync_entity_load_avg(&p->se);
> > > > +
> > > > +	task_util = task_util_est(p);
> > > > +	if (!task_util)
> > > > +		return prev_cpu;
> > > > +
> > > > +	/*
> > > > +	 * Energy-aware wake-up happens on the lowest sched_domain starting
> > > > +	 * from sd_ea spanning over this_cpu and prev_cpu.
> > > > +	 */
> > > > +	sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
> > > > +	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
> > > > +		sd = sd->parent;
> > > > +	if (!sd)
> > > > +		return -1;
> > > 
> > > Shouldn't this be return prev_cpu?
> > 
> > Well, you shouldn't be entering this function without an sd_ea pointer,
> > so this case is a sort of bug I think. By returning -1 I think we should
> > end-up picking a CPU using select_fallback_rq(), which sort of makes
> > sense ?
> 
> I fear cpumask_test_cpu() and such won't be happy with a -1 arg.
> If it's a recoverable bug, I'd say return prev and WARN_ON_ONCE() ?

Hmmm, yes, prev + WARN_ON_ONCE is probably appropriate here then.

> 
> > > > +
> > > > +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > > > +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> > > > +	else
> > > > +		prev_energy = best_energy = ULONG_MAX;
> > > > +
> > > > +	for_each_freq_domain(sfd) {
> > > > +		unsigned long spare_cap, max_spare_cap = 0;
> > > > +		int max_spare_cap_cpu = -1;
> > > > +		unsigned long util;
> > > > +
> > > > +		/* Find the CPU with the max spare cap in the freq. dom. */
> > > 
> > > I undestand this being a heuristic to cut some overhead, but shouldn't
> > > the model tell between packing vs. spreading?
> > 
> > Ah, that's a very interesting one :-) !
> > 
> > So, with only active costs of the CPUs in the model, we can't really
> > tell what's best between packing or spreading between identical CPUs if
> > the migration of the task doesn't change the OPP request.
> > 
> > In a frequency domain, all the "best" CPU candidates for a task are
> > those for which we'll request a low OPP. When there are several CPUs for
> > which the OPP request will be the same, we just don't know which one to
> > pick from an energy standpoint, because we don't have other energy costs
> > (for idle states for ex) to break the tie.
> > 
> > With this EM, the interesting thing is that if you assume that OPP
> > requests follow utilization, you are _guaranteed_ that the CPU with
> > the max spare capacity in a freq domain will always be among the best
> > candidates of this freq domain. And since we don't know how to
> > differentiate those candidates, why not using this one ?
> > 
> > Yes, it _might_ be better from an energy standpoint to pack small tasks
> > on a CPU in order to let other CPUs go in deeper idle states. But that
> > also hurts your chances to go cluster idle. Which solution is the best ?
> > It depends, and we have no ways to tell with this EM.
> > 
> > This approach basically favors cluster-packing, and spreading inside a
> > cluster. That should at least be a good thing for latency, and this is
> > consistent with the idea that most of the energy savings come from the
> > asymmetry of the system, and not so much from breaking the tie between
> > identical CPUs. That's also the reason why EAS is enabled only if your
> > system has SD_ASYM_CPUCAPACITY set, as we already discussed for patch
> > 05/10 :-).
> > 
> > Does that make sense ?
> 
> Yes, thanks for the explanation. It would probably make sense to copy
> and paste your text above somewhere in comment/doc for future ref.

OK, will do.

Thanks !
Quentin
Pavan Kondeti June 19, 2018, 5:06 a.m. UTC | #5
On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:

<snip>

> +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> +	else
> +		prev_energy = best_energy = ULONG_MAX;
> +
> +	for_each_freq_domain(sfd) {
> +		unsigned long spare_cap, max_spare_cap = 0;
> +		int max_spare_cap_cpu = -1;
> +		unsigned long util;
> +
> +		/* Find the CPU with the max spare cap in the freq. dom. */
> +		for_each_cpu_and(cpu, freq_domain_span(sfd), sched_domain_span(sd)) {
> +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> +				continue;
> +
> +			if (cpu == prev_cpu)
> +				continue;
> +
> +			/* Skip CPUs that will be overutilized */
> +			util = cpu_util_wake(cpu, p) + task_util;
> +			cpu_cap = capacity_of(cpu);
> +			if (cpu_cap * 1024 < util * capacity_margin)
> +				continue;
> +
> +			spare_cap = cpu_cap - util;
> +			if (spare_cap > max_spare_cap) {
> +				max_spare_cap = spare_cap;
> +				max_spare_cap_cpu = cpu;
> +			}
> +		}
> +
> +		/* Evaluate the energy impact of using this CPU. */
> +		if (max_spare_cap_cpu >= 0) {
> +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> +			if (cur_energy < best_energy) {
> +				best_energy = cur_energy;
> +				best_energy_cpu = max_spare_cap_cpu;
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * We pick the best CPU only if it saves at least 1.5% of the
> +	 * energy used by prev_cpu.
> +	 */
> +	if ((prev_energy - best_energy) > (prev_energy >> 6))
> +		return best_energy_cpu;
> +
> +	return prev_cpu;
> +}

We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
can't accommodate the waking task. Is this intentional? Should not we
discard the prev_cpu if it can't accommodate the task.

This can potentially make a BIG task run on a lower capacity CPU until
load balancer kicks in and corrects the situation.

Thanks,
Pavan
Quentin Perret June 19, 2018, 7:57 a.m. UTC | #6
Hi Pavan,

On Tuesday 19 Jun 2018 at 10:36:01 (+0530), Pavan Kondeti wrote:
> On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:
> 
> <snip>
> 
> > +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> > +	else
> > +		prev_energy = best_energy = ULONG_MAX;
> > +
> > +	for_each_freq_domain(sfd) {
> > +		unsigned long spare_cap, max_spare_cap = 0;
> > +		int max_spare_cap_cpu = -1;
> > +		unsigned long util;
> > +
> > +		/* Find the CPU with the max spare cap in the freq. dom. */
> > +		for_each_cpu_and(cpu, freq_domain_span(sfd), sched_domain_span(sd)) {
> > +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > +				continue;
> > +
> > +			if (cpu == prev_cpu)
> > +				continue;
> > +
> > +			/* Skip CPUs that will be overutilized */
> > +			util = cpu_util_wake(cpu, p) + task_util;
> > +			cpu_cap = capacity_of(cpu);
> > +			if (cpu_cap * 1024 < util * capacity_margin)
> > +				continue;
> > +
> > +			spare_cap = cpu_cap - util;
> > +			if (spare_cap > max_spare_cap) {
> > +				max_spare_cap = spare_cap;
> > +				max_spare_cap_cpu = cpu;
> > +			}
> > +		}
> > +
> > +		/* Evaluate the energy impact of using this CPU. */
> > +		if (max_spare_cap_cpu >= 0) {
> > +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> > +			if (cur_energy < best_energy) {
> > +				best_energy = cur_energy;
> > +				best_energy_cpu = max_spare_cap_cpu;
> > +			}
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * We pick the best CPU only if it saves at least 1.5% of the
> > +	 * energy used by prev_cpu.
> > +	 */
> > +	if ((prev_energy - best_energy) > (prev_energy >> 6))
> > +		return best_energy_cpu;
> > +
> > +	return prev_cpu;
> > +}
> 
> We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
> can't accommodate the waking task. Is this intentional? Should not we
> discard the prev_cpu if it can't accommodate the task.
> 
> This can potentially make a BIG task run on a lower capacity CPU until
> load balancer kicks in and corrects the situation.

We shouldn't enter find_energy_efficient_cpu() in the first place if the
system is overutilized, so that shouldn't too much of an issue in
general.

But yeah, there is one small corner case where prev_cpu is overutilized
and the system has not been flagged as such yet (when the tasks wakes-up
shortly before the tick for ex). I think it's possible to cover this case
by extending the "if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))"
condition at the top of the function with a check on capacity_margin.

I'll change that in v4.

Thanks !
Quentin
Pavan Kondeti June 19, 2018, 8:41 a.m. UTC | #7
On Tue, Jun 19, 2018 at 08:57:23AM +0100, Quentin Perret wrote:
> Hi Pavan,
> 
> On Tuesday 19 Jun 2018 at 10:36:01 (+0530), Pavan Kondeti wrote:
> > On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:
> > 
> > <snip>
> > 
> > > +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> > > +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> > > +	else
> > > +		prev_energy = best_energy = ULONG_MAX;
> > > +
> > > +	for_each_freq_domain(sfd) {
> > > +		unsigned long spare_cap, max_spare_cap = 0;
> > > +		int max_spare_cap_cpu = -1;
> > > +		unsigned long util;
> > > +
> > > +		/* Find the CPU with the max spare cap in the freq. dom. */
> > > +		for_each_cpu_and(cpu, freq_domain_span(sfd), sched_domain_span(sd)) {
> > > +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > > +				continue;
> > > +
> > > +			if (cpu == prev_cpu)
> > > +				continue;
> > > +
> > > +			/* Skip CPUs that will be overutilized */
> > > +			util = cpu_util_wake(cpu, p) + task_util;
> > > +			cpu_cap = capacity_of(cpu);
> > > +			if (cpu_cap * 1024 < util * capacity_margin)
> > > +				continue;
> > > +
> > > +			spare_cap = cpu_cap - util;
> > > +			if (spare_cap > max_spare_cap) {
> > > +				max_spare_cap = spare_cap;
> > > +				max_spare_cap_cpu = cpu;
> > > +			}
> > > +		}
> > > +
> > > +		/* Evaluate the energy impact of using this CPU. */
> > > +		if (max_spare_cap_cpu >= 0) {
> > > +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> > > +			if (cur_energy < best_energy) {
> > > +				best_energy = cur_energy;
> > > +				best_energy_cpu = max_spare_cap_cpu;
> > > +			}
> > > +		}
> > > +	}
> > > +
> > > +	/*
> > > +	 * We pick the best CPU only if it saves at least 1.5% of the
> > > +	 * energy used by prev_cpu.
> > > +	 */
> > > +	if ((prev_energy - best_energy) > (prev_energy >> 6))
> > > +		return best_energy_cpu;
> > > +
> > > +	return prev_cpu;
> > > +}
> > 
> > We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
> > can't accommodate the waking task. Is this intentional? Should not we
> > discard the prev_cpu if it can't accommodate the task.
> > 
> > This can potentially make a BIG task run on a lower capacity CPU until
> > load balancer kicks in and corrects the situation.
> 
> We shouldn't enter find_energy_efficient_cpu() in the first place if the
> system is overutilized, so that shouldn't too much of an issue in
> general.
> 

With UTIL_EST enabled, the overutilization condtion may get turned off when
that 1 BIG task goes to sleep. When it wakes up again, we may place it on
the previous CPU due to the above mentioned issue.

It is not just an existing overutilization condition. By placing this task
on the prev_cpu, we may enter overutilization state.

> But yeah, there is one small corner case where prev_cpu is overutilized
> and the system has not been flagged as such yet (when the tasks wakes-up
> shortly before the tick for ex). I think it's possible to cover this case
> by extending the "if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))"
> condition at the top of the function with a check on capacity_margin.
> 

LGTM.
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f7029258df2..eb44829be17f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6683,6 +6683,80 @@  static long compute_energy(struct task_struct *p, int dst_cpu)
 	return energy;
 }
 
+static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
+{
+	unsigned long cur_energy, prev_energy, best_energy, cpu_cap, task_util;
+	int cpu, best_energy_cpu = prev_cpu;
+	struct sched_energy_fd *sfd;
+	struct sched_domain *sd;
+
+	sync_entity_load_avg(&p->se);
+
+	task_util = task_util_est(p);
+	if (!task_util)
+		return prev_cpu;
+
+	/*
+	 * Energy-aware wake-up happens on the lowest sched_domain starting
+	 * from sd_ea spanning over this_cpu and prev_cpu.
+	 */
+	sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
+	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+		sd = sd->parent;
+	if (!sd)
+		return -1;
+
+	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
+		prev_energy = best_energy = compute_energy(p, prev_cpu);
+	else
+		prev_energy = best_energy = ULONG_MAX;
+
+	for_each_freq_domain(sfd) {
+		unsigned long spare_cap, max_spare_cap = 0;
+		int max_spare_cap_cpu = -1;
+		unsigned long util;
+
+		/* Find the CPU with the max spare cap in the freq. dom. */
+		for_each_cpu_and(cpu, freq_domain_span(sfd), sched_domain_span(sd)) {
+			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+				continue;
+
+			if (cpu == prev_cpu)
+				continue;
+
+			/* Skip CPUs that will be overutilized */
+			util = cpu_util_wake(cpu, p) + task_util;
+			cpu_cap = capacity_of(cpu);
+			if (cpu_cap * 1024 < util * capacity_margin)
+				continue;
+
+			spare_cap = cpu_cap - util;
+			if (spare_cap > max_spare_cap) {
+				max_spare_cap = spare_cap;
+				max_spare_cap_cpu = cpu;
+			}
+		}
+
+		/* Evaluate the energy impact of using this CPU. */
+		if (max_spare_cap_cpu >= 0) {
+			cur_energy = compute_energy(p, max_spare_cap_cpu);
+			if (cur_energy < best_energy) {
+				best_energy = cur_energy;
+				best_energy_cpu = max_spare_cap_cpu;
+			}
+		}
+	}
+
+	/*
+	 * We pick the best CPU only if it saves at least 1.5% of the
+	 * energy used by prev_cpu.
+	 */
+	if ((prev_energy - best_energy) > (prev_energy >> 6))
+		return best_energy_cpu;
+
+	return prev_cpu;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6701,16 +6775,23 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	struct sched_domain *tmp, *sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
-	int want_affine = 0;
+	int want_affine = 0, want_energy = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
+		want_energy = sched_energy_enabled() &&
+			      !READ_ONCE(cpu_rq(cpu)->rd->overutilized);
 		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
 			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
 	}
 
 	rcu_read_lock();
+	if (want_energy) {
+		new_cpu = find_energy_efficient_cpu(p, prev_cpu);
+		goto unlock;
+	}
+
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
@@ -6745,6 +6826,7 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (want_affine)
 			current->recent_used_cpu = cpu;
 	}
+unlock:
 	rcu_read_unlock();
 
 	return new_cpu;