diff mbox series

[v7,12/14] sched/fair: Select an energy-efficient CPU on task wake-up

Message ID 20180912091309.7551-13-quentin.perret@arm.com (mailing list archive)
State Changes Requested, archived
Headers show
Series Energy Aware Scheduling | expand

Commit Message

Quentin Perret Sept. 12, 2018, 9:13 a.m. UTC
If an Energy Model (EM) is available and if the system isn't
overutilized, re-route waking tasks into an energy-aware placement
algorithm. The selection of an energy-efficient CPU for a task
is achieved by estimating the impact on system-level active energy
resulting from the placement of the task on the CPU with the highest
spare capacity in each performance domain. This strategy spreads tasks
in a performance domain and avoids overly aggressive task packing. The
best CPU energy-wise is then selected if it saves a large enough amount
of energy with respect to prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This is an attempt to do something useful as writing a fast heuristic
that performs reasonably well on a broad spectrum of architectures isn't
an easy task. As such, the scope of usability of the energy-aware
wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag
set, and where the EM isn't too complex.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c | 139 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 136 insertions(+), 3 deletions(-)

Comments

Peter Zijlstra Oct. 4, 2018, 9:44 a.m. UTC | #1
On Wed, Sep 12, 2018 at 10:13:07AM +0100, Quentin Perret wrote:
> +	while (pd) {
> +		unsigned long cur_energy, spare_cap, max_spare_cap = 0;
> +		int max_spare_cap_cpu = -1;
> +
> +		for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {

Which of the two masks do we expect to be the smallest?

> +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> +				continue;
> +
> +			/* Skip CPUs that will be overutilized. */
> +			util = cpu_util_next(cpu, p, cpu);
> +			cpu_cap = capacity_of(cpu);
> +			if (cpu_cap * 1024 < util * capacity_margin)
> +				continue;
> +
> +			/* Always use prev_cpu as a candidate. */
> +			if (cpu == prev_cpu) {
> +				prev_energy = compute_energy(p, prev_cpu, head);
> +				if (prev_energy < best_energy)
> +					best_energy = prev_energy;

				best_energy = min(best_energy, prev_energy);

That's both shorter and clearer.

> +				continue;
> +			}
> +
> +			/*
> +			 * Find the CPU with the maximum spare capacity in
> +			 * the performance domain
> +			 */
> +			spare_cap = cpu_cap - util;
> +			if (spare_cap > max_spare_cap) {
> +				max_spare_cap = spare_cap;
> +				max_spare_cap_cpu = cpu;
> +			}

Sometimes I wonder if something like:

#define min_filter(varp, val)		\
({					\
	typeof(varp) _varp = (varp);	\
	typeof(val)  _val  = (val);	\
	bool f = false;			\
					\
	if (_val < *_varp) {		\
		*_varp = _val;		\
		f = true;		\
	}				\
					\
	f;				\
})

and the corresponding max_filter() are worth the trouble; it would allow
writing:

	if (max_filter(&max_spare_cap, spare_cap))
		max_spare_cap_cpu = cpu;

and:

> +		}
> +
> +		/* Evaluate the energy impact of using this CPU. */
> +		if (max_spare_cap_cpu >= 0) {
> +			cur_energy = compute_energy(p, max_spare_cap_cpu, head);
> +			if (cur_energy < best_energy) {
> +				best_energy = cur_energy;
> +				best_energy_cpu = max_spare_cap_cpu;
> +			}

	if (min_filter(&best_energy, cur_energy))
		best_energy_cpu = max_spare_cap_cpu;

But then I figure, it is not... dunno. We do lots of this stuff.

> +		}
> +		pd = pd->next;
> +	}
> +
> +	/*
> +	 * Pick the best CPU if prev_cpu cannot be used, or if it saves at
> +	 * least 6% of the energy used by prev_cpu.
> +	 */
> +	if (prev_energy == ULONG_MAX ||
> +			(prev_energy - best_energy) > (prev_energy >> 4))
> +		return best_energy_cpu;

Does that become more readable if we split that into two conditions?

	if (prev_energy == ULONG_MAX)
		return best_energy_cpu;

	if ((prev_energy - best_energy) > (prev_energy >> 4))
		return best_energy_cpu;

> +	return prev_cpu;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -6360,13 +6468,37 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  	int want_affine = 0;
>  	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>  
> +	rcu_read_lock();
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> +
> +		/*
> +		 * Forkees are not accepted in the energy-aware wake-up path
> +		 * because they don't have any useful utilization data yet and
> +		 * it's not possible to forecast their impact on energy
> +		 * consumption. Consequently, they will be placed by
> +		 * find_idlest_cpu() on the least loaded CPU, which might turn
> +		 * out to be energy-inefficient in some use-cases. The
> +		 * alternative would be to bias new tasks towards specific types
> +		 * of CPUs first, or to try to infer their util_avg from the
> +		 * parent task, but those heuristics could hurt other use-cases
> +		 * too. So, until someone finds a better way to solve this,
> +		 * let's keep things simple by re-using the existing slow path.
> +		 */
> +		if (sched_feat(ENERGY_AWARE)) {
> +			struct root_domain *rd = cpu_rq(cpu)->rd;
> +			struct perf_domain *pd = rcu_dereference(rd->pd);
> +
> +			if (pd && !READ_ONCE(rd->overutilized)) {
> +				new_cpu = find_energy_efficient_cpu(p, prev_cpu, pd);
> +				goto unlock;
> +			}
> +		}
> +
> +		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
> +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
>  	}

I would much prefer this to be something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a8f601edd958..5475a885ec9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6299,12 +6299,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 {
 	struct sched_domain *tmp, *sd = NULL;
 	int cpu = smp_processor_id();
-	int new_cpu = prev_cpu;
+	unsigned int new_cpu = prev_cpu;
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
+
+		if (static_branch_unlikely(sched_eas_balance)) {
+			new_cpu = select_task_rq_eas(p, prev_cpu, sd_flags, wake_flags);
+			if (new_cpu < nr_cpu_ids)
+				return new_cpu;
+		}
+
 		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
 			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
 	}


and then hide everything (including that giant comment) in
select_task_rq_eas().
Quentin Perret Oct. 4, 2018, 10:27 a.m. UTC | #2
On Thursday 04 Oct 2018 at 11:44:12 (+0200), Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 10:13:07AM +0100, Quentin Perret wrote:
> > +	while (pd) {
> > +		unsigned long cur_energy, spare_cap, max_spare_cap = 0;
> > +		int max_spare_cap_cpu = -1;
> > +
> > +		for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
> 
> Which of the two masks do we expect to be the smallest?

Typically, perf_domain_span is smaller.

> > +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > +				continue;
> > +
> > +			/* Skip CPUs that will be overutilized. */
> > +			util = cpu_util_next(cpu, p, cpu);
> > +			cpu_cap = capacity_of(cpu);
> > +			if (cpu_cap * 1024 < util * capacity_margin)
> > +				continue;
> > +
> > +			/* Always use prev_cpu as a candidate. */
> > +			if (cpu == prev_cpu) {
> > +				prev_energy = compute_energy(p, prev_cpu, head);
> > +				if (prev_energy < best_energy)
> > +					best_energy = prev_energy;
> 
> 				best_energy = min(best_energy, prev_energy);
> 
> That's both shorter and clearer.

OK.


> > +				continue;
> > +			}
> > +
> > +			/*
> > +			 * Find the CPU with the maximum spare capacity in
> > +			 * the performance domain
> > +			 */
> > +			spare_cap = cpu_cap - util;
> > +			if (spare_cap > max_spare_cap) {
> > +				max_spare_cap = spare_cap;
> > +				max_spare_cap_cpu = cpu;
> > +			}
> 
> Sometimes I wonder if something like:
> 
> #define min_filter(varp, val)		\
> ({					\
> 	typeof(varp) _varp = (varp);	\
> 	typeof(val)  _val  = (val);	\
> 	bool f = false;			\
> 					\
> 	if (_val < *_varp) {		\
> 		*_varp = _val;		\
> 		f = true;		\
> 	}				\
> 					\
> 	f;				\
> })
> 
> and the corresponding max_filter() are worth the trouble; it would allow
> writing:
> 
> 	if (max_filter(&max_spare_cap, spare_cap))
> 		max_spare_cap_cpu = cpu;
> 
> and:
> 
> > +		}
> > +
> > +		/* Evaluate the energy impact of using this CPU. */
> > +		if (max_spare_cap_cpu >= 0) {
> > +			cur_energy = compute_energy(p, max_spare_cap_cpu, head);
> > +			if (cur_energy < best_energy) {
> > +				best_energy = cur_energy;
> > +				best_energy_cpu = max_spare_cap_cpu;
> > +			}
> 
> 	if (min_filter(&best_energy, cur_energy))
> 		best_energy_cpu = max_spare_cap_cpu;
> 
> But then I figure, it is not... dunno. We do lots of this stuff.

If there are occurrences of this stuff all over the place, we could do
that in a separate clean-up patch that does just that, for the entire
file. Or maybe more ?

> > +		}
> > +		pd = pd->next;
> > +	}
> > +
> > +	/*
> > +	 * Pick the best CPU if prev_cpu cannot be used, or if it saves at
> > +	 * least 6% of the energy used by prev_cpu.
> > +	 */
> > +	if (prev_energy == ULONG_MAX ||
> > +			(prev_energy - best_energy) > (prev_energy >> 4))
> > +		return best_energy_cpu;
> 
> Does that become more readable if we split that into two conditions?
> 
> 	if (prev_energy == ULONG_MAX)
> 		return best_energy_cpu;
> 
> 	if ((prev_energy - best_energy) > (prev_energy >> 4))
> 		return best_energy_cpu;

Yeah, why not :-)

> > +	return prev_cpu;
> > +}
> > +
> >  /*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -6360,13 +6468,37 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >  	int want_affine = 0;
> >  	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> >  
> > +	rcu_read_lock();
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> >  		record_wakee(p);
> > +
> > +		/*
> > +		 * Forkees are not accepted in the energy-aware wake-up path
> > +		 * because they don't have any useful utilization data yet and
> > +		 * it's not possible to forecast their impact on energy
> > +		 * consumption. Consequently, they will be placed by
> > +		 * find_idlest_cpu() on the least loaded CPU, which might turn
> > +		 * out to be energy-inefficient in some use-cases. The
> > +		 * alternative would be to bias new tasks towards specific types
> > +		 * of CPUs first, or to try to infer their util_avg from the
> > +		 * parent task, but those heuristics could hurt other use-cases
> > +		 * too. So, until someone finds a better way to solve this,
> > +		 * let's keep things simple by re-using the existing slow path.
> > +		 */
> > +		if (sched_feat(ENERGY_AWARE)) {
> > +			struct root_domain *rd = cpu_rq(cpu)->rd;
> > +			struct perf_domain *pd = rcu_dereference(rd->pd);
> > +
> > +			if (pd && !READ_ONCE(rd->overutilized)) {
> > +				new_cpu = find_energy_efficient_cpu(p, prev_cpu, pd);
> > +				goto unlock;
> > +			}
> > +		}
> > +
> > +		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
> > +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
> >  	}
> 
> I would much prefer this to be something like:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a8f601edd958..5475a885ec9f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6299,12 +6299,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  {
>  	struct sched_domain *tmp, *sd = NULL;
>  	int cpu = smp_processor_id();
> -	int new_cpu = prev_cpu;
> +	unsigned int new_cpu = prev_cpu;
>  	int want_affine = 0;
>  	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>  
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> +
> +		if (static_branch_unlikely(sched_eas_balance)) {
> +			new_cpu = select_task_rq_eas(p, prev_cpu, sd_flags, wake_flags);
> +			if (new_cpu < nr_cpu_ids)
> +				return new_cpu;
> +		}
> +
>  		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
>  			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
>  	}
> and then hide everything (including that giant comment) in
> select_task_rq_eas().

So you think we should rename find_energy_efficient_cpu and put all the
checks in there ? Or should select_task_rq_eas do the checks and then
call find_energy_efficient_cpu ?

Not a huge deal, but that'll save some time if we agree on that one
upfront.

Thanks
Peter Zijlstra Oct. 4, 2018, 10:41 a.m. UTC | #3
On Thu, Oct 04, 2018 at 11:27:22AM +0100, Quentin Perret wrote:
> > > +		for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
> > 
> > Which of the two masks do we expect to be the smallest?
> 
> Typically, perf_domain_span is smaller.

OK, then the above expression is in the right order :-)

> > > +			if (spare_cap > max_spare_cap) {
> > > +				max_spare_cap = spare_cap;
> > > +				max_spare_cap_cpu = cpu;
> > > +			}
> > 
> > Sometimes I wonder if something like:
> > 
> > #define min_filter(varp, val)		\
> > ({					\
> > 	typeof(varp) _varp = (varp);	\
> > 	typeof(val)  _val  = (val);	\
> > 	bool f = false;			\
> > 					\
> > 	if (_val < *_varp) {		\
> > 		*_varp = _val;		\
> > 		f = true;		\
> > 	}				\
> > 					\
> > 	f;				\
> > })
> > 
> > and the corresponding max_filter() are worth the trouble; it would allow
> > writing:
> > 
> > 	if (max_filter(&max_spare_cap, spare_cap))
> > 		max_spare_cap_cpu = cpu;
> > 
> > and:
> > 
> > > +		}
> > > +
> > > +		/* Evaluate the energy impact of using this CPU. */
> > > +		if (max_spare_cap_cpu >= 0) {
> > > +			cur_energy = compute_energy(p, max_spare_cap_cpu, head);
> > > +			if (cur_energy < best_energy) {
> > > +				best_energy = cur_energy;
> > > +				best_energy_cpu = max_spare_cap_cpu;
> > > +			}
> > 
> > 	if (min_filter(&best_energy, cur_energy))
> > 		best_energy_cpu = max_spare_cap_cpu;
> > 
> > But then I figure, it is not... dunno. We do lots of this stuff.
> 
> If there are occurrences of this stuff all over the place, we could do
> that in a separate clean-up patch that does just that, for the entire
> file. Or maybe more ?

Sure, not something that needs done now. I just always think of this
when I see this pattern repeated, but never seem to get around to doing
anything about it.

I figured I'd mention it ;-)

> > I would much prefer this to be something like:
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a8f601edd958..5475a885ec9f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6299,12 +6299,19 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >  {
> >  	struct sched_domain *tmp, *sd = NULL;
> >  	int cpu = smp_processor_id();
> > -	int new_cpu = prev_cpu;
> > +	unsigned int new_cpu = prev_cpu;
> >  	int want_affine = 0;
> >  	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> >  
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> >  		record_wakee(p);
> > +
> > +		if (static_branch_unlikely(sched_eas_balance)) {
> > +			new_cpu = select_task_rq_eas(p, prev_cpu, sd_flags, wake_flags);
> > +			if (new_cpu < nr_cpu_ids)
> > +				return new_cpu;
> > +		}
> > +
> >  		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> >  			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> >  	}
> > and then hide everything (including that giant comment) in
> > select_task_rq_eas().
> 
> So you think we should rename find_energy_efficient_cpu and put all the
> checks in there ? Or should select_task_rq_eas do the checks and then
> call find_energy_efficient_cpu ?
> 
> Not a huge deal, but that'll save some time if we agree on that one
> upfront.

Not sure, see what it looks like ;-) My main concern here was to get rid
of that giant blob in select_task_rq_fair().
Quentin Perret Oct. 4, 2018, 10:55 a.m. UTC | #4
On Thursday 04 Oct 2018 at 12:41:07 (+0200), Peter Zijlstra wrote:
> Not sure, see what it looks like ;-) My main concern here was to get rid
> of that giant blob in select_task_rq_fair().

OK, got it. I'll probably just move the checks into the function and
merge that large comment into the already massive comment above
find_energy_efficient_cpu() or something. We'll see :-)

And do you actually care about the function name ? We've been using
find_energy_efficient_cpu() for some time in Android now, so that's just
slightly easier for us if that can be re-used. Not a big deal either,
though.

Thanks,
Quentin
Peter Zijlstra Oct. 4, 2018, 12:18 p.m. UTC | #5
On Thu, Oct 04, 2018 at 11:55:09AM +0100, Quentin Perret wrote:
> On Thursday 04 Oct 2018 at 12:41:07 (+0200), Peter Zijlstra wrote:
> > Not sure, see what it looks like ;-) My main concern here was to get rid
> > of that giant blob in select_task_rq_fair().
> 
> OK, got it. I'll probably just move the checks into the function and
> merge that large comment into the already massive comment above
> find_energy_efficient_cpu() or something. We'll see :-)
> 
> And do you actually care about the function name ? We've been using
> find_energy_efficient_cpu() for some time in Android now, so that's just
> slightly easier for us if that can be re-used. Not a big deal either,
> though.

Sure, that's fine I suppose.
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ee3e43cdaf2..6e988a01011c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6339,6 +6339,114 @@  static long compute_energy(struct task_struct *p, int dst_cpu,
 	return energy;
 }
 
+/*
+ * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
+ * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
+ * spare capacity in each performance domain and uses it as a potential
+ * candidate to execute the task. Then, it uses the Energy Model to figure
+ * out which of the CPU candidates is the most energy-efficient.
+ *
+ * The rationale for this heuristic is as follows. In a performance domain,
+ * all the most energy efficient CPU candidates (according to the Energy
+ * Model) are those for which we'll request a low frequency. When there are
+ * several CPUs for which the frequency request will be the same, we don't
+ * have enough data to break the tie between them, because the Energy Model
+ * only includes active power costs. With this model, if we assume that
+ * frequency requests follow utilization (e.g. using schedutil), the CPU with
+ * the maximum spare capacity in a performance domain is guaranteed to be among
+ * the best candidates of the performance domain.
+ *
+ * In practice, it could be preferable from an energy standpoint to pack
+ * small tasks on a CPU in order to let other CPUs go in deeper idle states,
+ * but that could also hurt our chances to go cluster idle, and we have no
+ * ways to tell with the current Energy Model if this is actually a good
+ * idea or not. So, find_energy_efficient_cpu() basically favors
+ * cluster-packing, and spreading inside a cluster. That should at least be
+ * a good thing for latency, and this is consistent with the idea that most
+ * of the energy savings of EAS come from the asymmetry of the system, and
+ * not so much from breaking the tie between identical CPUs. That's also the
+ * reason why EAS is enabled in the topology code only for systems where
+ * SD_ASYM_CPUCAPACITY is set.
+ */
+static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu,
+							struct perf_domain *pd)
+{
+	unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
+	int cpu, best_energy_cpu = prev_cpu;
+	struct perf_domain *head = pd;
+	unsigned long cpu_cap, util;
+	struct sched_domain *sd;
+
+	sync_entity_load_avg(&p->se);
+
+	if (!task_util_est(p))
+		return prev_cpu;
+
+	/*
+	 * Energy-aware wake-up happens on the lowest sched_domain starting
+	 * from sd_asym_cpucapacity spanning over this_cpu and prev_cpu.
+	 */
+	sd = rcu_dereference(*this_cpu_ptr(&sd_asym_cpucapacity));
+	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+		sd = sd->parent;
+	if (!sd)
+		return prev_cpu;
+
+	while (pd) {
+		unsigned long cur_energy, spare_cap, max_spare_cap = 0;
+		int max_spare_cap_cpu = -1;
+
+		for_each_cpu_and(cpu, perf_domain_span(pd), sched_domain_span(sd)) {
+			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+				continue;
+
+			/* Skip CPUs that will be overutilized. */
+			util = cpu_util_next(cpu, p, cpu);
+			cpu_cap = capacity_of(cpu);
+			if (cpu_cap * 1024 < util * capacity_margin)
+				continue;
+
+			/* Always use prev_cpu as a candidate. */
+			if (cpu == prev_cpu) {
+				prev_energy = compute_energy(p, prev_cpu, head);
+				if (prev_energy < best_energy)
+					best_energy = prev_energy;
+				continue;
+			}
+
+			/*
+			 * Find the CPU with the maximum spare capacity in
+			 * the performance domain
+			 */
+			spare_cap = cpu_cap - util;
+			if (spare_cap > max_spare_cap) {
+				max_spare_cap = spare_cap;
+				max_spare_cap_cpu = cpu;
+			}
+		}
+
+		/* Evaluate the energy impact of using this CPU. */
+		if (max_spare_cap_cpu >= 0) {
+			cur_energy = compute_energy(p, max_spare_cap_cpu, head);
+			if (cur_energy < best_energy) {
+				best_energy = cur_energy;
+				best_energy_cpu = max_spare_cap_cpu;
+			}
+		}
+		pd = pd->next;
+	}
+
+	/*
+	 * Pick the best CPU if prev_cpu cannot be used, or if it saves at
+	 * least 6% of the energy used by prev_cpu.
+	 */
+	if (prev_energy == ULONG_MAX ||
+			(prev_energy - best_energy) > (prev_energy >> 4))
+		return best_energy_cpu;
+
+	return prev_cpu;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6360,13 +6468,37 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int want_affine = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	rcu_read_lock();
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
-			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
+
+		/*
+		 * Forkees are not accepted in the energy-aware wake-up path
+		 * because they don't have any useful utilization data yet and
+		 * it's not possible to forecast their impact on energy
+		 * consumption. Consequently, they will be placed by
+		 * find_idlest_cpu() on the least loaded CPU, which might turn
+		 * out to be energy-inefficient in some use-cases. The
+		 * alternative would be to bias new tasks towards specific types
+		 * of CPUs first, or to try to infer their util_avg from the
+		 * parent task, but those heuristics could hurt other use-cases
+		 * too. So, until someone finds a better way to solve this,
+		 * let's keep things simple by re-using the existing slow path.
+		 */
+		if (sched_feat(ENERGY_AWARE)) {
+			struct root_domain *rd = cpu_rq(cpu)->rd;
+			struct perf_domain *pd = rcu_dereference(rd->pd);
+
+			if (pd && !READ_ONCE(rd->overutilized)) {
+				new_cpu = find_energy_efficient_cpu(p, prev_cpu, pd);
+				goto unlock;
+			}
+		}
+
+		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
+			      cpumask_test_cpu(cpu, &p->cpus_allowed);
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
@@ -6401,6 +6533,7 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (want_affine)
 			current->recent_used_cpu = cpu;
 	}
+unlock:
 	rcu_read_unlock();
 
 	return new_cpu;