diff mbox

[RFC,v2,5/6] sched/fair: Select an energy-efficient CPU on task wake-up

Message ID 20180406153607.17815-6-dietmar.eggemann@arm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Dietmar Eggemann April 6, 2018, 3:36 p.m. UTC
From: Quentin Perret <quentin.perret@arm.com>

In case an energy model is available, waking tasks are re-routed into a
new energy-aware placement algorithm. The eligible CPUs to be used in the
energy-aware wakeup path are restricted to the highest non-overutilized
sched_domain containing prev_cpu and this_cpu. If no such domain is found,
the tasks go through the usual wake-up path, hence energy-aware placement
happens only in lightly utilized scenarios.

The selection of the most energy-efficient CPU for a task is achieved by
estimating the impact on system-level active energy resulting from the
placement of the task on the CPU with the highest spare capacity in each
frequency domain. The best CPU energy-wise is then selected if it saves
a large enough amount of energy with respect to prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This patch is an attempt to do something useful as writing a fast
heuristic that performs reasonably well on a broad spectrum of
architectures isn't an easy task. As a consequence, the scope of
usability of the energy-aware wake-up path is restricted to systems
with the SD_ASYM_CPUCAPACITY flag set. These systems not only show the
most promising opportunities for saving energy but also typically
feature a limited number of logical CPUs.

Moreover, the energy-aware wake-up path is accessible only if
sched_energy_enabled() is true. For systems which don't meet all
dependencies for EAS (CONFIG_PM_OPP for ex.) at compile time,
sched_enegy_enabled() defaults to a constant "false" value, hence letting
the compiler remove the unused EAS code entirely.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 4 deletions(-)

Comments

Peter Zijlstra April 9, 2018, 4:30 p.m. UTC | #1
On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> +		want_energy = wake_energy(p, prev_cpu);
>  		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
> +			      && !want_energy;

Could you please fix that and put the operators at the end of the
previous line?
Quentin Perret April 9, 2018, 4:43 p.m. UTC | #2
On Monday 09 Apr 2018 at 18:30:29 (+0200), Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> >  		record_wakee(p);
> > +		want_energy = wake_energy(p, prev_cpu);
> >  		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> > +			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
> > +			      && !want_energy;
> 
> Could you please fix that and put the operators at the end of the
> previous line?

Will do.
Peter Zijlstra April 10, 2018, 5:29 p.m. UTC | #3
On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
> +	for_each_freq_domain(fd) {
> +		unsigned long spare_cap, max_spare_cap = 0;
> +		int max_spare_cap_cpu = -1;
> +		unsigned long util;
> +
> +		/* Find the CPU with the max spare cap in the freq. dom. */
> +		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
> +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> +				continue;
> +
> +			if (cpu == prev_cpu)
> +				continue;
> +
> +			util = cpu_util_wake(cpu, p);
> +			cpu_cap = capacity_of(cpu);
> +			if (!util_fits_capacity(util + task_util, cpu_cap))
> +				continue;
> +
> +			spare_cap = cpu_cap - util;
> +			if (spare_cap > max_spare_cap) {
> +				max_spare_cap = spare_cap;
> +				max_spare_cap_cpu = cpu;
> +			}
> +		}
> +
> +		/* Evaluate the energy impact of using this CPU. */
> +		if (max_spare_cap_cpu >= 0) {
> +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> +			if (cur_energy < best_energy) {
> +				best_energy = cur_energy;
> +				best_energy_cpu = max_spare_cap_cpu;
> +			}
> +		}
> +	}

If each CPU has its own frequency domain, then the above loop ends up
being O(n^2), no? Is there really nothing we can do about that? Also, I
feel that warrants a comment warning about this.

Someone, somewhere will try and build a 64+64 cpu system and get
surprised it doesn't work :-)
Quentin Perret April 10, 2018, 6:14 p.m. UTC | #4
On Tuesday 10 Apr 2018 at 19:29:32 (+0200), Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
> > +	for_each_freq_domain(fd) {
> > +		unsigned long spare_cap, max_spare_cap = 0;
> > +		int max_spare_cap_cpu = -1;
> > +		unsigned long util;
> > +
> > +		/* Find the CPU with the max spare cap in the freq. dom. */
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
> > +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > +				continue;
> > +
> > +			if (cpu == prev_cpu)
> > +				continue;
> > +
> > +			util = cpu_util_wake(cpu, p);
> > +			cpu_cap = capacity_of(cpu);
> > +			if (!util_fits_capacity(util + task_util, cpu_cap))
> > +				continue;
> > +
> > +			spare_cap = cpu_cap - util;
> > +			if (spare_cap > max_spare_cap) {
> > +				max_spare_cap = spare_cap;
> > +				max_spare_cap_cpu = cpu;
> > +			}
> > +		}
> > +
> > +		/* Evaluate the energy impact of using this CPU. */
> > +		if (max_spare_cap_cpu >= 0) {
> > +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> > +			if (cur_energy < best_energy) {
> > +				best_energy = cur_energy;
> > +				best_energy_cpu = max_spare_cap_cpu;
> > +			}
> > +		}
> > +	}
> 
> If each CPU has its own frequency domain, then the above loop ends up
> being O(n^2), no?

Hmmm, yes, that should be O(n^2) indeed.

> Is there really nothing we can do about that?

So, the only thing I see just now would be to make compute_energy()
smarter somehow. Today we compute the energy consumed by each frequency
domain and then sum them up to get the system energy. We re-compute the
energy of each frequency domain, even when it is not involved in the
migration. In the case you describe, we will end up re-computing the
energy of many frequency domains on which nothing happens every time
we re-call compute_energy(). So there is probably something we could do
by caching those values somehow.

Otherwise, on systems with 2 frequency domains (e.g. big.LITTLE), the
current code should behave relatively well. And I think that covers a
large portion of the real-world systems for which EAS is useful, as of
today at least ... :-)

> Also, I
> feel that warrants a comment warning about this.
> 
> Someone, somewhere will try and build a 64+64 cpu system and get
> surprised it doesn't work :-)
Leo Yan April 17, 2018, 3:39 p.m. UTC | #5
On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> In case an energy model is available, waking tasks are re-routed into a
> new energy-aware placement algorithm. The eligible CPUs to be used in the
> energy-aware wakeup path are restricted to the highest non-overutilized
> sched_domain containing prev_cpu and this_cpu. If no such domain is found,
> the tasks go through the usual wake-up path, hence energy-aware placement
> happens only in lightly utilized scenarios.
> 
> The selection of the most energy-efficient CPU for a task is achieved by
> estimating the impact on system-level active energy resulting from the
> placement of the task on the CPU with the highest spare capacity in each
> frequency domain. The best CPU energy-wise is then selected if it saves
> a large enough amount of energy with respect to prev_cpu.
> 
> Although it has already shown significant benefits on some existing
> targets, this approach cannot scale to platforms with numerous CPUs.
> This patch is an attempt to do something useful as writing a fast
> heuristic that performs reasonably well on a broad spectrum of
> architectures isn't an easy task. As a consequence, the scope of
> usability of the energy-aware wake-up path is restricted to systems
> with the SD_ASYM_CPUCAPACITY flag set. These systems not only show the
> most promising opportunities for saving energy but also typically
> feature a limited number of logical CPUs.
> 
> Moreover, the energy-aware wake-up path is accessible only if
> sched_energy_enabled() is true. For systems which don't meet all
> dependencies for EAS (CONFIG_PM_OPP for ex.) at compile time,
> sched_enegy_enabled() defaults to a constant "false" value, hence letting
> the compiler remove the unused EAS code entirely.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 93 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8cb9fb04fff2..5ebb2d0306c7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6700,6 +6700,81 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
>  	return energy;
>  }
>  
> +static int find_energy_efficient_cpu(struct sched_domain *sd,
> +					struct task_struct *p, int prev_cpu)
> +{
> +	unsigned long cur_energy, prev_energy, best_energy, cpu_cap;
> +	unsigned long task_util = task_util_est(p);
> +	int cpu, best_energy_cpu = prev_cpu;
> +	struct freq_domain *fd;
> +
> +	if (!task_util)
> +		return prev_cpu;
> +
> +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> +	else
> +		prev_energy = best_energy = ULONG_MAX;
> +
> +	for_each_freq_domain(fd) {
> +		unsigned long spare_cap, max_spare_cap = 0;
> +		int max_spare_cap_cpu = -1;
> +		unsigned long util;
> +
> +		/* Find the CPU with the max spare cap in the freq. dom. */
> +		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
> +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> +				continue;
> +
> +			if (cpu == prev_cpu)
> +				continue;
> +
> +			util = cpu_util_wake(cpu, p);
> +			cpu_cap = capacity_of(cpu);
> +			if (!util_fits_capacity(util + task_util, cpu_cap))
> +				continue;
> +
> +			spare_cap = cpu_cap - util;
> +			if (spare_cap > max_spare_cap) {
> +				max_spare_cap = spare_cap;
> +				max_spare_cap_cpu = cpu;
> +			}
> +		}

If have two clusters, and if firstly iterate the big cluster, then
max_spare_cap is a big value for big cluster and later LITTLE cluster
has no chance to have higher value for spare_cap.  For this case, the
LITTLE CPU will be skipped for energy computation?

> +
> +		/* Evaluate the energy impact of using this CPU. */
> +		if (max_spare_cap_cpu >= 0) {
> +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> +			if (cur_energy < best_energy) {
> +				best_energy = cur_energy;
> +				best_energy_cpu = max_spare_cap_cpu;
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * We pick the best CPU only if it saves at least 1.5% of the
> +	 * energy used by prev_cpu.
> +	 */
> +	if ((prev_energy - best_energy) > (prev_energy >> 6))
> +		return best_energy_cpu;
> +
> +	return prev_cpu;
> +}
> +
> +static inline bool wake_energy(struct task_struct *p, int prev_cpu)
> +{
> +	struct sched_domain *sd;
> +
> +	if (!sched_energy_enabled())
> +		return false;
> +
> +	sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
> +	if (!sd || sd_overutilized(sd))
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -6716,18 +6791,22 @@ static int
>  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>  {
>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> +	struct sched_domain *energy_sd = NULL;
>  	int cpu = smp_processor_id();
>  	int new_cpu = prev_cpu;
> -	int want_affine = 0;
> +	int want_affine = 0, want_energy = 0;
>  	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>  
> +	rcu_read_lock();
> +
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> +		want_energy = wake_energy(p, prev_cpu);
>  		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
> +			      && !want_energy;
>  	}
>  
> -	rcu_read_lock();
>  	for_each_domain(cpu, tmp) {
>  		if (!(tmp->flags & SD_LOAD_BALANCE))
>  			break;
> @@ -6742,6 +6821,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			break;
>  		}
>  
> +		/*
> +		 * Energy-aware task placement is performed on the highest
> +		 * non-overutilized domain spanning over cpu and prev_cpu.
> +		 */
> +		if (want_energy && !sd_overutilized(tmp) &&
> +		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> +			energy_sd = tmp;
> +
>  		if (tmp->flags & sd_flag)
>  			sd = tmp;
>  		else if (!want_affine)
> @@ -6765,7 +6852,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  		sync_entity_load_avg(&p->se);
>  	}
>  
> -	if (!sd) {
> +	if (energy_sd) {
> +		new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
> +	} else if (!sd) {
>  pick_cpu:
>  		if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
>  			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -- 
> 2.11.0
>
Quentin Perret April 18, 2018, 7:57 a.m. UTC | #6
Hi Leo,

On Tuesday 17 Apr 2018 at 23:39:44 (+0800), Leo Yan wrote:
> > +	for_each_freq_domain(fd) {
> > +		unsigned long spare_cap, max_spare_cap = 0;
> > +		int max_spare_cap_cpu = -1;
> > +		unsigned long util;
> > +
> > +		/* Find the CPU with the max spare cap in the freq. dom. */
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
> > +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > +				continue;
> > +
> > +			if (cpu == prev_cpu)
> > +				continue;
> > +
> > +			util = cpu_util_wake(cpu, p);
> > +			cpu_cap = capacity_of(cpu);
> > +			if (!util_fits_capacity(util + task_util, cpu_cap))
> > +				continue;
> > +
> > +			spare_cap = cpu_cap - util;
> > +			if (spare_cap > max_spare_cap) {
> > +				max_spare_cap = spare_cap;
> > +				max_spare_cap_cpu = cpu;
> > +			}
> > +		}
> 
> If have two clusters, and if firstly iterate the big cluster, then
> max_spare_cap is a big value for big cluster and later LITTLE cluster
> has no chance to have higher value for spare_cap.  For this case, the
> LITTLE CPU will be skipped for energy computation?

max_spare_cap is reset to 0 at the top of the for_each_freq_domain()
loop above so that shouldn't happen.

Thanks,
Quentin
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cb9fb04fff2..5ebb2d0306c7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6700,6 +6700,81 @@  static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
 	return energy;
 }
 
+static int find_energy_efficient_cpu(struct sched_domain *sd,
+					struct task_struct *p, int prev_cpu)
+{
+	unsigned long cur_energy, prev_energy, best_energy, cpu_cap;
+	unsigned long task_util = task_util_est(p);
+	int cpu, best_energy_cpu = prev_cpu;
+	struct freq_domain *fd;
+
+	if (!task_util)
+		return prev_cpu;
+
+	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
+		prev_energy = best_energy = compute_energy(p, prev_cpu);
+	else
+		prev_energy = best_energy = ULONG_MAX;
+
+	for_each_freq_domain(fd) {
+		unsigned long spare_cap, max_spare_cap = 0;
+		int max_spare_cap_cpu = -1;
+		unsigned long util;
+
+		/* Find the CPU with the max spare cap in the freq. dom. */
+		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
+			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+				continue;
+
+			if (cpu == prev_cpu)
+				continue;
+
+			util = cpu_util_wake(cpu, p);
+			cpu_cap = capacity_of(cpu);
+			if (!util_fits_capacity(util + task_util, cpu_cap))
+				continue;
+
+			spare_cap = cpu_cap - util;
+			if (spare_cap > max_spare_cap) {
+				max_spare_cap = spare_cap;
+				max_spare_cap_cpu = cpu;
+			}
+		}
+
+		/* Evaluate the energy impact of using this CPU. */
+		if (max_spare_cap_cpu >= 0) {
+			cur_energy = compute_energy(p, max_spare_cap_cpu);
+			if (cur_energy < best_energy) {
+				best_energy = cur_energy;
+				best_energy_cpu = max_spare_cap_cpu;
+			}
+		}
+	}
+
+	/*
+	 * We pick the best CPU only if it saves at least 1.5% of the
+	 * energy used by prev_cpu.
+	 */
+	if ((prev_energy - best_energy) > (prev_energy >> 6))
+		return best_energy_cpu;
+
+	return prev_cpu;
+}
+
+static inline bool wake_energy(struct task_struct *p, int prev_cpu)
+{
+	struct sched_domain *sd;
+
+	if (!sched_energy_enabled())
+		return false;
+
+	sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
+	if (!sd || sd_overutilized(sd))
+		return false;
+
+	return true;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6716,18 +6791,22 @@  static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sched_domain *energy_sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
-	int want_affine = 0;
+	int want_affine = 0, want_energy = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	rcu_read_lock();
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
+		want_energy = wake_energy(p, prev_cpu);
 		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
-			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
+			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
+			      && !want_energy;
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
@@ -6742,6 +6821,14 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			break;
 		}
 
+		/*
+		 * Energy-aware task placement is performed on the highest
+		 * non-overutilized domain spanning over cpu and prev_cpu.
+		 */
+		if (want_energy && !sd_overutilized(tmp) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
+			energy_sd = tmp;
+
 		if (tmp->flags & sd_flag)
 			sd = tmp;
 		else if (!want_affine)
@@ -6765,7 +6852,9 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		sync_entity_load_avg(&p->se);
 	}
 
-	if (!sd) {
+	if (energy_sd) {
+		new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
+	} else if (!sd) {
 pick_cpu:
 		if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);