diff mbox

[RFCv5,32/46] sched: Energy-aware wake-up task placement

Message ID 1436293469-25707-33-git-send-email-morten.rasmussen@arm.com (mailing list archive)
State RFC
Headers show

Commit Message

Morten Rasmussen July 7, 2015, 6:24 p.m. UTC
Let available compute capacity and estimated energy impact select
wake-up target cpu when energy-aware scheduling is enabled and the
system in not over-utilized (above the tipping point).

energy_aware_wake_cpu() attempts to find group of cpus with sufficient
compute capacity to accommodate the task and find a cpu with enough spare
capacity to handle the task within that group. Preference is given to
cpus with enough spare capacity at the current OPP. Finally, the energy
impact of the new target and the previous task cpu is compared to select
the wake-up target cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 1 deletion(-)

Comments

Sai July 17, 2015, 12:10 a.m. UTC | #1
Hi Morten,

On 07/07/2015 11:24 AM, Morten Rasmussen wrote:
> ---

> +static int energy_aware_wake_cpu(struct task_struct *p, int target)
> +{
> +	struct sched_domain *sd;
> +	struct sched_group *sg, *sg_target;
> +	int target_max_cap = INT_MAX;
> +	int target_cpu = task_cpu(p);
> +	int i;
> +
> +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> +
> +	if (!sd)
> +		return target;
> +
> +	sg = sd->groups;
> +	sg_target = sg;
> +
> +	/*
> +	 * Find group with sufficient capacity. We only get here if no cpu is
> +	 * overutilized. We may end up overutilizing a cpu by adding the task,
> +	 * but that should not be any worse than select_idle_sibling().
> +	 * load_balance() should sort it out later as we get above the tipping
> +	 * point.
> +	 */
> +	do {
> +		/* Assuming all cpus are the same in group */
> +		int max_cap_cpu = group_first_cpu(sg);
> +
> +		/*
> +		 * Assume smaller max capacity means more energy-efficient.
> +		 * Ideally we should query the energy model for the right
> +		 * answer but it easily ends up in an exhaustive search.
> +		 */
> +		if (capacity_of(max_cap_cpu) < target_max_cap &&
> +		    task_fits_capacity(p, max_cap_cpu)) {
> +			sg_target = sg;
> +			target_max_cap = capacity_of(max_cap_cpu);
> +		}
> +	} while (sg = sg->next, sg != sd->groups);

Should be capacity_orig_of(max_cap_cpu) right? Might select a suboptimal
sg_target if max_cap_cpu has a significant amount of IRQ/RT activity.

> +
> +	/* Find cpu with sufficient capacity */
> +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> +		/*
> +		 * p's blocked utilization is still accounted for on prev_cpu
> +		 * so prev_cpu will receive a negative bias due the double
> +		 * accouting. However, the blocked utilization may be zero.
> +		 */
> +		int new_usage = get_cpu_usage(i) + task_utilization(p);
> +
> +		if (new_usage >	capacity_orig_of(i))
> +			continue;

Is this supposed to be capacity_of(i) instead?

> +
> +		if (new_usage <	capacity_curr_of(i)) {
> +			target_cpu = i;
> +			if (cpu_rq(i)->nr_running)
> +				break;
> +		}
> +
> +		/* cpu has capacity at higher OPP, keep it as fallback */
> +		if (target_cpu == task_cpu(p))
> +			target_cpu = i;
> +	}
> +
> +	if (target_cpu != task_cpu(p)) {
> +		struct energy_env eenv = {
> +			.usage_delta	= task_utilization(p),
> +			.src_cpu	= task_cpu(p),
> +			.dst_cpu	= target_cpu,
> +		};
> +
> +		/* Not enough spare capacity on previous cpu */
> +		if (cpu_overutilized(task_cpu(p)))
> +			return target_cpu;
> +
> +		if (energy_diff(&eenv) >= 0)
> +			return task_cpu(p);
> +	}
> +
> +	return target_cpu;
> +}
> +

-Sai
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen July 20, 2015, 3:38 p.m. UTC | #2
On Thu, Jul 16, 2015 at 05:10:52PM -0700, Sai Gurrappadi wrote:
> Hi Morten,
> 
> On 07/07/2015 11:24 AM, Morten Rasmussen wrote:
> > ---
> 
> > +static int energy_aware_wake_cpu(struct task_struct *p, int target)
> > +{
> > +	struct sched_domain *sd;
> > +	struct sched_group *sg, *sg_target;
> > +	int target_max_cap = INT_MAX;
> > +	int target_cpu = task_cpu(p);
> > +	int i;
> > +
> > +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> > +
> > +	if (!sd)
> > +		return target;
> > +
> > +	sg = sd->groups;
> > +	sg_target = sg;
> > +
> > +	/*
> > +	 * Find group with sufficient capacity. We only get here if no cpu is
> > +	 * overutilized. We may end up overutilizing a cpu by adding the task,
> > +	 * but that should not be any worse than select_idle_sibling().
> > +	 * load_balance() should sort it out later as we get above the tipping
> > +	 * point.
> > +	 */
> > +	do {
> > +		/* Assuming all cpus are the same in group */
> > +		int max_cap_cpu = group_first_cpu(sg);
> > +
> > +		/*
> > +		 * Assume smaller max capacity means more energy-efficient.
> > +		 * Ideally we should query the energy model for the right
> > +		 * answer but it easily ends up in an exhaustive search.
> > +		 */
> > +		if (capacity_of(max_cap_cpu) < target_max_cap &&
> > +		    task_fits_capacity(p, max_cap_cpu)) {
> > +			sg_target = sg;
> > +			target_max_cap = capacity_of(max_cap_cpu);
> > +		}
> > +	} while (sg = sg->next, sg != sd->groups);
> 
> Should be capacity_orig_of(max_cap_cpu) right? Might select a suboptimal
> sg_target if max_cap_cpu has a significant amount of IRQ/RT activity.

Right, this heuristic isn't as good as I had hoped for.
task_fits_capacity() is using capacity_of() to check if we have
available capacity after subtracting RT/IRQ activity which should be
right but I only check the first cpu. So I might discard a group due to
RT/IRQ activity on cpu the first cpu while one the sibling cpus could be
fine. Then going for lowest capacity_of() means preferring group with
most RT/IRQ activity that still has enough capacity to fit the task.

Using capacity_orig_of() we would ignore RT/IRQ activity but is likely
to be better as we can try to avoid RQ/IRQ activity later. I will use
capacity_orig_of() here instead. Thanks.

> > +
> > +	/* Find cpu with sufficient capacity */
> > +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> > +		/*
> > +		 * p's blocked utilization is still accounted for on prev_cpu
> > +		 * so prev_cpu will receive a negative bias due the double
> > +		 * accouting. However, the blocked utilization may be zero.
> > +		 */
> > +		int new_usage = get_cpu_usage(i) + task_utilization(p);
> > +
> > +		if (new_usage >	capacity_orig_of(i))
> > +			continue;
> 
> Is this supposed to be capacity_of(i) instead?

Yes, we should skip cpus with too much RT/IRQ activity here. Thanks.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leo Yan Aug. 17, 2015, 4:23 p.m. UTC | #3
Hi Morten,

On Tue, Jul 07, 2015 at 07:24:15PM +0100, Morten Rasmussen wrote:
> Let available compute capacity and estimated energy impact select
> wake-up target cpu when energy-aware scheduling is enabled and the
> system in not over-utilized (above the tipping point).
> 
> energy_aware_wake_cpu() attempts to find group of cpus with sufficient
> compute capacity to accommodate the task and find a cpu with enough spare
> capacity to handle the task within that group. Preference is given to
> cpus with enough spare capacity at the current OPP. Finally, the energy
> impact of the new target and the previous task cpu is compared to select
> the wake-up target cpu.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f7dbda4..01f7337 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5427,6 +5427,86 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	return target;
>  }
>  
> +static int energy_aware_wake_cpu(struct task_struct *p, int target)
> +{
> +	struct sched_domain *sd;
> +	struct sched_group *sg, *sg_target;
> +	int target_max_cap = INT_MAX;
> +	int target_cpu = task_cpu(p);
> +	int i;
> +
> +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> +
> +	if (!sd)
> +		return target;
> +
> +	sg = sd->groups;
> +	sg_target = sg;
> +
> +	/*
> +	 * Find group with sufficient capacity. We only get here if no cpu is
> +	 * overutilized. We may end up overutilizing a cpu by adding the task,
> +	 * but that should not be any worse than select_idle_sibling().
> +	 * load_balance() should sort it out later as we get above the tipping
> +	 * point.
> +	 */
> +	do {
> +		/* Assuming all cpus are the same in group */
> +		int max_cap_cpu = group_first_cpu(sg);
> +
> +		/*
> +		 * Assume smaller max capacity means more energy-efficient.
> +		 * Ideally we should query the energy model for the right
> +		 * answer but it easily ends up in an exhaustive search.
> +		 */
> +		if (capacity_of(max_cap_cpu) < target_max_cap &&
> +		    task_fits_capacity(p, max_cap_cpu)) {
> +			sg_target = sg;
> +			target_max_cap = capacity_of(max_cap_cpu);
> +		}
> +	} while (sg = sg->next, sg != sd->groups);
> +
> +	/* Find cpu with sufficient capacity */
> +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> +		/*
> +		 * p's blocked utilization is still accounted for on prev_cpu
> +		 * so prev_cpu will receive a negative bias due the double
> +		 * accouting. However, the blocked utilization may be zero.
> +		 */
> +		int new_usage = get_cpu_usage(i) + task_utilization(p);
> +
> +		if (new_usage >	capacity_orig_of(i))
> +			continue;
> +
> +		if (new_usage <	capacity_curr_of(i)) {
> +			target_cpu = i;
> +			if (cpu_rq(i)->nr_running)
> +				break;
> +		}
> +
> +		/* cpu has capacity at higher OPP, keep it as fallback */
> +		if (target_cpu == task_cpu(p))
> +			target_cpu = i;

If CPU's current capacity cannot meet requirement, why not stay
task on prev CPU so have chance to use hot cache? Or the purpose
is to place tasks on the first CPU in schedule group as possible?

> +	}
> +
> +	if (target_cpu != task_cpu(p)) {
> +		struct energy_env eenv = {
> +			.usage_delta	= task_utilization(p),
> +			.src_cpu	= task_cpu(p),
> +			.dst_cpu	= target_cpu,
> +		};
> +
> +		/* Not enough spare capacity on previous cpu */
> +		if (cpu_overutilized(task_cpu(p)))
> +			return target_cpu;
> +
> +		if (energy_diff(&eenv) >= 0)
> +			return task_cpu(p);
> +	}
> +
> +	return target_cpu;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5479,7 +5559,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  		prev_cpu = cpu;
>  
>  	if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> +		if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
> +			new_cpu = energy_aware_wake_cpu(p, prev_cpu);
> +		else
> +			new_cpu = select_idle_sibling(p, prev_cpu);
>  		goto unlock;
>  	}
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leo Yan Sept. 2, 2015, 5:11 p.m. UTC | #4
On Tue, Jul 07, 2015 at 07:24:15PM +0100, Morten Rasmussen wrote:
> Let available compute capacity and estimated energy impact select
> wake-up target cpu when energy-aware scheduling is enabled and the
> system in not over-utilized (above the tipping point).
> 
> energy_aware_wake_cpu() attempts to find group of cpus with sufficient
> compute capacity to accommodate the task and find a cpu with enough spare
> capacity to handle the task within that group. Preference is given to
> cpus with enough spare capacity at the current OPP. Finally, the energy
> impact of the new target and the previous task cpu is compared to select
> the wake-up target cpu.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f7dbda4..01f7337 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5427,6 +5427,86 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	return target;
>  }
>  
> +static int energy_aware_wake_cpu(struct task_struct *p, int target)
> +{
> +	struct sched_domain *sd;
> +	struct sched_group *sg, *sg_target;
> +	int target_max_cap = INT_MAX;
> +	int target_cpu = task_cpu(p);
> +	int i;
> +
> +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> +
> +	if (!sd)
> +		return target;
> +
> +	sg = sd->groups;
> +	sg_target = sg;
> +
> +	/*
> +	 * Find group with sufficient capacity. We only get here if no cpu is
> +	 * overutilized. We may end up overutilizing a cpu by adding the task,
> +	 * but that should not be any worse than select_idle_sibling().
> +	 * load_balance() should sort it out later as we get above the tipping
> +	 * point.
> +	 */
> +	do {
> +		/* Assuming all cpus are the same in group */
> +		int max_cap_cpu = group_first_cpu(sg);
> +
> +		/*
> +		 * Assume smaller max capacity means more energy-efficient.
> +		 * Ideally we should query the energy model for the right
> +		 * answer but it easily ends up in an exhaustive search.
> +		 */
> +		if (capacity_of(max_cap_cpu) < target_max_cap &&
> +		    task_fits_capacity(p, max_cap_cpu)) {
> +			sg_target = sg;
> +			target_max_cap = capacity_of(max_cap_cpu);
> +		}

Here should consider scenario for two groups have same capacity?
This will benefit for the case LITTLE.LITTLE. So the code will be
looks like below:

	int target_sg_cpu = INT_MAX;

	if (capacity_of(max_cap_cpu) <= target_max_cap &&
            task_fits_capacity(p, max_cap_cpu)) {

                if ((capacity_of(max_cap_cpu) == target_max_cap) &&
		    (target_sg_cpu < max_cap_cpu))
		        continue;

		target_sg_cpu = max_cap_cpu;
		sg_target = sg;
		target_max_cap = capacity_of(max_cap_cpu);
	}

> +	} while (sg = sg->next, sg != sd->groups);
> +
> +	/* Find cpu with sufficient capacity */
> +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> +		/*
> +		 * p's blocked utilization is still accounted for on prev_cpu
> +		 * so prev_cpu will receive a negative bias due the double
> +		 * accouting. However, the blocked utilization may be zero.
> +		 */
> +		int new_usage = get_cpu_usage(i) + task_utilization(p);
> +
> +		if (new_usage >	capacity_orig_of(i))
> +			continue;
> +
> +		if (new_usage <	capacity_curr_of(i)) {
> +			target_cpu = i;
> +			if (cpu_rq(i)->nr_running)
> +				break;
> +		}
> +
> +		/* cpu has capacity at higher OPP, keep it as fallback */
> +		if (target_cpu == task_cpu(p))
> +			target_cpu = i;
> +	}
> +
> +	if (target_cpu != task_cpu(p)) {
> +		struct energy_env eenv = {
> +			.usage_delta	= task_utilization(p),
> +			.src_cpu	= task_cpu(p),
> +			.dst_cpu	= target_cpu,
> +		};
> +
> +		/* Not enough spare capacity on previous cpu */
> +		if (cpu_overutilized(task_cpu(p)))
> +			return target_cpu;
> +
> +		if (energy_diff(&eenv) >= 0)
> +			return task_cpu(p);
> +	}
> +
> +	return target_cpu;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5479,7 +5559,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  		prev_cpu = cpu;
>  
>  	if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
> -		new_cpu = select_idle_sibling(p, prev_cpu);
> +		if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
> +			new_cpu = energy_aware_wake_cpu(p, prev_cpu);
> +		else
> +			new_cpu = select_idle_sibling(p, prev_cpu);
>  		goto unlock;
>  	}
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dietmar Eggemann Sept. 18, 2015, 10:34 a.m. UTC | #5
On 02/09/15 18:11, Leo Yan wrote:
> On Tue, Jul 07, 2015 at 07:24:15PM +0100, Morten Rasmussen wrote:
>> Let available compute capacity and estimated energy impact select
>> wake-up target cpu when energy-aware scheduling is enabled and the
>> system in not over-utilized (above the tipping point).
>>
>> energy_aware_wake_cpu() attempts to find group of cpus with sufficient
>> compute capacity to accommodate the task and find a cpu with enough spare
>> capacity to handle the task within that group. Preference is given to
>> cpus with enough spare capacity at the current OPP. Finally, the energy
>> impact of the new target and the previous task cpu is compared to select
>> the wake-up target cpu.
>>
>> cc: Ingo Molnar <mingo@redhat.com>
>> cc: Peter Zijlstra <peterz@infradead.org>
>>
>> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> ---
>>  kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 84 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0f7dbda4..01f7337 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5427,6 +5427,86 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>  	return target;
>>  }
>>  
>> +static int energy_aware_wake_cpu(struct task_struct *p, int target)
>> +{
>> +	struct sched_domain *sd;
>> +	struct sched_group *sg, *sg_target;
>> +	int target_max_cap = INT_MAX;
>> +	int target_cpu = task_cpu(p);
>> +	int i;
>> +
>> +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
>> +
>> +	if (!sd)
>> +		return target;
>> +
>> +	sg = sd->groups;
>> +	sg_target = sg;
>> +
>> +	/*
>> +	 * Find group with sufficient capacity. We only get here if no cpu is
>> +	 * overutilized. We may end up overutilizing a cpu by adding the task,
>> +	 * but that should not be any worse than select_idle_sibling().
>> +	 * load_balance() should sort it out later as we get above the tipping
>> +	 * point.
>> +	 */
>> +	do {
>> +		/* Assuming all cpus are the same in group */
>> +		int max_cap_cpu = group_first_cpu(sg);
>> +
>> +		/*
>> +		 * Assume smaller max capacity means more energy-efficient.
>> +		 * Ideally we should query the energy model for the right
>> +		 * answer but it easily ends up in an exhaustive search.
>> +		 */
>> +		if (capacity_of(max_cap_cpu) < target_max_cap &&
>> +		    task_fits_capacity(p, max_cap_cpu)) {
>> +			sg_target = sg;
>> +			target_max_cap = capacity_of(max_cap_cpu);
>> +		}
> 
> Here should consider scenario for two groups have same capacity?
> This will benefit for the case LITTLE.LITTLE. So the code will be
> looks like below:
> 
> 	int target_sg_cpu = INT_MAX;
> 
> 	if (capacity_of(max_cap_cpu) <= target_max_cap &&
>             task_fits_capacity(p, max_cap_cpu)) {
> 
>                 if ((capacity_of(max_cap_cpu) == target_max_cap) &&
> 		    (target_sg_cpu < max_cap_cpu))
> 		        continue;
> 
> 		target_sg_cpu = max_cap_cpu;
> 		sg_target = sg;
> 		target_max_cap = capacity_of(max_cap_cpu);
> 	}
> 

It's true that on your SMP system the target sched_group 'sg_target'
depends only on 'task_cpu(p)' because this determines sched_domain 'sd'
(and so the order of sched_groups for the iteration).

So the current do-while loop to select 'sg_target' for an SMP system
makes little sense.

But why should we favour the first sched_group (cluster) (the one w/ the
lower max_cap_cpu number) in this situation?

[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Sept. 20, 2015, 6:39 p.m. UTC | #6
On 09/18/2015 03:34 AM, Dietmar Eggemann wrote:
>> Here should consider scenario for two groups have same capacity?
>> This will benefit for the case LITTLE.LITTLE. So the code will be
>> looks like below:
>>
>> 	int target_sg_cpu = INT_MAX;
>>
>> 	if (capacity_of(max_cap_cpu) <= target_max_cap &&
>>             task_fits_capacity(p, max_cap_cpu)) {
>>
>>                 if ((capacity_of(max_cap_cpu) == target_max_cap) &&
>> 		    (target_sg_cpu < max_cap_cpu))
>> 		        continue;
>>
>> 		target_sg_cpu = max_cap_cpu;
>> 		sg_target = sg;
>> 		target_max_cap = capacity_of(max_cap_cpu);
>> 	}
>>
> 
> It's true that on your SMP system the target sched_group 'sg_target'
> depends only on 'task_cpu(p)' because this determines sched_domain 'sd'
> (and so the order of sched_groups for the iteration).
> 
> So the current do-while loop to select 'sg_target' for an SMP system
> makes little sense.
> 
> But why should we favour the first sched_group (cluster) (the one w/ the
> lower max_cap_cpu number) in this situation?

Running the originally proposed code on a system with two identical
clusters, it looks like we'll always end up doing an energy-aware search
in the task's prev_cpu cluster (sched_group). If you had small tasks
scattered across both clusters, energy_aware_wake_cpu() would not
consider condensing them on a single cluster. Leo was this the issue you
were seeing?

However I think there may be negative side effects with the proposed
policy above as well - won't this cause us to pack the first cluster
until it's 100% full (running at fmax) before using the second cluster?
That would also be bad for power.

thanks,
Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Leo Yan Sept. 20, 2015, 10:03 p.m. UTC | #7
On Sun, Sep 20, 2015 at 11:39:16AM -0700, Steve Muckle wrote:
> On 09/18/2015 03:34 AM, Dietmar Eggemann wrote:
> >> Here should consider scenario for two groups have same capacity?
> >> This will benefit for the case LITTLE.LITTLE. So the code will be
> >> looks like below:
> >>
> >> 	int target_sg_cpu = INT_MAX;
> >>
> >> 	if (capacity_of(max_cap_cpu) <= target_max_cap &&
> >>             task_fits_capacity(p, max_cap_cpu)) {
> >>
> >>                 if ((capacity_of(max_cap_cpu) == target_max_cap) &&
> >> 		    (target_sg_cpu < max_cap_cpu))
> >> 		        continue;
> >>
> >> 		target_sg_cpu = max_cap_cpu;
> >> 		sg_target = sg;
> >> 		target_max_cap = capacity_of(max_cap_cpu);
> >> 	}
> >>
> > 
> > It's true that on your SMP system the target sched_group 'sg_target'
> > depends only on 'task_cpu(p)' because this determines sched_domain 'sd'
> > (and so the order of sched_groups for the iteration).
> > 
> > So the current do-while loop to select 'sg_target' for an SMP system
> > makes little sense.
> > 
> > But why should we favour the first sched_group (cluster) (the one w/ the
> > lower max_cap_cpu number) in this situation?
> 
> Running the originally proposed code on a system with two identical
> clusters, it looks like we'll always end up doing an energy-aware search
> in the task's prev_cpu cluster (sched_group). If you had small tasks
> scattered across both clusters, energy_aware_wake_cpu() would not
> consider condensing them on a single cluster. Leo was this the issue you
> were seeing?

Exactly.

> However I think there may be negative side effects with the proposed
> policy above as well - won't this cause us to pack the first cluster
> until it's 100% full (running at fmax) before using the second cluster?
> That would also be bad for power.

In this case of CPU is running at fmax, it's true that
task_fits_capacity() will return true. But here i think
cpu_overutilized() also will return true, so that means scheduler will
go back to use CFS's old way for loading balance. Finally tasks also
will be spread into two clusters.

Also reviewed the profiling result on Hikey with this modification
[1], rt-app 6%/13%/19%/25% place 8 tasks into one cluster as
possible, but rt-app 31%/38%/44%/50% also will place tasks to second
cluster. NOTE, I get this conclusion from CPU idle's duty cycle, but
not from real power data.

[1] https://lists.linaro.org/pipermail/eas-dev/2015-September/000218.html

Thanks,
Leo Yan
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Sept. 29, 2015, 12:15 a.m. UTC | #8
On 09/20/2015 03:03 PM, Leo Yan wrote:
> In this case of CPU is running at fmax, it's true that
> task_fits_capacity() will return true. But here i think
> cpu_overutilized() also will return true, so that means scheduler will
> go back to use CFS's old way for loading balance. Finally tasks also
> will be spread into two clusters.

Agreed that once the first cluster is overutilized, the load will
definitely spread to both clusters. My concern though is that for this
to occur, the first cluster will likely be pushed to a high OPP. For
power (and even perhaps performance) spreading the load earlier may be
better. Or not, my observation is really just that we're encoding policy
here which ideally would be the result of calculations in the energy model.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f7dbda4..01f7337 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5427,6 +5427,86 @@  static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 
+static int energy_aware_wake_cpu(struct task_struct *p, int target)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg, *sg_target;
+	int target_max_cap = INT_MAX;
+	int target_cpu = task_cpu(p);
+	int i;
+
+	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
+
+	if (!sd)
+		return target;
+
+	sg = sd->groups;
+	sg_target = sg;
+
+	/*
+	 * Find group with sufficient capacity. We only get here if no cpu is
+	 * overutilized. We may end up overutilizing a cpu by adding the task,
+	 * but that should not be any worse than select_idle_sibling().
+	 * load_balance() should sort it out later as we get above the tipping
+	 * point.
+	 */
+	do {
+		/* Assuming all cpus are the same in group */
+		int max_cap_cpu = group_first_cpu(sg);
+
+		/*
+		 * Assume smaller max capacity means more energy-efficient.
+		 * Ideally we should query the energy model for the right
+		 * answer but it easily ends up in an exhaustive search.
+		 */
+		if (capacity_of(max_cap_cpu) < target_max_cap &&
+		    task_fits_capacity(p, max_cap_cpu)) {
+			sg_target = sg;
+			target_max_cap = capacity_of(max_cap_cpu);
+		}
+	} while (sg = sg->next, sg != sd->groups);
+
+	/* Find cpu with sufficient capacity */
+	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
+		/*
+		 * p's blocked utilization is still accounted for on prev_cpu
+		 * so prev_cpu will receive a negative bias due the double
+		 * accouting. However, the blocked utilization may be zero.
+		 */
+		int new_usage = get_cpu_usage(i) + task_utilization(p);
+
+		if (new_usage >	capacity_orig_of(i))
+			continue;
+
+		if (new_usage <	capacity_curr_of(i)) {
+			target_cpu = i;
+			if (cpu_rq(i)->nr_running)
+				break;
+		}
+
+		/* cpu has capacity at higher OPP, keep it as fallback */
+		if (target_cpu == task_cpu(p))
+			target_cpu = i;
+	}
+
+	if (target_cpu != task_cpu(p)) {
+		struct energy_env eenv = {
+			.usage_delta	= task_utilization(p),
+			.src_cpu	= task_cpu(p),
+			.dst_cpu	= target_cpu,
+		};
+
+		/* Not enough spare capacity on previous cpu */
+		if (cpu_overutilized(task_cpu(p)))
+			return target_cpu;
+
+		if (energy_diff(&eenv) >= 0)
+			return task_cpu(p);
+	}
+
+	return target_cpu;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5479,7 +5559,10 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		prev_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
+			new_cpu = energy_aware_wake_cpu(p, prev_cpu);
+		else
+			new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}