diff mbox

[RFC,5/6] sched/fair: Select an energy-efficient CPU on task wake-up

Message ID 20180320094312.24081-6-dietmar.eggemann@arm.com (mailing list archive)
State RFC, archived
Headers show

Commit Message

Dietmar Eggemann March 20, 2018, 9:43 a.m. UTC
From: Quentin Perret <quentin.perret@arm.com>

In case an energy model is available, waking tasks are re-routed into a
new energy-aware placement algorithm. The eligible CPUs to be used in the
energy-aware wakeup path are restricted to the highest non-overutilized
sched_domain containing prev_cpu and this_cpu. If no such domain is found,
the tasks go through the usual wake-up path, hence energy-aware placement
happens only in lightly utilized scenarios.

The selection of the most energy-efficient CPU for a task is achieved by
estimating the impact on system-level active energy resulting from the
placement of the task on each candidate CPU. The best CPU energy-wise is
then selected if it saves a large enough amount of energy with respect to
prev_cpu.

Although it has already shown significant benefits on some existing
targets, this brute force approach clearly cannot scale to platforms with
numerous CPUs. This patch is an attempt to do something useful as writing
a fast heuristic that performs reasonably well on a broad spectrum of
architectures isn't an easy task. As a consequence, the scope of usability
of the energy-aware wake-up path is restricted to systems with the
SD_ASYM_CPUCAPACITY flag set. These systems not only show the most
promising opportunities for saving energy but also typically feature a
limited number of logical CPUs.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 3 deletions(-)

Comments

Patrick Bellasi March 21, 2018, 3:35 p.m. UTC | #1
On 20-Mar 09:43, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 76bd46502486..65a1bead0773 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
>  	return energy;
>  }
> 
> +static bool task_fits(struct task_struct *p, int cpu)
> +{
> +	unsigned long next_util = cpu_util_next(cpu, p, cpu);
> +
> +	return util_fits_capacity(next_util, capacity_orig_of(cpu));
                                             ^^^^^^^^^^^^^^^^^^^^^

Since here we are at scheduling CFS tasks, should we not better use
capacity_of() to account for RT/IRQ pressure ?

> +}
> +
> +static int find_energy_efficient_cpu(struct sched_domain *sd,
> +					struct task_struct *p, int prev_cpu)
> +{
> +	unsigned long cur_energy, prev_energy, best_energy;
> +	int cpu, best_cpu = prev_cpu;
> +
> +	if (!task_util(p))

We are still waking up a task... what if the task was previously
running on a big CPU which is now idle?

I understand that from a _relative_ energy_diff standpoint there is
not much to do for a 0 utilization task. However, for those tasks we
can still try to return the most energy efficient CPU among the ones
in their cpus_allowed mask.

It should be a relatively low overhead (maybe contained in a fallback
most_energy_efficient_cpu() kind of function) which allows, for
example on ARM big.LITTLE systems, to consolidate those tasks on
LITTLE CPUs instead for example keep running them on a big CPU.

> +		return prev_cpu;
> +
> +	/* Compute the energy impact of leaving the task on prev_cpu. */
> +	prev_energy = best_energy = compute_energy(p, prev_cpu);
> +
> +	/* Look for the CPU that minimizes the energy. */
                                           ^^^^^^^^^^
nit-pick: would say explicitly "best_energy" here, just to avoid
confusion about (non) possible overflows in the following if check ;)

> +	for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) {
> +		if (!task_fits(p, cpu) || cpu == prev_cpu)

nit-pick: to me it would read better as:

                if (cpu == prev_cpu)
                        continue;
                if (!task_fits(p, cpu))
                        continue;

but it's more matter of (personal) taste then efficiency.

> +			continue;
> +		cur_energy = compute_energy(p, cpu);
> +		if (cur_energy < best_energy) {
> +			best_energy = cur_energy;
> +			best_cpu = cpu;
> +		}
> +	}
> +
> +	/*
> +	 * We pick the best CPU only if it saves at least 1.5% of the
> +	 * energy used by prev_cpu.
> +	 */
> +	if ((prev_energy - best_energy) > (prev_energy >> 6))
> +		return best_cpu;
> +
> +	return prev_cpu;
> +}

[...]

> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			break;
>  		}
> 
> +		/*
> +		 * Energy-aware task placement is performed on the highest
> +		 * non-overutilized domain spanning over cpu and prev_cpu.
> +		 */
> +		if (want_energy && !sd_overutilized(tmp) &&
> +		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> +			energy_sd = tmp;
> +

Not entirely sure, but I was trying to understand if we can avoid to
modify the definition of want_affine (in the previous chunk) and move
this block before the previous "if (want_affine..." (in mainline but
not in this chunk), which will became an else, e.g.

        if (want_energy && !sd_overutilized(tmp) &&
                // ...
        else if (want_energy && !sd_overutilized(tmp) &&
                // ...

Isn't that the same?

Maybe there is a code path I'm missing... but otherwise it seems a
more self contained modification of select_task_rq_fair...

>  		if (tmp->flags & sd_flag)
>  			sd = tmp;
>  		else if (!want_affine)
> @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			if (want_affine)
>  				current->recent_used_cpu = cpu;
>  		}
> +	} else if (energy_sd) {
> +		new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
>  	} else {
>  		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
>  	}
Joel Fernandes March 22, 2018, 4:27 p.m. UTC | #2
Hi,

On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> From: Quentin Perret <quentin.perret@arm.com>
>
> In case an energy model is available, waking tasks are re-routed into a
> new energy-aware placement algorithm. The eligible CPUs to be used in the
> energy-aware wakeup path are restricted to the highest non-overutilized
> sched_domain containing prev_cpu and this_cpu. If no such domain is found,
> the tasks go through the usual wake-up path, hence energy-aware placement
> happens only in lightly utilized scenarios.
>
> The selection of the most energy-efficient CPU for a task is achieved by
> estimating the impact on system-level active energy resulting from the
> placement of the task on each candidate CPU. The best CPU energy-wise is
> then selected if it saves a large enough amount of energy with respect to
> prev_cpu.
>
> Although it has already shown significant benefits on some existing
> targets, this brute force approach clearly cannot scale to platforms with
> numerous CPUs. This patch is an attempt to do something useful as writing
> a fast heuristic that performs reasonably well on a broad spectrum of
> architectures isn't an easy task. As a consequence, the scope of usability
> of the energy-aware wake-up path is restricted to systems with the
> SD_ASYM_CPUCAPACITY flag set. These systems not only show the most
> promising opportunities for saving energy but also typically feature a
> limited number of logical CPUs.
>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 71 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 76bd46502486..65a1bead0773 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
>         return energy;
>  }
>
> +static bool task_fits(struct task_struct *p, int cpu)
> +{
> +       unsigned long next_util = cpu_util_next(cpu, p, cpu);
> +
> +       return util_fits_capacity(next_util, capacity_orig_of(cpu));
> +}
> +
> +static int find_energy_efficient_cpu(struct sched_domain *sd,
> +                                       struct task_struct *p, int prev_cpu)
> +{
> +       unsigned long cur_energy, prev_energy, best_energy;
> +       int cpu, best_cpu = prev_cpu;
> +
> +       if (!task_util(p))
> +               return prev_cpu;
> +
> +       /* Compute the energy impact of leaving the task on prev_cpu. */
> +       prev_energy = best_energy = compute_energy(p, prev_cpu);

Is it possible that before the wakeup, the task's affinity is changed
so that p->cpus_allowed no longer contains prev_cpu ? In that case
prev_energy wouldn't matter since previous CPU is no longer an option?

> +
> +       /* Look for the CPU that minimizes the energy. */
> +       for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) {
> +               if (!task_fits(p, cpu) || cpu == prev_cpu)
> +                       continue;
> +               cur_energy = compute_energy(p, cpu);
> +               if (cur_energy < best_energy) {
> +                       best_energy = cur_energy;
> +                       best_cpu = cpu;
> +               }
> +       }
> +
> +       /*
> +        * We pick the best CPU only if it saves at least 1.5% of the
> +        * energy used by prev_cpu.
> +        */
> +       if ((prev_energy - best_energy) > (prev_energy >> 6))
> +               return best_cpu;
> +
> +       return prev_cpu;
> +}
> +
> +static inline bool wake_energy(struct task_struct *p, int prev_cpu)
> +{
> +       struct sched_domain *sd;
> +
> +       if (!static_branch_unlikely(&sched_energy_present))
> +               return false;
> +
> +       sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
> +       if (!sd || sd_overutilized(sd))

Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here?

> +               return false;
> +
> +       return true;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -6529,18 +6583,22 @@ static int
>  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>  {
>         struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> +       struct sched_domain *energy_sd = NULL;
>         int cpu = smp_processor_id();
>         int new_cpu = prev_cpu;
> -       int want_affine = 0;
> +       int want_affine = 0, want_energy = 0;
>         int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>
> +       rcu_read_lock();
> +
>         if (sd_flag & SD_BALANCE_WAKE) {
>                 record_wakee(p);
> +               want_energy = wake_energy(p, prev_cpu);
>                 want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -                             && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +                             && cpumask_test_cpu(cpu, &p->cpus_allowed)
> +                             && !want_energy;
>         }
>
> -       rcu_read_lock();
>         for_each_domain(cpu, tmp) {
>                 if (!(tmp->flags & SD_LOAD_BALANCE))
>                         break;
> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>                         break;
>                 }
>
> +               /*
> +                * Energy-aware task placement is performed on the highest
> +                * non-overutilized domain spanning over cpu and prev_cpu.
> +                */
> +               if (want_energy && !sd_overutilized(tmp) &&
> +                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))

Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?


> +                       energy_sd = tmp;
> +
>                 if (tmp->flags & sd_flag)
>                         sd = tmp;
>                 else if (!want_affine)
> @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>                         if (want_affine)
>                                 current->recent_used_cpu = cpu;
>                 }
> +       } else if (energy_sd) {
> +               new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);

Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if
sd_flag and tmp->flags don't match. In this case we wont enter the EAS
selection path because sd will be = NULL. So should you be setting sd
= NULL explicitly if energy_sd != NULL , or rather do the if
(energy_sd) before doing the if (!sd) ?

If you still want to keep the logic this way, then probably you should
also check if (tmp->flags & sd_flag) == true in the loop? That way
energy_sd wont be set at all (Since we're basically saying we dont
want to do wake up across this sd (in energy aware fashion in this
case) if the domain flags don't watch the wake up sd_flag.

thanks,

- Joel
Patrick Bellasi March 22, 2018, 6:06 p.m. UTC | #3
On 22-Mar 09:27, Joel Fernandes wrote:
> Hi,
> 
> On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
> >
> > From: Quentin Perret <quentin.perret@arm.com>

[...]

> > +static inline bool wake_energy(struct task_struct *p, int prev_cpu)
> > +{
> > +       struct sched_domain *sd;
> > +
> > +       if (!static_branch_unlikely(&sched_energy_present))
> > +               return false;
> > +
> > +       sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
> > +       if (!sd || sd_overutilized(sd))
> 
> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here?

I think that should be already covered by the static key check
above...

> 
> > +               return false;
> > +
> > +       return true;
> > +}
> > +
> >  /*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -6529,18 +6583,22 @@ static int
> >  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> >  {
> >         struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> > +       struct sched_domain *energy_sd = NULL;
> >         int cpu = smp_processor_id();
> >         int new_cpu = prev_cpu;
> > -       int want_affine = 0;
> > +       int want_affine = 0, want_energy = 0;
> >         int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> >
> > +       rcu_read_lock();
> > +
> >         if (sd_flag & SD_BALANCE_WAKE) {
> >                 record_wakee(p);
> > +               want_energy = wake_energy(p, prev_cpu);
> >                 want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > -                             && cpumask_test_cpu(cpu, &p->cpus_allowed);
> > +                             && cpumask_test_cpu(cpu, &p->cpus_allowed)
> > +                             && !want_energy;
> >         }
> >
> > -       rcu_read_lock();
> >         for_each_domain(cpu, tmp) {
> >                 if (!(tmp->flags & SD_LOAD_BALANCE))
> >                         break;
> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >                         break;
> >                 }
> >
> > +               /*
> > +                * Energy-aware task placement is performed on the highest
> > +                * non-overutilized domain spanning over cpu and prev_cpu.
> > +                */
> > +               if (want_energy && !sd_overutilized(tmp) &&
> > +                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> 
> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?

... and this then should be covered by the previous check in
wake_energy(), which sets want_energy.

> 
> > +                       energy_sd = tmp;
> > +
> >                 if (tmp->flags & sd_flag)
> >                         sd = tmp;
> >                 else if (!want_affine)
> > @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >                         if (want_affine)
> >                                 current->recent_used_cpu = cpu;
> >                 }
> > +       } else if (energy_sd) {
> > +               new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
> 
> Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if
> sd_flag and tmp->flags don't match. In this case we wont enter the EAS
> selection path because sd will be = NULL. So should you be setting sd
> = NULL explicitly if energy_sd != NULL , or rather do the if
> (energy_sd) before doing the if (!sd) ?

That's the same think I was also proposing in my reply to this patch.
But in my case the point was mainly to make the code easier to
follow... which at the end it's also to void all the consideration on
dependencies you describe above.

Joel, can you have a look at what I proposed... I was not entirely
sure that we miss some code paths doing it that way.

> If you still want to keep the logic this way, then probably you should
> also check if (tmp->flags & sd_flag) == true in the loop? That way
> energy_sd wont be set at all (Since we're basically saying we dont
> want to do wake up across this sd (in energy aware fashion in this
> case) if the domain flags don't watch the wake up sd_flag.
> 
> thanks,
> 
> - Joel
Joel Fernandes March 22, 2018, 8:10 p.m. UTC | #4
On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
> [...]
>
>> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>>                       break;
>>               }
>>
>> +             /*
>> +              * Energy-aware task placement is performed on the highest
>> +              * non-overutilized domain spanning over cpu and prev_cpu.
>> +              */
>> +             if (want_energy && !sd_overutilized(tmp) &&
>> +                 cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>> +                     energy_sd = tmp;
>> +
>
> Not entirely sure, but I was trying to understand if we can avoid to
> modify the definition of want_affine (in the previous chunk) and move
> this block before the previous "if (want_affine..." (in mainline but
> not in this chunk), which will became an else, e.g.
>
>         if (want_energy && !sd_overutilized(tmp) &&
>                 // ...
>         else if (want_energy && !sd_overutilized(tmp) &&
>                 // ...
>
> Isn't that the same?
>
> Maybe there is a code path I'm missing... but otherwise it seems a
> more self contained modification of select_task_rq_fair...

Just replying to this here Patrick instead of the other thread.

I think this is the right place for the block from Quentin quoted
above because we want to search for the highest domain that is
!overutilized and look among those for the candidates. So from that
perspective, we can't move the block to the beginning and it seems to
be in the right place. My main concern on the other thread was
different, I was talking about the cases where sd_flag & tmp->flags
don't match. In that case, sd = NULL would trump EAS and I was
wondering if that's the right thing to do...

thanks,

- Joel

[...]
Joel Fernandes March 22, 2018, 8:19 p.m. UTC | #5
On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi
<patrick.bellasi@arm.com> wrote:
[..]
>> > +static inline bool wake_energy(struct task_struct *p, int prev_cpu)
>> > +{
>> > +       struct sched_domain *sd;
>> > +
>> > +       if (!static_branch_unlikely(&sched_energy_present))
>> > +               return false;
>> > +
>> > +       sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
>> > +       if (!sd || sd_overutilized(sd))
>>
>> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here?
>
> I think that should be already covered by the static key check
> above...
>

I understand there is an assumption like that but I was wondering if
its more future proof for checking it explicity. I am Ok if everyone
thinks its a valid assumption...

>>
>> > +               return false;
>> > +
>> > +       return true;
>> > +}
>> > +
>> >  /*
>> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
>> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> > @@ -6529,18 +6583,22 @@ static int
>> >  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>> >  {
>> >         struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
>> > +       struct sched_domain *energy_sd = NULL;
>> >         int cpu = smp_processor_id();
>> >         int new_cpu = prev_cpu;
>> > -       int want_affine = 0;
>> > +       int want_affine = 0, want_energy = 0;
>> >         int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>> >
>> > +       rcu_read_lock();
>> > +
>> >         if (sd_flag & SD_BALANCE_WAKE) {
>> >                 record_wakee(p);
>> > +               want_energy = wake_energy(p, prev_cpu);
>> >                 want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
>> > -                             && cpumask_test_cpu(cpu, &p->cpus_allowed);
>> > +                             && cpumask_test_cpu(cpu, &p->cpus_allowed)
>> > +                             && !want_energy;
>> >         }
>> >
>> > -       rcu_read_lock();
>> >         for_each_domain(cpu, tmp) {
>> >                 if (!(tmp->flags & SD_LOAD_BALANCE))
>> >                         break;
>> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> >                         break;
>> >                 }
>> >
>> > +               /*
>> > +                * Energy-aware task placement is performed on the highest
>> > +                * non-overutilized domain spanning over cpu and prev_cpu.
>> > +                */
>> > +               if (want_energy && !sd_overutilized(tmp) &&
>> > +                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>>
>> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?
>
> ... and this then should be covered by the previous check in
> wake_energy(), which sets want_energy.

Right, but in a scenario which probably doesn't exist today where we
have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the
hierarchy for which want_energy = 1, I was thinking if its more future
proof to check it and not make assumptions...

>>
>> > +                       energy_sd = tmp;
>> > +
>> >                 if (tmp->flags & sd_flag)
>> >                         sd = tmp;
>> >                 else if (!want_affine)
>> > @@ -6586,6 +6652,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> >                         if (want_affine)
>> >                                 current->recent_used_cpu = cpu;
>> >                 }
>> > +       } else if (energy_sd) {
>> > +               new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
>>
>> Even if want_affine = 0 (want_energy = 1), we can have sd = NULL if
>> sd_flag and tmp->flags don't match. In this case we wont enter the EAS
>> selection path because sd will be = NULL. So should you be setting sd
>> = NULL explicitly if energy_sd != NULL , or rather do the if
>> (energy_sd) before doing the if (!sd) ?
>
> That's the same think I was also proposing in my reply to this patch.
> But in my case the point was mainly to make the code easier to
> follow... which at the end it's also to void all the consideration on
> dependencies you describe above.
>
> Joel, can you have a look at what I proposed... I was not entirely
> sure that we miss some code paths doing it that way.

Replied to this in the other thread.

thanks,

- Joel
Morten Rasmussen March 23, 2018, 3:47 p.m. UTC | #6
On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:
> On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:
> > [...]
> >
> >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >>                       break;
> >>               }
> >>
> >> +             /*
> >> +              * Energy-aware task placement is performed on the highest
> >> +              * non-overutilized domain spanning over cpu and prev_cpu.
> >> +              */
> >> +             if (want_energy && !sd_overutilized(tmp) &&
> >> +                 cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> >> +                     energy_sd = tmp;
> >> +
> >
> > Not entirely sure, but I was trying to understand if we can avoid to
> > modify the definition of want_affine (in the previous chunk) and move
> > this block before the previous "if (want_affine..." (in mainline but
> > not in this chunk), which will became an else, e.g.
> >
> >         if (want_energy && !sd_overutilized(tmp) &&
> >                 // ...
> >         else if (want_energy && !sd_overutilized(tmp) &&
> >                 // ...
> >
> > Isn't that the same?
> >
> > Maybe there is a code path I'm missing... but otherwise it seems a
> > more self contained modification of select_task_rq_fair...
> 
> Just replying to this here Patrick instead of the other thread.
> 
> I think this is the right place for the block from Quentin quoted
> above because we want to search for the highest domain that is
> !overutilized and look among those for the candidates. So from that
> perspective, we can't move the block to the beginning and it seems to
> be in the right place. My main concern on the other thread was
> different, I was talking about the cases where sd_flag & tmp->flags
> don't match. In that case, sd = NULL would trump EAS and I was
> wondering if that's the right thing to do...

You mean if SD_BALANCE_WAKE isn't set on sched_domains?

The current code seems to rely on that flag to be set to work correctly.
Otherwise, the loop might bail out on !want_affine and we end up doing
the find_energy_efficient_cpu() on the lowest level sched_domain even if
there is higher level one which isn't over-utilized.

However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so
sd == NULL shouldn't be possible? This only holds as long as we only
want EAS for asymmetric systems.
Morten Rasmussen March 23, 2018, 4 p.m. UTC | #7
On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote:
> Hi,
> 
> On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
> >
> > From: Quentin Perret <quentin.perret@arm.com>
> >
> > In case an energy model is available, waking tasks are re-routed into a
> > new energy-aware placement algorithm. The eligible CPUs to be used in the
> > energy-aware wakeup path are restricted to the highest non-overutilized
> > sched_domain containing prev_cpu and this_cpu. If no such domain is found,
> > the tasks go through the usual wake-up path, hence energy-aware placement
> > happens only in lightly utilized scenarios.
> >
> > The selection of the most energy-efficient CPU for a task is achieved by
> > estimating the impact on system-level active energy resulting from the
> > placement of the task on each candidate CPU. The best CPU energy-wise is
> > then selected if it saves a large enough amount of energy with respect to
> > prev_cpu.
> >
> > Although it has already shown significant benefits on some existing
> > targets, this brute force approach clearly cannot scale to platforms with
> > numerous CPUs. This patch is an attempt to do something useful as writing
> > a fast heuristic that performs reasonably well on a broad spectrum of
> > architectures isn't an easy task. As a consequence, the scope of usability
> > of the energy-aware wake-up path is restricted to systems with the
> > SD_ASYM_CPUCAPACITY flag set. These systems not only show the most
> > promising opportunities for saving energy but also typically feature a
> > limited number of logical CPUs.
> >
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > ---
> >  kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 71 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 76bd46502486..65a1bead0773 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> >         return energy;
> >  }
> >
> > +static bool task_fits(struct task_struct *p, int cpu)
> > +{
> > +       unsigned long next_util = cpu_util_next(cpu, p, cpu);
> > +
> > +       return util_fits_capacity(next_util, capacity_orig_of(cpu));
> > +}
> > +
> > +static int find_energy_efficient_cpu(struct sched_domain *sd,
> > +                                       struct task_struct *p, int prev_cpu)
> > +{
> > +       unsigned long cur_energy, prev_energy, best_energy;
> > +       int cpu, best_cpu = prev_cpu;
> > +
> > +       if (!task_util(p))
> > +               return prev_cpu;
> > +
> > +       /* Compute the energy impact of leaving the task on prev_cpu. */
> > +       prev_energy = best_energy = compute_energy(p, prev_cpu);
> 
> Is it possible that before the wakeup, the task's affinity is changed
> so that p->cpus_allowed no longer contains prev_cpu ? In that case
> prev_energy wouldn't matter since previous CPU is no longer an option?

It is possible to wake-up with a disallowed prev_cpu. In fact
select_idle_sibling() may happily return a disallowed cpu in that case.
The mistake gets fixed in select_task_rq() which uses
select_fallback_rq() to find an allowed cpu instead.

Could we fix the issue in find_energy_efficient_cpu() by a simple test
like below

if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
	prev_energy = best_energy = compute_energy(p, prev_cpu);
else
	prev_energy = best_energy = ULONG_MAX;
Joel Fernandes March 24, 2018, 12:36 a.m. UTC | #8
On Fri, Mar 23, 2018 at 9:00 AM, Morten Rasmussen
<morten.rasmussen@arm.com> wrote:
> On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote:
>> >
>> > In case an energy model is available, waking tasks are re-routed into a
>> > new energy-aware placement algorithm. The eligible CPUs to be used in the
>> > energy-aware wakeup path are restricted to the highest non-overutilized
>> > sched_domain containing prev_cpu and this_cpu. If no such domain is found,
>> > the tasks go through the usual wake-up path, hence energy-aware placement
>> > happens only in lightly utilized scenarios.
>> >
>> > The selection of the most energy-efficient CPU for a task is achieved by
>> > estimating the impact on system-level active energy resulting from the
>> > placement of the task on each candidate CPU. The best CPU energy-wise is
>> > then selected if it saves a large enough amount of energy with respect to
>> > prev_cpu.
>> >
>> > Although it has already shown significant benefits on some existing
>> > targets, this brute force approach clearly cannot scale to platforms with
>> > numerous CPUs. This patch is an attempt to do something useful as writing
>> > a fast heuristic that performs reasonably well on a broad spectrum of
>> > architectures isn't an easy task. As a consequence, the scope of usability
>> > of the energy-aware wake-up path is restricted to systems with the
>> > SD_ASYM_CPUCAPACITY flag set. These systems not only show the most
>> > promising opportunities for saving energy but also typically feature a
>> > limited number of logical CPUs.
>> >
>> > Cc: Ingo Molnar <mingo@redhat.com>
>> > Cc: Peter Zijlstra <peterz@infradead.org>
>> > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
>> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> > ---
>> >  kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>> >  1 file changed, 71 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 76bd46502486..65a1bead0773 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
>> >         return energy;
>> >  }
>> >
>> > +static bool task_fits(struct task_struct *p, int cpu)
>> > +{
>> > +       unsigned long next_util = cpu_util_next(cpu, p, cpu);
>> > +
>> > +       return util_fits_capacity(next_util, capacity_orig_of(cpu));
>> > +}
>> > +
>> > +static int find_energy_efficient_cpu(struct sched_domain *sd,
>> > +                                       struct task_struct *p, int prev_cpu)
>> > +{
>> > +       unsigned long cur_energy, prev_energy, best_energy;
>> > +       int cpu, best_cpu = prev_cpu;
>> > +
>> > +       if (!task_util(p))
>> > +               return prev_cpu;
>> > +
>> > +       /* Compute the energy impact of leaving the task on prev_cpu. */
>> > +       prev_energy = best_energy = compute_energy(p, prev_cpu);
>>
>> Is it possible that before the wakeup, the task's affinity is changed
>> so that p->cpus_allowed no longer contains prev_cpu ? In that case
>> prev_energy wouldn't matter since previous CPU is no longer an option?
>
> It is possible to wake-up with a disallowed prev_cpu. In fact
> select_idle_sibling() may happily return a disallowed cpu in that case.
> The mistake gets fixed in select_task_rq() which uses
> select_fallback_rq() to find an allowed cpu instead.
>
> Could we fix the issue in find_energy_efficient_cpu() by a simple test
> like below
>
> if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
>         prev_energy = best_energy = compute_energy(p, prev_cpu);
> else
>         prev_energy = best_energy = ULONG_MAX;

Yes, I think setting to ULONG_MAX in this case is Ok with me.

thanks,

- Joel
Joel Fernandes March 24, 2018, 1:13 a.m. UTC | #9
Hi Morten,

On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen
<morten.rasmussen@arm.com> wrote:
> On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:
>> On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi
>> <patrick.bellasi@arm.com> wrote:
>> > [...]
>> >
>> >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> >>                       break;
>> >>               }
>> >>
>> >> +             /*
>> >> +              * Energy-aware task placement is performed on the highest
>> >> +              * non-overutilized domain spanning over cpu and prev_cpu.
>> >> +              */
>> >> +             if (want_energy && !sd_overutilized(tmp) &&
>> >> +                 cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>> >> +                     energy_sd = tmp;
>> >> +
>> >
>> > Not entirely sure, but I was trying to understand if we can avoid to
>> > modify the definition of want_affine (in the previous chunk) and move
>> > this block before the previous "if (want_affine..." (in mainline but
>> > not in this chunk), which will became an else, e.g.
>> >
>> >         if (want_energy && !sd_overutilized(tmp) &&
>> >                 // ...
>> >         else if (want_energy && !sd_overutilized(tmp) &&
>> >                 // ...
>> >
>> > Isn't that the same?
>> >
>> > Maybe there is a code path I'm missing... but otherwise it seems a
>> > more self contained modification of select_task_rq_fair...
>>
>> Just replying to this here Patrick instead of the other thread.
>>
>> I think this is the right place for the block from Quentin quoted
>> above because we want to search for the highest domain that is
>> !overutilized and look among those for the candidates. So from that
>> perspective, we can't move the block to the beginning and it seems to
>> be in the right place. My main concern on the other thread was
>> different, I was talking about the cases where sd_flag & tmp->flags
>> don't match. In that case, sd = NULL would trump EAS and I was
>> wondering if that's the right thing to do...
>
> You mean if SD_BALANCE_WAKE isn't set on sched_domains?

Yes.

> The current code seems to rely on that flag to be set to work correctly.
> Otherwise, the loop might bail out on !want_affine and we end up doing
> the find_energy_efficient_cpu() on the lowest level sched_domain even if
> there is higher level one which isn't over-utilized.
>
> However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so
> sd == NULL shouldn't be possible? This only holds as long as we only
> want EAS for asymmetric systems.

Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM. It
makes sense to me then, thanks for the clarification.

Still I feel it is a bit tedious/confusing when reading code to draw
the conclusion about why sd is checked first before doing
find_energy_efficient_cpu (and that sd will != NULL for ASYM systems).
If energy_sd is set, then we can just proceed with EAS without
checking that sd != NULL. This function in mainline is already pretty
confusing as it is :-(

Regards,

- Joel
Quentin Perret March 24, 2018, 1:22 a.m. UTC | #10
On Friday 23 Mar 2018 at 15:47:45 (+0000), Morten Rasmussen wrote:
> On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:
> > On Wed, Mar 21, 2018 at 8:35 AM, Patrick Bellasi
> > <patrick.bellasi@arm.com> wrote:
> > > [...]
> > >
> > >> @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> > >>                       break;
> > >>               }
> > >>
> > >> +             /*
> > >> +              * Energy-aware task placement is performed on the highest
> > >> +              * non-overutilized domain spanning over cpu and prev_cpu.
> > >> +              */
> > >> +             if (want_energy && !sd_overutilized(tmp) &&
> > >> +                 cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> > >> +                     energy_sd = tmp;
> > >> +
> > >
> > > Not entirely sure, but I was trying to understand if we can avoid to
> > > modify the definition of want_affine (in the previous chunk) and move
> > > this block before the previous "if (want_affine..." (in mainline but
> > > not in this chunk), which will became an else, e.g.
> > >
> > >         if (want_energy && !sd_overutilized(tmp) &&
> > >                 // ...
> > >         else if (want_energy && !sd_overutilized(tmp) &&
> > >                 // ...
> > >
> > > Isn't that the same?
> > >
> > > Maybe there is a code path I'm missing... but otherwise it seems a
> > > more self contained modification of select_task_rq_fair...
> > 
> > Just replying to this here Patrick instead of the other thread.
> > 
> > I think this is the right place for the block from Quentin quoted
> > above because we want to search for the highest domain that is
> > !overutilized and look among those for the candidates. So from that
> > perspective, we can't move the block to the beginning and it seems to
> > be in the right place. My main concern on the other thread was
> > different, I was talking about the cases where sd_flag & tmp->flags
> > don't match. In that case, sd = NULL would trump EAS and I was
> > wondering if that's the right thing to do...
> 
> You mean if SD_BALANCE_WAKE isn't set on sched_domains?
> 
> The current code seems to rely on that flag to be set to work correctly.
> Otherwise, the loop might bail out on !want_affine and we end up doing
> the find_energy_efficient_cpu() on the lowest level sched_domain even if
> there is higher level one which isn't over-utilized.
> 
> However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so
> sd == NULL shouldn't be possible? This only holds as long as we only
> want EAS for asymmetric systems.

That's correct, we are under the assumption that the SD_ASYM_CPUCAPACITY
flag is set somewhere in the hierarchy here. If a sched domain has this
flag set, SD_BALANCE_WAKE is propagated to all lower sched domains
(see sd_init() in kernel/sched/topology.c) so we should be fine.
Quentin Perret March 24, 2018, 1:34 a.m. UTC | #11
On Friday 23 Mar 2018 at 18:13:56 (-0700), Joel Fernandes wrote:
> Hi Morten,
> 
> On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen
> <morten.rasmussen@arm.com> wrote:
> > On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:

[...]

> > You mean if SD_BALANCE_WAKE isn't set on sched_domains?
> 
> Yes.
> 
> > The current code seems to rely on that flag to be set to work correctly.
> > Otherwise, the loop might bail out on !want_affine and we end up doing
> > the find_energy_efficient_cpu() on the lowest level sched_domain even if
> > there is higher level one which isn't over-utilized.
> >
> > However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is set so
> > sd == NULL shouldn't be possible? This only holds as long as we only
> > want EAS for asymmetric systems.
> 
> Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM. It
> makes sense to me then, thanks for the clarification.
> 
> Still I feel it is a bit tedious/confusing when reading code to draw
> the conclusion about why sd is checked first before doing
> find_energy_efficient_cpu (and that sd will != NULL for ASYM systems).
> If energy_sd is set, then we can just proceed with EAS without
> checking that sd != NULL. This function in mainline is already pretty
> confusing as it is :-(

Right I see your point. The code is correct as is, but I agree that having
a code structured as

	if (energy_sd) {
		new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
	} else if (!sd) {
		...

might be easier to understand and functionally equivalent. What do you
think ?

Thanks,
Quentin
Quentin Perret March 24, 2018, 1:47 a.m. UTC | #12
On Thursday 22 Mar 2018 at 13:19:03 (-0700), Joel Fernandes wrote:
> On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi
> <patrick.bellasi@arm.com> wrote:

[...]

> >> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >> >                         break;
> >> >                 }
> >> >
> >> > +               /*
> >> > +                * Energy-aware task placement is performed on the highest
> >> > +                * non-overutilized domain spanning over cpu and prev_cpu.
> >> > +                */
> >> > +               if (want_energy && !sd_overutilized(tmp) &&
> >> > +                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> >>
> >> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?
> >
> > ... and this then should be covered by the previous check in
> > wake_energy(), which sets want_energy.
> 
> Right, but in a scenario which probably doesn't exist today where we
> have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the
> hierarchy for which want_energy = 1, I was thinking if its more future
> proof to check it and not make assumptions...

So we can definitely have cases where SD_ASYM_CPUCAPACITY is not set at all
sd levels. Today, on mobile systems, this flag is typically set only at DIE
level for big.LITTLE platforms, and not at MC level.
We enable EAS if we find _at least_ one domain that has this flag in the
hierarchy, just to make sure we don't enable EAS for symmetric platform.
It's just a way to check a property about the topology when EAS starts, not
really a way to actually select the sd at which we do scheduling at
runtime.

I hope that helps !

Thanks,
Quentin
Joel Fernandes March 24, 2018, 6:06 a.m. UTC | #13
On March 23, 2018 6:34:22 PM PDT, Quentin Perret <quentin.perret@arm.com> wrote:
>On Friday 23 Mar 2018 at 18:13:56 (-0700), Joel Fernandes wrote:
>> Hi Morten,
>> 
>> On Fri, Mar 23, 2018 at 8:47 AM, Morten Rasmussen
>> <morten.rasmussen@arm.com> wrote:
>> > On Thu, Mar 22, 2018 at 01:10:22PM -0700, Joel Fernandes wrote:
>
>[...]
>
>> > You mean if SD_BALANCE_WAKE isn't set on sched_domains?
>> 
>> Yes.
>> 
>> > The current code seems to rely on that flag to be set to work
>correctly.
>> > Otherwise, the loop might bail out on !want_affine and we end up
>doing
>> > the find_energy_efficient_cpu() on the lowest level sched_domain
>even if
>> > there is higher level one which isn't over-utilized.
>> >
>> > However, SD_BALANCE_WAKE should be set if SD_ASYM_CPUCAPACITY is
>set so
>> > sd == NULL shouldn't be possible? This only holds as long as we
>only
>> > want EAS for asymmetric systems.
>> 
>> Yes, I see you had topology code that set SD_BALANCE_WAKE for ASYM.
>It
>> makes sense to me then, thanks for the clarification.
>> 
>> Still I feel it is a bit tedious/confusing when reading code to draw
>> the conclusion about why sd is checked first before doing
>> find_energy_efficient_cpu (and that sd will != NULL for ASYM
>systems).
>> If energy_sd is set, then we can just proceed with EAS without
>> checking that sd != NULL. This function in mainline is already pretty
>> confusing as it is :-(
>
>Right I see your point. The code is correct as is, but I agree that
>having
>a code structured as
>
>	if (energy_sd) {
>		new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
>	} else if (!sd) {
>		...
>
>might be easier to understand and functionally equivalent. What do you
>think ?

Yeah definitely. Go for it.

- Joel
Joel Fernandes March 25, 2018, 12:12 a.m. UTC | #14
On Fri, Mar 23, 2018 at 6:47 PM, Quentin Perret <quentin.perret@arm.com> wrote:
> On Thursday 22 Mar 2018 at 13:19:03 (-0700), Joel Fernandes wrote:
>> On Thu, Mar 22, 2018 at 11:06 AM, Patrick Bellasi
>> <patrick.bellasi@arm.com> wrote:
>
> [...]
>
>> >> > @@ -6555,6 +6613,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> >> >                         break;
>> >> >                 }
>> >> >
>> >> > +               /*
>> >> > +                * Energy-aware task placement is performed on the highest
>> >> > +                * non-overutilized domain spanning over cpu and prev_cpu.
>> >> > +                */
>> >> > +               if (want_energy && !sd_overutilized(tmp) &&
>> >> > +                   cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
>> >>
>> >> Shouldn't you check for the SD_ASYM_CPUCAPACITY flag here for tmp level?
>> >
>> > ... and this then should be covered by the previous check in
>> > wake_energy(), which sets want_energy.
>>
>> Right, but in a scenario which probably doesn't exist today where we
>> have both SD_ASYM_CPUCAPACITY and !SD_ASYM_CPUCAPACITY domains in the
>> hierarchy for which want_energy = 1, I was thinking if its more future
>> proof to check it and not make assumptions...
>
> So we can definitely have cases where SD_ASYM_CPUCAPACITY is not set at all
> sd levels. Today, on mobile systems, this flag is typically set only at DIE
> level for big.LITTLE platforms, and not at MC level.
> We enable EAS if we find _at least_ one domain that has this flag in the
> hierarchy, just to make sure we don't enable EAS for symmetric platform.
> It's just a way to check a property about the topology when EAS starts, not
> really a way to actually select the sd at which we do scheduling at
> runtime.

Yes Ok you're right we do have the ASYM flag set at some sd levels but
not others at the moment. Sorry about the hasty comment. I understand
what you're doing now, I am Ok with that.

thanks,

- Joel
Quentin Perret March 25, 2018, 1:38 a.m. UTC | #15
On Friday 23 Mar 2018 at 16:00:59 (+0000), Morten Rasmussen wrote:
> On Thu, Mar 22, 2018 at 09:27:43AM -0700, Joel Fernandes wrote:
> > Hi,
> > 
> > On Tue, Mar 20, 2018 at 2:43 AM, Dietmar Eggemann
> > <dietmar.eggemann@arm.com> wrote:

[...]

> > Is it possible that before the wakeup, the task's affinity is changed
> > so that p->cpus_allowed no longer contains prev_cpu ? In that case
> > prev_energy wouldn't matter since previous CPU is no longer an option?
> 
> It is possible to wake-up with a disallowed prev_cpu. In fact
> select_idle_sibling() may happily return a disallowed cpu in that case.
> The mistake gets fixed in select_task_rq() which uses
> select_fallback_rq() to find an allowed cpu instead.
> 
> Could we fix the issue in find_energy_efficient_cpu() by a simple test
> like below
> 
> if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> 	prev_energy = best_energy = compute_energy(p, prev_cpu);
> else
> 	prev_energy = best_energy = ULONG_MAX;

Right, that should work. I'll change this in v2.

Thanks,
Quentin
Quentin Perret March 25, 2018, 1:52 a.m. UTC | #16
On Wednesday 21 Mar 2018 at 15:35:18 (+0000), Patrick Bellasi wrote:
> On 20-Mar 09:43, Dietmar Eggemann wrote:
> > From: Quentin Perret <quentin.perret@arm.com>
> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 76bd46502486..65a1bead0773 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6513,6 +6513,60 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> >  	return energy;
> >  }
> > 
> > +static bool task_fits(struct task_struct *p, int cpu)
> > +{
> > +	unsigned long next_util = cpu_util_next(cpu, p, cpu);
> > +
> > +	return util_fits_capacity(next_util, capacity_orig_of(cpu));
>                                              ^^^^^^^^^^^^^^^^^^^^^
> 
> Since here we are at scheduling CFS tasks, should we not better use
> capacity_of() to account for RT/IRQ pressure ?

Yes, definitely. I change this in v2.

> 
> > +}
> > +
> > +static int find_energy_efficient_cpu(struct sched_domain *sd,
> > +					struct task_struct *p, int prev_cpu)
> > +{
> > +	unsigned long cur_energy, prev_energy, best_energy;
> > +	int cpu, best_cpu = prev_cpu;
> > +
> > +	if (!task_util(p))
> 
> We are still waking up a task... what if the task was previously
> running on a big CPU which is now idle?
> 
> I understand that from a _relative_ energy_diff standpoint there is
> not much to do for a 0 utilization task. However, for those tasks we
> can still try to return the most energy efficient CPU among the ones
> in their cpus_allowed mask.
> 
> It should be a relatively low overhead (maybe contained in a fallback
> most_energy_efficient_cpu() kind of function) which allows, for
> example on ARM big.LITTLE systems, to consolidate those tasks on
> LITTLE CPUs instead for example keep running them on a big CPU.

Hmmmm so the difficult thing about a task with 0 util is that you don't
know if this is really a small task, or a big task with a very long
period. The only useful thing you know for sure about the task is where
it ran last time, so I guess that makes sense to use that information
rather than make assumptions. There is no perfect solution using the
util_avg of the task.

Now, UTIL_EST is changing the game here. If we use it for task placement
(which I think is the right thing to do), this issue should be a lot
easier to solve. What do you think ?

Thanks,
Quentin
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76bd46502486..65a1bead0773 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6513,6 +6513,60 @@  static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
 	return energy;
 }
 
+static bool task_fits(struct task_struct *p, int cpu)
+{
+	unsigned long next_util = cpu_util_next(cpu, p, cpu);
+
+	return util_fits_capacity(next_util, capacity_orig_of(cpu));
+}
+
+static int find_energy_efficient_cpu(struct sched_domain *sd,
+					struct task_struct *p, int prev_cpu)
+{
+	unsigned long cur_energy, prev_energy, best_energy;
+	int cpu, best_cpu = prev_cpu;
+
+	if (!task_util(p))
+		return prev_cpu;
+
+	/* Compute the energy impact of leaving the task on prev_cpu. */
+	prev_energy = best_energy = compute_energy(p, prev_cpu);
+
+	/* Look for the CPU that minimizes the energy. */
+	for_each_cpu_and(cpu, &p->cpus_allowed, sched_domain_span(sd)) {
+		if (!task_fits(p, cpu) || cpu == prev_cpu)
+			continue;
+		cur_energy = compute_energy(p, cpu);
+		if (cur_energy < best_energy) {
+			best_energy = cur_energy;
+			best_cpu = cpu;
+		}
+	}
+
+	/*
+	 * We pick the best CPU only if it saves at least 1.5% of the
+	 * energy used by prev_cpu.
+	 */
+	if ((prev_energy - best_energy) > (prev_energy >> 6))
+		return best_cpu;
+
+	return prev_cpu;
+}
+
+static inline bool wake_energy(struct task_struct *p, int prev_cpu)
+{
+	struct sched_domain *sd;
+
+	if (!static_branch_unlikely(&sched_energy_present))
+		return false;
+
+	sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
+	if (!sd || sd_overutilized(sd))
+		return false;
+
+	return true;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6529,18 +6583,22 @@  static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sched_domain *energy_sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
-	int want_affine = 0;
+	int want_affine = 0, want_energy = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	rcu_read_lock();
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
+		want_energy = wake_energy(p, prev_cpu);
 		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
-			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
+			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
+			      && !want_energy;
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
@@ -6555,6 +6613,14 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			break;
 		}
 
+		/*
+		 * Energy-aware task placement is performed on the highest
+		 * non-overutilized domain spanning over cpu and prev_cpu.
+		 */
+		if (want_energy && !sd_overutilized(tmp) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
+			energy_sd = tmp;
+
 		if (tmp->flags & sd_flag)
 			sd = tmp;
 		else if (!want_affine)
@@ -6586,6 +6652,8 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			if (want_affine)
 				current->recent_used_cpu = cpu;
 		}
+	} else if (energy_sd) {
+		new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
 	} else {
 		new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);
 	}