diff mbox series

[v5,09/14] sched: Add over-utilization/tipping point indicator

Message ID 20180724122521.22109-10-quentin.perret@arm.com (mailing list archive)
State Changes Requested, archived
Delegated to: Eduardo Valentin
Headers show
Series Energy Aware Scheduling | expand

Commit Message

Quentin Perret July 24, 2018, 12:25 p.m. UTC
From: Morten Rasmussen <morten.rasmussen@arm.com>

Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at its
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For systems where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c  | 48 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  4 ++++
 2 files changed, 50 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra Aug. 2, 2018, 12:26 p.m. UTC | #1
On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		update_cfs_group(se);
>  	}
>  
> -	if (!se)
> +	if (!se) {
>  		add_nr_running(rq, 1);
> +		/*
> +		 * The utilization of a new task is 'wrong' so wait for it
> +		 * to build some utilization history before trying to detect
> +		 * the overutilized flag.
> +		 */
> +		if (flags & ENQUEUE_WAKEUP)
> +			update_overutilized_status(rq);
> +
> +	}
>  
>  	hrtick_update(rq);
>  }

That is a somewhat dodgy hack. There is no guarantee what so ever that
when the task wakes next its history is any better. The comment doesn't
reflect this I feel.
Quentin Perret Aug. 2, 2018, 1:03 p.m. UTC | #2
On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  		update_cfs_group(se);
> >  	}
> >  
> > -	if (!se)
> > +	if (!se) {
> >  		add_nr_running(rq, 1);
> > +		/*
> > +		 * The utilization of a new task is 'wrong' so wait for it
> > +		 * to build some utilization history before trying to detect
> > +		 * the overutilized flag.
> > +		 */
> > +		if (flags & ENQUEUE_WAKEUP)
> > +			update_overutilized_status(rq);
> > +
> > +	}
> >  
> >  	hrtick_update(rq);
> >  }
> 
> That is a somewhat dodgy hack. There is no guarantee what so ever that
> when the task wakes next its history is any better. The comment doesn't
> reflect this I feel.

AFAICT the main use-case here is to avoid re-enabling the load balance
and ruining all the task placement because of a tiny task. I don't
really see how we can do that differently ...

Or am I missing something Morten ?

In the meantime, I can try to improve the comment at least :-)

Thanks,
Quentin
Peter Zijlstra Aug. 2, 2018, 1:08 p.m. UTC | #3
On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  		update_cfs_group(se);
> > >  	}
> > >  
> > > -	if (!se)
> > > +	if (!se) {
> > >  		add_nr_running(rq, 1);
> > > +		/*
> > > +		 * The utilization of a new task is 'wrong' so wait for it
> > > +		 * to build some utilization history before trying to detect
> > > +		 * the overutilized flag.
> > > +		 */
> > > +		if (flags & ENQUEUE_WAKEUP)
> > > +			update_overutilized_status(rq);
> > > +
> > > +	}
> > >  
> > >  	hrtick_update(rq);
> > >  }
> > 
> > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > when the task wakes next its history is any better. The comment doesn't
> > reflect this I feel.
> 
> AFAICT the main use-case here is to avoid re-enabling the load balance
> and ruining all the task placement because of a tiny task. I don't
> really see how we can do that differently ...

Sure I realize that.. but it doesn't completely avoid it. Suppose this
new task instantly blocks and wakes up again. Then its util signal will
be exactly what you didn't want but we'll account it and cause the above
scenario you wanted to avoid.

Now, I suppose in practise it works well enough.

The alternative is trying to track when a running signal has converged,
but that's not a simple problem either I suppose.
Quentin Perret Aug. 2, 2018, 1:18 p.m. UTC | #4
On Thursday 02 Aug 2018 at 15:08:01 (+0200), Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> > On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >  		update_cfs_group(se);
> > > >  	}
> > > >  
> > > > -	if (!se)
> > > > +	if (!se) {
> > > >  		add_nr_running(rq, 1);
> > > > +		/*
> > > > +		 * The utilization of a new task is 'wrong' so wait for it
> > > > +		 * to build some utilization history before trying to detect
> > > > +		 * the overutilized flag.
> > > > +		 */
> > > > +		if (flags & ENQUEUE_WAKEUP)
> > > > +			update_overutilized_status(rq);
> > > > +
> > > > +	}
> > > >  
> > > >  	hrtick_update(rq);
> > > >  }
> > > 
> > > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > > when the task wakes next its history is any better. The comment doesn't
> > > reflect this I feel.
> > 
> > AFAICT the main use-case here is to avoid re-enabling the load balance
> > and ruining all the task placement because of a tiny task. I don't
> > really see how we can do that differently ...
> 
> Sure I realize that.. but it doesn't completely avoid it. Suppose this
> new task instantly blocks and wakes up again. Then its util signal will
> be exactly what you didn't want but we'll account it and cause the above
> scenario you wanted to avoid.

That is true. ... I also realize now that this patch was written long
before util_est, and that also has an impact here, especially in the
scenario you described where the task blocks. So any wake-up after the
first enqueue will risk to overutilize the system, even if the task
blocked for ages.

Hmm ...
Vincent Guittot Aug. 2, 2018, 1:48 p.m. UTC | #5
On Thu, 2 Aug 2018 at 15:19, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 15:08:01 (+0200), Peter Zijlstra wrote:
> > On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> > > On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > > > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > >                 update_cfs_group(se);
> > > > >         }
> > > > >
> > > > > -       if (!se)
> > > > > +       if (!se) {
> > > > >                 add_nr_running(rq, 1);
> > > > > +               /*
> > > > > +                * The utilization of a new task is 'wrong' so wait for it
> > > > > +                * to build some utilization history before trying to detect
> > > > > +                * the overutilized flag.
> > > > > +                */
> > > > > +               if (flags & ENQUEUE_WAKEUP)
> > > > > +                       update_overutilized_status(rq);
> > > > > +
> > > > > +       }
> > > > >
> > > > >         hrtick_update(rq);
> > > > >  }
> > > >
> > > > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > > > when the task wakes next its history is any better. The comment doesn't
> > > > reflect this I feel.
> > >
> > > AFAICT the main use-case here is to avoid re-enabling the load balance
> > > and ruining all the task placement because of a tiny task. I don't
> > > really see how we can do that differently ...
> >
> > Sure I realize that.. but it doesn't completely avoid it. Suppose this
> > new task instantly blocks and wakes up again. Then its util signal will
> > be exactly what you didn't want but we'll account it and cause the above
> > scenario you wanted to avoid.
>
> That is true. ... I also realize now that this patch was written long
> before util_est, and that also has an impact here, especially in the
> scenario you described where the task blocks. So any wake-up after the
> first enqueue will risk to overutilize the system, even if the task
> blocked for ages.
>
> Hmm ...

Does a init value set to 0 for util_avg for newly created task can
help in EAS in this case ?
Current initial value is computed to prevent packing newly created
tasks on same CPUs because it hurts performance of some benches. In
fact it somehow assumes that newly created task will use significant
part of the remaining capacity of a CPU and want to spread tasks. In
EAS case, it seems that it prefer to assume that  newly created task
are small and we can pack them and wait a bit to make sure the new
task will be a big task and will overload the CPU
Quentin Perret Aug. 2, 2018, 2:14 p.m. UTC | #6
On Thursday 02 Aug 2018 at 15:48:01 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 15:19, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 15:08:01 (+0200), Peter Zijlstra wrote:
> > > On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> > > > On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > > > > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > > > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > > >                 update_cfs_group(se);
> > > > > >         }
> > > > > >
> > > > > > -       if (!se)
> > > > > > +       if (!se) {
> > > > > >                 add_nr_running(rq, 1);
> > > > > > +               /*
> > > > > > +                * The utilization of a new task is 'wrong' so wait for it
> > > > > > +                * to build some utilization history before trying to detect
> > > > > > +                * the overutilized flag.
> > > > > > +                */
> > > > > > +               if (flags & ENQUEUE_WAKEUP)
> > > > > > +                       update_overutilized_status(rq);
> > > > > > +
> > > > > > +       }
> > > > > >
> > > > > >         hrtick_update(rq);
> > > > > >  }
> > > > >
> > > > > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > > > > when the task wakes next its history is any better. The comment doesn't
> > > > > reflect this I feel.
> > > >
> > > > AFAICT the main use-case here is to avoid re-enabling the load balance
> > > > and ruining all the task placement because of a tiny task. I don't
> > > > really see how we can do that differently ...
> > >
> > > Sure I realize that.. but it doesn't completely avoid it. Suppose this
> > > new task instantly blocks and wakes up again. Then its util signal will
> > > be exactly what you didn't want but we'll account it and cause the above
> > > scenario you wanted to avoid.
> >
> > That is true. ... I also realize now that this patch was written long
> > before util_est, and that also has an impact here, especially in the
> > scenario you described where the task blocks. So any wake-up after the
> > first enqueue will risk to overutilize the system, even if the task
> > blocked for ages.
> >
> > Hmm ...
> 
> Does a init value set to 0 for util_avg for newly created task can
> help in EAS in this case ?
> Current initial value is computed to prevent packing newly created
> tasks on same CPUs because it hurts performance of some benches. In
> fact it somehow assumes that newly created task will use significant
> part of the remaining capacity of a CPU and want to spread tasks. In
> EAS case, it seems that it prefer to assume that  newly created task
> are small and we can pack them and wait a bit to make sure the new
> task will be a big task and will overload the CPU

Good point, setting the util_avg to 0 for new tasks should help
filtering out those tiny tasks too. And that would match with the idea
of letting tasks build their history before looking at their util_avg ...

But there is one difference w.r.t frequency selection. The current code
won't mark the system overutilized, but will let sugov raise the
frequency when a new task is enqueued. So in case of a fork bomb, we
sort of fallback on the existing mainline strategy for both task
placement (because forkees don't go in find_energy_efficient_cpu) and
frequency selection. And I would argue this is the right thing to do
since EAS can't really help in this case.

Thoughts ?

Thanks,
Quentin
Vincent Guittot Aug. 2, 2018, 3:14 p.m. UTC | #7
On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 15:48:01 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 15:19, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 15:08:01 (+0200), Peter Zijlstra wrote:
> > > > On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> > > > > On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > > > > > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > > > > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > > > >                 update_cfs_group(se);
> > > > > > >         }
> > > > > > >
> > > > > > > -       if (!se)
> > > > > > > +       if (!se) {
> > > > > > >                 add_nr_running(rq, 1);
> > > > > > > +               /*
> > > > > > > +                * The utilization of a new task is 'wrong' so wait for it
> > > > > > > +                * to build some utilization history before trying to detect
> > > > > > > +                * the overutilized flag.
> > > > > > > +                */
> > > > > > > +               if (flags & ENQUEUE_WAKEUP)
> > > > > > > +                       update_overutilized_status(rq);
> > > > > > > +
> > > > > > > +       }
> > > > > > >
> > > > > > >         hrtick_update(rq);
> > > > > > >  }
> > > > > >
> > > > > > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > > > > > when the task wakes next its history is any better. The comment doesn't
> > > > > > reflect this I feel.
> > > > >
> > > > > AFAICT the main use-case here is to avoid re-enabling the load balance
> > > > > and ruining all the task placement because of a tiny task. I don't
> > > > > really see how we can do that differently ...
> > > >
> > > > Sure I realize that.. but it doesn't completely avoid it. Suppose this
> > > > new task instantly blocks and wakes up again. Then its util signal will
> > > > be exactly what you didn't want but we'll account it and cause the above
> > > > scenario you wanted to avoid.
> > >
> > > That is true. ... I also realize now that this patch was written long
> > > before util_est, and that also has an impact here, especially in the
> > > scenario you described where the task blocks. So any wake-up after the
> > > first enqueue will risk to overutilize the system, even if the task
> > > blocked for ages.
> > >
> > > Hmm ...
> >
> > Does a init value set to 0 for util_avg for newly created task can
> > help in EAS in this case ?
> > Current initial value is computed to prevent packing newly created
> > tasks on same CPUs because it hurts performance of some benches. In
> > fact it somehow assumes that newly created task will use significant
> > part of the remaining capacity of a CPU and want to spread tasks. In
> > EAS case, it seems that it prefer to assume that  newly created task
> > are small and we can pack them and wait a bit to make sure the new
> > task will be a big task and will overload the CPU
>
> Good point, setting the util_avg to 0 for new tasks should help
> filtering out those tiny tasks too. And that would match with the idea
> of letting tasks build their history before looking at their util_avg ...
>
> But there is one difference w.r.t frequency selection. The current code
> won't mark the system overutilized, but will let sugov raise the
> frequency when a new task is enqueued. So in case of a fork bomb, we

If the initial value of util_avg is 0, we should not have any impact
on the util_avg of the cfs rq on which the task is attached, isn't it
? so this should not impact both the over utilization state and the
frequency selected by sugov or I'm missing something ?
Then, select_task_rq_fair is called for a new task but util_avg is
still 0 at that time in the current code so you will have consistent
util_avg of the new task before and after calling
find_energy_efficient_cpu

> sort of fallback on the existing mainline strategy for both task
> placement (because forkees don't go in find_energy_efficient_cpu) and
> frequency selection. And I would argue this is the right thing to do
> since EAS can't really help in this case.
>
> Thoughts ?
>
> Thanks,
> Quentin
Quentin Perret Aug. 2, 2018, 3:30 p.m. UTC | #8
On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > Good point, setting the util_avg to 0 for new tasks should help
> > filtering out those tiny tasks too. And that would match with the idea
> > of letting tasks build their history before looking at their util_avg ...
> >
> > But there is one difference w.r.t frequency selection. The current code
> > won't mark the system overutilized, but will let sugov raise the
> > frequency when a new task is enqueued. So in case of a fork bomb, we
> 
> If the initial value of util_avg is 0, we should not have any impact
> on the util_avg of the cfs rq on which the task is attached, isn't it
> ? so this should not impact both the over utilization state and the
> frequency selected by sugov or I'm missing something ?

What I tried to say is that setting util_avg to 0 for new tasks will
prevent schedutil from raising the frequency in case of a fork bomb, and
I think that could be an issue. And I think this isn't an issue with the
patch as-is ...

Sorry if that wasn't clear

> Then, select_task_rq_fair is called for a new task but util_avg is
> still 0 at that time in the current code so you will have consistent
> util_avg of the new task before and after calling
> find_energy_efficient_cpu

New tasks don't go in find_energy_efficient_cpu(), because, as you said,
they have no consistent util_avg yet when select_task_rq_fair() is called
for the first time.

Thanks,
Quentin
Vincent Guittot Aug. 2, 2018, 3:55 p.m. UTC | #9
On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > Good point, setting the util_avg to 0 for new tasks should help
> > > filtering out those tiny tasks too. And that would match with the idea
> > > of letting tasks build their history before looking at their util_avg ...
> > >
> > > But there is one difference w.r.t frequency selection. The current code
> > > won't mark the system overutilized, but will let sugov raise the
> > > frequency when a new task is enqueued. So in case of a fork bomb, we
> >
> > If the initial value of util_avg is 0, we should not have any impact
> > on the util_avg of the cfs rq on which the task is attached, isn't it
> > ? so this should not impact both the over utilization state and the
> > frequency selected by sugov or I'm missing something ?
>
> What I tried to say is that setting util_avg to 0 for new tasks will
> prevent schedutil from raising the frequency in case of a fork bomb, and
> I think that could be an issue. And I think this isn't an issue with the
> patch as-is ...

ok. So you also want to deal with fork bomb
Not sure that you don't have some problem with current proposal too
because select_task_rq_fair will always return prev_cpu because
util_avg and util_est are 0 at that time

>
> Sorry if that wasn't clear
>
> > Then, select_task_rq_fair is called for a new task but util_avg is
> > still 0 at that time in the current code so you will have consistent
> > util_avg of the new task before and after calling
> > find_energy_efficient_cpu
>
> New tasks don't go in find_energy_efficient_cpu(), because, as you said,
> they have no consistent util_avg yet when select_task_rq_fair() is called
> for the first time.
>
> Thanks,
> Quentin
Quentin Perret Aug. 2, 2018, 4 p.m. UTC | #10
On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > filtering out those tiny tasks too. And that would match with the idea
> > > > of letting tasks build their history before looking at their util_avg ...
> > > >
> > > > But there is one difference w.r.t frequency selection. The current code
> > > > won't mark the system overutilized, but will let sugov raise the
> > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > >
> > > If the initial value of util_avg is 0, we should not have any impact
> > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > ? so this should not impact both the over utilization state and the
> > > frequency selected by sugov or I'm missing something ?
> >
> > What I tried to say is that setting util_avg to 0 for new tasks will
> > prevent schedutil from raising the frequency in case of a fork bomb, and
> > I think that could be an issue. And I think this isn't an issue with the
> > patch as-is ...
> 
> ok. So you also want to deal with fork bomb
> Not sure that you don't have some problem with current proposal too
> because select_task_rq_fair will always return prev_cpu because
> util_avg and util_est are 0 at that time

But find_idlest_cpu() should select a CPU using load in case of a forkee
no ?

Thanks,
Quentin
Vincent Guittot Aug. 2, 2018, 4:07 p.m. UTC | #11
On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > of letting tasks build their history before looking at their util_avg ...
> > > > >
> > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > >
> > > > If the initial value of util_avg is 0, we should not have any impact
> > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > ? so this should not impact both the over utilization state and the
> > > > frequency selected by sugov or I'm missing something ?
> > >
> > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > I think that could be an issue. And I think this isn't an issue with the
> > > patch as-is ...
> >
> > ok. So you also want to deal with fork bomb
> > Not sure that you don't have some problem with current proposal too
> > because select_task_rq_fair will always return prev_cpu because
> > util_avg and util_est are 0 at that time
>
> But find_idlest_cpu() should select a CPU using load in case of a forkee
> no ?

So you have to wait for the next tick that will set the overutilized
and disable the want_energy. Until this point, all new tasks will be
put on the current cpu

>
> Thanks,
> Quentin
Quentin Perret Aug. 2, 2018, 4:10 p.m. UTC | #12
On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > >
> > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > >
> > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > >
> > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > ? so this should not impact both the over utilization state and the
> > > > > frequency selected by sugov or I'm missing something ?
> > > >
> > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > I think that could be an issue. And I think this isn't an issue with the
> > > > patch as-is ...
> > >
> > > ok. So you also want to deal with fork bomb
> > > Not sure that you don't have some problem with current proposal too
> > > because select_task_rq_fair will always return prev_cpu because
> > > util_avg and util_est are 0 at that time
> >
> > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > no ?
> 
> So you have to wait for the next tick that will set the overutilized
> and disable the want_energy. Until this point, all new tasks will be
> put on the current cpu

want_energy should always be false for forkees, because we set it only
for SD_BALANCE_WAKE.
Vincent Guittot Aug. 2, 2018, 4:38 p.m. UTC | #13
On Thu, 2 Aug 2018 at 18:10, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > >
> > > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > > >
> > > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > > >
> > > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > > ? so this should not impact both the over utilization state and the
> > > > > > frequency selected by sugov or I'm missing something ?
> > > > >
> > > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > > I think that could be an issue. And I think this isn't an issue with the
> > > > > patch as-is ...
> > > >
> > > > ok. So you also want to deal with fork bomb
> > > > Not sure that you don't have some problem with current proposal too
> > > > because select_task_rq_fair will always return prev_cpu because
> > > > util_avg and util_est are 0 at that time
> > >
> > > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > > no ?
> >
> > So you have to wait for the next tick that will set the overutilized
> > and disable the want_energy. Until this point, all new tasks will be
> > put on the current cpu
>
> want_energy should always be false for forkees, because we set it only
> for SD_BALANCE_WAKE.

Ah yes I forgot that point.
But doesn't this break the EAS policy ? I mean each time a new task is
created, we use the load to select the best CPU
Quentin Perret Aug. 2, 2018, 4:59 p.m. UTC | #14
On Thursday 02 Aug 2018 at 18:38:01 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:10, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> > > >
> > > > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > >
> > > > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > > > >
> > > > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > > > >
> > > > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > > > ? so this should not impact both the over utilization state and the
> > > > > > > frequency selected by sugov or I'm missing something ?
> > > > > >
> > > > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > > > I think that could be an issue. And I think this isn't an issue with the
> > > > > > patch as-is ...
> > > > >
> > > > > ok. So you also want to deal with fork bomb
> > > > > Not sure that you don't have some problem with current proposal too
> > > > > because select_task_rq_fair will always return prev_cpu because
> > > > > util_avg and util_est are 0 at that time
> > > >
> > > > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > > > no ?
> > >
> > > So you have to wait for the next tick that will set the overutilized
> > > and disable the want_energy. Until this point, all new tasks will be
> > > put on the current cpu
> >
> > want_energy should always be false for forkees, because we set it only
> > for SD_BALANCE_WAKE.
> 
> Ah yes I forgot that point.
> But doesn't this break the EAS policy ? I mean each time a new task is
> created, we use the load to select the best CPU

If you really keep spawning new tasks all the time, yes EAS won't help
you, but there isn't a lot we can do :/. We need to have an idea of how
big a task is for EAS, and we obviously don't know that for new tasks, so
it's hard/dangerous to make assumptions.

So the proposal here is that if you only have forkees once in a while,
then those new tasks (and those new tasks only) will be placed using load
the first time, and then they'll fall under EAS control has soon as they
have at least a little bit of history. This _should_ happen without
re-enabling load balance spuriously too often, and that _should_ prevent
it from ruining the placement of existing tasks ...

As Peter already mentioned, a better way of solving this issue would be
to try to find the moment when the utilization signal has converged to
something stable (assuming that it converges), but that, I think, isn't
straightforward at all ...

Does that make any sense ?

Thanks,
Quentin
Vincent Guittot Aug. 3, 2018, 7:48 a.m. UTC | #15
On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 18:38:01 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 18:10, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > >
> > > > > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > > > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > >
> > > > > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > > > > >
> > > > > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > > > > >
> > > > > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > > > > ? so this should not impact both the over utilization state and the
> > > > > > > > frequency selected by sugov or I'm missing something ?
> > > > > > >
> > > > > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > > > > I think that could be an issue. And I think this isn't an issue with the
> > > > > > > patch as-is ...
> > > > > >
> > > > > > ok. So you also want to deal with fork bomb
> > > > > > Not sure that you don't have some problem with current proposal too
> > > > > > because select_task_rq_fair will always return prev_cpu because
> > > > > > util_avg and util_est are 0 at that time
> > > > >
> > > > > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > > > > no ?
> > > >
> > > > So you have to wait for the next tick that will set the overutilized
> > > > and disable the want_energy. Until this point, all new tasks will be
> > > > put on the current cpu
> > >
> > > want_energy should always be false for forkees, because we set it only
> > > for SD_BALANCE_WAKE.
> >
> > Ah yes I forgot that point.
> > But doesn't this break the EAS policy ? I mean each time a new task is
> > created, we use the load to select the best CPU
>
> If you really keep spawning new tasks all the time, yes EAS won't help
> you, but there isn't a lot we can do :/. We need to have an idea of how

My point was more that it's also happen for every single new task and
not only with fork bomb

> big a task is for EAS, and we obviously don't know that for new tasks, so
> it's hard/dangerous to make assumptions.

But by not making any assumption, the new tasks are placed outside EAS
control and can easily break what EAS tries to achieve because it
looks for the idlest cpu which is unluckily most probably a CPU that
EAS doesn't want to use

>
> So the proposal here is that if you only have forkees once in a while,
> then those new tasks (and those new tasks only) will be placed using load
> the first time, and then they'll fall under EAS control has soon as they
> have at least a little bit of history. This _should_ happen without
> re-enabling load balance spuriously too often, and that _should_ prevent

I'm not really concerned about re-enabling load balance but more that
the effort of packing of tasks in few cpus/clusters that EAS tries to
do can be broken for every new task.
So I wonder what is better for EAS : Make sure to efficiently spread
newly created tasks in cas of fork bomb or  try to not break EAS task
placement with every newly created tasks

Vincent

> it from ruining the placement of existing tasks ...
>
> As Peter already mentioned, a better way of solving this issue would be
> to try to find the moment when the utilization signal has converged to
> something stable (assuming that it converges), but that, I think, isn't
> straightforward at all ...
>
> Does that make any sense ?
>
> Thanks,
> Quentin
Quentin Perret Aug. 3, 2018, 8:18 a.m. UTC | #16
On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> I'm not really concerned about re-enabling load balance but more that
> the effort of packing of tasks in few cpus/clusters that EAS tries to
> do can be broken for every new task.

Well, re-enabling load balance immediately would break the nice placement
that EAS found, because it would shuffle all tasks around and break the
packing strategy. Letting that sole new task go in find_idlest_cpu()
shouldn't impact the placement of existing tasks. That might have an energy
cost for that one task, yes, but it's really hard to do anything smarter
with new tasks IMO ... EAS simply can't work without a utilization value.

> So I wonder what is better for EAS : Make sure to efficiently spread
> newly created tasks in cas of fork bomb or  try to not break EAS task
> placement with every newly created tasks

That shouldn't break the placement per se, we're just making one
temporary exception for new tasks. What do you think 'the right thing'
to do is ? To just put new tasks on prev_cpu or something like that ?

That might help some use-cases I suppose, but will probably harm others ...
I'm just not too keen on making assumptions about the size of new tasks,
that's all. But I'm definitely open to ideas if there is something
smarter we can do.

Thanks,
Quentin
Vincent Guittot Aug. 3, 2018, 1:49 p.m. UTC | #17
On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > I'm not really concerned about re-enabling load balance but more that
> > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > do can be broken for every new task.
>
> Well, re-enabling load balance immediately would break the nice placement
> that EAS found, because it would shuffle all tasks around and break the
> packing strategy. Letting that sole new task go in find_idlest_cpu()

Sorry I was not clear in my explanation. Re enabling load balance
would be a problem of course. I wanted to say that there is few chance
that this will re-enable the load balance immediately and break EAS
and I'm not worried by this case. But i'm only concerned by the new
task being put outside EAS policy.

For example, if you run on hikey960 the simple script below, which
can't really be seen as a fork bomb IMHO, you will see threads
scheduled on big cores every 0.5 seconds whereas everything should be
packed on little core  :
for i in {0..10}; do
echo "t"$i
sleep 0.5
done

> shouldn't impact the placement of existing tasks. That might have an energy
> cost for that one task, yes, but it's really hard to do anything smarter
> with new tasks IMO ... EAS simply can't work without a utilization value.
>
> > So I wonder what is better for EAS : Make sure to efficiently spread
> > newly created tasks in cas of fork bomb or  try to not break EAS task
> > placement with every newly created tasks
>
> That shouldn't break the placement per se, we're just making one
> temporary exception for new tasks. What do you think 'the right thing'
> to do is ? To just put new tasks on prev_cpu or something like that ?

I think that EAS, which is about saving power, could be a bit power
friendly when it has to make some assumptions about new task.

>
> That might help some use-cases I suppose, but will probably harm others ...
> I'm just not too keen on making assumptions about the size of new tasks,

But you are already doing some assumptions by letting the default
mode, which use load_avg, selecting the task for you. The comment of
the init function of load_avg states:

void init_entity_runnable_average()
{
...
/*
* Tasks are intialized with full load to be seen as heavy tasks until
* they get a chance to stabilize to their real load level.
* Group entities are intialized with zero load to reflect the fact that
* nothing has been attached to the task group yet.
*/

So it means that EAS makes the assumption that new task are heavy
tasks until they get a chance to stabilize

Regards,
Vincent

> that's all. But I'm definitely open to ideas if there is something
> smarter we can do.
>
> Thanks,
> Quentin
Vincent Guittot Aug. 3, 2018, 2:21 p.m. UTC | #18
On Fri, 3 Aug 2018 at 15:49, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > I'm not really concerned about re-enabling load balance but more that
> > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > do can be broken for every new task.
> >
> > Well, re-enabling load balance immediately would break the nice placement
> > that EAS found, because it would shuffle all tasks around and break the
> > packing strategy. Letting that sole new task go in find_idlest_cpu()
>
> Sorry I was not clear in my explanation. Re enabling load balance
> would be a problem of course. I wanted to say that there is few chance
> that this will re-enable the load balance immediately and break EAS
> and I'm not worried by this case. But i'm only concerned by the new
> task being put outside EAS policy.
>
> For example, if you run on hikey960 the simple script below, which
> can't really be seen as a fork bomb IMHO, you will see threads
> scheduled on big cores every 0.5 seconds whereas everything should be
> packed on little core  :
> for i in {0..10}; do
> echo "t"$i
> sleep 0.5
> done
>
> > shouldn't impact the placement of existing tasks. That might have an energy
> > cost for that one task, yes, but it's really hard to do anything smarter
> > with new tasks IMO ... EAS simply can't work without a utilization value.
> >
> > > So I wonder what is better for EAS : Make sure to efficiently spread
> > > newly created tasks in cas of fork bomb or  try to not break EAS task
> > > placement with every newly created tasks
> >
> > That shouldn't break the placement per se, we're just making one
> > temporary exception for new tasks. What do you think 'the right thing'
> > to do is ? To just put new tasks on prev_cpu or something like that ?
>
> I think that EAS, which is about saving power, could be a bit power
> friendly when it has to make some assumptions about new task.
>
> >
> > That might help some use-cases I suppose, but will probably harm others ...
> > I'm just not too keen on making assumptions about the size of new tasks,
>
> But you are already doing some assumptions by letting the default
> mode, which use load_avg, selecting the task for you. The comment of

s/selecting the task/selecting the cpu/

> the init function of load_avg states:
>
> void init_entity_runnable_average()
> {
> ...
> /*
> * Tasks are intialized with full load to be seen as heavy tasks until
> * they get a chance to stabilize to their real load level.
> * Group entities are intialized with zero load to reflect the fact that
> * nothing has been attached to the task group yet.
> */
>
> So it means that EAS makes the assumption that new task are heavy
> tasks until they get a chance to stabilize
>
> Regards,
> Vincent
>
> > that's all. But I'm definitely open to ideas if there is something
> > smarter we can do.
> >
> > Thanks,
> > Quentin
Quentin Perret Aug. 3, 2018, 3:55 p.m. UTC | #19
On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > I'm not really concerned about re-enabling load balance but more that
> > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > do can be broken for every new task.
> >
> > Well, re-enabling load balance immediately would break the nice placement
> > that EAS found, because it would shuffle all tasks around and break the
> > packing strategy. Letting that sole new task go in find_idlest_cpu()
> 
> Sorry I was not clear in my explanation. Re enabling load balance
> would be a problem of course. I wanted to say that there is few chance
> that this will re-enable the load balance immediately and break EAS
> and I'm not worried by this case. But i'm only concerned by the new
> task being put outside EAS policy.
> 
> For example, if you run on hikey960 the simple script below, which
> can't really be seen as a fork bomb IMHO, you will see threads
> scheduled on big cores every 0.5 seconds whereas everything should be
> packed on little core

I guess that also depends on what's running on the little cores, but I
see your point.

I think we're discussing two different things right now:
    1. Should forkees go in find_energy_efficient_cpu() ?
    2. Should forkees have 0 of initial util_avg when EAS is enabled ?

For 1, that would mean all forkees go on prev_cpu. I can see how that
can be more energy-efficient in some use-cases (the one you described
for example), but that also has drawbacks. Placing the task on a big
CPU can have an energy cost, but that should also help the task build
it's utilization faster, which is what we want to make smart decisions
with EAS. Also, it isn't always true that going on the little CPUs is
more energy efficient, only the Energy Model can tell. There is just no
perfect solution, so I'm still not fully decided on that one ...

For 2, I'm a little bit more reluctant, because that has more
implications ... That could probably harm some fairly standard use
cases (an simple app-launch for example). Enqueueing something new on a
CPU would go unnoticed, which might be fine for a very small task, but
probably a major issue if the task is actually big. I'd be more
comfortable with 2 only if we also speed-up the PELT half-life TBH ...

Is there a 3 that I missed ?

Thanks,
Quentin
Vincent Guittot Aug. 6, 2018, 8:40 a.m. UTC | #20
On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > I'm not really concerned about re-enabling load balance but more that
> > > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > > do can be broken for every new task.
> > >
> > > Well, re-enabling load balance immediately would break the nice placement
> > > that EAS found, because it would shuffle all tasks around and break the
> > > packing strategy. Letting that sole new task go in find_idlest_cpu()
> >
> > Sorry I was not clear in my explanation. Re enabling load balance
> > would be a problem of course. I wanted to say that there is few chance
> > that this will re-enable the load balance immediately and break EAS
> > and I'm not worried by this case. But i'm only concerned by the new
> > task being put outside EAS policy.
> >
> > For example, if you run on hikey960 the simple script below, which
> > can't really be seen as a fork bomb IMHO, you will see threads
> > scheduled on big cores every 0.5 seconds whereas everything should be
> > packed on little core
>
> I guess that also depends on what's running on the little cores, but I
> see your point.

In my case, the system was idle and nothing else than this script was running

>
> I think we're discussing two different things right now:
>     1. Should forkees go in find_energy_efficient_cpu() ?
>     2. Should forkees have 0 of initial util_avg when EAS is enabled ?

It's the same topic: How EAS should consider a newly created task ?

For now, we let the "performance" mode selects a CPU. This CPU will
most probably be worst CPU from a EAS pov because it's the idlest CPU
in the idlest group which is the opposite of what EAS tries to do

The current behavior is :
For every new task, the cpu selection is done assuming it's a heavy
task with the max possible load_avg,  and it looks for the idlest cpu.
This means that if the system is lightly loaded, scheduler will select
most probably a idle big core.
The utilization of this new task is then set to half of the remaining
capacity of the selected CPU which means that the idlest you are, the
biggest the task will be initialized to. This can easily be half a big
core which can be bigger than the max capacity of a little like on
hikey960. Then, util_est  will keep track of this value for a while
which will make this task like a big one.

>
> For 1, that would mean all forkees go on prev_cpu. I can see how that
> can be more energy-efficient in some use-cases (the one you described
> for example), but that also has drawbacks. Placing the task on a big
> CPU can have an energy cost, but that should also help the task build
> it's utilization faster, which is what we want to make smart decisions

With current  behavior, little task are seen as big for a long time
which is not really help the task to build its utilization faster
IMHO.

> with EAS. Also, it isn't always true that going on the little CPUs is
> more energy efficient, only the Energy Model can tell. There is just no

selecting big or Little is not the problem here. The problem is that
we don't use Energy Model so we will most probably do the wrong
choice. Nevertheless,  putting a task on big is clearly the wrong
choice  in the case I mentioned above " shell script on hikey960".

> perfect solution, so I'm still not fully decided on that one ...
>
> For 2, I'm a little bit more reluctant, because that has more
> implications ... That could probably harm some fairly standard use
> cases (an simple app-launch for example). Enqueueing something new on a
> CPU would go unnoticed, which might be fine for a very small task, but
> probably a major issue if the task is actually big. I'd be more
> comfortable with 2 only if we also speed-up the PELT half-life TBH ...
>
> Is there a 3 that I missed ?

Having something in the middle like taking into account load and/org
utilization of the parent in order to mitigate big task starting with
small utilization and small task starting with big utilization.
It's probably not perfect because big tasks can create small ones and
the opposite but if there are already big tasks, assuming that the new
one is also a big one should have less power impact as we are already
consuming power for the current bigs. At the opposite, if little are
running, assuming that new task is little will not harm the power
consumption unnecessarily.

My main concern is that by making no choice, you clearly make the most
power consumption choice which is a bit awkward for a policy that
wants to minimize power consumption.

Regards,
Vincent
>
> Thanks,
> Quentin
Quentin Perret Aug. 6, 2018, 9:43 a.m. UTC | #21
On Monday 06 Aug 2018 at 10:40:46 (+0200), Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> For every new task, the cpu selection is done assuming it's a heavy
> task with the max possible load_avg,  and it looks for the idlest cpu.
> This means that if the system is lightly loaded, scheduler will select
> most probably a idle big core.

Agreed, that is what should happen if the system is lightly loaded.
However, I'm still not totally convinced this is wrong. It's
definitely not _always_ wrong, at least. Just like starting new tasks
on little CPUs isn't always wrong either.

> selecting big or Little is not the problem here. The problem is that
> we don't use Energy Model so we will most probably do the wrong
> choice. Nevertheless,  putting a task on big is clearly the wrong
> choice  in the case I mentioned above " shell script on hikey960".

_You_ can say it's wrong because _you_ know the task composition. The
scheduler has no way to tell. You could come up with a script that
spawns heavy tasks every once in a while, and in this case putting
those on big cores would be beneficial ...

> Having something in the middle like taking into account load and/org
> utilization of the parent in order to mitigate big task starting with
> small utilization and small task starting with big utilization.
> It's probably not perfect because big tasks can create small ones and
> the opposite but if there are already big tasks, assuming that the new
> one is also a big one should have less power impact as we are already
> consuming power for the current bigs. At the opposite, if little are
> running, assuming that new task is little will not harm the power
> consumption unnecessarily.

Right, we can definitely come up with something more conservative than
what I'm currently proposing. I had a quick chat with Morten about this
the other day and one suggestion he had was to pick the CPU with the max
spare cap in the frequency domain in which the parent task is running ...

In any case, I really feel like there isn't an obvious right decision
here, so I'd prefer to keep things simple for now. This patch-set is a
first step, and fine-grained tuning for new tasks is probably something
that can be done later, if need be. What do you think ?

Thanks,
Quentin
Dietmar Eggemann Aug. 6, 2018, 10:08 a.m. UTC | #22
On 08/06/2018 10:40 AM, Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>>
>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>
>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:

[...]

>> I think we're discussing two different things right now:
>>      1. Should forkees go in find_energy_efficient_cpu() ?
>>      2. Should forkees have 0 of initial util_avg when EAS is enabled ?
> 
> It's the same topic: How EAS should consider a newly created task ?
> 
> For now, we let the "performance" mode selects a CPU. This CPU will
> most probably be worst CPU from a EAS pov because it's the idlest CPU
> in the idlest group which is the opposite of what EAS tries to do
> 
> The current behavior is :
> For every new task, the cpu selection is done assuming it's a heavy
> task with the max possible load_avg,  and it looks for the idlest cpu.
> This means that if the system is lightly loaded, scheduler will select
> most probably a idle big core.

AFAICS, task load doesn't seem to be used for find_idlest_cpu() ( 
find_idlest_group() and find_idlest_group_cpu()). So the forkee 
(SD_BALANCE_FORK) is placed independently of his task load.
Task load (task_h_load(p)) is used in 
wake_affine()->wake_affine_weight() but for this to be called it has to 
be a wakeup (SD_BALANCE_WAKE).

> The utilization of this new task is then set to half of the remaining
> capacity of the selected CPU which means that the idlest you are, the
> biggest the task will be initialized to. This can easily be half a big
> core which can be bigger than the max capacity of a little like on
> hikey960. Then, util_est  will keep track of this value for a while
> which will make this task like a big one.

[...]
Vincent Guittot Aug. 6, 2018, 10:33 a.m. UTC | #23
On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> >>
> >> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> >>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >>>>
> >>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> >>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>
> [...]
>
> >> I think we're discussing two different things right now:
> >>      1. Should forkees go in find_energy_efficient_cpu() ?
> >>      2. Should forkees have 0 of initial util_avg when EAS is enabled ?
> >
> > It's the same topic: How EAS should consider a newly created task ?
> >
> > For now, we let the "performance" mode selects a CPU. This CPU will
> > most probably be worst CPU from a EAS pov because it's the idlest CPU
> > in the idlest group which is the opposite of what EAS tries to do
> >
> > The current behavior is :
> > For every new task, the cpu selection is done assuming it's a heavy
> > task with the max possible load_avg,  and it looks for the idlest cpu.
> > This means that if the system is lightly loaded, scheduler will select
> > most probably a idle big core.
>
> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
> find_idlest_group() and find_idlest_group_cpu()). So the forkee
> (SD_BALANCE_FORK) is placed independently of his task load.

hmm ... so what is used if load or runnable load are not used ?
find_idlest_group() uses load and runnable load but skip spare
capacity in case of fork

> Task load (task_h_load(p)) is used in
> wake_affine()->wake_affine_weight() but for this to be called it has to
> be a wakeup (SD_BALANCE_WAKE).
>
> > The utilization of this new task is then set to half of the remaining
> > capacity of the selected CPU which means that the idlest you are, the
> > biggest the task will be initialized to. This can easily be half a big
> > core which can be bigger than the max capacity of a little like on
> > hikey960. Then, util_est  will keep track of this value for a while
> > which will make this task like a big one.
>
> [...]
>
Vincent Guittot Aug. 6, 2018, 10:45 a.m. UTC | #24
On Mon, 6 Aug 2018 at 11:43, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Monday 06 Aug 2018 at 10:40:46 (+0200), Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> > For every new task, the cpu selection is done assuming it's a heavy
> > task with the max possible load_avg,  and it looks for the idlest cpu.
> > This means that if the system is lightly loaded, scheduler will select
> > most probably a idle big core.
>
> Agreed, that is what should happen if the system is lightly loaded.
> However, I'm still not totally convinced this is wrong. It's
> definitely not _always_ wrong, at least. Just like starting new tasks
> on little CPUs isn't always wrong either.

As explained before, IMHO, this is not wrong if you looks for
performance but this is wrong if you looks for power saving

>
> > selecting big or Little is not the problem here. The problem is that
> > we don't use Energy Model so we will most probably do the wrong
> > choice. Nevertheless,  putting a task on big is clearly the wrong
> > choice  in the case I mentioned above " shell script on hikey960".
>
> _You_ can say it's wrong because _you_ know the task composition. The
> scheduler has no way to tell. You could come up with a script that
> spawns heavy tasks every once in a while, and in this case putting
> those on big cores would be beneficial ...
>
> > Having something in the middle like taking into account load and/org
> > utilization of the parent in order to mitigate big task starting with
> > small utilization and small task starting with big utilization.
> > It's probably not perfect because big tasks can create small ones and
> > the opposite but if there are already big tasks, assuming that the new
> > one is also a big one should have less power impact as we are already
> > consuming power for the current bigs. At the opposite, if little are
> > running, assuming that new task is little will not harm the power
> > consumption unnecessarily.
>
> Right, we can definitely come up with something more conservative than
> what I'm currently proposing. I had a quick chat with Morten about this
> the other day and one suggestion he had was to pick the CPU with the max
> spare cap in the frequency domain in which the parent task is running ...
>
> In any case, I really feel like there isn't an obvious right decision
> here, so I'd prefer to keep things simple for now. This patch-set is a
> first step, and fine-grained tuning for new tasks is probably something
> that can be done later, if need be. What do you think ?

I would have preferred to have a full power policy for all task when
EAS is in used by default and then see if there is any performance
problem instead of letting some UC unclear but that's a personal
opinion.
so IMO, the minimum is to add a comment in the code that describes
this behavior for fork tasks so people will understand why EAS puts
newly created task on not "EAS friendly" cpus when they will look at
the code trying to understand the behavior

>
> Thanks,
> Quentin
Quentin Perret Aug. 6, 2018, 11:02 a.m. UTC | #25
On Monday 06 Aug 2018 at 12:45:44 (+0200), Vincent Guittot wrote:
> On Mon, 6 Aug 2018 at 11:43, Quentin Perret <quentin.perret@arm.com> wrote:
> I would have preferred to have a full power policy for all task when
> EAS is in used by default and then see if there is any performance
> problem instead of letting some UC unclear but that's a personal
> opinion.

Understood. I'd say let's keep things simple for now unless there is a
consensus that this is must-have form the start.

> so IMO, the minimum is to add a comment in the code that describes
> this behavior for fork tasks so people will understand why EAS puts
> newly created task on not "EAS friendly" cpus when they will look at
> the code trying to understand the behavior

Agreed, that really needs to be documented. I'll add a comment somewhere
in v6.

Thanks,
Quentin
Dietmar Eggemann Aug. 6, 2018, 12:29 p.m. UTC | #26
On 08/06/2018 12:33 PM, Vincent Guittot wrote:
> On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
>>> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>
>>>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
>>>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>>>
>>>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
>>>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>>
>> [...]
>>
>>>> I think we're discussing two different things right now:
>>>>       1. Should forkees go in find_energy_efficient_cpu() ?
>>>>       2. Should forkees have 0 of initial util_avg when EAS is enabled ?
>>>
>>> It's the same topic: How EAS should consider a newly created task ?
>>>
>>> For now, we let the "performance" mode selects a CPU. This CPU will
>>> most probably be worst CPU from a EAS pov because it's the idlest CPU
>>> in the idlest group which is the opposite of what EAS tries to do
>>>
>>> The current behavior is :
>>> For every new task, the cpu selection is done assuming it's a heavy
>>> task with the max possible load_avg,  and it looks for the idlest cpu.
>>> This means that if the system is lightly loaded, scheduler will select
>>> most probably a idle big core.
>>
>> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
>> find_idlest_group() and find_idlest_group_cpu()). So the forkee
>> (SD_BALANCE_FORK) is placed independently of his task load.
> 
> hmm ... so what is used if load or runnable load are not used ?
> find_idlest_group() uses load and runnable load but skip spare
> capacity in case of fork

Yes, runnable load and load are used, but from the cpus, not from the task.

[...]
Vincent Guittot Aug. 6, 2018, 12:37 p.m. UTC | #27
On Mon, 6 Aug 2018 at 14:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/06/2018 12:33 PM, Vincent Guittot wrote:
> > On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
> >>> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> >>>>
> >>>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> >>>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >>>>>>
> >>>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> >>>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> >>
> >> [...]
> >>
> >>>> I think we're discussing two different things right now:
> >>>>       1. Should forkees go in find_energy_efficient_cpu() ?
> >>>>       2. Should forkees have 0 of initial util_avg when EAS is enabled ?
> >>>
> >>> It's the same topic: How EAS should consider a newly created task ?
> >>>
> >>> For now, we let the "performance" mode selects a CPU. This CPU will
> >>> most probably be worst CPU from a EAS pov because it's the idlest CPU
> >>> in the idlest group which is the opposite of what EAS tries to do
> >>>
> >>> The current behavior is :
> >>> For every new task, the cpu selection is done assuming it's a heavy
> >>> task with the max possible load_avg,  and it looks for the idlest cpu.
> >>> This means that if the system is lightly loaded, scheduler will select
> >>> most probably a idle big core.
> >>
> >> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
> >> find_idlest_group() and find_idlest_group_cpu()). So the forkee
> >> (SD_BALANCE_FORK) is placed independently of his task load.
> >
> > hmm ... so what is used if load or runnable load are not used ?
> > find_idlest_group() uses load and runnable load but skip spare
> > capacity in case of fork
>
> Yes, runnable load and load are used, but from the cpus, not from the task.

yes that's right, I have skipped the "task" word when reading.
So scheduler looks for the idlest CPU taking into account only CPU
loads. Then the task load starts to highest value until it get a
chance to reduce and stabilize to its final value

>
> [...]
Dietmar Eggemann Aug. 6, 2018, 1:20 p.m. UTC | #28
On 08/06/2018 02:37 PM, Vincent Guittot wrote:
> On Mon, 6 Aug 2018 at 14:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 08/06/2018 12:33 PM, Vincent Guittot wrote:
>>> On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
>>>>> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>>>
>>>>>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
>>>>>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>>>>>
>>>>>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
>>>>>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>
>>>> [...]
>>>>
>>>>>> I think we're discussing two different things right now:
>>>>>>        1. Should forkees go in find_energy_efficient_cpu() ?
>>>>>>        2. Should forkees have 0 of initial util_avg when EAS is enabled ?
>>>>>
>>>>> It's the same topic: How EAS should consider a newly created task ?
>>>>>
>>>>> For now, we let the "performance" mode selects a CPU. This CPU will
>>>>> most probably be worst CPU from a EAS pov because it's the idlest CPU
>>>>> in the idlest group which is the opposite of what EAS tries to do
>>>>>
>>>>> The current behavior is :
>>>>> For every new task, the cpu selection is done assuming it's a heavy
>>>>> task with the max possible load_avg,  and it looks for the idlest cpu.
>>>>> This means that if the system is lightly loaded, scheduler will select
>>>>> most probably a idle big core.
>>>>
>>>> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
>>>> find_idlest_group() and find_idlest_group_cpu()). So the forkee
>>>> (SD_BALANCE_FORK) is placed independently of his task load.
>>>
>>> hmm ... so what is used if load or runnable load are not used ?
>>> find_idlest_group() uses load and runnable load but skip spare
>>> capacity in case of fork
>>
>> Yes, runnable load and load are used, but from the cpus, not from the task.
> 
> yes that's right, I have skipped the "task" word when reading.
> So scheduler looks for the idlest CPU taking into account only CPU
> loads. Then the task load starts to highest value until it get a
> chance to reduce and stabilize to its final value

This could potentially allow us to find a better init value for 
sa->[runnable]_load_avg. At least we could use the information of the 
initial task rq.

> 
>>
>> [...]
Vincent Guittot Aug. 9, 2018, 9:30 a.m. UTC | #29
On Tue, 24 Jul 2018 at 14:26, Quentin Perret <quentin.perret@arm.com> wrote:
>
> From: Morten Rasmussen <morten.rasmussen@arm.com>
>
> Energy-aware scheduling is only meant to be active while the system is
> _not_ over-utilized. That is, there are spare cycles available to shift
> tasks around based on their actual utilization to get a more
> energy-efficient task distribution without depriving any tasks. When
> above the tipping point task placement is done the traditional way based
> on load_avg, spreading the tasks across as many cpus as possible based
> on priority scaled load to preserve smp_nice. Below the tipping point we
> want to use util_avg instead. We need to define a criteria for when we
> make the switch.
>
> The util_avg for each cpu converges towards 100% (1024) regardless of

remove the "(1024)" because util_avg converges to max cpu capacity
which can be different from 1024

> how many task additional task we may put on it. If we define
> over-utilized as:
>
> sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)
>
> some individual cpus may be over-utilized running multiple tasks even
> when the above condition is false. That should be okay as long as we try
> to spread the tasks out to avoid per-cpu over-utilization as much as
> possible and if all tasks have the _same_ priority. If the latter isn't
> true, we have to consider priority to preserve smp_nice.
>
Quentin Perret Aug. 9, 2018, 9:38 a.m. UTC | #30
On Thursday 09 Aug 2018 at 11:30:57 (+0200), Vincent Guittot wrote:
> On Tue, 24 Jul 2018 at 14:26, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > From: Morten Rasmussen <morten.rasmussen@arm.com>
> >
> > Energy-aware scheduling is only meant to be active while the system is
> > _not_ over-utilized. That is, there are spare cycles available to shift
> > tasks around based on their actual utilization to get a more
> > energy-efficient task distribution without depriving any tasks. When
> > above the tipping point task placement is done the traditional way based
> > on load_avg, spreading the tasks across as many cpus as possible based
> > on priority scaled load to preserve smp_nice. Below the tipping point we
> > want to use util_avg instead. We need to define a criteria for when we
> > make the switch.
> >
> > The util_avg for each cpu converges towards 100% (1024) regardless of
> 
> remove the "(1024)" because util_avg converges to max cpu capacity
> which can be different from 1024

Good point, will be fixed in v6.

Thanks,
Quentin
diff mbox series

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcc97fa3be94..4aaa9132e840 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5043,6 +5043,24 @@  static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline unsigned long cpu_util(int cpu);
+static unsigned long capacity_of(int cpu);
+
+static inline bool cpu_overutilized(int cpu)
+{
+	return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) { }
+#endif
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5100,8 +5118,17 @@  enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_group(se);
 	}
 
-	if (!se)
+	if (!se) {
 		add_nr_running(rq, 1);
+		/*
+		 * The utilization of a new task is 'wrong' so wait for it
+		 * to build some utilization history before trying to detect
+		 * the overutilized flag.
+		 */
+		if (flags & ENQUEUE_WAKEUP)
+			update_overutilized_status(rq);
+
+	}
 
 	hrtick_update(rq);
 }
@@ -7837,6 +7864,9 @@  static inline void update_sg_lb_stats(struct lb_env *env,
 		if (nr_running > 1)
 			*sg_status |= SG_OVERLOAD;
 
+		if (cpu_overutilized(i))
+			*sg_status |= SG_OVERUTILIZED;
+
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
@@ -8046,8 +8076,15 @@  static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 
 	if (!env->sd->parent) {
+		struct root_domain *rd = env->dst_rq->rd;
+
 		/* update overload indicator if we are at root domain */
-		env->dst_rq->rd->overload = sg_status & SG_OVERLOAD;
+		rd->overload = sg_status & SG_OVERLOAD;
+
+		/* Update over-utilization (tipping point, U >= 0) indicator */
+		WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
+	} else if (sg_status & SG_OVERUTILIZED) {
+		WRITE_ONCE(env->dst_rq->rd->overutilized, SG_OVERUTILIZED);
 	}
 }
 
@@ -8267,6 +8304,11 @@  static struct sched_group *find_busiest_group(struct lb_env *env)
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
+
+	if (rd_freq_domain(env->dst_rq->rd) &&
+				!READ_ONCE(env->dst_rq->rd->overutilized))
+		goto out_balanced;
+
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
@@ -9624,6 +9666,8 @@  static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_overutilized_status(task_rq(curr));
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 99b5bf245eab..6d08ccd1e7a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -709,6 +709,7 @@  struct freq_domain {
 
 /* Scheduling group status flags */
 #define SG_OVERLOAD		0x1 /* More than one runnable task on a CPU. */
+#define SG_OVERUTILIZED		0x2 /* One or more CPUs are over-utilized. */
 
 /*
  * We add the notion of a root-domain which will be used to define per-domain
@@ -728,6 +729,9 @@  struct root_domain {
 	/* Indicate more than one runnable task for any CPU */
 	int			overload;
 
+	/* Indicate one or more cpus over-utilized (tipping point) */
+	int			overutilized;
+
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
 	 * than one runnable -deadline task (as it is below for RT tasks).