[v4] sched/freq: move call to cpufreq_update_util

Message ID	1573751251-3505-1-git-send-email-vincent.guittot@linaro.org (mailing list archive)
State	Not Applicable, archived
Headers	show Return-Path: <SRS0=zvQi=ZG=vger.kernel.org=linux-pm-owner@kernel.org> From: Vincent Guittot <vincent.guittot@linaro.org> To: linux-kernel@vger.kernel.org, mingo@redhat.com, peterz@infradead.org, dietmar.eggemann@arm.com, juri.lelli@redhat.com, rostedt@goodmis.org, mgorman@suse.de, dsmythies@telus.net Cc: linux-pm@vger.kernel.org, torvalds@linux-foundation.org, tglx@linutronix.de, sargun@sargun.me, tj@kernel.org, xiexiuqi@huawei.com, xiezhipeng1@huawei.com, srinivas.pandruvada@linux.intel.com, Vincent Guittot <vincent.guittot@linaro.org> Subject: [PATCH v4] sched/freq: move call to cpufreq_update_util Date: Thu, 14 Nov 2019 18:07:31 +0100 Message-Id: <1573751251-3505-1-git-send-email-vincent.guittot@linaro.org> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk
Series	[v4] sched/freq: move call to cpufreq_update_util \| expand [v4] sched/freq: move call to cpufreq_update_util

Vincent Guittot Nov. 14, 2019, 5:07 p.m. UTC

update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
which might be inefficient when cpufreq driver has rate limitation.

When a task is attached on a CPU, we have call path:

update_load_avg()
  update_cfs_rq_load_avg()
    cfs_rq_util_change -- > trig frequency update
  attach_entity_load_avg()
    cfs_rq_util_change -- > trig frequency update

The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discard because of rate
limitation of the cpufreq driver.

update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
interesting to notice that update_load_avg() already calls directly
cfs_rq_util_change() for !SMP case.

This changes will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update but only
irq, rt or dl pelt signals.

Reported-by: Doug Smythies <dsmythies@telus.net>
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---

this patch applies on tip/sched/urgent as there is a dependency with
commit b90f7c9d2198 ("sched/pelt: Fix update of blocked PELT ordering")

Changes for v4:
- updated comments
- added Reviewed-by and Acked-by 

 kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

Peter Zijlstra Nov. 15, 2019, 9:54 a.m. UTC | #1

On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> which might be inefficient when cpufreq driver has rate limitation.
> 
> When a task is attached on a CPU, we have call path:
> 
> update_load_avg()
>   update_cfs_rq_load_avg()
>     cfs_rq_util_change -- > trig frequency update
>   attach_entity_load_avg()
>     cfs_rq_util_change -- > trig frequency update
> 
> The 1st frequency update will not take into account the utilization of the
> newly attached task and the 2nd one might be discard because of rate
> limitation of the cpufreq driver.

Doesn't this just show that a dumb rate limit in the driver is broken?

> update_cfs_rq_load_avg() is only called by update_blocked_averages()
> and update_load_avg() so we can move the call to
> cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> interesting to notice that update_load_avg() already calls directly
> cfs_rq_util_change() for !SMP case.
> 
> This changes will also ensure that cpufreq_update_util() is called even
> when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> irq, rt or dl pelt signals.

I don't think it does that; that is, iirc the return value of
___update_load_sum() is 1 every time a period lapses. So even if the avg
is 0 and doesn't change, it'll still return 1 on every period.

Which is what that dumb rate-limit thing wants of course. But I'm still
thinking that it's stupid to do. If nothing changes, don't generate
events.

If anything, update_blocked_avgerages() should look at
@done/others_have_blocked() to emit events for rt,dl,irq.

So why are we making the scheduler code more ugly instead of fixing that
driver?

Rafael J. Wysocki Nov. 15, 2019, 10:03 a.m. UTC | #2

On Fri, Nov 15, 2019 at 10:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > which might be inefficient when cpufreq driver has rate limitation.
> >
> > When a task is attached on a CPU, we have call path:
> >
> > update_load_avg()
> >   update_cfs_rq_load_avg()
> >     cfs_rq_util_change -- > trig frequency update
> >   attach_entity_load_avg()
> >     cfs_rq_util_change -- > trig frequency update
> >
> > The 1st frequency update will not take into account the utilization of the
> > newly attached task and the 2nd one might be discard because of rate
> > limitation of the cpufreq driver.
>
> Doesn't this just show that a dumb rate limit in the driver is broken?
>
> > update_cfs_rq_load_avg() is only called by update_blocked_averages()
> > and update_load_avg() so we can move the call to
> > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> > interesting to notice that update_load_avg() already calls directly
> > cfs_rq_util_change() for !SMP case.
> >
> > This changes will also ensure that cpufreq_update_util() is called even
> > when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> > irq, rt or dl pelt signals.
>
> I don't think it does that; that is, iirc the return value of
> ___update_load_sum() is 1 every time a period lapses. So even if the avg
> is 0 and doesn't change, it'll still return 1 on every period.
>
> Which is what that dumb rate-limit thing wants of course. But I'm still
> thinking that it's stupid to do. If nothing changes, don't generate
> events.
>
> If anything, update_blocked_avgerages() should look at
> @done/others_have_blocked() to emit events for rt,dl,irq.
>
> So why are we making the scheduler code more ugly instead of fixing that
> driver?

I guess we could "fix" the driver by making it rate limit MSR writes
only, but I'm not sure if that would help.

Vincent Guittot Nov. 15, 2019, 10:18 a.m. UTC | #3

On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > which might be inefficient when cpufreq driver has rate limitation.
> >
> > When a task is attached on a CPU, we have call path:
> >
> > update_load_avg()
> >   update_cfs_rq_load_avg()
> >     cfs_rq_util_change -- > trig frequency update
> >   attach_entity_load_avg()
> >     cfs_rq_util_change -- > trig frequency update
> >
> > The 1st frequency update will not take into account the utilization of the
> > newly attached task and the 2nd one might be discard because of rate
> > limitation of the cpufreq driver.
>
> Doesn't this just show that a dumb rate limit in the driver is broken?

But the rate limit may come from HW constraints that forces to wait
let say 4ms or even 10ms between each frequency update.

>
> > update_cfs_rq_load_avg() is only called by update_blocked_averages()
> > and update_load_avg() so we can move the call to
> > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> > interesting to notice that update_load_avg() already calls directly
> > cfs_rq_util_change() for !SMP case.
> >
> > This changes will also ensure that cpufreq_update_util() is called even
> > when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> > irq, rt or dl pelt signals.
>
> I don't think it does that; that is, iirc the return value of
> ___update_load_sum() is 1 every time a period lapses. So even if the avg
> is 0 and doesn't change, it'll still return 1 on every period.
>
> Which is what that dumb rate-limit thing wants of course. But I'm still
> thinking that it's stupid to do. If nothing changes, don't generate
> events.

When everything (irq, dl, rt, cfs) is 0, we don't generate events
because update_blocked_averages is no more called because
rq->has_blocked_load is clear

With current implementation, if cfs is 0 but not irq, dl or rt, we
don't call cpufreq_update_util because it is only called through cfs

>
> If anything, update_blocked_avgerages() should look at
> @done/others_have_blocked() to emit events for rt,dl,irq.

other_have_blocked can be set but no decay happened during the update
and we don't need to call cpufreq_update_util

>
> So why are we making the scheduler code more ugly instead of fixing that
> driver?

Vincent Guittot Nov. 15, 2019, 10:29 a.m. UTC | #4

On Fri, 15 Nov 2019 at 11:18, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > > which might be inefficient when cpufreq driver has rate limitation.
> > >
> > > When a task is attached on a CPU, we have call path:
> > >
> > > update_load_avg()
> > >   update_cfs_rq_load_avg()
> > >     cfs_rq_util_change -- > trig frequency update
> > >   attach_entity_load_avg()
> > >     cfs_rq_util_change -- > trig frequency update
> > >
> > > The 1st frequency update will not take into account the utilization of the
> > > newly attached task and the 2nd one might be discard because of rate
> > > limitation of the cpufreq driver.
> >
> > Doesn't this just show that a dumb rate limit in the driver is broken?
>
> But the rate limit may come from HW constraints that forces to wait
> let say 4ms or even 10ms between each frequency update.
>
> >
> > > update_cfs_rq_load_avg() is only called by update_blocked_averages()
> > > and update_load_avg() so we can move the call to
> > > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> > > interesting to notice that update_load_avg() already calls directly
> > > cfs_rq_util_change() for !SMP case.
> > >
> > > This changes will also ensure that cpufreq_update_util() is called even
> > > when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> > > irq, rt or dl pelt signals.
> >
> > I don't think it does that; that is, iirc the return value of
> > ___update_load_sum() is 1 every time a period lapses. So even if the avg
> > is 0 and doesn't change, it'll still return 1 on every period.
> >
> > Which is what that dumb rate-limit thing wants of course. But I'm still
> > thinking that it's stupid to do. If nothing changes, don't generate
> > events.
>
> When everything (irq, dl, rt, cfs) is 0, we don't generate events
> because update_blocked_averages is no more called because
> rq->has_blocked_load is clear
>
> With current implementation, if cfs is 0 but not irq, dl or rt, we
> don't call cpufreq_update_util because it is only called through cfs
>
> >
> > If anything, update_blocked_avgerages() should look at
> > @done/others_have_blocked() to emit events for rt,dl,irq.
>
> other_have_blocked can be set but no decay happened during the update
> and we don't need to call cpufreq_update_util
>
> >
> > So why are we making the scheduler code more ugly instead of fixing that
> > driver?

Also, I think that calling cfs_rq_util_change in
attach_entity_load_avg is not optimal because the attach can happen at
a child level before it has been propagated down to root
So I'm working on trying to remove it from attach_entity_load_avg and
keep it in update_load_avg. this would help cleaning the ugly

-       } else if (decayed && (flags & UPDATE_TG))
-               update_tg_load_avg(cfs_rq, 0);
+       } else if (decayed) {
+               cfs_rq_util_change(cfs_rq, 0);
+
+               if (flags & UPDATE_TG)
+                       update_tg_load_avg(cfs_rq, 0);
+       }
 }

Peter Zijlstra Nov. 15, 2019, 10:37 a.m. UTC | #5

On Fri, Nov 15, 2019 at 11:18:00AM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > > which might be inefficient when cpufreq driver has rate limitation.
> > >
> > > When a task is attached on a CPU, we have call path:
> > >
> > > update_load_avg()
> > >   update_cfs_rq_load_avg()
> > >     cfs_rq_util_change -- > trig frequency update
> > >   attach_entity_load_avg()
> > >     cfs_rq_util_change -- > trig frequency update
> > >
> > > The 1st frequency update will not take into account the utilization of the
> > > newly attached task and the 2nd one might be discard because of rate
> > > limitation of the cpufreq driver.
> >
> > Doesn't this just show that a dumb rate limit in the driver is broken?
> 
> But the rate limit may come from HW constraints that forces to wait
> let say 4ms or even 10ms between each frequency update.

Sure, but then it can still remember the value passed in last and use
that state later.

It doesn't _have_ to completely discard values.

> > > update_cfs_rq_load_avg() is only called by update_blocked_averages()
> > > and update_load_avg() so we can move the call to
> > > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> > > interesting to notice that update_load_avg() already calls directly
> > > cfs_rq_util_change() for !SMP case.
> > >
> > > This changes will also ensure that cpufreq_update_util() is called even
> > > when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> > > irq, rt or dl pelt signals.
> >
> > I don't think it does that; that is, iirc the return value of
> > ___update_load_sum() is 1 every time a period lapses. So even if the avg
> > is 0 and doesn't change, it'll still return 1 on every period.
> >
> > Which is what that dumb rate-limit thing wants of course. But I'm still
> > thinking that it's stupid to do. If nothing changes, don't generate
> > events.
> 
> When everything (irq, dl, rt, cfs) is 0, we don't generate events
> because update_blocked_averages is no more called because
> rq->has_blocked_load is clear

Aah.. Ok, let me look at this thing again.

> > If anything, update_blocked_avgerages() should look at
> > @done/others_have_blocked() to emit events for rt,dl,irq.
> 
> other_have_blocked can be set but no decay happened during the update
> and we don't need to call cpufreq_update_util

True.

Peter Zijlstra Nov. 15, 2019, 10:40 a.m. UTC | #6

On Fri, Nov 15, 2019 at 11:03:20AM +0100, Rafael J. Wysocki wrote:
> On Fri, Nov 15, 2019 at 10:55 AM Peter Zijlstra <peterz@infradead.org> wrote:

> > So why are we making the scheduler code more ugly instead of fixing that
> > driver?
> 
> I guess we could "fix" the driver by making it rate limit MSR writes
> only, but I'm not sure if that would help.

So it is not clear to me what exactly intel_pstate needs and why. Like I
wrote in my reply to Vincent just now, it can still store the last
value, even if it doesn't act on it right away.

And it can then act on that stored value at a later event, whatever is
appropriate.

I'm just saying that generating superfluous events is silly. But
possibly I read the patch wrong.

Vincent Guittot Nov. 15, 2019, 10:46 a.m. UTC | #7

On Fri, 15 Nov 2019 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 15, 2019 at 11:18:00AM +0100, Vincent Guittot wrote:
> > On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > > > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > > > which might be inefficient when cpufreq driver has rate limitation.
> > > >
> > > > When a task is attached on a CPU, we have call path:
> > > >
> > > > update_load_avg()
> > > >   update_cfs_rq_load_avg()
> > > >     cfs_rq_util_change -- > trig frequency update
> > > >   attach_entity_load_avg()
> > > >     cfs_rq_util_change -- > trig frequency update
> > > >
> > > > The 1st frequency update will not take into account the utilization of the
> > > > newly attached task and the 2nd one might be discard because of rate
> > > > limitation of the cpufreq driver.
> > >
> > > Doesn't this just show that a dumb rate limit in the driver is broken?
> >
> > But the rate limit may come from HW constraints that forces to wait
> > let say 4ms or even 10ms between each frequency update.
>
> Sure, but then it can still remember the value passed in last and use
> that state later.
>
> It doesn't _have_ to completely discard values.

yes but it means that we run at the "wrong" frequency during this
period and also that the cpufreq must in this case set a kind of timer
to resubmit a new frequency change out of scheduler event

>
> > > > update_cfs_rq_load_avg() is only called by update_blocked_averages()
> > > > and update_load_avg() so we can move the call to
> > > > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> > > > interesting to notice that update_load_avg() already calls directly
> > > > cfs_rq_util_change() for !SMP case.
> > > >
> > > > This changes will also ensure that cpufreq_update_util() is called even
> > > > when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> > > > irq, rt or dl pelt signals.
> > >
> > > I don't think it does that; that is, iirc the return value of
> > > ___update_load_sum() is 1 every time a period lapses. So even if the avg
> > > is 0 and doesn't change, it'll still return 1 on every period.
> > >
> > > Which is what that dumb rate-limit thing wants of course. But I'm still
> > > thinking that it's stupid to do. If nothing changes, don't generate
> > > events.
> >
> > When everything (irq, dl, rt, cfs) is 0, we don't generate events
> > because update_blocked_averages is no more called because
> > rq->has_blocked_load is clear
>
> Aah.. Ok, let me look at this thing again.
>
> > > If anything, update_blocked_avgerages() should look at
> > > @done/others_have_blocked() to emit events for rt,dl,irq.
> >
> > other_have_blocked can be set but no decay happened during the update
> > and we don't need to call cpufreq_update_util
>
> True.

Peter Zijlstra Nov. 15, 2019, 10:51 a.m. UTC | #8

On Fri, Nov 15, 2019 at 11:46:01AM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:

> > Sure, but then it can still remember the value passed in last and use
> > that state later.
> >
> > It doesn't _have_ to completely discard values.
> 
> yes but it means that we run at the "wrong" frequency during this
> period and also that the cpufreq must in this case set a kind of timer
> to resubmit a new frequency change out of scheduler event

But if, as you say, we're completely shutting down the event stream
when everything has decayed, that's still true, right?

Rafael J. Wysocki Nov. 15, 2019, 10:51 a.m. UTC | #9

On Fri, Nov 15, 2019 at 11:46 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 15 Nov 2019 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Nov 15, 2019 at 11:18:00AM +0100, Vincent Guittot wrote:
> > > On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > > > > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > > > > which might be inefficient when cpufreq driver has rate limitation.
> > > > >
> > > > > When a task is attached on a CPU, we have call path:
> > > > >
> > > > > update_load_avg()
> > > > >   update_cfs_rq_load_avg()
> > > > >     cfs_rq_util_change -- > trig frequency update
> > > > >   attach_entity_load_avg()
> > > > >     cfs_rq_util_change -- > trig frequency update
> > > > >
> > > > > The 1st frequency update will not take into account the utilization of the
> > > > > newly attached task and the 2nd one might be discard because of rate
> > > > > limitation of the cpufreq driver.
> > > >
> > > > Doesn't this just show that a dumb rate limit in the driver is broken?
> > >
> > > But the rate limit may come from HW constraints that forces to wait
> > > let say 4ms or even 10ms between each frequency update.
> >
> > Sure, but then it can still remember the value passed in last and use
> > that state later.
> >
> > It doesn't _have_ to completely discard values.
>
> yes but it means that we run at the "wrong" frequency during this
> period and also that the cpufreq must in this case set a kind of timer
> to resubmit a new frequency change out of scheduler event

The driver would need to do that, because from the cpufreq core
perspective it is in-band.

Which would kind of defeat the purpose of driving it from the
scheduler, wouldn't it?

Vincent Guittot Nov. 15, 2019, 11:03 a.m. UTC | #10

On Fri, 15 Nov 2019 at 11:51, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 15, 2019 at 11:46:01AM +0100, Vincent Guittot wrote:
> > On Fri, 15 Nov 2019 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Sure, but then it can still remember the value passed in last and use
> > > that state later.
> > >
> > > It doesn't _have_ to completely discard values.
> >
> > yes but it means that we run at the "wrong" frequency during this
> > period and also that the cpufreq must in this case set a kind of timer
> > to resubmit a new frequency change out of scheduler event
>
> But if, as you say, we're completely shutting down the event stream
> when everything has decayed, that's still true, right?

But It doesn't because there is nothing else to do.

This patch does 2 things:
- fix the spurious call to cpufreq just before attaching a task
- make sure cpufreq is still called when cfs is 0 but not irq/rt or dl

There are somehow related but not fully

Vincent Guittot Nov. 15, 2019, 11:05 a.m. UTC | #11

On Fri, 15 Nov 2019 at 11:41, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 15, 2019 at 11:03:20AM +0100, Rafael J. Wysocki wrote:
> > On Fri, Nov 15, 2019 at 10:55 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > So why are we making the scheduler code more ugly instead of fixing that
> > > driver?
> >
> > I guess we could "fix" the driver by making it rate limit MSR writes
> > only, but I'm not sure if that would help.
>
> So it is not clear to me what exactly intel_pstate needs and why. Like I
> wrote in my reply to Vincent just now, it can still store the last
> value, even if it doesn't act on it right away.
>
> And it can then act on that stored value at a later event, whatever is
> appropriate.
>
> I'm just saying that generating superfluous events is silly. But
> possibly I read the patch wrong.

This is not the intent of the patch.

Before 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load
balancing path"), the call to cpufreq was done thanks to
update_cfs_rq_load_avg() even if cfs was already null but not irq/rt
or dl
After the patch, cpufreq was not called anymore

This patch fix this to make sure that cpufreq is called while  irq/rt
or dl is not null

Then, it also remove a spurious call to cpufreq just before attaching the task

Rafael J. Wysocki Nov. 15, 2019, 11:46 a.m. UTC | #12

On Thu, Nov 14, 2019 at 6:07 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> which might be inefficient when cpufreq driver has rate limitation.
>
> When a task is attached on a CPU, we have call path:
>
> update_load_avg()
>   update_cfs_rq_load_avg()
>     cfs_rq_util_change -- > trig frequency update
>   attach_entity_load_avg()
>     cfs_rq_util_change -- > trig frequency update
>
> The 1st frequency update will not take into account the utilization of the
> newly attached task and the 2nd one might be discard because of rate
> limitation of the cpufreq driver.

Kind of on a second thought, it shouldn't matter for governors other
than schedutil that the new task's utilization is not taken into
account by the first update, because they measure utilization by
themselves.

> update_cfs_rq_load_avg() is only called by update_blocked_averages()
> and update_load_avg() so we can move the call to
> cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> interesting to notice that update_load_avg() already calls directly
> cfs_rq_util_change() for !SMP case.
>
> This changes will also ensure that cpufreq_update_util() is called even
> when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> irq, rt or dl pelt signals.

So this change appears to be the relevant one for non-schedutil governors.

Now, there is a rate limit in schedutil too, see
sugov_should_update_freq(), but I'm not sure if that matters in the
context of this patch.

Peter Zijlstra Nov. 15, 2019, 11:56 a.m. UTC | #13

On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 11:51, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Nov 15, 2019 at 11:46:01AM +0100, Vincent Guittot wrote:
> > > On Fri, 15 Nov 2019 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > Sure, but then it can still remember the value passed in last and use
> > > > that state later.
> > > >
> > > > It doesn't _have_ to completely discard values.
> > >
> > > yes but it means that we run at the "wrong" frequency during this
> > > period and also that the cpufreq must in this case set a kind of timer
> > > to resubmit a new frequency change out of scheduler event
> >
> > But if, as you say, we're completely shutting down the event stream
> > when everything has decayed, that's still true, right?
> 
> But It doesn't because there is nothing else to do.
> 
> This patch does 2 things:
> - fix the spurious call to cpufreq just before attaching a task
> - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl
> 
> There are somehow related but not fully

Right, but when everything is 0, we stop generating events because we
stop calling update_blocked_average(), so there will be a last event
when we hit all 0s and then nothing.

So no superfluous events.

And if that last event if thrown out because of rate-limiting ....

Vincent Guittot Nov. 15, 2019, 11:59 a.m. UTC | #14

Le Friday 15 Nov 2019 à 11:29:03 (+0100), Vincent Guittot a écrit :
> On Fri, 15 Nov 2019 at 11:18, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > > > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > > > which might be inefficient when cpufreq driver has rate limitation.
> > > >
> > > > When a task is attached on a CPU, we have call path:
> > > >
> > > > update_load_avg()
> > > >   update_cfs_rq_load_avg()
> > > >     cfs_rq_util_change -- > trig frequency update
> > > >   attach_entity_load_avg()
> > > >     cfs_rq_util_change -- > trig frequency update
> > > >
> > > > The 1st frequency update will not take into account the utilization of the
> > > > newly attached task and the 2nd one might be discard because of rate
> > > > limitation of the cpufreq driver.
> > >
> > > Doesn't this just show that a dumb rate limit in the driver is broken?
> >
> > But the rate limit may come from HW constraints that forces to wait
> > let say 4ms or even 10ms between each frequency update.
> >
> > >
> > > > update_cfs_rq_load_avg() is only called by update_blocked_averages()
> > > > and update_load_avg() so we can move the call to
> > > > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> > > > interesting to notice that update_load_avg() already calls directly
> > > > cfs_rq_util_change() for !SMP case.
> > > >
> > > > This changes will also ensure that cpufreq_update_util() is called even
> > > > when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> > > > irq, rt or dl pelt signals.
> > >
> > > I don't think it does that; that is, iirc the return value of
> > > ___update_load_sum() is 1 every time a period lapses. So even if the avg
> > > is 0 and doesn't change, it'll still return 1 on every period.
> > >
> > > Which is what that dumb rate-limit thing wants of course. But I'm still
> > > thinking that it's stupid to do. If nothing changes, don't generate
> > > events.
> >
> > When everything (irq, dl, rt, cfs) is 0, we don't generate events
> > because update_blocked_averages is no more called because
> > rq->has_blocked_load is clear
> >
> > With current implementation, if cfs is 0 but not irq, dl or rt, we
> > don't call cpufreq_update_util because it is only called through cfs
> >
> > >
> > > If anything, update_blocked_avgerages() should look at
> > > @done/others_have_blocked() to emit events for rt,dl,irq.
> >
> > other_have_blocked can be set but no decay happened during the update
> > and we don't need to call cpufreq_update_util
> >
> > >
> > > So why are we making the scheduler code more ugly instead of fixing that
> > > driver?
> 
> Also, I think that calling cfs_rq_util_change in
> attach_entity_load_avg is not optimal because the attach can happen at
> a child level before it has been propagated down to root
> So I'm working on trying to remove it from attach_entity_load_avg and
> keep it in update_load_avg. this would help cleaning the ugly
> 
> -       } else if (decayed && (flags & UPDATE_TG))
> -               update_tg_load_avg(cfs_rq, 0);
> +       } else if (decayed) {
> +               cfs_rq_util_change(cfs_rq, 0);
> +
> +               if (flags & UPDATE_TG)
> +                       update_tg_load_avg(cfs_rq, 0);
> +       }
>  }
>

we can also do this instead :

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d377a3f..550b6bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3614,15 +3614,15 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
                 *
                 * IOW we're enqueueing a task on a new CPU.
                 */
-               attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
+               attach_entity_load_avg(cfs_rq, se, 0);
                update_tg_load_avg(cfs_rq, 0);
+               decayed = 1;
 
-       } else if (decayed) {
-               cfs_rq_util_change(cfs_rq, 0);
+       } else if (decayed && (flags & UPDATE_TG))
+               update_tg_load_avg(cfs_rq, 0);
 
-               if (flags & UPDATE_TG)
-                       update_tg_load_avg(cfs_rq, 0);
-       }
+       if (decayed)
+               cfs_rq_util_change(cfs_rq, 0);
 }



>

Peter Zijlstra Nov. 15, 2019, 12:17 p.m. UTC | #15

On Fri, Nov 15, 2019 at 11:46:01AM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 11:37, Peter Zijlstra <peterz@infradead.org> wrote:

> > Sure, but then it can still remember the value passed in last and use
> > that state later.
> >
> > It doesn't _have_ to completely discard values.
> 
> yes but it means that we run at the "wrong" frequency during this
> period and also that the cpufreq must in this case set a kind of timer
> to resubmit a new frequency change out of scheduler event

It always runs at the wrong frequency. Almost per definition. We're
doing near future predictions based on recent past, and if we can only
set the hardware once every N [ms] then there's really nothing better we
can do. We'll _have_ to live with the value we 'randomly' pick at the
start of those N [ms] for the whole period.

And I'm not sure it needs to set a timer, it can simply probe the value
when we go idle, if this is the kind of system that cares about OPP on
idle.

Vincent Guittot Nov. 15, 2019, 12:25 p.m. UTC | #16

On Fri, 15 Nov 2019 at 12:59, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Le Friday 15 Nov 2019 à 11:29:03 (+0100), Vincent Guittot a écrit :
> > On Fri, 15 Nov 2019 at 11:18, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Fri, 15 Nov 2019 at 10:55, Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > > > > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > > > > which might be inefficient when cpufreq driver has rate limitation.
> > > > >
> > > > > When a task is attached on a CPU, we have call path:
> > > > >
> > > > > update_load_avg()
> > > > >   update_cfs_rq_load_avg()
> > > > >     cfs_rq_util_change -- > trig frequency update
> > > > >   attach_entity_load_avg()
> > > > >     cfs_rq_util_change -- > trig frequency update
> > > > >
> > > > > The 1st frequency update will not take into account the utilization of the
> > > > > newly attached task and the 2nd one might be discard because of rate
> > > > > limitation of the cpufreq driver.
> > > >
> > > > Doesn't this just show that a dumb rate limit in the driver is broken?
> > >
> > > But the rate limit may come from HW constraints that forces to wait
> > > let say 4ms or even 10ms between each frequency update.
> > >
> > > >
> > > > > update_cfs_rq_load_avg() is only called by update_blocked_averages()
> > > > > and update_load_avg() so we can move the call to
> > > > > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> > > > > interesting to notice that update_load_avg() already calls directly
> > > > > cfs_rq_util_change() for !SMP case.
> > > > >
> > > > > This changes will also ensure that cpufreq_update_util() is called even
> > > > > when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> > > > > irq, rt or dl pelt signals.
> > > >
> > > > I don't think it does that; that is, iirc the return value of
> > > > ___update_load_sum() is 1 every time a period lapses. So even if the avg
> > > > is 0 and doesn't change, it'll still return 1 on every period.
> > > >
> > > > Which is what that dumb rate-limit thing wants of course. But I'm still
> > > > thinking that it's stupid to do. If nothing changes, don't generate
> > > > events.
> > >
> > > When everything (irq, dl, rt, cfs) is 0, we don't generate events
> > > because update_blocked_averages is no more called because
> > > rq->has_blocked_load is clear
> > >
> > > With current implementation, if cfs is 0 but not irq, dl or rt, we
> > > don't call cpufreq_update_util because it is only called through cfs
> > >
> > > >
> > > > If anything, update_blocked_avgerages() should look at
> > > > @done/others_have_blocked() to emit events for rt,dl,irq.
> > >
> > > other_have_blocked can be set but no decay happened during the update
> > > and we don't need to call cpufreq_update_util
> > >
> > > >
> > > > So why are we making the scheduler code more ugly instead of fixing that
> > > > driver?
> >
> > Also, I think that calling cfs_rq_util_change in
> > attach_entity_load_avg is not optimal because the attach can happen at
> > a child level before it has been propagated down to root
> > So I'm working on trying to remove it from attach_entity_load_avg and
> > keep it in update_load_avg. this would help cleaning the ugly
> >
> > -       } else if (decayed && (flags & UPDATE_TG))
> > -               update_tg_load_avg(cfs_rq, 0);
> > +       } else if (decayed) {
> > +               cfs_rq_util_change(cfs_rq, 0);
> > +
> > +               if (flags & UPDATE_TG)
> > +                       update_tg_load_avg(cfs_rq, 0);
> > +       }
> >  }
> >
>
> we can also do this instead :
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d377a3f..550b6bc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3614,15 +3614,15 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                  *
>                  * IOW we're enqueueing a task on a new CPU.
>                  */
> -               attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
> +               attach_entity_load_avg(cfs_rq, se, 0);
>                 update_tg_load_avg(cfs_rq, 0);
> +               decayed = 1;
>
> -       } else if (decayed) {
> -               cfs_rq_util_change(cfs_rq, 0);
> +       } else if (decayed && (flags & UPDATE_TG))
> +               update_tg_load_avg(cfs_rq, 0);
>
> -               if (flags & UPDATE_TG)
> -                       update_tg_load_avg(cfs_rq, 0);
> -       }
> +       if (decayed)
> +               cfs_rq_util_change(cfs_rq, 0);
>  }

Forget this ... It's not enough and will continue to generate spurious
call when task is at root domain

>
>
>
> >

Peter Zijlstra Nov. 15, 2019, 1:01 p.m. UTC | #17

On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:

> This patch does 2 things:
> - fix the spurious call to cpufreq just before attaching a task

Right, so that one doesn't concern me too much.

> - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl

But per the rq->has_blocked_load logic we would mostly stop sending
events once we reach all 0s.

Now, most of those updates will be through _nohz_idle_balance() ->
update_nohz_stats(), which are remote, which means intel_pstate is
ignoring them anyway.

Now the _nohz_idle_balance() -> update_blocked_averages() thing runs
local, and that will update the one random idle CPU we picked to run
nohz balance, but all others will be left where they were.

So why does intel_pstate care... Esp. on SKL+ with per-core P state this
is of dubious value.

Also, and maybe I should go read back, why do we care what the P state
is when we're mostly in C states anyway? These are all idle CPUs,
otherwise we wouldkn't be running update_blocked_averages() on them
anyway.

Much confusion..

Peter Zijlstra Nov. 15, 2019, 1:25 p.m. UTC | #18

On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 69a81a5..3be44e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3504,9 +3504,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  	cfs_rq->load_last_update_time_copy = sa->last_update_time;
>  #endif
>  
> -	if (decayed)
> -		cfs_rq_util_change(cfs_rq, 0);
> -
>  	return decayed;
>  }

This removes the call from the for_each_leaf_cfs_rq_safe() loop.

> @@ -7543,18 +7544,19 @@ static void update_blocked_averages(int cpu)
>  	const struct sched_class *curr_class;
>  	struct rq_flags rf;
>  	bool done = true;
> +	int decayed;
>  
>  	rq_lock_irqsave(rq, &rf);
>  	update_rq_clock(rq);
>  
>  	/*
> -	 * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
> -	 * that RT, DL and IRQ signals have been updated before updating CFS.
> +	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> +	 * DL and IRQ signals have been updated before updating CFS.
>  	 */
>  	curr_class = rq->curr->sched_class;
> -	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> -	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> -	update_irq_load_avg(rq, 0);
> +	decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> +	decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> +	decayed |= update_irq_load_avg(rq, 0);

Should not all 3 have their windows aligned and thus alway return the
exact same value?

>  
>  	/* Don't need periodic decay once load/util_avg are null */
>  	if (others_have_blocked(rq))
> @@ -7567,9 +7569,13 @@ static void update_blocked_averages(int cpu)
>  	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
>  		struct sched_entity *se;
>  
> -		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
> +		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
>  			update_tg_load_avg(cfs_rq, 0);
>  
> +			if (cfs_rq == &rq->cfs)
> +				decayed = 1;

And that restores it.

But should not also rq->cfs's window be aligned with the above 3?
Meaning that this one, with exception of the list_del, covers all 4.

> +		}
> +
>  		/* Propagate pending load changes to the parent, if any: */
>  		se = cfs_rq->tg->se[cpu];
>  		if (se && !skip_blocked_update(se))
> @@ -7588,6 +7594,9 @@ static void update_blocked_averages(int cpu)
>  	}
>  
>  	update_blocked_load_status(rq, !done);
> +
> +	if (decayed)
> +		cpufreq_update_util(rq, 0);
>  	rq_unlock_irqrestore(rq, &rf);
>  }
>  
> @@ -7644,22 +7653,22 @@ static inline void update_blocked_averages(int cpu)
>  	struct cfs_rq *cfs_rq = &rq->cfs;
>  	const struct sched_class *curr_class;
>  	struct rq_flags rf;
> +	int decayed;
>  
>  	rq_lock_irqsave(rq, &rf);
>  	update_rq_clock(rq);
>  
> -	/*
> -	 * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
> -	 * that RT, DL and IRQ signals have been updated before updating CFS.
> -	 */
>  	curr_class = rq->curr->sched_class;
> -	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> -	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> -	update_irq_load_avg(rq, 0);
> +	decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> +	decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> +	decayed |= update_irq_load_avg(rq, 0);
>  
> -	update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
> +	decayed |= update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);

And that thus this one makes all 3 above redundant.

>  
>  	update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
> +
> +	if (decayed)
> +		cpufreq_update_util(rq, 0);
>  	rq_unlock_irqrestore(rq, &rf);
>  }

That is, I'm almost tempted to prefer a variant of your initial hack,
that refuses to remove rq->cfs from the list.

That avoids having to care about the rt,dl,irq decays (their windows
align with rq->cfs) and makes smp/up similar.


I still don't actually understand how any of this makes intel_pstate
happy though.

Vincent Guittot Nov. 15, 2019, 1:30 p.m. UTC | #19

On Fri, 15 Nov 2019 at 14:02, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:
>
> > This patch does 2 things:
> > - fix the spurious call to cpufreq just before attaching a task
>
> Right, so that one doesn't concern me too much.
>
> > - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl
>
> But per the rq->has_blocked_load logic we would mostly stop sending
> events once we reach all 0s.
>
> Now, most of those updates will be through _nohz_idle_balance() ->
> update_nohz_stats(), which are remote, which means intel_pstate is
> ignoring them anyway.
>
> Now the _nohz_idle_balance() -> update_blocked_averages() thing runs
> local, and that will update the one random idle CPU we picked to run
> nohz balance, but all others will be left where they were.
>
> So why does intel_pstate care... Esp. on SKL+ with per-core P state this
> is of dubious value.

Doug mentioned some periodic timers that were running on the CPUs

>
> Also, and maybe I should go read back, why do we care what the P state
> is when we're mostly in C states anyway? These are all idle CPUs,
> otherwise we wouldkn't be running update_blocked_averages() on them
> anyway.

AFAIU, there is not 100% idle but they have periodic timers that will
fire and run at higher P state


>
>
> Much confusion..

Vincent Guittot Nov. 15, 2019, 1:37 p.m. UTC | #20

On Fri, 15 Nov 2019 at 14:25, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 69a81a5..3be44e1 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3504,9 +3504,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >       cfs_rq->load_last_update_time_copy = sa->last_update_time;
> >  #endif
> >
> > -     if (decayed)
> > -             cfs_rq_util_change(cfs_rq, 0);
> > -
> >       return decayed;
> >  }
>
> This removes the call from the for_each_leaf_cfs_rq_safe() loop.
>
> > @@ -7543,18 +7544,19 @@ static void update_blocked_averages(int cpu)
> >       const struct sched_class *curr_class;
> >       struct rq_flags rf;
> >       bool done = true;
> > +     int decayed;
> >
> >       rq_lock_irqsave(rq, &rf);
> >       update_rq_clock(rq);
> >
> >       /*
> > -      * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
> > -      * that RT, DL and IRQ signals have been updated before updating CFS.
> > +      * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> > +      * DL and IRQ signals have been updated before updating CFS.
> >        */
> >       curr_class = rq->curr->sched_class;
> > -     update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> > -     update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> > -     update_irq_load_avg(rq, 0);
> > +     decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> > +     decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> > +     decayed |= update_irq_load_avg(rq, 0);
>
> Should not all 3 have their windows aligned and thus alway return the
> exact same value?

rt and dl yes but not irq

But having aligned window doesn't mean that they will all decay.
One can have been updated just before (during a dequeue as an example)
or at least less than 1ms before

>
> >
> >       /* Don't need periodic decay once load/util_avg are null */
> >       if (others_have_blocked(rq))
> > @@ -7567,9 +7569,13 @@ static void update_blocked_averages(int cpu)
> >       for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> >               struct sched_entity *se;
> >
> > -             if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
> > +             if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> >                       update_tg_load_avg(cfs_rq, 0);
> >
> > +                     if (cfs_rq == &rq->cfs)
> > +                             decayed = 1;
>
> And that restores it.
>
> But should not also rq->cfs's window be aligned with the above 3?
> Meaning that this one, with exception of the list_del, covers all 4.
>
> > +             }
> > +
> >               /* Propagate pending load changes to the parent, if any: */
> >               se = cfs_rq->tg->se[cpu];
> >               if (se && !skip_blocked_update(se))
> > @@ -7588,6 +7594,9 @@ static void update_blocked_averages(int cpu)
> >       }
> >
> >       update_blocked_load_status(rq, !done);
> > +
> > +     if (decayed)
> > +             cpufreq_update_util(rq, 0);
> >       rq_unlock_irqrestore(rq, &rf);
> >  }
> >
> > @@ -7644,22 +7653,22 @@ static inline void update_blocked_averages(int cpu)
> >       struct cfs_rq *cfs_rq = &rq->cfs;
> >       const struct sched_class *curr_class;
> >       struct rq_flags rf;
> > +     int decayed;
> >
> >       rq_lock_irqsave(rq, &rf);
> >       update_rq_clock(rq);
> >
> > -     /*
> > -      * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
> > -      * that RT, DL and IRQ signals have been updated before updating CFS.
> > -      */
> >       curr_class = rq->curr->sched_class;
> > -     update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> > -     update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> > -     update_irq_load_avg(rq, 0);
> > +     decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> > +     decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> > +     decayed |= update_irq_load_avg(rq, 0);
> >
> > -     update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
> > +     decayed |= update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
>
> And that thus this one makes all 3 above redundant.
>
> >
> >       update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
> > +
> > +     if (decayed)
> > +             cpufreq_update_util(rq, 0);
> >       rq_unlock_irqrestore(rq, &rf);
> >  }
>
> That is, I'm almost tempted to prefer a variant of your initial hack,
> that refuses to remove rq->cfs from the list.
>
> That avoids having to care about the rt,dl,irq decays (their windows
> align with rq->cfs) and makes smp/up similar.
>
>
> I still don't actually understand how any of this makes intel_pstate
> happy though.

Peter Zijlstra Nov. 15, 2019, 1:46 p.m. UTC | #21

On Fri, Nov 15, 2019 at 02:30:58PM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 14:02, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:
> >
> > > This patch does 2 things:
> > > - fix the spurious call to cpufreq just before attaching a task
> >
> > Right, so that one doesn't concern me too much.
> >
> > > - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl
> >
> > But per the rq->has_blocked_load logic we would mostly stop sending
> > events once we reach all 0s.
> >
> > Now, most of those updates will be through _nohz_idle_balance() ->
> > update_nohz_stats(), which are remote, which means intel_pstate is
> > ignoring them anyway.
> >
> > Now the _nohz_idle_balance() -> update_blocked_averages() thing runs
> > local, and that will update the one random idle CPU we picked to run
> > nohz balance, but all others will be left where they were.
> >
> > So why does intel_pstate care... Esp. on SKL+ with per-core P state this
> > is of dubious value.
> 
> Doug mentioned some periodic timers that were running on the CPUs
> 
> >
> > Also, and maybe I should go read back, why do we care what the P state
> > is when we're mostly in C states anyway? These are all idle CPUs,
> > otherwise we wouldkn't be running update_blocked_averages() on them
> > anyway.
> 
> AFAIU, there is not 100% idle but they have periodic timers that will
> fire and run at higher P state

If it is pure timers, I don't see how those CPUs end up calling
cpufreq_update_util().

Per the above argument, only the CPU that ran nohz balance gets an
update call, all the other CPUs that remain idle (or only serve IRQs)
never get the call.

Peter Zijlstra Nov. 15, 2019, 2:05 p.m. UTC | #22

On Fri, Nov 15, 2019 at 02:37:27PM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 14:25, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:

> > > +     decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> > > +     decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> > > +     decayed |= update_irq_load_avg(rq, 0);
> >
> > Should not all 3 have their windows aligned and thus alway return the
> > exact same value?
> 
> rt and dl yes but not irq

Any reason for IRQ not to be aligned?

> But having aligned window doesn't mean that they will all decay.
> One can have been updated just before (during a dequeue as an example)
> or at least less than 1ms before

Bah... true.

Vincent Guittot Nov. 15, 2019, 2:12 p.m. UTC | #23

On Fri, 15 Nov 2019 at 15:06, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 15, 2019 at 02:37:27PM +0100, Vincent Guittot wrote:
> > On Fri, 15 Nov 2019 at 14:25, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
>
> > > > +     decayed = update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> > > > +     decayed |= update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> > > > +     decayed |= update_irq_load_avg(rq, 0);
> > >
> > > Should not all 3 have their windows aligned and thus alway return the
> > > exact same value?
> >
> > rt and dl yes but not irq
>
> Any reason for IRQ not to be aligned?

irq time is not accounted into task time so irq_avg use rq->clock
whereas other use rq->clock_task
But irq is also not sum up with others

>
> > But having aligned window doesn't mean that they will all decay.
> > One can have been updated just before (during a dequeue as an example)
> > or at least less than 1ms before
>
> Bah... true.

Peter Zijlstra Nov. 15, 2019, 3:12 p.m. UTC | #24

On Fri, Nov 15, 2019 at 02:37:27PM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 14:25, Peter Zijlstra <peterz@infradead.org> wrote:

> > Should not all 3 have their windows aligned and thus alway return the
> > exact same value?
> 
> rt and dl yes but not irq
> 
> But having aligned window doesn't mean that they will all decay.
> One can have been updated just before (during a dequeue as an example)
> or at least less than 1ms before

Now, the thing is, if that update happened in sched/rt, then it wouldn't
have called cpufreq anyway. And once we're idle longer than a period,
they'll all decay at once.

Except indeed that IRQ stuff, which runs out of sync.

That is, I'm just not convinced it matters much if we keep rq->cfs
on the list forever (like UP). Because we'll only stop calling when
update_blocked_averages() when everything hit 0, and up until that
point, we'll get one update per period from rq->cfs.

For good measure we can force an update when @done, at that point we
know all 0s.

How is something like this?

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 545bcb90b4de..a99ac2aa4a23 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3508,9 +3508,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif
 
-	if (decayed)
-		cfs_rq_util_change(cfs_rq, 0);
-
 	return decayed;
 }
 
@@ -3620,8 +3617,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
 		update_tg_load_avg(cfs_rq, 0);
 
-	} else if (decayed && (flags & UPDATE_TG))
-		update_tg_load_avg(cfs_rq, 0);
+	} else if (decayed) {
+		cfs_rq_util_change(cfs_rq, 0);
+
+		if (flags & UPDATE_TG)
+			update_tg_load_avg(cfs_rq, 0);
+	}
 }
 
 #ifndef CONFIG_64BIT
@@ -7453,7 +7454,7 @@ static void update_blocked_averages(int cpu)
 	struct cfs_rq *cfs_rq, *pos;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
-	bool done = true;
+	bool done = true, decayed = false;
 
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
@@ -7476,10 +7477,14 @@ static void update_blocked_averages(int cpu)
 	 * list_add_leaf_cfs_rq() for details.
 	 */
 	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
+		bool last = cfs_rq == &rq->cfs;
 		struct sched_entity *se;
 
-		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
+		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
 			update_tg_load_avg(cfs_rq, 0);
+			if (last)
+				decayed = true;
+		}
 
 		/* Propagate pending load changes to the parent, if any: */
 		se = cfs_rq->tg->se[cpu];
@@ -7490,7 +7495,7 @@ static void update_blocked_averages(int cpu)
 		 * There can be a lot of idle CPU cgroups.  Don't let fully
 		 * decayed cfs_rqs linger on the list.
 		 */
-		if (cfs_rq_is_decayed(cfs_rq))
+		if (!last && cfs_rq_is_decayed(cfs_rq))
 			list_del_leaf_cfs_rq(cfs_rq);
 
 		/* Don't need periodic decay once load/util_avg are null */
@@ -7498,6 +7503,9 @@ static void update_blocked_averages(int cpu)
 			done = false;
 	}
 
+	if (decayed || done)
+		cpufreq_update_util(rq, 0);
+
 	update_blocked_load_status(rq, !done);
 	rq_unlock_irqrestore(rq, &rf);
 }
@@ -7555,6 +7563,7 @@ static inline void update_blocked_averages(int cpu)
 	struct cfs_rq *cfs_rq = &rq->cfs;
 	const struct sched_class *curr_class;
 	struct rq_flags rf;
+	bool done, decayed;
 
 	rq_lock_irqsave(rq, &rf);
 	update_rq_clock(rq);
@@ -7568,9 +7577,13 @@ static inline void update_blocked_averages(int cpu)
 	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
 	update_irq_load_avg(rq, 0);
 
-	update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+	decayed = update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+	done = !(cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
 
-	update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
+	if (decayed || done)
+		cpufreq_update_util(rq, 0);
+
+	update_blocked_load_status(rq, !done);
 	rq_unlock_irqrestore(rq, &rf);
 }

Doug Smythies Nov. 15, 2019, 3:30 p.m. UTC | #25

Hi Peter,

On 2019.11.15 05:02 Peter Zijlstra wrote:
> On Fri, Nov 15, 2019 at 12:03:31PM +0100, Vincent Guittot wrote:
>
>> This patch does 2 things:
>> - fix the spurious call to cpufreq just before attaching a task
>
> Right, so that one doesn't concern me too much.
>
>> - make sure cpufreq is still called when cfs is 0 but not irq/rt or dl
>
> But per the rq->has_blocked_load logic we would mostly stop sending
> events once we reach all 0s.
>
> Now, most of those updates will be through _nohz_idle_balance() ->
> update_nohz_stats(), which are remote, which means intel_pstate is
> ignoring them anyway.
>
> Now the _nohz_idle_balance() -> update_blocked_averages() thing runs
> local, and that will update the one random idle CPU we picked to run
> nohz balance, but all others will be left where they were.
>
> So why does intel_pstate care... Esp. on SKL+ with per-core P state this
> is of dubious value.
>
> Also, and maybe I should go read back, why do we care what the P state
> is when we're mostly in C states anyway? These are all idle CPUs,
> otherwise we wouldkn't be running update_blocked_averages() on them
> anyway.
>
> Much confusion..

Background:

It is true that this is very likely a rare use case.
Apparently, I can make my test system considerably more "idle"
than most.

For many years now, I have never seen the time between calls,
per CPU, to the intel_pstate driver exceed 4 seconds.

Then as of:
sched/fair: Fix O(nr_cgroups) in load balance path
and for an idle system, the time between calls could
be as much as a few hundred seconds. Myself, and not
knowing much (anything) about scheduler details, I found
this odd, and so investigated.

And yes, so who cares if we are in deep C states anyhow?
If, for whatever reason, the system is running with
"intel_idle.max_cstate=1" my findings were that
the processor could end up consuming a lot more energy
for a long long time. Why? Because, at least for my
processor, and older i7-2600K (no HWP), in idle state 1, the
CPU does not relinquish its vote to the PLL, and with
no calls to the driver the requested p-state doesn't decay.

Not previously mentioned: The situation is considerably
exasperated by this piece of boost code within the intel_pstate
driver:

        /*
         * If the average P-state during the previous cycle was higher than the
         * current target, add 50% of the difference to the target to reduce
         * possible performance oscillations and offset possible performance
         * loss related to moving the workload from one CPU to another within
         * a package/module.
         */
        avg_pstate = get_avg_pstate(cpu);
        if (avg_pstate > target)
                target += (avg_pstate - target) >> 1;

Hope this helps.

... Doug

Vincent Guittot Nov. 15, 2019, 3:31 p.m. UTC | #26

On Fri, 15 Nov 2019 at 16:12, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 15, 2019 at 02:37:27PM +0100, Vincent Guittot wrote:
> > On Fri, 15 Nov 2019 at 14:25, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Should not all 3 have their windows aligned and thus alway return the
> > > exact same value?
> >
> > rt and dl yes but not irq
> >
> > But having aligned window doesn't mean that they will all decay.
> > One can have been updated just before (during a dequeue as an example)
> > or at least less than 1ms before
>
> Now, the thing is, if that update happened in sched/rt, then it wouldn't
> have called cpufreq anyway. And once we're idle longer than a period,
> they'll all decay at once.
>
> Except indeed that IRQ stuff, which runs out of sync.
>
> That is, I'm just not convinced it matters much if we keep rq->cfs
> on the list forever (like UP). Because we'll only stop calling when
> update_blocked_averages() when everything hit 0, and up until that
> point, we'll get one update per period from rq->cfs.
>
> For good measure we can force an update when @done, at that point we
> know all 0s.
>
> How is something like this?
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 545bcb90b4de..a99ac2aa4a23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3508,9 +3508,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>         cfs_rq->load_last_update_time_copy = sa->last_update_time;
>  #endif
>
> -       if (decayed)
> -               cfs_rq_util_change(cfs_rq, 0);
> -
>         return decayed;
>  }
>
> @@ -3620,8 +3617,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                 attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
>                 update_tg_load_avg(cfs_rq, 0);
>
> -       } else if (decayed && (flags & UPDATE_TG))
> -               update_tg_load_avg(cfs_rq, 0);
> +       } else if (decayed) {
> +               cfs_rq_util_change(cfs_rq, 0);
> +
> +               if (flags & UPDATE_TG)
> +                       update_tg_load_avg(cfs_rq, 0);
> +       }
>  }
>
>  #ifndef CONFIG_64BIT
> @@ -7453,7 +7454,7 @@ static void update_blocked_averages(int cpu)
>         struct cfs_rq *cfs_rq, *pos;
>         const struct sched_class *curr_class;
>         struct rq_flags rf;
> -       bool done = true;
> +       bool done = true, decayed = false;
>
>         rq_lock_irqsave(rq, &rf);
>         update_rq_clock(rq);
> @@ -7476,10 +7477,14 @@ static void update_blocked_averages(int cpu)
>          * list_add_leaf_cfs_rq() for details.
>          */
>         for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> +               bool last = cfs_rq == &rq->cfs;
>                 struct sched_entity *se;
>
> -               if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
> +               if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
>                         update_tg_load_avg(cfs_rq, 0);
> +                       if (last)

using this last make code more readable

> +                               decayed = true;
> +               }
>
>                 /* Propagate pending load changes to the parent, if any: */
>                 se = cfs_rq->tg->se[cpu];
> @@ -7490,7 +7495,7 @@ static void update_blocked_averages(int cpu)
>                  * There can be a lot of idle CPU cgroups.  Don't let fully
>                  * decayed cfs_rqs linger on the list.
>                  */
> -               if (cfs_rq_is_decayed(cfs_rq))
> +               if (!last && cfs_rq_is_decayed(cfs_rq))
>                         list_del_leaf_cfs_rq(cfs_rq);

Keeping root cfs in the list will not change anything now that
cfs_rq_util_change is in update_load_avg()
cfs_rq_util_change will not be called

>
>                 /* Don't need periodic decay once load/util_avg are null */
> @@ -7498,6 +7503,9 @@ static void update_blocked_averages(int cpu)
>                         done = false;
>         }
>
> +       if (decayed || done)

I'm not sure to get why you want to call cpufreq when done is true
which means that everything reaches 0
Why do you prefer to use done instead of ORing the decay of  rt, dl,
irq and cfs ?

> +               cpufreq_update_util(rq, 0);
> +
>         update_blocked_load_status(rq, !done);
>         rq_unlock_irqrestore(rq, &rf);
>  }
> @@ -7555,6 +7563,7 @@ static inline void update_blocked_averages(int cpu)
>         struct cfs_rq *cfs_rq = &rq->cfs;
>         const struct sched_class *curr_class;
>         struct rq_flags rf;
> +       bool done, decayed;
>
>         rq_lock_irqsave(rq, &rf);
>         update_rq_clock(rq);
> @@ -7568,9 +7577,13 @@ static inline void update_blocked_averages(int cpu)
>         update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
>         update_irq_load_avg(rq, 0);
>
> -       update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
> +       decayed = update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
> +       done = !(cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
>
> -       update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
> +       if (decayed || done)
> +               cpufreq_update_util(rq, 0);
> +
> +       update_blocked_load_status(rq, !done);
>         rq_unlock_irqrestore(rq, &rf);
>  }
>

Peter Zijlstra Nov. 15, 2019, 5:43 p.m. UTC | #27

On Fri, Nov 15, 2019 at 04:31:35PM +0100, Vincent Guittot wrote:

> > @@ -7476,10 +7477,14 @@ static void update_blocked_averages(int cpu)
> >          * list_add_leaf_cfs_rq() for details.
> >          */
> >         for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> > +               bool last = cfs_rq == &rq->cfs;
> >                 struct sched_entity *se;
> >
> > -               if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
> > +               if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> >                         update_tg_load_avg(cfs_rq, 0);
> > +                       if (last)
> 
> using this last make code more readable
> 
> > +                               decayed = true;
> > +               }
> >
> >                 /* Propagate pending load changes to the parent, if any: */
> >                 se = cfs_rq->tg->se[cpu];
> > @@ -7490,7 +7495,7 @@ static void update_blocked_averages(int cpu)
> >                  * There can be a lot of idle CPU cgroups.  Don't let fully
> >                  * decayed cfs_rqs linger on the list.
> >                  */
> > -               if (cfs_rq_is_decayed(cfs_rq))
> > +               if (!last && cfs_rq_is_decayed(cfs_rq))
> >                         list_del_leaf_cfs_rq(cfs_rq);
> 
> Keeping root cfs in the list will not change anything now that
> cfs_rq_util_change is in update_load_avg()
> cfs_rq_util_change will not be called

Oh but it does, since it will then keep triggering that hunk above on
every period.

> >
> >                 /* Don't need periodic decay once load/util_avg are null */
> > @@ -7498,6 +7503,9 @@ static void update_blocked_averages(int cpu)
> >                         done = false;
> >         }
> >
> > +       if (decayed || done)
> 
> I'm not sure to get why you want to call cpufreq when done is true
> which means that everything reaches 0
> Why do you prefer to use done instead of ORing the decay of  rt, dl,
> irq and cfs ?
> 
> > +               cpufreq_update_util(rq, 0);

Because we don't care about the rt,dl,irq decay anywhere else either. We
only call cpufreq_update_util() for rq->cfs changes.

Also, as I argued, realistically rt,dl and cfs decay on the same edge,
so aside from some fuzz on the first period, they're all the same. But
even if they were not, why would we care about their exact edges here
when we do no anywhere else.

Not caring reduces the number of cpufreq_update_util() calls to one per
period, instead of potentially many more.

Doing the || done ensures never miss the all 0 case.

Vincent Guittot Nov. 15, 2019, 6 p.m. UTC | #28

On Fri, 15 Nov 2019 at 18:44, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 15, 2019 at 04:31:35PM +0100, Vincent Guittot wrote:
>
> > > @@ -7476,10 +7477,14 @@ static void update_blocked_averages(int cpu)
> > >          * list_add_leaf_cfs_rq() for details.
> > >          */
> > >         for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
> > > +               bool last = cfs_rq == &rq->cfs;
> > >                 struct sched_entity *se;
> > >
> > > -               if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
> > > +               if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> > >                         update_tg_load_avg(cfs_rq, 0);
> > > +                       if (last)
> >
> > using this last make code more readable
> >
> > > +                               decayed = true;
> > > +               }
> > >
> > >                 /* Propagate pending load changes to the parent, if any: */
> > >                 se = cfs_rq->tg->se[cpu];
> > > @@ -7490,7 +7495,7 @@ static void update_blocked_averages(int cpu)
> > >                  * There can be a lot of idle CPU cgroups.  Don't let fully
> > >                  * decayed cfs_rqs linger on the list.
> > >                  */
> > > -               if (cfs_rq_is_decayed(cfs_rq))
> > > +               if (!last && cfs_rq_is_decayed(cfs_rq))
> > >                         list_del_leaf_cfs_rq(cfs_rq);
> >
> > Keeping root cfs in the list will not change anything now that
> > cfs_rq_util_change is in update_load_avg()
> > cfs_rq_util_change will not be called
>
> Oh but it does, since it will then keep triggering that hunk above on
> every period.

indeed

>
> > >
> > >                 /* Don't need periodic decay once load/util_avg are null */
> > > @@ -7498,6 +7503,9 @@ static void update_blocked_averages(int cpu)
> > >                         done = false;
> > >         }
> > >
> > > +       if (decayed || done)
> >
> > I'm not sure to get why you want to call cpufreq when done is true
> > which means that everything reaches 0
> > Why do you prefer to use done instead of ORing the decay of  rt, dl,
> > irq and cfs ?
> >
> > > +               cpufreq_update_util(rq, 0);
>
> Because we don't care about the rt,dl,irq decay anywhere else either. We
> only call cpufreq_update_util() for rq->cfs changes.

cpufreq_update_util is called for each enqueue/dequeue of rt/dl tasks

>
> Also, as I argued, realistically rt,dl and cfs decay on the same edge,
> so aside from some fuzz on the first period, they're all the same. But

But the 1st period can be the only one for the next 4sec

> even if they were not, why would we care about their exact edges here
> when we do no anywhere else.
>
> Not caring reduces the number of cpufreq_update_util() calls to one per
> period, instead of potentially many more.
>
> Doing the || done ensures never miss the all 0 case.

How can we miss it  according to your explanation above ?

Peter Zijlstra Nov. 15, 2019, 8:12 p.m. UTC | #29

On Fri, Nov 15, 2019 at 07:00:45PM +0100, Vincent Guittot wrote:
> On Fri, 15 Nov 2019 at 18:44, Peter Zijlstra <peterz@infradead.org> wrote:

> > Because we don't care about the rt,dl,irq decay anywhere else either. We
> > only call cpufreq_update_util() for rq->cfs changes.
> 
> cpufreq_update_util is called for each enqueue/dequeue of rt/dl tasks

Oh indeed.. OK, so then I suppose rt,dl,cfs makes some sense.

Let me sleep on it.

Peter Zijlstra Nov. 15, 2019, 9:52 p.m. UTC | #30

On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> which might be inefficient when cpufreq driver has rate limitation.
> 
> When a task is attached on a CPU, we have call path:
> 
> update_load_avg()
>   update_cfs_rq_load_avg()
>     cfs_rq_util_change -- > trig frequency update
>   attach_entity_load_avg()
>     cfs_rq_util_change -- > trig frequency update
> 
> The 1st frequency update will not take into account the utilization of the
> newly attached task and the 2nd one might be discard because of rate
> limitation of the cpufreq driver.
> 
> update_cfs_rq_load_avg() is only called by update_blocked_averages()
> and update_load_avg() so we can move the call to
> cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> interesting to notice that update_load_avg() already calls directly
> cfs_rq_util_change() for !SMP case.
> 
> This changes will also ensure that cpufreq_update_util() is called even
> when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> irq, rt or dl pelt signals.
> 
> Reported-by: Doug Smythies <dsmythies@telus.net>
> Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---

OK, but shall we write it like so instead?

---
 kernel/sched/fair.c | 111 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 62 insertions(+), 49 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 545bcb90b4de..7a762266c335 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3508,9 +3508,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 	cfs_rq->load_last_update_time_copy = sa->last_update_time;
 #endif

-	if (decayed)
-		cfs_rq_util_change(cfs_rq, 0);
-
 	return decayed;
 }

@@ -3620,8 +3617,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
 		update_tg_load_avg(cfs_rq, 0);

-	} else if (decayed && (flags & UPDATE_TG))
-		update_tg_load_avg(cfs_rq, 0);
+	} else if (decayed) {
+		cfs_rq_util_change(cfs_rq, 0);
+
+		if (flags & UPDATE_TG)
+			update_tg_load_avg(cfs_rq, 0);
+	}
 }

 #ifndef CONFIG_64BIT
@@ -7428,6 +7429,28 @@ static inline bool others_have_blocked(struct rq *rq) { return false; }
 static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {}
 #endif

+static bool __update_blocked_others(struct rq *rq, bool *done)
+{
+	const struct sched_class *curr_class;
+	u64 now = rq_clock_pelt(rq);
+	bool decayed;
+
+	/*
+	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
+	 * DL and IRQ signals have been updated before updating CFS.
+	 */
+	curr_class = rq->curr->sched_class;
+
+	decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
+		  update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
+		  update_irq_load_avg(rq, 0);
+
+	if (others_have_blocked(rq))
+		*done = false;
+
+	return decayed;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED

 static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
@@ -7447,29 +7470,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	return true;
 }

-static void update_blocked_averages(int cpu)
+static bool __update_blocked_fair(struct rq *rq, bool *done)
 {
-	struct rq *rq = cpu_rq(cpu);
 	struct cfs_rq *cfs_rq, *pos;
-	const struct sched_class *curr_class;
-	struct rq_flags rf;
-	bool done = true;
-
-	rq_lock_irqsave(rq, &rf);
-	update_rq_clock(rq);
-
-	/*
-	 * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
-	 * that RT, DL and IRQ signals have been updated before updating CFS.
-	 */
-	curr_class = rq->curr->sched_class;
-	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
-	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
-	update_irq_load_avg(rq, 0);
-
-	/* Don't need periodic decay once load/util_avg are null */
-	if (others_have_blocked(rq))
-		done = false;
+	bool decayed = false;
+	int cpu = cpu_of(rq);

 	/*
 	 * Iterates the task_group tree in a bottom up fashion, see
@@ -7478,9 +7483,13 @@ static void update_blocked_averages(int cpu)
 	for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
 		struct sched_entity *se;

-		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
+		if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
 			update_tg_load_avg(cfs_rq, 0);

+			if (cfs_rq == &rq->cfs)
+				decayed = true;
+		}
+
 		/* Propagate pending load changes to the parent, if any: */
 		se = cfs_rq->tg->se[cpu];
 		if (se && !skip_blocked_update(se))
@@ -7495,11 +7504,10 @@ static void update_blocked_averages(int cpu)

 		/* Don't need periodic decay once load/util_avg are null */
 		if (cfs_rq_has_blocked(cfs_rq))
-			done = false;
+			*done = false;
 	}

-	update_blocked_load_status(rq, !done);
-	rq_unlock_irqrestore(rq, &rf);
+	return decayed;
 }

 /*
@@ -7549,29 +7557,16 @@ static unsigned long task_h_load(struct task_struct *p)
 			cfs_rq_load_avg(cfs_rq) + 1);
 }
 #else
-static inline void update_blocked_averages(int cpu)
+static bool __update_blocked_fair(struct rq *rq, bool *done)
 {
-	struct rq *rq = cpu_rq(cpu);
 	struct cfs_rq *cfs_rq = &rq->cfs;
-	const struct sched_class *curr_class;
-	struct rq_flags rf;
-
-	rq_lock_irqsave(rq, &rf);
-	update_rq_clock(rq);
-
-	/*
-	 * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
-	 * that RT, DL and IRQ signals have been updated before updating CFS.
-	 */
-	curr_class = rq->curr->sched_class;
-	update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
-	update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
-	update_irq_load_avg(rq, 0);
+	bool decayed;

-	update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+	decayed = update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
+	if (cfs_rq_has_blocked(cfs_rq))
+		*done = false;

-	update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
-	rq_unlock_irqrestore(rq, &rf);
+	return decayed;
 }

 static unsigned long task_h_load(struct task_struct *p)
@@ -7580,6 +7575,24 @@ static unsigned long task_h_load(struct task_struct *p)
 }
 #endif

+static void update_blocked_averages(int cpu)
+{
+	bool decayed = false, done = true;
+	struct rq *rq = cpu_rq(cpu);
+	struct rq_flags rf;
+
+	rq_lock_irqsave(rq, &rf);
+	update_rq_clock(rq);
+
+	decayed |= __update_blocked_others(rq, &done);
+	decayed |= __update_blocked_fair(rq, &done);
+
+	update_blocked_load_status(rq, !done);
+	if (decayed)
+		cpufreq_update_util(rq, 0);
+	rq_unlock_irqrestore(rq, &rf);
+}
+
 /********** Helpers for find_busiest_group ************************/

 /*

Vincent Guittot Nov. 16, 2019, 8:47 a.m. UTC | #31

On Fri, 15 Nov 2019 at 22:52, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 14, 2019 at 06:07:31PM +0100, Vincent Guittot wrote:
> > update_cfs_rq_load_avg() calls cfs_rq_util_change() everytime pelt decays,
> > which might be inefficient when cpufreq driver has rate limitation.
> >
> > When a task is attached on a CPU, we have call path:
> >
> > update_load_avg()
> >   update_cfs_rq_load_avg()
> >     cfs_rq_util_change -- > trig frequency update
> >   attach_entity_load_avg()
> >     cfs_rq_util_change -- > trig frequency update
> >
> > The 1st frequency update will not take into account the utilization of the
> > newly attached task and the 2nd one might be discard because of rate
> > limitation of the cpufreq driver.
> >
> > update_cfs_rq_load_avg() is only called by update_blocked_averages()
> > and update_load_avg() so we can move the call to
> > cfs_rq_util_change/cpufreq_update_util() into these 2 functions. It's also
> > interesting to notice that update_load_avg() already calls directly
> > cfs_rq_util_change() for !SMP case.
> >
> > This changes will also ensure that cpufreq_update_util() is called even
> > when there is no more CFS rq in the leaf_cfs_rq_list to update but only
> > irq, rt or dl pelt signals.
> >
> > Reported-by: Doug Smythies <dsmythies@telus.net>
> > Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > ---
>
> OK, but shall we write it like so instead?

Yes. Looks good to me

>
> ---
>  kernel/sched/fair.c | 111 +++++++++++++++++++++++++++++-----------------------
>  1 file changed, 62 insertions(+), 49 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 545bcb90b4de..7a762266c335 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3508,9 +3508,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>         cfs_rq->load_last_update_time_copy = sa->last_update_time;
>  #endif
>
> -       if (decayed)
> -               cfs_rq_util_change(cfs_rq, 0);
> -
>         return decayed;
>  }
>
> @@ -3620,8 +3617,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                 attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
>                 update_tg_load_avg(cfs_rq, 0);
>
> -       } else if (decayed && (flags & UPDATE_TG))
> -               update_tg_load_avg(cfs_rq, 0);
> +       } else if (decayed) {
> +               cfs_rq_util_change(cfs_rq, 0);
> +
> +               if (flags & UPDATE_TG)
> +                       update_tg_load_avg(cfs_rq, 0);
> +       }
>  }
>
>  #ifndef CONFIG_64BIT
> @@ -7428,6 +7429,28 @@ static inline bool others_have_blocked(struct rq *rq) { return false; }
>  static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) {}
>  #endif
>
> +static bool __update_blocked_others(struct rq *rq, bool *done)
> +{
> +       const struct sched_class *curr_class;
> +       u64 now = rq_clock_pelt(rq);
> +       bool decayed;
> +
> +       /*
> +        * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> +        * DL and IRQ signals have been updated before updating CFS.
> +        */
> +       curr_class = rq->curr->sched_class;
> +
> +       decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) |
> +                 update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) |
> +                 update_irq_load_avg(rq, 0);
> +
> +       if (others_have_blocked(rq))
> +               *done = false;
> +
> +       return decayed;
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>
>  static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> @@ -7447,29 +7470,11 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>         return true;
>  }
>
> -static void update_blocked_averages(int cpu)
> +static bool __update_blocked_fair(struct rq *rq, bool *done)
>  {
> -       struct rq *rq = cpu_rq(cpu);
>         struct cfs_rq *cfs_rq, *pos;
> -       const struct sched_class *curr_class;
> -       struct rq_flags rf;
> -       bool done = true;
> -
> -       rq_lock_irqsave(rq, &rf);
> -       update_rq_clock(rq);
> -
> -       /*
> -        * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
> -        * that RT, DL and IRQ signals have been updated before updating CFS.
> -        */
> -       curr_class = rq->curr->sched_class;
> -       update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> -       update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> -       update_irq_load_avg(rq, 0);
> -
> -       /* Don't need periodic decay once load/util_avg are null */
> -       if (others_have_blocked(rq))
> -               done = false;
> +       bool decayed = false;
> +       int cpu = cpu_of(rq);
>
>         /*
>          * Iterates the task_group tree in a bottom up fashion, see
> @@ -7478,9 +7483,13 @@ static void update_blocked_averages(int cpu)
>         for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
>                 struct sched_entity *se;
>
> -               if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq))
> +               if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
>                         update_tg_load_avg(cfs_rq, 0);
>
> +                       if (cfs_rq == &rq->cfs)
> +                               decayed = true;
> +               }
> +
>                 /* Propagate pending load changes to the parent, if any: */
>                 se = cfs_rq->tg->se[cpu];
>                 if (se && !skip_blocked_update(se))
> @@ -7495,11 +7504,10 @@ static void update_blocked_averages(int cpu)
>
>                 /* Don't need periodic decay once load/util_avg are null */
>                 if (cfs_rq_has_blocked(cfs_rq))
> -                       done = false;
> +                       *done = false;
>         }
>
> -       update_blocked_load_status(rq, !done);
> -       rq_unlock_irqrestore(rq, &rf);
> +       return decayed;
>  }
>
>  /*
> @@ -7549,29 +7557,16 @@ static unsigned long task_h_load(struct task_struct *p)
>                         cfs_rq_load_avg(cfs_rq) + 1);
>  }
>  #else
> -static inline void update_blocked_averages(int cpu)
> +static bool __update_blocked_fair(struct rq *rq, bool *done)
>  {
> -       struct rq *rq = cpu_rq(cpu);
>         struct cfs_rq *cfs_rq = &rq->cfs;
> -       const struct sched_class *curr_class;
> -       struct rq_flags rf;
> -
> -       rq_lock_irqsave(rq, &rf);
> -       update_rq_clock(rq);
> -
> -       /*
> -        * update_cfs_rq_load_avg() can call cpufreq_update_util(). Make sure
> -        * that RT, DL and IRQ signals have been updated before updating CFS.
> -        */
> -       curr_class = rq->curr->sched_class;
> -       update_rt_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &rt_sched_class);
> -       update_dl_rq_load_avg(rq_clock_pelt(rq), rq, curr_class == &dl_sched_class);
> -       update_irq_load_avg(rq, 0);
> +       bool decayed;
>
> -       update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
> +       decayed = update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq);
> +       if (cfs_rq_has_blocked(cfs_rq))
> +               *done = false;
>
> -       update_blocked_load_status(rq, cfs_rq_has_blocked(cfs_rq) || others_have_blocked(rq));
> -       rq_unlock_irqrestore(rq, &rf);
> +       return decayed;
>  }
>
>  static unsigned long task_h_load(struct task_struct *p)
> @@ -7580,6 +7575,24 @@ static unsigned long task_h_load(struct task_struct *p)
>  }
>  #endif
>
> +static void update_blocked_averages(int cpu)
> +{
> +       bool decayed = false, done = true;
> +       struct rq *rq = cpu_rq(cpu);
> +       struct rq_flags rf;
> +
> +       rq_lock_irqsave(rq, &rf);
> +       update_rq_clock(rq);
> +
> +       decayed |= __update_blocked_others(rq, &done);
> +       decayed |= __update_blocked_fair(rq, &done);
> +
> +       update_blocked_load_status(rq, !done);
> +       if (decayed)
> +               cpufreq_update_util(rq, 0);
> +       rq_unlock_irqrestore(rq, &rf);
> +}
> +
>  /********** Helpers for find_busiest_group ************************/
>
>  /*
>

Doug Smythies Nov. 16, 2019, 3:07 p.m. UTC | #32

Hi,

I tested both Vincent's V4 and Peter's, I called it V1.

8000 seconds intel_pstate_tracer on a very idle system.
Everything was fine.

Vincent-V4:

Maximum time between calls to the intel_pstate driver,
per CPU: 4.00813 seconds: GOOD/PASS

Total entries/exits to/from driver: 90,730:
Consistent with previous tests / expectations.

228,600 aperf clocks / second / cpu.  
Consistent with previous tests / expectations.

Peter-V1:

Maximum time between calls to the intel_pstate driver,
per CPU: 4.00407 seconds: GOOD/PASS

Total entries/exits to/from driver: 96,440:
Consistent with previous tests / expectations.

241,310 aperf clocks / second / cpu.  
Consistent with previous tests / expectations.

Baseline reference:

Maximum time between calls to the intel_pstate driver,
per CPU: 225.03 seconds: BAD/FAIL

Number of durations over an arbitrary threshold of
10 seconds: 379.

Total entries/exits to/from driver: 75,969:
Consistent with previous tests / expectations,
when the issue is present.

226,963 aperf clocks / second / cpu.
Consistent with previous tests / expectations.

I did not do the load no-load energy test.
It is not possible for it to be a problem
if the long duration issue is solved.
I'll do the test, if someone wants the proof.

Kernels were 5.4-rc7 + linux next.
Idle governor was TEO, but this isn't actually relevant.

Tested-by: Doug Smythies <dsmythies@telus.net>

... Doug

[v4] sched/freq: move call to cpufreq_update_util

Commit Message

Comments

Patch