diff mbox

[1/2] sched/fair: move cpufreq hook to update_cfs_rq_load_avg()

Message ID 20160413000824.GB22643@graphite.smuckle.net (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Steve Muckle April 13, 2016, 12:08 a.m. UTC
On Tue, Apr 12, 2016 at 04:29:06PM +0200, Rafael J. Wysocki wrote:
> This is rather fundamental.
> 
> For example, if you look at cpufreq_update_util(), it does this:
> 
> data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
> 
> meaning that it will run the current CPU's utilization update
> callback.  Of course, that won't work cross-CPU, because in principle
> different CPUs may use different governors and therefore different
> util update callbacks.

Will something like the attached (unfinished patches) work? It seems
to for me, but I haven't tested it much beyond confirming the hook is
working on remote wakeups.

I'm relying on the previous comment that it's up to cpufreq drivers to
run stuff on the target policy's CPUs if the driver needs that.

There's still some more work, fixing up some more smp_processor_id()
usage in schedutil, but it should be easy (trace, slow path irq_work
target).

> If you want to do remote updates, I guess that will require an
> irq_work to run the update on the target CPU, but then you'll probably
> want to neglect the rate limit on it as well, so it looks like a
> "need_update" flag in struct update_util_data will be useful for that.

Why is it required to run the update on the target CPU?

thanks,
Steve

Comments

Rafael J. Wysocki April 13, 2016, 4:48 a.m. UTC | #1
On Wed, Apr 13, 2016 at 2:08 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On Tue, Apr 12, 2016 at 04:29:06PM +0200, Rafael J. Wysocki wrote:
>> This is rather fundamental.
>>
>> For example, if you look at cpufreq_update_util(), it does this:
>>
>> data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
>>
>> meaning that it will run the current CPU's utilization update
>> callback.  Of course, that won't work cross-CPU, because in principle
>> different CPUs may use different governors and therefore different
>> util update callbacks.
>
> Will something like the attached (unfinished patches) work? It seems
> to for me, but I haven't tested it much beyond confirming the hook is
> working on remote wakeups.

No, they are not sufficient.

First of all, you need to take all of the governors into account and
they all make assumptions about updates being run on the CPU being
updated.

That should be easy to take into account for ondemand/conservative,
but intel_pstate is a different story.

> I'm relying on the previous comment that it's up to cpufreq drivers to
> run stuff on the target policy's CPUs if the driver needs that.

That's not the case for the fast frequency switching though, which has
to happen on the CPU running the code.

> There's still some more work, fixing up some more smp_processor_id()
> usage in schedutil, but it should be easy (trace, slow path irq_work
> target).
>
>> If you want to do remote updates, I guess that will require an
>> irq_work to run the update on the target CPU, but then you'll probably
>> want to neglect the rate limit on it as well, so it looks like a
>> "need_update" flag in struct update_util_data will be useful for that.
>
> Why is it required to run the update on the target CPU?

The fast switching and intel_pstate are the main reason.

They both have to write to registers of the target CPU and the code to
do that needs to run on that CPU.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki April 13, 2016, 4:05 p.m. UTC | #2
On Wed, Apr 13, 2016 at 6:48 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Apr 13, 2016 at 2:08 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
>> On Tue, Apr 12, 2016 at 04:29:06PM +0200, Rafael J. Wysocki wrote:
>>> This is rather fundamental.
>>>
>>> For example, if you look at cpufreq_update_util(), it does this:
>>>
>>> data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
>>>
>>> meaning that it will run the current CPU's utilization update
>>> callback.  Of course, that won't work cross-CPU, because in principle
>>> different CPUs may use different governors and therefore different
>>> util update callbacks.
>>
>> Will something like the attached (unfinished patches) work? It seems
>> to for me, but I haven't tested it much beyond confirming the hook is
>> working on remote wakeups.
>
> No, they are not sufficient.
>
> First of all, you need to take all of the governors into account and
> they all make assumptions about updates being run on the CPU being
> updated.
>
> That should be easy to take into account for ondemand/conservative,
> but intel_pstate is a different story.
>
>> I'm relying on the previous comment that it's up to cpufreq drivers to
>> run stuff on the target policy's CPUs if the driver needs that.
>
> That's not the case for the fast frequency switching though, which has
> to happen on the CPU running the code.
>
>> There's still some more work, fixing up some more smp_processor_id()
>> usage in schedutil, but it should be easy (trace, slow path irq_work
>> target).
>>
>>> If you want to do remote updates, I guess that will require an
>>> irq_work to run the update on the target CPU, but then you'll probably
>>> want to neglect the rate limit on it as well, so it looks like a
>>> "need_update" flag in struct update_util_data will be useful for that.
>>
>> Why is it required to run the update on the target CPU?
>
> The fast switching and intel_pstate are the main reason.
>
> They both have to write to registers of the target CPU and the code to
> do that needs to run on that CPU.

And these two seem to be the only interesting cases for you, because
if you need to work for the worker thread to schedule to eventually
change the CPU frequency for you, that will defeat the whole purpose
here.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki April 13, 2016, 4:07 p.m. UTC | #3
On Wed, Apr 13, 2016 at 6:05 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Apr 13, 2016 at 6:48 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Wed, Apr 13, 2016 at 2:08 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
>>> On Tue, Apr 12, 2016 at 04:29:06PM +0200, Rafael J. Wysocki wrote:
>>>> This is rather fundamental.
>>>>
>>>> For example, if you look at cpufreq_update_util(), it does this:
>>>>
>>>> data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
>>>>
>>>> meaning that it will run the current CPU's utilization update
>>>> callback.  Of course, that won't work cross-CPU, because in principle
>>>> different CPUs may use different governors and therefore different
>>>> util update callbacks.
>>>
>>> Will something like the attached (unfinished patches) work? It seems
>>> to for me, but I haven't tested it much beyond confirming the hook is
>>> working on remote wakeups.
>>
>> No, they are not sufficient.
>>
>> First of all, you need to take all of the governors into account and
>> they all make assumptions about updates being run on the CPU being
>> updated.
>>
>> That should be easy to take into account for ondemand/conservative,
>> but intel_pstate is a different story.
>>
>>> I'm relying on the previous comment that it's up to cpufreq drivers to
>>> run stuff on the target policy's CPUs if the driver needs that.
>>
>> That's not the case for the fast frequency switching though, which has
>> to happen on the CPU running the code.
>>
>>> There's still some more work, fixing up some more smp_processor_id()
>>> usage in schedutil, but it should be easy (trace, slow path irq_work
>>> target).
>>>
>>>> If you want to do remote updates, I guess that will require an
>>>> irq_work to run the update on the target CPU, but then you'll probably
>>>> want to neglect the rate limit on it as well, so it looks like a
>>>> "need_update" flag in struct update_util_data will be useful for that.
>>>
>>> Why is it required to run the update on the target CPU?
>>
>> The fast switching and intel_pstate are the main reason.
>>
>> They both have to write to registers of the target CPU and the code to
>> do that needs to run on that CPU.
>
> And these two seem to be the only interesting cases for you, because
> if you need to work for the worker thread to schedule to eventually

s/work/wait/ (sorry)

> change the CPU frequency for you, that will defeat the whole purpose
> here.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle April 13, 2016, 6:06 p.m. UTC | #4
On 04/13/2016 09:07 AM, Rafael J. Wysocki wrote:
>>>>> If you want to do remote updates, I guess that will require an
>>>>> irq_work to run the update on the target CPU, but then you'll probably
>>>>> want to neglect the rate limit on it as well, so it looks like a
>>>>> "need_update" flag in struct update_util_data will be useful for that.

Have you added rate limiting at the hook level that I missed? I thought
it was just inside schedutil.

>>>>
>>>> Why is it required to run the update on the target CPU?
>>>
>>> The fast switching and intel_pstate are the main reason.
>>>
>>> They both have to write to registers of the target CPU and the code to
>>> do that needs to run on that CPU.

Ok thanks, I'll take another look at this.

I was thinking it might be nice to be able to push the decision on
whether to send the IPI in to the governor/hook client. For example in
the schedutil case, you don't need to IPI if sugov_should_update_freq()
= false (outside the slight chance it might be true when it runs on the
target). Beyond that perhaps for policy reasons it's desired to not send
the IPI if next_freq <= cur_freq, etc.

>> And these two seem to be the only interesting cases for you, because
>> if you need to work for the worker thread to schedule to eventually
> 
> s/work/wait/ (sorry)
> 
>> change the CPU frequency for you, that will defeat the whole purpose
>> here.

I was hoping to submit at some point a patch to change the context for
slow path frequency changes to RT or DL context, so this would benefit
that case as well.

thanks,
steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki April 13, 2016, 7:50 p.m. UTC | #5
On Wed, Apr 13, 2016 at 8:06 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 04/13/2016 09:07 AM, Rafael J. Wysocki wrote:
>>>>>> If you want to do remote updates, I guess that will require an
>>>>>> irq_work to run the update on the target CPU, but then you'll probably
>>>>>> want to neglect the rate limit on it as well, so it looks like a
>>>>>> "need_update" flag in struct update_util_data will be useful for that.
>
> Have you added rate limiting at the hook level that I missed? I thought
> it was just inside schedutil.

It is in schedutil (and other governors), but if you do a cross-CPU
update, you probably want that rate limit to be ignored in that case.
Now, if the local and target CPUs happen to use different governors
(eg. the local CPU uses ondemand and the target one uses schedutil) or
they just don't belong to the same policy, you need to set the "need
update" flag for the target CPU, so the local one needs access to it.
It is better for that flag to be located in the per-CPU data of the
target CPU for that.

>>>>>
>>>>> Why is it required to run the update on the target CPU?
>>>>
>>>> The fast switching and intel_pstate are the main reason.
>>>>
>>>> They both have to write to registers of the target CPU and the code to
>>>> do that needs to run on that CPU.
>
> Ok thanks, I'll take another look at this.
>
> I was thinking it might be nice to be able to push the decision on
> whether to send the IPI in to the governor/hook client. For example in
> the schedutil case, you don't need to IPI if sugov_should_update_freq()
> = false (outside the slight chance it might be true when it runs on the
> target). Beyond that perhaps for policy reasons it's desired to not send
> the IPI if next_freq <= cur_freq, etc.

Yes, that is an option, but then your governor code gets more
complicated.  Since every governor would need that complexity, you'd
end up having it in multiple places.  To me, it seems more efficient
to just have it in one place (the code that triggers a cross-CPU
update).

And as I said, the rate limit would need to be overridden in the
cross-CPU update case anyway, because it may just prevent you from
getting what you want otherwise.

>>> And these two seem to be the only interesting cases for you, because
>>> if you need to work for the worker thread to schedule to eventually
>>
>> s/work/wait/ (sorry)
>>
>>> change the CPU frequency for you, that will defeat the whole purpose
>>> here.
>
> I was hoping to submit at some point a patch to change the context for
> slow path frequency changes to RT or DL context, so this would benefit
> that case as well.

But it still would require the worker thread to schedule, although it
might just occur a bit earlier if that's DL/RT.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle April 20, 2016, 2:22 a.m. UTC | #6
On Wed, Apr 13, 2016 at 09:50:19PM +0200, Rafael J. Wysocki wrote:
> On Wed, Apr 13, 2016 at 8:06 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> > On 04/13/2016 09:07 AM, Rafael J. Wysocki wrote:
> >>>>>> If you want to do remote updates, I guess that will require an
> >>>>>> irq_work to run the update on the target CPU, but then you'll probably
> >>>>>> want to neglect the rate limit on it as well, so it looks like a
> >>>>>> "need_update" flag in struct update_util_data will be useful for that.
> >
> > Have you added rate limiting at the hook level that I missed? I thought
> > it was just inside schedutil.
> 
> It is in schedutil (and other governors), but if you do a cross-CPU
> update, you probably want that rate limit to be ignored in that case.
> Now, if the local and target CPUs happen to use different governors
> (eg. the local CPU uses ondemand and the target one uses schedutil) or
> they just don't belong to the same policy, you need to set the "need
> update" flag for the target CPU, so the local one needs access to it.
> It is better for that flag to be located in the per-CPU data of the
> target CPU for that.

It's not clear to me whether remote updates are necessarily more
important than updates triggered locally, such that they would
override the rate limit while local ones do not. My guess is this
special treatment could be left out at least initially.

> > Ok thanks, I'll take another look at this.
> >
> > I was thinking it might be nice to be able to push the decision on
> > whether to send the IPI in to the governor/hook client. For example in
> > the schedutil case, you don't need to IPI if sugov_should_update_freq()
> > = false (outside the slight chance it might be true when it runs on the
> > target). Beyond that perhaps for policy reasons it's desired to not send
> > the IPI if next_freq <= cur_freq, etc.
> 
> Yes, that is an option, but then your governor code gets more
> complicated.  Since every governor would need that complexity, you'd
> end up having it in multiple places.  To me, it seems more efficient
> to just have it in one place (the code that triggers a cross-CPU
> update).

More efficient in terms of lines of code perhaps but it'd be really
nice to save on IPIs by not sending unnecessary ones. I went ahead and
tried to address the issues in the governors so we could see what it
would look like. I'll send that as a separate RFC.

It also occurred to me that even when the governors filter out the
needless IPIs, it may still end up being unnecessary if the scheduler
is going to send a resched IPI to the target CPU. Need to think about
that some more.

...
> >>> And these two seem to be the only interesting cases for you, because
> >>> if you need to work for the worker thread to schedule to eventually
> >>
> >> s/work/wait/ (sorry)
> >>
> >>> change the CPU frequency for you, that will defeat the whole purpose
> >>> here.
> >
> > I was hoping to submit at some point a patch to change the context for
> > slow path frequency changes to RT or DL context, so this would benefit
> > that case as well.
> 
> But it still would require the worker thread to schedule, although it
> might just occur a bit earlier if that's DL/RT.

Scheduling latency should be very short for DL/RT tasks, I would
expect 10s of usec or less? Still much better than up to a tick.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From ae6eb75162f11674cbdc569466e3def4e0eed077 Mon Sep 17 00:00:00 2001
From: Steve Muckle <smuckle@linaro.org>
Date: Tue, 12 Apr 2016 15:19:34 -0700
Subject: [PATCH 2/2] sched/fair: call cpufreq hook for remote wakeups

Without calling the cpufreq hook for a remote wakeup it is possible
for such a wakeup to go unnoticed by cpufreq on the target CPU for up
to a full tick.

Signed-off-by: Steve Muckle <smuckle@linaro.org>
---
 kernel/sched/fair.c  | 8 +++-----
 kernel/sched/sched.h | 7 ++++---
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b06c1e938cb9..d21a80a44b6e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2826,15 +2826,13 @@  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 	struct rq *rq = rq_of(cfs_rq);
 	int cpu = cpu_of(rq);
 
-	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+	if (&rq->cfs == cfs_rq) {
 		unsigned long max = rq->cpu_capacity_orig;
 
 		/*
 		 * There are a few boundary cases this might miss but it should
 		 * get called often enough that that should (hopefully) not be
-		 * a real problem -- added to that it only calls on the local
-		 * CPU, so if we enqueue remotely we'll miss an update, but
-		 * the next tick/schedule should update.
+		 * a real problem.
 		 *
 		 * It will not get called when we go idle, because the idle
 		 * thread is a different class (!fair), nor will the utilization
@@ -2845,7 +2843,7 @@  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
 		 *
 		 * See cpu_util().
 		 */
-		cpufreq_update_util(rq_clock(rq),
+		cpufreq_update_util(cpu, rq_clock(rq),
 				    min(cfs_rq->avg.util_avg, max), max);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 921d6e5d33b7..01faa0781099 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1808,11 +1808,12 @@  DECLARE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
  *
  * It can only be called from RCU-sched read-side critical sections.
  */
-static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+static inline void cpufreq_update_util(int cpu, u64 time, unsigned long util,
+				       unsigned long max)
 {
        struct update_util_data *data;
 
-       data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
+       data = rcu_dereference_sched(per_cpu(cpufreq_update_util_data, cpu));
        if (data)
                data->func(data, time, util, max);
 }
@@ -1835,7 +1836,7 @@  static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned lo
  */
 static inline void cpufreq_trigger_update(u64 time)
 {
-	cpufreq_update_util(time, ULONG_MAX, 0);
+	cpufreq_update_util(smp_processor_id(), time, ULONG_MAX, 0);
 }
 #else
 static inline void cpufreq_update_util(u64 time, unsigned long util, unsigned long max) {}
-- 
2.4.10