diff mbox

[0/3] cpufreq: Replace timers with utilization update callbacks

Message ID 5387313.xAhVpzgZCg@vostro.rjw.lan (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Rafael J. Wysocki Feb. 9, 2016, 8:05 p.m. UTC
On Tuesday, February 09, 2016 02:01:39 AM Rafael J. Wysocki wrote:
> On Tue, Feb 9, 2016 at 1:39 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> > Hi Rafael,
> >
> > On 02/08/2016 03:06 PM, Rafael J. Wysocki wrote:
> >> Now that all review comments have been addressed in patch [3/3], I'm going to
> >> put this series into linux-next.
> >>
> >> There already is 20+ patches on top of it in the queue including fixes for
> >> bugs that have haunted us for quite some time (and that functionally depend on
> >> this set) and I'd really like all that to get enough linux-next coverage, so
> >> there really isn't more time to wait.
> >
> > Sorry for the late reply. As Juri mentioned I was OOO last week and
> > really just got to look at this today.
> >
> > One concern I had was, given that the lone scheduler update hook is in
> > CFS, is it possible for governor updates to be stalled due to RT or DL
> > task activity?
> 
> I don't think they may be completely stalled, but I'd prefer Peter to
> answer that as he suggested to do it this way.

In any case, if that concern turns out to be significant in practice, it may
be addressed like in the appended modification of patch [1/3] from the $subject
series.

With that things look like before from the cpufreq side, but the other sched
classes also get a chance to trigger a cpufreq update.  The drawback is the
cpu_clock() call instead of passing the time value from update_load_avg(), but
I guess we can live with that if necessary.

FWIW, this modification doesn't seem to break things on my test machine.

Thanks,
Rafael


Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/cpufreq.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/cpufreq.h   |    7 +++++++
 include/linux/sched.h     |    7 +++++++
 kernel/sched/deadline.c   |    3 +++
 kernel/sched/fair.c       |   29 ++++++++++++++++++++++++++++-
 kernel/sched/rt.c         |    3 +++
 6 files changed, 92 insertions(+), 1 deletion(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Steve Muckle Feb. 10, 2016, 1:02 a.m. UTC | #1
On 02/09/2016 12:05 PM, Rafael J. Wysocki wrote:
>>> One concern I had was, given that the lone scheduler update hook is in
>>> CFS, is it possible for governor updates to be stalled due to RT or DL
>>> task activity?
>>
>> I don't think they may be completely stalled, but I'd prefer Peter to
>> answer that as he suggested to do it this way.
> 
> In any case, if that concern turns out to be significant in practice, it may
> be addressed like in the appended modification of patch [1/3] from the $subject
> series.
> 
> With that things look like before from the cpufreq side, but the other sched
> classes also get a chance to trigger a cpufreq update.  The drawback is the
> cpu_clock() call instead of passing the time value from update_load_avg(), but
> I guess we can live with that if necessary.
> 
> FWIW, this modification doesn't seem to break things on my test machine.
> 
...
> Index: linux-pm/kernel/sched/rt.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/rt.c
> +++ linux-pm/kernel/sched/rt.c
> @@ -2212,6 +2212,9 @@ static void task_tick_rt(struct rq *rq,
>  
>  	update_curr_rt(rq);
>  
> +	/* Kick cpufreq to prevent it from stalling. */
> +	cpufreq_kick();
> +
>  	watchdog(rq, p);
>  
>  	/*
> Index: linux-pm/kernel/sched/deadline.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/deadline.c
> +++ linux-pm/kernel/sched/deadline.c
> @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
>  {
>  	update_curr_dl(rq);
>  
> +	/* Kick cpufreq to prevent it from stalling. */
> +	cpufreq_kick();
> +
>  	/*
>  	 * Even when we have runtime, update_curr_dl() might have resulted in us
>  	 * not being the leftmost task anymore. In that case NEED_RESCHED will

I think additional hooks such as enqueue/dequeue would be needed in
RT/DL. The task tick callbacks will only run if a task in that class is
executing at the time of the tick. There could be intermittent RT/DL
task activity in a frequency domain (the only task activity there, no
CFS tasks) that doesn't happen to overlap the tick. Worst case the task
activity could be periodic in such a way that it never overlaps the tick
and the update is never made.

thanks,
steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 10, 2016, 1:57 a.m. UTC | #2
On Wed, Feb 10, 2016 at 2:02 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/09/2016 12:05 PM, Rafael J. Wysocki wrote:
>>>> One concern I had was, given that the lone scheduler update hook is in
>>>> CFS, is it possible for governor updates to be stalled due to RT or DL
>>>> task activity?
>>>
>>> I don't think they may be completely stalled, but I'd prefer Peter to
>>> answer that as he suggested to do it this way.
>>
>> In any case, if that concern turns out to be significant in practice, it may
>> be addressed like in the appended modification of patch [1/3] from the $subject
>> series.
>>
>> With that things look like before from the cpufreq side, but the other sched
>> classes also get a chance to trigger a cpufreq update.  The drawback is the
>> cpu_clock() call instead of passing the time value from update_load_avg(), but
>> I guess we can live with that if necessary.
>>
>> FWIW, this modification doesn't seem to break things on my test machine.
>>
> ...
>> Index: linux-pm/kernel/sched/rt.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/rt.c
>> +++ linux-pm/kernel/sched/rt.c
>> @@ -2212,6 +2212,9 @@ static void task_tick_rt(struct rq *rq,
>>
>>       update_curr_rt(rq);
>>
>> +     /* Kick cpufreq to prevent it from stalling. */
>> +     cpufreq_kick();
>> +
>>       watchdog(rq, p);
>>
>>       /*
>> Index: linux-pm/kernel/sched/deadline.c
>> ===================================================================
>> --- linux-pm.orig/kernel/sched/deadline.c
>> +++ linux-pm/kernel/sched/deadline.c
>> @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
>>  {
>>       update_curr_dl(rq);
>>
>> +     /* Kick cpufreq to prevent it from stalling. */
>> +     cpufreq_kick();
>> +
>>       /*
>>        * Even when we have runtime, update_curr_dl() might have resulted in us
>>        * not being the leftmost task anymore. In that case NEED_RESCHED will
>
> I think additional hooks such as enqueue/dequeue would be needed in
> RT/DL. The task tick callbacks will only run if a task in that class is
> executing at the time of the tick. There could be intermittent RT/DL
> task activity in a frequency domain (the only task activity there, no
> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
> activity could be periodic in such a way that it never overlaps the tick
> and the update is never made.

So if I'm reading this correctly, it would be better to put the hooks
into update_curr_rt/dl()?

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 10, 2016, 3:09 a.m. UTC | #3
On Wed, Feb 10, 2016 at 2:57 AM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Feb 10, 2016 at 2:02 AM, Steve Muckle <steve.muckle@linaro.org> wrote:
>> On 02/09/2016 12:05 PM, Rafael J. Wysocki wrote:
>>>>> One concern I had was, given that the lone scheduler update hook is in
>>>>> CFS, is it possible for governor updates to be stalled due to RT or DL
>>>>> task activity?
>>>>
>>>> I don't think they may be completely stalled, but I'd prefer Peter to
>>>> answer that as he suggested to do it this way.
>>>
>>> In any case, if that concern turns out to be significant in practice, it may
>>> be addressed like in the appended modification of patch [1/3] from the $subject
>>> series.
>>>
>>> With that things look like before from the cpufreq side, but the other sched
>>> classes also get a chance to trigger a cpufreq update.  The drawback is the
>>> cpu_clock() call instead of passing the time value from update_load_avg(), but
>>> I guess we can live with that if necessary.
>>>
>>> FWIW, this modification doesn't seem to break things on my test machine.
>>>
>> ...
>>> Index: linux-pm/kernel/sched/rt.c
>>> ===================================================================
>>> --- linux-pm.orig/kernel/sched/rt.c
>>> +++ linux-pm/kernel/sched/rt.c
>>> @@ -2212,6 +2212,9 @@ static void task_tick_rt(struct rq *rq,
>>>
>>>       update_curr_rt(rq);
>>>
>>> +     /* Kick cpufreq to prevent it from stalling. */
>>> +     cpufreq_kick();
>>> +
>>>       watchdog(rq, p);
>>>
>>>       /*
>>> Index: linux-pm/kernel/sched/deadline.c
>>> ===================================================================
>>> --- linux-pm.orig/kernel/sched/deadline.c
>>> +++ linux-pm/kernel/sched/deadline.c
>>> @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
>>>  {
>>>       update_curr_dl(rq);
>>>
>>> +     /* Kick cpufreq to prevent it from stalling. */
>>> +     cpufreq_kick();
>>> +
>>>       /*
>>>        * Even when we have runtime, update_curr_dl() might have resulted in us
>>>        * not being the leftmost task anymore. In that case NEED_RESCHED will
>>
>> I think additional hooks such as enqueue/dequeue would be needed in
>> RT/DL. The task tick callbacks will only run if a task in that class is
>> executing at the time of the tick. There could be intermittent RT/DL
>> task activity in a frequency domain (the only task activity there, no
>> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
>> activity could be periodic in such a way that it never overlaps the tick
>> and the update is never made.
>
> So if I'm reading this correctly, it would be better to put the hooks
> into update_curr_rt/dl()?

If done this way, I guess we may pass rq_clock_task(rq) as the time
arg to cpufreq_update_util() from there and then the cpu_lock() call
I've added to this prototype won't be necessary any more.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Feb. 10, 2016, 12:33 p.m. UTC | #4
Hi Rafael,

On 09/02/16 21:05, Rafael J. Wysocki wrote:

[...]

> +/**
> + * cpufreq_update_util - Take a note about CPU utilization changes.
> + * @util: Current utilization.
> + * @max: Utilization ceiling.
> + *
> + * This function is called by the scheduler on every invocation of
> + * update_load_avg() on the CPU whose utilization is being updated.
> + */
> +void cpufreq_update_util(unsigned long util, unsigned long max)
> +{
> +	struct update_util_data *data;
> +
> +	rcu_read_lock();
> +
> +	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> +	if (data && data->func)
> +		data->func(data, cpu_clock(smp_processor_id()), util, max);

Are util and max used anywhere? It seems to me that cpu_clock is used by
the callbacks to check if the sampling period is elapsed, but I couldn't
yet find who is using util and max.

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 10, 2016, 1:23 p.m. UTC | #5
On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi Rafael,
>
> On 09/02/16 21:05, Rafael J. Wysocki wrote:
>
> [...]
>
>> +/**
>> + * cpufreq_update_util - Take a note about CPU utilization changes.
>> + * @util: Current utilization.
>> + * @max: Utilization ceiling.
>> + *
>> + * This function is called by the scheduler on every invocation of
>> + * update_load_avg() on the CPU whose utilization is being updated.
>> + */
>> +void cpufreq_update_util(unsigned long util, unsigned long max)
>> +{
>> +     struct update_util_data *data;
>> +
>> +     rcu_read_lock();
>> +
>> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
>> +     if (data && data->func)
>> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
>
> Are util and max used anywhere?

They aren't yet, but they will be.

Maybe not in this cycle (it it takes too much time to integrate the
preliminary changes), but we definitely are going to use those
numbers.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Feb. 10, 2016, 2:03 p.m. UTC | #6
On 10/02/16 14:23, Rafael J. Wysocki wrote:
> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> > Hi Rafael,
> >
> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
> >
> > [...]
> >
> >> +/**
> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
> >> + * @util: Current utilization.
> >> + * @max: Utilization ceiling.
> >> + *
> >> + * This function is called by the scheduler on every invocation of
> >> + * update_load_avg() on the CPU whose utilization is being updated.
> >> + */
> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
> >> +{
> >> +     struct update_util_data *data;
> >> +
> >> +     rcu_read_lock();
> >> +
> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> >> +     if (data && data->func)
> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
> >
> > Are util and max used anywhere?
> 
> They aren't yet, but they will be.
> 
> Maybe not in this cycle (it it takes too much time to integrate the
> preliminary changes), but we definitely are going to use those
> numbers.
> 

Oh OK. However, I was under the impression that this set was only
proposing a way to get rid of timers and use the scheduler as heartbeat
for cpufreq governors. The governors' sample based approach wouldn't
change, though. Am I wrong in assuming this?

Also, is linux-pm/bleeding-edge the one I want to fetch to try this set
out?

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 10, 2016, 2:26 p.m. UTC | #7
On Wed, Feb 10, 2016 at 3:03 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> On 10/02/16 14:23, Rafael J. Wysocki wrote:
>> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
>> > Hi Rafael,
>> >
>> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
>> >
>> > [...]
>> >
>> >> +/**
>> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
>> >> + * @util: Current utilization.
>> >> + * @max: Utilization ceiling.
>> >> + *
>> >> + * This function is called by the scheduler on every invocation of
>> >> + * update_load_avg() on the CPU whose utilization is being updated.
>> >> + */
>> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
>> >> +{
>> >> +     struct update_util_data *data;
>> >> +
>> >> +     rcu_read_lock();
>> >> +
>> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
>> >> +     if (data && data->func)
>> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
>> >
>> > Are util and max used anywhere?
>>
>> They aren't yet, but they will be.
>>
>> Maybe not in this cycle (it it takes too much time to integrate the
>> preliminary changes), but we definitely are going to use those
>> numbers.
>>
>
> Oh OK. However, I was under the impression that this set was only
> proposing a way to get rid of timers and use the scheduler as heartbeat
> for cpufreq governors. The governors' sample based approach wouldn't
> change, though. Am I wrong in assuming this?

Your assumption is correct.

The sample-based approach doesn't change at this time, simply to avoid
making too many changes in one go.

The next step, as I'm seeing it, would be to use the
scheduler-provided utilization in the governor computations instead of
the load estimation made by governors themselves.

> Also, is linux-pm/bleeding-edge the one I want to fetch to try this set out?

You can get it from there, but possibly with some changes unrelated to cpufreq.

You can also pull from the pm-cpufreq-test branch to get the cpufreq
changes only.

Apart from that, I'm going resend the $subject set with updated patch
[1/3] for completeness.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Feb. 10, 2016, 2:46 p.m. UTC | #8
On 10/02/16 15:26, Rafael J. Wysocki wrote:
> On Wed, Feb 10, 2016 at 3:03 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> > On 10/02/16 14:23, Rafael J. Wysocki wrote:
> >> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> >> > Hi Rafael,
> >> >
> >> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
> >> >
> >> > [...]
> >> >
> >> >> +/**
> >> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
> >> >> + * @util: Current utilization.
> >> >> + * @max: Utilization ceiling.
> >> >> + *
> >> >> + * This function is called by the scheduler on every invocation of
> >> >> + * update_load_avg() on the CPU whose utilization is being updated.
> >> >> + */
> >> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
> >> >> +{
> >> >> +     struct update_util_data *data;
> >> >> +
> >> >> +     rcu_read_lock();
> >> >> +
> >> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> >> >> +     if (data && data->func)
> >> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
> >> >
> >> > Are util and max used anywhere?
> >>
> >> They aren't yet, but they will be.
> >>
> >> Maybe not in this cycle (it it takes too much time to integrate the
> >> preliminary changes), but we definitely are going to use those
> >> numbers.
> >>
> >
> > Oh OK. However, I was under the impression that this set was only
> > proposing a way to get rid of timers and use the scheduler as heartbeat
> > for cpufreq governors. The governors' sample based approach wouldn't
> > change, though. Am I wrong in assuming this?
> 
> Your assumption is correct.
> 

In this case. Wouldn't be possible to simply put the kicks in
sched/core.c? scheduler_tick() seems a good candidate for that, and you
could complement that with enqueue/dequeue/etc., if needed.

I'm actually wondering if a slow CONFIG_HZ might affect governors'
sampling rate. We might have scheduler tick firing every 40ms and
sampling rate set to 10 or 20ms, don't we?

> The sample-based approach doesn't change at this time, simply to avoid
> making too many changes in one go.
> 
> The next step, as I'm seeing it, would be to use the
> scheduler-provided utilization in the governor computations instead of
> the load estimation made by governors themselves.
> 

OK. But, I'm not sure what does this buy us. If the end goal is still to
do sampling, aren't we better off using the (1 - idle) estimation as
today?

> > Also, is linux-pm/bleeding-edge the one I want to fetch to try this set out?
> 
> You can get it from there, but possibly with some changes unrelated to cpufreq.
> 
> You can also pull from the pm-cpufreq-test branch to get the cpufreq
> changes only.
> 
> Apart from that, I'm going resend the $subject set with updated patch
> [1/3] for completeness.
> 

Great, thanks! Let's see if I can finally find time to run some tests
this time :).

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 10, 2016, 3:46 p.m. UTC | #9
On Wed, Feb 10, 2016 at 3:46 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> On 10/02/16 15:26, Rafael J. Wysocki wrote:
>> On Wed, Feb 10, 2016 at 3:03 PM, Juri Lelli <juri.lelli@arm.com> wrote:
>> > On 10/02/16 14:23, Rafael J. Wysocki wrote:
>> >> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
>> >> > Hi Rafael,
>> >> >
>> >> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
>> >> >
>> >> > [...]
>> >> >
>> >> >> +/**
>> >> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
>> >> >> + * @util: Current utilization.
>> >> >> + * @max: Utilization ceiling.
>> >> >> + *
>> >> >> + * This function is called by the scheduler on every invocation of
>> >> >> + * update_load_avg() on the CPU whose utilization is being updated.
>> >> >> + */
>> >> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
>> >> >> +{
>> >> >> +     struct update_util_data *data;
>> >> >> +
>> >> >> +     rcu_read_lock();
>> >> >> +
>> >> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
>> >> >> +     if (data && data->func)
>> >> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
>> >> >
>> >> > Are util and max used anywhere?
>> >>
>> >> They aren't yet, but they will be.
>> >>
>> >> Maybe not in this cycle (it it takes too much time to integrate the
>> >> preliminary changes), but we definitely are going to use those
>> >> numbers.
>> >>
>> >
>> > Oh OK. However, I was under the impression that this set was only
>> > proposing a way to get rid of timers and use the scheduler as heartbeat
>> > for cpufreq governors. The governors' sample based approach wouldn't
>> > change, though. Am I wrong in assuming this?
>>
>> Your assumption is correct.
>>
>
> In this case. Wouldn't be possible to simply put the kicks in
> sched/core.c? scheduler_tick() seems a good candidate for that, and you
> could complement that with enqueue/dequeue/etc., if needed.

That can be done, but they are not needed for things like idle and
stop, are they?

> I'm actually wondering if a slow CONFIG_HZ might affect governors'
> sampling rate. We might have scheduler tick firing every 40ms and
> sampling rate set to 10 or 20ms, don't we?

The smallest HZ you can get from the standard config is 100.  That
would translate to an update every 10ms roughly if my understanding of
things is correct.

Also I think that the scheduler and cpufreq should really work at the
same pace as they affect each other in any case.

>> The sample-based approach doesn't change at this time, simply to avoid
>> making too many changes in one go.
>>
>> The next step, as I'm seeing it, would be to use the
>> scheduler-provided utilization in the governor computations instead of
>> the load estimation made by governors themselves.
>>
>
> OK. But, I'm not sure what does this buy us. If the end goal is still to
> do sampling, aren't we better off using the (1 - idle) estimation as
> today?

First of all, we can avoid the need to compute this number entirely if
we use the scheduler-provided one.

Second, what if we come up with a different idea about the CPU
utilization than the scheduler has?  Who's right then?

Finally, the way this number is currently computed by cpufreq is based
on some questionable heuristics (and not just in one place), so maybe
it's better to stop doing that?

Also I didn't say that the *final* goal would be to do sampling.  I
was talking about the next step. :-)

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Feb. 10, 2016, 4:05 p.m. UTC | #10
On 10/02/16 16:46, Rafael J. Wysocki wrote:
> On Wed, Feb 10, 2016 at 3:46 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> > On 10/02/16 15:26, Rafael J. Wysocki wrote:
> >> On Wed, Feb 10, 2016 at 3:03 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> >> > On 10/02/16 14:23, Rafael J. Wysocki wrote:
> >> >> On Wed, Feb 10, 2016 at 1:33 PM, Juri Lelli <juri.lelli@arm.com> wrote:
> >> >> > Hi Rafael,
> >> >> >
> >> >> > On 09/02/16 21:05, Rafael J. Wysocki wrote:
> >> >> >
> >> >> > [...]
> >> >> >
> >> >> >> +/**
> >> >> >> + * cpufreq_update_util - Take a note about CPU utilization changes.
> >> >> >> + * @util: Current utilization.
> >> >> >> + * @max: Utilization ceiling.
> >> >> >> + *
> >> >> >> + * This function is called by the scheduler on every invocation of
> >> >> >> + * update_load_avg() on the CPU whose utilization is being updated.
> >> >> >> + */
> >> >> >> +void cpufreq_update_util(unsigned long util, unsigned long max)
> >> >> >> +{
> >> >> >> +     struct update_util_data *data;
> >> >> >> +
> >> >> >> +     rcu_read_lock();
> >> >> >> +
> >> >> >> +     data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
> >> >> >> +     if (data && data->func)
> >> >> >> +             data->func(data, cpu_clock(smp_processor_id()), util, max);
> >> >> >
> >> >> > Are util and max used anywhere?
> >> >>
> >> >> They aren't yet, but they will be.
> >> >>
> >> >> Maybe not in this cycle (it it takes too much time to integrate the
> >> >> preliminary changes), but we definitely are going to use those
> >> >> numbers.
> >> >>
> >> >
> >> > Oh OK. However, I was under the impression that this set was only
> >> > proposing a way to get rid of timers and use the scheduler as heartbeat
> >> > for cpufreq governors. The governors' sample based approach wouldn't
> >> > change, though. Am I wrong in assuming this?
> >>
> >> Your assumption is correct.
> >>
> >
> > In this case. Wouldn't be possible to simply put the kicks in
> > sched/core.c? scheduler_tick() seems a good candidate for that, and you
> > could complement that with enqueue/dequeue/etc., if needed.
> 
> That can be done, but they are not needed for things like idle and
> stop, are they?
> 

Sorry, I'm not sure I understand you here. In a NO_HZ system tick will
be stopped when idle.

> > I'm actually wondering if a slow CONFIG_HZ might affect governors'
> > sampling rate. We might have scheduler tick firing every 40ms and
> > sampling rate set to 10 or 20ms, don't we?
> 
> The smallest HZ you can get from the standard config is 100.  That
> would translate to an update every 10ms roughly if my understanding of
> things is correct.
> 

Right. Please, forget my question above :).

> Also I think that the scheduler and cpufreq should really work at the
> same pace as they affect each other in any case.
> 

Makes sense yes.

> >> The sample-based approach doesn't change at this time, simply to avoid
> >> making too many changes in one go.
> >>
> >> The next step, as I'm seeing it, would be to use the
> >> scheduler-provided utilization in the governor computations instead of
> >> the load estimation made by governors themselves.
> >>
> >
> > OK. But, I'm not sure what does this buy us. If the end goal is still to
> > do sampling, aren't we better off using the (1 - idle) estimation as
> > today?
> 
> First of all, we can avoid the need to compute this number entirely if
> we use the scheduler-provided one.
> 
> Second, what if we come up with a different idea about the CPU
> utilization than the scheduler has?  Who's right then?
> 
> Finally, the way this number is currently computed by cpufreq is based
> on some questionable heuristics (and not just in one place), so maybe
> it's better to stop doing that?
> 
> Also I didn't say that the *final* goal would be to do sampling.  I
> was talking about the next step. :-)
> 

Oh, this changes things indeed. :)

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Feb. 10, 2016, 7:47 p.m. UTC | #11
On 02/09/2016 07:09 PM, Rafael J. Wysocki wrote:
>>> >> I think additional hooks such as enqueue/dequeue would be needed in
>>> >> RT/DL. The task tick callbacks will only run if a task in that class is
>>> >> executing at the time of the tick. There could be intermittent RT/DL
>>> >> task activity in a frequency domain (the only task activity there, no
>>> >> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
>>> >> activity could be periodic in such a way that it never overlaps the tick
>>> >> and the update is never made.
>> >
>> > So if I'm reading this correctly, it would be better to put the hooks
>> > into update_curr_rt/dl()?

That should AFAICS be sufficient to avoid stalling. It may be more than
is required as that covers more than just enqueue/dequeue but I'm not
sure offhand.

>
> If done this way, I guess we may pass rq_clock_task(rq) as the time
> arg to cpufreq_update_util() from there and then the cpu_lock() call
> I've added to this prototype won't be necessary any more.

Is it rq_clock_task() or rq_clock()? The former can omit irq time so may
gradually fall behind wall clock time, delaying callbacks in cpufreq.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 10, 2016, 9:49 p.m. UTC | #12
On Wed, Feb 10, 2016 at 8:47 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/09/2016 07:09 PM, Rafael J. Wysocki wrote:
>>>> >> I think additional hooks such as enqueue/dequeue would be needed in
>>>> >> RT/DL. The task tick callbacks will only run if a task in that class is
>>>> >> executing at the time of the tick. There could be intermittent RT/DL
>>>> >> task activity in a frequency domain (the only task activity there, no
>>>> >> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
>>>> >> activity could be periodic in such a way that it never overlaps the tick
>>>> >> and the update is never made.
>>> >
>>> > So if I'm reading this correctly, it would be better to put the hooks
>>> > into update_curr_rt/dl()?
>
> That should AFAICS be sufficient to avoid stalling. It may be more than
> is required as that covers more than just enqueue/dequeue but I'm not
> sure offhand.
>
>>
>> If done this way, I guess we may pass rq_clock_task(rq) as the time
>> arg to cpufreq_update_util() from there and then the cpu_lock() call
>> I've added to this prototype won't be necessary any more.
>
> Is it rq_clock_task() or rq_clock()? The former can omit irq time so may
> gradually fall behind wall clock time, delaying callbacks in cpufreq.

What matters to us is the difference between the current time and the
time we previously took a sample and there shouldn't be too much
difference between the two in that respect.

Both are good enough IMO, but I can update the patch to use rq_clock()
if that's preferred.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Feb. 10, 2016, 10:07 p.m. UTC | #13
On 02/10/2016 01:49 PM, Rafael J. Wysocki wrote:
>>> If done this way, I guess we may pass rq_clock_task(rq) as the time
>>> >> arg to cpufreq_update_util() from there and then the cpu_lock() call
>>> >> I've added to this prototype won't be necessary any more.
>> >
>> > Is it rq_clock_task() or rq_clock()? The former can omit irq time so may
>> > gradually fall behind wall clock time, delaying callbacks in cpufreq.
>
> What matters to us is the difference between the current time and the
> time we previously took a sample and there shouldn't be too much
> difference between the two in that respect.

Sorry, the reference to wall clock time was unnecessary. I just meant it
can lose time, which could cause cpufreq updates to be delayed during
irq heavy periods.

> Both are good enough IMO, but I can update the patch to use rq_clock()
> if that's preferred.

I do believe rq_clock should be used as workloads such as heavy
networking could spend a significant portion of time in interrupts,
skewing rq_clock_task significantly, assuming I understand it correctly.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 10, 2016, 10:12 p.m. UTC | #14
On Wed, Feb 10, 2016 at 11:07 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/10/2016 01:49 PM, Rafael J. Wysocki wrote:
>>>> If done this way, I guess we may pass rq_clock_task(rq) as the time
>>>> >> arg to cpufreq_update_util() from there and then the cpu_lock() call
>>>> >> I've added to this prototype won't be necessary any more.
>>> >
>>> > Is it rq_clock_task() or rq_clock()? The former can omit irq time so may
>>> > gradually fall behind wall clock time, delaying callbacks in cpufreq.
>>
>> What matters to us is the difference between the current time and the
>> time we previously took a sample and there shouldn't be too much
>> difference between the two in that respect.
>
> Sorry, the reference to wall clock time was unnecessary. I just meant it
> can lose time, which could cause cpufreq updates to be delayed during
> irq heavy periods.
>
>> Both are good enough IMO, but I can update the patch to use rq_clock()
>> if that's preferred.
>
> I do believe rq_clock should be used as workloads such as heavy
> networking could spend a significant portion of time in interrupts,
> skewing rq_clock_task significantly, assuming I understand it correctly.

OK, I'll send an update, then.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 11, 2016, 11:51 a.m. UTC | #15
On Tue, Feb 09, 2016 at 09:05:05PM +0100, Rafael J. Wysocki wrote:
> > > One concern I had was, given that the lone scheduler update hook is in
> > > CFS, is it possible for governor updates to be stalled due to RT or DL
> > > task activity?
> > 
> > I don't think they may be completely stalled, but I'd prefer Peter to
> > answer that as he suggested to do it this way.
> 
> In any case, if that concern turns out to be significant in practice, it may
> be addressed like in the appended modification of patch [1/3] from the $subject
> series.
> 
> With that things look like before from the cpufreq side, but the other sched
> classes also get a chance to trigger a cpufreq update.  The drawback is the
> cpu_clock() call instead of passing the time value from update_load_avg(), but
> I guess we can live with that if necessary.
> 
> FWIW, this modification doesn't seem to break things on my test machine.

Not really pretty though. It blows a bit that you require this callback
to be periodic (in order to replace a timer).

Ideally we'd not have to call this if state doesn't change.


> +++ linux-pm/include/linux/sched.h
> @@ -3207,4 +3207,11 @@ static inline unsigned long rlimit_max(u
>  	return task_rlimit_max(current, limit);
>  }
>  
> +void cpufreq_update_util(unsigned long util, unsigned long max);

Didn't you have a timestamp in there?

> +
> +static inline void cpufreq_kick(void)
> +{
> +	cpufreq_update_util(ULONG_MAX, ULONG_MAX);
> +}
> +
>  #endif
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 11, 2016, 11:59 a.m. UTC | #16
On Tue, Feb 09, 2016 at 05:02:33PM -0800, Steve Muckle wrote:
> > Index: linux-pm/kernel/sched/deadline.c
> > ===================================================================
> > --- linux-pm.orig/kernel/sched/deadline.c
> > +++ linux-pm/kernel/sched/deadline.c
> > @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
> >  {
> >  	update_curr_dl(rq);
> >  
> > +	/* Kick cpufreq to prevent it from stalling. */
> > +	cpufreq_kick();
> > +
> >  	/*
> >  	 * Even when we have runtime, update_curr_dl() might have resulted in us
> >  	 * not being the leftmost task anymore. In that case NEED_RESCHED will
> 
> I think additional hooks such as enqueue/dequeue would be needed in
> RT/DL. The task tick callbacks will only run if a task in that class is
> executing at the time of the tick. There could be intermittent RT/DL
> task activity in a frequency domain (the only task activity there, no
> CFS tasks) that doesn't happen to overlap the tick. Worst case the task
> activity could be periodic in such a way that it never overlaps the tick
> and the update is never made.

No, for RT (RR/FIFO) we do not have enough information to do anything
useful. Basically RR/FIFO should result in running 100% whenever we
schedule such a task.

That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
to 100% and leave it there until something else gets to run.

For DL it basically wants to set a minimum freq based on reserved
utilization, so that is __setparam_dl() or somewhere around there.

And we should either use CPPC hints for min freq or manually ensure that
the CFS callback will not select something less than this.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 11, 2016, 12:08 p.m. UTC | #17
On Thu, Feb 11, 2016 at 12:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Feb 09, 2016 at 09:05:05PM +0100, Rafael J. Wysocki wrote:
>> > > One concern I had was, given that the lone scheduler update hook is in
>> > > CFS, is it possible for governor updates to be stalled due to RT or DL
>> > > task activity?
>> >
>> > I don't think they may be completely stalled, but I'd prefer Peter to
>> > answer that as he suggested to do it this way.
>>
>> In any case, if that concern turns out to be significant in practice, it may
>> be addressed like in the appended modification of patch [1/3] from the $subject
>> series.
>>
>> With that things look like before from the cpufreq side, but the other sched
>> classes also get a chance to trigger a cpufreq update.  The drawback is the
>> cpu_clock() call instead of passing the time value from update_load_avg(), but
>> I guess we can live with that if necessary.
>>
>> FWIW, this modification doesn't seem to break things on my test machine.
>
> Not really pretty though. It blows a bit that you require this callback
> to be periodic (in order to replace a timer).

We need it for now, but that's because of how things work on the cpufreq side.

> Ideally we'd not have to call this if state doesn't change.

When cpufreq starts to use the util numbers, things will work like
that pretty much automatically.

We'll need to avoid thrashing if there are too many state changes over
a short time, but that's a different problem.

>> +++ linux-pm/include/linux/sched.h
>> @@ -3207,4 +3207,11 @@ static inline unsigned long rlimit_max(u
>>       return task_rlimit_max(current, limit);
>>  }
>>
>> +void cpufreq_update_util(unsigned long util, unsigned long max);
>
> Didn't you have a timestamp in there?

I did and I still do in fact.

The last version is here:

https://patchwork.kernel.org/patch/8275271/

but it has the additional hooks for RT/DL which you seem to be
thinking are a mistake.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli Feb. 11, 2016, 12:24 p.m. UTC | #18
Hi Peter,

On 11/02/16 12:59, Peter Zijlstra wrote:
> On Tue, Feb 09, 2016 at 05:02:33PM -0800, Steve Muckle wrote:
> > > Index: linux-pm/kernel/sched/deadline.c
> > > ===================================================================
> > > --- linux-pm.orig/kernel/sched/deadline.c
> > > +++ linux-pm/kernel/sched/deadline.c
> > > @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
> > >  {
> > >  	update_curr_dl(rq);
> > >  
> > > +	/* Kick cpufreq to prevent it from stalling. */
> > > +	cpufreq_kick();
> > > +
> > >  	/*
> > >  	 * Even when we have runtime, update_curr_dl() might have resulted in us
> > >  	 * not being the leftmost task anymore. In that case NEED_RESCHED will
> > 
> > I think additional hooks such as enqueue/dequeue would be needed in
> > RT/DL. The task tick callbacks will only run if a task in that class is
> > executing at the time of the tick. There could be intermittent RT/DL
> > task activity in a frequency domain (the only task activity there, no
> > CFS tasks) that doesn't happen to overlap the tick. Worst case the task
> > activity could be periodic in such a way that it never overlaps the tick
> > and the update is never made.
> 
> No, for RT (RR/FIFO) we do not have enough information to do anything
> useful. Basically RR/FIFO should result in running 100% whenever we
> schedule such a task.
> 
> That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
> to 100% and leave it there until something else gets to run.
> 

Vincent is trying to play with rt_avg (in the last sched-freq thread) to
see if we can get some information about RT as well. I understand that
from a theoretical perspective that's not much we can say of such tasks,
and bumping to max can be the only sensible thing to do, but there are
users of RT (ehm, Android) that will probably see differences in energy
consumption if we do so. Yeah, maybe the should use a different policy,
yes.

> For DL it basically wants to set a minimum freq based on reserved
> utilization, so that is __setparam_dl() or somewhere around there.
> 

I think we could do better than this once Luca's reclaiming stuff gets
in. The reserved bw is usually somewhat pessimistic. But this is a
different discussion, maybe.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 11, 2016, 3:26 p.m. UTC | #19
On Thu, Feb 11, 2016 at 12:24:29PM +0000, Juri Lelli wrote:
> Hi Peter,
> 
> On 11/02/16 12:59, Peter Zijlstra wrote:
> > On Tue, Feb 09, 2016 at 05:02:33PM -0800, Steve Muckle wrote:
> > > > Index: linux-pm/kernel/sched/deadline.c
> > > > ===================================================================
> > > > --- linux-pm.orig/kernel/sched/deadline.c
> > > > +++ linux-pm/kernel/sched/deadline.c
> > > > @@ -1197,6 +1197,9 @@ static void task_tick_dl(struct rq *rq,
> > > >  {
> > > >  	update_curr_dl(rq);
> > > >  
> > > > +	/* Kick cpufreq to prevent it from stalling. */
> > > > +	cpufreq_kick();
> > > > +
> > > >  	/*
> > > >  	 * Even when we have runtime, update_curr_dl() might have resulted in us
> > > >  	 * not being the leftmost task anymore. In that case NEED_RESCHED will
> > > 
> > > I think additional hooks such as enqueue/dequeue would be needed in
> > > RT/DL. The task tick callbacks will only run if a task in that class is
> > > executing at the time of the tick. There could be intermittent RT/DL
> > > task activity in a frequency domain (the only task activity there, no
> > > CFS tasks) that doesn't happen to overlap the tick. Worst case the task
> > > activity could be periodic in such a way that it never overlaps the tick
> > > and the update is never made.
> > 
> > No, for RT (RR/FIFO) we do not have enough information to do anything
> > useful. Basically RR/FIFO should result in running 100% whenever we
> > schedule such a task.
> > 
> > That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
> > to 100% and leave it there until something else gets to run.
> > 
> 
> Vincent is trying to play with rt_avg (in the last sched-freq thread) to
> see if we can get some information about RT as well. I understand that
> from a theoretical perspective that's not much we can say of such tasks,
> and bumping to max can be the only sensible thing to do, but there are
> users of RT (ehm, Android) that will probably see differences in energy
> consumption if we do so. Yeah, maybe the should use a different policy,
> yes.

Can't we just leave broken people get broken results? Trying to use
rt_avg for this is just insane. We should ensure that people using this
thing correctly get correct results, the rest can take a hike.

Using rt_avg gets us to the place where people who want to do the right
thing cannot, and that is bad.

> > For DL it basically wants to set a minimum freq based on reserved
> > utilization, so that is __setparam_dl() or somewhere around there.
> > 
> 
> I think we could do better than this once Luca's reclaiming stuff gets
> in. The reserved bw is usually somewhat pessimistic. But this is a
> different discussion, maybe.

Sure, there's cleverer things that can be done. But a simple one would
indeed be the min guarantee based on accepted bandwidth.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 11, 2016, 3:29 p.m. UTC | #20
On Thu, Feb 11, 2016 at 01:08:28PM +0100, Rafael J. Wysocki wrote:
> > Not really pretty though. It blows a bit that you require this callback
> > to be periodic (in order to replace a timer).
> 
> We need it for now, but that's because of how things work on the cpufreq side.

Right, maybe stick a big comment on cpufreq_trigger_update() noting its
a big ugly hack and will go away 'soon'.

> The last version is here:
> 
> https://patchwork.kernel.org/patch/8275271/
> 
> but it has the additional hooks for RT/DL which you seem to be
> thinking are a mistake.

As long as we make sure everbody knows they're a band-aid and will be
taken out back and shot that should be fine for a little while I
suppose.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 11, 2016, 3:58 p.m. UTC | #21
On Thu, Feb 11, 2016 at 4:29 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 01:08:28PM +0100, Rafael J. Wysocki wrote:
>> > Not really pretty though. It blows a bit that you require this callback
>> > to be periodic (in order to replace a timer).
>>
>> We need it for now, but that's because of how things work on the cpufreq side.
>
> Right, maybe stick a big comment on cpufreq_trigger_update() noting its
> a big ugly hack and will go away 'soon'.

I will.

>> The last version is here:
>>
>> https://patchwork.kernel.org/patch/8275271/
>>
>> but it has the additional hooks for RT/DL which you seem to be
>> thinking are a mistake.
>
> As long as we make sure everbody knows they're a band-aid and will be
> taken out back and shot that should be fine for a little while I
> suppose.

Great, thanks!

Yes, I'm treating those as a band-aid for replacement.

Let me update the patch with a comment to explain that.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Feb. 11, 2016, 5:06 p.m. UTC | #22
Hi Peter,

On 02/11/2016 03:59 AM, Peter Zijlstra wrote:
>> I think additional hooks such as enqueue/dequeue would be needed in
>> > RT/DL. The task tick callbacks will only run if a task in that class is
>> > executing at the time of the tick. There could be intermittent RT/DL
>> > task activity in a frequency domain (the only task activity there, no
>> > CFS tasks) that doesn't happen to overlap the tick. Worst case the task
>> > activity could be periodic in such a way that it never overlaps the tick
>> > and the update is never made.
>
> No, for RT (RR/FIFO) we do not have enough information to do anything
> useful. Basically RR/FIFO should result in running 100% whenever we
> schedule such a task.
> 
> That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
> to 100% and leave it there until something else gets to run.
>
> For DL it basically wants to set a minimum freq based on reserved
> utilization, so that is __setparam_dl() or somewhere around there.
> 
> And we should either use CPPC hints for min freq or manually ensure that
> the CFS callback will not select something less than this.

Rafael's changes aren't specifying particular frequencies/capacities in
the scheduler hooks. They're just pokes to get cpufreq to run, in order
to eliminate cpufreq's timers.

My concern above is that pokes are guaranteed to keep occurring when
there is only RT or DL activity so nothing breaks.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 11, 2016, 5:30 p.m. UTC | #23
On Thu, Feb 11, 2016 at 09:06:04AM -0800, Steve Muckle wrote:
> Hi Peter,
> 
> >> > I think additional hooks such as enqueue/dequeue would be needed in
> >> > RT/DL.

That is what I reacted to mostly. Enqueue/dequeue hooks don't really
make much sense for RT / DL.

> Rafael's changes aren't specifying particular frequencies/capacities in
> the scheduler hooks. They're just pokes to get cpufreq to run, in order
> to eliminate cpufreq's timers.
> 
> My concern above is that pokes are guaranteed to keep occurring when
> there is only RT or DL activity so nothing breaks.

The hook in their respective tick handler should ensure stuff is called
sporadically and isn't stalled.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 11, 2016, 5:34 p.m. UTC | #24
On Thu, Feb 11, 2016 at 6:30 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 09:06:04AM -0800, Steve Muckle wrote:
>> Hi Peter,
>>
>> >> > I think additional hooks such as enqueue/dequeue would be needed in
>> >> > RT/DL.
>
> That is what I reacted to mostly. Enqueue/dequeue hooks don't really
> make much sense for RT / DL.
>
>> Rafael's changes aren't specifying particular frequencies/capacities in
>> the scheduler hooks. They're just pokes to get cpufreq to run, in order
>> to eliminate cpufreq's timers.
>>
>> My concern above is that pokes are guaranteed to keep occurring when
>> there is only RT or DL activity so nothing breaks.
>
> The hook in their respective tick handler should ensure stuff is called
> sporadically and isn't stalled.

I've updated the patch in the meantime
(https://patchwork.kernel.org/patch/8283431/).

Should I move the RT/DL hooks to task_tick_rt/dl(), respectively?

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 11, 2016, 5:38 p.m. UTC | #25
On Thu, Feb 11, 2016 at 06:34:05PM +0100, Rafael J. Wysocki wrote:
> I've updated the patch in the meantime
> (https://patchwork.kernel.org/patch/8283431/).
> 
> Should I move the RT/DL hooks to task_tick_rt/dl(), respectively?

Probably, this really is about kicking cpufreq to do something, right?
update_curr_*() seems overkill for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot Feb. 11, 2016, 6:23 p.m. UTC | #26
On 11 February 2016 at 16:26, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 12:24:29PM +0000, Juri Lelli wrote:
>> Hi Peter,
>>
>> On 11/02/16 12:59, Peter Zijlstra wrote:
>> >
>> > No, for RT (RR/FIFO) we do not have enough information to do anything
>> > useful. Basically RR/FIFO should result in running 100% whenever we
>> > schedule such a task.
>> >
>> > That means RR/FIFO want a hook in pick_next_task_rt() to bump the freq
>> > to 100% and leave it there until something else gets to run.
>> >
>>
>> Vincent is trying to play with rt_avg (in the last sched-freq thread) to
>> see if we can get some information about RT as well. I understand that
>> from a theoretical perspective that's not much we can say of such tasks,
>> and bumping to max can be the only sensible thing to do, but there are
>> users of RT (ehm, Android) that will probably see differences in energy
>> consumption if we do so. Yeah, maybe the should use a different policy,
>> yes.
>
> Can't we just leave broken people get broken results? Trying to use
> rt_avg for this is just insane. We should ensure that people using this
> thing correctly get correct results, the rest can take a hike.
>
> Using rt_avg gets us to the place where people who want to do the right
> thing cannot, and that is bad.

I agree that using rt_avg is not the best choice to evaluate the
capacity that is used by RT tasks but it has the advantage of been
already there. Do you mean that we should use another way to compute
the capacity that is used by rt tasks to then select the frequency  ?
Or do you mean that we can't do anything else than asking for max
frequency ?

Trying to set max frequency just before scheduling RT task is not
really doable on a lot of platform because the sequence that changes
the frequency can sleep and takes more time than the run time of the
task. At the end, we will have set max frequency once the task has
finished to run. There is no other solution than increasing the
min_freq of cpufreq to a level that will ensure enough compute
capacity for RT task with such high constraints that cpufreq can't
react. For other RT tasks, we can probably found a way to set a
frequency that can fit both RT constraints and power consumption.

>
>> > For DL it basically wants to set a minimum freq based on reserved
>> > utilization, so that is __setparam_dl() or somewhere around there.
>> >
>>
>> I think we could do better than this once Luca's reclaiming stuff gets
>> in. The reserved bw is usually somewhat pessimistic. But this is a
>> different discussion, maybe.
>
> Sure, there's cleverer things that can be done. But a simple one would
> indeed be the min guarantee based on accepted bandwidth.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle Feb. 11, 2016, 6:52 p.m. UTC | #27
On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>> My concern above is that pokes are guaranteed to keep occurring when
>> > there is only RT or DL activity so nothing breaks.
>
> The hook in their respective tick handler should ensure stuff is called
> sporadically and isn't stalled.

But that's only true if the RT/DL tasks happen to be running when the
tick arrives right?

Couldn't we have RT/DL activity which doesn't overlap with the tick? And
if no CFS tasks happen to be executing on that CPU, we'll never trigger
the cpufreq update. This could go on for an arbitrarily long time
depending on the periodicity of the work.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 11, 2016, 7:04 p.m. UTC | #28
On Thu, Feb 11, 2016 at 7:52 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>> My concern above is that pokes are guaranteed to keep occurring when
>>> > there is only RT or DL activity so nothing breaks.
>>
>> The hook in their respective tick handler should ensure stuff is called
>> sporadically and isn't stalled.
>
> But that's only true if the RT/DL tasks happen to be running when the
> tick arrives right?
>
> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
> if no CFS tasks happen to be executing on that CPU, we'll never trigger
> the cpufreq update. This could go on for an arbitrarily long time
> depending on the periodicity of the work.

I'm thinking that two additional hooks in enqueue_task_rt/dl() might
help here.  Then, we will hit either the tick or enqueue and that
should do the trick.

Peter, what do you think?

Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 11, 2016, 8:47 p.m. UTC | #29
On Thu, Feb 11, 2016 at 1:08 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Feb 11, 2016 at 12:51 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Feb 09, 2016 at 09:05:05PM +0100, Rafael J. Wysocki wrote:
>>> > > One concern I had was, given that the lone scheduler update hook is in
>>> > > CFS, is it possible for governor updates to be stalled due to RT or DL
>>> > > task activity?
>>> >
>>> > I don't think they may be completely stalled, but I'd prefer Peter to
>>> > answer that as he suggested to do it this way.
>>>
>>> In any case, if that concern turns out to be significant in practice, it may
>>> be addressed like in the appended modification of patch [1/3] from the $subject
>>> series.
>>>
>>> With that things look like before from the cpufreq side, but the other sched
>>> classes also get a chance to trigger a cpufreq update.  The drawback is the
>>> cpu_clock() call instead of passing the time value from update_load_avg(), but
>>> I guess we can live with that if necessary.
>>>
>>> FWIW, this modification doesn't seem to break things on my test machine.
>>
>> Not really pretty though. It blows a bit that you require this callback
>> to be periodic (in order to replace a timer).
>
> We need it for now, but that's because of how things work on the cpufreq side.

In fact, I don't need the new callback to be invoked periodically.  I
only need it to be called often enough, where "enough" means at least
once in every sampling interval (for the lack of a better name) on the
rough average.  Less often than that may be kind of OK too depending
on the case.

I guess I need to explain that in more detail, though, at least for
the record if not anything else, so let me do that.

To start with let me note that things in cpufreq don't happen
periodically even today with timers, because all of those timers are
deferrable, so you never know when you'll get the next update
realistically.  We try to compensate for that in a kind of poor man's
way (which may be a source of problems by itself as mentioned by
Doug), but that's a band-aid rather.

With that in mind, there are two cases, the intel_pstate case and the
ondemand/conservative governor case.

intel_pstate is simpler, because it can do everything it needs in the
new callback (or in a timer function previously).  Periodicity might
matter to it, but it only uses two last points in its computations,
the current one and the previous one.  Thus it is not that important
how long the particular interval is.  Of course, if it is way too
long, we may miss some intermediate peaks and valleys and if the peaks
are intermittent enough, people may see poor performance.  In
practice, though, it turns out that the new callback is invoked (even
from CFS alone) much more frequently than we need on the average, so
we apply a "sample delay" rate limit to it.

In turn, the ondemand/conservative governor case is outright
ridiculous, because they don't even compute anything in the callback
(or a timer function previously).  They simply use it to spawn a work
item in process context that will estimate the "utilization" and
possibly change the P-state.  That may be delayed by the scheduling
interval, then pushed back by RT tasks and so on, so the time between
the moment they decide to take a "sample" and the moment that actually
happens may be, well, arbitrary.  So really timers are used here to
poke at things on a regular basis rather than for any actually
periodic stuff.

That may be improved in two ways in principle.  First, by moving as
much as we can into the utilization update callback without adding too
much overhead to the scheduler path.  Governor computations are the
primary candidate for that.  They need to take all of the tunables
accessible from user space into account, but that shouldn't be a big
problem.  We may be able to call at least some drivers from there too
(even the ACPI driver may be able to switch P-states via register
writes in some cases).  The second way would be to use the utilization
numbers provided by the scheduler for making governor decisions.

If we can do both, we should be much better off than we are today
already, even without the EAS stuff.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 12, 2016, 1:43 p.m. UTC | #30
On Thu, Feb 11, 2016 at 8:04 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Thu, Feb 11, 2016 at 7:52 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>> My concern above is that pokes are guaranteed to keep occurring when
>>>> > there is only RT or DL activity so nothing breaks.
>>>
>>> The hook in their respective tick handler should ensure stuff is called
>>> sporadically and isn't stalled.
>>
>> But that's only true if the RT/DL tasks happen to be running when the
>> tick arrives right?
>>
>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>> the cpufreq update. This could go on for an arbitrarily long time
>> depending on the periodicity of the work.
>
> I'm thinking that two additional hooks in enqueue_task_rt/dl() might
> help here.  Then, we will hit either the tick or enqueue and that
> should do the trick.
>
> Peter, what do you think?

In any case I posted a v9 with those changes
(https://patchwork.kernel.org/patch/8290791/).

Again, it doesn't appear to break things.

If the enqueue hooks are bad (unwanted at all or in wrong places),
please let me know.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 12, 2016, 2:04 p.m. UTC | #31
On Thu, Feb 11, 2016 at 07:23:55PM +0100, Vincent Guittot wrote:
> I agree that using rt_avg is not the best choice to evaluate the
> capacity that is used by RT tasks but it has the advantage of been
> already there. Do you mean that we should use another way to compute
> the capacity that is used by rt tasks to then select the frequency  ?

Nope, RR/FIFO simply do not contain enough information to compute
anything from.

> Or do you mean that we can't do anything else than asking for max
> frequency ?

Yep.

> Trying to set max frequency just before scheduling RT task is not
> really doable on a lot of platform because the sequence that changes
> the frequency can sleep and takes more time than the run time of the
> task.

So what people do today is shoot cpufreq in the head and not use it,
maybe that's the 'right' thing on these platforms.

> At the end, we will have set max frequency once the task has
> finished to run. There is no other solution than increasing the
> min_freq of cpufreq to a level that will ensure enough compute
> capacity for RT task with such high constraints that cpufreq can't
> react.

But you cannot a priori tell how much time RR/FIFO tasks will require,
that's the entire problem with them. We can compute a hysterical
average, but that _will_ mis predict the future and get you
underruns/deadline misses.

> For other RT tasks, we can probably found a way to set a
> frequency that can fit both RT constraints and power consumption.

You cannot, not without adding a lot more information about what these
tasks are doing, and that is not captured in the task model.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 12, 2016, 2:10 p.m. UTC | #32
On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
> >> My concern above is that pokes are guaranteed to keep occurring when
> >> > there is only RT or DL activity so nothing breaks.
> >
> > The hook in their respective tick handler should ensure stuff is called
> > sporadically and isn't stalled.
> 
> But that's only true if the RT/DL tasks happen to be running when the
> tick arrives right?
> 
> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
> if no CFS tasks happen to be executing on that CPU, we'll never trigger
> the cpufreq update. This could go on for an arbitrarily long time
> depending on the periodicity of the work.

Possible yes, but why do we care? Such a CPU would be so much idle that
cpufreq doesn't matter one way or another, right?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot Feb. 12, 2016, 2:48 p.m. UTC | #33
On 12 February 2016 at 15:04, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 07:23:55PM +0100, Vincent Guittot wrote:
>> I agree that using rt_avg is not the best choice to evaluate the
>> capacity that is used by RT tasks but it has the advantage of been
>> already there. Do you mean that we should use another way to compute
>> the capacity that is used by rt tasks to then select the frequency  ?
>
> Nope, RR/FIFO simply do not contain enough information to compute
> anything from.
>
>> Or do you mean that we can't do anything else than asking for max
>> frequency ?
>
> Yep.
>
>> Trying to set max frequency just before scheduling RT task is not
>> really doable on a lot of platform because the sequence that changes
>> the frequency can sleep and takes more time than the run time of the
>> task.
>
> So what people do today is shoot cpufreq in the head and not use it,
> maybe that's the 'right' thing on these platforms.
>
>> At the end, we will have set max frequency once the task has
>> finished to run. There is no other solution than increasing the
>> min_freq of cpufreq to a level that will ensure enough compute
>> capacity for RT task with such high constraints that cpufreq can't
>> react.
>
> But you cannot a priori tell how much time RR/FIFO tasks will require,
> that's the entire problem with them. We can compute a hysterical
> average, but that _will_ mis predict the future and get you
> underruns/deadline misses.
>
>> For other RT tasks, we can probably found a way to set a
>> frequency that can fit both RT constraints and power consumption.
>
> You cannot, not without adding a lot more information about what these
> tasks are doing, and that is not captured in the task model.

Another point to take into account is that the RT tasks will "steal"
the compute capacity that has been requested by the cfs tasks.

Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
and B and 1 cfs task C.
Let assume that the real time constraint of RT task A is too agressive
for the lowest OPP0 and that the change of the frequency of the core
is too slow compare to this constraint but the real time constraint of
RT task B can be handle whatever the OPP. System don't have other
choice than setting the cpufreq min freq to OPP1 to be sure that
constraint of task A will be covered at anytime. Then, we still have 2
possible OPPs. The CFS task asks for compute capacity that fits in
OPP1 but a part of this capacity will be stolen by RT tasks. If we
monitor the load of RT tasks and request capacity for these RT tasks
according to their current utilization, we can decide to switch to
highest OPP2 to ensure that task C will have enough remaining
capacity. A lot of embedded platform faces such kind of use cases
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 12, 2016, 4:01 p.m. UTC | #34
On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>> >> My concern above is that pokes are guaranteed to keep occurring when
>> >> > there is only RT or DL activity so nothing breaks.
>> >
>> > The hook in their respective tick handler should ensure stuff is called
>> > sporadically and isn't stalled.
>>
>> But that's only true if the RT/DL tasks happen to be running when the
>> tick arrives right?
>>
>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>> the cpufreq update. This could go on for an arbitrarily long time
>> depending on the periodicity of the work.
>
> Possible yes, but why do we care? Such a CPU would be so much idle that
> cpufreq doesn't matter one way or another, right?

Well, in theory you can get 50% or so of the time active in bursts
that happen to fit between ticks.  If we happen to do those in the
lowest P-state, we may burn more energy than necessary on platforms
where more idle is preferred.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 12, 2016, 4:15 p.m. UTC | #35
On Fri, Feb 12, 2016 at 5:01 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>> >> My concern above is that pokes are guaranteed to keep occurring when
>>> >> > there is only RT or DL activity so nothing breaks.
>>> >
>>> > The hook in their respective tick handler should ensure stuff is called
>>> > sporadically and isn't stalled.
>>>
>>> But that's only true if the RT/DL tasks happen to be running when the
>>> tick arrives right?
>>>
>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>> the cpufreq update. This could go on for an arbitrarily long time
>>> depending on the periodicity of the work.
>>
>> Possible yes, but why do we care? Such a CPU would be so much idle that
>> cpufreq doesn't matter one way or another, right?
>
> Well, in theory you can get 50% or so of the time active in bursts
> that happen to fit between ticks.  If we happen to do those in the
> lowest P-state, we may burn more energy than necessary on platforms
> where more idle is preferred.

At least intel_pstate should be able to figure out which P-state to
use then on the APERF/MPERF basis.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ashwin Chaugule Feb. 12, 2016, 4:53 p.m. UTC | #36
On 12 February 2016 at 11:15, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Feb 12, 2016 at 5:01 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>> >> My concern above is that pokes are guaranteed to keep occurring when
>>>> >> > there is only RT or DL activity so nothing breaks.
>>>> >
>>>> > The hook in their respective tick handler should ensure stuff is called
>>>> > sporadically and isn't stalled.
>>>>
>>>> But that's only true if the RT/DL tasks happen to be running when the
>>>> tick arrives right?
>>>>
>>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>>> the cpufreq update. This could go on for an arbitrarily long time
>>>> depending on the periodicity of the work.
>>>
>>> Possible yes, but why do we care? Such a CPU would be so much idle that
>>> cpufreq doesn't matter one way or another, right?
>>
>> Well, in theory you can get 50% or so of the time active in bursts
>> that happen to fit between ticks.  If we happen to do those in the
>> lowest P-state, we may burn more energy than necessary on platforms
>> where more idle is preferred.
>
> At least intel_pstate should be able to figure out which P-state to
> use then on the APERF/MPERF basis.

Speaking for the generic case, it would be great to make use of such
feedback counters for selecting the next freq request. Use (num of
cycles used/total cycles) to figure out %ON time for the CPU. I
understand its not the goal for this patch series, but in the future
if we can do this in your callbacks where possible, then I think we
will do better than Ondemand.

Regards,
Ashwin.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Smythies Feb. 12, 2016, 5:02 p.m. UTC | #37
On 2016.02.12 08:01 Rafael J. Wysocki wrote:
> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>>> My concern above is that pokes are guaranteed to keep occurring when
>>>>> there is only RT or DL activity so nothing breaks.
>>>>
>>>> The hook in their respective tick handler should ensure stuff is called
>>>> sporadically and isn't stalled.
>>>
>>> But that's only true if the RT/DL tasks happen to be running when the
>>> tick arrives right?
>>>
>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>> the cpufreq update. This could go on for an arbitrarily long time
>>> depending on the periodicity of the work.
>>
>> Possible yes, but why do we care? Such a CPU would be so much idle that
>> cpufreq doesn't matter one way or another, right?

> Well, in theory you can get 50% or so of the time active in bursts
> that happen to fit between ticks.  If we happen to do those in the
> lowest P-state, we may burn more energy than necessary on platforms
> where more idle is preferred.

I believe this happens considerably more often than is commonly thought,
and is the exact reason I was opposed to the introduction of the
"duration" method into the intel_pstate driver in the first
place. The probability of occurrence (of a relatively busy CPU being idle
on jiffy boundaries) is very use dependant, occurring more on desktops than
servers, and sometime more with video frame rate based tasks. Data to support
my claim is a couple of years old and not very complete, but I see the issue
often on trace data acquired from desktop users on bugzilla reports.

Disclaimer: I fully admit that my related tests on the other thread have
been rigged to exaggerate the issue.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 12, 2016, 11:14 p.m. UTC | #38
On Fri, Feb 12, 2016 at 5:53 PM, Ashwin Chaugule
<ashwin.chaugule@linaro.org> wrote:
> On 12 February 2016 at 11:15, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Fri, Feb 12, 2016 at 5:01 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>>> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>>> >> My concern above is that pokes are guaranteed to keep occurring when
>>>>> >> > there is only RT or DL activity so nothing breaks.
>>>>> >
>>>>> > The hook in their respective tick handler should ensure stuff is called
>>>>> > sporadically and isn't stalled.
>>>>>
>>>>> But that's only true if the RT/DL tasks happen to be running when the
>>>>> tick arrives right?
>>>>>
>>>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>>>> the cpufreq update. This could go on for an arbitrarily long time
>>>>> depending on the periodicity of the work.
>>>>
>>>> Possible yes, but why do we care? Such a CPU would be so much idle that
>>>> cpufreq doesn't matter one way or another, right?
>>>
>>> Well, in theory you can get 50% or so of the time active in bursts
>>> that happen to fit between ticks.  If we happen to do those in the
>>> lowest P-state, we may burn more energy than necessary on platforms
>>> where more idle is preferred.
>>
>> At least intel_pstate should be able to figure out which P-state to
>> use then on the APERF/MPERF basis.
>
> Speaking for the generic case, it would be great to make use of such
> feedback counters for selecting the next freq request. Use (num of
> cycles used/total cycles) to figure out %ON time for the CPU. I
> understand its not the goal for this patch series, but in the future
> if we can do this in your callbacks where possible, then I think we
> will do better than Ondemand.

Yes, we can do that at least in principle.  intel_pstate is a proof of that.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki Feb. 12, 2016, 11:17 p.m. UTC | #39
On Fri, Feb 12, 2016 at 6:02 PM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2016.02.12 08:01 Rafael J. Wysocki wrote:
>> On Fri, Feb 12, 2016 at 3:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Thu, Feb 11, 2016 at 10:52:20AM -0800, Steve Muckle wrote:
>>>> On 02/11/2016 09:30 AM, Peter Zijlstra wrote:
>>>>>> My concern above is that pokes are guaranteed to keep occurring when
>>>>>> there is only RT or DL activity so nothing breaks.
>>>>>
>>>>> The hook in their respective tick handler should ensure stuff is called
>>>>> sporadically and isn't stalled.
>>>>
>>>> But that's only true if the RT/DL tasks happen to be running when the
>>>> tick arrives right?
>>>>
>>>> Couldn't we have RT/DL activity which doesn't overlap with the tick? And
>>>> if no CFS tasks happen to be executing on that CPU, we'll never trigger
>>>> the cpufreq update. This could go on for an arbitrarily long time
>>>> depending on the periodicity of the work.
>>>
>>> Possible yes, but why do we care? Such a CPU would be so much idle that
>>> cpufreq doesn't matter one way or another, right?
>
>> Well, in theory you can get 50% or so of the time active in bursts
>> that happen to fit between ticks.  If we happen to do those in the
>> lowest P-state, we may burn more energy than necessary on platforms
>> where more idle is preferred.
>
> I believe this happens considerably more often than is commonly thought,
> and is the exact reason I was opposed to the introduction of the
> "duration" method into the intel_pstate driver in the first
> place. The probability of occurrence (of a relatively busy CPU being idle
> on jiffy boundaries) is very use dependant, occurring more on desktops than
> servers, and sometime more with video frame rate based tasks. Data to support
> my claim is a couple of years old and not very complete, but I see the issue
> often on trace data acquired from desktop users on bugzilla reports.

The approach with update callbacks from the scheduler should not be
affected by this, because it takes updates not only at the tick time,
but also on other scheduler events.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 1, 2016, 1:58 p.m. UTC | #40
On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:

> Another point to take into account is that the RT tasks will "steal"
> the compute capacity that has been requested by the cfs tasks.
> 
> Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> and B and 1 cfs task C.

> Let assume that the real time constraint of RT task A is too agressive
> for the lowest OPP0 and that the change of the frequency of the core
> is too slow compare to this constraint but the real time constraint of
> RT task B can be handle whatever the OPP. System don't have other
> choice than setting the cpufreq min freq to OPP1 to be sure that
> constraint of task A will be covered at anytime.

> Then, we still have 2
> possible OPPs. The CFS task asks for compute capacity that fits in
> OPP1 but a part of this capacity will be stolen by RT tasks. If we
> monitor the load of RT tasks and request capacity for these RT tasks
> according to their current utilization, we can decide to switch to
> highest OPP2 to ensure that task C will have enough remaining
> capacity. A lot of embedded platform faces such kind of use cases

Still doesn't make sense. How would you know the constraint of RT task
A, and that it cannot be satisfied by OPP0 ? The only information you
have in the task model is a static priority.

The only possible choice the kernel has at this point is max OPP. It
doesn't have enough (_any_) information about worst case execution of
that task.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 1, 2016, 2:17 p.m. UTC | #41
On 01/03/16 14:58, Peter Zijlstra wrote:
> On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
> 
> > Another point to take into account is that the RT tasks will "steal"
> > the compute capacity that has been requested by the cfs tasks.
> > 
> > Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> > and B and 1 cfs task C.
> 
> > Let assume that the real time constraint of RT task A is too agressive
> > for the lowest OPP0 and that the change of the frequency of the core
> > is too slow compare to this constraint but the real time constraint of
> > RT task B can be handle whatever the OPP. System don't have other
> > choice than setting the cpufreq min freq to OPP1 to be sure that
> > constraint of task A will be covered at anytime.
> 
> > Then, we still have 2
> > possible OPPs. The CFS task asks for compute capacity that fits in
> > OPP1 but a part of this capacity will be stolen by RT tasks. If we
> > monitor the load of RT tasks and request capacity for these RT tasks
> > according to their current utilization, we can decide to switch to
> > highest OPP2 to ensure that task C will have enough remaining
> > capacity. A lot of embedded platform faces such kind of use cases
> 
> Still doesn't make sense. How would you know the constraint of RT task
> A, and that it cannot be satisfied by OPP0 ? The only information you
> have in the task model is a static priority.
> 

But, can't we have the problem Vincent describes if we s/RT/DL/ ?

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 1, 2016, 2:24 p.m. UTC | #42
On Tue, Mar 01, 2016 at 02:17:06PM +0000, Juri Lelli wrote:
> On 01/03/16 14:58, Peter Zijlstra wrote:
> > On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
> > 
> > > Another point to take into account is that the RT tasks will "steal"
> > > the compute capacity that has been requested by the cfs tasks.
> > > 
> > > Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> > > and B and 1 cfs task C.
> > 
> > > Let assume that the real time constraint of RT task A is too agressive
> > > for the lowest OPP0 and that the change of the frequency of the core
> > > is too slow compare to this constraint but the real time constraint of
> > > RT task B can be handle whatever the OPP. System don't have other
> > > choice than setting the cpufreq min freq to OPP1 to be sure that
> > > constraint of task A will be covered at anytime.
> > 
> > > Then, we still have 2
> > > possible OPPs. The CFS task asks for compute capacity that fits in
> > > OPP1 but a part of this capacity will be stolen by RT tasks. If we
> > > monitor the load of RT tasks and request capacity for these RT tasks
> > > according to their current utilization, we can decide to switch to
> > > highest OPP2 to ensure that task C will have enough remaining
> > > capacity. A lot of embedded platform faces such kind of use cases
> > 
> > Still doesn't make sense. How would you know the constraint of RT task
> > A, and that it cannot be satisfied by OPP0 ? The only information you
> > have in the task model is a static priority.
> > 
> 
> But, can't we have the problem Vincent describes if we s/RT/DL/ ?

Still not sure I actually see a problem. With DL you have a minimal OPP
required to guarantee correct execution of the DL tasks. For CFS you
have an average util reflecting its workload.

Add the two and you've got an effective OPP request. Or in CPPC terms:
we request a min freq of the DL and a max freq of DL+avg_CFS.

We could probably improve upon that by also tracking an avg DL and
lowering the max freq request to: min(DL, avg_DL + avg_CFS). The
consequence is that when the DL tasks hit peaks (over their avg) the CFS
tasks get a little more delay. But this might be a worthwhile trade-off.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 1, 2016, 2:26 p.m. UTC | #43
On Tue, Mar 01, 2016 at 03:24:59PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 01, 2016 at 02:17:06PM +0000, Juri Lelli wrote:
> > On 01/03/16 14:58, Peter Zijlstra wrote:
> > > On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
> > > 
> > > > Another point to take into account is that the RT tasks will "steal"
> > > > the compute capacity that has been requested by the cfs tasks.
> > > > 
> > > > Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> > > > and B and 1 cfs task C.
> > > 
> > > > Let assume that the real time constraint of RT task A is too agressive
> > > > for the lowest OPP0 and that the change of the frequency of the core
> > > > is too slow compare to this constraint but the real time constraint of
> > > > RT task B can be handle whatever the OPP. System don't have other
> > > > choice than setting the cpufreq min freq to OPP1 to be sure that
> > > > constraint of task A will be covered at anytime.
> > > 
> > > > Then, we still have 2
> > > > possible OPPs. The CFS task asks for compute capacity that fits in
> > > > OPP1 but a part of this capacity will be stolen by RT tasks. If we
> > > > monitor the load of RT tasks and request capacity for these RT tasks
> > > > according to their current utilization, we can decide to switch to
> > > > highest OPP2 to ensure that task C will have enough remaining
> > > > capacity. A lot of embedded platform faces such kind of use cases
> > > 
> > > Still doesn't make sense. How would you know the constraint of RT task
> > > A, and that it cannot be satisfied by OPP0 ? The only information you
> > > have in the task model is a static priority.
> > > 
> > 
> > But, can't we have the problem Vincent describes if we s/RT/DL/ ?
> 
> Still not sure I actually see a problem. With DL you have a minimal OPP
> required to guarantee correct execution of the DL tasks. For CFS you
> have an average util reflecting its workload.
> 
> Add the two and you've got an effective OPP request. Or in CPPC terms:
> we request a min freq of the DL and a max freq of DL+avg_CFS.
> 
> We could probably improve upon that by also tracking an avg DL and
> lowering the max freq request to: min(DL, avg_DL + avg_CFS). The

max(DL, avg_DL + avg_CFS) obviously! ;-)

> consequence is that when the DL tasks hit peaks (over their avg) the CFS
> tasks get a little more delay. But this might be a worthwhile trade-off.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 1, 2016, 2:42 p.m. UTC | #44
On 01/03/16 15:26, Peter Zijlstra wrote:
> On Tue, Mar 01, 2016 at 03:24:59PM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 01, 2016 at 02:17:06PM +0000, Juri Lelli wrote:
> > > On 01/03/16 14:58, Peter Zijlstra wrote:
> > > > On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
> > > > 
> > > > > Another point to take into account is that the RT tasks will "steal"
> > > > > the compute capacity that has been requested by the cfs tasks.
> > > > > 
> > > > > Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
> > > > > and B and 1 cfs task C.
> > > > 
> > > > > Let assume that the real time constraint of RT task A is too agressive
> > > > > for the lowest OPP0 and that the change of the frequency of the core
> > > > > is too slow compare to this constraint but the real time constraint of
> > > > > RT task B can be handle whatever the OPP. System don't have other
> > > > > choice than setting the cpufreq min freq to OPP1 to be sure that
> > > > > constraint of task A will be covered at anytime.
> > > > 
> > > > > Then, we still have 2
> > > > > possible OPPs. The CFS task asks for compute capacity that fits in
> > > > > OPP1 but a part of this capacity will be stolen by RT tasks. If we
> > > > > monitor the load of RT tasks and request capacity for these RT tasks
> > > > > according to their current utilization, we can decide to switch to
> > > > > highest OPP2 to ensure that task C will have enough remaining
> > > > > capacity. A lot of embedded platform faces such kind of use cases
> > > > 
> > > > Still doesn't make sense. How would you know the constraint of RT task
> > > > A, and that it cannot be satisfied by OPP0 ? The only information you
> > > > have in the task model is a static priority.
> > > > 
> > > 
> > > But, can't we have the problem Vincent describes if we s/RT/DL/ ?
> > 
> > Still not sure I actually see a problem. With DL you have a minimal OPP
> > required to guarantee correct execution of the DL tasks. For CFS you
> > have an average util reflecting its workload.
> > 
> > Add the two and you've got an effective OPP request. Or in CPPC terms:
> > we request a min freq of the DL and a max freq of DL+avg_CFS.
> > 
> > We could probably improve upon that by also tracking an avg DL and
> > lowering the max freq request to: min(DL, avg_DL + avg_CFS). The
> 
> max(DL, avg_DL + avg_CFS) obviously! ;-)
> 
> > consequence is that when the DL tasks hit peaks (over their avg) the CFS
> > tasks get a little more delay. But this might be a worthwhile trade-off.
> 

Agree. My point was actually more about Rafael's schedutil RFC (I should
probably have posted this there, but I thought it fitted well with this
example). I realize that Rafael is starting simple, but I fear that some
aggregation of util coming from the different classes will be needed in
the end; schedfreq has already something along this line.

IMHO, the general approach would be that every scheduling class has an
interface to communicate its util requirement. Then RT will probably
have to ask for max, but CFS and DL will do better.

Thanks,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot March 1, 2016, 2:58 p.m. UTC | #45
On 1 March 2016 at 14:58, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Feb 12, 2016 at 03:48:54PM +0100, Vincent Guittot wrote:
>
>> Another point to take into account is that the RT tasks will "steal"
>> the compute capacity that has been requested by the cfs tasks.
>>
>> Let takes the example of a CPU with 3 OPP on which run 2 rt tasks A
>> and B and 1 cfs task C.
>
>> Let assume that the real time constraint of RT task A is too agressive
>> for the lowest OPP0 and that the change of the frequency of the core
>> is too slow compare to this constraint but the real time constraint of
>> RT task B can be handle whatever the OPP. System don't have other
>> choice than setting the cpufreq min freq to OPP1 to be sure that
>> constraint of task A will be covered at anytime.
>
>> Then, we still have 2
>> possible OPPs. The CFS task asks for compute capacity that fits in
>> OPP1 but a part of this capacity will be stolen by RT tasks. If we
>> monitor the load of RT tasks and request capacity for these RT tasks
>> according to their current utilization, we can decide to switch to
>> highest OPP2 to ensure that task C will have enough remaining
>> capacity. A lot of embedded platform faces such kind of use cases
>
> Still doesn't make sense. How would you know the constraint of RT task
> A, and that it cannot be satisfied by OPP0 ? The only information you
> have in the task model is a static priority.

The kernel doesn't have this information so that's why the sysfs
cpufreq/scaling_min_freq has to be used to prevent the kernel (and
cpufreq in particular) to use OPP0.
From a kernel/sched/cpufreq pov, we assume that all OPPs above
cpufreq/scaling_min can be used with RT tasks of the system. And
performance governor is used if only highest OPP can be used.

>
> The only possible choice the kernel has at this point is max OPP. It
> doesn't have enough (_any_) information about worst case execution of
> that task.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 1, 2016, 3:04 p.m. UTC | #46
On Tue, Mar 01, 2016 at 02:42:10PM +0000, Juri Lelli wrote:
> Agree. My point was actually more about Rafael's schedutil RFC (I should
> probably have posted this there, but I thought it fitted well with this
> example). I realize that Rafael is starting simple, but I fear that some
> aggregation of util coming from the different classes will be needed in
> the end; schedfreq has already something along this line.

Right, but I'm not sure that's a hard thing to add. But yes, it needs
doing.

It also very much has a bearing on the OPP state selection. As already
pointed out, the nearest OPP thing Rafael did is just wrong for DL.

It probably makes sense to pass a CPPC like form into the (software) OPP
selector.

> IMHO, the general approach would be that every scheduling class has an
> interface to communicate its util requirement. Then RT will probably
> have to ask for max, but CFS and DL will do better.

Right, so on IRC you mentioned that we could also use the global (or
cgroup) RT throttle to lower the RT util/OPP.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 1, 2016, 7:49 p.m. UTC | #47
On Tue, Mar 1, 2016 at 4:04 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 01, 2016 at 02:42:10PM +0000, Juri Lelli wrote:
>> Agree. My point was actually more about Rafael's schedutil RFC (I should
>> probably have posted this there, but I thought it fitted well with this
>> example). I realize that Rafael is starting simple, but I fear that some
>> aggregation of util coming from the different classes will be needed in
>> the end; schedfreq has already something along this line.
>
> Right, but I'm not sure that's a hard thing to add. But yes, it needs
> doing.
>
> It also very much has a bearing on the OPP state selection. As already
> pointed out, the nearest OPP thing Rafael did is just wrong for DL.
>
> It probably makes sense to pass a CPPC like form into the (software) OPP
> selector.
>
>> IMHO, the general approach would be that every scheduling class has an
>> interface to communicate its util requirement. Then RT will probably
>> have to ask for max, but CFS and DL will do better.
>
> Right, so on IRC you mentioned that we could also use the global (or
> cgroup) RT throttle to lower the RT util/OPP.

The current code simply treats RT/DL as "uknknown" and will always ask
for the max for them.  That should work, although it's suboptimal for
DL at least.  However, I'd prefer to add something more sophisticated
on top of it just to keep things simple to start with.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-pm/include/linux/sched.h
===================================================================
--- linux-pm.orig/include/linux/sched.h
+++ linux-pm/include/linux/sched.h
@@ -3207,4 +3207,11 @@  static inline unsigned long rlimit_max(u
 	return task_rlimit_max(current, limit);
 }
 
+void cpufreq_update_util(unsigned long util, unsigned long max);
+
+static inline void cpufreq_kick(void)
+{
+	cpufreq_update_util(ULONG_MAX, ULONG_MAX);
+}
+
 #endif
Index: linux-pm/kernel/sched/fair.c
===================================================================
--- linux-pm.orig/kernel/sched/fair.c
+++ linux-pm/kernel/sched/fair.c
@@ -2819,12 +2819,17 @@  static inline int update_cfs_rq_load_avg
 	return decayed || removed;
 }
 
+__weak void cpufreq_update_util(unsigned long util, unsigned long max)
+{
+}
+
 /* Update task and its cfs_rq load average */
 static inline void update_load_avg(struct sched_entity *se, int update_tg)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	u64 now = cfs_rq_clock_task(cfs_rq);
-	int cpu = cpu_of(rq_of(cfs_rq));
+	struct rq *rq = rq_of(cfs_rq);
+	int cpu = cpu_of(rq);
 
 	/*
 	 * Track task load average for carrying it to new CPU after migrated, and
@@ -2836,6 +2841,28 @@  static inline void update_load_avg(struc
 
 	if (update_cfs_rq_load_avg(now, cfs_rq) && update_tg)
 		update_tg_load_avg(cfs_rq, 0);
+
+	if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
+		unsigned long max = rq->cpu_capacity_orig;
+
+		/*
+		 * There are a few boundary cases this might miss but it should
+		 * get called often enough that that should (hopefully) not be
+		 * a real problem -- added to that it only calls on the local
+		 * CPU, so if we enqueue remotely we'll loose an update, but
+		 * the next tick/schedule should update.
+		 *
+		 * It will not get called when we go idle, because the idle
+		 * thread is a different class (!fair), nor will the utilization
+		 * number include things like RT tasks.
+		 *
+		 * As is, the util number is not freq invariant (we'd have to
+		 * implement arch_scale_freq_capacity() for that).
+		 *
+		 * See cpu_util().
+		 */
+		cpufreq_update_util(min(cfs_rq->avg.util_avg, max), max);
+	}
 }
 
 static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
Index: linux-pm/drivers/cpufreq/cpufreq.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/cpufreq.c
+++ linux-pm/drivers/cpufreq/cpufreq.c
@@ -102,6 +102,50 @@  static LIST_HEAD(cpufreq_governor_list);
 static struct cpufreq_driver *cpufreq_driver;
 static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data);
 static DEFINE_RWLOCK(cpufreq_driver_lock);
+
+static DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);
+
+/**
+ * cpufreq_set_update_util_data - Populate the CPU's update_util_data pointer.
+ * @cpu: The CPU to set the pointer for.
+ * @data: New pointer value.
+ *
+ * Set and publish the update_util_data pointer for the given CPU.  That pointer
+ * points to a struct update_util_data object containing a callback function
+ * to call from cpufreq_update_util().  That function will be called from an RCU
+ * read-side critical section, so it must not sleep.
+ *
+ * Callers must use RCU callbacks to free any memory that might be accessed
+ * via the old update_util_data pointer or invoke synchronize_rcu() right after
+ * this function to avoid use-after-free.
+ */
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
+{
+	rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);
+
+/**
+ * cpufreq_update_util - Take a note about CPU utilization changes.
+ * @util: Current utilization.
+ * @max: Utilization ceiling.
+ *
+ * This function is called by the scheduler on every invocation of
+ * update_load_avg() on the CPU whose utilization is being updated.
+ */
+void cpufreq_update_util(unsigned long util, unsigned long max)
+{
+	struct update_util_data *data;
+
+	rcu_read_lock();
+
+	data = rcu_dereference(*this_cpu_ptr(&cpufreq_update_util_data));
+	if (data && data->func)
+		data->func(data, cpu_clock(smp_processor_id()), util, max);
+
+	rcu_read_unlock();
+}
+
 DEFINE_MUTEX(cpufreq_governor_lock);
 
 /* Flag to suspend/resume CPUFreq governors */
Index: linux-pm/include/linux/cpufreq.h
===================================================================
--- linux-pm.orig/include/linux/cpufreq.h
+++ linux-pm/include/linux/cpufreq.h
@@ -322,6 +322,13 @@  int cpufreq_unregister_driver(struct cpu
 const char *cpufreq_get_current_driver(void);
 void *cpufreq_get_driver_data(void);
 
+struct update_util_data {
+	void (*func)(struct update_util_data *data,
+		     u64 time, unsigned long util, unsigned long max);
+};
+
+void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
+
 static inline void cpufreq_verify_within_limits(struct cpufreq_policy *policy,
 		unsigned int min, unsigned int max)
 {
Index: linux-pm/kernel/sched/rt.c
===================================================================
--- linux-pm.orig/kernel/sched/rt.c
+++ linux-pm/kernel/sched/rt.c
@@ -2212,6 +2212,9 @@  static void task_tick_rt(struct rq *rq,
 
 	update_curr_rt(rq);
 
+	/* Kick cpufreq to prevent it from stalling. */
+	cpufreq_kick();
+
 	watchdog(rq, p);
 
 	/*
Index: linux-pm/kernel/sched/deadline.c
===================================================================
--- linux-pm.orig/kernel/sched/deadline.c
+++ linux-pm/kernel/sched/deadline.c
@@ -1197,6 +1197,9 @@  static void task_tick_dl(struct rq *rq,
 {
 	update_curr_dl(rq);
 
+	/* Kick cpufreq to prevent it from stalling. */
+	cpufreq_kick();
+
 	/*
 	 * Even when we have runtime, update_curr_dl() might have resulted in us
 	 * not being the leftmost task anymore. In that case NEED_RESCHED will