diff mbox

[6/6] cpufreq: schedutil: New governor based on scheduler utilization data

Message ID 1842158.0Xhak3Uaac@vostro.rjw.lan (mailing list archive)
State Superseded, archived
Delegated to: Rafael Wysocki
Headers show

Commit Message

Rafael J. Wysocki March 2, 2016, 2:27 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Add a new cpufreq scaling governor, called "schedutil", that uses
scheduler-provided CPU utilization information as input for making
its decisions.

Doing that is possible after commit fe7034338ba0 (cpufreq: Add
mechanism for registering utilization update callbacks) that
introduced cpufreq_update_util() called by the scheduler on
utilization changes (from CFS) and RT/DL task status updates.
In particular, CPU frequency scaling decisions may be based on
the the utilization data passed to cpufreq_update_util() by CFS.

The new governor is relatively simple.

The frequency selection formula used by it is essentially the same
as the one used by the "ondemand" governor, although it doesn't use
the additional up_threshold parameter, but instead of computing the
load as the "non-idle CPU time" to "total CPU time" ratio, it takes
the utilization data provided by CFS as input.  More specifically,
it represents "load" as the util/max ratio, where util and max
are the utilization and CPU capacity coming from CFS.

All of the computations are carried out in the utilization update
handlers provided by the new governor.  One of those handlers is
used for cpufreq policies shared between multiple CPUs and the other
one is for policies with one CPU only (and therefore it doesn't need
to use any extra synchronization means).

The governor supports fast frequency switching if that is supported
by the cpufreq driver in use and possible for the given policy.
In the fast switching case, all operations of the governor take
place in its utilization update handlers.  If fast switching cannot
be used, the frequency switch operations are carried out with the
help of a work item which only calls __cpufreq_driver_target()
(under a mutex) to trigger a frequency update (to a value already
computed beforehand in one of the utilization update handlers).

Currently, the governor treats all of the RT and DL tasks as
"unknown utilization" and sets the frequency to the allowed
maximum when updated from the RT or DL sched classes.  That
heavy-handed approach should be replaced with something more
subtle and specifically targeted at RT and DL tasks.

The governor shares some tunables management code with the
"ondemand" and "conservative" governors and uses some common
definitions from cpufreq_governor.h, but apart from that it
is stand-alone.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

In addition to the changes mentioned in the intro message [0/6] this also
tweaks the frequency selection formula in a couple of ways.

First off, it uses min and max frequencies correctly (the formula from
"ondemand" is applied to cpuinfo.min/max_freq like the original and
policy->min/max are applied to the result later).

Second, RELATION_L is used most of the time except for the bottom 1/4
of the available frequency range (but also note that DL tasks are
treated in the same way as RT ones, meaning f_max is always used for
them).

Finally, the condition for discarding idle policy CPUs was modified
to also work if the rate limit is below the scheduling rate.

The code in sugov_init/exit/stop() and the irq_work handler look
very similar to the analogous code in cpufreq_governor.c, but it
is different enough that trying to avoid that duplication was not
practical.

Thanks,
Rafael

---
 drivers/cpufreq/Kconfig             |   26 +
 drivers/cpufreq/Makefile            |    1 
 drivers/cpufreq/cpufreq_schedutil.c |  501 ++++++++++++++++++++++++++++++++++++
 3 files changed, 528 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Vincent Guittot March 2, 2016, 5:10 p.m. UTC | #1
Hi Rafael,


On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Add a new cpufreq scaling governor, called "schedutil", that uses
> scheduler-provided CPU utilization information as input for making
> its decisions.
>
> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
> mechanism for registering utilization update callbacks) that
> introduced cpufreq_update_util() called by the scheduler on
> utilization changes (from CFS) and RT/DL task status updates.
> In particular, CPU frequency scaling decisions may be based on
> the the utilization data passed to cpufreq_update_util() by CFS.
>
> The new governor is relatively simple.
>
> The frequency selection formula used by it is essentially the same
> as the one used by the "ondemand" governor, although it doesn't use
> the additional up_threshold parameter, but instead of computing the
> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
> the utilization data provided by CFS as input.  More specifically,
> it represents "load" as the util/max ratio, where util and max
> are the utilization and CPU capacity coming from CFS.
>

[snip]

> +
> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> +                               unsigned long util, unsigned long max,
> +                               unsigned int next_freq)
> +{
> +       struct cpufreq_policy *policy = sg_policy->policy;
> +       unsigned int rel;
> +
> +       if (next_freq > policy->max)
> +               next_freq = policy->max;
> +       else if (next_freq < policy->min)
> +               next_freq = policy->min;
> +
> +       sg_policy->last_freq_update_time = time;
> +       if (sg_policy->next_freq == next_freq)
> +               return;
> +
> +       sg_policy->next_freq = next_freq;
> +       /*
> +        * If utilization is less than max / 4, use RELATION_C to allow the
> +        * minimum frequency to be selected more often in case the distance from
> +        * it to the next available frequency in the table is significant.
> +        */
> +       rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
> +       if (policy->fast_switch_possible) {
> +               cpufreq_driver_fast_switch(policy, next_freq, rel);
> +       } else {
> +               sg_policy->relation = rel;
> +               sg_policy->work_in_progress = true;
> +               irq_work_queue(&sg_policy->irq_work);
> +       }
> +}
> +
> +static void sugov_update_single(struct update_util_data *data, u64 time,
> +                               unsigned long util, unsigned long max)
> +{
> +       struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
> +       struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> +       unsigned int min_f, max_f, next_f;
> +
> +       if (!sugov_should_update_freq(sg_policy, time))
> +               return;
> +
> +       min_f = sg_policy->policy->cpuinfo.min_freq;
> +       max_f = sg_policy->policy->cpuinfo.max_freq;
> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;

I think it has been pointed out in another email's thread but you
should change the way the next_f is computed. util reflects the
utilization of a CPU from 0 to its max compute capacity whereas
ondemand was using the load at the current frequency during the last
time window. I have understood that you want to keep same formula than
ondemand as a starting point but you use a different input to
calculate the next frequency so i don't see the rational of keeping
this formula. Saying that, even the simple formula next_f = util > max
? max_f : util * (max_f) / max will not work properly if the frequency
invariance is enable because the utilization becomes capped by the
current compute capacity so next_f will never be higher than current
freq (unless a task move on the rq).  That was one reason of using a
threshold in sched-freq proposal (and there are on going dev to try to
solve this limitation).
IIIUC, frequency invariance is not enable on your platform so you have
not seen the problem but you have probably see that selection of your
next_f was not really stable. Without frequency invariance, the
utilization will be overestimated when running at lower frequency so
the governor will probably select a frequency that is higher than
necessary but then the utilization will decrease at this higher
frequency so the governor will probably decrease the frequency and so
on until you found the right frequency that will generate the right
utilisation value

Regards,
Vincent

> +
> +       sugov_update_commit(sg_policy, time, util, max, next_f);
> +}
> +
> +static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
> +                                   unsigned long util, unsigned long max)
> +{
> +       struct cpufreq_policy *policy = sg_policy->policy;
> +       unsigned int min_f = policy->cpuinfo.min_freq;
> +       unsigned int max_f = policy->cpuinfo.max_freq;
> +       u64 last_freq_update_time = sg_policy->last_freq_update_time;
> +       unsigned int j;
> +
> +       if (util > max)
> +               return max_f;
> +
> +       for_each_cpu(j, policy->cpus) {
> +               struct sugov_cpu *j_sg_cpu;
> +               unsigned long j_util, j_max;
> +               u64 delta_ns;
> +
> +               if (j == smp_processor_id())
> +                       continue;
> +
> +               j_sg_cpu = &per_cpu(sugov_cpu, j);
> +               /*
> +                * If the CPU utilization was last updated before the previous
> +                * frequency update and the time elapsed between the last update
> +                * of the CPU utilization and the last frequency update is long
> +                * enough, don't take the CPU into account as it probably is
> +                * idle now.
> +                */
> +               delta_ns = last_freq_update_time - j_sg_cpu->last_update;
> +               if ((s64)delta_ns > NSEC_PER_SEC / HZ)
> +                       continue;
> +
> +               j_util = j_sg_cpu->util;
> +               j_max = j_sg_cpu->max;
> +               if (j_util > j_max)
> +                       return max_f;
> +
> +               if (j_util * max > j_max * util) {
> +                       util = j_util;
> +                       max = j_max;
> +               }
> +       }
> +
> +       return min_f + util * (max_f - min_f) / max;
> +}
> +

[snip]
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 2, 2016, 5:58 p.m. UTC | #2
On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> Hi Rafael,
>
>
> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Add a new cpufreq scaling governor, called "schedutil", that uses
>> scheduler-provided CPU utilization information as input for making
>> its decisions.
>>
>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>> mechanism for registering utilization update callbacks) that
>> introduced cpufreq_update_util() called by the scheduler on
>> utilization changes (from CFS) and RT/DL task status updates.
>> In particular, CPU frequency scaling decisions may be based on
>> the the utilization data passed to cpufreq_update_util() by CFS.
>>
>> The new governor is relatively simple.
>>
>> The frequency selection formula used by it is essentially the same
>> as the one used by the "ondemand" governor, although it doesn't use
>> the additional up_threshold parameter, but instead of computing the
>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
>> the utilization data provided by CFS as input.  More specifically,
>> it represents "load" as the util/max ratio, where util and max
>> are the utilization and CPU capacity coming from CFS.
>>
>
> [snip]
>
>> +
>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>> +                               unsigned long util, unsigned long max,
>> +                               unsigned int next_freq)
>> +{
>> +       struct cpufreq_policy *policy = sg_policy->policy;
>> +       unsigned int rel;
>> +
>> +       if (next_freq > policy->max)
>> +               next_freq = policy->max;
>> +       else if (next_freq < policy->min)
>> +               next_freq = policy->min;
>> +
>> +       sg_policy->last_freq_update_time = time;
>> +       if (sg_policy->next_freq == next_freq)
>> +               return;
>> +
>> +       sg_policy->next_freq = next_freq;
>> +       /*
>> +        * If utilization is less than max / 4, use RELATION_C to allow the
>> +        * minimum frequency to be selected more often in case the distance from
>> +        * it to the next available frequency in the table is significant.
>> +        */
>> +       rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
>> +       if (policy->fast_switch_possible) {
>> +               cpufreq_driver_fast_switch(policy, next_freq, rel);
>> +       } else {
>> +               sg_policy->relation = rel;
>> +               sg_policy->work_in_progress = true;
>> +               irq_work_queue(&sg_policy->irq_work);
>> +       }
>> +}
>> +
>> +static void sugov_update_single(struct update_util_data *data, u64 time,
>> +                               unsigned long util, unsigned long max)
>> +{
>> +       struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
>> +       struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>> +       unsigned int min_f, max_f, next_f;
>> +
>> +       if (!sugov_should_update_freq(sg_policy, time))
>> +               return;
>> +
>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>
> I think it has been pointed out in another email's thread but you
> should change the way the next_f is computed. util reflects the
> utilization of a CPU from 0 to its max compute capacity whereas
> ondemand was using the load at the current frequency during the last
> time window. I have understood that you want to keep same formula than
> ondemand as a starting point but you use a different input to
> calculate the next frequency so i don't see the rational of keeping
> this formula.

It is a formula that causes the entire available frequency range to be
utilized proportionally to the utilization as reported by the
scheduler (modulo the policy->min/max limits).  Its (significant IMO)
advantage is that it doesn't require any additional factors that would
need to be determined somehow.

> Saying that, even the simple formula next_f = util > max
> ? max_f : util * (max_f) / max will not work properly if the frequency
> invariance is enable because the utilization becomes capped by the
> current compute capacity so next_f will never be higher than current
> freq (unless a task move on the rq).  That was one reason of using a
> threshold in sched-freq proposal (and there are on going dev to try to
> solve this limitation).

Well, a different formula will have to be used along with frequency
invariance, then.

> IIIUC, frequency invariance is not enable on your platform so you have
> not seen the problem but you have probably see that selection of your
> next_f was not really stable. Without frequency invariance, the
> utilization will be overestimated when running at lower frequency so
> the governor will probably select a frequency that is higher than
> necessary but then the utilization will decrease at this higher
> frequency so the governor will probably decrease the frequency and so
> on until you found the right frequency that will generate the right
> utilisation value

I don't have any problems with that to be honest and if you aim at
selecting the perfect frequency at the first attempt, then good luck
with that anyway.

Now, I'm not saying that the formula used in this patch cannot be
improved or similar.  It very well may be possible to improve it.  I'm
only saying that it is good enough to start with, because of the
reasons mentioned above.

Still, if you can suggest to me what other formula specifically should
be used here, I'll consider using it.  Which will probably mean
comparing the two and seeing which one leads to better results.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 2, 2016, 10:49 p.m. UTC | #3
On Wed, Mar 2, 2016 at 6:58 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>> Hi Rafael,
>>
>>
>> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Add a new cpufreq scaling governor, called "schedutil", that uses
>>> scheduler-provided CPU utilization information as input for making
>>> its decisions.
>>>
>>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>>> mechanism for registering utilization update callbacks) that
>>> introduced cpufreq_update_util() called by the scheduler on
>>> utilization changes (from CFS) and RT/DL task status updates.
>>> In particular, CPU frequency scaling decisions may be based on
>>> the the utilization data passed to cpufreq_update_util() by CFS.
>>>
>>> The new governor is relatively simple.
>>>
>>> The frequency selection formula used by it is essentially the same
>>> as the one used by the "ondemand" governor, although it doesn't use
>>> the additional up_threshold parameter, but instead of computing the
>>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
>>> the utilization data provided by CFS as input.  More specifically,
>>> it represents "load" as the util/max ratio, where util and max
>>> are the utilization and CPU capacity coming from CFS.
>>>
>>
>> [snip]
>>
>>> +
>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>> +                               unsigned long util, unsigned long max,
>>> +                               unsigned int next_freq)
>>> +{
>>> +       struct cpufreq_policy *policy = sg_policy->policy;
>>> +       unsigned int rel;
>>> +
>>> +       if (next_freq > policy->max)
>>> +               next_freq = policy->max;
>>> +       else if (next_freq < policy->min)
>>> +               next_freq = policy->min;
>>> +
>>> +       sg_policy->last_freq_update_time = time;
>>> +       if (sg_policy->next_freq == next_freq)
>>> +               return;
>>> +
>>> +       sg_policy->next_freq = next_freq;
>>> +       /*
>>> +        * If utilization is less than max / 4, use RELATION_C to allow the
>>> +        * minimum frequency to be selected more often in case the distance from
>>> +        * it to the next available frequency in the table is significant.
>>> +        */
>>> +       rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
>>> +       if (policy->fast_switch_possible) {
>>> +               cpufreq_driver_fast_switch(policy, next_freq, rel);
>>> +       } else {
>>> +               sg_policy->relation = rel;
>>> +               sg_policy->work_in_progress = true;
>>> +               irq_work_queue(&sg_policy->irq_work);
>>> +       }
>>> +}
>>> +
>>> +static void sugov_update_single(struct update_util_data *data, u64 time,
>>> +                               unsigned long util, unsigned long max)
>>> +{
>>> +       struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
>>> +       struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>>> +       unsigned int min_f, max_f, next_f;
>>> +
>>> +       if (!sugov_should_update_freq(sg_policy, time))
>>> +               return;
>>> +
>>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
>>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
>>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>>
>> I think it has been pointed out in another email's thread but you
>> should change the way the next_f is computed. util reflects the
>> utilization of a CPU from 0 to its max compute capacity whereas
>> ondemand was using the load at the current frequency during the last
>> time window. I have understood that you want to keep same formula than
>> ondemand as a starting point but you use a different input to
>> calculate the next frequency so i don't see the rational of keeping
>> this formula.
>
> It is a formula that causes the entire available frequency range to be
> utilized proportionally to the utilization as reported by the
> scheduler (modulo the policy->min/max limits).  Its (significant IMO)
> advantage is that it doesn't require any additional factors that would
> need to be determined somehow.

In case a more formal derivation of this formula is needed, it is
based on the following 3 assumptions:

(1) Performance is a linear function of frequency.
(2) Required performance is a linear function of the utilization ratio
x = util/max as provided by the scheduler (0 <= x <= 1).
(3) The minimum possible frequency (min_freq) corresponds to x = 0 and
the maximum possible frequency (max_freq) corresponds to x = 1.

(1) and (2) combined imply that

f = a * x + b

(f - frequency, a, b - constants to be determined) and then (3) quite
trivially leads to b = min_freq and a = max_freq - min_freq.

Now, of course, the linearity assumptions may be questioned, but then
it's just the first approximation.  If you go any further, though, you
end up with an expansion series like this:

f(x) = c_0 + c_1 * x + c_2 * x^2 + c_3 * x^3 + ...

where all of the c_j need to be determined in principle.  With luck,
if you can guess what kind of a function f(x) may be, it may be
possible to reduce the number of coefficients to determine, but
question is whether or not that is going to work universally for all
systems.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 3, 2016, 12:20 p.m. UTC | #4
On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
> >>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
> >>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
> >>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;

> In case a more formal derivation of this formula is needed, it is
> based on the following 3 assumptions:
> 
> (1) Performance is a linear function of frequency.
> (2) Required performance is a linear function of the utilization ratio
> x = util/max as provided by the scheduler (0 <= x <= 1).

> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> the maximum possible frequency (max_freq) corresponds to x = 1.
> 
> (1) and (2) combined imply that
> 
> f = a * x + b
> 
> (f - frequency, a, b - constants to be determined) and then (3) quite
> trivially leads to b = min_freq and a = max_freq - min_freq.

3 is the problem, that just doesn't make sense and is probably the
reason why you see very little selection of the min freq.

Suppose a machine with the following frequencies:

	500, 750, 1000

And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
700 make any sense? Per your point 1, it should should be asking for
0.4 * 1000 = 400.

Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
only need 0.4 times as much. Therefore 500 is more than sufficient.



Note. we all know that 1 is a 'broken' assumption, but lacking anything
better I think its a reasonable one to make.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 3, 2016, 12:32 p.m. UTC | #5
On 03/03/16 13:20, Peter Zijlstra wrote:
> On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
> > >>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
> > >>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
> > >>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
> 
> > In case a more formal derivation of this formula is needed, it is
> > based on the following 3 assumptions:
> > 
> > (1) Performance is a linear function of frequency.
> > (2) Required performance is a linear function of the utilization ratio
> > x = util/max as provided by the scheduler (0 <= x <= 1).
> 
> > (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> > the maximum possible frequency (max_freq) corresponds to x = 1.
> > 
> > (1) and (2) combined imply that
> > 
> > f = a * x + b
> > 
> > (f - frequency, a, b - constants to be determined) and then (3) quite
> > trivially leads to b = min_freq and a = max_freq - min_freq.
> 
> 3 is the problem, that just doesn't make sense and is probably the
> reason why you see very little selection of the min freq.
> 
> Suppose a machine with the following frequencies:
> 
> 	500, 750, 1000
> 
> And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
> 700 make any sense? Per your point 1, it should should be asking for
> 0.4 * 1000 = 400.
> 
> Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
> only need 0.4 times as much. Therefore 500 is more than sufficient.
> 

Oh, and that is probably also why the governor can reach max OPP with
freq invariance enabled (the point Vincent was making). When we run at
500 the util signal is capped at that capacity, but the formula makes us
requesting more, so we can jump to the next step and so on.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot March 3, 2016, 1:07 p.m. UTC | #6
On 2 March 2016 at 18:58, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>> Hi Rafael,
>>
>>
>> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Add a new cpufreq scaling governor, called "schedutil", that uses
>>> scheduler-provided CPU utilization information as input for making
>>> its decisions.
>>>
>>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>>> mechanism for registering utilization update callbacks) that
>>> introduced cpufreq_update_util() called by the scheduler on
>>> utilization changes (from CFS) and RT/DL task status updates.
>>> In particular, CPU frequency scaling decisions may be based on
>>> the the utilization data passed to cpufreq_update_util() by CFS.
>>>
>>> The new governor is relatively simple.
>>>
>>> The frequency selection formula used by it is essentially the same
>>> as the one used by the "ondemand" governor, although it doesn't use
>>> the additional up_threshold parameter, but instead of computing the
>>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
>>> the utilization data provided by CFS as input.  More specifically,
>>> it represents "load" as the util/max ratio, where util and max
>>> are the utilization and CPU capacity coming from CFS.
>>>
>>
>> [snip]
>>
>>> +
>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>> +                               unsigned long util, unsigned long max,
>>> +                               unsigned int next_freq)
>>> +{
>>> +       struct cpufreq_policy *policy = sg_policy->policy;
>>> +       unsigned int rel;
>>> +
>>> +       if (next_freq > policy->max)
>>> +               next_freq = policy->max;
>>> +       else if (next_freq < policy->min)
>>> +               next_freq = policy->min;
>>> +
>>> +       sg_policy->last_freq_update_time = time;
>>> +       if (sg_policy->next_freq == next_freq)
>>> +               return;
>>> +
>>> +       sg_policy->next_freq = next_freq;
>>> +       /*
>>> +        * If utilization is less than max / 4, use RELATION_C to allow the
>>> +        * minimum frequency to be selected more often in case the distance from
>>> +        * it to the next available frequency in the table is significant.
>>> +        */
>>> +       rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
>>> +       if (policy->fast_switch_possible) {
>>> +               cpufreq_driver_fast_switch(policy, next_freq, rel);
>>> +       } else {
>>> +               sg_policy->relation = rel;
>>> +               sg_policy->work_in_progress = true;
>>> +               irq_work_queue(&sg_policy->irq_work);
>>> +       }
>>> +}
>>> +
>>> +static void sugov_update_single(struct update_util_data *data, u64 time,
>>> +                               unsigned long util, unsigned long max)
>>> +{
>>> +       struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
>>> +       struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>>> +       unsigned int min_f, max_f, next_f;
>>> +
>>> +       if (!sugov_should_update_freq(sg_policy, time))
>>> +               return;
>>> +
>>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
>>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
>>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>>
>> I think it has been pointed out in another email's thread but you
>> should change the way the next_f is computed. util reflects the
>> utilization of a CPU from 0 to its max compute capacity whereas
>> ondemand was using the load at the current frequency during the last
>> time window. I have understood that you want to keep same formula than
>> ondemand as a starting point but you use a different input to
>> calculate the next frequency so i don't see the rational of keeping
>> this formula.
>
> It is a formula that causes the entire available frequency range to be
> utilized proportionally to the utilization as reported by the
> scheduler (modulo the policy->min/max limits).  Its (significant IMO)
> advantage is that it doesn't require any additional factors that would
> need to be determined somehow.
>
>> Saying that, even the simple formula next_f = util > max
>> ? max_f : util * (max_f) / max will not work properly if the frequency
>> invariance is enable because the utilization becomes capped by the
>> current compute capacity so next_f will never be higher than current
>> freq (unless a task move on the rq).  That was one reason of using a
>> threshold in sched-freq proposal (and there are on going dev to try to
>> solve this limitation).
>
> Well, a different formula will have to be used along with frequency
> invariance, then.
>
>> IIIUC, frequency invariance is not enable on your platform so you have
>> not seen the problem but you have probably see that selection of your
>> next_f was not really stable. Without frequency invariance, the
>> utilization will be overestimated when running at lower frequency so
>> the governor will probably select a frequency that is higher than
>> necessary but then the utilization will decrease at this higher
>> frequency so the governor will probably decrease the frequency and so
>> on until you found the right frequency that will generate the right
>> utilisation value
>
> I don't have any problems with that to be honest and if you aim at
> selecting the perfect frequency at the first attempt, then good luck
> with that anyway.

I mainly want to prevent any useless and periodic frequency switch
because of an utilization that changes with the current frequency (if
frequency invariance is not used) and that can make the formula
selects another frequency than the current one. That what i can see
when testing it .

Sorry for the late reply, i was trying to do some test on my board but
was facing some crash issue (not link with your patchset). So i have
done some tests and i can see such instable behavior. I have generated
a load of 33% at max frequency (3ms runs every 9ms) and i can see the
frequency that toggles without any good reason. Saying that, i can see
similar thing with ondemand.

Vincent

>
> Now, I'm not saying that the formula used in this patch cannot be
> improved or similar.  It very well may be possible to improve it.  I'm
> only saying that it is good enough to start with, because of the
> reasons mentioned above.
>
> Still, if you can suggest to me what other formula specifically should
> be used here, I'll consider using it.  Which will probably mean
> comparing the two and seeing which one leads to better results.
>
> Thanks,
> Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot March 3, 2016, 2:01 p.m. UTC | #7
On 2 March 2016 at 23:49, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 2, 2016 at 6:58 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
>> On Wed, Mar 2, 2016 at 6:10 PM, Vincent Guittot
>> <vincent.guittot@linaro.org> wrote:
>>> Hi Rafael,
>>>
>>>
>>> On 2 March 2016 at 03:27, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>
>>>> Add a new cpufreq scaling governor, called "schedutil", that uses
>>>> scheduler-provided CPU utilization information as input for making
>>>> its decisions.
>>>>
>>>> Doing that is possible after commit fe7034338ba0 (cpufreq: Add
>>>> mechanism for registering utilization update callbacks) that
>>>> introduced cpufreq_update_util() called by the scheduler on
>>>> utilization changes (from CFS) and RT/DL task status updates.
>>>> In particular, CPU frequency scaling decisions may be based on
>>>> the the utilization data passed to cpufreq_update_util() by CFS.
>>>>
>>>> The new governor is relatively simple.
>>>>
>>>> The frequency selection formula used by it is essentially the same
>>>> as the one used by the "ondemand" governor, although it doesn't use
>>>> the additional up_threshold parameter, but instead of computing the
>>>> load as the "non-idle CPU time" to "total CPU time" ratio, it takes
>>>> the utilization data provided by CFS as input.  More specifically,
>>>> it represents "load" as the util/max ratio, where util and max
>>>> are the utilization and CPU capacity coming from CFS.
>>>>
>>>
>>> [snip]
>>>
>>>> +
>>>> +static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
>>>> +                               unsigned long util, unsigned long max,
>>>> +                               unsigned int next_freq)
>>>> +{
>>>> +       struct cpufreq_policy *policy = sg_policy->policy;
>>>> +       unsigned int rel;
>>>> +
>>>> +       if (next_freq > policy->max)
>>>> +               next_freq = policy->max;
>>>> +       else if (next_freq < policy->min)
>>>> +               next_freq = policy->min;
>>>> +
>>>> +       sg_policy->last_freq_update_time = time;
>>>> +       if (sg_policy->next_freq == next_freq)
>>>> +               return;
>>>> +
>>>> +       sg_policy->next_freq = next_freq;
>>>> +       /*
>>>> +        * If utilization is less than max / 4, use RELATION_C to allow the
>>>> +        * minimum frequency to be selected more often in case the distance from
>>>> +        * it to the next available frequency in the table is significant.
>>>> +        */
>>>> +       rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
>>>> +       if (policy->fast_switch_possible) {
>>>> +               cpufreq_driver_fast_switch(policy, next_freq, rel);
>>>> +       } else {
>>>> +               sg_policy->relation = rel;
>>>> +               sg_policy->work_in_progress = true;
>>>> +               irq_work_queue(&sg_policy->irq_work);
>>>> +       }
>>>> +}
>>>> +
>>>> +static void sugov_update_single(struct update_util_data *data, u64 time,
>>>> +                               unsigned long util, unsigned long max)
>>>> +{
>>>> +       struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
>>>> +       struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>>>> +       unsigned int min_f, max_f, next_f;
>>>> +
>>>> +       if (!sugov_should_update_freq(sg_policy, time))
>>>> +               return;
>>>> +
>>>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
>>>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
>>>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>>>
>>> I think it has been pointed out in another email's thread but you
>>> should change the way the next_f is computed. util reflects the
>>> utilization of a CPU from 0 to its max compute capacity whereas
>>> ondemand was using the load at the current frequency during the last
>>> time window. I have understood that you want to keep same formula than
>>> ondemand as a starting point but you use a different input to
>>> calculate the next frequency so i don't see the rational of keeping
>>> this formula.
>>
>> It is a formula that causes the entire available frequency range to be
>> utilized proportionally to the utilization as reported by the
>> scheduler (modulo the policy->min/max limits).  Its (significant IMO)
>> advantage is that it doesn't require any additional factors that would
>> need to be determined somehow.
>
> In case a more formal derivation of this formula is needed, it is
> based on the following 3 assumptions:
>
> (1) Performance is a linear function of frequency.
> (2) Required performance is a linear function of the utilization ratio
> x = util/max as provided by the scheduler (0 <= x <= 1).

Just to mention that the utilization that you are using, varies with
the frequency which add another variable in your equation

> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> the maximum possible frequency (max_freq) corresponds to x = 1.
>
> (1) and (2) combined imply that
>
> f = a * x + b
>
> (f - frequency, a, b - constants to be determined) and then (3) quite
> trivially leads to b = min_freq and a = max_freq - min_freq.
>
> Now, of course, the linearity assumptions may be questioned, but then
> it's just the first approximation.  If you go any further, though, you
> end up with an expansion series like this:
>
> f(x) = c_0 + c_1 * x + c_2 * x^2 + c_3 * x^3 + ...
>
> where all of the c_j need to be determined in principle.  With luck,
> if you can guess what kind of a function f(x) may be, it may be
> possible to reduce the number of coefficients to determine, but
> question is whether or not that is going to work universally for all
> systems.
>
> Thanks,
> Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 3, 2016, 3:38 p.m. UTC | #8
On Thu, Mar 03, 2016 at 03:01:15PM +0100, Vincent Guittot wrote:
> > In case a more formal derivation of this formula is needed, it is
> > based on the following 3 assumptions:
> >
> > (1) Performance is a linear function of frequency.
> > (2) Required performance is a linear function of the utilization ratio
> > x = util/max as provided by the scheduler (0 <= x <= 1).
> 
> Just to mention that the utilization that you are using, varies with
> the frequency which add another variable in your equation

Right, x86 hasn't implemented arch_scale_freq_capacity(), so the
utilization values we use are all over the map. If we lower freq, the
util will go up, which would result in us bumping the freq again, etc..

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 3, 2016, 4:24 p.m. UTC | #9
On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
>> >>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
>> >>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
>> >>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
>
>> In case a more formal derivation of this formula is needed, it is
>> based on the following 3 assumptions:
>>
>> (1) Performance is a linear function of frequency.
>> (2) Required performance is a linear function of the utilization ratio
>> x = util/max as provided by the scheduler (0 <= x <= 1).
>
>> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
>> the maximum possible frequency (max_freq) corresponds to x = 1.
>>
>> (1) and (2) combined imply that
>>
>> f = a * x + b
>>
>> (f - frequency, a, b - constants to be determined) and then (3) quite
>> trivially leads to b = min_freq and a = max_freq - min_freq.
>
> 3 is the problem, that just doesn't make sense and is probably the
> reason why you see very little selection of the min freq.

It is about mapping the entire [0,1] interval to the available frequency range.

I till overprovision things (the smaller x the more), but then it may
help the race-to-idle a bit in theory.

> Suppose a machine with the following frequencies:
>
>         500, 750, 1000
>
> And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
> 700 make any sense? Per your point 1, it should should be asking for
> 0.4 * 1000 = 400.
>
> Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
> only need 0.4 times as much. Therefore 500 is more than sufficient.

OK, but then I don't see why this reasoning only applies to the lower
bound of the frequency range.  Is there any reason why x = 1 should be
the only point mapping to max_freq?

If not, then I think it's reasonable to map the middle of the
available frequency range to x = 0.5 and then we have b = 0 and a =
(max_freq + min_freq) / 2.

I'll try that and see how it goes.

> Note. we all know that 1 is a 'broken' assumption, but lacking anything
> better I think its a reasonable one to make.

Right.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 3, 2016, 4:37 p.m. UTC | #10
On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote:
> On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
> >> >>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
> >> >>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
> >> >>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
> >
> >> In case a more formal derivation of this formula is needed, it is
> >> based on the following 3 assumptions:
> >>
> >> (1) Performance is a linear function of frequency.
> >> (2) Required performance is a linear function of the utilization ratio
> >> x = util/max as provided by the scheduler (0 <= x <= 1).
> >
> >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> >> the maximum possible frequency (max_freq) corresponds to x = 1.
> >>
> >> (1) and (2) combined imply that
> >>
> >> f = a * x + b
> >>
> >> (f - frequency, a, b - constants to be determined) and then (3) quite
> >> trivially leads to b = min_freq and a = max_freq - min_freq.
> >
> > 3 is the problem, that just doesn't make sense and is probably the
> > reason why you see very little selection of the min freq.
> 
> It is about mapping the entire [0,1] interval to the available frequency range.

Yeah, but I don't see why that makes sense..

> I till overprovision things (the smaller x the more), but then it may
> help the race-to-idle a bit in theory.

So, since we also have the cpuidle information, could we not make a
better guess at race-to-idle?

> > Suppose a machine with the following frequencies:
> >
> >         500, 750, 1000
> >
> > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
> > 700 make any sense? Per your point 1, it should should be asking for
> > 0.4 * 1000 = 400.
> >
> > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
> > only need 0.4 times as much. Therefore 500 is more than sufficient.
> 
> OK, but then I don't see why this reasoning only applies to the lower
> bound of the frequency range.  Is there any reason why x = 1 should be
> the only point mapping to max_freq?

Well, everything that goes over the second to last freq would end up at
the last (max) freq.

Take again the 500,750,1000 example, everything that's >750 would end up
at 1000 (for relation_l, >875 for _c).

But given the platform's cpuidle information, maybe coupled with an avg
idle est, we can compute the benefit of race-to-idle and over provision
based on that, right?

> If not, then I think it's reasonable to map the middle of the
> available frequency range to x = 0.5 and then we have b = 0 and a =
> (max_freq + min_freq) / 2.

So I really think that approach falls apart on the low util bits, you
effectively always run above min speed, even if min is already vstly
over provisioned.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 3, 2016, 4:47 p.m. UTC | #11
On Thu, Mar 03, 2016 at 05:37:35PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote:
> > >> f = a * x + b

> > If not, then I think it's reasonable to map the middle of the
> > available frequency range to x = 0.5 and then we have b = 0 and a =
> > (max_freq + min_freq) / 2.
> 
> So I really think that approach falls apart on the low util bits, you
> effectively always run above min speed, even if min is already vstly
> over provisioned.

Ah nevermind, I cannot read. Yes that is worth trying I suppose. But the
b=0,a=1 thing seems more natural still.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 3, 2016, 4:55 p.m. UTC | #12
On 03/03/16 17:37, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote:
> > On Thu, Mar 3, 2016 at 1:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, Mar 02, 2016 at 11:49:48PM +0100, Rafael J. Wysocki wrote:
> > >> >>> +       min_f = sg_policy->policy->cpuinfo.min_freq;
> > >> >>> +       max_f = sg_policy->policy->cpuinfo.max_freq;
> > >> >>> +       next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
> > >
> > >> In case a more formal derivation of this formula is needed, it is
> > >> based on the following 3 assumptions:
> > >>
> > >> (1) Performance is a linear function of frequency.
> > >> (2) Required performance is a linear function of the utilization ratio
> > >> x = util/max as provided by the scheduler (0 <= x <= 1).
> > >
> > >> (3) The minimum possible frequency (min_freq) corresponds to x = 0 and
> > >> the maximum possible frequency (max_freq) corresponds to x = 1.
> > >>
> > >> (1) and (2) combined imply that
> > >>
> > >> f = a * x + b
> > >>
> > >> (f - frequency, a, b - constants to be determined) and then (3) quite
> > >> trivially leads to b = min_freq and a = max_freq - min_freq.
> > >
> > > 3 is the problem, that just doesn't make sense and is probably the
> > > reason why you see very little selection of the min freq.
> > 
> > It is about mapping the entire [0,1] interval to the available frequency range.
> 
> Yeah, but I don't see why that makes sense..
> 
> > I till overprovision things (the smaller x the more), but then it may
> > help the race-to-idle a bit in theory.
> 
> So, since we also have the cpuidle information, could we not make a
> better guess at race-to-idle?
> 
> > > Suppose a machine with the following frequencies:
> > >
> > >         500, 750, 1000
> > >
> > > And a utilization of 0.4, how does asking for 500 + 0.4 * (1000-500) =
> > > 700 make any sense? Per your point 1, it should should be asking for
> > > 0.4 * 1000 = 400.
> > >
> > > Because, per 1, at 500 it runs exactly half as fast as at 1000, and we
> > > only need 0.4 times as much. Therefore 500 is more than sufficient.
> > 
> > OK, but then I don't see why this reasoning only applies to the lower
> > bound of the frequency range.  Is there any reason why x = 1 should be
> > the only point mapping to max_freq?
> 
> Well, everything that goes over the second to last freq would end up at
> the last (max) freq.
> 
> Take again the 500,750,1000 example, everything that's >750 would end up
> at 1000 (for relation_l, >875 for _c).
> 
> But given the platform's cpuidle information, maybe coupled with an avg
> idle est, we can compute the benefit of race-to-idle and over provision
> based on that, right?
> 

Shouldn't this kind of considerations be a scheduler thing? I'm not
really getting why we want to put more "intelligence" in a new governor.
Also, if I understand Ingo's point correctly, I think we want to make
this kind of policy decisions inside the scheduler.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 3, 2016, 4:56 p.m. UTC | #13
On Thu, Mar 03, 2016 at 04:55:44PM +0000, Juri Lelli wrote:
> On 03/03/16 17:37, Peter Zijlstra wrote:
> > But given the platform's cpuidle information, maybe coupled with an avg
> > idle est, we can compute the benefit of race-to-idle and over provision
> > based on that, right?
> > 
> 
> Shouldn't this kind of considerations be a scheduler thing? I'm not
> really getting why we want to put more "intelligence" in a new governor.
> Also, if I understand Ingo's point correctly, I think we want to make
> this kind of policy decisions inside the scheduler.

Well sure, put it in kernel/sched/cpufreq.c or wherever. My point was
more that we don't have to guess/hardcode race-to-idle assumptions but
can actually calculate some of that.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 3, 2016, 5:14 p.m. UTC | #14
On 03/03/16 17:56, Peter Zijlstra wrote:
> On Thu, Mar 03, 2016 at 04:55:44PM +0000, Juri Lelli wrote:
> > On 03/03/16 17:37, Peter Zijlstra wrote:
> > > But given the platform's cpuidle information, maybe coupled with an avg
> > > idle est, we can compute the benefit of race-to-idle and over provision
> > > based on that, right?
> > > 
> > 
> > Shouldn't this kind of considerations be a scheduler thing? I'm not
> > really getting why we want to put more "intelligence" in a new governor.
> > Also, if I understand Ingo's point correctly, I think we want to make
> > this kind of policy decisions inside the scheduler.
> 
> Well sure, put it in kernel/sched/cpufreq.c or wherever. My point was
> more that we don't have to guess/hardcode race-to-idle assumptions but
> can actually calculate some of that.
> 

Right, thanks for clarifying!
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 3, 2016, 8:06 p.m. UTC | #15
On 03/03/2016 05:07 AM, Vincent Guittot wrote:
> I mainly want to prevent any useless and periodic frequency switch
> because of an utilization that changes with the current frequency (if
> frequency invariance is not used) and that can make the formula
> selects another frequency than the current one. That what i can see
> when testing it .
> 
> Sorry for the late reply, i was trying to do some test on my board but
> was facing some crash issue (not link with your patchset). So i have
> done some tests and i can see such instable behavior. I have generated
> a load of 33% at max frequency (3ms runs every 9ms) and i can see the
> frequency that toggles without any good reason. Saying that, i can see
> similar thing with ondemand.

FWIW I ran some performance numbers on my chromebook 2. Initially I
forgot to bring in the frequency invariance support but that yielded an
opportunity to see the impact of it.

The tests below consist of a periodic workload. The OH (overhead)
numbers show how close the workload got to running as slow as fmin (100%
= as slow as powersave gov, 0% = as fast as perf gov). The OR (overrun)
number is the count of instances where the busy work exceeded the period.

First a comparison of schedutil with and without frequency invariance.
Run and period are in milliseconds.

			scu (no inv)	scu (w/inv)	
run	period	busy %	OR	OH	OR	OH
1	100	1.00%	0	79.72%	0	95.86%
10	1000	1.00%	0	24.52%	0	71.61%
1	10	10.00%	0	21.25%	0	41.78%
10	100	10.00%	0	26.06%	0	47.96%
100	1000	10.00%	0	6.36%	0	26.03%
6	33	18.18%	0	15.67%	0	31.61%
66	333	19.82%	0	8.94%	0	29.46%
4	10	40.00%	0	6.26%	0	12.93%
40	100	40.00%	0	6.93%	2	14.08%
400	1000	40.00%	0	1.65%	0	11.58%
5	9	55.56%	0	3.70%	0	7.70%
50	90	55.56%	1	4.19%	6	8.06%
500	900	55.56%	0	1.35%	5	6.94%
9	12	75.00%	0	1.60%	56	3.59%
90	120	75.00%	0	1.88%	21	3.94%
900	1200	75.00%	0	0.73%	4	4.41%

Frequency invariance causes schedutil overhead to increase noticeably. I
haven't dug into traces or anything. Perhaps this is due to the
algorithm overshooting then overcorrecting etc., I do not yet know.

Here is a comparison, with frequency invariance, of ondemand and
interactive with schedfreq and schedutil. The first two columns (run and
period) are omitted so the table will fit.

	ondemand	interactive	schedfreq	schedutil	
busy %	OR	OH	OR	OH	OR	OH	OR	OH
1.00%	0	68.96%	0	100.04%	0	78.49%	0	95.86%
1.00%	0	25.04%	0	22.59%	0	72.56%	0	71.61%
10.00%	0	21.75%	0	63.08%	0	52.40%	0	41.78%
10.00%	0	12.17%	0	14.41%	0	17.33%	0	47.96%
10.00%	0	2.57%	0	2.17%	0	0.29%	0	26.03%
18.18%	0	12.39%	0	9.39%	0	17.34%	0	31.61%
19.82%	0	3.74%	0	3.42%	0	12.26%	0	29.46%
40.00%	2	6.26%	1	12.23%	0	6.15%	0	12.93%
40.00%	0	0.47%	0	0.05%	0	2.68%	2	14.08%
40.00%	0	0.60%	0	0.50%	0	1.22%	0	11.58%
55.56%	2	4.25%	5	5.97%	0	2.51%	0	7.70%
55.56%	0	1.89%	0	0.04%	0	1.71%	6	8.06%
55.56%	0	0.50%	0	0.47%	0	1.82%	5	6.94%
75.00%	2	1.65%	1	0.46%	0	0.26%	56	3.59%
75.00%	0	1.68%	0	0.05%	0	0.49%	21	3.94%
75.00%	0	0.28%	0	0.23%	0	0.62%	4	4.41%

Aside from the 2nd and 3rd tests schedutil is showing decreased
performance across the board. The fifth test is particularly bad.

The catch is that I do not have power numbers to go with this data, as
I'm not currently equipped to gather them. So more analysis is
definitely needed to capture the full story.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 3, 2016, 8:20 p.m. UTC | #16
On Thu, Mar 3, 2016 at 9:06 PM, Steve Muckle <steve.muckle@linaro.org> wrote:
> On 03/03/2016 05:07 AM, Vincent Guittot wrote:
>> I mainly want to prevent any useless and periodic frequency switch
>> because of an utilization that changes with the current frequency (if
>> frequency invariance is not used) and that can make the formula
>> selects another frequency than the current one. That what i can see
>> when testing it .
>>
>> Sorry for the late reply, i was trying to do some test on my board but
>> was facing some crash issue (not link with your patchset). So i have
>> done some tests and i can see such instable behavior. I have generated
>> a load of 33% at max frequency (3ms runs every 9ms) and i can see the
>> frequency that toggles without any good reason. Saying that, i can see
>> similar thing with ondemand.
>
> FWIW I ran some performance numbers on my chromebook 2. Initially I
> forgot to bring in the frequency invariance support but that yielded an
> opportunity to see the impact of it.
>
> The tests below consist of a periodic workload. The OH (overhead)
> numbers show how close the workload got to running as slow as fmin (100%
> = as slow as powersave gov, 0% = as fast as perf gov). The OR (overrun)
> number is the count of instances where the busy work exceeded the period.
>
> First a comparison of schedutil with and without frequency invariance.
> Run and period are in milliseconds.
>
>                         scu (no inv)    scu (w/inv)
> run     period  busy %  OR      OH      OR      OH
> 1       100     1.00%   0       79.72%  0       95.86%
> 10      1000    1.00%   0       24.52%  0       71.61%
> 1       10      10.00%  0       21.25%  0       41.78%
> 10      100     10.00%  0       26.06%  0       47.96%
> 100     1000    10.00%  0       6.36%   0       26.03%
> 6       33      18.18%  0       15.67%  0       31.61%
> 66      333     19.82%  0       8.94%   0       29.46%
> 4       10      40.00%  0       6.26%   0       12.93%
> 40      100     40.00%  0       6.93%   2       14.08%
> 400     1000    40.00%  0       1.65%   0       11.58%
> 5       9       55.56%  0       3.70%   0       7.70%
> 50      90      55.56%  1       4.19%   6       8.06%
> 500     900     55.56%  0       1.35%   5       6.94%
> 9       12      75.00%  0       1.60%   56      3.59%
> 90      120     75.00%  0       1.88%   21      3.94%
> 900     1200    75.00%  0       0.73%   4       4.41%
>
> Frequency invariance causes schedutil overhead to increase noticeably. I
> haven't dug into traces or anything. Perhaps this is due to the
> algorithm overshooting then overcorrecting etc., I do not yet know.

So as I said, the formula I used didn't take invariance into account,
so that's quite as expected.

> Here is a comparison, with frequency invariance, of ondemand and
> interactive with schedfreq and schedutil. The first two columns (run and
> period) are omitted so the table will fit.
>
>         ondemand        interactive     schedfreq       schedutil
> busy %  OR      OH      OR      OH      OR      OH      OR      OH
> 1.00%   0       68.96%  0       100.04% 0       78.49%  0       95.86%
> 1.00%   0       25.04%  0       22.59%  0       72.56%  0       71.61%
> 10.00%  0       21.75%  0       63.08%  0       52.40%  0       41.78%
> 10.00%  0       12.17%  0       14.41%  0       17.33%  0       47.96%
> 10.00%  0       2.57%   0       2.17%   0       0.29%   0       26.03%
> 18.18%  0       12.39%  0       9.39%   0       17.34%  0       31.61%
> 19.82%  0       3.74%   0       3.42%   0       12.26%  0       29.46%
> 40.00%  2       6.26%   1       12.23%  0       6.15%   0       12.93%
> 40.00%  0       0.47%   0       0.05%   0       2.68%   2       14.08%
> 40.00%  0       0.60%   0       0.50%   0       1.22%   0       11.58%
> 55.56%  2       4.25%   5       5.97%   0       2.51%   0       7.70%
> 55.56%  0       1.89%   0       0.04%   0       1.71%   6       8.06%
> 55.56%  0       0.50%   0       0.47%   0       1.82%   5       6.94%
> 75.00%  2       1.65%   1       0.46%   0       0.26%   56      3.59%
> 75.00%  0       1.68%   0       0.05%   0       0.49%   21      3.94%
> 75.00%  0       0.28%   0       0.23%   0       0.62%   4       4.41%
>
> Aside from the 2nd and 3rd tests schedutil is showing decreased
> performance across the board. The fifth test is particularly bad.

I guess you mean performance in terms of the overhead?

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Steve Muckle March 3, 2016, 9:37 p.m. UTC | #17
On 03/03/2016 12:20 PM, Rafael J. Wysocki wrote:
>> Here is a comparison, with frequency invariance, of ondemand and
>> interactive with schedfreq and schedutil. The first two columns (run and
>> period) are omitted so the table will fit.
>>
>>         ondemand        interactive     schedfreq       schedutil
>> busy %  OR      OH      OR      OH      OR      OH      OR      OH
>> 1.00%   0       68.96%  0       100.04% 0       78.49%  0       95.86%
>> 1.00%   0       25.04%  0       22.59%  0       72.56%  0       71.61%
>> 10.00%  0       21.75%  0       63.08%  0       52.40%  0       41.78%
>> 10.00%  0       12.17%  0       14.41%  0       17.33%  0       47.96%
>> 10.00%  0       2.57%   0       2.17%   0       0.29%   0       26.03%
>> 18.18%  0       12.39%  0       9.39%   0       17.34%  0       31.61%
>> 19.82%  0       3.74%   0       3.42%   0       12.26%  0       29.46%
>> 40.00%  2       6.26%   1       12.23%  0       6.15%   0       12.93%
>> 40.00%  0       0.47%   0       0.05%   0       2.68%   2       14.08%
>> 40.00%  0       0.60%   0       0.50%   0       1.22%   0       11.58%
>> 55.56%  2       4.25%   5       5.97%   0       2.51%   0       7.70%
>> 55.56%  0       1.89%   0       0.04%   0       1.71%   6       8.06%
>> 55.56%  0       0.50%   0       0.47%   0       1.82%   5       6.94%
>> 75.00%  2       1.65%   1       0.46%   0       0.26%   56      3.59%
>> 75.00%  0       1.68%   0       0.05%   0       0.49%   21      3.94%
>> 75.00%  0       0.28%   0       0.23%   0       0.62%   4       4.41%
>>
>> Aside from the 2nd and 3rd tests schedutil is showing decreased
>> performance across the board. The fifth test is particularly bad.
> 
> I guess you mean performance in terms of the overhead?

Correct. This overhead metric describes how fast the workload completes,
with 0% equaling the perf governor and 100% equaling the powersave
governor. So it's a reflection of general performance using the
governor. It's called "overhead" I imagine (the metric predates my
involvement) as it is something introduced/caused by the policy of the
governor.

thanks,
Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 4, 2016, 1:14 a.m. UTC | #18
On Thu, Mar 3, 2016 at 5:47 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Mar 03, 2016 at 05:37:35PM +0100, Peter Zijlstra wrote:
>> On Thu, Mar 03, 2016 at 05:24:32PM +0100, Rafael J. Wysocki wrote:
>> > >> f = a * x + b
>
>> > If not, then I think it's reasonable to map the middle of the
>> > available frequency range to x = 0.5 and then we have b = 0 and a =
>> > (max_freq + min_freq) / 2.

That actually should be a = max_freq + min_freq, because I want
(max_freq + min_freq) / 2 = a / 2.

>> So I really think that approach falls apart on the low util bits, you
>> effectively always run above min speed, even if min is already vstly
>> over provisioned.
>
> Ah nevermind, I cannot read. Yes that is worth trying I suppose. But the
> b=0,a=1 thing seems more natural still.

It is somewhat imbalanced, though.  If all of the values of x are
equally probable (or equally frequent), the probability of running
above the middle frequency is lower than the probability of running
below it.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 7, 2016, 2:41 a.m. UTC | #19
On Thursday, March 03, 2016 01:37:59 PM Steve Muckle wrote:
> On 03/03/2016 12:20 PM, Rafael J. Wysocki wrote:
> >> Here is a comparison, with frequency invariance, of ondemand and
> >> interactive with schedfreq and schedutil. The first two columns (run and
> >> period) are omitted so the table will fit.
> >>
> >>         ondemand        interactive     schedfreq       schedutil
> >> busy %  OR      OH      OR      OH      OR      OH      OR      OH
> >> 1.00%   0       68.96%  0       100.04% 0       78.49%  0       95.86%
> >> 1.00%   0       25.04%  0       22.59%  0       72.56%  0       71.61%
> >> 10.00%  0       21.75%  0       63.08%  0       52.40%  0       41.78%
> >> 10.00%  0       12.17%  0       14.41%  0       17.33%  0       47.96%
> >> 10.00%  0       2.57%   0       2.17%   0       0.29%   0       26.03%
> >> 18.18%  0       12.39%  0       9.39%   0       17.34%  0       31.61%
> >> 19.82%  0       3.74%   0       3.42%   0       12.26%  0       29.46%
> >> 40.00%  2       6.26%   1       12.23%  0       6.15%   0       12.93%
> >> 40.00%  0       0.47%   0       0.05%   0       2.68%   2       14.08%
> >> 40.00%  0       0.60%   0       0.50%   0       1.22%   0       11.58%
> >> 55.56%  2       4.25%   5       5.97%   0       2.51%   0       7.70%
> >> 55.56%  0       1.89%   0       0.04%   0       1.71%   6       8.06%
> >> 55.56%  0       0.50%   0       0.47%   0       1.82%   5       6.94%
> >> 75.00%  2       1.65%   1       0.46%   0       0.26%   56      3.59%
> >> 75.00%  0       1.68%   0       0.05%   0       0.49%   21      3.94%
> >> 75.00%  0       0.28%   0       0.23%   0       0.62%   4       4.41%
> >>
> >> Aside from the 2nd and 3rd tests schedutil is showing decreased
> >> performance across the board. The fifth test is particularly bad.
> > 
> > I guess you mean performance in terms of the overhead?
> 
> Correct. This overhead metric describes how fast the workload completes,
> with 0% equaling the perf governor and 100% equaling the powersave
> governor. So it's a reflection of general performance using the
> governor. It's called "overhead" I imagine (the metric predates my
> involvement) as it is something introduced/caused by the policy of the
> governor.

If my understanding of the requency invariant utilization idea is correct,
it is about re-scaling utilization so it is always relative to the capacity
at the max frequency.  If that's the case, then instead of using x = util_raw / max
we will use something like y = (util_raw / max) * (f / max_freq) (f - current
frequency).  This means that

(1) x = y * max_freq / f

Now, say we have an agreed-on (linear) formula for f depending on x:

f = a * x + b

and if you say "Look, if I substitute y for x in this formula, it doesn't
produce correct results", then I can only say "It doesn't, because it can't".

It *obviously* won't work, because instead of substituting y for x, you
need to substitute the right-hand side of (1) for it.  They you'll get

f = a * y * max_freq / f + b

which is obviously nonlinear, so there's no hope that the same formula
will ever work for both "raw" and "frequency invariant" utilization.

To me this means that looking for a formula that will work for both is
just pointless and there are 3 possibilities:

(a) Look for a good enough formula to apply to "raw" utilization and then
    switch over when all architectures start to use "frequency invariant"
    utilization.
(b) Make all architecuters use "frequency invariant" and then look for a
    working formula (seems rather less than realistic to me to be honest).
(c) Code for using either "raw" or "frequency invariant" depending on
    a callback flag or something like that.

I, personally, would go for (a) at this point, because that's the easiest
one, but (c) would be doable too IMO, so I don't care that much as long
as it is not (b).

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 8, 2016, 11:27 a.m. UTC | #20
On Mon, Mar 07, 2016 at 03:41:15AM +0100, Rafael J. Wysocki wrote:

> If my understanding of the requency invariant utilization idea is correct,
> it is about re-scaling utilization so it is always relative to the capacity
> at the max frequency.

Right. So if a workload runs for 5ms at @1GHz and 10ms @500MHz, it would
still result in the exact same utilization.

> If that's the case, then instead of using
>   x = util_raw / max
> we will use something like
>   y = (util_raw / max) * (f / max_freq) (f - current frequency).

I don't get the last term. Assuming fixed frequency hardware (we can't
really assume anything else) I get to:

  util = util_raw * (current_freq / max_freq)		(1)
  x = util / max					(2)

> so there's no hope that the same formula will ever work for both "raw"
> and "frequency invariant" utilization.

Here I agree, however the above (current_freq / max_freq) term is easily
computable, and really the only thing we can assume if the arch doesn't
implement freq invariant accounting.

> (c) Code for using either "raw" or "frequency invariant" depending on
>     a callback flag or something like that.

Seeing how frequency invariance is an arch feature, and cpufreq drivers
are also typically arch specific, do we really need a flag at this
level?

In any case, I think the only difference between the two formula should
be the addition of (1) for the platforms that do not already implement
frequency invariance.

That is actually correct for platforms which do as told with their DVFS
bits. And there's really not much else we can do short of implementing
the scheduler arch hook to do better.

> (b) Make all architecuters use "frequency invariant" and then look for a
>     working formula (seems rather less than realistic to me to be honest).

There was a proposal to implement arch_scale_freq_capacity() as a weak
function and have it serve the cpufreq selected frequency for (1) so
that everything would default to that.

We didn't do that because that makes the function call and
multiplications unconditional. It's cheaper to add (1) to the cpufreq
side when selecting a freq rather than at every single time we update
the util statistics.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 8, 2016, 6 p.m. UTC | #21
On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Mar 07, 2016 at 03:41:15AM +0100, Rafael J. Wysocki wrote:
>
>> If my understanding of the requency invariant utilization idea is correct,
>> it is about re-scaling utilization so it is always relative to the capacity
>> at the max frequency.
>
> Right. So if a workload runs for 5ms at @1GHz and 10ms @500MHz, it would
> still result in the exact same utilization.
>
>> If that's the case, then instead of using
>>   x = util_raw / max
>> we will use something like
>>   y = (util_raw / max) * (f / max_freq) (f - current frequency).
>
> I don't get the last term.

The "(f - current frequency)" thing?  It doesn't belong to the
formula, sorry for the confusion.

So it is almost the same as your (1) below (except for the max in the
denominator), so my y is your x. :-)

> Assuming fixed frequency hardware (we can't
> really assume anything else) I get to:
>
>   util = util_raw * (current_freq / max_freq)           (1)
>   x = util / max                                        (2)
>
>> so there's no hope that the same formula will ever work for both "raw"
>> and "frequency invariant" utilization.
>
> Here I agree, however the above (current_freq / max_freq) term is easily
> computable, and really the only thing we can assume if the arch doesn't
> implement freq invariant accounting.

Right.

>> (c) Code for using either "raw" or "frequency invariant" depending on
>>     a callback flag or something like that.
>
> Seeing how frequency invariance is an arch feature, and cpufreq drivers
> are also typically arch specific, do we really need a flag at this
> level?

The next frequency is selected by the governor and that's why.  The
driver gets a frequency to set only.

Now, the governor needs to work with different platforms, so it needs
to know how to deal with the given one.

> In any case, I think the only difference between the two formula should
> be the addition of (1) for the platforms that do not already implement
> frequency invariance.

OK

So I'm reading this as a statement that linear is a better
approximation for frequency invariant utilization.

This means that on platforms where the utilization is frequency
invariant we should use

  next_freq = a * x

(where x is given by (2) above) and for platforms where the
utilization is not frequency invariant

  next_freq = a * x * current_freq / max_freq

and all boils down to finding a.

Now, it seems reasonable for a to be something like (1 + 1/n) *
max_freq, so for non-frequency invariant we get

  nex_freq = (1 + 1/n) * current_freq * x

> That is actually correct for platforms which do as told with their DVFS
> bits. And there's really not much else we can do short of implementing
> the scheduler arch hook to do better.
>
>> (b) Make all architecuters use "frequency invariant" and then look for a
>>     working formula (seems rather less than realistic to me to be honest).
>
> There was a proposal to implement arch_scale_freq_capacity() as a weak
> function and have it serve the cpufreq selected frequency for (1) so
> that everything would default to that.
>
> We didn't do that because that makes the function call and
> multiplications unconditional. It's cheaper to add (1) to the cpufreq
> side when selecting a freq rather than at every single time we update
> the util statistics.

That's fine by me.

My point was that we need different formulas for frequency invariant
and the other basically.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 8, 2016, 7:26 p.m. UTC | #22
On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> > Seeing how frequency invariance is an arch feature, and cpufreq drivers
> > are also typically arch specific, do we really need a flag at this
> > level?
> 
> The next frequency is selected by the governor and that's why.  The
> driver gets a frequency to set only.
> 
> Now, the governor needs to work with different platforms, so it needs
> to know how to deal with the given one.

Ah, indeed. In any case, the availability of arch_sched_scale_freq() is
a compile time thingy, so we can, at compile time, know what to use.

> > In any case, I think the only difference between the two formula should
> > be the addition of (1) for the platforms that do not already implement
> > frequency invariance.
> 
> OK
> 
> So I'm reading this as a statement that linear is a better
> approximation for frequency invariant utilization.

Well, (1) is what the scheduler does with frequency invariance, except
that allows a more flexible definition of 'current frequency' by asking
for it every time we update the util stats.

But if a platform doesn't need this, ie. it has a fixed frequency, or
simply doesn't provide anything like this, assuming we run at the
frequency we asked for is a reasonable assumption no?

> This means that on platforms where the utilization is frequency
> invariant we should use
> 
>   next_freq = a * x
> 
> (where x is given by (2) above) and for platforms where the
> utilization is not frequency invariant
> 
>   next_freq = a * x * current_freq / max_freq
> 
> and all boils down to finding a.

Right.

> Now, it seems reasonable for a to be something like (1 + 1/n) *
> max_freq, so for non-frequency invariant we get
> 
>   nex_freq = (1 + 1/n) * current_freq * x

This seems like a big leap; where does:

  (1 + 1/n) * max_freq

come from? And what is 'n'?
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 8, 2016, 8:05 p.m. UTC | #23
On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
>> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > Seeing how frequency invariance is an arch feature, and cpufreq drivers
>> > are also typically arch specific, do we really need a flag at this
>> > level?
>>
>> The next frequency is selected by the governor and that's why.  The
>> driver gets a frequency to set only.
>>
>> Now, the governor needs to work with different platforms, so it needs
>> to know how to deal with the given one.
>
> Ah, indeed. In any case, the availability of arch_sched_scale_freq() is
> a compile time thingy, so we can, at compile time, know what to use.
>
>> > In any case, I think the only difference between the two formula should
>> > be the addition of (1) for the platforms that do not already implement
>> > frequency invariance.
>>
>> OK
>>
>> So I'm reading this as a statement that linear is a better
>> approximation for frequency invariant utilization.
>
> Well, (1) is what the scheduler does with frequency invariance, except
> that allows a more flexible definition of 'current frequency' by asking
> for it every time we update the util stats.
>
> But if a platform doesn't need this, ie. it has a fixed frequency, or
> simply doesn't provide anything like this, assuming we run at the
> frequency we asked for is a reasonable assumption no?
>
>> This means that on platforms where the utilization is frequency
>> invariant we should use
>>
>>   next_freq = a * x
>>
>> (where x is given by (2) above) and for platforms where the
>> utilization is not frequency invariant
>>
>>   next_freq = a * x * current_freq / max_freq
>>
>> and all boils down to finding a.
>
> Right.

However, that doesn't seem to be in agreement with the Steve's results
posted earlier in this thread.

Also theoretically, with frequency invariant, the only way you can get
to 100% utilization is by running at the max frequency, so the closer
to 100% you get, the faster you need to run to get any further.  That
indicates nonlinear to me.

>> Now, it seems reasonable for a to be something like (1 + 1/n) *
>> max_freq, so for non-frequency invariant we get
>>
>>   nex_freq = (1 + 1/n) * current_freq * x
>
> This seems like a big leap; where does:
>
>   (1 + 1/n) * max_freq
>
> come from? And what is 'n'?

a = max_freq gives next_freq = max_freq for x = 1, but with that
choice of a you may never get to x = 1 with frequency invariant
because of the feedback effect mentioned above, so the 1/n produces
the extra boost needed for that (n is a positive integer).

Quite frankly, to me it looks like linear really is a better
approximation for "raw" utilization.  That is, for frequency invariant
x we should take:

  next_freq = a * x * max_freq / current_freq

(and if x is not frequency invariant, the right-hand side becomes a *
x).  Then, the extra boost needed to get to x = 1 for frequency
invariant is produced by the (max_freq / current_freq) factor that is
greater than 1 as long as we are not running at max_freq and a can be
chosen as max_freq.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 9, 2016, 10:15 a.m. UTC | #24
Hi,

sorry if I didn't reply yet. Trying to cope with jetlag and
talks/meetings these days :-). Let me see if I'm getting what you are
discussing, though.

On 08/03/16 21:05, Rafael J. Wysocki wrote:
> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:

[...]

> a = max_freq gives next_freq = max_freq for x = 1, but with that
> choice of a you may never get to x = 1 with frequency invariant
> because of the feedback effect mentioned above, so the 1/n produces
> the extra boost needed for that (n is a positive integer).
> 
> Quite frankly, to me it looks like linear really is a better
> approximation for "raw" utilization.  That is, for frequency invariant
> x we should take:
> 
>   next_freq = a * x * max_freq / current_freq
> 
> (and if x is not frequency invariant, the right-hand side becomes a *
> x).  Then, the extra boost needed to get to x = 1 for frequency
> invariant is produced by the (max_freq / current_freq) factor that is
> greater than 1 as long as we are not running at max_freq and a can be
> chosen as max_freq.
> 

Expanding terms again, your original formula (without the 1.1 factor of
the last version) was:

 next_freq = util / max_cap * max_freq

and this doesn't work when we have freq invariance since util won't go
over curr_cap.

What you propose above is to add another factor, so that we have:

 next_freq = util / max_cap * max_freq / curr_freq * max_freq

which should give us the opportunity to reach max_freq also with freq
invariance.

This should actually be the same of doing:

 next_freq = util / max_cap * max_cap / curr_cap * max_freq

We are basically scaling how much the cpu is busy at curr_cap back to
the 0..1024 scale. And we use this to select next_freq. Also, we can
simplify this to:

 next_freq = util / curr_cap * max_freq

and we save some ops.

However, if that is correct, I think we might have a problem, as we are
skewing OPP selection towards higher frequencies. Let's suppose we have
a platform with 3 OPPs:

  freq     cap
  1200     1024
  900      768
  600      512

As soon a task reaches an utilization of 257 we will be selecting the
second OPP as

 next_freq = 257 / 512 * 1200 ~ 602

While the cpu is only 50% busy in this case. And we will go at max OPP
when reaching ~492 (~64% of 768).

That said, I guess this might work as a first solution, but we will
probably need something better in the future. I understand Rafael's
concerns regardin margins, but it seems to me that some kind of
additional parameter will be probably needed anyway to fix this.
Just to say again how we handle this in schedfreq, with a -20% margin
applied to the lowest OPP we will get to the next one when utilization
reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
which is less aggressive and might be better IMHO.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 9, 2016, 4:39 p.m. UTC | #25
On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote:
> >> This means that on platforms where the utilization is frequency
> >> invariant we should use
> >>
> >>   next_freq = a * x
> >>
> >> (where x is given by (2) above) and for platforms where the
> >> utilization is not frequency invariant
> >>
> >>   next_freq = a * x * current_freq / max_freq
> >>
> >> and all boils down to finding a.
> >
> > Right.
> 
> However, that doesn't seem to be in agreement with the Steve's results
> posted earlier in this thread.

I could not make anything of those numbers.

> Also theoretically, with frequency invariant, the only way you can get
> to 100% utilization is by running at the max frequency, so the closer
> to 100% you get, the faster you need to run to get any further.  That
> indicates nonlinear to me.

I'm not seeing that, you get that by using a > 1. No need for
non-linear.

> >> Now, it seems reasonable for a to be something like (1 + 1/n) *
> >> max_freq, so for non-frequency invariant we get
> >>
> >>   nex_freq = (1 + 1/n) * current_freq * x
> >
> > This seems like a big leap; where does:
> >
> >   (1 + 1/n) * max_freq
> >
> > come from? And what is 'n'?

> a = max_freq gives next_freq = max_freq for x = 1,

next_freq = a        * x * current_freq / max_freq

  [ a := max_freq, x := 1 ] ->

          = max_freq * 1 * current_freq / max_freq
	  = current_freq

	  != max_freq

But I think I see what you're saying; because at x = 1,
current_frequency must be max_frequency. Per your earlier point.

> but with that choice of a you may never get to x = 1 with frequency
> invariant because of the feedback effect mentioned above, so the 1/n
> produces the extra boost needed for that (n is a positive integer).

OK, so that gets us:

	a = (1 + 1/n) ; n > 0

[ I would not have chosen (1 + 1/n), but lets stick to that ]

So for n = 4 that gets you: a = 1.25, which effectively gets you an 80%
utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the
next frequency (assuming RELATION_L like selection).

Together this gets you:

	next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq
	          = (1 + 1/n) * x * current_freq

Again, with n = 4, x > .8 will result in a next_freq > current_freq, and
hence (RELATION_L) pick a higher one.

> Quite frankly, to me it looks like linear really is a better
> approximation for "raw" utilization.  That is, for frequency invariant
> x we should take:
> 
>   next_freq = a * x * max_freq / current_freq

(its very confusing how you use 'x' for both invariant and
non-invariant).

That doesn't make sense, remember:

	util = \Sum_i u_i * freq_i / max_freq		(1)

Which for systems where freq_i is constant reduces to:

	util = util_raw * current_freq / max_freq	(2)

But you cannot reverse this. IOW you cannot try and divide out
current_freq on a frequency invariant metric.

So going by:

	next_freq = (1 + 1/n) * max_freq * util		(3)

if we substitute (2) into (3) we get:

	          = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq
		  = (1 + 1/n) * current_freq * util_raw	(4)

Which gets you two formula with the same general behaviour. As (2) is
the only approximation of (1) we can make.


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 9, 2016, 11:28 p.m. UTC | #26
On Wed, Mar 9, 2016 at 5:39 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote:
>> >> This means that on platforms where the utilization is frequency
>> >> invariant we should use
>> >>
>> >>   next_freq = a * x
>> >>
>> >> (where x is given by (2) above) and for platforms where the
>> >> utilization is not frequency invariant
>> >>
>> >>   next_freq = a * x * current_freq / max_freq
>> >>
>> >> and all boils down to finding a.
>> >
>> > Right.
>>
>> However, that doesn't seem to be in agreement with the Steve's results
>> posted earlier in this thread.
>
> I could not make anything of those numbers.
>
>> Also theoretically, with frequency invariant, the only way you can get
>> to 100% utilization is by running at the max frequency, so the closer
>> to 100% you get, the faster you need to run to get any further.  That
>> indicates nonlinear to me.
>
> I'm not seeing that, you get that by using a > 1. No need for
> non-linear.

OK

>> >> Now, it seems reasonable for a to be something like (1 + 1/n) *
>> >> max_freq, so for non-frequency invariant we get
>> >>
>> >>   nex_freq = (1 + 1/n) * current_freq * x

(*) (see below)

>> > This seems like a big leap; where does:
>> >
>> >   (1 + 1/n) * max_freq
>> >
>> > come from? And what is 'n'?
>
>> a = max_freq gives next_freq = max_freq for x = 1,
>
> next_freq = a        * x * current_freq / max_freq
>
>   [ a := max_freq, x := 1 ] ->
>
>           = max_freq * 1 * current_freq / max_freq
>           = current_freq
>
>           != max_freq
>
> But I think I see what you're saying; because at x = 1,
> current_frequency must be max_frequency. Per your earlier point.

Correct.

>> but with that choice of a you may never get to x = 1 with frequency
>> invariant because of the feedback effect mentioned above, so the 1/n
>> produces the extra boost needed for that (n is a positive integer).
>
> OK, so that gets us:
>
>         a = (1 + 1/n) ; n > 0
>
> [ I would not have chosen (1 + 1/n), but lets stick to that ]

Well, what would you choose then? :-)

> So for n = 4 that gets you: a = 1.25, which effectively gets you an 80%
> utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the
> next frequency (assuming RELATION_L like selection).
>
> Together this gets you:
>
>         next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq
>                   = (1 + 1/n) * x * current_freq

That seems to be what I said above (*), isn't it?

> Again, with n = 4, x > .8 will result in a next_freq > current_freq, and
> hence (RELATION_L) pick a higher one.

OK

>> Quite frankly, to me it looks like linear really is a better
>> approximation for "raw" utilization.  That is, for frequency invariant
>> x we should take:
>>
>>   next_freq = a * x * max_freq / current_freq
>
> (its very confusing how you use 'x' for both invariant and
> non-invariant).
>
> That doesn't make sense, remember:
>
>         util = \Sum_i u_i * freq_i / max_freq           (1)
>
> Which for systems where freq_i is constant reduces to:
>
>         util = util_raw * current_freq / max_freq       (2)
>
> But you cannot reverse this. IOW you cannot try and divide out
> current_freq on a frequency invariant metric.

I see.

> So going by:
>
>         next_freq = (1 + 1/n) * max_freq * util         (3)

I think that should be

  next_freq = (1 + 1/n) * max_freq * util / max

(where max is the second argument of cpufreq_update_util) or the
dimensions on both sides don't match.

> if we substitute (2) into (3) we get:
>
>                   = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq
>                   = (1 + 1/n) * current_freq * util_raw (4)
>
> Which gets you two formula with the same general behaviour. As (2) is
> the only approximation of (1) we can make.

OK

So since utilization is not frequency invariant in the current
mainline (or linux-next for that matter) AFAIC, I'm going to use the
following in the next version of the schedutil patch series:

  next_freq = 1.25 * current_freq * util_raw / max

where util_raw and max are what I get from cpufreq_update_util().

1.25 is for the 80% tipping point which I think is reasonable.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 9, 2016, 11:41 p.m. UTC | #27
On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> Hi,
>
> sorry if I didn't reply yet. Trying to cope with jetlag and
> talks/meetings these days :-). Let me see if I'm getting what you are
> discussing, though.
>
> On 08/03/16 21:05, Rafael J. Wysocki wrote:
>> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
>> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> [...]
>
>> a = max_freq gives next_freq = max_freq for x = 1, but with that
>> choice of a you may never get to x = 1 with frequency invariant
>> because of the feedback effect mentioned above, so the 1/n produces
>> the extra boost needed for that (n is a positive integer).
>>
>> Quite frankly, to me it looks like linear really is a better
>> approximation for "raw" utilization.  That is, for frequency invariant
>> x we should take:
>>
>>   next_freq = a * x * max_freq / current_freq
>>
>> (and if x is not frequency invariant, the right-hand side becomes a *
>> x).  Then, the extra boost needed to get to x = 1 for frequency
>> invariant is produced by the (max_freq / current_freq) factor that is
>> greater than 1 as long as we are not running at max_freq and a can be
>> chosen as max_freq.
>>
>
> Expanding terms again, your original formula (without the 1.1 factor of
> the last version) was:
>
>  next_freq = util / max_cap * max_freq
>
> and this doesn't work when we have freq invariance since util won't go
> over curr_cap.

Can you please remind me what curr_cap is?

> What you propose above is to add another factor, so that we have:
>
>  next_freq = util / max_cap * max_freq / curr_freq * max_freq
>
> which should give us the opportunity to reach max_freq also with freq
> invariance.
>
> This should actually be the same of doing:
>
>  next_freq = util / max_cap * max_cap / curr_cap * max_freq
>
> We are basically scaling how much the cpu is busy at curr_cap back to
> the 0..1024 scale. And we use this to select next_freq. Also, we can
> simplify this to:
>
>  next_freq = util / curr_cap * max_freq
>
> and we save some ops.
>
> However, if that is correct, I think we might have a problem, as we are
> skewing OPP selection towards higher frequencies. Let's suppose we have
> a platform with 3 OPPs:
>
>   freq     cap
>   1200     1024
>   900      768
>   600      512
>
> As soon a task reaches an utilization of 257 we will be selecting the
> second OPP as
>
>  next_freq = 257 / 512 * 1200 ~ 602
>
> While the cpu is only 50% busy in this case. And we will go at max OPP
> when reaching ~492 (~64% of 768).
>
> That said, I guess this might work as a first solution, but we will
> probably need something better in the future. I understand Rafael's
> concerns regardin margins, but it seems to me that some kind of
> additional parameter will be probably needed anyway to fix this.
> Just to say again how we handle this in schedfreq, with a -20% margin
> applied to the lowest OPP we will get to the next one when utilization
> reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
> which is less aggressive and might be better IMHO.

Well, Peter says that my idea is incorrect, so I'll go for

  next_freq = C * current_freq * util_raw / max

where C > 1 (and likely C < 1.5) instead.

That means C has to be determined somehow or guessed.  The 80% tipping
point condition seems reasonable to me, though, which leads to C =
1.25.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot March 10, 2016, 3:44 a.m. UTC | #28
On 10 March 2016 at 06:28, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Mar 9, 2016 at 5:39 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, Mar 08, 2016 at 09:05:50PM +0100, Rafael J. Wysocki wrote:
>>> >> This means that on platforms where the utilization is frequency
>>> >> invariant we should use
>>> >>
>>> >>   next_freq = a * x
>>> >>
>>> >> (where x is given by (2) above) and for platforms where the
>>> >> utilization is not frequency invariant
>>> >>
>>> >>   next_freq = a * x * current_freq / max_freq
>>> >>
>>> >> and all boils down to finding a.
>>> >
>>> > Right.
>>>
>>> However, that doesn't seem to be in agreement with the Steve's results
>>> posted earlier in this thread.
>>
>> I could not make anything of those numbers.
>>
>>> Also theoretically, with frequency invariant, the only way you can get
>>> to 100% utilization is by running at the max frequency, so the closer
>>> to 100% you get, the faster you need to run to get any further.  That
>>> indicates nonlinear to me.
>>
>> I'm not seeing that, you get that by using a > 1. No need for
>> non-linear.
>
> OK
>
>>> >> Now, it seems reasonable for a to be something like (1 + 1/n) *
>>> >> max_freq, so for non-frequency invariant we get
>>> >>
>>> >>   nex_freq = (1 + 1/n) * current_freq * x
>
> (*) (see below)
>
>>> > This seems like a big leap; where does:
>>> >
>>> >   (1 + 1/n) * max_freq
>>> >
>>> > come from? And what is 'n'?
>>
>>> a = max_freq gives next_freq = max_freq for x = 1,
>>
>> next_freq = a        * x * current_freq / max_freq
>>
>>   [ a := max_freq, x := 1 ] ->
>>
>>           = max_freq * 1 * current_freq / max_freq
>>           = current_freq
>>
>>           != max_freq
>>
>> But I think I see what you're saying; because at x = 1,
>> current_frequency must be max_frequency. Per your earlier point.
>
> Correct.
>
>>> but with that choice of a you may never get to x = 1 with frequency
>>> invariant because of the feedback effect mentioned above, so the 1/n
>>> produces the extra boost needed for that (n is a positive integer).
>>
>> OK, so that gets us:
>>
>>         a = (1 + 1/n) ; n > 0
>>
>> [ I would not have chosen (1 + 1/n), but lets stick to that ]
>
> Well, what would you choose then? :-)
>
>> So for n = 4 that gets you: a = 1.25, which effectively gets you an 80%
>> utilization tipping point. That is, 1.25 * .8 = 1, iow. you'll pick the
>> next frequency (assuming RELATION_L like selection).
>>
>> Together this gets you:
>>
>>         next_freq = (1 + 1/n) * max_freq * x * current_freq / max_freq
>>                   = (1 + 1/n) * x * current_freq
>
> That seems to be what I said above (*), isn't it?
>
>> Again, with n = 4, x > .8 will result in a next_freq > current_freq, and
>> hence (RELATION_L) pick a higher one.
>
> OK
>
>>> Quite frankly, to me it looks like linear really is a better
>>> approximation for "raw" utilization.  That is, for frequency invariant
>>> x we should take:
>>>
>>>   next_freq = a * x * max_freq / current_freq
>>
>> (its very confusing how you use 'x' for both invariant and
>> non-invariant).
>>
>> That doesn't make sense, remember:
>>
>>         util = \Sum_i u_i * freq_i / max_freq           (1)
>>
>> Which for systems where freq_i is constant reduces to:
>>
>>         util = util_raw * current_freq / max_freq       (2)
>>
>> But you cannot reverse this. IOW you cannot try and divide out
>> current_freq on a frequency invariant metric.
>
> I see.
>
>> So going by:
>>
>>         next_freq = (1 + 1/n) * max_freq * util         (3)
>
> I think that should be
>
>   next_freq = (1 + 1/n) * max_freq * util / max
>
> (where max is the second argument of cpufreq_update_util) or the
> dimensions on both sides don't match.
>
>> if we substitute (2) into (3) we get:
>>
>>                   = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq
>>                   = (1 + 1/n) * current_freq * util_raw (4)
>>
>> Which gets you two formula with the same general behaviour. As (2) is
>> the only approximation of (1) we can make.
>
> OK
>
> So since utilization is not frequency invariant in the current
> mainline (or linux-next for that matter) AFAIC, I'm going to use the
> following in the next version of the schedutil patch series:
>
>   next_freq = 1.25 * current_freq * util_raw / max
>
> where util_raw and max are what I get from cpufreq_update_util().
>
> 1.25 is for the 80% tipping point which I think is reasonable.


We have the arch_scale_freq_capacity function that is arch dependent
and can be used to merge the 2 formula that were described by peter
above.
By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which
is max capacity
but when arch_scale_freq_capacity is defined by an architecture,
arch_scale_freq_capacity returns current_freq * max_capacity/max_freq

so can't we use arch_scale_freq in your formula ? Taking your formula
above it becomes:
next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity()

Without invariance feature, we have the same formula than above :
next_freq = 1.25 * current_freq * util_raw / max because
SCHED_CAPACITY_SCALE is max capacity

With invariance feature, we have next_freq = 1.25 * current_freq *
util / (current_freq*max_capacity/max_freq) = 1.25 * util * max_freq /
max which is the formula that has to be used with frequency invariant
utilization.

so we have one formula that works for both configuration (this is not
really optimized for invariant system because we multiply then divide
by current_freq in 2 different places but it's better than a wrong
formula)

Now, arch_scale_freq_capacity is available in kernel/sched/sched.h
header file which can only be accessed by scheduler code...

May be we can pass arch_scale_freq_capacity value instead of max one
as a parameter of update_util function prototype

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Juri Lelli March 10, 2016, 4:30 a.m. UTC | #29
On 10/03/16 00:41, Rafael J. Wysocki wrote:
> On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > Hi,
> >
> > sorry if I didn't reply yet. Trying to cope with jetlag and
> > talks/meetings these days :-). Let me see if I'm getting what you are
> > discussing, though.
> >
> > On 08/03/16 21:05, Rafael J. Wysocki wrote:
> >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > [...]
> >
> >> a = max_freq gives next_freq = max_freq for x = 1, but with that
> >> choice of a you may never get to x = 1 with frequency invariant
> >> because of the feedback effect mentioned above, so the 1/n produces
> >> the extra boost needed for that (n is a positive integer).
> >>
> >> Quite frankly, to me it looks like linear really is a better
> >> approximation for "raw" utilization.  That is, for frequency invariant
> >> x we should take:
> >>
> >>   next_freq = a * x * max_freq / current_freq
> >>
> >> (and if x is not frequency invariant, the right-hand side becomes a *
> >> x).  Then, the extra boost needed to get to x = 1 for frequency
> >> invariant is produced by the (max_freq / current_freq) factor that is
> >> greater than 1 as long as we are not running at max_freq and a can be
> >> chosen as max_freq.
> >>
> >
> > Expanding terms again, your original formula (without the 1.1 factor of
> > the last version) was:
> >
> >  next_freq = util / max_cap * max_freq
> >
> > and this doesn't work when we have freq invariance since util won't go
> > over curr_cap.
> 
> Can you please remind me what curr_cap is?
> 

The capacity at current frequency.

> > What you propose above is to add another factor, so that we have:
> >
> >  next_freq = util / max_cap * max_freq / curr_freq * max_freq
> >
> > which should give us the opportunity to reach max_freq also with freq
> > invariance.
> >
> > This should actually be the same of doing:
> >
> >  next_freq = util / max_cap * max_cap / curr_cap * max_freq
> >
> > We are basically scaling how much the cpu is busy at curr_cap back to
> > the 0..1024 scale. And we use this to select next_freq. Also, we can
> > simplify this to:
> >
> >  next_freq = util / curr_cap * max_freq
> >
> > and we save some ops.
> >
> > However, if that is correct, I think we might have a problem, as we are
> > skewing OPP selection towards higher frequencies. Let's suppose we have
> > a platform with 3 OPPs:
> >
> >   freq     cap
> >   1200     1024
> >   900      768
> >   600      512
> >
> > As soon a task reaches an utilization of 257 we will be selecting the
> > second OPP as
> >
> >  next_freq = 257 / 512 * 1200 ~ 602
> >
> > While the cpu is only 50% busy in this case. And we will go at max OPP
> > when reaching ~492 (~64% of 768).
> >
> > That said, I guess this might work as a first solution, but we will
> > probably need something better in the future. I understand Rafael's
> > concerns regardin margins, but it seems to me that some kind of
> > additional parameter will be probably needed anyway to fix this.
> > Just to say again how we handle this in schedfreq, with a -20% margin
> > applied to the lowest OPP we will get to the next one when utilization
> > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
> > which is less aggressive and might be better IMHO.
> 
> Well, Peter says that my idea is incorrect, so I'll go for
> 
>   next_freq = C * current_freq * util_raw / max
> 
> where C > 1 (and likely C < 1.5) instead.
> 
> That means C has to be determined somehow or guessed.  The 80% tipping
> point condition seems reasonable to me, though, which leads to C =
> 1.25.
> 

Right. So, when using freq. invariant util we have:

 next_freq = C * curr_freq * util / curr_cap

as

 util_raw = util * max / curr_cap

What Vincent is saying makes sense, though. If we use
arch_scale_freq_capacity() as denominator instead of max, we can use a
single formula for both cases.

Best,

- Juri
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 10, 2016, 8:43 a.m. UTC | #30
On Thu, Mar 10, 2016 at 12:28:52AM +0100, Rafael J. Wysocki wrote:
> > [ I would not have chosen (1 + 1/n), but lets stick to that ]
> 
> Well, what would you choose then? :-)

1/p ; 0 < p < 1

or so. Where p then represents the percentile threshold where you want
to bump to the next freq.

> I think that should be
> 
>   next_freq = (1 + 1/n) * max_freq * util / max
> 
> (where max is the second argument of cpufreq_update_util) or the
> dimensions on both sides don't match.

Well yes, but so far we were treating util (and util_raw) as 0 < u < 1,
values, so already normalized against max.

But yes..

> > if we substitute (2) into (3) we get:
> >
> >                   = (1 + 1/n) * max_freq * util_raw * current_freq / max_freq
> >                   = (1 + 1/n) * current_freq * util_raw (4)
> >
> > Which gets you two formula with the same general behaviour. As (2) is
> > the only approximation of (1) we can make.
> 
> OK
> 
> So since utilization is not frequency invariant in the current
> mainline (or linux-next for that matter) AFAIC, I'm going to use the
> following in the next version of the schedutil patch series:
> 
>   next_freq = 1.25 * current_freq * util_raw / max
> 
> where util_raw and max are what I get from cpufreq_update_util().
> 
> 1.25 is for the 80% tipping point which I think is reasonable.

OK.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 10, 2016, 10:07 a.m. UTC | #31
On Thu, Mar 10, 2016 at 10:44:21AM +0700, Vincent Guittot wrote:
> We have the arch_scale_freq_capacity function that is arch dependent
> and can be used to merge the 2 formula that were described by peter
> above.
> By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which
> is max capacity
> but when arch_scale_freq_capacity is defined by an architecture,

> arch_scale_freq_capacity returns current_freq * max_capacity/max_freq

However, current_freq is a very fluid thing, it might (and will) change
very rapidly on some platforms.

This is the same point I made earlier, you cannot try and divide out
current_freq from the invariant measure.

> so can't we use arch_scale_freq in your formula ? Taking your formula
> above it becomes:
> next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity()

No, that cannot work, nor makes any sense, per the above.

> With invariance feature, we have:
>
>   next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq)
>             = 1.25 * util * max_freq / max
>
> which is the formula that has to be used with frequency invariant
> utilization.

Wrong, you cannot talk about current_freq in the invariant case.

> May be we can pass arch_scale_freq_capacity value instead of max one
> as a parameter of update_util function prototype

No, since its a compile time thing, we can simply do:

#ifdef arch_scale_freq_capacity
	next_freq = (1 + 1/n) * max_freq * (util / max)
#else
	next_freq = (1 + 1/n) * current_freq * (util_raw / max)
#endif
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vincent Guittot March 10, 2016, 10:26 a.m. UTC | #32
On 10 March 2016 at 17:07, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Mar 10, 2016 at 10:44:21AM +0700, Vincent Guittot wrote:
>> We have the arch_scale_freq_capacity function that is arch dependent
>> and can be used to merge the 2 formula that were described by peter
>> above.
>> By default, arch_scale_freq_capacity return SCHED_CAPACITY_SCALE which
>> is max capacity
>> but when arch_scale_freq_capacity is defined by an architecture,
>
>> arch_scale_freq_capacity returns current_freq * max_capacity/max_freq
>
> However, current_freq is a very fluid thing, it might (and will) change
> very rapidly on some platforms.
>
> This is the same point I made earlier, you cannot try and divide out
> current_freq from the invariant measure.
>
>> so can't we use arch_scale_freq in your formula ? Taking your formula
>> above it becomes:
>> next_freq = 1.25 * current_freq * util / arch_scale_freq_capacity()
>
> No, that cannot work, nor makes any sense, per the above.
>
>> With invariance feature, we have:
>>
>>   next_freq = 1.25 * current_freq * util / (current_freq*max_capacity/max_freq)
>>             = 1.25 * util * max_freq / max
>>
>> which is the formula that has to be used with frequency invariant
>> utilization.
>
> Wrong, you cannot talk about current_freq in the invariant case.
>
>> May be we can pass arch_scale_freq_capacity value instead of max one
>> as a parameter of update_util function prototype
>
> No, since its a compile time thing, we can simply do:
>
> #ifdef arch_scale_freq_capacity
>         next_freq = (1 + 1/n) * max_freq * (util / max)
> #else
>         next_freq = (1 + 1/n) * current_freq * (util_raw / max)
> #endif

selecting formula at compilation is clearly better. I wrongly thought
that it can't be accepted as a solution.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 10, 2016, 10:30 a.m. UTC | #33
On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote:

> > No, since its a compile time thing, we can simply do:
> >
> > #ifdef arch_scale_freq_capacity
> >         next_freq = (1 + 1/n) * max_freq * (util / max)
> > #else
> >         next_freq = (1 + 1/n) * current_freq * (util_raw / max)
> > #endif
> 
> selecting formula at compilation is clearly better. I wrongly thought that
> it can't be accepted as a solution.

Well, its bound to get more 'interesting' since I forse implementations
not always actually doing the invariant thing.

Take for example the thing I send:

  lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net

it both shows why you cannot talk about current_freq but also that the
above needs a little more help (for the !X86_FEATURE_APERFMPERF case).

But the !arch_scale_freq_capacity case should indeed be that simple.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra March 10, 2016, 10:56 a.m. UTC | #34
On Thu, Mar 10, 2016 at 11:30:08AM +0100, Peter Zijlstra wrote:
> On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote:
> 
> > > No, since its a compile time thing, we can simply do:
> > >
> > > #ifdef arch_scale_freq_capacity
> > >         next_freq = (1 + 1/n) * max_freq * (util / max)
> > > #else
> > >         next_freq = (1 + 1/n) * current_freq * (util_raw / max)
> > > #endif
> > 
> > selecting formula at compilation is clearly better. I wrongly thought that
> > it can't be accepted as a solution.
> 
> Well, its bound to get more 'interesting' since I forse implementations
> not always actually doing the invariant thing.
> 
> Take for example the thing I send:
> 
>   lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net
> 
> it both shows why you cannot talk about current_freq but also that the
> above needs a little more help (for the !X86_FEATURE_APERFMPERF case).
> 
> But the !arch_scale_freq_capacity case should indeed be that simple.

Maybe something like:

#ifdef arch_scale_freq_capacity
#ifndef arch_scale_freq_invariant
#define arch_scale_freq_invariant()	(true)
#endif
#else /* arch_scale_freq_capacity */
#define arch_scale_freq_invariant()	(false)
#endif

	if (arch_scale_freq_invariant())

And have archs that have conditional arch_scale_freq_capacity()
implementation provide a arch_scale_freq_invariant implementation.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 10, 2016, 9:01 p.m. UTC | #35
On Thursday, March 10, 2016 11:30:34 AM Juri Lelli wrote:
> On 10/03/16 00:41, Rafael J. Wysocki wrote:
> > On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > > Hi,
> > >
> > > sorry if I didn't reply yet. Trying to cope with jetlag and
> > > talks/meetings these days :-). Let me see if I'm getting what you are
> > > discussing, though.
> > >
> > > On 08/03/16 21:05, Rafael J. Wysocki wrote:
> > >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> > >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > [...]
> > >
> > >> a = max_freq gives next_freq = max_freq for x = 1, but with that
> > >> choice of a you may never get to x = 1 with frequency invariant
> > >> because of the feedback effect mentioned above, so the 1/n produces
> > >> the extra boost needed for that (n is a positive integer).
> > >>
> > >> Quite frankly, to me it looks like linear really is a better
> > >> approximation for "raw" utilization.  That is, for frequency invariant
> > >> x we should take:
> > >>
> > >>   next_freq = a * x * max_freq / current_freq
> > >>
> > >> (and if x is not frequency invariant, the right-hand side becomes a *
> > >> x).  Then, the extra boost needed to get to x = 1 for frequency
> > >> invariant is produced by the (max_freq / current_freq) factor that is
> > >> greater than 1 as long as we are not running at max_freq and a can be
> > >> chosen as max_freq.
> > >>
> > >
> > > Expanding terms again, your original formula (without the 1.1 factor of
> > > the last version) was:
> > >
> > >  next_freq = util / max_cap * max_freq
> > >
> > > and this doesn't work when we have freq invariance since util won't go
> > > over curr_cap.
> > 
> > Can you please remind me what curr_cap is?
> > 
> 
> The capacity at current frequency.

I see, thanks!

> > > What you propose above is to add another factor, so that we have:
> > >
> > >  next_freq = util / max_cap * max_freq / curr_freq * max_freq
> > >
> > > which should give us the opportunity to reach max_freq also with freq
> > > invariance.
> > >
> > > This should actually be the same of doing:
> > >
> > >  next_freq = util / max_cap * max_cap / curr_cap * max_freq
> > >
> > > We are basically scaling how much the cpu is busy at curr_cap back to
> > > the 0..1024 scale. And we use this to select next_freq. Also, we can
> > > simplify this to:
> > >
> > >  next_freq = util / curr_cap * max_freq
> > >
> > > and we save some ops.
> > >
> > > However, if that is correct, I think we might have a problem, as we are
> > > skewing OPP selection towards higher frequencies. Let's suppose we have
> > > a platform with 3 OPPs:
> > >
> > >   freq     cap
> > >   1200     1024
> > >   900      768
> > >   600      512
> > >
> > > As soon a task reaches an utilization of 257 we will be selecting the
> > > second OPP as
> > >
> > >  next_freq = 257 / 512 * 1200 ~ 602
> > >
> > > While the cpu is only 50% busy in this case. And we will go at max OPP
> > > when reaching ~492 (~64% of 768).
> > >
> > > That said, I guess this might work as a first solution, but we will
> > > probably need something better in the future. I understand Rafael's
> > > concerns regardin margins, but it seems to me that some kind of
> > > additional parameter will be probably needed anyway to fix this.
> > > Just to say again how we handle this in schedfreq, with a -20% margin
> > > applied to the lowest OPP we will get to the next one when utilization
> > > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
> > > which is less aggressive and might be better IMHO.
> > 
> > Well, Peter says that my idea is incorrect, so I'll go for
> > 
> >   next_freq = C * current_freq * util_raw / max
> > 
> > where C > 1 (and likely C < 1.5) instead.
> > 
> > That means C has to be determined somehow or guessed.  The 80% tipping
> > point condition seems reasonable to me, though, which leads to C =
> > 1.25.
> > 
> 
> Right. So, when using freq. invariant util we have:
> 
>  next_freq = C * curr_freq * util / curr_cap
> 
> as
> 
>  util_raw = util * max / curr_cap
> 
> What Vincent is saying makes sense, though. If we use
> arch_scale_freq_capacity() as denominator instead of max, we can use a
> single formula for both cases.

I'm not convinced about that yet, but let me think about it some more. :-)

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki March 10, 2016, 10:28 p.m. UTC | #36
On Thursday, March 10, 2016 11:56:14 AM Peter Zijlstra wrote:
> On Thu, Mar 10, 2016 at 11:30:08AM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 10, 2016 at 05:23:54PM +0700, Vincent Guittot wrote:
> > 
> > > > No, since its a compile time thing, we can simply do:
> > > >
> > > > #ifdef arch_scale_freq_capacity
> > > >         next_freq = (1 + 1/n) * max_freq * (util / max)
> > > > #else
> > > >         next_freq = (1 + 1/n) * current_freq * (util_raw / max)
> > > > #endif
> > > 
> > > selecting formula at compilation is clearly better. I wrongly thought that
> > > it can't be accepted as a solution.
> > 
> > Well, its bound to get more 'interesting' since I forse implementations
> > not always actually doing the invariant thing.
> > 
> > Take for example the thing I send:
> > 
> >   lkml.kernel.org/r/20160303162829.GB6375@twins.programming.kicks-ass.net
> > 
> > it both shows why you cannot talk about current_freq but also that the
> > above needs a little more help (for the !X86_FEATURE_APERFMPERF case).
> > 
> > But the !arch_scale_freq_capacity case should indeed be that simple.
> 
> Maybe something like:
> 
> #ifdef arch_scale_freq_capacity
> #ifndef arch_scale_freq_invariant
> #define arch_scale_freq_invariant()	(true)
> #endif
> #else /* arch_scale_freq_capacity */
> #define arch_scale_freq_invariant()	(false)
> #endif
> 
> 	if (arch_scale_freq_invariant())
> 
> And have archs that have conditional arch_scale_freq_capacity()
> implementation provide a arch_scale_freq_invariant implementation.

Yeah, looks workable to me.

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Turquette March 10, 2016, 11:19 p.m. UTC | #37
Quoting Rafael J. Wysocki (2016-03-09 15:41:34)
> On Wed, Mar 9, 2016 at 11:15 AM, Juri Lelli <juri.lelli@arm.com> wrote:
> > Hi,
> >
> > sorry if I didn't reply yet. Trying to cope with jetlag and
> > talks/meetings these days :-). Let me see if I'm getting what you are
> > discussing, though.
> >
> > On 08/03/16 21:05, Rafael J. Wysocki wrote:
> >> On Tue, Mar 8, 2016 at 8:26 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Tue, Mar 08, 2016 at 07:00:57PM +0100, Rafael J. Wysocki wrote:
> >> >> On Tue, Mar 8, 2016 at 12:27 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > [...]
> >
> >> a = max_freq gives next_freq = max_freq for x = 1, but with that
> >> choice of a you may never get to x = 1 with frequency invariant
> >> because of the feedback effect mentioned above, so the 1/n produces
> >> the extra boost needed for that (n is a positive integer).
> >>
> >> Quite frankly, to me it looks like linear really is a better
> >> approximation for "raw" utilization.  That is, for frequency invariant
> >> x we should take:
> >>
> >>   next_freq = a * x * max_freq / current_freq
> >>
> >> (and if x is not frequency invariant, the right-hand side becomes a *
> >> x).  Then, the extra boost needed to get to x = 1 for frequency
> >> invariant is produced by the (max_freq / current_freq) factor that is
> >> greater than 1 as long as we are not running at max_freq and a can be
> >> chosen as max_freq.
> >>
> >
> > Expanding terms again, your original formula (without the 1.1 factor of
> > the last version) was:
> >
> >  next_freq = util / max_cap * max_freq
> >
> > and this doesn't work when we have freq invariance since util won't go
> > over curr_cap.
> 
> Can you please remind me what curr_cap is?
> 
> > What you propose above is to add another factor, so that we have:
> >
> >  next_freq = util / max_cap * max_freq / curr_freq * max_freq
> >
> > which should give us the opportunity to reach max_freq also with freq
> > invariance.
> >
> > This should actually be the same of doing:
> >
> >  next_freq = util / max_cap * max_cap / curr_cap * max_freq
> >
> > We are basically scaling how much the cpu is busy at curr_cap back to
> > the 0..1024 scale. And we use this to select next_freq. Also, we can
> > simplify this to:
> >
> >  next_freq = util / curr_cap * max_freq
> >
> > and we save some ops.
> >
> > However, if that is correct, I think we might have a problem, as we are
> > skewing OPP selection towards higher frequencies. Let's suppose we have
> > a platform with 3 OPPs:
> >
> >   freq     cap
> >   1200     1024
> >   900      768
> >   600      512
> >
> > As soon a task reaches an utilization of 257 we will be selecting the
> > second OPP as
> >
> >  next_freq = 257 / 512 * 1200 ~ 602
> >
> > While the cpu is only 50% busy in this case. And we will go at max OPP
> > when reaching ~492 (~64% of 768).
> >
> > That said, I guess this might work as a first solution, but we will
> > probably need something better in the future. I understand Rafael's
> > concerns regardin margins, but it seems to me that some kind of
> > additional parameter will be probably needed anyway to fix this.
> > Just to say again how we handle this in schedfreq, with a -20% margin
> > applied to the lowest OPP we will get to the next one when utilization
> > reaches ~410 (80% busy at curr OPP), and so on for the subsequent ones,
> > which is less aggressive and might be better IMHO.
> 
> Well, Peter says that my idea is incorrect, so I'll go for
> 
>   next_freq = C * current_freq * util_raw / max
> 
> where C > 1 (and likely C < 1.5) instead.
> 
> That means C has to be determined somehow or guessed.  The 80% tipping
> point condition seems reasonable to me, though, which leads to C =
> 1.25.

Right, that is the same value used in the schedfreq series:

+/*
+ * Capacity margin added to CFS and RT capacity requests to provide
+ * some head room if task utilization further increases.
+ */
+unsigned int capacity_margin = 1280;

Regards,
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-pm/drivers/cpufreq/cpufreq_schedutil.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpufreq/cpufreq_schedutil.c
@@ -0,0 +1,501 @@ 
+/*
+ * CPUFreq governor based on scheduler-provided CPU utilization data.
+ *
+ * Copyright (C) 2016, Intel Corporation
+ * Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/percpu-defs.h>
+#include <linux/slab.h>
+
+#include "cpufreq_governor.h"
+
+struct sugov_tunables {
+	struct gov_tunables gt;
+	unsigned int rate_limit_us;
+};
+
+struct sugov_policy {
+	struct cpufreq_policy *policy;
+
+	struct sugov_tunables *tunables;
+	struct list_head tunables_hook;
+
+	raw_spinlock_t update_lock;  /* For shared policies */
+	u64 last_freq_update_time;
+	s64 freq_update_delay_ns;
+	unsigned int next_freq;
+
+	/* The next fields are only needed if fast switch cannot be used. */
+	unsigned int relation;
+	struct irq_work irq_work;
+	struct work_struct work;
+	struct mutex work_lock;
+	bool work_in_progress;
+
+	bool need_freq_update;
+};
+
+struct sugov_cpu {
+	struct update_util_data update_util;
+	struct sugov_policy *sg_policy;
+
+	/* The fields below are only needed when sharing a policy. */
+	unsigned long util;
+	unsigned long max;
+	u64 last_update;
+};
+
+static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
+
+/************************ Governor internals ***********************/
+
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
+	u64 delta_ns;
+
+	if (sg_policy->work_in_progress)
+		return false;
+
+	if (unlikely(sg_policy->need_freq_update)) {
+		sg_policy->need_freq_update = false;
+		return true;
+	}
+
+	delta_ns = time - sg_policy->last_freq_update_time;
+	return (s64)delta_ns >= sg_policy->freq_update_delay_ns;
+}
+
+static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+				unsigned long util, unsigned long max,
+				unsigned int next_freq)
+{
+	struct cpufreq_policy *policy = sg_policy->policy;
+	unsigned int rel;
+
+	if (next_freq > policy->max)
+		next_freq = policy->max;
+	else if (next_freq < policy->min)
+		next_freq = policy->min;
+
+	sg_policy->last_freq_update_time = time;
+	if (sg_policy->next_freq == next_freq)
+		return;
+
+	sg_policy->next_freq = next_freq;
+	/*
+	 * If utilization is less than max / 4, use RELATION_C to allow the
+	 * minimum frequency to be selected more often in case the distance from
+	 * it to the next available frequency in the table is significant.
+	 */
+	rel = util < (max >> 2) ? CPUFREQ_RELATION_C : CPUFREQ_RELATION_L;
+	if (policy->fast_switch_possible) {
+		cpufreq_driver_fast_switch(policy, next_freq, rel);
+	} else {
+		sg_policy->relation = rel;
+		sg_policy->work_in_progress = true;
+		irq_work_queue(&sg_policy->irq_work);
+	}
+}
+
+static void sugov_update_single(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+	unsigned int min_f, max_f, next_f;
+
+	if (!sugov_should_update_freq(sg_policy, time))
+		return;
+
+	min_f = sg_policy->policy->cpuinfo.min_freq;
+	max_f = sg_policy->policy->cpuinfo.max_freq;
+	next_f = util > max ? max_f : min_f + util * (max_f - min_f) / max;
+
+	sugov_update_commit(sg_policy, time, util, max, next_f);
+}
+
+static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
+				    unsigned long util, unsigned long max)
+{
+	struct cpufreq_policy *policy = sg_policy->policy;
+	unsigned int min_f = policy->cpuinfo.min_freq;
+	unsigned int max_f = policy->cpuinfo.max_freq;
+	u64 last_freq_update_time = sg_policy->last_freq_update_time;
+	unsigned int j;
+
+	if (util > max)
+		return max_f;
+
+	for_each_cpu(j, policy->cpus) {
+		struct sugov_cpu *j_sg_cpu;
+		unsigned long j_util, j_max;
+		u64 delta_ns;
+
+		if (j == smp_processor_id())
+			continue;
+
+		j_sg_cpu = &per_cpu(sugov_cpu, j);
+		/*
+		 * If the CPU utilization was last updated before the previous
+		 * frequency update and the time elapsed between the last update
+		 * of the CPU utilization and the last frequency update is long
+		 * enough, don't take the CPU into account as it probably is
+		 * idle now.
+		 */
+		delta_ns = last_freq_update_time - j_sg_cpu->last_update;
+		if ((s64)delta_ns > NSEC_PER_SEC / HZ)
+			continue;
+
+		j_util = j_sg_cpu->util;
+		j_max = j_sg_cpu->max;
+		if (j_util > j_max)
+			return max_f;
+
+		if (j_util * max > j_max * util) {
+			util = j_util;
+			max = j_max;
+		}
+	}
+
+	return min_f + util * (max_f - min_f) / max;
+}
+
+static void sugov_update_shared(struct update_util_data *data, u64 time,
+				unsigned long util, unsigned long max)
+{
+	struct sugov_cpu *sg_cpu = container_of(data, struct sugov_cpu, update_util);
+	struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+	unsigned int next_f;
+
+	raw_spin_lock(&sg_policy->update_lock);
+
+	sg_cpu->util = util;
+	sg_cpu->max = max;
+	sg_cpu->last_update = time;
+
+	if (sugov_should_update_freq(sg_policy, time)) {
+		next_f = sugov_next_freq(sg_policy, util, max);
+		sugov_update_commit(sg_policy, time, util, max, next_f);
+	}
+
+	raw_spin_unlock(&sg_policy->update_lock);
+}
+
+static void sugov_work(struct work_struct *work)
+{
+	struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work);
+
+	mutex_lock(&sg_policy->work_lock);
+	__cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+				sg_policy->relation);
+	mutex_unlock(&sg_policy->work_lock);
+
+	sg_policy->work_in_progress = false;
+}
+
+static void sugov_irq_work(struct irq_work *irq_work)
+{
+	struct sugov_policy *sg_policy;
+
+	sg_policy = container_of(irq_work, struct sugov_policy, irq_work);
+	schedule_work(&sg_policy->work);
+}
+
+/************************** sysfs interface ************************/
+
+static struct sugov_tunables *global_tunables;
+static DEFINE_MUTEX(global_tunables_lock);
+
+static inline struct sugov_tunables *to_sugov_tunables(struct gov_tunables *gt)
+{
+	return container_of(gt, struct sugov_tunables, gt);
+}
+
+static ssize_t rate_limit_us_show(struct gov_tunables *gt, char *buf)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(gt);
+
+	return sprintf(buf, "%u\n", tunables->rate_limit_us);
+}
+
+static ssize_t rate_limit_us_store(struct gov_tunables *gt, const char *buf,
+				   size_t count)
+{
+	struct sugov_tunables *tunables = to_sugov_tunables(gt);
+	struct sugov_policy *sg_policy;
+	unsigned int rate_limit_us;
+	int ret;
+
+	ret = sscanf(buf, "%u", &rate_limit_us);
+	if (ret != 1)
+		return -EINVAL;
+
+	tunables->rate_limit_us = rate_limit_us;
+
+	list_for_each_entry(sg_policy, &gt->policy_list, tunables_hook)
+		sg_policy->freq_update_delay_ns = rate_limit_us * NSEC_PER_USEC;
+
+	return count;
+}
+
+static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);
+
+static struct attribute *sugov_attributes[] = {
+	&rate_limit_us.attr,
+	NULL
+};
+
+static struct kobj_type sugov_tunables_ktype = {
+	.default_attrs = sugov_attributes,
+	.sysfs_ops = &governor_sysfs_ops,
+};
+
+/********************** cpufreq governor interface *********************/
+
+static struct cpufreq_governor schedutil_gov;
+
+static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy;
+
+	sg_policy = kzalloc(sizeof(*sg_policy), GFP_KERNEL);
+	if (!sg_policy)
+		return NULL;
+
+	sg_policy->policy = policy;
+	init_irq_work(&sg_policy->irq_work, sugov_irq_work);
+	INIT_WORK(&sg_policy->work, sugov_work);
+	mutex_init(&sg_policy->work_lock);
+	raw_spin_lock_init(&sg_policy->update_lock);
+	return sg_policy;
+}
+
+static void sugov_policy_free(struct sugov_policy *sg_policy)
+{
+	mutex_destroy(&sg_policy->work_lock);
+	kfree(sg_policy);
+}
+
+static struct sugov_tunables *sugov_tunables_alloc(struct sugov_policy *sg_policy)
+{
+	struct sugov_tunables *tunables;
+
+	tunables = kzalloc(sizeof(*tunables), GFP_KERNEL);
+	if (tunables)
+		gov_tunables_init(&tunables->gt, &sg_policy->tunables_hook);
+
+	return tunables;
+}
+
+static void sugov_tunables_free(struct sugov_tunables *tunables)
+{
+	if (!have_governor_per_policy())
+		global_tunables = NULL;
+
+	kfree(tunables);
+}
+
+static int sugov_init(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy;
+	struct sugov_tunables *tunables;
+	unsigned int lat;
+	int ret = 0;
+
+	/* State should be equivalent to EXIT */
+	if (policy->governor_data)
+		return -EBUSY;
+
+	sg_policy = sugov_policy_alloc(policy);
+	if (!sg_policy)
+		return -ENOMEM;
+
+	mutex_lock(&global_tunables_lock);
+
+	if (global_tunables) {
+		if (WARN_ON(have_governor_per_policy())) {
+			ret = -EINVAL;
+			goto free_sg_policy;
+		}
+		policy->governor_data = sg_policy;
+		sg_policy->tunables = global_tunables;
+
+		gov_tunables_get(&global_tunables->gt, &sg_policy->tunables_hook);
+		goto out;
+	}
+
+	tunables = sugov_tunables_alloc(sg_policy);
+	if (!tunables) {
+		ret = -ENOMEM;
+		goto free_sg_policy;
+	}
+
+	tunables->rate_limit_us = LATENCY_MULTIPLIER;
+	lat = policy->cpuinfo.transition_latency / NSEC_PER_USEC;
+	if (lat)
+		tunables->rate_limit_us *= lat;
+
+	if (!have_governor_per_policy())
+		global_tunables = tunables;
+
+	policy->governor_data = sg_policy;
+	sg_policy->tunables = tunables;
+
+	ret = kobject_init_and_add(&tunables->gt.kobj, &sugov_tunables_ktype,
+				   get_governor_parent_kobj(policy), "%s",
+				   schedutil_gov.name);
+	if (!ret)
+		goto out;
+
+	/* Failure, so roll back. */
+	policy->governor_data = NULL;
+	sugov_tunables_free(tunables);
+
+ free_sg_policy:
+	pr_err("cpufreq: schedutil governor initialization failed (error %d)\n", ret);
+	sugov_policy_free(sg_policy);
+
+ out:
+	mutex_unlock(&global_tunables_lock);
+	return ret;
+}
+
+static int sugov_exit(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy = policy->governor_data;
+	struct sugov_tunables *tunables = sg_policy->tunables;
+	unsigned int count;
+
+	mutex_lock(&global_tunables_lock);
+
+	count = gov_tunables_put(&tunables->gt, &sg_policy->tunables_hook);
+	policy->governor_data = NULL;
+	if (!count)
+		sugov_tunables_free(tunables);
+
+	mutex_unlock(&global_tunables_lock);
+
+	sugov_policy_free(sg_policy);
+	return 0;
+}
+
+static int sugov_start(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy = policy->governor_data;
+	unsigned int cpu;
+
+	sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
+	sg_policy->last_freq_update_time = 0;
+	sg_policy->next_freq = UINT_MAX;
+	sg_policy->work_in_progress = false;
+	sg_policy->need_freq_update = false;
+
+	for_each_cpu(cpu, policy->cpus) {
+		struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);
+
+		sg_cpu->sg_policy = sg_policy;
+		if (policy_is_shared(policy)) {
+			sg_cpu->util = ULONG_MAX;
+			sg_cpu->max = 0;
+			sg_cpu->last_update = 0;
+			sg_cpu->update_util.func = sugov_update_shared;
+		} else {
+			sg_cpu->update_util.func = sugov_update_single;
+		}
+		cpufreq_set_update_util_data(cpu, &sg_cpu->update_util);
+	}
+	return 0;
+}
+
+static int sugov_stop(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy = policy->governor_data;
+	unsigned int cpu;
+
+	for_each_cpu(cpu, policy->cpus)
+		cpufreq_set_update_util_data(cpu, NULL);
+
+	synchronize_sched();
+
+	irq_work_sync(&sg_policy->irq_work);
+	cancel_work_sync(&sg_policy->work);
+	return 0;
+}
+
+static int sugov_limits(struct cpufreq_policy *policy)
+{
+	struct sugov_policy *sg_policy = policy->governor_data;
+
+	if (!policy->fast_switch_possible) {
+		mutex_lock(&sg_policy->work_lock);
+
+		if (policy->max < policy->cur)
+			__cpufreq_driver_target(policy, policy->max,
+						CPUFREQ_RELATION_H);
+		else if (policy->min > policy->cur)
+			__cpufreq_driver_target(policy, policy->min,
+						CPUFREQ_RELATION_L);
+
+		mutex_unlock(&sg_policy->work_lock);
+	}
+
+	sg_policy->need_freq_update = true;
+	return 0;
+}
+
+int sugov_governor(struct cpufreq_policy *policy, unsigned int event)
+{
+	if (event == CPUFREQ_GOV_POLICY_INIT) {
+		return sugov_init(policy);
+	} else if (policy->governor_data) {
+		switch (event) {
+		case CPUFREQ_GOV_POLICY_EXIT:
+			return sugov_exit(policy);
+		case CPUFREQ_GOV_START:
+			return sugov_start(policy);
+		case CPUFREQ_GOV_STOP:
+			return sugov_stop(policy);
+		case CPUFREQ_GOV_LIMITS:
+			return sugov_limits(policy);
+		}
+	}
+	return -EINVAL;
+}
+
+static struct cpufreq_governor schedutil_gov = {
+	.name = "schedutil",
+	.governor = sugov_governor,
+	.max_transition_latency	= TRANSITION_LATENCY_LIMIT,
+	.owner = THIS_MODULE,
+};
+
+static int __init sugov_module_init(void)
+{
+	return cpufreq_register_governor(&schedutil_gov);
+}
+
+static void __exit sugov_module_exit(void)
+{
+	cpufreq_unregister_governor(&schedutil_gov);
+}
+
+MODULE_AUTHOR("Rafael J. Wysocki <rafael.j.wysocki@intel.com>");
+MODULE_DESCRIPTION("Utilization-based CPU frequency selection");
+MODULE_LICENSE("GPL");
+
+#ifdef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+struct cpufreq_governor *cpufreq_default_governor(void)
+{
+	return &schedutil_gov;
+}
+
+fs_initcall(sugov_module_init);
+#else
+module_init(sugov_module_init);
+#endif
+module_exit(sugov_module_exit);
Index: linux-pm/drivers/cpufreq/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpufreq/Kconfig
+++ linux-pm/drivers/cpufreq/Kconfig
@@ -107,6 +107,16 @@  config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
 	  Be aware that not all cpufreq drivers support the conservative
 	  governor. If unsure have a look at the help section of the
 	  driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHEDUTIL
+	bool "schedutil"
+	select CPU_FREQ_GOV_SCHEDUTIL
+	select CPU_FREQ_GOV_PERFORMANCE
+	help
+	  Use the 'schedutil' CPUFreq governor by default. If unsure,
+	  have a look at the help section of that governor. The fallback
+	  governor will be 'performance'.
+
 endchoice
 
 config CPU_FREQ_GOV_PERFORMANCE
@@ -188,6 +198,22 @@  config CPU_FREQ_GOV_CONSERVATIVE
 
 	  If in doubt, say N.
 
+config CPU_FREQ_GOV_SCHEDUTIL
+	tristate "'schedutil' cpufreq policy governor"
+	depends on CPU_FREQ
+	select CPU_FREQ_GOV_TUNABLES
+	select IRQ_WORK
+	help
+	  The frequency selection formula used by this governor is analogous
+	  to the one used by 'ondemand', but instead of computing CPU load
+	  as the "non-idle CPU time" to "total CPU time" ratio, it uses CPU
+	  utilization data provided by the scheduler as input.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called cpufreq_schedutil.
+
+	  If in doubt, say N.
+
 comment "CPU frequency scaling drivers"
 
 config CPUFREQ_DT
Index: linux-pm/drivers/cpufreq/Makefile
===================================================================
--- linux-pm.orig/drivers/cpufreq/Makefile
+++ linux-pm/drivers/cpufreq/Makefile
@@ -10,6 +10,7 @@  obj-$(CONFIG_CPU_FREQ_GOV_POWERSAVE)	+=
 obj-$(CONFIG_CPU_FREQ_GOV_USERSPACE)	+= cpufreq_userspace.o
 obj-$(CONFIG_CPU_FREQ_GOV_ONDEMAND)	+= cpufreq_ondemand.o
 obj-$(CONFIG_CPU_FREQ_GOV_CONSERVATIVE)	+= cpufreq_conservative.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)	+= cpufreq_schedutil.o
 obj-$(CONFIG_CPU_FREQ_GOV_COMMON)		+= cpufreq_governor.o
 obj-$(CONFIG_CPU_FREQ_GOV_TUNABLES)	+= cpufreq_governor_tunables.o