diff mbox

[RFC,v3,2/2] cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

Message ID 2997922.DidfPadJuT@aspire.rjw.lan (mailing list archive)
State Mainlined
Delegated to: Rafael Wysocki
Headers show

Commit Message

Rafael J. Wysocki March 21, 2017, 11:08 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

The way the schedutil governor uses the PELT metric causes it to
underestimate the CPU utilization in some cases.

That can be easily demonstrated by running kernel compilation on
a Sandy Bridge Intel processor, running turbostat in parallel with
it and looking at the values written to the MSR_IA32_PERF_CTL
register.  Namely, the expected result would be that when all CPUs
were 100% busy, all of them would be requested to run in the maximum
P-state, but observation shows that this clearly isn't the case.
The CPUs run in the maximum P-state for a while and then are
requested to run slower and go back to the maximum P-state after
a while again.  That causes the actual frequency of the processor to
visibly oscillate below the sustainable maximum in a jittery fashion
which clearly is not desirable.

That has been attributed to CPU utilization metric updates on task
migration that cause the total utilization value for the CPU to be
reduced by the utilization of the migrated task.  If that happens,
the schedutil governor may see a CPU utilization reduction and will
attempt to reduce the CPU frequency accordingly right away.  That
may be premature, though, for example if the system is generally
busy and there are other runnable tasks waiting to be run on that
CPU already.

This is unlikely to be an issue on systems where cpufreq policies are
shared between multiple CPUs, because in those cases the policy
utilization is computed as the maximum of the CPU utilization values
over the whole policy and if that turns out to be low, reducing the
frequency for the policy most likely is a good idea anyway.  On
systems with one CPU per policy, however, it may affect performance
adversely and even lead to increased energy consumption in some cases.

On those systems it may be addressed by taking another utilization
metric into consideration, like whether or not the CPU whose
frequency is about to be reduced has been idle recently, because if
that's not the case, the CPU is likely to be busy in the near future
and its frequency should not be reduced.

To that end, use the counter of idle calls in the timekeeping code.
Namely, make the schedutil governor look at that counter for the
current CPU every time before its frequency is about to be reduced.
If the counter has not changed since the previous iteration of the
governor computations for that CPU, the CPU has been busy for all
that time and its frequency should not be decreased, so if the new
frequency would be lower than the one set previously, the governor
will skip the frequency update.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 include/linux/tick.h             |    1 +
 kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
 kernel/time/tick-sched.c         |   12 ++++++++++++
 3 files changed, 40 insertions(+)

Comments

Peter Zijlstra March 22, 2017, 9:26 a.m. UTC | #1
On Wed, Mar 22, 2017 at 12:08:50AM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
> 
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
> 
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
> 
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Right; this makes sense to me. Of course it would be good to have some
more measurements on this, but in principle:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Viresh Kumar March 22, 2017, 9:54 a.m. UTC | #2
On 22-03-17, 00:08, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
> 
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
> 
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
> 
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/tick.h             |    1 +
>  kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
>  kernel/time/tick-sched.c         |   12 ++++++++++++
>  3 files changed, 40 insertions(+)

Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Joel Fernandes March 23, 2017, 1:04 a.m. UTC | #3
On Tue, Mar 21, 2017 at 4:08 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
>
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Makes sense,

Reviewed-by: Joel Fernandes <joelaf@google.com>

Thanks,
Joel
Sai March 23, 2017, 7:26 p.m. UTC | #4
Hi Rafael,

On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

<snip>

> 
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
> 
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values
> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On

I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:

1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
2. Do wakeup on dst_cpu, add load to dst_cpu

Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.

This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.


> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
> 
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
> 
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/tick.h             |    1 +
>  kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
>  kernel/time/tick-sched.c         |   12 ++++++++++++
>  3 files changed, 40 insertions(+)
> 
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -61,6 +61,11 @@ struct sugov_cpu {
>  	unsigned long util;
>  	unsigned long max;
>  	unsigned int flags;
> +
> +	/* The field below is for single-CPU policies only. */
> +#ifdef CONFIG_NO_HZ_COMMON
> +	unsigned long saved_idle_calls;
> +#endif
>  };
>  
>  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su
>  	sg_cpu->iowait_boost >>= 1;
>  }
>  
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> +	unsigned long idle_calls = tick_nohz_get_idle_calls();
> +	bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> +	sg_cpu->saved_idle_calls = idle_calls;
> +	return ret;
> +}

Hm, sorry I am a bit confused perhaps you could help me understand the problem/solution better :)

Say we have the this simple case of only a single periodic task running on one CPU, wouldn't the PELT update on wakeup cause a frequency update which updates the sg_cpu->saved_idle_calls value here? That would then cause the frequency update on idle entry to always skip dropping frequency right?

If I am reading this correctly, the PELT update on the dequeue for the periodic task (in the scenario above) happens _before_ the idle_calls++ which is in tick_nohz_idle_enter.

Thanks!
-Sai
Sai March 23, 2017, 8:48 p.m. UTC | #5
On 03/23/2017 12:26 PM, Sai Gurrappadi wrote:

> 
> Hm, sorry I am a bit confused perhaps you could help me understand the problem/solution better :)
> 
> Say we have the this simple case of only a single periodic task running on one CPU, wouldn't the PELT update on wakeup cause a frequency update which updates the sg_cpu->saved_idle_calls value here? That would then cause the frequency update on idle entry to always skip dropping frequency right?
 
Apologies, this example here is flawed because on task dequeue, its utilization isn't removed. There is no problem in this case...


-Sai
Rafael J. Wysocki March 24, 2017, 1:39 a.m. UTC | #6
On Thu, Mar 23, 2017 at 8:26 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
> Hi Rafael,

Hi,

> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> <snip>
>
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task.  If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away.  That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>> This is unlikely to be an issue on systems where cpufreq policies are
>> shared between multiple CPUs, because in those cases the policy
>> utilization is computed as the maximum of the CPU utilization values
>> over the whole policy and if that turns out to be low, reducing the
>> frequency for the policy most likely is a good idea anyway.  On
>
> I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:
>
> 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
> 2. Do wakeup on dst_cpu, add load to dst_cpu
>
> Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.
>
> This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.

Interesting, and this seems to be related to last_freq_update_time
being per-policy (which it has to be, because frequency updates are
per-policy too and that's what we need to rate-limit).

Does this happen often enough to be a real concern in practice on
those configurations, though?

The other CPUs in the policy need to be either idle (so schedutil
doesn't take them into account at all) or lightly utilized for that to
happen, so that would affect workloads with one CPU hog type of task
that is migrated from one CPU to another within a policy and that
doesn't happen too often AFAICS.

Thanks,
Rafael
Sai March 24, 2017, 7:08 p.m. UTC | #7
On 03/23/2017 06:39 PM, Rafael J. Wysocki wrote:
> On Thu, Mar 23, 2017 at 8:26 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
>> Hi Rafael,
> 
> Hi,
> 
>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> <snip>
>>
>>>
>>> That has been attributed to CPU utilization metric updates on task
>>> migration that cause the total utilization value for the CPU to be
>>> reduced by the utilization of the migrated task.  If that happens,
>>> the schedutil governor may see a CPU utilization reduction and will
>>> attempt to reduce the CPU frequency accordingly right away.  That
>>> may be premature, though, for example if the system is generally
>>> busy and there are other runnable tasks waiting to be run on that
>>> CPU already.
>>>
>>> This is unlikely to be an issue on systems where cpufreq policies are
>>> shared between multiple CPUs, because in those cases the policy
>>> utilization is computed as the maximum of the CPU utilization values
>>> over the whole policy and if that turns out to be low, reducing the
>>> frequency for the policy most likely is a good idea anyway.  On
>>
>> I have observed this issue even in the shared policy case (one clock domain for many CPUs). On migrate, the actual load update is split into two updates:
>>
>> 1. Add to removed_load on src_cpu (cpu_util(src_cpu) not updated yet)
>> 2. Do wakeup on dst_cpu, add load to dst_cpu
>>
>> Now if src_cpu manages to do a PELT update before 2. happens, ex: say a small periodic task woke up on src_cpu, it'll end up subtracting the removed_load from its utilization and issue a frequency update before 2. happens.
>>
>> This causes a premature dip in frequency which doesn't get corrected until the next util update that fires after rate_limit_us. The dst_cpu freq. update from step 2. above gets rate limited in this scenario.
> 
> Interesting, and this seems to be related to last_freq_update_time
> being per-policy (which it has to be, because frequency updates are
> per-policy too and that's what we need to rate-limit).
> 

Correct.

> Does this happen often enough to be a real concern in practice on
> those configurations, though?
> 
> The other CPUs in the policy need to be either idle (so schedutil
> doesn't take them into account at all) or lightly utilized for that to
> happen, so that would affect workloads with one CPU hog type of task
> that is migrated from one CPU to another within a policy and that
> doesn't happen too often AFAICS.

So it is possible, even likely in some cases for a heavy CPU task to migrate on wakeup between the policy->cpus via select_idle_sibling() if the prev_cpu it was on was !idle on wakeup.

This style of heavy thread + lots of light work is a common pattern on Android (games, browsing, etc.) given how Android does its threading for ipc (Binder stuff) + its rendering/audio pipelines. 

I unfortunately don't have any numbers atm though.

-Sai
Sai March 25, 2017, 1:14 a.m. UTC | #8
Hi Rafael,

On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
> 
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
> 
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
> 

Thinking out loud a bit, I wonder if what you really want to do is basically:

schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);

Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.

Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.

Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:

schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);

Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.

Thoughts?

Thanks,
-Sai
Rafael J. Wysocki March 25, 2017, 1:39 a.m. UTC | #9
On Sat, Mar 25, 2017 at 2:14 AM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
> Hi Rafael,
>
> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> The way the schedutil governor uses the PELT metric causes it to
>> underestimate the CPU utilization in some cases.
>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register.  Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again.  That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task.  If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away.  That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>
> Thinking out loud a bit, I wonder if what you really want to do is basically:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>
> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
>
> Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.
>
> Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);
>
> Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.
>
> Thoughts?

Well, is the total_cpu_util_avg metric readily available?

Thanks,
Rafael
Vincent Guittot March 27, 2017, 7:04 a.m. UTC | #10
On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
> Hi Rafael,
>
> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> The way the schedutil governor uses the PELT metric causes it to
>> underestimate the CPU utilization in some cases.
>>
>> That can be easily demonstrated by running kernel compilation on
>> a Sandy Bridge Intel processor, running turbostat in parallel with
>> it and looking at the values written to the MSR_IA32_PERF_CTL
>> register.  Namely, the expected result would be that when all CPUs
>> were 100% busy, all of them would be requested to run in the maximum
>> P-state, but observation shows that this clearly isn't the case.
>> The CPUs run in the maximum P-state for a while and then are
>> requested to run slower and go back to the maximum P-state after
>> a while again.  That causes the actual frequency of the processor to
>> visibly oscillate below the sustainable maximum in a jittery fashion
>> which clearly is not desirable.
>>
>> That has been attributed to CPU utilization metric updates on task
>> migration that cause the total utilization value for the CPU to be
>> reduced by the utilization of the migrated task.  If that happens,
>> the schedutil governor may see a CPU utilization reduction and will
>> attempt to reduce the CPU frequency accordingly right away.  That
>> may be premature, though, for example if the system is generally
>> busy and there are other runnable tasks waiting to be run on that
>> CPU already.
>>
>
> Thinking out loud a bit, I wonder if what you really want to do is basically:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>
> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.

But we loose the interest of immediate decrease when tasks migrate.
Instead of total_cpu_util_avg we should better track RT utilization in
the same manner so with ongoing work for deadline we will have :
total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
and we still take advantage of task migration effect


>
> Over time, total_cpu_util_avg and cfs_rq(cpu)->util_avg will tend towards each other the lesser the amount of 'overlap' / overloading.
>
> Yes, the above metric would 'overestimate' in case all tasks have migrated away and we are left with an idle CPU. A fix for that could be to just use the PELT value like so:
>
> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, idle_cpu(cpu) ? 0 : total_cpu_util_avg);
>
> Note that the problem described here in the commit message doesn't need fully runnable threads, it just needs two threads to execute in parallel on the same CPU for a period of time. I don't think looking at just idle_calls necessarily covers all cases.
>
> Thoughts?
>
> Thanks,
> -Sai
Sai March 27, 2017, 9:01 p.m. UTC | #11
Hi Vincent,

On 03/27/2017 12:04 AM, Vincent Guittot wrote:
> On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
>> Hi Rafael,
>>
>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> The way the schedutil governor uses the PELT metric causes it to
>>> underestimate the CPU utilization in some cases.
>>>
>>> That can be easily demonstrated by running kernel compilation on
>>> a Sandy Bridge Intel processor, running turbostat in parallel with
>>> it and looking at the values written to the MSR_IA32_PERF_CTL
>>> register.  Namely, the expected result would be that when all CPUs
>>> were 100% busy, all of them would be requested to run in the maximum
>>> P-state, but observation shows that this clearly isn't the case.
>>> The CPUs run in the maximum P-state for a while and then are
>>> requested to run slower and go back to the maximum P-state after
>>> a while again.  That causes the actual frequency of the processor to
>>> visibly oscillate below the sustainable maximum in a jittery fashion
>>> which clearly is not desirable.
>>>
>>> That has been attributed to CPU utilization metric updates on task
>>> migration that cause the total utilization value for the CPU to be
>>> reduced by the utilization of the migrated task.  If that happens,
>>> the schedutil governor may see a CPU utilization reduction and will
>>> attempt to reduce the CPU frequency accordingly right away.  That
>>> may be premature, though, for example if the system is generally
>>> busy and there are other runnable tasks waiting to be run on that
>>> CPU already.
>>>
>>
>> Thinking out loud a bit, I wonder if what you really want to do is basically:
>>
>> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>>
>> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
> 
> But we loose the interest of immediate decrease when tasks migrate.

Indeed, this is not ideal.

> Instead of total_cpu_util_avg we should better track RT utilization in
> the same manner so with ongoing work for deadline we will have :
> total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
> and we still take advantage of task migration effect

I agree that we need better tracking for RT and DL tasks but that doesn't solve the overloaded case with more than one CFS thread sharing a CPU.

In the overloaded case, we care not just about the instant where the migrate happens but also subsequent windows where the PELT metric is slowly ramping up to reflect the real utilization of a task now that it has a CPU to itself.

Maybe there are better ways to solve that though :-)

Thanks,
-Sai
Rafael J. Wysocki March 27, 2017, 9:11 p.m. UTC | #12
On Mon, Mar 27, 2017 at 11:01 PM, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
> Hi Vincent,
>
> On 03/27/2017 12:04 AM, Vincent Guittot wrote:
>> On 25 March 2017 at 02:14, Sai Gurrappadi <sgurrappadi@nvidia.com> wrote:
>>> Hi Rafael,
>>>
>>> On 03/21/2017 04:08 PM, Rafael J. Wysocki wrote:
>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>
>>>> The way the schedutil governor uses the PELT metric causes it to
>>>> underestimate the CPU utilization in some cases.
>>>>
>>>> That can be easily demonstrated by running kernel compilation on
>>>> a Sandy Bridge Intel processor, running turbostat in parallel with
>>>> it and looking at the values written to the MSR_IA32_PERF_CTL
>>>> register.  Namely, the expected result would be that when all CPUs
>>>> were 100% busy, all of them would be requested to run in the maximum
>>>> P-state, but observation shows that this clearly isn't the case.
>>>> The CPUs run in the maximum P-state for a while and then are
>>>> requested to run slower and go back to the maximum P-state after
>>>> a while again.  That causes the actual frequency of the processor to
>>>> visibly oscillate below the sustainable maximum in a jittery fashion
>>>> which clearly is not desirable.
>>>>
>>>> That has been attributed to CPU utilization metric updates on task
>>>> migration that cause the total utilization value for the CPU to be
>>>> reduced by the utilization of the migrated task.  If that happens,
>>>> the schedutil governor may see a CPU utilization reduction and will
>>>> attempt to reduce the CPU frequency accordingly right away.  That
>>>> may be premature, though, for example if the system is generally
>>>> busy and there are other runnable tasks waiting to be run on that
>>>> CPU already.
>>>>
>>>
>>> Thinking out loud a bit, I wonder if what you really want to do is basically:
>>>
>>> schedutil_cpu_util(cpu) = max(cpu_rq(cpu)->cfs.util_avg, total_cpu_util_avg);
>>>
>>> Where total_cpu_util_avg tracks the average utilization of the CPU itself over time (% of time the CPU was busy) in the same PELT like manner. The difference here is that it doesn't change instantaneously as tasks migrate in/out but it decays/accumulates just like the per-entity util_avgs.
>>
>> But we loose the interest of immediate decrease when tasks migrate.
>
> Indeed, this is not ideal.
>
>> Instead of total_cpu_util_avg we should better track RT utilization in
>> the same manner so with ongoing work for deadline we will have :
>> total_utilization = cfs.util_avg + rt's util_avg + deadline's util avg
>> and we still take advantage of task migration effect
>
> I agree that we need better tracking for RT and DL tasks but that doesn't solve the overloaded case with more than one CFS thread sharing a CPU.
>
> In the overloaded case, we care not just about the instant where the migrate happens but also subsequent windows where the PELT metric is slowly ramping up to reflect the real utilization of a task now that it has a CPU to itself.
>
> Maybe there are better ways to solve that though :-)

I wonder if it's viable to postpone the utilization update on both the
source and target runqueues until the task has been fully migrated?

That would make the artificial utilization reductions go away.

Thanks,
Rafael
Wanpeng Li May 8, 2017, 3:49 a.m. UTC | #13
Hi Rafael,
2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> The way the schedutil governor uses the PELT metric causes it to
> underestimate the CPU utilization in some cases.
>
> That can be easily demonstrated by running kernel compilation on
> a Sandy Bridge Intel processor, running turbostat in parallel with
> it and looking at the values written to the MSR_IA32_PERF_CTL
> register.  Namely, the expected result would be that when all CPUs
> were 100% busy, all of them would be requested to run in the maximum
> P-state, but observation shows that this clearly isn't the case.
> The CPUs run in the maximum P-state for a while and then are
> requested to run slower and go back to the maximum P-state after
> a while again.  That causes the actual frequency of the processor to
> visibly oscillate below the sustainable maximum in a jittery fashion
> which clearly is not desirable.
>
> That has been attributed to CPU utilization metric updates on task
> migration that cause the total utilization value for the CPU to be
> reduced by the utilization of the migrated task.  If that happens,
> the schedutil governor may see a CPU utilization reduction and will
> attempt to reduce the CPU frequency accordingly right away.  That
> may be premature, though, for example if the system is generally
> busy and there are other runnable tasks waiting to be run on that
> CPU already.
>
> This is unlikely to be an issue on systems where cpufreq policies are
> shared between multiple CPUs, because in those cases the policy
> utilization is computed as the maximum of the CPU utilization values

Sorry for one question maybe not associated with this patch. If the
cpufreq policy is shared between multiple CPUs, the function
intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
which is managing this policy, I wonder whether other cpus which are
affected should also update their per-logical cpu's IA32_PERF_CTL MSR?

Regards,
Wanpeng Li

> over the whole policy and if that turns out to be low, reducing the
> frequency for the policy most likely is a good idea anyway.  On
> systems with one CPU per policy, however, it may affect performance
> adversely and even lead to increased energy consumption in some cases.
>
> On those systems it may be addressed by taking another utilization
> metric into consideration, like whether or not the CPU whose
> frequency is about to be reduced has been idle recently, because if
> that's not the case, the CPU is likely to be busy in the near future
> and its frequency should not be reduced.
>
> To that end, use the counter of idle calls in the timekeeping code.
> Namely, make the schedutil governor look at that counter for the
> current CPU every time before its frequency is about to be reduced.
> If the counter has not changed since the previous iteration of the
> governor computations for that CPU, the CPU has been busy for all
> that time and its frequency should not be decreased, so if the new
> frequency would be lower than the one set previously, the governor
> will skip the frequency update.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  include/linux/tick.h             |    1 +
>  kernel/sched/cpufreq_schedutil.c |   27 +++++++++++++++++++++++++++
>  kernel/time/tick-sched.c         |   12 ++++++++++++
>  3 files changed, 40 insertions(+)
>
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===================================================================
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -61,6 +61,11 @@ struct sugov_cpu {
>         unsigned long util;
>         unsigned long max;
>         unsigned int flags;
> +
> +       /* The field below is for single-CPU policies only. */
> +#ifdef CONFIG_NO_HZ_COMMON
> +       unsigned long saved_idle_calls;
> +#endif
>  };
>
>  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
> @@ -192,6 +197,19 @@ static void sugov_iowait_boost(struct su
>         sg_cpu->iowait_boost >>= 1;
>  }
>
> +#ifdef CONFIG_NO_HZ_COMMON
> +static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
> +{
> +       unsigned long idle_calls = tick_nohz_get_idle_calls();
> +       bool ret = idle_calls == sg_cpu->saved_idle_calls;
> +
> +       sg_cpu->saved_idle_calls = idle_calls;
> +       return ret;
> +}
> +#else
> +static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> +#endif /* CONFIG_NO_HZ_COMMON */
> +
>  static void sugov_update_single(struct update_util_data *hook, u64 time,
>                                 unsigned int flags)
>  {
> @@ -200,6 +218,7 @@ static void sugov_update_single(struct u
>         struct cpufreq_policy *policy = sg_policy->policy;
>         unsigned long util, max;
>         unsigned int next_f;
> +       bool busy;
>
>         sugov_set_iowait_boost(sg_cpu, time, flags);
>         sg_cpu->last_update = time;
> @@ -207,12 +226,20 @@ static void sugov_update_single(struct u
>         if (!sugov_should_update_freq(sg_policy, time))
>                 return;
>
> +       busy = sugov_cpu_is_busy(sg_cpu);
> +
>         if (flags & SCHED_CPUFREQ_RT_DL) {
>                 next_f = policy->cpuinfo.max_freq;
>         } else {
>                 sugov_get_util(&util, &max);
>                 sugov_iowait_boost(sg_cpu, &util, &max);
>                 next_f = get_next_freq(sg_policy, util, max);
> +               /*
> +                * Do not reduce the frequency if the CPU has not been idle
> +                * recently, as the reduction is likely to be premature then.
> +                */
> +               if (busy && next_f < sg_policy->next_freq)
> +                       next_f = sg_policy->next_freq;
>         }
>         sugov_update_commit(sg_policy, time, next_f);
>  }
> Index: linux-pm/include/linux/tick.h
> ===================================================================
> --- linux-pm.orig/include/linux/tick.h
> +++ linux-pm/include/linux/tick.h
> @@ -117,6 +117,7 @@ extern void tick_nohz_idle_enter(void);
>  extern void tick_nohz_idle_exit(void);
>  extern void tick_nohz_irq_exit(void);
>  extern ktime_t tick_nohz_get_sleep_length(void);
> +extern unsigned long tick_nohz_get_idle_calls(void);
>  extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
>  extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
>  #else /* !CONFIG_NO_HZ_COMMON */
> Index: linux-pm/kernel/time/tick-sched.c
> ===================================================================
> --- linux-pm.orig/kernel/time/tick-sched.c
> +++ linux-pm/kernel/time/tick-sched.c
> @@ -993,6 +993,18 @@ ktime_t tick_nohz_get_sleep_length(void)
>         return ts->sleep_length;
>  }
>
> +/**
> + * tick_nohz_get_idle_calls - return the current idle calls counter value
> + *
> + * Called from the schedutil frequency scaling governor in scheduler context.
> + */
> +unsigned long tick_nohz_get_idle_calls(void)
> +{
> +       struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
> +
> +       return ts->idle_calls;
> +}
> +
>  static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
>  {
>  #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
>
Viresh Kumar May 8, 2017, 4:01 a.m. UTC | #14
On 08-05-17, 11:49, Wanpeng Li wrote:
> Hi Rafael,
> 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > The way the schedutil governor uses the PELT metric causes it to
> > underestimate the CPU utilization in some cases.
> >
> > That can be easily demonstrated by running kernel compilation on
> > a Sandy Bridge Intel processor, running turbostat in parallel with
> > it and looking at the values written to the MSR_IA32_PERF_CTL
> > register.  Namely, the expected result would be that when all CPUs
> > were 100% busy, all of them would be requested to run in the maximum
> > P-state, but observation shows that this clearly isn't the case.
> > The CPUs run in the maximum P-state for a while and then are
> > requested to run slower and go back to the maximum P-state after
> > a while again.  That causes the actual frequency of the processor to
> > visibly oscillate below the sustainable maximum in a jittery fashion
> > which clearly is not desirable.
> >
> > That has been attributed to CPU utilization metric updates on task
> > migration that cause the total utilization value for the CPU to be
> > reduced by the utilization of the migrated task.  If that happens,
> > the schedutil governor may see a CPU utilization reduction and will
> > attempt to reduce the CPU frequency accordingly right away.  That
> > may be premature, though, for example if the system is generally
> > busy and there are other runnable tasks waiting to be run on that
> > CPU already.
> >
> > This is unlikely to be an issue on systems where cpufreq policies are
> > shared between multiple CPUs, because in those cases the policy
> > utilization is computed as the maximum of the CPU utilization values
> 
> Sorry for one question maybe not associated with this patch. If the
> cpufreq policy is shared between multiple CPUs, the function
> intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
> which is managing this policy, I wonder whether other cpus which are
> affected should also update their per-logical cpu's IA32_PERF_CTL MSR?

The CPUs share the policy when they share their freq/voltage rails and so
changing perf state of one CPU should result in that changing for all the CPUs
in that policy. Otherwise, they can't be considered to be part of the same
policy.

That's why this code is changing it only for policy->cpu alone.
Wanpeng Li May 8, 2017, 5:15 a.m. UTC | #15
2017-05-08 12:01 GMT+08:00 Viresh Kumar <viresh.kumar@linaro.org>:
> On 08-05-17, 11:49, Wanpeng Li wrote:
>> Hi Rafael,
>> 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> >
>> > The way the schedutil governor uses the PELT metric causes it to
>> > underestimate the CPU utilization in some cases.
>> >
>> > That can be easily demonstrated by running kernel compilation on
>> > a Sandy Bridge Intel processor, running turbostat in parallel with
>> > it and looking at the values written to the MSR_IA32_PERF_CTL
>> > register.  Namely, the expected result would be that when all CPUs
>> > were 100% busy, all of them would be requested to run in the maximum
>> > P-state, but observation shows that this clearly isn't the case.
>> > The CPUs run in the maximum P-state for a while and then are
>> > requested to run slower and go back to the maximum P-state after
>> > a while again.  That causes the actual frequency of the processor to
>> > visibly oscillate below the sustainable maximum in a jittery fashion
>> > which clearly is not desirable.
>> >
>> > That has been attributed to CPU utilization metric updates on task
>> > migration that cause the total utilization value for the CPU to be
>> > reduced by the utilization of the migrated task.  If that happens,
>> > the schedutil governor may see a CPU utilization reduction and will
>> > attempt to reduce the CPU frequency accordingly right away.  That
>> > may be premature, though, for example if the system is generally
>> > busy and there are other runnable tasks waiting to be run on that
>> > CPU already.
>> >
>> > This is unlikely to be an issue on systems where cpufreq policies are
>> > shared between multiple CPUs, because in those cases the policy
>> > utilization is computed as the maximum of the CPU utilization values
>>
>> Sorry for one question maybe not associated with this patch. If the
>> cpufreq policy is shared between multiple CPUs, the function
>> intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
>> which is managing this policy, I wonder whether other cpus which are
>> affected should also update their per-logical cpu's IA32_PERF_CTL MSR?
>
> The CPUs share the policy when they share their freq/voltage rails and so
> changing perf state of one CPU should result in that changing for all the CPUs
> in that policy. Otherwise, they can't be considered to be part of the same
> policy.
>
> That's why this code is changing it only for policy->cpu alone.

I see, thanks for the explanation.

Regards,
Wanpeng Li
Rafael J. Wysocki May 8, 2017, 10:16 p.m. UTC | #16
On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote:
> On 08-05-17, 11:49, Wanpeng Li wrote:
> > Hi Rafael,
> > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > The way the schedutil governor uses the PELT metric causes it to
> > > underestimate the CPU utilization in some cases.
> > >
> > > That can be easily demonstrated by running kernel compilation on
> > > a Sandy Bridge Intel processor, running turbostat in parallel with
> > > it and looking at the values written to the MSR_IA32_PERF_CTL
> > > register.  Namely, the expected result would be that when all CPUs
> > > were 100% busy, all of them would be requested to run in the maximum
> > > P-state, but observation shows that this clearly isn't the case.
> > > The CPUs run in the maximum P-state for a while and then are
> > > requested to run slower and go back to the maximum P-state after
> > > a while again.  That causes the actual frequency of the processor to
> > > visibly oscillate below the sustainable maximum in a jittery fashion
> > > which clearly is not desirable.
> > >
> > > That has been attributed to CPU utilization metric updates on task
> > > migration that cause the total utilization value for the CPU to be
> > > reduced by the utilization of the migrated task.  If that happens,
> > > the schedutil governor may see a CPU utilization reduction and will
> > > attempt to reduce the CPU frequency accordingly right away.  That
> > > may be premature, though, for example if the system is generally
> > > busy and there are other runnable tasks waiting to be run on that
> > > CPU already.
> > >
> > > This is unlikely to be an issue on systems where cpufreq policies are
> > > shared between multiple CPUs, because in those cases the policy
> > > utilization is computed as the maximum of the CPU utilization values
> > 
> > Sorry for one question maybe not associated with this patch. If the
> > cpufreq policy is shared between multiple CPUs, the function
> > intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
> > which is managing this policy, I wonder whether other cpus which are
> > affected should also update their per-logical cpu's IA32_PERF_CTL MSR?
> 
> The CPUs share the policy when they share their freq/voltage rails and so
> changing perf state of one CPU should result in that changing for all the CPUs
> in that policy. Otherwise, they can't be considered to be part of the same
> policy.

To be entirely precise, this depends on the granularity of the HW interface.

If the interface is per-logical-CPU, we will use it this way for efficiency
reasons and even if there is some coordination on the HW side, the information
on how exactly it works usually is limited.

Thanks,
Rafael
Wanpeng Li May 8, 2017, 10:36 p.m. UTC | #17
2017-05-09 6:16 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote:
>> On 08-05-17, 11:49, Wanpeng Li wrote:
>> > Hi Rafael,
>> > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> > >
>> > > The way the schedutil governor uses the PELT metric causes it to
>> > > underestimate the CPU utilization in some cases.
>> > >
>> > > That can be easily demonstrated by running kernel compilation on
>> > > a Sandy Bridge Intel processor, running turbostat in parallel with
>> > > it and looking at the values written to the MSR_IA32_PERF_CTL
>> > > register.  Namely, the expected result would be that when all CPUs
>> > > were 100% busy, all of them would be requested to run in the maximum
>> > > P-state, but observation shows that this clearly isn't the case.
>> > > The CPUs run in the maximum P-state for a while and then are
>> > > requested to run slower and go back to the maximum P-state after
>> > > a while again.  That causes the actual frequency of the processor to
>> > > visibly oscillate below the sustainable maximum in a jittery fashion
>> > > which clearly is not desirable.
>> > >
>> > > That has been attributed to CPU utilization metric updates on task
>> > > migration that cause the total utilization value for the CPU to be
>> > > reduced by the utilization of the migrated task.  If that happens,
>> > > the schedutil governor may see a CPU utilization reduction and will
>> > > attempt to reduce the CPU frequency accordingly right away.  That
>> > > may be premature, though, for example if the system is generally
>> > > busy and there are other runnable tasks waiting to be run on that
>> > > CPU already.
>> > >
>> > > This is unlikely to be an issue on systems where cpufreq policies are
>> > > shared between multiple CPUs, because in those cases the policy
>> > > utilization is computed as the maximum of the CPU utilization values
>> >
>> > Sorry for one question maybe not associated with this patch. If the
>> > cpufreq policy is shared between multiple CPUs, the function
>> > intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
>> > which is managing this policy, I wonder whether other cpus which are
>> > affected should also update their per-logical cpu's IA32_PERF_CTL MSR?
>>
>> The CPUs share the policy when they share their freq/voltage rails and so
>> changing perf state of one CPU should result in that changing for all the CPUs
>> in that policy. Otherwise, they can't be considered to be part of the same
>> policy.
>
> To be entirely precise, this depends on the granularity of the HW interface.
>
> If the interface is per-logical-CPU, we will use it this way for efficiency
> reasons and even if there is some coordination on the HW side, the information
> on how exactly it works usually is limited.

I check it on several Xeon servers on hand, however, I didn't find
/sys/devices/system/cpu/cpufreq/policyx/affected_cpus can affect more
than one logical cpu, so I guess most of Xeon servers are not support
shared cpufreq policy, then which kind of boxes support that?

Regards,
Wanpeng Li
Rafael J. Wysocki May 8, 2017, 11:01 p.m. UTC | #18
On Tuesday, May 09, 2017 06:36:14 AM Wanpeng Li wrote:
> 2017-05-09 6:16 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > On Monday, May 08, 2017 09:31:19 AM Viresh Kumar wrote:
> >> On 08-05-17, 11:49, Wanpeng Li wrote:
> >> > Hi Rafael,
> >> > 2017-03-22 7:08 GMT+08:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> >> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >> > >
> >> > > The way the schedutil governor uses the PELT metric causes it to
> >> > > underestimate the CPU utilization in some cases.
> >> > >
> >> > > That can be easily demonstrated by running kernel compilation on
> >> > > a Sandy Bridge Intel processor, running turbostat in parallel with
> >> > > it and looking at the values written to the MSR_IA32_PERF_CTL
> >> > > register.  Namely, the expected result would be that when all CPUs
> >> > > were 100% busy, all of them would be requested to run in the maximum
> >> > > P-state, but observation shows that this clearly isn't the case.
> >> > > The CPUs run in the maximum P-state for a while and then are
> >> > > requested to run slower and go back to the maximum P-state after
> >> > > a while again.  That causes the actual frequency of the processor to
> >> > > visibly oscillate below the sustainable maximum in a jittery fashion
> >> > > which clearly is not desirable.
> >> > >
> >> > > That has been attributed to CPU utilization metric updates on task
> >> > > migration that cause the total utilization value for the CPU to be
> >> > > reduced by the utilization of the migrated task.  If that happens,
> >> > > the schedutil governor may see a CPU utilization reduction and will
> >> > > attempt to reduce the CPU frequency accordingly right away.  That
> >> > > may be premature, though, for example if the system is generally
> >> > > busy and there are other runnable tasks waiting to be run on that
> >> > > CPU already.
> >> > >
> >> > > This is unlikely to be an issue on systems where cpufreq policies are
> >> > > shared between multiple CPUs, because in those cases the policy
> >> > > utilization is computed as the maximum of the CPU utilization values
> >> >
> >> > Sorry for one question maybe not associated with this patch. If the
> >> > cpufreq policy is shared between multiple CPUs, the function
> >> > intel_cpufreq_target()  just updates IA32_PERF_CTL MSR of the cpu
> >> > which is managing this policy, I wonder whether other cpus which are
> >> > affected should also update their per-logical cpu's IA32_PERF_CTL MSR?
> >>
> >> The CPUs share the policy when they share their freq/voltage rails and so
> >> changing perf state of one CPU should result in that changing for all the CPUs
> >> in that policy. Otherwise, they can't be considered to be part of the same
> >> policy.
> >
> > To be entirely precise, this depends on the granularity of the HW interface.
> >
> > If the interface is per-logical-CPU, we will use it this way for efficiency
> > reasons and even if there is some coordination on the HW side, the information
> > on how exactly it works usually is limited.
> 
> I check it on several Xeon servers on hand, however, I didn't find
> /sys/devices/system/cpu/cpufreq/policyx/affected_cpus can affect more
> than one logical cpu, so I guess most of Xeon servers are not support
> shared cpufreq policy, then which kind of boxes support that?

On Intel the interface for performance scaling is per-logical-CPU in general.

Thanks,
Rafael
diff mbox

Patch

Index: linux-pm/kernel/sched/cpufreq_schedutil.c
===================================================================
--- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
+++ linux-pm/kernel/sched/cpufreq_schedutil.c
@@ -61,6 +61,11 @@  struct sugov_cpu {
 	unsigned long util;
 	unsigned long max;
 	unsigned int flags;
+
+	/* The field below is for single-CPU policies only. */
+#ifdef CONFIG_NO_HZ_COMMON
+	unsigned long saved_idle_calls;
+#endif
 };
 
 static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
@@ -192,6 +197,19 @@  static void sugov_iowait_boost(struct su
 	sg_cpu->iowait_boost >>= 1;
 }
 
+#ifdef CONFIG_NO_HZ_COMMON
+static bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu)
+{
+	unsigned long idle_calls = tick_nohz_get_idle_calls();
+	bool ret = idle_calls == sg_cpu->saved_idle_calls;
+
+	sg_cpu->saved_idle_calls = idle_calls;
+	return ret;
+}
+#else
+static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
+#endif /* CONFIG_NO_HZ_COMMON */
+
 static void sugov_update_single(struct update_util_data *hook, u64 time,
 				unsigned int flags)
 {
@@ -200,6 +218,7 @@  static void sugov_update_single(struct u
 	struct cpufreq_policy *policy = sg_policy->policy;
 	unsigned long util, max;
 	unsigned int next_f;
+	bool busy;
 
 	sugov_set_iowait_boost(sg_cpu, time, flags);
 	sg_cpu->last_update = time;
@@ -207,12 +226,20 @@  static void sugov_update_single(struct u
 	if (!sugov_should_update_freq(sg_policy, time))
 		return;
 
+	busy = sugov_cpu_is_busy(sg_cpu);
+
 	if (flags & SCHED_CPUFREQ_RT_DL) {
 		next_f = policy->cpuinfo.max_freq;
 	} else {
 		sugov_get_util(&util, &max);
 		sugov_iowait_boost(sg_cpu, &util, &max);
 		next_f = get_next_freq(sg_policy, util, max);
+		/*
+		 * Do not reduce the frequency if the CPU has not been idle
+		 * recently, as the reduction is likely to be premature then.
+		 */
+		if (busy && next_f < sg_policy->next_freq)
+			next_f = sg_policy->next_freq;
 	}
 	sugov_update_commit(sg_policy, time, next_f);
 }
Index: linux-pm/include/linux/tick.h
===================================================================
--- linux-pm.orig/include/linux/tick.h
+++ linux-pm/include/linux/tick.h
@@ -117,6 +117,7 @@  extern void tick_nohz_idle_enter(void);
 extern void tick_nohz_idle_exit(void);
 extern void tick_nohz_irq_exit(void);
 extern ktime_t tick_nohz_get_sleep_length(void);
+extern unsigned long tick_nohz_get_idle_calls(void);
 extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
 extern u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time);
 #else /* !CONFIG_NO_HZ_COMMON */
Index: linux-pm/kernel/time/tick-sched.c
===================================================================
--- linux-pm.orig/kernel/time/tick-sched.c
+++ linux-pm/kernel/time/tick-sched.c
@@ -993,6 +993,18 @@  ktime_t tick_nohz_get_sleep_length(void)
 	return ts->sleep_length;
 }
 
+/**
+ * tick_nohz_get_idle_calls - return the current idle calls counter value
+ *
+ * Called from the schedutil frequency scaling governor in scheduler context.
+ */
+unsigned long tick_nohz_get_idle_calls(void)
+{
+	struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
+
+	return ts->idle_calls;
+}
+
 static void tick_nohz_account_idle_ticks(struct tick_sched *ts)
 {
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE