diff mbox

[intel-pstate,driver,regression] processor frequency very high even if in idle

Message ID 3623107.tlAuqH4F7s@vostro.rjw.lan (mailing list archive)
State Accepted, archived
Delegated to: Rafael Wysocki
Headers show

Commit Message

Rafael J. Wysocki April 1, 2016, 12:40 p.m. UTC
On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote:
> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote:
> >> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> >> > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote:
> >> >
> >> > [cut]
> >> >
> >> >> >
> >> >>
> >> >> Yes, works for me.
> >> >>
> >> >> CPUID(7): No-SGX
> >> >>      CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
> >> >>        -      11    0.66 1682 2494
> >> >>        0      11    0.60 1856 2494
> >> >>        1       6    0.34    1898    2494
> >> >>        2      13    0.82    1628    2494
> >> >>        3      13    0.87    1528    2494
> >> >>      CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
> >> >>        -       6    0.58     963    2494
> >> >>        0       8    0.83     957    2494
> >> >>        1       1    0.08     984    2494
> >> >>        2      10    1.04     975    2494
> >> >>        3       3    0.35     934    2494
> >> >>
> >> >
> >
> > [cut]
> >
> >> >
> >>
> >> No, this patch doesn't help.
> >
> > Well, more work to do then.
> >
> > I've just noticed a bug in this patch, which is not relevant for the results,
> > but below is a new version.
> >
> >> CPUID(7): No-SGX
> >>       CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
> >>        -       8    0.32    2507    2495
> >>        0      13    0.53    2505    2495
> >>        1       3    0.11    2523    2495
> >>        2       1    0.06    2555    2495
> >>        3      15    0.59    2500    2495
> >>      CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
> >>        -       8    0.33    2486    2495
> >>        0      12    0.50    2482    2495
> >>        1       5    0.22    2489    2495
> >>        2       1    0.04    2492    2495
> >>        3      15    0.59    2487    2495
> >

[cut]

> 
> here they are.
> 

Thanks!

First of all, the sampling mechanics works as expected in the failing case,
which is the most important thing I wanted to know.  However, there are anomalies
in the failing case trace.  The core_busy column is clearly suspicious and it
looks like CPUs 2 and 3 never really go idle.  I guess we'll need to find out
why they don't go idle to get to the bottom of this, but it firmly falls into
the weird stuff territory already.

In the meantime, below is one more patch to test, on top of the previous one
(that is, https://patchwork.kernel.org/patch/8714401/).

Again, this is a change I'd like to make regardless, so it would be good to
know if anything more has to be done before we go further.

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: [PATCH] intel_pstate: Avoid extra invocation of intel_pstate_sample()

The initialization of intel_pstate for a given CPU involves populating
the fields of its struct cpudata that represent the previous sample,
but currently that is done in a problematic way.

Namely, intel_pstate_init_cpu() makes an extra call to
intel_pstate_sample() so it reads the current register values that
will be used to populate the "previous sample" record during the
next invocation of intel_pstate_sample().  However, after commit
a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization
update callbacks) that doesn't work for last_sample_time, because
the time value is passed to intel_pstate_sample() as an argument now.
Passing 0 to it from intel_pstate_init_cpu() is problematic, because
that causes cpu->last_sample_time == 0 to be visible in
get_target_pstate_use_performance() (and hence the extra
cpu->last_sample_time > 0 check in there) and effectively allows
the first invocation of intel_pstate_sample() from
intel_pstate_update_util() to happen immediately after the
initialization which may lead to a significant "turn on"
effect in the governor algorithm.

To mitigate that issue, rework the initialization to avoid the
extra intel_pstate_sample() call from intel_pstate_init_cpu().
Instead, make intel_pstate_sample() return false if it has been
called with cpu->sample.time equal to zero, which will make
intel_pstate_update_util() skip the sample in that case, and
reset cpu->sample.time from intel_pstate_set_update_util_hook()
to make the algorithm start properly every time the hook is set.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/cpufreq/intel_pstate.c |   21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jörg Otte April 1, 2016, 2:06 p.m. UTC | #1
2016-04-01 14:40 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote:
>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>> > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote:
>> >> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>> >> > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote:
>> >> >
>> >> > [cut]
>> >> >
>> >> >> >
>> >> >>
>> >> >> Yes, works for me.
>> >> >>
>> >> >> CPUID(7): No-SGX
>> >> >>      CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
>> >> >>        -      11    0.66 1682 2494
>> >> >>        0      11    0.60 1856 2494
>> >> >>        1       6    0.34    1898    2494
>> >> >>        2      13    0.82    1628    2494
>> >> >>        3      13    0.87    1528    2494
>> >> >>      CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
>> >> >>        -       6    0.58     963    2494
>> >> >>        0       8    0.83     957    2494
>> >> >>        1       1    0.08     984    2494
>> >> >>        2      10    1.04     975    2494
>> >> >>        3       3    0.35     934    2494
>> >> >>
>> >> >
>> >
>> > [cut]
>> >
>> >> >
>> >>
>> >> No, this patch doesn't help.
>> >
>> > Well, more work to do then.
>> >
>> > I've just noticed a bug in this patch, which is not relevant for the results,
>> > but below is a new version.
>> >
>> >> CPUID(7): No-SGX
>> >>       CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
>> >>        -       8    0.32    2507    2495
>> >>        0      13    0.53    2505    2495
>> >>        1       3    0.11    2523    2495
>> >>        2       1    0.06    2555    2495
>> >>        3      15    0.59    2500    2495
>> >>      CPU Avg_MHz   Busy% Bzy_MHz TSC_MHz
>> >>        -       8    0.33    2486    2495
>> >>        0      12    0.50    2482    2495
>> >>        1       5    0.22    2489    2495
>> >>        2       1    0.04    2492    2495
>> >>        3      15    0.59    2487    2495
>> >
>
> [cut]
>
>>
>> here they are.
>>
>
> Thanks!
>
> First of all, the sampling mechanics works as expected in the failing case,
> which is the most important thing I wanted to know.  However, there are anomalies
> in the failing case trace.  The core_busy column is clearly suspicious and it
> looks like CPUs 2 and 3 never really go idle.  I guess we'll need to find out
> why they don't go idle to get to the bottom of this, but it firmly falls into
> the weird stuff territory already.
>
> In the meantime, below is one more patch to test, on top of the previous one
> (that is, https://patchwork.kernel.org/patch/8714401/).
>
> Again, this is a change I'd like to make regardless, so it would be good to
> know if anything more has to be done before we go further.
>
> ---
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: [PATCH] intel_pstate: Avoid extra invocation of intel_pstate_sample()
>
> The initialization of intel_pstate for a given CPU involves populating
> the fields of its struct cpudata that represent the previous sample,
> but currently that is done in a problematic way.
>
> Namely, intel_pstate_init_cpu() makes an extra call to
> intel_pstate_sample() so it reads the current register values that
> will be used to populate the "previous sample" record during the
> next invocation of intel_pstate_sample().  However, after commit
> a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization
> update callbacks) that doesn't work for last_sample_time, because
> the time value is passed to intel_pstate_sample() as an argument now.
> Passing 0 to it from intel_pstate_init_cpu() is problematic, because
> that causes cpu->last_sample_time == 0 to be visible in
> get_target_pstate_use_performance() (and hence the extra
> cpu->last_sample_time > 0 check in there) and effectively allows
> the first invocation of intel_pstate_sample() from
> intel_pstate_update_util() to happen immediately after the
> initialization which may lead to a significant "turn on"
> effect in the governor algorithm.
>
> To mitigate that issue, rework the initialization to avoid the
> extra intel_pstate_sample() call from intel_pstate_init_cpu().
> Instead, make intel_pstate_sample() return false if it has been
> called with cpu->sample.time equal to zero, which will make
> intel_pstate_update_util() skip the sample in that case, and
> reset cpu->sample.time from intel_pstate_set_update_util_hook()
> to make the algorithm start properly every time the hook is set.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/cpufreq/intel_pstate.c |   21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===================================================================
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -910,7 +910,14 @@ static inline bool intel_pstate_sample(s
>         cpu->prev_aperf = aperf;
>         cpu->prev_mperf = mperf;
>         cpu->prev_tsc = tsc;
> -       return true;
> +       /*
> +        * First time this function is invoked in a given cycle, all of the
> +        * previous sample data fields are equal to zero or stale and they must
> +        * be populated with meaningful numbers for things to work, so assume
> +        * that sample.time will always be reset before setting the utilization
> +        * update hook and make the caller skip the sample then.
> +        */
> +       return !!cpu->last_sample_time;
>  }
>
>  static inline int32_t get_avg_frequency(struct cpudata *cpu)
> @@ -984,8 +991,7 @@ static inline int32_t get_target_pstate_
>          * enough period of time to adjust our busyness.
>          */
>         duration_ns = cpu->sample.time - cpu->last_sample_time;
> -       if ((s64)duration_ns > pid_params.sample_rate_ns * 3
> -           && cpu->last_sample_time > 0) {
> +       if ((s64)duration_ns > pid_params.sample_rate_ns * 3) {
>                 sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
>                                       int_tofp(duration_ns));
>                 core_busy = mul_fp(core_busy, sample_ratio);
> @@ -1100,7 +1106,6 @@ static int intel_pstate_init_cpu(unsigne
>         intel_pstate_get_cpu_pstates(cpu);
>
>         intel_pstate_busy_pid_reset(cpu);
> -       intel_pstate_sample(cpu, 0);
>
>         cpu->update_util.func = intel_pstate_update_util;
>
> @@ -1121,9 +1126,13 @@ static unsigned int intel_pstate_get(uns
>         return get_avg_frequency(cpu);
>  }
>
> -static void intel_pstate_set_update_util_hook(unsigned int cpu)
> +static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
>  {
> -       cpufreq_set_update_util_data(cpu, &all_cpu_data[cpu]->update_util);
> +       struct cpudata *cpu = all_cpu_data[cpu_num];
> +
> +       /* Prevent intel_pstate_update_util() from using stale data. */
> +       cpu->sample.time = 0;
> +       cpufreq_set_update_util_data(cpu_num, &cpu->update_util);
>  }
>
>  static void intel_pstate_clear_update_util_hook(unsigned int cpu)
>

Done. Attached the tracer.
For me it looks like the previous one of the failing case.

Thanks, Jörg
Doug Smythies April 1, 2016, 3:20 p.m. UTC | #2
On 2016.04.01 05:40 Rafael J. Wysocki wrote:
> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote:
>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>> On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote:
>>>> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>>>> On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote:

>> 
>> here they are.
>> 

> First of all, the sampling mechanics works as expected
> in the failing case, which is the most important thing
> I wanted to know.

Yes, but that might be part of the problem, as for some CPUs
there is never a long duration, and thus the long duration
check never kicks in driving the target pstate down.

> The core_busy column is clearly suspicious and it
> looks like CPUs 2 and 3 never really go idle.

This has been observed several times before [1].
Due to beat frequencies between desktop type frame rates
and such, the worst manifestation of the issue seems to be
for 300 Hz kernels, but Ubuntu uses uses 250 Hz.

Oh look, Jörg is using 300 Hz!!

$ grep CONFIG_HZ .config_jorg
# CONFIG_HZ_PERIODIC is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300

> I guess we'll need to find out
> why they don't go idle to get to the bottom of this, but it firmly falls into
> the weird stuff territory already.

I'm compiling a 300 Hz kernel now, also with "# CONFIG_NO_HZ is not set",
but I have never been able to re-create these type of findings before.

I have also tried several other things in an attempt re-create Jörg's
Case, so far without success.

References:
[1] https://bugzilla.kernel.org/show_bug.cgi?id=93521
In particular:
https://bugzilla.kernel.org/show_bug.cgi?id=93521#c35
https://bugzilla.kernel.org/show_bug.cgi?id=93521#c42
https://bugzilla.kernel.org/show_bug.cgi?id=93521#c77

... Doug


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jörg Otte April 1, 2016, 4:46 p.m. UTC | #3
2016-04-01 17:20 GMT+02:00 Doug Smythies <dsmythies@telus.net>:
> On 2016.04.01 05:40 Rafael J. Wysocki wrote:
>> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote:
>>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>>> On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote:
>>>>> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>>>>> On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote:
>
>>>
>>> here they are.
>>>
>
>> First of all, the sampling mechanics works as expected
>> in the failing case, which is the most important thing
>> I wanted to know.
>
> Yes, but that might be part of the problem, as for some CPUs
> there is never a long duration, and thus the long duration
> check never kicks in driving the target pstate down.
>
>> The core_busy column is clearly suspicious and it
>> looks like CPUs 2 and 3 never really go idle.
>
> This has been observed several times before [1].
> Due to beat frequencies between desktop type frame rates
> and such, the worst manifestation of the issue seems to be
> for 300 Hz kernels, but Ubuntu uses uses 250 Hz.
>
> Oh look, Jörg is using 300 Hz!!
>
> $ grep CONFIG_HZ .config_jorg
> # CONFIG_HZ_PERIODIC is not set
> # CONFIG_HZ_100 is not set
> # CONFIG_HZ_250 is not set
> CONFIG_HZ_300=y
> # CONFIG_HZ_1000 is not set
> CONFIG_HZ=300
>

I use 300Hz because of:
"250 Hz is a good compromise choice allowing server performance
while also showing good interactive responsiveness even
on SMP and NUMA systems. If you are going to be using NTSC video
or multimedia, selected 300Hz instead." (from KBuild helptext)

-> I often use multimedia so according this text 300 Hz is the better
choice.

>> I guess we'll need to find out
>> why they don't go idle to get to the bottom of this, but it firmly falls into
>> the weird stuff territory already.

> I'm compiling a 300 Hz kernel now, also with "# CONFIG_NO_HZ is not set",

Again from KBuild helptext:
"CONFIG_NO_HZ:
This is the old config entry that enables dynticks idle.
We keep it around for a little while to enforce backward
compatibility with older config files."

-> NO_HZ outdated.

> but I have never been able to re-create these type of findings before.
>
> I have also tried several other things in an attempt re-create Jörg's
> Case, so far without success.
>
> References:
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=93521
> In particular:
> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c35
> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c42
> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c77
>
> ... Doug


Nevertheless, I'll try setting 250Hz + NO_HZ

Thanks, Jörg
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jörg Otte April 1, 2016, 5:34 p.m. UTC | #4
2016-04-01 18:46 GMT+02:00 Jörg Otte <jrg.otte@gmail.com>:
> 2016-04-01 17:20 GMT+02:00 Doug Smythies <dsmythies@telus.net>:
>> On 2016.04.01 05:40 Rafael J. Wysocki wrote:
>>> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote:
>>>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>>>> On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote:
>>>>>> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>>>>>> On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote:
>>
>>>>
>>>> here they are.
>>>>
>>
>>> First of all, the sampling mechanics works as expected
>>> in the failing case, which is the most important thing
>>> I wanted to know.
>>
>> Yes, but that might be part of the problem, as for some CPUs
>> there is never a long duration, and thus the long duration
>> check never kicks in driving the target pstate down.
>>
>>> The core_busy column is clearly suspicious and it
>>> looks like CPUs 2 and 3 never really go idle.
>>
>> This has been observed several times before [1].
>> Due to beat frequencies between desktop type frame rates
>> and such, the worst manifestation of the issue seems to be
>> for 300 Hz kernels, but Ubuntu uses uses 250 Hz.
>>
>> Oh look, Jörg is using 300 Hz!!
>>
>> $ grep CONFIG_HZ .config_jorg
>> # CONFIG_HZ_PERIODIC is not set
>> # CONFIG_HZ_100 is not set
>> # CONFIG_HZ_250 is not set
>> CONFIG_HZ_300=y
>> # CONFIG_HZ_1000 is not set
>> CONFIG_HZ=300
>>
>
> I use 300Hz because of:
> "250 Hz is a good compromise choice allowing server performance
> while also showing good interactive responsiveness even
> on SMP and NUMA systems. If you are going to be using NTSC video
> or multimedia, selected 300Hz instead." (from KBuild helptext)
>
> -> I often use multimedia so according this text 300 Hz is the better
> choice.
>
>>> I guess we'll need to find out
>>> why they don't go idle to get to the bottom of this, but it firmly falls into
>>> the weird stuff territory already.
>
>> I'm compiling a 300 Hz kernel now, also with "# CONFIG_NO_HZ is not set",
>
> Again from KBuild helptext:
> "CONFIG_NO_HZ:
> This is the old config entry that enables dynticks idle.
> We keep it around for a little while to enforce backward
> compatibility with older config files."
>
> -> NO_HZ outdated.
>
>> but I have never been able to re-create these type of findings before.
>>
>> I have also tried several other things in an attempt re-create Jörg's
>> Case, so far without success.
>>
>> References:
>> [1] https://bugzilla.kernel.org/show_bug.cgi?id=93521
>> In particular:
>> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c35
>> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c42
>> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c77
>>
>> ... Doug
>
>
> Nevertheless, I'll try setting 250Hz + NO_HZ
>
For me no improvements.
Neither 300->250Hz  nor  NO_HZ_IDLE + NO_HZ

Thanks, Jörg
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srinivas Pandruvada April 1, 2016, 5:44 p.m. UTC | #5
On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote:
> 2016-04-01 14:40 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > 
> > On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote:
> > > 
> > > 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
> > > > 
> > > > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote:
> > > > > 
> > > > > 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.n
> > > > > et>:
> > > > > > 
> > > > > > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote:
> > > > > > 
[Cut]
> > > > > > 
> Done. Attached the tracer.
> For me it looks like the previous one of the failing case.
> 
The traces show that idle task is constantly running without sleep. The
driver is processing samples for idle task for every 10ms and
aperf/mperf are showing that we are always in turbo mode for idle task.

Need to find out why idle task is not sleeping.

Thanks,
Srinivas



> Thanks, Jörg
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Smythies April 1, 2016, 6:31 p.m. UTC | #6
On 2106.034.01 10:45 Srinivas Pandruvada wrote:
> On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote:
> > > > > > 
>> Done. Attached the tracer.
>> For me it looks like the previous one of the failing case.
> 
> The traces show that idle task is constantly running without sleep.

No, they (at least the first one, I didn't look at the next one yet)
show that CPUs 2 and 3 are spending around 99% of their time not in state
C0. That the sample rate is ending up at ~10 Milliseconds, indicates some
high frequency (>= 100Hz) events on those CPUs. Those events, apparently,
take very little CPU time to complete, hence a load of about 1% on average.

By the way, I can recreate the high sample rate with virtually no load
on my system easy, but so far have been unable to get the high CPU
frequencies observed by Jörg. I can get my system to about a target pstate of
20 where it should have remained at 16, but that is about it.

> The driver is processing samples for idle task for every 10ms and
> aperf/mperf are showing that we are always in turbo mode for idle task.

That column pretty much always says "idle" (or swapper for my way of doing
things). I have not found it to very useful as an indicator, and considerably
more so since the utilization changes.

> 
> Need to find out why idle task is not sleeping.

I contend that is it.
I don't have enough reverted data, but so far this seems very much like
that bug report I referenced earlier.

... Doug


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Srinivas Pandruvada April 1, 2016, 6:41 p.m. UTC | #7
On Fri, 2016-04-01 at 11:31 -0700, Doug Smythies wrote:
> On 2106.034.01 10:45 Srinivas Pandruvada wrote:
> > 
> > On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote:
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > Done. Attached the tracer.
> > > For me it looks like the previous one of the failing case.
> > The traces show that idle task is constantly running without sleep.
> No,
I mean atleast on once CPU. I am not saying they are looping, but the
wakeup time is less than 10ms, so the driver will not scale the load
for idle time.

>  they (at least the first one, I didn't look at the next one yet)
> show that CPUs 2 and 3 are spending around 99% of their time not in
> state
> C0. That the sample rate is ending up at ~10 Milliseconds, indicates
> some
> high frequency (>= 100Hz) events on those CPUs. Those events,
> apparently,
> take very little CPU time to complete, hence a load of about 1% on
> average.
> 
> By the way, I can recreate the high sample rate with virtually no
> load
> on my system easy, but so far have been unable to get the high CPU
> frequencies observed by Jörg. I can get my system to about a target
> pstate of
> 20 where it should have remained at 16, but that is about it.
> 
> > 
> > The driver is processing samples for idle task for every 10ms and
> > aperf/mperf are showing that we are always in turbo mode for idle
> > task.
> That column pretty much always says "idle" (or swapper for my way of
> doing
> things). I have not found it to very useful as an indicator, and
> considerably
> more so since the utilization changes.
> 
> > 
> > 
> > Need to find out why idle task is not sleeping.
> I contend that is it.
> I don't have enough reverted data, but so far this seems very much
> like
> that bug report I referenced earlier.
> 
> ... Doug
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm"
> in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki April 1, 2016, 7:54 p.m. UTC | #8
On Fri, Apr 1, 2016 at 8:31 PM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2106.034.01 10:45 Srinivas Pandruvada wrote:
>> On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote:
>> > > > > >
>>> Done. Attached the tracer.
>>> For me it looks like the previous one of the failing case.
>>
>> The traces show that idle task is constantly running without sleep.
>
> No, they (at least the first one, I didn't look at the next one yet)
> show that CPUs 2 and 3 are spending around 99% of their time not in state
> C0.

How do you figure that out if I may ask?  It is not so obvious to me
to be honest.

> That the sample rate is ending up at ~10 Milliseconds, indicates some
> high frequency (>= 100Hz) events on those CPUs. Those events, apparently,
> take very little CPU time to complete, hence a load of about 1% on average.
>
> By the way, I can recreate the high sample rate with virtually no load
> on my system easy, but so far have been unable to get the high CPU
> frequencies observed by Jörg. I can get my system to about a target pstate of
> 20 where it should have remained at 16, but that is about it.
>
>> The driver is processing samples for idle task for every 10ms and
>> aperf/mperf are showing that we are always in turbo mode for idle task.
>
> That column pretty much always says "idle" (or swapper for my way of doing
> things). I have not found it to very useful as an indicator, and considerably
> more so since the utilization changes.
>
>>
>> Need to find out why idle task is not sleeping.
>
> I contend that is it.

Why?

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki April 1, 2016, 8:28 p.m. UTC | #9
On Fri, Apr 1, 2016 at 4:06 PM, Jörg Otte <jrg.otte@gmail.com> wrote:
> 2016-04-01 14:40 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote:
>>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>> > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote:
>>> >> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>> >> > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote:
>>> >> >

[cut]

>
> Done. Attached the tracer.
> For me it looks like the previous one of the failing case.
>

Yes, it does, so no improvement.

Well, that will require further investigation, so I have created a
kernel Bugzilla entry for the tracking of this issue:

https://bugzilla.kernel.org/show_bug.cgi?id=115771

I would like to communicate through it going forward if that's not a problem.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Doug Smythies April 1, 2016, 11:36 p.m. UTC | #10
On 2016.04.01 12:54 Rafael J. Wysocki wrote:
>On Fri, Apr 1, 2016 at 8:31 PM, Doug Smythies <dsmythies@telus.net> wrote:
>> On 2106.034.01 10:45 Srinivas Pandruvada wrote:
>>> On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote:
>> > > > > >
>>>> Done. Attached the tracer.
>>>> For me it looks like the previous one of the failing case.
>>>
>>> The traces show that idle task is constantly running without sleep.
>>
>> No, they (at least the first one, I didn't look at the next one yet)
>> show that CPUs 2 and 3 are spending around 99% of their time not in state
>> C0.

> How do you figure that out if I may ask?  It is not so obvious to me
> to be honest.

The trace was not in the form for the post processing tools, so I had
to manually import the trace into a spreadsheet and manually add new columns
calculated from the others.

Load = mperf / tsc * 100 % = C0 time.
Duration (mS) = tsc / 2.5e9 * 1000 
Note: I do not recall seeing an exact tsc for Jörg's computer, so I used
The 2.5 GHz from the device spec from some earlier e-mail.

Example (formatting will likely not send O.K.):

		CPU#	time		core_busy	scaled	from	to	mperf		aperf		tsc		freq		load		duration (ms)
<idle>-0	[002]	465.879451:	100		96		26	26	1826656	1826710	25062693	2500073	7.288%	10.025
<idle>-0	[003]	465.879484:	99		96		26	26	305796	305781	25147993	2499877	1.216%	10.059
<idle>-0	[000]	465.885794:	100		96		26	26	975908	975951	32434672	2500110	3.009%	12.974
<idle>-0	[001]	465.886898:	100		250		10	31	327356	327364	26673840	2500061	1.227%	10.670
<idle>-0	[002]	465.889527:	100		96		26	26	205336	205365	25133396	2500353	0.817%	10.053
<idle>-0	[003]	465.889555:	99		95		26	26	62544		62341		25117916	2491885	0.249%	10.047

> That the sample rate is ending up at ~10 Milliseconds, indicates some
> high frequency (>= 100Hz) events on those CPUs. Those events, apparently,
> take very little CPU time to complete, hence a load of about 1% on average.
>
> By the way, I can recreate the high sample rate with virtually no load
> on my system easy, but so far have been unable to get the high CPU
> frequencies observed by Jörg. I can get my system to about a target pstate of
> 20 where it should have remained at 16, but that is about it.
>
>> The driver is processing samples for idle task for every 10ms and
>> aperf/mperf are showing that we are always in turbo mode for idle task.
>
> That column pretty much always says "idle" (or swapper for my way of doing
> things). I have not found it to very useful as an indicator, and considerably
> more so since the utilization changes.
>
>>
>> Need to find out why idle task is not sleeping.
>
> I contend that is it.

Why?

Unless I misunderstood, because the trace data indicates that the those CPUs
are going into some deeper C stsate than C0 for most of their time.

... Doug


--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki April 2, 2016, 12:28 a.m. UTC | #11
On Sat, Apr 2, 2016 at 1:36 AM, Doug Smythies <dsmythies@telus.net> wrote:
> On 2016.04.01 12:54 Rafael J. Wysocki wrote:
>>On Fri, Apr 1, 2016 at 8:31 PM, Doug Smythies <dsmythies@telus.net> wrote:
>>> On 2106.034.01 10:45 Srinivas Pandruvada wrote:
>>>> On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote:
>>> > > > > >
>>>>> Done. Attached the tracer.
>>>>> For me it looks like the previous one of the failing case.
>>>>
>>>> The traces show that idle task is constantly running without sleep.
>>>
>>> No, they (at least the first one, I didn't look at the next one yet)
>>> show that CPUs 2 and 3 are spending around 99% of their time not in state
>>> C0.
>
>> How do you figure that out if I may ask?  It is not so obvious to me
>> to be honest.
>
> The trace was not in the form for the post processing tools, so I had
> to manually import the trace into a spreadsheet and manually add new columns
> calculated from the others.
>
> Load = mperf / tsc * 100 % = C0 time.
> Duration (mS) = tsc / 2.5e9 * 1000
> Note: I do not recall seeing an exact tsc for Jörg's computer, so I used
> The 2.5 GHz from the device spec from some earlier e-mail.
>
> Example (formatting will likely not send O.K.):
>
>                 CPU#    time            core_busy       scaled  from    to      mperf           aperf           tsc             freq            load            duration (ms)
> <idle>-0        [002]   465.879451:     100             96              26      26      1826656 1826710 25062693        2500073 7.288%  10.025
> <idle>-0        [003]   465.879484:     99              96              26      26      305796  305781  25147993        2499877 1.216%  10.059
> <idle>-0        [000]   465.885794:     100             96              26      26      975908  975951  32434672        2500110 3.009%  12.974
> <idle>-0        [001]   465.886898:     100             250             10      31      327356  327364  26673840        2500061 1.227%  10.670
> <idle>-0        [002]   465.889527:     100             96              26      26      205336  205365  25133396        2500353 0.817%  10.053
> <idle>-0        [003]   465.889555:     99              95              26      26      62544           62341           25117916        2491885 0.249%  10.047

OK

It could be C1 with relatively short periods spent in it.

>> That the sample rate is ending up at ~10 Milliseconds, indicates some
>> high frequency (>= 100Hz) events on those CPUs. Those events, apparently,
>> take very little CPU time to complete, hence a load of about 1% on average.
>>
>> By the way, I can recreate the high sample rate with virtually no load
>> on my system easy, but so far have been unable to get the high CPU
>> frequencies observed by Jörg. I can get my system to about a target pstate of
>> 20 where it should have remained at 16, but that is about it.
>>
>>> The driver is processing samples for idle task for every 10ms and
>>> aperf/mperf are showing that we are always in turbo mode for idle task.
>>
>> That column pretty much always says "idle" (or swapper for my way of doing
>> things). I have not found it to very useful as an indicator, and considerably
>> more so since the utilization changes.
>>
>>>
>>> Need to find out why idle task is not sleeping.
>>
>> I contend that is it.
>
> Why?
>
> Unless I misunderstood, because the trace data indicates that the those CPUs
> are going into some deeper C stsate than C0 for most of their time.

But how long do they stay in those states every time?

Average residencies need to be well below 10 ms for the trace to be
produced every 10 ms, so the question seems to be what kicks the CPUs
out of idle states so often.  On a completely idle system, that's
highly suspicious.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki April 2, 2016, 10:50 p.m. UTC | #12
On Fri, Apr 1, 2016 at 10:28 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Fri, Apr 1, 2016 at 4:06 PM, Jörg Otte <jrg.otte@gmail.com> wrote:
>> 2016-04-01 14:40 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote:
>>>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>>> > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote:
>>>> >> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>:
>>>> >> > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote:
>>>> >> >
>
> [cut]
>
>>
>> Done. Attached the tracer.
>> For me it looks like the previous one of the failing case.
>>
>
> Yes, it does, so no improvement.
>
> Well, that will require further investigation, so I have created a
> kernel Bugzilla entry for the tracking of this issue:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=115771
>
> I would like to communicate through it going forward if that's not a problem.

Please have a look at this comment:

https://bugzilla.kernel.org/show_bug.cgi?id=115771#c6

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-pm/drivers/cpufreq/intel_pstate.c
===================================================================
--- linux-pm.orig/drivers/cpufreq/intel_pstate.c
+++ linux-pm/drivers/cpufreq/intel_pstate.c
@@ -910,7 +910,14 @@  static inline bool intel_pstate_sample(s
 	cpu->prev_aperf = aperf;
 	cpu->prev_mperf = mperf;
 	cpu->prev_tsc = tsc;
-	return true;
+	/*
+	 * First time this function is invoked in a given cycle, all of the
+	 * previous sample data fields are equal to zero or stale and they must
+	 * be populated with meaningful numbers for things to work, so assume
+	 * that sample.time will always be reset before setting the utilization
+	 * update hook and make the caller skip the sample then.
+	 */
+	return !!cpu->last_sample_time;
 }
 
 static inline int32_t get_avg_frequency(struct cpudata *cpu)
@@ -984,8 +991,7 @@  static inline int32_t get_target_pstate_
 	 * enough period of time to adjust our busyness.
 	 */
 	duration_ns = cpu->sample.time - cpu->last_sample_time;
-	if ((s64)duration_ns > pid_params.sample_rate_ns * 3
-	    && cpu->last_sample_time > 0) {
+	if ((s64)duration_ns > pid_params.sample_rate_ns * 3) {
 		sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns),
 				      int_tofp(duration_ns));
 		core_busy = mul_fp(core_busy, sample_ratio);
@@ -1100,7 +1106,6 @@  static int intel_pstate_init_cpu(unsigne
 	intel_pstate_get_cpu_pstates(cpu);
 
 	intel_pstate_busy_pid_reset(cpu);
-	intel_pstate_sample(cpu, 0);
 
 	cpu->update_util.func = intel_pstate_update_util;
 
@@ -1121,9 +1126,13 @@  static unsigned int intel_pstate_get(uns
 	return get_avg_frequency(cpu);
 }
 
-static void intel_pstate_set_update_util_hook(unsigned int cpu)
+static void intel_pstate_set_update_util_hook(unsigned int cpu_num)
 {
-	cpufreq_set_update_util_data(cpu, &all_cpu_data[cpu]->update_util);
+	struct cpudata *cpu = all_cpu_data[cpu_num];
+
+	/* Prevent intel_pstate_update_util() from using stale data. */
+	cpu->sample.time = 0;
+	cpufreq_set_update_util_data(cpu_num, &cpu->update_util);
 }
 
 static void intel_pstate_clear_update_util_hook(unsigned int cpu)