Message ID | 3623107.tlAuqH4F7s@vostro.rjw.lan (mailing list archive) |
---|---|
State | Accepted, archived |
Delegated to: | Rafael Wysocki |
Headers | show |
2016-04-01 14:40 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote: >> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >> > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote: >> >> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >> >> > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote: >> >> > >> >> > [cut] >> >> > >> >> >> > >> >> >> >> >> >> Yes, works for me. >> >> >> >> >> >> CPUID(7): No-SGX >> >> >> CPU Avg_MHz Busy% Bzy_MHz TSC_MHz >> >> >> - 11 0.66 1682 2494 >> >> >> 0 11 0.60 1856 2494 >> >> >> 1 6 0.34 1898 2494 >> >> >> 2 13 0.82 1628 2494 >> >> >> 3 13 0.87 1528 2494 >> >> >> CPU Avg_MHz Busy% Bzy_MHz TSC_MHz >> >> >> - 6 0.58 963 2494 >> >> >> 0 8 0.83 957 2494 >> >> >> 1 1 0.08 984 2494 >> >> >> 2 10 1.04 975 2494 >> >> >> 3 3 0.35 934 2494 >> >> >> >> >> > >> > >> > [cut] >> > >> >> > >> >> >> >> No, this patch doesn't help. >> > >> > Well, more work to do then. >> > >> > I've just noticed a bug in this patch, which is not relevant for the results, >> > but below is a new version. >> > >> >> CPUID(7): No-SGX >> >> CPU Avg_MHz Busy% Bzy_MHz TSC_MHz >> >> - 8 0.32 2507 2495 >> >> 0 13 0.53 2505 2495 >> >> 1 3 0.11 2523 2495 >> >> 2 1 0.06 2555 2495 >> >> 3 15 0.59 2500 2495 >> >> CPU Avg_MHz Busy% Bzy_MHz TSC_MHz >> >> - 8 0.33 2486 2495 >> >> 0 12 0.50 2482 2495 >> >> 1 5 0.22 2489 2495 >> >> 2 1 0.04 2492 2495 >> >> 3 15 0.59 2487 2495 >> > > > [cut] > >> >> here they are. >> > > Thanks! > > First of all, the sampling mechanics works as expected in the failing case, > which is the most important thing I wanted to know. However, there are anomalies > in the failing case trace. The core_busy column is clearly suspicious and it > looks like CPUs 2 and 3 never really go idle. I guess we'll need to find out > why they don't go idle to get to the bottom of this, but it firmly falls into > the weird stuff territory already. > > In the meantime, below is one more patch to test, on top of the previous one > (that is, https://patchwork.kernel.org/patch/8714401/). > > Again, this is a change I'd like to make regardless, so it would be good to > know if anything more has to be done before we go further. > > --- > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Subject: [PATCH] intel_pstate: Avoid extra invocation of intel_pstate_sample() > > The initialization of intel_pstate for a given CPU involves populating > the fields of its struct cpudata that represent the previous sample, > but currently that is done in a problematic way. > > Namely, intel_pstate_init_cpu() makes an extra call to > intel_pstate_sample() so it reads the current register values that > will be used to populate the "previous sample" record during the > next invocation of intel_pstate_sample(). However, after commit > a4675fbc4a7a (cpufreq: intel_pstate: Replace timers with utilization > update callbacks) that doesn't work for last_sample_time, because > the time value is passed to intel_pstate_sample() as an argument now. > Passing 0 to it from intel_pstate_init_cpu() is problematic, because > that causes cpu->last_sample_time == 0 to be visible in > get_target_pstate_use_performance() (and hence the extra > cpu->last_sample_time > 0 check in there) and effectively allows > the first invocation of intel_pstate_sample() from > intel_pstate_update_util() to happen immediately after the > initialization which may lead to a significant "turn on" > effect in the governor algorithm. > > To mitigate that issue, rework the initialization to avoid the > extra intel_pstate_sample() call from intel_pstate_init_cpu(). > Instead, make intel_pstate_sample() return false if it has been > called with cpu->sample.time equal to zero, which will make > intel_pstate_update_util() skip the sample in that case, and > reset cpu->sample.time from intel_pstate_set_update_util_hook() > to make the algorithm start properly every time the hook is set. > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > --- > drivers/cpufreq/intel_pstate.c | 21 +++++++++++++++------ > 1 file changed, 15 insertions(+), 6 deletions(-) > > Index: linux-pm/drivers/cpufreq/intel_pstate.c > =================================================================== > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > +++ linux-pm/drivers/cpufreq/intel_pstate.c > @@ -910,7 +910,14 @@ static inline bool intel_pstate_sample(s > cpu->prev_aperf = aperf; > cpu->prev_mperf = mperf; > cpu->prev_tsc = tsc; > - return true; > + /* > + * First time this function is invoked in a given cycle, all of the > + * previous sample data fields are equal to zero or stale and they must > + * be populated with meaningful numbers for things to work, so assume > + * that sample.time will always be reset before setting the utilization > + * update hook and make the caller skip the sample then. > + */ > + return !!cpu->last_sample_time; > } > > static inline int32_t get_avg_frequency(struct cpudata *cpu) > @@ -984,8 +991,7 @@ static inline int32_t get_target_pstate_ > * enough period of time to adjust our busyness. > */ > duration_ns = cpu->sample.time - cpu->last_sample_time; > - if ((s64)duration_ns > pid_params.sample_rate_ns * 3 > - && cpu->last_sample_time > 0) { > + if ((s64)duration_ns > pid_params.sample_rate_ns * 3) { > sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns), > int_tofp(duration_ns)); > core_busy = mul_fp(core_busy, sample_ratio); > @@ -1100,7 +1106,6 @@ static int intel_pstate_init_cpu(unsigne > intel_pstate_get_cpu_pstates(cpu); > > intel_pstate_busy_pid_reset(cpu); > - intel_pstate_sample(cpu, 0); > > cpu->update_util.func = intel_pstate_update_util; > > @@ -1121,9 +1126,13 @@ static unsigned int intel_pstate_get(uns > return get_avg_frequency(cpu); > } > > -static void intel_pstate_set_update_util_hook(unsigned int cpu) > +static void intel_pstate_set_update_util_hook(unsigned int cpu_num) > { > - cpufreq_set_update_util_data(cpu, &all_cpu_data[cpu]->update_util); > + struct cpudata *cpu = all_cpu_data[cpu_num]; > + > + /* Prevent intel_pstate_update_util() from using stale data. */ > + cpu->sample.time = 0; > + cpufreq_set_update_util_data(cpu_num, &cpu->update_util); > } > > static void intel_pstate_clear_update_util_hook(unsigned int cpu) > Done. Attached the tracer. For me it looks like the previous one of the failing case. Thanks, Jörg
On 2016.04.01 05:40 Rafael J. Wysocki wrote: > On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote: >> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>> On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote: >>>> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>>>> On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote: >> >> here they are. >> > First of all, the sampling mechanics works as expected > in the failing case, which is the most important thing > I wanted to know. Yes, but that might be part of the problem, as for some CPUs there is never a long duration, and thus the long duration check never kicks in driving the target pstate down. > The core_busy column is clearly suspicious and it > looks like CPUs 2 and 3 never really go idle. This has been observed several times before [1]. Due to beat frequencies between desktop type frame rates and such, the worst manifestation of the issue seems to be for 300 Hz kernels, but Ubuntu uses uses 250 Hz. Oh look, Jörg is using 300 Hz!! $ grep CONFIG_HZ .config_jorg # CONFIG_HZ_PERIODIC is not set # CONFIG_HZ_100 is not set # CONFIG_HZ_250 is not set CONFIG_HZ_300=y # CONFIG_HZ_1000 is not set CONFIG_HZ=300 > I guess we'll need to find out > why they don't go idle to get to the bottom of this, but it firmly falls into > the weird stuff territory already. I'm compiling a 300 Hz kernel now, also with "# CONFIG_NO_HZ is not set", but I have never been able to re-create these type of findings before. I have also tried several other things in an attempt re-create Jörg's Case, so far without success. References: [1] https://bugzilla.kernel.org/show_bug.cgi?id=93521 In particular: https://bugzilla.kernel.org/show_bug.cgi?id=93521#c35 https://bugzilla.kernel.org/show_bug.cgi?id=93521#c42 https://bugzilla.kernel.org/show_bug.cgi?id=93521#c77 ... Doug -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2016-04-01 17:20 GMT+02:00 Doug Smythies <dsmythies@telus.net>: > On 2016.04.01 05:40 Rafael J. Wysocki wrote: >> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote: >>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>>> On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote: >>>>> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>>>>> On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote: > >>> >>> here they are. >>> > >> First of all, the sampling mechanics works as expected >> in the failing case, which is the most important thing >> I wanted to know. > > Yes, but that might be part of the problem, as for some CPUs > there is never a long duration, and thus the long duration > check never kicks in driving the target pstate down. > >> The core_busy column is clearly suspicious and it >> looks like CPUs 2 and 3 never really go idle. > > This has been observed several times before [1]. > Due to beat frequencies between desktop type frame rates > and such, the worst manifestation of the issue seems to be > for 300 Hz kernels, but Ubuntu uses uses 250 Hz. > > Oh look, Jörg is using 300 Hz!! > > $ grep CONFIG_HZ .config_jorg > # CONFIG_HZ_PERIODIC is not set > # CONFIG_HZ_100 is not set > # CONFIG_HZ_250 is not set > CONFIG_HZ_300=y > # CONFIG_HZ_1000 is not set > CONFIG_HZ=300 > I use 300Hz because of: "250 Hz is a good compromise choice allowing server performance while also showing good interactive responsiveness even on SMP and NUMA systems. If you are going to be using NTSC video or multimedia, selected 300Hz instead." (from KBuild helptext) -> I often use multimedia so according this text 300 Hz is the better choice. >> I guess we'll need to find out >> why they don't go idle to get to the bottom of this, but it firmly falls into >> the weird stuff territory already. > I'm compiling a 300 Hz kernel now, also with "# CONFIG_NO_HZ is not set", Again from KBuild helptext: "CONFIG_NO_HZ: This is the old config entry that enables dynticks idle. We keep it around for a little while to enforce backward compatibility with older config files." -> NO_HZ outdated. > but I have never been able to re-create these type of findings before. > > I have also tried several other things in an attempt re-create Jörg's > Case, so far without success. > > References: > [1] https://bugzilla.kernel.org/show_bug.cgi?id=93521 > In particular: > https://bugzilla.kernel.org/show_bug.cgi?id=93521#c35 > https://bugzilla.kernel.org/show_bug.cgi?id=93521#c42 > https://bugzilla.kernel.org/show_bug.cgi?id=93521#c77 > > ... Doug Nevertheless, I'll try setting 250Hz + NO_HZ Thanks, Jörg -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2016-04-01 18:46 GMT+02:00 Jörg Otte <jrg.otte@gmail.com>: > 2016-04-01 17:20 GMT+02:00 Doug Smythies <dsmythies@telus.net>: >> On 2016.04.01 05:40 Rafael J. Wysocki wrote: >>> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote: >>>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>>>> On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote: >>>>>> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>>>>>> On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote: >> >>>> >>>> here they are. >>>> >> >>> First of all, the sampling mechanics works as expected >>> in the failing case, which is the most important thing >>> I wanted to know. >> >> Yes, but that might be part of the problem, as for some CPUs >> there is never a long duration, and thus the long duration >> check never kicks in driving the target pstate down. >> >>> The core_busy column is clearly suspicious and it >>> looks like CPUs 2 and 3 never really go idle. >> >> This has been observed several times before [1]. >> Due to beat frequencies between desktop type frame rates >> and such, the worst manifestation of the issue seems to be >> for 300 Hz kernels, but Ubuntu uses uses 250 Hz. >> >> Oh look, Jörg is using 300 Hz!! >> >> $ grep CONFIG_HZ .config_jorg >> # CONFIG_HZ_PERIODIC is not set >> # CONFIG_HZ_100 is not set >> # CONFIG_HZ_250 is not set >> CONFIG_HZ_300=y >> # CONFIG_HZ_1000 is not set >> CONFIG_HZ=300 >> > > I use 300Hz because of: > "250 Hz is a good compromise choice allowing server performance > while also showing good interactive responsiveness even > on SMP and NUMA systems. If you are going to be using NTSC video > or multimedia, selected 300Hz instead." (from KBuild helptext) > > -> I often use multimedia so according this text 300 Hz is the better > choice. > >>> I guess we'll need to find out >>> why they don't go idle to get to the bottom of this, but it firmly falls into >>> the weird stuff territory already. > >> I'm compiling a 300 Hz kernel now, also with "# CONFIG_NO_HZ is not set", > > Again from KBuild helptext: > "CONFIG_NO_HZ: > This is the old config entry that enables dynticks idle. > We keep it around for a little while to enforce backward > compatibility with older config files." > > -> NO_HZ outdated. > >> but I have never been able to re-create these type of findings before. >> >> I have also tried several other things in an attempt re-create Jörg's >> Case, so far without success. >> >> References: >> [1] https://bugzilla.kernel.org/show_bug.cgi?id=93521 >> In particular: >> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c35 >> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c42 >> https://bugzilla.kernel.org/show_bug.cgi?id=93521#c77 >> >> ... Doug > > > Nevertheless, I'll try setting 250Hz + NO_HZ > For me no improvements. Neither 300->250Hz nor NO_HZ_IDLE + NO_HZ Thanks, Jörg -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote: > 2016-04-01 14:40 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > > > > On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote: > > > > > > 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: > > > > > > > > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote: > > > > > > > > > > 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.n > > > > > et>: > > > > > > > > > > > > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote: > > > > > > [Cut] > > > > > > > Done. Attached the tracer. > For me it looks like the previous one of the failing case. > The traces show that idle task is constantly running without sleep. The driver is processing samples for idle task for every 10ms and aperf/mperf are showing that we are always in turbo mode for idle task. Need to find out why idle task is not sleeping. Thanks, Srinivas > Thanks, Jörg -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2106.034.01 10:45 Srinivas Pandruvada wrote: > On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote: > > > > > > >> Done. Attached the tracer. >> For me it looks like the previous one of the failing case. > > The traces show that idle task is constantly running without sleep. No, they (at least the first one, I didn't look at the next one yet) show that CPUs 2 and 3 are spending around 99% of their time not in state C0. That the sample rate is ending up at ~10 Milliseconds, indicates some high frequency (>= 100Hz) events on those CPUs. Those events, apparently, take very little CPU time to complete, hence a load of about 1% on average. By the way, I can recreate the high sample rate with virtually no load on my system easy, but so far have been unable to get the high CPU frequencies observed by Jörg. I can get my system to about a target pstate of 20 where it should have remained at 16, but that is about it. > The driver is processing samples for idle task for every 10ms and > aperf/mperf are showing that we are always in turbo mode for idle task. That column pretty much always says "idle" (or swapper for my way of doing things). I have not found it to very useful as an indicator, and considerably more so since the utilization changes. > > Need to find out why idle task is not sleeping. I contend that is it. I don't have enough reverted data, but so far this seems very much like that bug report I referenced earlier. ... Doug -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2016-04-01 at 11:31 -0700, Doug Smythies wrote: > On 2106.034.01 10:45 Srinivas Pandruvada wrote: > > > > On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Done. Attached the tracer. > > > For me it looks like the previous one of the failing case. > > The traces show that idle task is constantly running without sleep. > No, I mean atleast on once CPU. I am not saying they are looping, but the wakeup time is less than 10ms, so the driver will not scale the load for idle time. > they (at least the first one, I didn't look at the next one yet) > show that CPUs 2 and 3 are spending around 99% of their time not in > state > C0. That the sample rate is ending up at ~10 Milliseconds, indicates > some > high frequency (>= 100Hz) events on those CPUs. Those events, > apparently, > take very little CPU time to complete, hence a load of about 1% on > average. > > By the way, I can recreate the high sample rate with virtually no > load > on my system easy, but so far have been unable to get the high CPU > frequencies observed by Jörg. I can get my system to about a target > pstate of > 20 where it should have remained at 16, but that is about it. > > > > > The driver is processing samples for idle task for every 10ms and > > aperf/mperf are showing that we are always in turbo mode for idle > > task. > That column pretty much always says "idle" (or swapper for my way of > doing > things). I have not found it to very useful as an indicator, and > considerably > more so since the utilization changes. > > > > > > > Need to find out why idle task is not sleeping. > I contend that is it. > I don't have enough reverted data, but so far this seems very much > like > that bug report I referenced earlier. > > ... Doug > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pm" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 1, 2016 at 8:31 PM, Doug Smythies <dsmythies@telus.net> wrote: > On 2106.034.01 10:45 Srinivas Pandruvada wrote: >> On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote: >> > > > > > >>> Done. Attached the tracer. >>> For me it looks like the previous one of the failing case. >> >> The traces show that idle task is constantly running without sleep. > > No, they (at least the first one, I didn't look at the next one yet) > show that CPUs 2 and 3 are spending around 99% of their time not in state > C0. How do you figure that out if I may ask? It is not so obvious to me to be honest. > That the sample rate is ending up at ~10 Milliseconds, indicates some > high frequency (>= 100Hz) events on those CPUs. Those events, apparently, > take very little CPU time to complete, hence a load of about 1% on average. > > By the way, I can recreate the high sample rate with virtually no load > on my system easy, but so far have been unable to get the high CPU > frequencies observed by Jörg. I can get my system to about a target pstate of > 20 where it should have remained at 16, but that is about it. > >> The driver is processing samples for idle task for every 10ms and >> aperf/mperf are showing that we are always in turbo mode for idle task. > > That column pretty much always says "idle" (or swapper for my way of doing > things). I have not found it to very useful as an indicator, and considerably > more so since the utilization changes. > >> >> Need to find out why idle task is not sleeping. > > I contend that is it. Why? Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 1, 2016 at 4:06 PM, Jörg Otte <jrg.otte@gmail.com> wrote: > 2016-04-01 14:40 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote: >>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>> > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote: >>> >> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>> >> > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote: >>> >> > [cut] > > Done. Attached the tracer. > For me it looks like the previous one of the failing case. > Yes, it does, so no improvement. Well, that will require further investigation, so I have created a kernel Bugzilla entry for the tracking of this issue: https://bugzilla.kernel.org/show_bug.cgi?id=115771 I would like to communicate through it going forward if that's not a problem. Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2016.04.01 12:54 Rafael J. Wysocki wrote: >On Fri, Apr 1, 2016 at 8:31 PM, Doug Smythies <dsmythies@telus.net> wrote: >> On 2106.034.01 10:45 Srinivas Pandruvada wrote: >>> On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote: >> > > > > > >>>> Done. Attached the tracer. >>>> For me it looks like the previous one of the failing case. >>> >>> The traces show that idle task is constantly running without sleep. >> >> No, they (at least the first one, I didn't look at the next one yet) >> show that CPUs 2 and 3 are spending around 99% of their time not in state >> C0. > How do you figure that out if I may ask? It is not so obvious to me > to be honest. The trace was not in the form for the post processing tools, so I had to manually import the trace into a spreadsheet and manually add new columns calculated from the others. Load = mperf / tsc * 100 % = C0 time. Duration (mS) = tsc / 2.5e9 * 1000 Note: I do not recall seeing an exact tsc for Jörg's computer, so I used The 2.5 GHz from the device spec from some earlier e-mail. Example (formatting will likely not send O.K.): CPU# time core_busy scaled from to mperf aperf tsc freq load duration (ms) <idle>-0 [002] 465.879451: 100 96 26 26 1826656 1826710 25062693 2500073 7.288% 10.025 <idle>-0 [003] 465.879484: 99 96 26 26 305796 305781 25147993 2499877 1.216% 10.059 <idle>-0 [000] 465.885794: 100 96 26 26 975908 975951 32434672 2500110 3.009% 12.974 <idle>-0 [001] 465.886898: 100 250 10 31 327356 327364 26673840 2500061 1.227% 10.670 <idle>-0 [002] 465.889527: 100 96 26 26 205336 205365 25133396 2500353 0.817% 10.053 <idle>-0 [003] 465.889555: 99 95 26 26 62544 62341 25117916 2491885 0.249% 10.047 > That the sample rate is ending up at ~10 Milliseconds, indicates some > high frequency (>= 100Hz) events on those CPUs. Those events, apparently, > take very little CPU time to complete, hence a load of about 1% on average. > > By the way, I can recreate the high sample rate with virtually no load > on my system easy, but so far have been unable to get the high CPU > frequencies observed by Jörg. I can get my system to about a target pstate of > 20 where it should have remained at 16, but that is about it. > >> The driver is processing samples for idle task for every 10ms and >> aperf/mperf are showing that we are always in turbo mode for idle task. > > That column pretty much always says "idle" (or swapper for my way of doing > things). I have not found it to very useful as an indicator, and considerably > more so since the utilization changes. > >> >> Need to find out why idle task is not sleeping. > > I contend that is it. Why? Unless I misunderstood, because the trace data indicates that the those CPUs are going into some deeper C stsate than C0 for most of their time. ... Doug -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Apr 2, 2016 at 1:36 AM, Doug Smythies <dsmythies@telus.net> wrote: > On 2016.04.01 12:54 Rafael J. Wysocki wrote: >>On Fri, Apr 1, 2016 at 8:31 PM, Doug Smythies <dsmythies@telus.net> wrote: >>> On 2106.034.01 10:45 Srinivas Pandruvada wrote: >>>> On Fri, 2016-04-01 at 16:06 +0200, Jörg Otte wrote: >>> > > > > > >>>>> Done. Attached the tracer. >>>>> For me it looks like the previous one of the failing case. >>>> >>>> The traces show that idle task is constantly running without sleep. >>> >>> No, they (at least the first one, I didn't look at the next one yet) >>> show that CPUs 2 and 3 are spending around 99% of their time not in state >>> C0. > >> How do you figure that out if I may ask? It is not so obvious to me >> to be honest. > > The trace was not in the form for the post processing tools, so I had > to manually import the trace into a spreadsheet and manually add new columns > calculated from the others. > > Load = mperf / tsc * 100 % = C0 time. > Duration (mS) = tsc / 2.5e9 * 1000 > Note: I do not recall seeing an exact tsc for Jörg's computer, so I used > The 2.5 GHz from the device spec from some earlier e-mail. > > Example (formatting will likely not send O.K.): > > CPU# time core_busy scaled from to mperf aperf tsc freq load duration (ms) > <idle>-0 [002] 465.879451: 100 96 26 26 1826656 1826710 25062693 2500073 7.288% 10.025 > <idle>-0 [003] 465.879484: 99 96 26 26 305796 305781 25147993 2499877 1.216% 10.059 > <idle>-0 [000] 465.885794: 100 96 26 26 975908 975951 32434672 2500110 3.009% 12.974 > <idle>-0 [001] 465.886898: 100 250 10 31 327356 327364 26673840 2500061 1.227% 10.670 > <idle>-0 [002] 465.889527: 100 96 26 26 205336 205365 25133396 2500353 0.817% 10.053 > <idle>-0 [003] 465.889555: 99 95 26 26 62544 62341 25117916 2491885 0.249% 10.047 OK It could be C1 with relatively short periods spent in it. >> That the sample rate is ending up at ~10 Milliseconds, indicates some >> high frequency (>= 100Hz) events on those CPUs. Those events, apparently, >> take very little CPU time to complete, hence a load of about 1% on average. >> >> By the way, I can recreate the high sample rate with virtually no load >> on my system easy, but so far have been unable to get the high CPU >> frequencies observed by Jörg. I can get my system to about a target pstate of >> 20 where it should have remained at 16, but that is about it. >> >>> The driver is processing samples for idle task for every 10ms and >>> aperf/mperf are showing that we are always in turbo mode for idle task. >> >> That column pretty much always says "idle" (or swapper for my way of doing >> things). I have not found it to very useful as an indicator, and considerably >> more so since the utilization changes. >> >>> >>> Need to find out why idle task is not sleeping. >> >> I contend that is it. > > Why? > > Unless I misunderstood, because the trace data indicates that the those CPUs > are going into some deeper C stsate than C0 for most of their time. But how long do they stay in those states every time? Average residencies need to be well below 10 ms for the trace to be produced every 10 ms, so the question seems to be what kicks the CPUs out of idle states so often. On a completely idle system, that's highly suspicious. -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 1, 2016 at 10:28 PM, Rafael J. Wysocki <rafael@kernel.org> wrote: > On Fri, Apr 1, 2016 at 4:06 PM, Jörg Otte <jrg.otte@gmail.com> wrote: >> 2016-04-01 14:40 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>> On Friday, April 01, 2016 11:20:42 AM Jörg Otte wrote: >>>> 2016-03-31 17:43 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>>> > On Thursday, March 31, 2016 05:25:18 PM Jörg Otte wrote: >>>> >> 2016-03-31 13:42 GMT+02:00 Rafael J. Wysocki <rjw@rjwysocki.net>: >>>> >> > On Thursday, March 31, 2016 11:05:56 AM Jörg Otte wrote: >>>> >> > > > [cut] > >> >> Done. Attached the tracer. >> For me it looks like the previous one of the failing case. >> > > Yes, it does, so no improvement. > > Well, that will require further investigation, so I have created a > kernel Bugzilla entry for the tracking of this issue: > > https://bugzilla.kernel.org/show_bug.cgi?id=115771 > > I would like to communicate through it going forward if that's not a problem. Please have a look at this comment: https://bugzilla.kernel.org/show_bug.cgi?id=115771#c6 Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-pm/drivers/cpufreq/intel_pstate.c =================================================================== --- linux-pm.orig/drivers/cpufreq/intel_pstate.c +++ linux-pm/drivers/cpufreq/intel_pstate.c @@ -910,7 +910,14 @@ static inline bool intel_pstate_sample(s cpu->prev_aperf = aperf; cpu->prev_mperf = mperf; cpu->prev_tsc = tsc; - return true; + /* + * First time this function is invoked in a given cycle, all of the + * previous sample data fields are equal to zero or stale and they must + * be populated with meaningful numbers for things to work, so assume + * that sample.time will always be reset before setting the utilization + * update hook and make the caller skip the sample then. + */ + return !!cpu->last_sample_time; } static inline int32_t get_avg_frequency(struct cpudata *cpu) @@ -984,8 +991,7 @@ static inline int32_t get_target_pstate_ * enough period of time to adjust our busyness. */ duration_ns = cpu->sample.time - cpu->last_sample_time; - if ((s64)duration_ns > pid_params.sample_rate_ns * 3 - && cpu->last_sample_time > 0) { + if ((s64)duration_ns > pid_params.sample_rate_ns * 3) { sample_ratio = div_fp(int_tofp(pid_params.sample_rate_ns), int_tofp(duration_ns)); core_busy = mul_fp(core_busy, sample_ratio); @@ -1100,7 +1106,6 @@ static int intel_pstate_init_cpu(unsigne intel_pstate_get_cpu_pstates(cpu); intel_pstate_busy_pid_reset(cpu); - intel_pstate_sample(cpu, 0); cpu->update_util.func = intel_pstate_update_util; @@ -1121,9 +1126,13 @@ static unsigned int intel_pstate_get(uns return get_avg_frequency(cpu); } -static void intel_pstate_set_update_util_hook(unsigned int cpu) +static void intel_pstate_set_update_util_hook(unsigned int cpu_num) { - cpufreq_set_update_util_data(cpu, &all_cpu_data[cpu]->update_util); + struct cpudata *cpu = all_cpu_data[cpu_num]; + + /* Prevent intel_pstate_update_util() from using stale data. */ + cpu->sample.time = 0; + cpufreq_set_update_util_data(cpu_num, &cpu->update_util); } static void intel_pstate_clear_update_util_hook(unsigned int cpu)