Message ID | 20191113124654.18122-2-ggherdovich@suse.cz (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Series | Add support for frequency invariance for (some) x86 | expand |
Hi all, The address list here is likely incorrect, and this e-mail is really about a kernel 5.4 bisected regression. It had been since mid September, and kernel 5.3-rc8 since I had tried this, so I wanted to try it again. Call it due diligence. I focused on my own version of the "gitsource" test. Kernel 5.4-rc8 (as a baseline reference). My results were extremely surprising. As it turns out, at least on my test computer, both the acpi-cpufreq and intel_cpufreq CPU frequency scaling drivers using the schedutil governor are broken. For the tests that I ran, there is negligible difference between them and the performance governor. So, one might argue that they are not broken, but rather working incredibly well, which if true then this patch is no longer needed. I bisected the kernel and got: first bad commit: [04cbfba6208592999d7bfe6609ec01dc3fde73f5] Merge tag 'dmaengine-5.4-rc1' of git://git.infradead.org/users/vkoul/slave-dma Which did not make any sense at all. I don't even know how this is being pulled into my kernel compile. O.K., I often (usually) make a mistake during bisection, so I did it again, and got the same result. Relevant excerpt from the commit: diff --cc drivers/dma/Kconfig index 413efef,03fa0c5..7c511e3 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@@ -294,8 -294,8 +294,8 @@@ config INTEL_IOATDM If unsure, say N. config INTEL_IOP_ADMA - tristate "Intel IOP ADMA support" - depends on ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX + tristate "Intel IOP32x ADMA support" - depends on ARCH_IOP32X ++ depends on ARCH_IOP32X || COMPILE_TEST select DMA_ENGINE select ASYNC_TX_ENABLE_CHANNEL_SWITCH help If I revert the above, manually, then everything behaves as expected (minimally tested only, so far). Are others seeing the schedutil governors not working as expected with any of kernels 5.4-rc1 - 5.4-rc8? I do have a pretty graph of my method of doing the "gitsource" test, but am not ready to post it yet. Here is some gitsource test data, 6 runs of "make test", the first run is discarded: "gg 6" means this 6 patch set. Kernel 5.4-rc8 + revert, intel_cpufreq/schedutil: 3899 seconds Kernel 5.4-rc8 + gg 6 + revert, intel_cpufreq/schedutil: 2740.7 seconds Ratio: 0.70 (as expected) Kernel 5.4-rc8, intel_cpufreq/schedutil: 2334.7 seconds (faster than expected) Kernel 5.4-rc8 + gg 6 patch set, intel_cpufreq/schedutil: 2275.0 seconds (faster than expected) Ratio: 0.97 (not as expected) Kernel 5.4-rc8, intel_cpufreq/performance: 2215.3 seconds Kernel 5.4-rc8, intel_cpufreq/ondemand: 3286.3 seconds Re-stated from previous e-mail: Kernel 5.3-rc8, intel_cpufreq/schedutil: ratio: 0.69 (I don't have the original times) ... Doug
On 2019.11.23 23:50 Doug Smythies wrote: > Hi all, > > The address list here is likely incorrect, > and this e-mail is really about a kernel 5.4 > bisected regression. > > It had been since mid September, and kernel 5.3-rc8 since > I had tried this, so I wanted to try it again. Call it due diligence. > I focused on my own version of the "gitsource" test. > > Kernel 5.4-rc8 (as a baseline reference). > > My results were extremely surprising. > > As it turns out, at least on my test computer, both the > acpi-cpufreq and intel_cpufreq CPU frequency scaling drivers > using the schedutil governor are broken. For the tests that > I ran, there is negligible difference between them and the > performance governor. So, one might argue that they are not > broken, but rather working incredibly well, which if true > then this patch is no longer needed. Should be able to gain better insight here with the intel_pstate_tracer.py utility, watching for differences in rates of rotation between CPUs. Too late tonight. > > I bisected the kernel and got: > > first bad commit: [04cbfba6208592999d7bfe6609ec01dc3fde73f5] > Merge tag 'dmaengine-5.4-rc1' of git://git.infradead.org/users/vkoul/slave-dma > > Which did not make any sense at all. I don't even know how > this is being pulled into my kernel compile. > O.K., I often (usually) make a mistake > during bisection, so I did it again, and got the same result. > > Relevant excerpt from the commit: > > diff --cc drivers/dma/Kconfig > index 413efef,03fa0c5..7c511e3 > --- a/drivers/dma/Kconfig > +++ b/drivers/dma/Kconfig > @@@ -294,8 -294,8 +294,8 @@@ config INTEL_IOATDM > If unsure, say N. > > config INTEL_IOP_ADMA > - tristate "Intel IOP ADMA support" > - depends on ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX > + tristate "Intel IOP32x ADMA support" > - depends on ARCH_IOP32X > ++ depends on ARCH_IOP32X || COMPILE_TEST > select DMA_ENGINE > select ASYNC_TX_ENABLE_CHANNEL_SWITCH > help > > If I revert the above, manually, then everything behaves > as expected (minimally tested only, so far). > > Are others seeing the schedutil governors not working as > expected with any of kernels 5.4-rc1 - 5.4-rc8? > > I do have a pretty graph of my method of doing the > "gitsource" test, but am not ready to post it yet. Graphs and write up (mostly the same as herein) are now here: http://www.smythies.com/~doug/linux/single-threaded/k54regression/index.html The graphs are rather crowded, sorry. > Here is some gitsource test data, 6 runs of "make test", > the first run is discarded: > > "gg 6" means this 6 patch set. > > Kernel 5.4-rc8 + revert, intel_cpufreq/schedutil: 3899 seconds > Kernel 5.4-rc8 + gg 6 + revert, intel_cpufreq/schedutil: 2740.7 seconds > Ratio: 0.70 (as expected) Kernel 5.4-rc8 + gg 6 + revert, forced CPU affinity performance: 2106.5 seconds > Kernel 5.4-rc8, intel_cpufreq/schedutil: 2334.7 seconds (faster than expected) > Kernel 5.4-rc8 + gg 6 patch set, intel_cpufreq/schedutil: 2275.0 seconds (faster than expected) > Ratio: 0.97 (not as expected) > Kernel 5.4-rc8, intel_cpufreq/performance: 2215.3 seconds > Kernel 5.4-rc8, intel_cpufreq/ondemand: 3286.3 seconds > Re-stated from previous e-mail: > Kernel 5.3-rc8, intel_cpufreq/schedutil: ratio: 0.69 (I don't have the original times) ... Doug
On Sat, Nov 23, 2019 at 11:49:57PM -0800, Doug Smythies wrote: > Hi all, > > The address list here is likely incorrect, > and this e-mail is really about a kernel 5.4 > bisected regression. > > It had been since mid September, and kernel 5.3-rc8 since > I had tried this, so I wanted to try it again. Call it due diligence. > I focused on my own version of the "gitsource" test. > > Kernel 5.4-rc8 (as a baseline reference). > > My results were extremely surprising. > > As it turns out, at least on my test computer, both the > acpi-cpufreq and intel_cpufreq CPU frequency scaling drivers > using the schedutil governor are broken. For the tests that > I ran, there is negligible difference between them and the > performance governor. So, one might argue that they are not > broken, but rather working incredibly well, which if true > then this patch is no longer needed. > I don't think that is necessarily fair. It still makes sense that the "size" of a task is independent of the frequency the CPU is running at for load balancing decisions and should be merged on that basis alone. If the scheduler made perfect decisions on task location then there would be negligible difference between ondemand (or powersave if intel_pstate) and performance governor too. cpufreq tends to be more noticable under the current implementation as tasks are a lot more mobile than they need to be for basic workloads.
On Sat, 2019-11-23 at 23:49 -0800, Doug Smythies wrote: > ... > Kernel 5.4-rc8 + revert, intel_cpufreq/schedutil: 3899 seconds > Kernel 5.4-rc8 + gg 6 + revert, intel_cpufreq/schedutil: 2740.7 seconds > Ratio: 0.70 (as expected) > Kernel 5.4-rc8, intel_cpufreq/schedutil: 2334.7 seconds (faster than expected) > Kernel 5.4-rc8 + gg 6 patch set, intel_cpufreq/schedutil: 2275.0 seconds (faster than expected) > Ratio: 0.97 (not as expected) > Kernel 5.4-rc8, intel_cpufreq/performance: 2215.3 seconds > Kernel 5.4-rc8, intel_cpufreq/ondemand: 3286.3 seconds > Re-stated from previous e-mail: > Kernel 5.3-rc8, intel_cpufreq/schedutil: ratio: 0.69 (I don't have the original times) Hello Doug, schedutil in 5.4 going a lot faster than in 5.3 would be a surprise. I'm running that same test too to check if I can see it as well. Besides, as it's already been said this patchset adds frequency scale-invariance to scheduler metrics such as load and utilization and that's useful also in areas other than frequency scaling (most notably the scheduler load balancer). Thanks, Giovanni
On 2019.11.25 08:06 Giovanni Gherdovich wrote: > On Sat, 2019-11-23 at 23:49 -0800, Doug Smythies wrote: >> ... >> Kernel 5.4-rc8 + revert, intel_cpufreq/schedutil: 3899 seconds >> Kernel 5.4-rc8 + gg 6 + revert, intel_cpufreq/schedutil: 2740.7 seconds >> Ratio: 0.70 (as expected) >> Kernel 5.4-rc8, intel_cpufreq/schedutil: 2334.7 seconds (faster than expected) >> Kernel 5.4-rc8 + gg 6 patch set, intel_cpufreq/schedutil: 2275.0 seconds (faster than expected) >> Ratio: 0.97 (not as expected) >> Kernel 5.4-rc8, intel_cpufreq/performance: 2215.3 seconds >> Kernel 5.4-rc8, intel_cpufreq/ondemand: 3286.3 seconds >> Re-stated from previous e-mail: >> Kernel 5.3-rc8, intel_cpufreq/schedutil: ratio: 0.69 (I don't have the original times) > > Hello Doug, > > schedutil in 5.4 going a lot faster than in 5.3 would be a surprise. I'm > running that same test too to check if I can see it as well. Great, thanks. But see below. > > Besides, as it's already been said this patchset adds frequency > scale-invariance to scheduler metrics such as load and utilization and that's > useful also in areas other than frequency scaling (most notably the scheduler > load balancer). The issue with the schedutil governor not working properly in the 5.4 RC series appears to be hardware dependant. My test computer is Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz., Sandy Bridge. On a temporary basis, I acquired a computer with an Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, Haswell, and schedutil governor behaviour with the exact same kernels is fine: That "gitsource" test, "make test" 6 times, first run thrown out: Kernel 5.4 intel_cpufreq/schedutil: 3411.8 seconds Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 1696.7 seconds Ratio: 0.49 Recall you got a ratio of 0.49 with 5th generation, Broadwell. ... Doug
On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote: > [...] > The issue with the schedutil governor not working properly in the 5.4 RC series > appears to be hardware dependant. > > My test computer is Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz., Sandy Bridge. > On a temporary basis, I acquired a computer with an > Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, Haswell, > and schedutil governor behaviour with the exact same kernels is fine: > > That "gitsource" test, "make test" 6 times, first run thrown out: > > Kernel 5.4 intel_cpufreq/schedutil: 3411.8 seconds > Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 1696.7 seconds > Ratio: 0.49 > Recall you got a ratio of 0.49 with 5th generation, Broadwell. It's good to hear that we're getting the same performance numbers for this patchset on all hardware that is not a Sandy Bridge. Thanks for double checking, independent verification is always valuable. Now, regarding the 5.4 regression for schedutil you see on Sandy Bridge: can we move this to the kernel bugzilla? Would you care to open a bug there and CC me to it? If it's reproducible we should assess it and see what can be done. I've tried gitsource on 5.3 versus 5.4, using intel_cpufreq + schedutil; I don't see the drop you're observing, but I don't have a Sandy Bridge readily available. This is what I see: Arithmetic mean of elapsed time for gitsource over 5 iterations (seconds): microarch v5.3 (baseline) v5.4 ------------------------------------------------------------------ Haswell 1337.84 +- 0.11% 1336.35 +- 0.12% ( 0.11%) Broadwell 1335.42 +- 0.08% 1352.54 +- 0.03% ( -1.28%) Skylake 887.03 +- 1.02% 870.90 +- 1.19% ( 1.82%) I'm looking around for a Sandy Bridge but I can make no promises at the moment. Thanks, Giovanni
On 2019.11.26 07:20 Giovanni Gherdovich wrote: > On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote: >> [...] >> The issue with the schedutil governor not working properly in the 5.4 RC series >> appears to be hardware dependant. >> >> My test computer is Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz., Sandy Bridge. >> On a temporary basis, I acquired a computer with an >> Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, Haswell, >> and schedutil governor behaviour with the exact same kernels is fine: >> >> That "gitsource" test, "make test" 6 times, first run thrown out: >> >> Kernel 5.4 intel_cpufreq/schedutil: 3411.8 seconds >> Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 1696.7 seconds >> Ratio: 0.49 >> Recall you got a ratio of 0.49 with 5th generation, Broadwell. > > It's good to hear that we're getting the same performance numbers for this > patchset on all hardware that is not a Sandy Bridge. Thanks for double > checking, independent verification is always valuable. > > Now, regarding the 5.4 regression for schedutil you see on Sandy Bridge: can > we move this to the kernel bugzilla? Would you care to open a bug there and CC > me to it? O.K., I'll need another day or two to isolate further, then I'll open a bug. I now understand considerably more, and why my bisection ended up at a strange spot. > If it's reproducible we should assess it and see what can be done. On my Sandy Bridge system if the kernel configuration contains: CONFIG_UCLAMP_TASK_GROUP=y Then the intel_cpufreq/schedutil will respond much like the performance governor. If the kernel configuration contains: # CONFIG_UCLAMP_TASK_GROUP is not set Then the intel_cpufreq/schedutil will respond much like it used to. On the Haswell computer, it doesn't seem to matter, and your tests seem to confirm this. Note: I steal my kernel configuration from the Ubuntu mainline builds, and they changed this parameter during the 5.4-rc series. ... Doug
Summary: There never was an issue here. Sorry for the noise of this thread, and the resulting waste of time. On 2019.11.26 23:33 Doug Smythies wrote: > On 2019.11.26 07:20 Giovanni Gherdovich wrote: >> On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote: >>> [...] >>> The issue with the schedutil governor not working properly in the 5.4 RC series >>> appears to be hardware dependant. No it 's not. Issues with my Sandy Bridge, i7-2600K, test computer and kernel 5.4 seem to be because it is running an older Ubuntu server version, apparently somewhat dependant on cgroup V1 and their cgmanager package. I am unable to remove the package to test further because I do use VMs that seem to depend on it. In the kernel configuration when CONFIG_UCLAMP_TASK_GROUP=y the computer behaves as though the new parameter "cpu.uclamp.min" is set to max rather than 0, but I can not prove it. >>> My test computer is Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz., Sandy Bridge. >>> On a temporary basis, I acquired a computer with an >>> Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz, Haswell, >>> and schedutil governor behaviour with the exact same kernels is fine: I failed to mention that was brand new Ubuntu server 20.04 development version installation, which turns out to have been an important omission. >>> >>> That "gitsource" test, "make test" 6 times, first run thrown out: >>> >>> Kernel 5.4 intel_cpufreq/schedutil: 3411.8 seconds >>> Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 1696.7 seconds >>> Ratio: 0.49 >>> Recall you got a ratio of 0.49 with 5th generation, Broadwell. I took the disk from the Haswell computer and put it in the Sandy Bridge computer, and then everything was fine: Kernel 5.4 intel_cpufreq/schedutil: 3636 seconds Kernel 5.4 + gg 6 intel_cpufreq/schedutil: 2427.6 seconds Ratio: 0.67 which is consistent with previous Sandy Bridge results ... Those above kernels included: CONFIG_UCLAMP_TASK_GROUP=y Which cause my older Ubuntu server edition using the intel_cpufreq/schedutil to respond much like the performance governor, as reported the other day and above. I also booted the Sandy Bridge computer, with its original disk, with the kernel command line containing: cgroup_no_v1=all and, while it complained during boot, it did work as expected using the intel_cpufreq/schedutil driver/governor. ... Doug
Hi Giovanni, On Wednesday 13 Nov 2019 at 13:46:49 (+0100), Giovanni Gherdovich wrote: [...] > --- > arch/x86/include/asm/topology.h | 23 ++++++ > arch/x86/kernel/smpboot.c | 176 +++++++++++++++++++++++++++++++++++++++- > kernel/sched/core.c | 1 + > kernel/sched/sched.h | 7 ++ > 4 files changed, 206 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h > index 4b14d2318251..9b3aca463c8f 100644 > --- a/arch/x86/include/asm/topology.h > +++ b/arch/x86/include/asm/topology.h > @@ -193,4 +193,27 @@ static inline void sched_clear_itmt_support(void) > } > #endif /* CONFIG_SCHED_MC_PRIO */ > > +#ifdef CONFIG_SMP > +#include <asm/cpufeature.h> > + > +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); > + > +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) > + > +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); > + > +static inline long arch_scale_freq_capacity(int cpu) > +{ > + if (arch_scale_freq_invariant()) > + return per_cpu(arch_cpu_freq, cpu); > + I see further down in the code that you gate the setting of arch_cpu_freq by arch_scale_freq_invariant() as well, so it might be cleaner to remove the condition here and just return the value of the per_cpu variable. That variable should also have an initial value of SCHED_FREQ_CAPACITY_SCALE (1024) and if it happens that frequency invariance is not enabled, then 1024 will always be returned as no code would have set it to anything else. Also, arm64 names this cpu variable freq_scale instead of arch_cpu_freq. It would be nice to have the same name here, to easily understand similarities in this functionality on both sides. If arch_cpu_freq seems more complete, you might want to rename it to arch_cpu_freq_scale, although longer, to clearly state that this is a scale value and not an absolute frequency value. > + return 1024 /* SCHED_CAPACITY_SCALE */; > +} > +#define arch_scale_freq_capacity arch_scale_freq_capacity > + > +extern void arch_scale_freq_tick(void); > +#define arch_scale_freq_tick arch_scale_freq_tick > + > +#endif > + > #endif /* _ASM_X86_TOPOLOGY_H */ > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 69881b2d446c..814d7900779d 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c [...] > + > +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); > + > +static DEFINE_PER_CPU(u64, arch_prev_aperf); > +static DEFINE_PER_CPU(u64, arch_prev_mperf); > +static u64 arch_max_freq = SCHED_CAPACITY_SCALE; > + Same here: the scale suffix would make the math below clearer. [...] > +static void intel_set_cpu_max_freq(void) > +{ > + /* > + * TODO: add support for: > + * > + * - Xeon Gold/Platinum > + * - Xeon Phi (KNM, KNL) > + * - Atom Goldmont > + * - Atom Silvermont > + * > + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE > + */ > + > + static_branch_enable(&arch_scale_freq_key); > + > + if (turbo_disabled() || > + x86_match_cpu(has_skx_turbo_ratio_limits) || > + x86_match_cpu(has_knl_turbo_ratio_limits) || > + x86_match_cpu(has_glm_turbo_ratio_limits)) > + return; > + > + core_set_cpu_max_freq(); > +} > + > +static void init_scale_freq(void *arg) This function does not initialise the frequency scale factor so the name is confusing to me. How about init_counters_refs or init_fie_counters_refs (fie = frequency invariance engine)? > +{ > + u64 aperf, mperf; > + > + rdmsrl(MSR_IA32_APERF, aperf); > + rdmsrl(MSR_IA32_MPERF, mperf); > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > +} > + > +static void set_cpu_max_freq(void) Similarly for the name of this function: it seems to both set the max frequency ratio and initialise the references to the aperf and mperf counters. Also, in the process it enables frequency invariance. So this function seems to do all the preparation work for frequency invariance so a more generic name (init_fie/init_frequency_invariance) would work better in my opinion. > +{ > + if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF)) > + return; > + > + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) > + intel_set_cpu_max_freq(); I see above that you enable the static key (and therefore frequency invariance before setting the max frequency ratio (if possible) and before you initialise the counter references. Is there any reason for doing this? In my mind the more clear process is: - Obtain and set max frequency ratio - Initialise counter references - If all above goes well enable the static key (and frequency invariance) Thanks, Ionela.
Hello Ionela, thanks for your review. Comments below. > On Wednesday 13 Nov 2019 at 13:46:49 (+0100), Giovanni Gherdovich > wrote: > [...] > > --- > > arch/x86/include/asm/topology.h | 23 ++++++ > > arch/x86/kernel/smpboot.c | 176 +++++++++++++++++++++++++++++++++++++++- > > kernel/sched/core.c | 1 + > > kernel/sched/sched.h | 7 ++ > > 4 files changed, 206 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h > > index 4b14d2318251..9b3aca463c8f 100644 > > --- a/arch/x86/include/asm/topology.h > > +++ b/arch/x86/include/asm/topology.h > > @@ -193,4 +193,27 @@ static inline void sched_clear_itmt_support(void) > > } > > #endif /* CONFIG_SCHED_MC_PRIO */ > > > > +#ifdef CONFIG_SMP > > +#include <asm/cpufeature.h> > > + > > +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); > > + > > +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) > > + > > +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); > > + > > +static inline long arch_scale_freq_capacity(int cpu) > > +{ > > + if (arch_scale_freq_invariant()) > > + return per_cpu(arch_cpu_freq, cpu); > > + > > I see further down in the code that you gate the setting of > arch_cpu_freq by arch_scale_freq_invariant() as well, so it might be > cleaner to remove the condition here and just return the value of the > per_cpu variable. That variable should also have an initial value of > SCHED_FREQ_CAPACITY_SCALE (1024) and if it happens that frequency > invariance is not enabled, then 1024 will always be returned as no code > would have set it to anything else. You're correct. Currently I'm not initializing arch_cpu_freq explicitely, but if I set it to 1024 than I can remove the check for arch_scale_freq_invariant() in the function above (arch_scale_freq_capacity) and always return arch_cpu_freq from there no matter what. Will do that in v5. > > Also, arm64 names this cpu variable freq_scale instead of > arch_cpu_freq. > It would be nice to have the same name here, to easily understand > similarities in this functionality on both sides. > > If arch_cpu_freq seems more complete, you might want to rename it to > arch_cpu_freq_scale, although longer, to clearly state that this is a > scale value and not an absolute frequency value. > > > + return 1024 /* SCHED_CAPACITY_SCALE */; > > +} > > +#define arch_scale_freq_capacity arch_scale_freq_capacity > > + > > +extern void arch_scale_freq_tick(void); > > +#define arch_scale_freq_tick arch_scale_freq_tick > > + > > +#endif > > + > > #endif /* _ASM_X86_TOPOLOGY_H */ > > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > > index 69881b2d446c..814d7900779d 100644 > > --- a/arch/x86/kernel/smpboot.c > > +++ b/arch/x86/kernel/smpboot.c > [...] > > + > > +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); > > + > > +static DEFINE_PER_CPU(u64, arch_prev_aperf); > > +static DEFINE_PER_CPU(u64, arch_prev_mperf); > > +static u64 arch_max_freq = SCHED_CAPACITY_SCALE; > > + > > Same here: the scale suffix would make the math below clearer. Agreed, in v5 will rename the variabled arch_cpu_freq and arch_max_freq to include "scale". > > [...] > > +static void intel_set_cpu_max_freq(void) > > +{ > > + /* > > + * TODO: add support for: > > + * > > + * - Xeon Gold/Platinum > > + * - Xeon Phi (KNM, KNL) > > + * - Atom Goldmont > > + * - Atom Silvermont > > + * > > + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE > > + */ > > + > > + static_branch_enable(&arch_scale_freq_key); > > + > > + if (turbo_disabled() || > > + x86_match_cpu(has_skx_turbo_ratio_limits) || > > + x86_match_cpu(has_knl_turbo_ratio_limits) || > > + x86_match_cpu(has_glm_turbo_ratio_limits)) > > + return; > > + > > + core_set_cpu_max_freq(); > > +} > > + > > +static void init_scale_freq(void *arg) > > This function does not initialise the frequency scale factor so the > name > is confusing to me. How about init_counters_refs or > init_fie_counters_refs > (fie = frequency invariance engine)? Good point too, I'll rename init_scale_freq() following your suggestion. > > > +{ > > + u64 aperf, mperf; > > + > > + rdmsrl(MSR_IA32_APERF, aperf); > > + rdmsrl(MSR_IA32_MPERF, mperf); > > + > > + this_cpu_write(arch_prev_aperf, aperf); > > + this_cpu_write(arch_prev_mperf, mperf); > > +} > > + > > +static void set_cpu_max_freq(void) > > Similarly for the name of this function: it seems to both set the max > frequency ratio and initialise the references to the aperf and mperf > counters. Also, in the process it enables frequency invariance. > So this function seems to do all the preparation work for frequency > invariance so a more generic name (init_fie/init_frequency_invariance) > would work better in my opinion. Yeah I agree, names all around here are not great, will put more thought into it for v5. Thanks for the names suggestions. > > > +{ > > + if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF)) > > + return; > > + > > + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) > > + intel_set_cpu_max_freq(); > > I see above that you enable the static key (and therefore frequency > invariance before setting the max frequency ratio (if possible) and > before you initialise the counter references. Is there any reason for > doing this? > > In my mind the more clear process is: > - Obtain and set max frequency ratio > - Initialise counter references > - If all above goes well enable the static key (and frequency > invariance) This is a fair point; mine was a deliberate choice but you're the second person making this remark (Peter Zijlstra also suggested I find the max frequency before I set the static key), so it appears this design is unpopular enough to warrant a change in v5. The initialization of counter references doesn't worry me: once you check the presence of the [AM]PERF registers with boot_cpu_has(X86_FEATURE_APERFMPERF), there isn't much that can go wrong when reading their value. Reading and setting the max frequency ratio before or after enabling frequency invariance is more opinionated. My view is that if you fail to obtain the max frequency ratio from the various MSRs, that means the processor is not among the list of CPUs I've enumerated in the code, but some other model that has a different method (MSR semantics) for gathering the max freq value. Eg: if Intel next year makes a new processor where the max freq is written in some different register (happened plenty of times in the past), and I'm not quick to send a patch for that, that part of the initialization would fail. What to do at that point? Bail out entirely from frequency invariance? In my opinion that would be too conservative, I prefer to give that machine a default max freq ratio of 1024 (max frequency == base frequency, i.e. pretend it doesn't have any turbo). In other words I think that a little frequency invariance is better than no frequency invariance, and the only really essential ingredient is X86_FEATURE_APERFMPERF to read the current frequency (which I check before setting the static key). But again: you and one other person have already suggested to do it the other way, and mine is only a mild preference, so I'll change that part in v5. Thanks, Giovanni
On Fri, Dec 06, 2019 at 12:57:46PM +0100, Giovanni Gherdovich wrote: > > On Wednesday 13 Nov 2019 at 13:46:49 (+0100), Giovanni Gherdovich wrote: > > I see above that you enable the static key (and therefore frequency > > invariance before setting the max frequency ratio (if possible) and > > before you initialise the counter references. Is there any reason for > > doing this? > This is a fair point; mine was a deliberate choice but you're the second > person making this remark (Peter Zijlstra also suggested I find the max > frequency before I set the static key), so it appears this design is > unpopular > enough to warrant a change in v5. You actually 'fix' this in the next patch. I thought it was a patch management 'fail' that it didn't end up in this patch.
Hi Doug On 11/28/19 14:48, Doug Smythies wrote: > Summary: There never was an issue here. > > Sorry for the noise of this thread, and the resulting waste of time. > > On 2019.11.26 23:33 Doug Smythies wrote: > > On 2019.11.26 07:20 Giovanni Gherdovich wrote: > >> On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote: > >>> [...] > >>> The issue with the schedutil governor not working properly in the 5.4 RC series > >>> appears to be hardware dependant. > > No it 's not. > > Issues with my Sandy Bridge, i7-2600K, test computer and kernel 5.4 > seem to be because it is running an older Ubuntu server version, > apparently somewhat dependant on cgroup V1 and their cgmanager package. > I am unable to remove the package to test further because I do use VMs > that seem to depend on it. > > In the kernel configuration when CONFIG_UCLAMP_TASK_GROUP=y > the computer behaves as though the new parameter "cpu.uclamp.min" > is set to max rather than 0, but I can not prove it. I just noticed this. This option shouldn't cause any problem, if it does there might be a bug that we need to fix. So cpu.uclamp.min reads 0 but you think it's not taking effect, correct? In the quotes above I see 5.4 RC, if you haven't tried this against the final 5.4 release, do you mind trying to see if you can reproduce? Trying 5.5-rc2 would be helpful too if 5.4 fails. Thanks -- Qais Yousef
On Wed, 2019-12-18 at 20:34 +0100, Peter Zijlstra wrote: > On Fri, Dec 06, 2019 at 12:57:46PM +0100, Giovanni Gherdovich wrote: > > > On Wednesday 13 Nov 2019 at 13:46:49 (+0100), Giovanni Gherdovich wrote: > > > I see above that you enable the static key (and therefore frequency > > > invariance before setting the max frequency ratio (if possible) and > > > before you initialise the counter references. Is there any reason for > > > doing this? > > This is a fair point; mine was a deliberate choice but you're the second > > person making this remark (Peter Zijlstra also suggested I find the max > > frequency before I set the static key), so it appears this design is > > unpopular > > enough to warrant a change in v5. > > You actually 'fix' this in the next patch. I thought it was a patch > management 'fail' that it didn't end up in this patch. Uhm. I'm not sure I agree; let me paste the function intel_set_cpu_max_freq after the entire series is applied: > static void intel_set_cpu_max_freq(void) > { > u64 ratio = 1, turbo_ratio = 1; > > if (slv_set_cpu_max_freq(&ratio, &turbo_ratio)) > goto set_value; > > if (glm_set_cpu_max_freq(&ratio, &turbo_ratio)) > goto set_value; > > if (knl_set_cpu_max_freq(&ratio, &turbo_ratio)) > goto set_value; > > if (skx_set_cpu_max_freq(&ratio, &turbo_ratio)) > goto set_value; > > core_set_cpu_max_freq(&ratio, &turbo_ratio); > let's say that all functions return false; as I don't check the return value of the last one, you can very well end up here with 'ratio' and 'turbo_ratio' that are still untouched, =1 since their initialization, and I would go on and set the static key anyway (because I previously checked X86_FEATURE_APERFMPERF). Right? > set_value: > arch_max_turbo_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, > ratio); > set_arch_max_freq(turbo_disabled()); > static_branch_enable(&arch_scale_freq_key); > } But again, not only people disagree with this behavior, it's probably a little misleading too in how it's written. Changing in v5. Giovanni
Hi Qais, Thank you for your follow up. On 2019.12.19 02:48 Qais Yousef wrote: > On 11/28/19 14:48, Doug Smythies wrote: >> Summary: There never was an issue here. >> >> Sorry for the noise of this thread, and the resulting waste of time. >> >> On 2019.11.26 23:33 Doug Smythies wrote: >>> On 2019.11.26 07:20 Giovanni Gherdovich wrote: >>>> On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote: >>>>> [...] >>>>> The issue with the schedutil governor not working properly in the 5.4 RC series >>>>> appears to be hardware dependant. >> >> No it 's not. >> >> Issues with my Sandy Bridge, i7-2600K, test computer and kernel 5.4 >> seem to be because it is running an older Ubuntu server version, >> apparently somewhat dependant on cgroup V1 and their cgmanager package. >> I am unable to remove the package to test further because I do use VMs >> that seem to depend on it. >> >> In the kernel configuration when CONFIG_UCLAMP_TASK_GROUP=y >> the computer behaves as though the new parameter "cpu.uclamp.min" >> is set to max rather than 0, but I can not prove it. > > I just noticed this. This option shouldn't cause any problem, if it does there > might be a bug that we need to fix. > > So cpu.uclamp.min reads 0 but you think it's not taking effect, correct? Actually, on the i7-2600K older distro test computer, I couldn't find cpu.uclamp.min to read its setting. However, yes the behaviour of the governor was as though that value was set to maximum (read on). > > In the quotes above I see 5.4 RC, if you haven't tried this against the final > 5.4 release, do you mind trying to see if you can reproduce? Trying 5.5-rc2 > would be helpful too if 5.4 fails. My test setup and baseline distribution versions have changed since November, when I did those tests. However, I was able to rig up a bootable old ssd and was able to reproduce the issue with kernel 5.5-rc2. More importantly, I was to reproduce the issue with the current i7-2600K test computer (Ubuntu server 20.04 development, upgraded version) and kernel 5.5-rc2. Note that I have access to another i5-9600K based test computer (Ubuntu server 20.04 development, fresh install), that does not show this issue. Detail: If formatting gets messed up in this e-mail, then the content, and links to more details, is also here: http://www.smythies.com/~doug/linux/single-threaded/k54regression/qais.html CPU frequency scaling driver: intel_pstate, in passive (intel-cpufreq) mode. CPU frequency scaling governor: various. CPU Idle driver: intel_idle; Governor: teo. kernels ("stock", "notset" and "nocgv1"): stock: CONFIG_UCLAMP_TASK_GROUP=y notset: # CONFIG_UCLAMP_TASK_GROUP is not set nocgv1: is "stock" booted with "cgroup_no_v1=all" on the grub kernel command line. Linux s15 5.5.0-rc2-stock #768 SMP PREEMPT Fri Dec 20 16:19:44 PST 2019 x86_64 x86_64 x86_64 GNU/Linux Linux s18 5.5.0-rc2-notset #769 SMP PREEMPT Fri Dec 20 18:43:59 PST 2019 x86_64 x86_64 x86_64 GNU/Linux kernel configuration differences: doug@s15:~/temp-k-git/linux$ scripts/diffconfig /boot/config-5.5.0-rc2-stock /boot/config-5.5.0-rc2-notset UCLAMP_TASK_GROUP y -> n doug@s15:~/temp-k-git/linux$ Test methods used herein are greatly sped up, by switching to just a couple of PID per seconds samples, instead of a great many. Also disk I/O is not used, eliminating any access time related non-repeatability, and saving thrashing my SSD. Note that several governors had CPU frequency variations with time, resulting in variability in the PIDs per second number. There are two tests, the performance metric being the number of PIDs per second consumed: test 1: Dountil terminated: launch a null program (uses a new PID per call). Wait for it to finish Enduntil test 2: Dountil terminated: launch a program with a package of work to do (uses a new PID per call). Wait for it to finish Enduntil The assumed fastest and master reference test run is using the performance governor and forcing CPU affinity. All other calculations are relative to this result. Results: i7-2600K computer booted with Ubuntu server 16.04.6, test 1 only: Governor kernel notset stock notset PID/S ratio PID/S ratio PID/S ratio schedutil 1650 2.4 3935 1.0 FAIL 1645 2.4 ondemand 2787 1.4 2787 1.4 2787 1.4 performance 3925 1.0 3940 1.0 3940 1.0 conservative2545 1.5 2540 1.5 2530 1.6 powersave 1645 2.4 1655 2.4 1650 2.4 reference 3934 1.0 3917 1.0 3950 1.0 i7-2600K computer booted with Ubuntu server 20.04 dev, test 1: Governor kernel stock notset stock notset nocgv1 PID/S ratio PID/S ratio PID/S ratio PID/S ratio PID/S ratio schedutil 3310 1.1 FAIL 1455 2.4 3250 1.1 FAIL 1465 2.4 3220 1.1 FAIL ondemand 2510 1.4 2485 1.4 2495 1.4 2490 1.4 2460 1.4 performance 3333 1.1 3254 1.1 3250 1.1 3360 1.0 3220 1.1 conservative2230 1.6 2260 1.5 2280 1.5 2220 1.6 2230 1.6 powersave 1470 2.4 1455 2.4 1460 2.4 1470 2.4 1450 2.4 reference 3521 1.0 3500 1.0 3526 1.0 3500 1.0 3500 1.0 i7-2600K computer booted with Ubuntu server 20.04 dev, test 2: Governor kernel stock notset nocgv1 PID/S ratio PID/S ratio PID/S ratio schedutil 405 1.1 FAIL 177 2.4 405 1.1 FAIL ondemand 371 1.1 371 1.1 371 1.1 performance 408 1.0 405 1.0 405 1.0 conservative362 1.2 365 1.2 365 1.2 powersave 177 2.4 177 2.4 177 2.4 reference 423 1.0 423 1.0 423 1.0 The "nocgv1" (cgroup_no_v1=all) kernel is of particular interest because now uclamp variables are available: root@s15:/sys/fs/cgroup/user.slice# echo "+cpu" > cgroup.subtree_control root@s15:/sys/fs/cgroup/user.slice# cat cgroup.subtree_control cpu memory pids root@s15:/sys/fs/cgroup/user.slice# grep . cpu\.uclamp* cpu.uclamp.max:max cpu.uclamp.min:0.00 This is repeatable: To make the schedutil governor respond as expected thereafter and until the next re-boot, do this: # echo 0 > cpu.uclamp.min Attempts to kick the schedutil governor response via /sys/devices/system/cpu/intel_pstate/max_perf_pct and /sys/devices/system/cpu/intel_pstate/min_perf_pct didn't. Other modifications of the cpu.uclamp.min and max variables also kick the schedutil governor out of whatever state it was in. This test was done 5 times: Re-boot to the nocgv1 (stock + cgroup_no_v1=all) kernel. set the schedutil governor. launch test 2 and related monitoring tools. verify performance governor like behavior. echo 0 > /sys/fs/cgroup/user.slice/cpu.uclamp.min verify schedutil governor like behaviour. ... Doug
On 12/22/19 23:47, Doug Smythies wrote: > Hi Qais, > > Thank you for your follow up. Thanks for the detailed info! > > On 2019.12.19 02:48 Qais Yousef wrote: > > On 11/28/19 14:48, Doug Smythies wrote: > >> Summary: There never was an issue here. > >> > >> Sorry for the noise of this thread, and the resulting waste of time. > >> > >> On 2019.11.26 23:33 Doug Smythies wrote: > >>> On 2019.11.26 07:20 Giovanni Gherdovich wrote: > >>>> On Mon, 2019-11-25 at 21:59 -0800, Doug Smythies wrote: > >>>>> [...] > >>>>> The issue with the schedutil governor not working properly in the 5.4 RC series > >>>>> appears to be hardware dependant. > >> > >> No it 's not. > >> > >> Issues with my Sandy Bridge, i7-2600K, test computer and kernel 5.4 > >> seem to be because it is running an older Ubuntu server version, > >> apparently somewhat dependant on cgroup V1 and their cgmanager package. > >> I am unable to remove the package to test further because I do use VMs > >> that seem to depend on it. > >> > >> In the kernel configuration when CONFIG_UCLAMP_TASK_GROUP=y > >> the computer behaves as though the new parameter "cpu.uclamp.min" > >> is set to max rather than 0, but I can not prove it. > > > > I just noticed this. This option shouldn't cause any problem, if it does there > > might be a bug that we need to fix. > > > > So cpu.uclamp.min reads 0 but you think it's not taking effect, correct? > > Actually, on the i7-2600K older distro test computer, I couldn't find > cpu.uclamp.min to read its setting. However, yes the behaviour of the governor > was as though that value was set to maximum (read on). Sorry, silly question. Did you verify both cgroup v1 and v2 mount points? If the cpu controller is mounted in v1, it can't be seen in v2. And vice versa. > > > > > In the quotes above I see 5.4 RC, if you haven't tried this against the final > > 5.4 release, do you mind trying to see if you can reproduce? Trying 5.5-rc2 > > would be helpful too if 5.4 fails. > > My test setup and baseline distribution versions have changed since November, > when I did those tests. However, I was able to rig up a bootable old ssd > and was able to reproduce the issue with kernel 5.5-rc2. More importantly, > I was to reproduce the issue with the current i7-2600K test computer > (Ubuntu server 20.04 development, upgraded version) and kernel 5.5-rc2. > Note that I have access to another i5-9600K based test computer (Ubuntu > server 20.04 development, fresh install), that does not show this issue. > > Detail: > > If formatting gets messed up in this e-mail, then the content, > and links to more details, is also here: > http://www.smythies.com/~doug/linux/single-threaded/k54regression/qais.html I see here: "kernel configuration differences between Ubuntu mainline PPA and local" Does this mean you're using Ubuntu's mainline tree not Linus'? > > CPU frequency scaling driver: intel_pstate, in passive (intel-cpufreq) mode. > CPU frequency scaling governor: various. > CPU Idle driver: intel_idle; Governor: teo. > > kernels ("stock", "notset" and "nocgv1"): > stock: CONFIG_UCLAMP_TASK_GROUP=y > notset: # CONFIG_UCLAMP_TASK_GROUP is not set > nocgv1: is "stock" booted with "cgroup_no_v1=all" on the grub kernel command line. So I looked at the code and I can't see how to connect the dots. By default, all CFS tasks has a uclamp_min value of 0, and uclamp_max value of 1024. RT tasks by default are boosted to max only, from uclamp+schedutil perspective. Similarly all child group should have cpu.uclamp.{min, max} = (0, 1024) by default unless changed. I'm wondering if forcing cgroup_v1 off is just has the side effect of allowing you to mount and use the cpu controller in v2. > > Linux s15 5.5.0-rc2-stock #768 SMP PREEMPT Fri Dec 20 16:19:44 PST 2019 x86_64 x86_64 x86_64 GNU/Linux > Linux s18 5.5.0-rc2-notset #769 SMP PREEMPT Fri Dec 20 18:43:59 PST 2019 x86_64 x86_64 x86_64 GNU/Linux > > kernel configuration differences: > > doug@s15:~/temp-k-git/linux$ scripts/diffconfig /boot/config-5.5.0-rc2-stock /boot/config-5.5.0-rc2-notset > UCLAMP_TASK_GROUP y -> n > doug@s15:~/temp-k-git/linux$ > > Test methods used herein are greatly sped up, by switching > to just a couple of PID per seconds samples, instead of > a great many. Also disk I/O is not used, eliminating any > access time related non-repeatability, and saving thrashing > my SSD. Note that several governors had CPU frequency variations > with time, resulting in variability in the PIDs per second number. > > There are two tests, the performance metric being > the number of PIDs per second consumed: > [...] > i7-2600K computer booted with Ubuntu server 20.04 dev, test 2: > > Governor kernel > stock notset nocgv1 > PID/S ratio PID/S ratio PID/S ratio > schedutil 405 1.1 FAIL 177 2.4 405 1.1 FAIL > ondemand 371 1.1 371 1.1 371 1.1 > performance 408 1.0 405 1.0 405 1.0 > conservative362 1.2 365 1.2 365 1.2 > powersave 177 2.4 177 2.4 177 2.4 > reference 423 1.0 423 1.0 423 1.0 > > The "nocgv1" (cgroup_no_v1=all) kernel is of particular interest because > now uclamp variables are available: > > root@s15:/sys/fs/cgroup/user.slice# echo "+cpu" > cgroup.subtree_control This is v2 syntax. It's worth checking the mount point of the cpu controller in cgroup v1. > root@s15:/sys/fs/cgroup/user.slice# cat cgroup.subtree_control > cpu memory pids > root@s15:/sys/fs/cgroup/user.slice# grep . cpu\.uclamp* > cpu.uclamp.max:max > cpu.uclamp.min:0.00 > > This is repeatable: > To make the schedutil governor respond as expected thereafter > and until the next re-boot, do this: > > # echo 0 > cpu.uclamp.min > > Attempts to kick the schedutil governor response via > /sys/devices/system/cpu/intel_pstate/max_perf_pct and > /sys/devices/system/cpu/intel_pstate/min_perf_pct didn't. > Other modifications of the cpu.uclamp.min and max variables also > kick the schedutil governor out of whatever state it was in. > > This test was done 5 times: > > Re-boot to the nocgv1 (stock + cgroup_no_v1=all) kernel. > set the schedutil governor. > launch test 2 and related monitoring tools. > verify performance governor like behavior. So as stated above, by default uclamp_{min, max} = (0, 1024). So it wouldn't act as performance governor by default unless you explicitly write 1024 to uclamp.min. Let me go find Ubuntu mainline tree to see if they applied anything extra in there. If they modified the default behavior that could explain what you see. Thanks! -- Qais Yousef > echo 0 > /sys/fs/cgroup/user.slice/cpu.uclamp.min > verify schedutil governor like behaviour. > > ... Doug > >
On 12/23/19 14:07, Qais Yousef wrote: > > Re-boot to the nocgv1 (stock + cgroup_no_v1=all) kernel. > > set the schedutil governor. > > launch test 2 and related monitoring tools. > > verify performance governor like behavior. > > So as stated above, by default uclamp_{min, max} = (0, 1024). So it wouldn't > act as performance governor by default unless you explicitly write 1024 to > uclamp.min. > > Let me go find Ubuntu mainline tree to see if they applied anything extra in > there. If they modified the default behavior that could explain what you see. Actually I see what you were saying now that you copy the config. So I think I misunderstood and you are running Linus' 5.5-rc2 + Ubuntu PPA config. I'm trying to to reproduce the issue at my end. Cheers -- Qais Yousef
On 2019.12.23 06:41 Qais Yousef wrote: > On 12/23/19 14:07, Qais Yousef wrote: >>> Re-boot to the nocgv1 (stock + cgroup_no_v1=all) kernel. >>> set the schedutil governor. >>> launch test 2 and related monitoring tools. >>> verify performance governor like behavior. >> >> So as stated above, by default uclamp_{min, max} = (0, 1024). So it wouldn't >> act as performance governor by default unless you explicitly write 1024 to >> uclamp.min. >> >> Let me go find Ubuntu mainline tree to see if they applied anything extra in >> there. If they modified the default behavior that could explain what you see. > > Actually I see what you were saying now that you copy the config. So I think > I misunderstood and you are running Linus' 5.5-rc2 + Ubuntu PPA config. Yes, exactly. I have a clone of the main Linus git branch, but steal the kernel configuration from the Ubuntu mainline PPA. > > I'm trying to to reproduce the issue at my end. ... Doug
On 12/23/19 08:34, Doug Smythies wrote: > On 2019.12.23 06:41 Qais Yousef wrote: > > On 12/23/19 14:07, Qais Yousef wrote: > >>> Re-boot to the nocgv1 (stock + cgroup_no_v1=all) kernel. > >>> set the schedutil governor. > >>> launch test 2 and related monitoring tools. > >>> verify performance governor like behavior. > >> > >> So as stated above, by default uclamp_{min, max} = (0, 1024). So it wouldn't > >> act as performance governor by default unless you explicitly write 1024 to > >> uclamp.min. > >> > >> Let me go find Ubuntu mainline tree to see if they applied anything extra in > >> there. If they modified the default behavior that could explain what you see. > > > > Actually I see what you were saying now that you copy the config. So I think > > I misunderstood and you are running Linus' 5.5-rc2 + Ubuntu PPA config. > > Yes, exactly. > I have a clone of the main Linus git branch, but steal the kernel > configuration from the Ubuntu mainline PPA. I think I managed to reproduce it. The below seems to fix it for me, can you try it out please? I think what's going on is that the child group default util_min value isn't propagated correctly after it's created, so we end up using the root_task_group values which is 1024 for both min and max. --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 90e4b00ace89..cf9d9106d1b5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7100,6 +7100,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css) if (parent) sched_online_group(tg, parent); + +#ifdef CONFIG_UCLAMP_TASK_GROUP + cpu_util_update_eff(css); +#endif + return 0; }
On 2019.12.23 11:10 Qais Yousef wrote: > On 12/23/19 08:34, Doug Smythies wrote: >> On 2019.12.23 06:41 Qais Yousef wrote: >>> On 12/23/19 14:07, Qais Yousef wrote: >>>>> Re-boot to the nocgv1 (stock + cgroup_no_v1=all) kernel. >>>>> set the schedutil governor. >>>>> launch test 2 and related monitoring tools. >>>>> verify performance governor like behavior. >>>> >>>> So as stated above, by default uclamp_{min, max} = (0, 1024). So it wouldn't >>>> act as performance governor by default unless you explicitly write 1024 to >>>> uclamp.min. >>>> >>>> Let me go find Ubuntu mainline tree to see if they applied anything extra in >>>> there. If they modified the default behavior that could explain what you see. >>> >>> Actually I see what you were saying now that you copy the config. So I think >>> I misunderstood and you are running Linus' 5.5-rc2 + Ubuntu PPA config. >> >> Yes, exactly. >> I have a clone of the main Linus git branch, but steal the kernel >> configuration from the Ubuntu mainline PPA. > > I think I managed to reproduce it. The below seems to fix it for me, can you > try it out please? Yes, it fixes the schedutil governor behaving like the performance governor problem on my i7-2600K test system. I re-ran the tests several times, and re-booted back to the stock (problem) kernel to verify incorrect schedutil governor performance (i.e. I toggled back and forth, 2 times for each of 2 kernels, tests 1 and 2, total 8 tests). Kernel 5.5-rc2: 4 tests FAILED (as expected). Kernel 5.5-rc2 + this patch: 4 tests PASSED. Accidentally tested: Kernel 5.5-rc2 + this patch + command line "cgroup_no_v1=all": 1 test PASS. > I think what's going on is that the child group default util_min value isn't > propagated correctly after it's created, so we end up using the root_task_group > values which is 1024 for both min and max. O.K. thanks. > --- > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 90e4b00ace89..cf9d9106d1b5 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7100,6 +7100,11 @@ static int cpu_cgroup_css_online(struct cgroup_subsys_state *css) > > if (parent) > sched_online_group(tg, parent); > + > +#ifdef CONFIG_UCLAMP_TASK_GROUP > + cpu_util_update_eff(css); > +#endif > + > return 0; > } ... Doug
On 12/23/19 17:16, Doug Smythies wrote: > Yes, it fixes the schedutil governor behaving like the performance governor > problem on my i7-2600K test system. > > I re-ran the tests several times, and re-booted back to the stock (problem) > kernel to verify incorrect schedutil governor performance (i.e. I toggled > back and forth, 2 times for each of 2 kernels, tests 1 and 2, total 8 tests). > Kernel 5.5-rc2: 4 tests FAILED (as expected). > Kernel 5.5-rc2 + this patch: 4 tests PASSED. Great! Thanks for testing it, I'll send a proper patch shortly. > > Accidentally tested: > Kernel 5.5-rc2 + this patch + command line "cgroup_no_v1=all": 1 test PASS. I think this cgroup_no_v1 is a happy accident. It has nothing to do with the fault, but for your case maybe helped observing things in a better way. FWIW, I reproduced the issue on Juno Arm64 using Debian and Buildroot rootfs. What is actually required to trigger the bug is to create a cpu controller and add all system tasks to it to create some noise - which Ubuntu and Debian do by default at boot time. In Buildroot when I did that manually I could see the frequency going to max most of the time. I'll add away to test this scenario. Thanks for your detailed report. Happy holidays! -- Qais Yousef
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 4b14d2318251..9b3aca463c8f 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -193,4 +193,27 @@ static inline void sched_clear_itmt_support(void) } #endif /* CONFIG_SCHED_MC_PRIO */ +#ifdef CONFIG_SMP +#include <asm/cpufeature.h> + +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); + +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) + +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); + +static inline long arch_scale_freq_capacity(int cpu) +{ + if (arch_scale_freq_invariant()) + return per_cpu(arch_cpu_freq, cpu); + + return 1024 /* SCHED_CAPACITY_SCALE */; +} +#define arch_scale_freq_capacity arch_scale_freq_capacity + +extern void arch_scale_freq_tick(void); +#define arch_scale_freq_tick arch_scale_freq_tick + +#endif + #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 69881b2d446c..814d7900779d 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void) *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0; } +static void set_cpu_max_freq(void); + /* * Report back to the Boot Processor during boot time or to the caller processor * during CPU online. @@ -183,6 +185,8 @@ static void smp_callin(void) */ set_cpu_sibling_map(raw_smp_processor_id()); + set_cpu_max_freq(); + /* * Get our bogomips. * Update loops_per_jiffy in cpu_data. Previous call to @@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) set_sched_topology(x86_topology); set_cpu_sibling_map(0); - + set_cpu_max_freq(); smp_sanity_check(); switch (apic_intr_mode) { @@ -1764,3 +1768,173 @@ void native_play_dead(void) } #endif + +/* + * APERF/MPERF frequency ratio computation. + * + * The scheduler wants to do frequency invariant accounting and needs a <1 + * ratio to account for the 'current' frequency, corresponding to + * freq_curr / freq_max. + * + * Since the frequency freq_curr on x86 is controlled by micro-controller and + * our P-state setting is little more than a request/hint, we need to observe + * the effective frequency 'BusyMHz', i.e. the average frequency over a time + * interval after discarding idle time. This is given by: + * + * BusyMHz = delta_APERF / delta_MPERF * freq_base + * + * where freq_base is the max non-turbo P-state. + * + * The freq_max term has to be set to a somewhat arbitrary value, because we + * can't know which turbo states will be available at a given point in time: + * it all depends on the thermal headroom of the entire package. We set it to + * the turbo level with 4 cores active. + * + * Benchmarks show that's a good compromise between the 1C turbo ratio + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base, + * which would ignore the entire turbo range (a conspicuous part, making + * freq_curr/freq_max always maxed out). + * + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. + */ + +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); + +static DEFINE_PER_CPU(u64, arch_prev_aperf); +static DEFINE_PER_CPU(u64, arch_prev_mperf); +static u64 arch_max_freq = SCHED_CAPACITY_SCALE; + +static bool turbo_disabled(void) +{ + u64 misc_en; + int err; + + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); + if (err) + return false; + + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); +} + +#include <asm/cpu_device_id.h> +#include <asm/intel-family.h> + +#define ICPU(model) \ + {X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} + +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_XEON_PHI_KNL), + ICPU(INTEL_FAM6_XEON_PHI_KNM), + {} +}; + +static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_SKYLAKE_X), + {} +}; + +static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_ATOM_GOLDMONT), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), + {} +}; + +static void core_set_cpu_max_freq(void) +{ + u64 ratio, turbo_ratio; + int err; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); + if (err) + return; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); + if (err) + return; + + ratio = (ratio >> 8) & 0xFF; /* max P state ratio */ + turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ + + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); +} + +static void intel_set_cpu_max_freq(void) +{ + /* + * TODO: add support for: + * + * - Xeon Gold/Platinum + * - Xeon Phi (KNM, KNL) + * - Atom Goldmont + * - Atom Silvermont + * + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE + */ + + static_branch_enable(&arch_scale_freq_key); + + if (turbo_disabled() || + x86_match_cpu(has_skx_turbo_ratio_limits) || + x86_match_cpu(has_knl_turbo_ratio_limits) || + x86_match_cpu(has_glm_turbo_ratio_limits)) + return; + + core_set_cpu_max_freq(); +} + +static void init_scale_freq(void *arg) +{ + u64 aperf, mperf; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); +} + +static void set_cpu_max_freq(void) +{ + if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF)) + return; + + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + intel_set_cpu_max_freq(); + + init_scale_freq(NULL); +} + +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); + +void arch_scale_freq_tick(void) +{ + u64 freq; + u64 aperf, mperf; + u64 acnt, mcnt; + + if (!arch_scale_freq_invariant()) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + acnt = aperf - this_cpu_read(arch_prev_aperf); + mcnt = mperf - this_cpu_read(arch_prev_mperf); + if (!mcnt) + return; + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); + + acnt <<= 2*SCHED_CAPACITY_SHIFT; + mcnt *= arch_max_freq; + + freq = div64_u64(acnt, mcnt); + + if (freq > SCHED_CAPACITY_SCALE) + freq = SCHED_CAPACITY_SCALE; + + this_cpu_write(arch_cpu_freq, freq); +} diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0f2eb3629070..1d0f5df8020e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3593,6 +3593,7 @@ void scheduler_tick(void) struct task_struct *curr = rq->curr; struct rq_flags rf; + arch_scale_freq_tick(); sched_clock_tick(); rq_lock(rq, &rf); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c8870c5bd7df..4c4f5f197e10 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1977,6 +1977,13 @@ static inline int hrtick_enabled(struct rq *rq) #endif /* CONFIG_SCHED_HRTICK */ +#ifndef arch_scale_freq_tick +static __always_inline +void arch_scale_freq_tick(void) +{ +} +#endif + #ifndef arch_scale_freq_capacity static __always_inline unsigned long arch_scale_freq_capacity(int cpu)