Message ID | 20220415133356.179706384@linutronix.de (mailing list archive) |
---|---|
Headers | show |
Series | x86/cpu: Consolidate APERF/MPERF code | expand |
On Fri, Apr 15, 2022 at 12:19 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > APERF/MPERF is utilized in two ways: > > 1) Ad hoc readout of CPU frequency which requires IPIs > > 2) Frequency scale calculation for frequency invariant scheduling which > reads APERF/MPERF on every tick. > > These are completely independent code parts. Eric observed long latencies > when reading /proc/cpuinfo which reads out CPU frequency via #1 and > proposed to replace the per CPU single IPI with a broadcast IPI. > > While this makes the latency smaller, it is not necessary at all because #2 > samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs > which are excluded from IPI already. > > It could be argued that not all APERF/MPERF capable systems have the > required BIOS information to enable frequency invariance support, but in > practice most of them do. So the APERF/MPERF sampling can be made > unconditional and just the frequency scale calculation for the scheduler > excluded. > > The following series consolidates that. > Thanks a lot for working on that Thomas. I am not sure I will be able to backport this to a Google prodkernel, as I guess there will be many merge conflicts. Do you have by any chance this work available in a git branch ? Thanks. > Thanks, > > tglx > --- > arch/x86/include/asm/cpu.h | 2 > arch/x86/include/asm/topology.h | 17 - > arch/x86/kernel/acpi/cppc.c | 28 -- > arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++-------- > arch/x86/kernel/cpu/proc.c | 2 > arch/x86/kernel/smpboot.c | 358 ----------------------------- > fs/proc/cpuinfo.c | 6 > include/linux/cpufreq.h | 1 > 8 files changed, 405 insertions(+), 483 deletions(-) > >
On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote: > --- > arch/x86/include/asm/cpu.h | 2 > arch/x86/include/asm/topology.h | 17 - > arch/x86/kernel/acpi/cppc.c | 28 -- > arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++-------- > arch/x86/kernel/cpu/proc.c | 2 > arch/x86/kernel/smpboot.c | 358 ----------------------------- > fs/proc/cpuinfo.c | 6 > include/linux/cpufreq.h | 1 > 8 files changed, 405 insertions(+), 483 deletions(-) Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Hi Thomas, On 2022.04.15 12:20 Thomas Gleixner wrote: > APERF/MPERF is utilized in two ways: > > 1) Ad hoc readout of CPU frequency which requires IPIs > > 2) Frequency scale calculation for frequency invariant scheduling which > reads APERF/MPERF on every tick. > > These are completely independent code parts. Eric observed long latencies > when reading /proc/cpuinfo which reads out CPU frequency via #1 and > proposed to replace the per CPU single IPI with a broadcast IPI. > > While this makes the latency smaller, it is not necessary at all because #2 > samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs > which are excluded from IPI already. > > It could be argued that not all APERF/MPERF capable systems have the > required BIOS information to enable frequency invariance support, but in > practice most of them do. So the APERF/MPERF sampling can be made > unconditional and just the frequency scale calculation for the scheduler > excluded. > > The following series consolidates that. I have used this patch set with the acpi-cpufreq, intel_cpufreq (passive), and intel_pstate (active) CPU frequency scaling drivers and various governors. Additionally, with HWP both enabled and disabled. For intel_pstate (active), both HWP enabled or disabled, the behaviour of scaling_cur_freq is inconsistent with prior to this patch set and other scaling driver governor combinations. Note there is no issue with " grep MHz /proc/cpuinfo" for any combination. Examples: No-HWP: active/powersave: doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 active/performance: doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 HWP: active/powersave: doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:799993 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800069 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800131 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:799844 active/performance: doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:4800186 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:4800016 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 Other configurations: intel_cpufreq /schedutil (no HWP), for example: doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000 Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz > Thanks, > > tglx > --- > arch/x86/include/asm/cpu.h | 2 > arch/x86/include/asm/topology.h | 17 - > arch/x86/kernel/acpi/cppc.c | 28 -- > arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++-------- > arch/x86/kernel/cpu/proc.c | 2 > arch/x86/kernel/smpboot.c | 358 ----------------------------- > fs/proc/cpuinfo.c | 6 > include/linux/cpufreq.h | 1 > 8 files changed, 405 insertions(+), 483 deletions(-)
On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote: > > Hi Thomas, > > On 2022.04.15 12:20 Thomas Gleixner wrote: > > > APERF/MPERF is utilized in two ways: > > > > 1) Ad hoc readout of CPU frequency which requires IPIs > > > > 2) Frequency scale calculation for frequency invariant scheduling which > > reads APERF/MPERF on every tick. > > > > These are completely independent code parts. Eric observed long latencies > > when reading /proc/cpuinfo which reads out CPU frequency via #1 and > > proposed to replace the per CPU single IPI with a broadcast IPI. > > > > While this makes the latency smaller, it is not necessary at all because #2 > > samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs > > which are excluded from IPI already. > > > > It could be argued that not all APERF/MPERF capable systems have the > > required BIOS information to enable frequency invariance support, but in > > practice most of them do. So the APERF/MPERF sampling can be made > > unconditional and just the frequency scale calculation for the scheduler > > excluded. > > > > The following series consolidates that. > > I have used this patch set with the acpi-cpufreq, intel_cpufreq (passive), > and intel_pstate (active) CPU frequency scaling drivers and various > governors. Additionally, with HWP both enabled and disabled. > > For intel_pstate (active), both HWP enabled or disabled, the behaviour > of scaling_cur_freq is inconsistent with prior to this patch set and other > scaling driver governor combinations. > > Note there is no issue with " grep MHz /proc/cpuinfo" for any > combination. > > Examples: > > No-HWP: > > active/powersave: > doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq > /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418 > /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 > /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 > /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 > /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 > /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 > /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 > /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 > /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 > /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006 > /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005 > /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 That's because after the changes in this series scaling_cur_freq returns 0 if the given CPU is idle. I guess it could return the last known result, but that wouldn't be more meaningful.
Eric, On Tue, Apr 19 2022 at 08:51, Eric Dumazet wrote: > On Fri, Apr 15, 2022 at 12:19 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> It could be argued that not all APERF/MPERF capable systems have the >> required BIOS information to enable frequency invariance support, but in >> practice most of them do. So the APERF/MPERF sampling can be made >> unconditional and just the frequency scale calculation for the scheduler >> excluded. >> >> The following series consolidates that. >> > > Thanks a lot for working on that Thomas. > > I am not sure I will be able to backport this to a Google prodkernel, > as I guess there will be many merge conflicts. :) > Do you have by any chance this work available in a git branch ? git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/amperf Thanks, tglx
On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote: > On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote: >> For intel_pstate (active), both HWP enabled or disabled, the behaviour >> of scaling_cur_freq is inconsistent with prior to this patch set and other >> scaling driver governor combinations. >> >> Note there is no issue with " grep MHz /proc/cpuinfo" for any >> combination. >> >> Examples: >> >> No-HWP: >> >> active/powersave: >> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq >> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418 >> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 >> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 >> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 >> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 >> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 >> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 >> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 >> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 >> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006 >> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005 >> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 > > That's because after the changes in this series scaling_cur_freq > returns 0 if the given CPU is idle. Which is sensible IMO as there is really no point in waking an idle CPU just to read those MSRs, then wait 20ms wake it up again to read those MSRs again. > I guess it could return the last known result, but that wouldn't be > more meaningful. Right. Thanks, tglx
On Tue, Apr 19, 2022 at 1:39 PM Thomas Gleixner <tglx@linutronix.de> wrote: > > > git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/amperf > Excellent, things look fine to me. Before: # grep MHz /proc/cpuinfo | sort | uniq -c 255 cpu MHz : 2249.998 1 cpu MHz : 3297.719 # grep MHz /proc/cpuinfo | sort|uniq -c 1 cpu MHz : 1590.400 1 cpu MHz : 1684.772 1 cpu MHz : 1693.890 1 cpu MHz : 1780.072 1 cpu MHz : 1784.513 1 cpu MHz : 1831.106 1 cpu MHz : 1880.344 1 cpu MHz : 1953.481 1 cpu MHz : 1980.636 1 cpu MHz : 2013.620 1 cpu MHz : 2219.617 240 cpu MHz : 2250.173 1 cpu MHz : 3292.206 1 cpu MHz : 3294.956 1 cpu MHz : 3297.653 1 cpu MHz : 3298.385 1 cpu MHz : 3300.197 Tested-by: Eric Dumazet <edumazet@google.com>
On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote: > APERF/MPERF is utilized in two ways: > > 1) Ad hoc readout of CPU frequency which requires IPIs > > 2) Frequency scale calculation for frequency invariant scheduling which > reads APERF/MPERF on every tick. > > These are completely independent code parts. Eric observed long latencies > when reading /proc/cpuinfo which reads out CPU frequency via #1 and > proposed to replace the per CPU single IPI with a broadcast IPI. > > While this makes the latency smaller, it is not necessary at all because #2 > samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs > which are excluded from IPI already. > > It could be argued that not all APERF/MPERF capable systems have the > required BIOS information to enable frequency invariance support, but in > practice most of them do. So the APERF/MPERF sampling can be made > unconditional and just the frequency scale calculation for the scheduler > excluded. > > The following series consolidates that. Acked-by: Paul E. McKenney <paulmck@kernel.org> > Thanks, > > tglx > --- > arch/x86/include/asm/cpu.h | 2 > arch/x86/include/asm/topology.h | 17 - > arch/x86/kernel/acpi/cppc.c | 28 -- > arch/x86/kernel/cpu/aperfmperf.c | 474 +++++++++++++++++++++++++++++++-------- > arch/x86/kernel/cpu/proc.c | 2 > arch/x86/kernel/smpboot.c | 358 ----------------------------- > fs/proc/cpuinfo.c | 6 > include/linux/cpufreq.h | 1 > 8 files changed, 405 insertions(+), 483 deletions(-) > >
Hi Thomas, Rafael, Thank you for your replies. On 2022.04.19 14:11 Thomas Gleixner wrote: > On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote: >> On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote: >>> For intel_pstate (active), both HWP enabled or disabled, the behaviour >>> of scaling_cur_freq is inconsistent with prior to this patch set and other >>> scaling driver governor combinations. >>> >>> Note there is no issue with " grep MHz /proc/cpuinfo" for any >>> combination. >>> >>> Examples: >>> >>> No-HWP: >>> >>> active/powersave: >>> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq >>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418 >>> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 >>> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006 >>> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005 >>> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 >> >> That's because after the changes in this series scaling_cur_freq >> returns 0 if the given CPU is idle. > > Which is sensible IMO as there is really no point in waking an idle CPU > just to read those MSRs, then wait 20ms wake it up again to read those > MSRs again. I totally agree. It is the inconsistency for what is displayed as a function of driver/governor that is my concern. > >> I guess it could return the last known result, but that wouldn't be >> more meaningful. > > Right. How about something like this, which I realize might break something else, but just to demonstrate: doug@s19:~/kernel/linux$ git diff diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 80f535cc8a75..a161e75794cd 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -710,7 +710,7 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf) else if (cpufreq_driver->setpolicy && cpufreq_driver->get) ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu)); else - ret = sprintf(buf, "%u\n", policy->cur); + ret = sprintf(buf, "%u\n", freq); return ret; } Note: I left the other 0 return condition, because I do not know what uses it. Which gives: acpi-cpufreq/schedutil doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:4100723 intel_pstate/powersave (no-HWP) doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800295 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800015 intel_cpufreq/schedutil (no-HWP) doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:1971265 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2785446 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 Which I suggest is more consistent. Note: because it was deleted from this thread, and just for reference, I'll repost the previous intel_cpufreq/schedutil (no-HWP) output: doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573 /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011 /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109 /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000 /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000 ... Doug
On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote: > On 2022.04.19 14:11 Thomas Gleixner wrote: >>> That's because after the changes in this series scaling_cur_freq >>> returns 0 if the given CPU is idle. >> >> Which is sensible IMO as there is really no point in waking an idle CPU >> just to read those MSRs, then wait 20ms wake it up again to read those >> MSRs again. > > I totally agree. > It is the inconsistency for what is displayed as a function of driver/governor > that is my concern. Raphael suggested to move the show_cpuinfo() logic into the a/mperf code. See below. Thanks, tglx --- Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo() From: Thomas Gleixner <tglx@linutronix.de> Date: Mon, 25 Apr 2022 15:19:29 +0200 Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return 0 when the last sample was too long ago. show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to return cpu_khz, but the readout code for the per CPU scaling frequency in sysfs does not. Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same when reading /proc/cpuinfo and /sys/..../cur_scaling_freq. Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/cpu/aperfmperf.c | 10 +++++++--- arch/x86/kernel/cpu/proc.c | 7 +------ 2 files changed, 8 insertions(+), 9 deletions(-) --- a/arch/x86/kernel/cpu/aperfmperf.c +++ b/arch/x86/kernel/cpu/aperfmperf.c @@ -405,12 +405,12 @@ void arch_scale_freq_tick(void) unsigned int arch_freq_get_on_cpu(int cpu) { struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu); + unsigned int seq, freq; unsigned long last; - unsigned int seq; u64 acnt, mcnt; if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF)) - return 0; + goto fallback; do { seq = raw_read_seqcount_begin(&s->seq); @@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cp * which covers idle and NOHZ full CPUs. */ if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE) - return 0; + goto fallback; return div64_u64((cpu_khz * acnt), mcnt); + +fallback: + freq = cpufreq_quick_get(cpu); + return freq ? freq : cpu_khz; } static int __init bp_init_aperfmperf(void) --- a/arch/x86/kernel/cpu/proc.c +++ b/arch/x86/kernel/cpu/proc.c @@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file if (cpu_has(c, X86_FEATURE_TSC)) { unsigned int freq = arch_freq_get_on_cpu(cpu); - if (!freq) - freq = cpufreq_quick_get(cpu); - if (!freq) - freq = cpu_khz; - seq_printf(m, "cpu MHz\t\t: %u.%03u\n", - freq / 1000, (freq % 1000)); + seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000)); } /* Cache size */
On Mon, Apr 25, 2022 at 8:45 AM Thomas Gleixner <tglx@linutronix.de> wrote: > On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote: >> On 2022.04.19 14:11 Thomas Gleixner wrote: >>>> That's because after the changes in this series scaling_cur_freq >>>> returns 0 if the given CPU is idle. >>> >>> Which is sensible IMO as there is really no point in waking an idle CPU >>> just to read those MSRs, then wait 20ms wake it up again to read those >>> MSRs again. >> >> I totally agree. >> It is the inconsistency for what is displayed as a function of driver/governor >> that is my concern. > > Raphael suggested to move the show_cpuinfo() logic into the a/mperf > code. See below. Hi Thomas, I tested the patch on top of your 10 patch set on kernel 5.18-rc3. It addresses my consistency concerns. Thank you ... Doug > --- > Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo() > From: Thomas Gleixner <tglx@linutronix.de> > Date: Mon, 25 Apr 2022 15:19:29 +0200 > ...