mbox series

[00/10] x86/cpu: Consolidate APERF/MPERF code

Message ID 20220415133356.179706384@linutronix.de (mailing list archive)
Headers show
Series x86/cpu: Consolidate APERF/MPERF code | expand

Message

Thomas Gleixner April 15, 2022, 7:19 p.m. UTC
APERF/MPERF is utilized in two ways:

  1) Ad hoc readout of CPU frequency which requires IPIs

  2) Frequency scale calculation for frequency invariant scheduling which
     reads APERF/MPERF on every tick.

These are completely independent code parts. Eric observed long latencies
when reading /proc/cpuinfo which reads out CPU frequency via #1 and
proposed to replace the per CPU single IPI with a broadcast IPI.

While this makes the latency smaller, it is not necessary at all because #2
samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
which are excluded from IPI already.

It could be argued that not all APERF/MPERF capable systems have the
required BIOS information to enable frequency invariance support, but in
practice most of them do. So the APERF/MPERF sampling can be made
unconditional and just the frequency scale calculation for the scheduler
excluded.

The following series consolidates that.

Thanks,

	tglx
---
 arch/x86/include/asm/cpu.h       |    2 
 arch/x86/include/asm/topology.h  |   17 -
 arch/x86/kernel/acpi/cppc.c      |   28 --
 arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
 arch/x86/kernel/cpu/proc.c       |    2 
 arch/x86/kernel/smpboot.c        |  358 -----------------------------
 fs/proc/cpuinfo.c                |    6 
 include/linux/cpufreq.h          |    1 
 8 files changed, 405 insertions(+), 483 deletions(-)

Comments

Eric Dumazet April 19, 2022, 3:51 p.m. UTC | #1
On Fri, Apr 15, 2022 at 12:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> APERF/MPERF is utilized in two ways:
>
>   1) Ad hoc readout of CPU frequency which requires IPIs
>
>   2) Frequency scale calculation for frequency invariant scheduling which
>      reads APERF/MPERF on every tick.
>
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
>
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
>
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
>
> The following series consolidates that.
>

Thanks a lot for working on that Thomas.

I am not sure I will be able to backport this to a Google prodkernel,
as I guess there will be many merge conflicts.

Do you have by any chance this work available in a git branch ?

Thanks.



> Thanks,
>
>         tglx
> ---
>  arch/x86/include/asm/cpu.h       |    2
>  arch/x86/include/asm/topology.h  |   17 -
>  arch/x86/kernel/acpi/cppc.c      |   28 --
>  arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
>  arch/x86/kernel/cpu/proc.c       |    2
>  arch/x86/kernel/smpboot.c        |  358 -----------------------------
>  fs/proc/cpuinfo.c                |    6
>  include/linux/cpufreq.h          |    1
>  8 files changed, 405 insertions(+), 483 deletions(-)
>
>
Peter Zijlstra April 19, 2022, 4:41 p.m. UTC | #2
On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote:
> ---
>  arch/x86/include/asm/cpu.h       |    2 
>  arch/x86/include/asm/topology.h  |   17 -
>  arch/x86/kernel/acpi/cppc.c      |   28 --
>  arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
>  arch/x86/kernel/cpu/proc.c       |    2 
>  arch/x86/kernel/smpboot.c        |  358 -----------------------------
>  fs/proc/cpuinfo.c                |    6 
>  include/linux/cpufreq.h          |    1 
>  8 files changed, 405 insertions(+), 483 deletions(-)


Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Doug Smythies April 19, 2022, 5:32 p.m. UTC | #3
Hi Thomas,

On 2022.04.15 12:20 Thomas Gleixner wrote:

> APERF/MPERF is utilized in two ways:
>
>  1) Ad hoc readout of CPU frequency which requires IPIs
>
>  2) Frequency scale calculation for frequency invariant scheduling which
>     reads APERF/MPERF on every tick.
>
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
>
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
>
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
>
> The following series consolidates that.

I have used this patch set with the acpi-cpufreq, intel_cpufreq (passive),
and intel_pstate (active) CPU frequency scaling drivers and various
governors. Additionally, with HWP both enabled and disabled.

For intel_pstate (active), both HWP enabled or disabled, the behaviour
of scaling_cur_freq is inconsistent with prior to this patch set and other
scaling driver governor combinations.

Note there is no issue with " grep MHz /proc/cpuinfo" for any
combination.

Examples:

No-HWP:

active/powersave:
doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

active/performance:
doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

HWP:

active/powersave:
doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:799993
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800069
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800131
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:799844

active/performance:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:4800186
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:4800016
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0 

Other configurations:
intel_cpufreq /schedutil (no HWP), for example:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000

Processor: Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz

> Thanks,
>
>	tglx
> ---
> arch/x86/include/asm/cpu.h       |    2 
> arch/x86/include/asm/topology.h  |   17 -
> arch/x86/kernel/acpi/cppc.c      |   28 --
> arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
> arch/x86/kernel/cpu/proc.c       |    2 
> arch/x86/kernel/smpboot.c        |  358 -----------------------------
> fs/proc/cpuinfo.c                |    6 
> include/linux/cpufreq.h          |    1 
> 8 files changed, 405 insertions(+), 483 deletions(-)
Rafael J. Wysocki April 19, 2022, 6:49 p.m. UTC | #4
On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote:
>
> Hi Thomas,
>
> On 2022.04.15 12:20 Thomas Gleixner wrote:
>
> > APERF/MPERF is utilized in two ways:
> >
> >  1) Ad hoc readout of CPU frequency which requires IPIs
> >
> >  2) Frequency scale calculation for frequency invariant scheduling which
> >     reads APERF/MPERF on every tick.
> >
> > These are completely independent code parts. Eric observed long latencies
> > when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> > proposed to replace the per CPU single IPI with a broadcast IPI.
> >
> > While this makes the latency smaller, it is not necessary at all because #2
> > samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> > which are excluded from IPI already.
> >
> > It could be argued that not all APERF/MPERF capable systems have the
> > required BIOS information to enable frequency invariance support, but in
> > practice most of them do. So the APERF/MPERF sampling can be made
> > unconditional and just the frequency scale calculation for the scheduler
> > excluded.
> >
> > The following series consolidates that.
>
> I have used this patch set with the acpi-cpufreq, intel_cpufreq (passive),
> and intel_pstate (active) CPU frequency scaling drivers and various
> governors. Additionally, with HWP both enabled and disabled.
>
> For intel_pstate (active), both HWP enabled or disabled, the behaviour
> of scaling_cur_freq is inconsistent with prior to this patch set and other
> scaling driver governor combinations.
>
> Note there is no issue with " grep MHz /proc/cpuinfo" for any
> combination.
>
> Examples:
>
> No-HWP:
>
> active/powersave:
> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

That's because after the changes in this series scaling_cur_freq
returns 0 if the given CPU is idle.

I guess it could return the last known result, but that wouldn't be
more meaningful.
Thomas Gleixner April 19, 2022, 8:39 p.m. UTC | #5
Eric,

On Tue, Apr 19 2022 at 08:51, Eric Dumazet wrote:
> On Fri, Apr 15, 2022 at 12:19 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> It could be argued that not all APERF/MPERF capable systems have the
>> required BIOS information to enable frequency invariance support, but in
>> practice most of them do. So the APERF/MPERF sampling can be made
>> unconditional and just the frequency scale calculation for the scheduler
>> excluded.
>>
>> The following series consolidates that.
>>
>
> Thanks a lot for working on that Thomas.
>
> I am not sure I will be able to backport this to a Google prodkernel,
> as I guess there will be many merge conflicts.

:)

> Do you have by any chance this work available in a git branch ?

 git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/amperf

Thanks,

        tglx
Thomas Gleixner April 19, 2022, 9:11 p.m. UTC | #6
On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote:
> On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote:
>> For intel_pstate (active), both HWP enabled or disabled, the behaviour
>> of scaling_cur_freq is inconsistent with prior to this patch set and other
>> scaling driver governor combinations.
>>
>> Note there is no issue with " grep MHz /proc/cpuinfo" for any
>> combination.
>>
>> Examples:
>>
>> No-HWP:
>>
>> active/powersave:
>> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
>> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
>> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
>> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
>> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0
>
> That's because after the changes in this series scaling_cur_freq
> returns 0 if the given CPU is idle.

Which is sensible IMO as there is really no point in waking an idle CPU
just to read those MSRs, then wait 20ms wake it up again to read those
MSRs again.

> I guess it could return the last known result, but that wouldn't be
> more meaningful.

Right.

Thanks,

        tglx
Eric Dumazet April 19, 2022, 9:20 p.m. UTC | #7
On Tue, Apr 19, 2022 at 1:39 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
>  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/amperf
>

Excellent, things look fine to me.

Before:
# grep MHz /proc/cpuinfo | sort | uniq -c
    255 cpu MHz : 2249.998
      1 cpu MHz : 3297.719

# grep MHz /proc/cpuinfo | sort|uniq -c
      1 cpu MHz : 1590.400
      1 cpu MHz : 1684.772
      1 cpu MHz : 1693.890
      1 cpu MHz : 1780.072
      1 cpu MHz : 1784.513
      1 cpu MHz : 1831.106
      1 cpu MHz : 1880.344
      1 cpu MHz : 1953.481
      1 cpu MHz : 1980.636
      1 cpu MHz : 2013.620
      1 cpu MHz : 2219.617
    240 cpu MHz : 2250.173
      1 cpu MHz : 3292.206
      1 cpu MHz : 3294.956
      1 cpu MHz : 3297.653
      1 cpu MHz : 3298.385
      1 cpu MHz : 3300.197

Tested-by: Eric Dumazet <edumazet@google.com>
Paul E. McKenney April 19, 2022, 9:56 p.m. UTC | #8
On Fri, Apr 15, 2022 at 09:19:48PM +0200, Thomas Gleixner wrote:
> APERF/MPERF is utilized in two ways:
> 
>   1) Ad hoc readout of CPU frequency which requires IPIs
> 
>   2) Frequency scale calculation for frequency invariant scheduling which
>      reads APERF/MPERF on every tick.
> 
> These are completely independent code parts. Eric observed long latencies
> when reading /proc/cpuinfo which reads out CPU frequency via #1 and
> proposed to replace the per CPU single IPI with a broadcast IPI.
> 
> While this makes the latency smaller, it is not necessary at all because #2
> samples APERF/MPERF periodically, except on idle or isolated NOHZ full CPUs
> which are excluded from IPI already.
> 
> It could be argued that not all APERF/MPERF capable systems have the
> required BIOS information to enable frequency invariance support, but in
> practice most of them do. So the APERF/MPERF sampling can be made
> unconditional and just the frequency scale calculation for the scheduler
> excluded.
> 
> The following series consolidates that.

Acked-by: Paul E. McKenney <paulmck@kernel.org>

> Thanks,
> 
> 	tglx
> ---
>  arch/x86/include/asm/cpu.h       |    2 
>  arch/x86/include/asm/topology.h  |   17 -
>  arch/x86/kernel/acpi/cppc.c      |   28 --
>  arch/x86/kernel/cpu/aperfmperf.c |  474 +++++++++++++++++++++++++++++++--------
>  arch/x86/kernel/cpu/proc.c       |    2 
>  arch/x86/kernel/smpboot.c        |  358 -----------------------------
>  fs/proc/cpuinfo.c                |    6 
>  include/linux/cpufreq.h          |    1 
>  8 files changed, 405 insertions(+), 483 deletions(-)
> 
>
Doug Smythies April 20, 2022, 10:08 p.m. UTC | #9
Hi Thomas, Rafael,

Thank you for your replies.

On 2022.04.19 14:11 Thomas Gleixner wrote:
> On Tue, Apr 19 2022 at 20:49, Rafael J. Wysocki wrote:
>> On Tue, Apr 19, 2022 at 7:32 PM Doug Smythies <dsmythies@telus.net> wrote:
>>> For intel_pstate (active), both HWP enabled or disabled, the behaviour
>>> of scaling_cur_freq is inconsistent with prior to this patch set and other
>>> scaling driver governor combinations.
>>>
>>> Note there is no issue with " grep MHz /proc/cpuinfo" for any
>>> combination.
>>>
>>> Examples:
>>>
>>> No-HWP:
>>>
>>> active/powersave:
>>> doug@s19:~/freq-scalers/trace$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
>>> /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:2300418
>>> /sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
>>> /sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:2300006
>>> /sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2300005
>>> /sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0
>>
>> That's because after the changes in this series scaling_cur_freq
>> returns 0 if the given CPU is idle.
>
> Which is sensible IMO as there is really no point in waking an idle CPU
> just to read those MSRs, then wait 20ms wake it up again to read those
> MSRs again.

I totally agree.
It is the inconsistency for what is displayed as a function of driver/governor
that is my concern.

>
>> I guess it could return the last known result, but that wouldn't be
>> more meaningful.
>
> Right.

How about something like this, which I realize might break something else,
but just to demonstrate:

doug@s19:~/kernel/linux$ git diff
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 80f535cc8a75..a161e75794cd 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -710,7 +710,7 @@ static ssize_t show_scaling_cur_freq(struct cpufreq_policy *policy, char *buf)
        else if (cpufreq_driver->setpolicy && cpufreq_driver->get)
                ret = sprintf(buf, "%u\n", cpufreq_driver->get(policy->cpu));
        else
-               ret = sprintf(buf, "%u\n", policy->cur);
+               ret = sprintf(buf, "%u\n", freq);
        return ret;
 }

Note: I left the other 0 return condition, because I do not know what uses it.

Which gives:

acpi-cpufreq/schedutil
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:4100723

intel_pstate/powersave (no-HWP)
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800295
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800015

intel_cpufreq/schedutil (no-HWP)
doug@s19:~/kernel/linux$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:1971265
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:0
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:2785446
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:0

Which I suggest is more consistent.

Note: because it was deleted from this thread, 
and just for reference, I'll repost the previous
intel_cpufreq/schedutil (no-HWP) output:

doug@s19:~$ grep . /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu10/cpufreq/scaling_cur_freq:1067573
/sys/devices/system/cpu/cpu11/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq:800011
/sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu5/cpufreq/scaling_cur_freq:800109
/sys/devices/system/cpu/cpu6/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu7/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu8/cpufreq/scaling_cur_freq:800000
/sys/devices/system/cpu/cpu9/cpufreq/scaling_cur_freq:800000

... Doug
Thomas Gleixner April 25, 2022, 3:45 p.m. UTC | #10
On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote:
> On 2022.04.19 14:11 Thomas Gleixner wrote:
>>> That's because after the changes in this series scaling_cur_freq
>>> returns 0 if the given CPU is idle.
>>
>> Which is sensible IMO as there is really no point in waking an idle CPU
>> just to read those MSRs, then wait 20ms wake it up again to read those
>> MSRs again.
>
> I totally agree.
> It is the inconsistency for what is displayed as a function of driver/governor
> that is my concern.

Raphael suggested to move the show_cpuinfo() logic into the a/mperf
code. See below.

Thanks,

        tglx
---
Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
From: Thomas Gleixner <tglx@linutronix.de>
Date: Mon, 25 Apr 2022 15:19:29 +0200

Due to the avoidance of IPIs to idle CPUs arch_freq_get_on_cpu() can return
0 when the last sample was too long ago.

show_cpuinfo() has a fallback to cpufreq_quick_get() and if that fails to
return cpu_khz, but the readout code for the per CPU scaling frequency in
sysfs does not.

Move that fallback into arch_freq_get_on_cpu() so the behaviour is the same
when reading /proc/cpuinfo and /sys/..../cur_scaling_freq.

Suggested-by: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/cpu/aperfmperf.c |   10 +++++++---
 arch/x86/kernel/cpu/proc.c       |    7 +------
 2 files changed, 8 insertions(+), 9 deletions(-)

--- a/arch/x86/kernel/cpu/aperfmperf.c
+++ b/arch/x86/kernel/cpu/aperfmperf.c
@@ -405,12 +405,12 @@ void arch_scale_freq_tick(void)
 unsigned int arch_freq_get_on_cpu(int cpu)
 {
 	struct aperfmperf *s = per_cpu_ptr(&cpu_samples, cpu);
+	unsigned int seq, freq;
 	unsigned long last;
-	unsigned int seq;
 	u64 acnt, mcnt;
 
 	if (!cpu_feature_enabled(X86_FEATURE_APERFMPERF))
-		return 0;
+		goto fallback;
 
 	do {
 		seq = raw_read_seqcount_begin(&s->seq);
@@ -424,9 +424,13 @@ unsigned int arch_freq_get_on_cpu(int cp
 	 * which covers idle and NOHZ full CPUs.
 	 */
 	if (!mcnt || (jiffies - last) > MAX_SAMPLE_AGE)
-		return 0;
+		goto fallback;
 
 	return div64_u64((cpu_khz * acnt), mcnt);
+
+fallback:
+	freq = cpufreq_quick_get(cpu);
+	return freq ? freq : cpu_khz;
 }
 
 static int __init bp_init_aperfmperf(void)
--- a/arch/x86/kernel/cpu/proc.c
+++ b/arch/x86/kernel/cpu/proc.c
@@ -86,12 +86,7 @@ static int show_cpuinfo(struct seq_file
 	if (cpu_has(c, X86_FEATURE_TSC)) {
 		unsigned int freq = arch_freq_get_on_cpu(cpu);
 
-		if (!freq)
-			freq = cpufreq_quick_get(cpu);
-		if (!freq)
-			freq = cpu_khz;
-		seq_printf(m, "cpu MHz\t\t: %u.%03u\n",
-			   freq / 1000, (freq % 1000));
+		seq_printf(m, "cpu MHz\t\t: %u.%03u\n", freq / 1000, (freq % 1000));
 	}
 
 	/* Cache size */
Doug Smythies April 25, 2022, 11:20 p.m. UTC | #11
On Mon, Apr 25, 2022 at 8:45 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, Apr 20 2022 at 15:08, Doug Smythies wrote:
>> On 2022.04.19 14:11 Thomas Gleixner wrote:
>>>> That's because after the changes in this series scaling_cur_freq
>>>> returns 0 if the given CPU is idle.
>>>
>>> Which is sensible IMO as there is really no point in waking an idle CPU
>>> just to read those MSRs, then wait 20ms wake it up again to read those
>>> MSRs again.
>>
>> I totally agree.
>> It is the inconsistency for what is displayed as a function of driver/governor
>> that is my concern.
>
> Raphael suggested to move the show_cpuinfo() logic into the a/mperf
> code. See below.

Hi Thomas,

I tested the patch on top of your 10 patch set on kernel 5.18-rc3.
It addresses my consistency concerns.

Thank you

... Doug

> ---
> Subject: x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
> From: Thomas Gleixner <tglx@linutronix.de>
> Date: Mon, 25 Apr 2022 15:19:29 +0200
>
...