mbox series

[v3,00/21] cpufreq: introduce a new AMD CPU frequency control mechanism

Message ID 20211029130241.1984459-1-ray.huang@amd.com (mailing list archive)
Headers show
Series cpufreq: introduce a new AMD CPU frequency control mechanism | expand

Message

Huang Rui Oct. 29, 2021, 1:02 p.m. UTC
Hi all,

We would like to introduce a new AMD CPU frequency control mechanism as the
"amd-pstate" driver for modern AMD Zen based CPU series in Linux Kernel.
The new mechanism is based on Collaborative processor performance control
(CPPC) which is finer grain frequency management than legacy ACPI hardware
P-States. Current AMD CPU platforms are using the ACPI P-states driver to
manage CPU frequency and clocks with switching only in 3 P-states. AMD
P-States is to replace the ACPI P-states controls, allows a flexible,
low-latency interface for the Linux kernel to directly communicate the
performance hints to hardware.

"amd-pstate" leverages the Linux kernel governors such as *schedutil*,
*ondemand*, etc. to manage the performance hints which are provided by CPPC
hardware functionality. The first version for amd-pstate is to support one
of the Zen3 processors, and we will support more in future after we verify
the hardware and SBIOS functionalities.

There are two types of hardware implementations for amd-pstate: one is full
MSR support and another is shared memory support. It can use
X86_FEATURE_AMD_CPPC feature flag to distinguish the different types. 

Using the new AMD P-States method + kernel governors (*schedutil*,
*ondemand*, ...) to manage the frequency update is the most appropriate
bridge between AMD Zen based hardware processor and Linux kernel, the
processor is able to ajust to the most efficiency frequency according to
the kernel scheduler loading.

Performance Per Watt (PPW) Caculation:

The PPW caculation is referred by below paper:
https://software.intel.com/content/dam/develop/external/us/en/documents/performance-per-what-paper.pdf

Below formula is referred from below spec to measure the PPW:

(F / t) / P = F * t / (t * E) = F / E,

"F" is the number of frames per second.
"P" is power measurd in watts.
"E" is energy measured in joules.

We use the RAPL interface with "perf" tool to get the energy data of the
package power.

The data comparsions between amd-pstate and acpi-freq module are tested on
AMD Cezanne processor:

1) TBench CPU benchmark:

+---------------------------------------------------------------------+
|                                                                     |
|               TBench (Performance Per Watt)                         |
|                                                    Higher is better |
+-------------------+------------------------+------------------------+
|                   |  Performance Per Watt  |  Performance Per Watt  |
|   Kernel Module   |       (Schedutil)      |       (Ondemand)       |
|                   |  Unit: MB / (s * J)    |  Unit: MB / (s * J)    |
+-------------------+------------------------+------------------------+
|                   |                        |                        |
|    acpi-cpufreq   |         3.022          |        2.969           |
|                   |                        |                        |
+-------------------+------------------------+------------------------+
|                   |                        |                        |
|     amd-pstate    |         3.131          |        3.284           |
|                   |                        |                        |
+-------------------+------------------------+------------------------+

2) Gitsource CPU benchmark:

+---------------------------------------------------------------------+
|                                                                     |
|               Gitsource (Performance Per Watt)                      |
|                                                    Higher is better |
+-------------------+------------------------+------------------------+
|                   |  Performance Per Watt  |  Performance Per Watt  |
|   Kernel Module   |       (Schedutil)      |       (Ondemand)       |
|                   |  Unit: 1 / (s * J)     |  Unit: 1 / (s * J)     |
+-------------------+------------------------+------------------------+
|                   |                        |                        |
|    acpi-cpufreq   |     3.42172E-07        |     2.74508E-07        |
|                   |                        |                        |
+-------------------+------------------------+------------------------+
|                   |                        |                        |
|     amd-pstate    |     4.09141E-07        |     3.47610E-07        |
|                   |                        |                        |
+-------------------+------------------------+------------------------+

3) Speedometer 2.0 CPU benchmark:

+---------------------------------------------------------------------+
|                                                                     |
|               Speedometer 2.0 (Performance Per Watt)                |
|                                                    Higher is better |
+-------------------+------------------------+------------------------+
|                   |  Performance Per Watt  |  Performance Per Watt  |
|   Kernel Module   |       (Schedutil)      |       (Ondemand)       |
|                   |  Unit: 1 / (s * J)     |  Unit: 1 / (s * J)     |
+-------------------+------------------------+------------------------+
|                   |                        |                        |
|    acpi-cpufreq   |      0.116111767       |      0.110321664       |
|                   |                        |                        |
+-------------------+------------------------+------------------------+
|                   |                        |                        |
|     amd-pstate    |      0.115825281       |      0.122024299       |
|                   |                        |                        |
+-------------------+------------------------+------------------------+


According to above average data, we can see this solution has shown better
performance per watt scaling on mobile CPU benchmarks in most of cases.

These patch series depends on a "hotplug capable" CPU fix below (Only few
of CPU parts with "un-hotplug" core will encounter the issue and Mario is
working on the fix):
https://lore.kernel.org/linux-pm/20210813161842.222414-1-mario.limonciello@amd.com/

And we can see patch series in below git repo:
V1: https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/log/?h=amd-pstate-dev-v1
V2: https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/log/?h=amd-pstate-dev-v2
V3: https://git.kernel.org/pub/scm/linux/kernel/git/rui/linux.git/log/?h=amd-pstate-dev-v3

For details introduction, please see the patch 19.

Changes from V1 -> V2:
- cpufreq:
- - Add detailed description in the commit log.
- - Clean up the "extension" postfix in the x86 feature flag.
- - Revise cppc_set_enable helper.
- - Add a fix to check online cpus in cppc_acpi.
- - Use static calls to avoid retpolines.
- - Revise the comment style.
- - Remove amd_pstate_boost_supported() function.
- - Revise the return value in syfs attribute functions.
- cpupower:
- - Refine the commit log for cpupower patches.
- - Expose a function to get the sysfs value from specific table.
- - Move amd-pstate sysfs definitions and functions into amd helper file.
- - Move the boost init function into amd helper file and explain the
  details in the commit log.
- - Remove the amd_pstate_get_data in the lib/cpufreq.c to keep the lib as
  common operations.
- - Move print_speed function into misc helper file.
- - Add amd_pstate_show_perf_and_freq() function in amd helper for
  cpufreq-info print.

Changes from V2 -> V3:
- cpufreq:
- - Add a patch from Steven to add systemio register in cppc lib. (Thanks
  to verify the driver in his platform)
- - Update online cpu mask to present cpu.
- - Enhance cppc_set_enable to cover all valid use cases.
- - Add more description in the Kconfig definition.
- - Clean up some redundance functions and data members.
- - Revise amd-pstate trace event prints.
- - Move the amd-pstate traces into power trace system and set the driver
  as build-in instead of module.
- - Clean up the duplicated sysfs with core cpufreq driver.
- - Revise the amd-pstate RST documentation.
- cpupower:
- - Revise the cpupower_amd_pstate_enabled() function to use
  cpufreq_get_driver helper instead of read sysfs.
- - Clean up the amd-pstate max/min frequency APIs, because they are
  actually the same with cpufreq info sysfs.

Thanks,
Ray

Huang Rui (18):
  x86/cpufreatures: add AMD Collaborative Processor Performance Control
    feature flag
  x86/msr: add AMD CPPC MSR definitions
  cpufreq: amd: introduce a new amd pstate driver to support future
    processors
  cpufreq: amd: add fast switch function for amd-pstate
  cpufreq: amd: add acpi cppc function as the backend for legacy
    processors
  cpufreq: amd: add trace for amd-pstate module
  cpufreq: amd: add boost mode support for amd-pstate
  cpufreq: amd: add amd-pstate frequencies attributes
  cpufreq: amd: add amd-pstate performance attributes
  cpupower: add AMD P-state capability flag
  cpupower: add the function to check amd-pstate enabled
  cpupower: initial AMD P-state capability
  cpupower: add the function to get the sysfs value from specific table
  cpupower: add amd-pstate sysfs definition and access helper
  cpupower: enable boost state support for amd-pstate module
  cpupower: move print_speed function into misc helper
  cpupower: print amd-pstate information on cpupower
  Documentation: amd-pstate: add amd-pstate driver introduction

Jinzhou Su (1):
  ACPI: CPPC: add cppc enable register function

Mario Limonciello (1):
  ACPI: CPPC: Check present CPUs for determining _CPC is valid

Steven Noonan (1):
  ACPI: CPPC: implement support for SystemIO registers

 Documentation/admin-guide/pm/amd-pstate.rst   | 373 ++++++++++
 .../admin-guide/pm/working-state.rst          |   1 +
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/msr-index.h              |  17 +
 drivers/acpi/cppc_acpi.c                      |  93 ++-
 drivers/cpufreq/Kconfig.x86                   |  17 +
 drivers/cpufreq/Makefile                      |   1 +
 drivers/cpufreq/amd-pstate.c                  | 663 ++++++++++++++++++
 include/acpi/cppc_acpi.h                      |   5 +
 include/trace/events/power.h                  |  46 ++
 tools/power/cpupower/lib/cpufreq.c            |  21 +-
 tools/power/cpupower/lib/cpufreq.h            |  12 +
 tools/power/cpupower/utils/cpufreq-info.c     |  68 +-
 tools/power/cpupower/utils/helpers/amd.c      |  87 +++
 tools/power/cpupower/utils/helpers/cpuid.c    |  13 +
 tools/power/cpupower/utils/helpers/helpers.h  |  22 +
 tools/power/cpupower/utils/helpers/misc.c     |  62 ++
 17 files changed, 1441 insertions(+), 61 deletions(-)
 create mode 100644 Documentation/admin-guide/pm/amd-pstate.rst
 create mode 100644 drivers/cpufreq/amd-pstate.c

Comments

Giovanni Gherdovich Nov. 4, 2021, 4:40 p.m. UTC | #1
On Fri, 2021-10-29 at 21:02 +0800, Huang Rui wrote:
> Hi all,
> 
> We would like to introduce a new AMD CPU frequency control mechanism as the
> "amd-pstate" driver for modern AMD Zen based CPU series in Linux Kernel.
> 
> ..snip..

Hello,

I've tested this driver and it seems the results are a little underwhelming.
The test machine is a two sockets server with two AMD EPYC 7713,
family:model:stepping 25:1:1, 128 cores/256 threads, 256G of memory and SSD
storage. On this system, the amd-pstate driver works only in "shared memory
support", not in "full MSR support", meaning that frequency switches are
triggered from a workqueue instead of scheduler context (!fast_switch).

Dbench sees some ludicrous improvements in both performance and performance
per watt; likewise netperf sees some modest improvements, but that's about
the only good news. Schedutil/ondemand on tbench and hackbench do worse
with amd-pstate than acpi-cpufreq. I don't have data for
ondemand/amd-pstate on kernbench and gitsource, but schedutil regresses on
both.

Here the tables, then some questions & discussion points.

Tilde (~) means the result is the same as baseline (which is, the ratio is close to 1).
"Sugov" means "schedutil governor", "perfgov" means "performance governor".

             :        acpi-cpufreq          :        amd-pstate          :
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
             :  ondemand  sugov  perfgov    :  ondemand  sugov  perfgov  :  better if
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                       PERFORMANCE RATIOS
dbench       :  1.00      ~      0.33       :  0.37      0.35   0.36     :  lower
netperf      :  1.00      0.97   ~          :  1.03      1.04   ~        :  higher
tbench       :  1.00      1.04   1.06       :  0.83      0.40   1.05     :  higher
hackbench    :  1.00      ~      1.03       :  1.09      1.42   1.03     :  lower
kernbench    :  1.00      0.96   0.97       :  N/A       1.08   ~        :  lower
gitsource    :  1.00      0.67   0.69       :  N/A       0.79   0.67     :  lower
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
                                  PERFORMANCE-PER-WATT RATIOS
dbench4      :  1.00      ~      3.37       :  2.68      3.12   3.03     :  higher
netperf      :  1.00      0.96   ~          :  1.09      1.06   ~        :  higher
tbench4      :  1.00      1.03   1.06       :  0.76      0.34   1.04     :  higher
hackbench    :  1.00      ~      0.95       :  0.88      0.65   0.96     :  higher
kernbench    :  1.00      1.06   1.05       :  N/A       0.93   1.05     :  higher
gitsource    :  1.00      1.53   1.50       :  N/A       1.33   1.55     :  higher


How to read the table: all numbers are ratios of the results of some
governor/driver combination and ondemand/acpi-cpufreq, which is the
baseline (first column). When the "better if" column says "higher", a ratio
larger than 1 indicates an improvement; otherwise it's a regression.
Example: hackbench with sugov/amd-pstate is 42% slower than with
ondemand/acpi-cpufreq (top table). At the same time, it's also 35% less
efficient (bottom table).

Now, some questions / possible troubleshooting directions:

- ACPI-CPUFREQ DRIVER: REQUESTS ARE HINTS OR MANDATES?
  When using acpi-cpufreq, and the OS requests some frequency (one of the
  three allowed P-States), does the hardware underneath stick to it? Or
  does it do some ulterior adjustment based on the load?
  This would tell if a machine using acpi-cpufreq is less dumb than it
  seems, and can in principle do fine-grain adjustments all the same.

- PROCESSING CPPC DOORBELL REQUESTS: HOW FAST IS THAT?
  How long does it take the hardware to process the CPPC doorbell
  request to change frequency? What happens to outstanding requests, if
  they're not processed in a timely manner? Is there any queue of requests,
  and if so, how long is it? Could it be that if requests come in too quickly
  the CPU ends up playing catch-up on freq switches that are obsoletes or
  redundant?

- LIKE-FOR-LIKE: TRY BENCHMARKING WITH AMD-PSTATE LIMITED TO 3 P-STATES?
  Could it be that to study the performance of the "shared memory support"
  system against acpi-cpufreq a more like-to-like comparison would be to limit
  amd-pstate to only the 3 P-States available to acpi-cpufreq? That would be
  for experimental/benchmarking purposes only. Eg: on my machines acpi-cpufreq
  sees 1.5GHz, 1.7GHz and 2GHz. Given that max boost is 3.72GHz, and the CPPC
  range is the abstract interval 0..255, I could limit amd-pstate to only set
  performance level of 68, 102 and 137, and see what it gives against the old
  driver. What do you think?

- PROCESSING CPPC DOORBELL REQS IS SLOW. BUT /MAKING/ A REQUEST, SLOW TOO?
  Looks to me that with the "shared memory support" the frequency update
  process is doubly asynchronous: first we have the ->target() callback
  deferred to a workqueue, then when it's eventually executed, it calls
  cppc_update_perf() which again just asks the firmware to do work at a
  later time. Are we sure that cppc_update_perf() is actually so slow to
  warrant !fast_switch?

- HOW MANY P-STATES ARE TOO MANY?
  I've always believed the contrary, but what if having too many P-States is
  harmful for both performance and efficiency? Maybe the governor is
  requesting many updates in small increments where less (and larger) updates
  would be more appropriate?


Thanks,
Giovanni
Huang Rui Nov. 5, 2021, 4:09 p.m. UTC | #2
On Fri, Nov 05, 2021 at 12:40:18AM +0800, Giovanni Gherdovich wrote:
> On Fri, 2021-10-29 at 21:02 +0800, Huang Rui wrote:
> > Hi all,
> > 
> > We would like to introduce a new AMD CPU frequency control mechanism as the
> > "amd-pstate" driver for modern AMD Zen based CPU series in Linux Kernel.
> > 
> > ..snip..
> 
> Hello,
> 
> I've tested this driver and it seems the results are a little underwhelming.
> The test machine is a two sockets server with two AMD EPYC 7713,
> family:model:stepping 25:1:1, 128 cores/256 threads, 256G of memory and SSD
> storage. On this system, the amd-pstate driver works only in "shared memory
> support", not in "full MSR support", meaning that frequency switches are
> triggered from a workqueue instead of scheduler context (!fast_switch).
> 

Hi Giovanni,

I am really appreciated for the detailed tests and analysis! Thank you!

The initial driver was developed on a mobile CPU (Cezanne) with 8 cores/16
threads which supports the "full MSR" solution. And we spent a lot of time
to debug with BIOS, SMU firmware, and hardware guys to bring up this driver
on this CPU. The test results we provided were based on those series of
processors.

For the processors with "shared memory solution", we bring it up in a short
time recently to hope more AMD processors to also support new driver. :-)
Although our CPUs comply with the ACPI standard theoretically, different
processors have different SBIOS and SMU firmware (I assumed you know this
in previous mail). In real case, we need to verify it one by one, because
there are some differences in SBIOS ACPI _CPC table and firmware
implementation.

Of course, right now, we can start to optimize other processors and "shared
memory solution" in parallel.

Would you mind that we add a module param or filter the known good
processors (mobile parts) to load amd-pstate. And others can use the param
to switch between amd-pstate and acpi-cpufreq manually? After we address the
performance gap, then we can switch it back.

> Dbench sees some ludicrous improvements in both performance and performance
> per watt; likewise netperf sees some modest improvements, but that's about
> the only good news. Schedutil/ondemand on tbench and hackbench do worse
> with amd-pstate than acpi-cpufreq. I don't have data for
> ondemand/amd-pstate on kernbench and gitsource, but schedutil regresses on
> both.
> 
> Here the tables, then some questions & discussion points.
> 
> Tilde (~) means the result is the same as baseline (which is, the ratio is close to 1).
> "Sugov" means "schedutil governor", "perfgov" means "performance governor".
> 
>              :        acpi-cpufreq          :        amd-pstate          :
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>              :  ondemand  sugov  perfgov    :  ondemand  sugov  perfgov  :  better if
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>                                        PERFORMANCE RATIOS
> dbench       :  1.00      ~      0.33       :  0.37      0.35   0.36     :  lower
> netperf      :  1.00      0.97   ~          :  1.03      1.04   ~        :  higher
> tbench       :  1.00      1.04   1.06       :  0.83      0.40   1.05     :  higher
> hackbench    :  1.00      ~      1.03       :  1.09      1.42   1.03     :  lower
> kernbench    :  1.00      0.96   0.97       :  N/A       1.08   ~        :  lower
> gitsource    :  1.00      0.67   0.69       :  N/A       0.79   0.67     :  lower
>  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>                                   PERFORMANCE-PER-WATT RATIOS
> dbench4      :  1.00      ~      3.37       :  2.68      3.12   3.03     :  higher
> netperf      :  1.00      0.96   ~          :  1.09      1.06   ~        :  higher
> tbench4      :  1.00      1.03   1.06       :  0.76      0.34   1.04     :  higher
> hackbench    :  1.00      ~      0.95       :  0.88      0.65   0.96     :  higher
> kernbench    :  1.00      1.06   1.05       :  N/A       0.93   1.05     :  higher
> gitsource    :  1.00      1.53   1.50       :  N/A       1.33   1.55     :  higher
> 
> 
> How to read the table: all numbers are ratios of the results of some
> governor/driver combination and ondemand/acpi-cpufreq, which is the
> baseline (first column). When the "better if" column says "higher", a ratio
> larger than 1 indicates an improvement; otherwise it's a regression.
> Example: hackbench with sugov/amd-pstate is 42% slower than with
> ondemand/acpi-cpufreq (top table). At the same time, it's also 35% less
> efficient (bottom table).

It seems the issue mainly from the processors with big number of cores and
threads. Let's find the similiar family threadripper or EYPC processors to
duplicate the test results. Will contact at you for details. :-)

> 
> Now, some questions / possible troubleshooting directions:
> 
> - ACPI-CPUFREQ DRIVER: REQUESTS ARE HINTS OR MANDATES?
>   When using acpi-cpufreq, and the OS requests some frequency (one of the
>   three allowed P-States), does the hardware underneath stick to it? Or
>   does it do some ulterior adjustment based on the load?
>   This would tell if a machine using acpi-cpufreq is less dumb than it
>   seems, and can in principle do fine-grain adjustments all the same.
> 

The acpi-cpufreq driver should request the frequency level to go, however,
the firmware has a policy to adjust clock as well according to the hardware
condition such as voltage, electricity, and temperature. Legacy ACPI
P-state doesn't have any transaction to firmware side. But on amd-pstate,
the firmware can detects the performance goals as the hints that driver
provides.

> - PROCESSING CPPC DOORBELL REQUESTS: HOW FAST IS THAT?
>   How long does it take the hardware to process the CPPC doorbell
>   request to change frequency? What happens to outstanding requests, if
>   they're not processed in a timely manner? Is there any queue of requests,
>   and if so, how long is it? Could it be that if requests come in too quickly
>   the CPU ends up playing catch-up on freq switches that are obsoletes or
>   redundant?

That's a good question. We need to consult with firmware and hardware guys.
Or any method, we can caculate it from software side.

> 
> - LIKE-FOR-LIKE: TRY BENCHMARKING WITH AMD-PSTATE LIMITED TO 3 P-STATES?
>   Could it be that to study the performance of the "shared memory support"
>   system against acpi-cpufreq a more like-to-like comparison would be to limit
>   amd-pstate to only the 3 P-States available to acpi-cpufreq? That would be
>   for experimental/benchmarking purposes only. Eg: on my machines acpi-cpufreq
>   sees 1.5GHz, 1.7GHz and 2GHz. Given that max boost is 3.72GHz, and the CPPC
>   range is the abstract interval 0..255, I could limit amd-pstate to only set
>   performance level of 68, 102 and 137, and see what it gives against the old
>   driver. What do you think?

That's good idea. We can give some experiments like this.

> 
> - PROCESSING CPPC DOORBELL REQS IS SLOW. BUT /MAKING/ A REQUEST, SLOW TOO?
>   Looks to me that with the "shared memory support" the frequency update
>   process is doubly asynchronous: first we have the ->target() callback
>   deferred to a workqueue, then when it's eventually executed, it calls
>   cppc_update_perf() which again just asks the firmware to do work at a
>   later time. Are we sure that cppc_update_perf() is actually so slow to
>   warrant !fast_switch?

That's a good question! I think your platform with "shared memory support"
is actually to read/write the memory in Platform Communication Channel
(PCC) to update the performance goals. However, acpi-cpufreq driver is
using the MSR registers with cpu_freq_write_amd()/cpu_freq_read_amd().

Is that possible that MSR register access faster than the memory doorbell
in PCC?

> 
> - HOW MANY P-STATES ARE TOO MANY?
>   I've always believed the contrary, but what if having too many P-States is
>   harmful for both performance and efficiency? Maybe the governor is
>   requesting many updates in small increments where less (and larger) updates
>   would be more appropriate?

I am thinking that, maybe, we can dig out better policy to control the
perf range.


Thanks again for questions / possible troubleshooting directions. They are
very helpful. Next step, let us find out what is the root cause of the
performance gap between acpi-cpufreq and amd-pstate driver.

Thanks,
Ray
Matt McDonald Nov. 6, 2021, 8:58 a.m. UTC | #3
> > I've tested this driver and it seems the results are a little
> > underwhelming.
> > The test machine is a two sockets server with two AMD EPYC 7713,
> > family:model:stepping 25:1:1, 128 cores/256 threads, 256G of memory
> > and SSD
> > storage. On this system, the amd-pstate driver works only in
> > "shared memory support", not in "full MSR support",
> > meaning that frequency switches are triggered from a workqueue
> > instead of scheduler context (!fast_switch).

Huang, I've also done some detailed testing, and while many synthetic
benchmarks seem to show minimal differences between this new frequency
control mechanism and acpi_cpufreq, the general user experience seems a
bit degraded, but most of all, gaming performance in many instances (if
not all) is cut in half. Fully half. 

I have an RTX 3090 and a Ryzen 9 5900X, with 32GB (4x8) DDR4 3600. In
Control with DLSS and RT enabled, on 5.15.rc5 with acpi_cpufreq, I get
120-130 fps at 1440p. The same exact kernel with v3 of AMD_CPPC gives
me 50 fps. GPU usage is still at 100, but the CPU frequency is being
reported as like 5100Mhz*, and other assorted weirdness, but most
importantly the fps is stuck at 50. This is regardless of performance
scheduler (schedutil, ondemand, userspace or performance). 

*My CPU can indeed boost over 5GHz on a single core here and there, but
this was constant and on all cores, so clearly it wasn't accurate.

Also, from the documentation it looks like there's supposed to be a way
to fall back to acpi_cpufreq, but I found no such way to do that. If
AMD_CPPC was built into the kernel, I had to use amd-pstate, there was
no other option. Maybe I misinterpreted and acpi-cpufreq is only able
to be used as a fallback for CPUs that don't support amd-pstate.

I know that gaming on Linux hasn't historically been one of AMD's
priorities with their CPUs, but with the Steam Deck upcoming I would
imagine this is a pretty important use-case, and I've tested multiple
games and they all lose a full 50% performance. I'm happy to test any
revisions or even kernel parameters or whatever else to try and get
this sorted. 



> Would you mind that we add a module param or filter the known good
> processors (mobile parts) to load amd-pstate. And others can use the
> param
> to switch between amd-pstate and acpi-cpufreq manually? After we
> address the
> performance gap, then we can switch it back.


This would be something I would be interested to try.

> 
> It seems the issue mainly from the processors with big number of
> cores and
> threads. Let's find the similiar family threadripper or EYPC
> processors to
> duplicate the test results. Will contact at you for details. :-)

This may be an interesting route of investigation, I could potentially
try running a game with `taskset -c 0-7` or something similar. 

>
Huang Rui Nov. 8, 2021, 9:20 a.m. UTC | #4
On Sat, Nov 06, 2021 at 04:58:35PM +0800, Matt McDonald wrote:
> > > I've tested this driver and it seems the results are a little
> > > underwhelming.
> > > The test machine is a two sockets server with two AMD EPYC 7713,
> > > family:model:stepping 25:1:1, 128 cores/256 threads, 256G of memory
> > > and SSD
> > > storage. On this system, the amd-pstate driver works only in
> > > "shared memory support", not in "full MSR support",
> > > meaning that frequency switches are triggered from a workqueue
> > > instead of scheduler context (!fast_switch).
> 
> Huang, I've also done some detailed testing, and while many synthetic
> benchmarks seem to show minimal differences between this new frequency
> control mechanism and acpi_cpufreq, the general user experience seems a
> bit degraded, but most of all, gaming performance in many instances (if
> not all) is cut in half. Fully half. 
> 
> I have an RTX 3090 and a Ryzen 9 5900X, with 32GB (4x8) DDR4 3600. In

May we know the family/model id of your processors?

> Control with DLSS and RT enabled, on 5.15.rc5 with acpi_cpufreq, I get
> 120-130 fps at 1440p. The same exact kernel with v3 of AMD_CPPC gives
> me 50 fps. GPU usage is still at 100, but the CPU frequency is being
> reported as like 5100Mhz*, and other assorted weirdness, but most
> importantly the fps is stuck at 50. This is regardless of performance
> scheduler (schedutil, ondemand, userspace or performance). 

May we know your SMU version in your SBIOS?

Thanks,
Ray

> 
> *My CPU can indeed boost over 5GHz on a single core here and there, but
> this was constant and on all cores, so clearly it wasn't accurate.
> 
> Also, from the documentation it looks like there's supposed to be a way
> to fall back to acpi_cpufreq, but I found no such way to do that. If
> AMD_CPPC was built into the kernel, I had to use amd-pstate, there was
> no other option. Maybe I misinterpreted and acpi-cpufreq is only able
> to be used as a fallback for CPUs that don't support amd-pstate.
> 
> I know that gaming on Linux hasn't historically been one of AMD's
> priorities with their CPUs, but with the Steam Deck upcoming I would
> imagine this is a pretty important use-case, and I've tested multiple
> games and they all lose a full 50% performance. I'm happy to test any
> revisions or even kernel parameters or whatever else to try and get
> this sorted. 
> 
> 
> 
> > Would you mind that we add a module param or filter the known good
> > processors (mobile parts) to load amd-pstate. And others can use the
> > param
> > to switch between amd-pstate and acpi-cpufreq manually? After we
> > address the
> > performance gap, then we can switch it back.
> 
> 
> This would be something I would be interested to try.
> 
> > 
> > It seems the issue mainly from the processors with big number of
> > cores and
> > threads. Let's find the similiar family threadripper or EYPC
> > processors to
> > duplicate the test results. Will contact at you for details. :-)
> 
> This may be an interesting route of investigation, I could potentially
> try running a game with `taskset -c 0-7` or something similar. 
> 
> > 
>
Du, Xiaojian Nov. 12, 2021, 11:21 a.m. UTC | #5
[AMD Official Use Only]

Hi Matt,

Thanks for you test, we are very happy to receive the feedback from you and community.
We try to reproduce the issue you reported in our local environment.

Hardware configuration:
CPU: 5900X 12core
MEM: DDR4 8*2GB @2667MHz@2channel
GPU: VEGA20, Radeon VII
Mainboard: B550
Kennel: 5.15-rc, custom kernel, with acpi-cpufreq and amd_pstate driver.

We build two sets of the same system and install the pure Ubuntu20.04.3 OS and Steam.
The software version of Steam is default.
And we use the *USB synchronizer* to control the two systems at the same time.

For "Control" game:
Graphics option: default setting, 1080P, to avoid GPU performance bottle.
GPU driver package is:
https://drivers.amd.com/drivers/linux/amdgpu-pro-21.20-1292797-rhel-8.4.tar.xz
(Installed with command: ./amdgpu-install  --no-dkms)

The only difference of the two systems is the different cpufreq driver: one is acpi-cpufreq, another is amd_pstate.

From our test result, we can't find one obvious performance gap between the two systems, they all run the "Control" at 100-120fps.
You can fetch the result capture from the following picture and videos, they will show the two screens at the same time:

One picture:
https://drive.google.com/file/d/1PvSduykJn9U5MMOhzFWycnbmGmznalM3/view?usp=sharing

Two videos:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1nQQEteL-v_zQxnOJpyW8JqvRW2FFDN2Z%2Fview%3Fusp%3Dsharing&data=04%7C01%7Cray.huang%40amd.com%7C2103847cc456406b2d0508d9a5c6c3c0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637723096252262986%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=NH21Xjhg8BWm17JJW%2F5hN8JIMkXYwjQCIrTxxjSjrIE%3D&reserved=0

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1heuPgFG71SQHvGb6wfedrQciBfE2rhnu%2Fview%3Fusp%3Dsharing&data=04%7C01%7Cray.huang%40amd.com%7C2103847cc456406b2d0508d9a5c6c3c0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637723096252272980%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6%2BcvgbUSkk%2BaRThfID5wIbjjY6sHusJ90uygw6%2FO6m4%3D&reserved=0

We don't test with NV GPU cards, because we have no NV RTX cards so far.
But we can test again with navi21 GPU, named as RX 6900xt/6800xt/6800, if the issue is related to ray trace.
Would you have any chance to use one AMD GPU to re-test with your system?

Anyway, very appreciated for your feedback, we need more feedback to improve our AMD CPU driver.

Thanks,
Xiaojian


-----Original Message-----
From: Huang, Ray <Ray.Huang@amd.com>
Sent: 2021年11月8日 17:20
To: Matt McDonald <gardotd426@gmail.com>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>; Rafael J . Wysocki <rafael.j.wysocki@intel.com>; Viresh Kumar <viresh.kumar@linaro.org>; Shuah Khan <skhan@linuxfoundation.org>; Borislav Petkov <bp@suse.de>; Peter Zijlstra <peterz@infradead.org>; Ingo Molnar <mingo@kernel.org>; linux-pm@vger.kernel.org; Sharma, Deepak <Deepak.Sharma@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Limonciello, Mario <Mario.Limonciello@amd.com>; Steven Noonan <steven@valvesoftware.com>; Fontenot, Nathan <Nathan.Fontenot@amd.com>; Su, Jinzhou (Joe) <Jinzhou.Su@amd.com>; Du, Xiaojian <Xiaojian.Du@amd.com>; linux-kernel@vger.kernel.org; x86@kernel.org
Subject: Re: [PATCH v3 00/21] cpufreq: introduce a new AMD CPU frequency control mechanism

On Sat, Nov 06, 2021 at 04:58:35PM +0800, Matt McDonald wrote:
> > > I've tested this driver and it seems the results are a little
> > > underwhelming.
> > > The test machine is a two sockets server with two AMD EPYC 7713,
> > > family:model:stepping 25:1:1, 128 cores/256 threads, 256G of
> > > memory and SSD storage. On this system, the amd-pstate driver
> > > works only in "shared memory support", not in "full MSR support",
> > > meaning that frequency switches are triggered from a workqueue
> > > instead of scheduler context (!fast_switch).
>
> Huang, I've also done some detailed testing, and while many synthetic
> benchmarks seem to show minimal differences between this new frequency
> control mechanism and acpi_cpufreq, the general user experience seems
> a bit degraded, but most of all, gaming performance in many instances
> (if not all) is cut in half. Fully half.
>
> I have an RTX 3090 and a Ryzen 9 5900X, with 32GB (4x8) DDR4 3600. In

May we know the family/model id of your processors?

> Control with DLSS and RT enabled, on 5.15.rc5 with acpi_cpufreq, I get
> 120-130 fps at 1440p. The same exact kernel with v3 of AMD_CPPC gives
> me 50 fps. GPU usage is still at 100, but the CPU frequency is being
> reported as like 5100Mhz*, and other assorted weirdness, but most
> importantly the fps is stuck at 50. This is regardless of performance
> scheduler (schedutil, ondemand, userspace or performance).

May we know your SMU version in your SBIOS?

Thanks,
Ray

>
> *My CPU can indeed boost over 5GHz on a single core here and there,
> but this was constant and on all cores, so clearly it wasn't accurate.
>
> Also, from the documentation it looks like there's supposed to be a
> way to fall back to acpi_cpufreq, but I found no such way to do that.
> If AMD_CPPC was built into the kernel, I had to use amd-pstate, there
> was no other option. Maybe I misinterpreted and acpi-cpufreq is only
> able to be used as a fallback for CPUs that don't support amd-pstate.
>
> I know that gaming on Linux hasn't historically been one of AMD's
> priorities with their CPUs, but with the Steam Deck upcoming I would
> imagine this is a pretty important use-case, and I've tested multiple
> games and they all lose a full 50% performance. I'm happy to test any
> revisions or even kernel parameters or whatever else to try and get
> this sorted.
>
>
>
> > Would you mind that we add a module param or filter the known good
> > processors (mobile parts) to load amd-pstate. And others can use the
> > param to switch between amd-pstate and acpi-cpufreq manually? After
> > we address the performance gap, then we can switch it back.
>
>
> This would be something I would be interested to try.
>
> >
> > It seems the issue mainly from the processors with big number of
> > cores and threads. Let's find the similiar family threadripper or
> > EYPC processors to duplicate the test results. Will contact at you
> > for details. :-)
>
> This may be an interesting route of investigation, I could potentially
> try running a game with `taskset -c 0-7` or something similar.
>
> >
>