[1/2] x86,sched: Add support for frequency invariance
diff mbox series

Message ID 20190909024216.5942-2-ggherdovich@suse.cz
State Superseded, archived
Headers show
Series
  • Add support for frequency invariance for (some) x86
Related show

Commit Message

Giovanni Gherdovich Sept. 9, 2019, 2:42 a.m. UTC
Implement arch_scale_freq_capacity() for 'modern' x86. This function
is used by the scheduler to correctly account usage in the face of
DVFS.

The present patch addresses Intel processors specifically and has positive
performance and performance-per-watt implications for the schedutil cpufreq
governor, bringing it closer to, if not on-par with, the powersave governor
from the intel_pstate driver/framework.

Large performance gains are obtained when the machine is lightly loaded and no
regression are observed at saturation. The benchmarks with the largest gains
are kernel compilation, tbench (the networking version of dbench) and
shell-intensive workloads.

1. FREQUENCY INVARIANCE: MOTIVATION
   * Without it, a task looks larger if the CPU runs slower

2. PECULIARITIES OF X86
   * freq invariance accounting requires knowing the ratio freq_curr/freq_max
   2.1 CURRENT FREQUENCY
       * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz")
   2.2 MAX FREQUENCY
       * It varies with time (turbo). As an approximation, we set it to a
         constant, i.e. 4-cores turbo frequency (or base frequency if nothing
         else is reported by MSRs)

3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
   * The invariant schedutil's formula has no feedback loop and reacts faster
     to utilization changes

4. KNOWN LIMITATIONS
   * In some cases tasks can't reach max util despite how hard they try

5. PERFORMANCE TESTING
   5.1 MACHINES
       * Skylake, Broadwell, Haswell
   5.2 SETUP
       * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12
         active cores turbo w/ invariant schedutil, and intel_pstate/powersave
   5.3 BENCHMARK RESULTS
       5.3.1 NEUTRAL BENCHMARKS
             * NAS Parallel Benchmark (HPC), hackbench
       5.3.2 NON-NEUTRAL BENCHMARKS
             * tbench (10-30% better), kernbench (10-15% better),
               shell-intensive-scripts (30-50% better)
             * no regressions
       5.3.3 SELECTION OF DETAILED RESULTS
       5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
             * dbench (5% worse on one machine), kernbench (3% worse),
               tbench (5-10% better), shell-intensive-scripts (10-40% better)

6. MICROARCH'ES ADDRESSED HERE
   * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum
     etc have different MSRs semantic for querying turbo levels)

7. REFERENCES
   * MMTests performance testing framework, github.com/gormanm/mmtests

+-------------------------------------------------------------------------+
| 1. FREQUENCY INVARIANCE: MOTIVATION
+-------------------------------------------------------------------------+

For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When
running a task that would consume 1/3rd of a CPU at 1000 MHz, it would
appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the
false impression this CPU is almost at capacity, even though it can go
faster [*]. In a nutshell, without frequency scale-invariance tasks look
larger just because the CPU is running slower.

[*] (footnote: this assumes a linear frequency/performance relation; which
everybody knows to be false, but given realities its the best approximation
we can make.)

+-------------------------------------------------------------------------+
| 2. PECULIARITIES OF X86
+-------------------------------------------------------------------------+

Accounting for frequency changes in PELT signals requires the computation of
the ratio freq_curr / freq_max. On x86 neither of those terms is readily
available.

2.1 CURRENT FREQUENCY
====================

Since modern x86 has hardware control over the actual frequency we run
at (because amongst other things, Turbo-Mode), we cannot simply use
the frequency as requested through cpufreq.

Instead we use the APERF/MPERF MSRs to compute the effective frequency
over the recent past. Also, because reading MSRs is expensive, don't
do so every time we need the value, but amortize the cost by doing it
every tick.

2.2 MAX FREQUENCY
=================

Obtaining freq_max is also non-trivial because at any time the hardware can
provide a frequency boost to a selected subset of cores if the package has
enough power to spare (eg: Turbo Boost). This means that the maximum frequency
available to a given core changes with time.

The approach taken in this change is to arbitrarily set freq_max to a constant
value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most
microarchitectures, after evaluating the following candidates:

    * 1-core (1C) turbo frequency (the fastest turbo state available)
    * around base frequency (a.k.a. max P-state)
    * something in between, such as 4C turbo

To interpret these options, consider that this is the denominator in
freq_curr/freq_max, and that ratio will be used to scale PELT signals such as
util_avg and load_avg. A large denominator will undershoot (util_avg looks a
bit smaller than it really is), viceversa with a smaller denominator PELT
signals will tend to overshoot. Given that PELT drives frequency selection
in the schedutil governor, we will have:

    freq_max set to     | effect on DVFS
    --------------------+------------------
    1C turbo            | power efficiency (lower freq choices)
    base freq           | performance (higher util_avg, higher freq requests)
    4C turbo            | a bit of both

4C turbo proves to be a good compromise in a number of benchmarks (see
below). Note that when the function intel_set_cpu_max_freq() fails to query
the various MSRs for the 4C turbo value, the variable arch_max_freq retains
its default value of SCHED_CAPACITY_SCALE (1024) that corresponds to setting
freq_max to base frequency wrt the table above.

+-------------------------------------------------------------------------+
| 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
+-------------------------------------------------------------------------+

Once an architecture implements a frequency scale-invariant utilization (the
PELT signal util_avg), schedutil switches its frequency selection formula from

    freq_next = 1.25 * freq_curr * util            [non-invariant util signal]

to

    freq_next = 1.25 * freq_max * util             [invariant util signal]

where, in the second formula, freq_max is set to the 1C turbo frequency (max
turbo). The advantage of the second formula, whose usage we unlock with this
patch, is that freq_next doesn't depend on the current frequency in an
iterative fashion, but can jump to any frequency in a single update. This
absence of feedback in the formula makes it quicker to react to utilization
changes and more robust against pathological instabilities.

Compare it to the update formula of intel_pstate/powersave:

    freq_next = 1.25 * freq_max * Busy%

where again freq_max is 1C turbo and Busy% is the percentage of time not spent
idling (calculated with delta_MPERF / delta_TSC); essentially the same as
invariant schedutil, and largely responsible for intel_pstate/powersave good
reputation. The non-invariant schedutil formula is derived from the invariant
one by approximating util_inv with util_raw * freq_curr / freq_max, but this
has limitations.

Testing shows improved performances due to better frequency selections when
the machine is lightly loaded, and essentially no change in behaviour at
saturation / overutilization.

+-------------------------------------------------------------------------+
| 4. KNOWN LIMITATIONS
+-------------------------------------------------------------------------+

It's been shown that it is possible to create pathological scenarios where a
CPU-bound task cannot reach max utilization, if the normalizing factor
freq_max is fixed to a constant value (see [Lelli-2018]).

If freq_max is set to 4C turbo as we do here, one needs to peg at least 5
cores in a package doing some busywork, and observe that none of those task
will ever reach max util (1024) because they're all running at less than the
4C turbo frequency.

While this concern still applies, we believe the performance benefit of
frequency scale-invariant PELT signals outweights the cost of this limitation.

[Lelli-2018]
https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/

+-------------------------------------------------------------------------+
| 5. PERFORMANCE TESTING
+-------------------------------------------------------------------------+

5.1 MACHINES
============

We tested the patch on three machines, with Skylake, Broadwell and Haswell
CPUs. The details are below, together with the available turbo ratios as
reported by the appropriate MSRs.

* 8x-SKYLAKE-UMA:
  Single socket E3-1240 v5, Skylake 4 cores/8 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC    800 |********
    BASE    3500 |***********************************
    4C      3700 |*************************************
    3C      3800 |**************************************
    2C      3900 |***************************************
    1C      3900 |***************************************

* 80x-BROADWELL-NUMA:
  Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2200 |**********************
    8C      2900 |*****************************
    7C      3000 |******************************
    6C      3100 |*******************************
    5C      3200 |********************************
    4C      3300 |*********************************
    3C      3400 |**********************************
    2C      3600 |************************************
    1C      3600 |************************************

* 48x-HASWELL-NUMA
  Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads
  Max EFFiciency, BASE frequency and available turbo levels (MHz):

    EFFIC   1200 |************
    BASE    2300 |***********************
    12C     2600 |**************************
    11C     2600 |**************************
    10C     2600 |**************************
    9C      2600 |**************************
    8C      2600 |**************************
    7C      2600 |**************************
    6C      2600 |**************************
    5C      2700 |***************************
    4C      2800 |****************************
    3C      2900 |*****************************
    2C      3100 |*******************************
    1C      3100 |*******************************

5.2 SETUP
=========

* The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate
  driver in passive mode.
* The rationale for choosing the various freq_max values to test have been to
  try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical
  on all machines), plus one more value closer to base_freq but still in the
  turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA).
* In addition we've run all tests with intel_pstate/powersave for comparison.
* The filesystem is always XFS, the userspace is openSUSE Leap 15.1.
* 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs
  with active intel_pstate on this machine use that.

This gives, in terms of combinations tested on each machine:

* 8x-SKYLAKE-UMA
  * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive
  * intel_pstate active + powersave + HWP
  * invariant schedutil, freq_max = 1C turbo
  * invariant schedutil, freq_max = 3C turbo
  * invariant schedutil, freq_max = 4C turbo

* both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA
  * [same as 8x-SKYLAKE-UMA, but no HWP capable]
  * invariant schedutil, freq_max = 8C turbo
  * (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo")

5.3 BENCHMARK RESULTS
=====================

5.3.1 NEUTRAL BENCHMARKS
------------------------

Tests that didn't show any measurable difference in performance on any of the
test machines between non-invariant schedutil and our patch are:

* NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any
  computational kernel
* flexible I/O (FIO)
* hackbench (using threads or processes, and using pipes or sockets)

5.3.2 NON-NEUTRAL BENCHMARKS
----------------------------

What follow are summary tables where each benchmark result is given a score.

* A tilde (~) means a neutral result, i.e. no difference from baseline.
* Scores are computed with the ratio result_new / result_baseline, so a tilde
  means a score of 1.00.
* The results in the score ratio are the geometric means of results running
  the benchmark with different parameters (eg: for kernbench: using 1, 2, 4,
  ... number of processes; for pgbench: varying the number of clients, and so
  on).
* The first three tables show higher-is-better kind of tests (i.e. measured in
  operations/second), the subsequent three show lower-is-better kind of tests
  (i.e. the workload is fixed and we measure elapsed time, think kernbench).
* "gitsource" is a name we made up for the test consisting in running the
  entire unit tests suite of the Git SCM and measuring how long it takes. We
  take it as a typical example of shell-intensive serialized workload.
* In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other
  columns show invariant schedutil for different values of freq_max. 4C turbo
  is circled as it's the value we've chosen for the final implementation.

80x-BROADWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
pgbench-ro           1.14   ~      ~     | 1.11 |  1.14
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.06   ~      1.06  | 1.05 |  1.07
netperf-tcp          ~      1.03   ~     | 1.01 |  1.02
tbench4              1.57   1.18   1.22  | 1.30 |  1.56
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; higher is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
pgbench-ro           ~      ~      ~     | ~    |
pgbench-rw           ~      ~      ~     | ~    |
netperf-udp          ~      ~      ~     | ~    |
netperf-tcp          ~      ~      ~     | ~    |
tbench4              1.30   1.14   1.14  | 1.16 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; higher is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  12C
pgbench-ro           1.15   ~      ~     | 1.06 |  1.16
pgbench-rw           ~      ~      ~     | ~    |  ~
netperf-udp          1.05   0.97   1.04  | 1.04 |  1.02
netperf-tcp          0.96   1.01   1.01  | 1.01 |  1.01
tbench4              1.50   1.05   1.13  | 1.13 |  1.25
                                         +------+

In the table above we see that active intel_pstate is slightly better than our
4C-turbo patch (both in reference to the baseline non-invariant schedutil) on
read-only pgbench and much better on tbench. Both cases are notable in which
it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on
80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant
schedutil to get closer.

If we ignore active intel_pstate and focus on the comparison with baseline
alone, there are several instances of double-digit performance improvement.

80x-BROADWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              1.23   0.95   0.95  | 0.95 |  0.95
kernbench            0.93   0.83   0.83  | 0.83 |  0.82
gitsource            0.98   0.49   0.49  | 0.49 |  0.48
                                         +------+

8x-SKYLAKE-UMA (comparison ratio; lower is better)
                                         +------+
             I_PSTATE/HWP   1C     3C    | 4C   |
dbench4              ~      ~      ~     | ~    |
kernbench            ~      ~      ~     | ~    |
gitsource            0.92   0.55   0.55  | 0.55 |
                                         +------+

48x-HASWELL-NUMA (comparison ratio; lower is better)
                                         +------+
                 I_PSTATE   1C     3C    | 4C   |  8C
dbench4              ~      ~      ~     | ~    |  ~
kernbench            0.94   0.90   0.89  | 0.90 |  0.90
gitsource            0.97   0.69   0.69  | 0.69 |  0.69
                                         +------+

dbench is not very remarkable here, unless we notice how poorly active
intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus
non-invariant schedutil. We repeated that run getting consistent results. Out
of scope for the patch at hand, but deserving future investigation. Other than
that, we previously ran this campaign with Linux v5.0 and saw the patch doing
better on dbench a the time. We haven't checked closely and can only speculate
at this point.

On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in
the detailed tables that the gains concentrate on low process counts (lightly
loaded machines).

The test we call "gitsource" (running the git unit test suite, a long-running
single-threaded shell script) appears rather spectacular in this table (gains
of 30-50% depending on the machine). It is to be noted, however, that
gitsource has no adjustable parameters (such as the number of jobs in
kernbench, which we average over in order to get a single-number summary
score) and is exactly the kind of low-parallelism workload that benefits the
most from this patch. When looking at the detailed tables of kernbench or
tbench4, at low process or client counts one can see similar numbers.

5.3.3 SELECTION OF DETAILED RESULTS
-----------------------------------

Machine            : 48x-HASWELL-NUMA
Benchmark          : tbench4 (i.e. dbench4 over the network, actually loopback)
Varying parameter  : number of clients
Unit               : MB/sec (higher is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        126.73  +- 0.31% (        )      315.91  +- 0.66% ( 149.28%)      125.03  +- 0.76% (  -1.34%)
Hmean  2        258.04  +- 0.62% (        )      614.16  +- 0.51% ( 138.01%)      269.58  +- 1.45% (   4.47%)
Hmean  4        514.30  +- 0.67% (        )     1146.58  +- 0.54% ( 122.94%)      533.84  +- 1.99% (   3.80%)
Hmean  8       1111.38  +- 2.52% (        )     2159.78  +- 0.38% (  94.33%)     1359.92  +- 1.56% (  22.36%)
Hmean  16      2286.47  +- 1.36% (        )     3338.29  +- 0.21% (  46.00%)     2720.20  +- 0.52% (  18.97%)
Hmean  32      4704.84  +- 0.35% (        )     4759.03  +- 0.43% (   1.15%)     4774.48  +- 0.30% (   1.48%)
Hmean  64      7578.04  +- 0.27% (        )     7533.70  +- 0.43% (  -0.59%)     7462.17  +- 0.65% (  -1.53%)
Hmean  128     6998.52  +- 0.16% (        )     6987.59  +- 0.12% (  -0.16%)     6909.17  +- 0.14% (  -1.28%)
Hmean  192     6901.35  +- 0.25% (        )     6913.16  +- 0.10% (   0.17%)     6855.47  +- 0.21% (  -0.66%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                  5.2.0 12C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Hmean  1        128.43  +- 0.28% (   1.34%)      130.64  +- 3.81% (   3.09%)      153.71  +- 5.89% (  21.30%)
Hmean  2        311.70  +- 6.15% (  20.79%)      281.66  +- 3.40% (   9.15%)      305.08  +- 5.70% (  18.23%)
Hmean  4        641.98  +- 2.32% (  24.83%)      623.88  +- 5.28% (  21.31%)      906.84  +- 4.65% (  76.32%)
Hmean  8       1633.31  +- 1.56% (  46.96%)     1714.16  +- 0.93% (  54.24%)     2095.74  +- 0.47% (  88.57%)
Hmean  16      3047.24  +- 0.42% (  33.27%)     3155.02  +- 0.30% (  37.99%)     3634.58  +- 0.15% (  58.96%)
Hmean  32      4734.31  +- 0.60% (   0.63%)     4804.38  +- 0.23% (   2.12%)     4674.62  +- 0.27% (  -0.64%)
Hmean  64      7699.74  +- 0.35% (   1.61%)     7499.72  +- 0.34% (  -1.03%)     7659.03  +- 0.25% (   1.07%)
Hmean  128     6935.18  +- 0.15% (  -0.91%)     6942.54  +- 0.10% (  -0.80%)     7004.85  +- 0.12% (   0.09%)
Hmean  192     6901.62  +- 0.12% (   0.00%)     6856.93  +- 0.10% (  -0.64%)     6978.74  +- 0.10% (   1.12%)

This is one of the cases where the patch still can't surpass active
intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are
visible up to 16 clients and the saturated scenario is the same as baseline.

The scores in the summary table from the previous sections are ratios of
geometric means of the results over different clients, as seen in this table.

Machine            : 80x-BROADWELL-NUMA
Benchmark          : kernbench (kernel compilation)
Varying parameter  : number of jobs
Unit               : seconds (lower is better)

                   5.2.0 vanilla (BASELINE)               5.2.0 intel_pstate                   5.2.0 1C-turbo
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        379.68  +- 0.06% (        )      330.20  +- 0.43% (  13.03%)      285.93  +- 0.07% (  24.69%)
Amean  4        200.15  +- 0.24% (        )      175.89  +- 0.22% (  12.12%)      153.78  +- 0.25% (  23.17%)
Amean  8        106.20  +- 0.31% (        )       95.54  +- 0.23% (  10.03%)       86.74  +- 0.10% (  18.32%)
Amean  16        56.96  +- 1.31% (        )       53.25  +- 1.22% (   6.50%)       48.34  +- 1.73% (  15.13%)
Amean  32        34.80  +- 2.46% (        )       33.81  +- 0.77% (   2.83%)       30.28  +- 1.59% (  12.99%)
Amean  64        26.11  +- 1.63% (        )       25.04  +- 1.07% (   4.10%)       22.41  +- 2.37% (  14.16%)
Amean  128       24.80  +- 1.36% (        )       23.57  +- 1.23% (   4.93%)       21.44  +- 1.37% (  13.55%)
Amean  160       24.85  +- 0.56% (        )       23.85  +- 1.17% (   4.06%)       21.25  +- 1.12% (  14.49%)

                             5.2.0 3C-turbo                   5.2.0 4C-turbo                   5.2.0 8C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean  2        284.08  +- 0.13% (  25.18%)      283.96  +- 0.51% (  25.21%)      285.05  +- 0.21% (  24.92%)
Amean  4        153.18  +- 0.22% (  23.47%)      154.70  +- 1.64% (  22.71%)      153.64  +- 0.30% (  23.24%)
Amean  8         87.06  +- 0.28% (  18.02%)       86.77  +- 0.46% (  18.29%)       86.78  +- 0.22% (  18.28%)
Amean  16        48.03  +- 0.93% (  15.68%)       47.75  +- 1.99% (  16.17%)       47.52  +- 1.61% (  16.57%)
Amean  32        30.23  +- 1.20% (  13.14%)       30.08  +- 1.67% (  13.57%)       30.07  +- 1.67% (  13.60%)
Amean  64        22.59  +- 2.02% (  13.50%)       22.63  +- 0.81% (  13.32%)       22.42  +- 0.76% (  14.12%)
Amean  128       21.37  +- 0.67% (  13.82%)       21.31  +- 1.15% (  14.07%)       21.17  +- 1.93% (  14.63%)
Amean  160       21.68  +- 0.57% (  12.76%)       21.18  +- 1.74% (  14.77%)       21.22  +- 1.00% (  14.61%)

The patch outperform active intel_pstate (and baseline) by a considerable
margin; the summary table from the previous section says 4C turbo and active
intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is
0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no
noticeable difference with regard to the value of freq_max.

Machine            : 8x-SKYLAKE-UMA
Benchmark          : gitsource (time to run the git unit test suite)
Varying parameter  : none
Unit               : seconds (lower is better)

                            5.2.0 vanilla           5.2.0 intel_pstate/hwp         5.2.0 1C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         858.85  +- 1.16% (        )      791.94  +- 0.21% (   7.79%)      474.95 (  44.70%)

                           5.2.0 3C-turbo                   5.2.0 4C-turbo
- - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Amean         475.26  +- 0.20% (  44.66%)      474.34  +- 0.13% (  44.77%)

In this test, which is of interest as representing shell-intensive
(i.e. fork-intensive) serialized workloads, invariant schedutil outperforms
intel_pstate/powersave by a whopping 40% margin.

5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT
---------------------------------------------

The following table shows average power consumption in watt for each
benchmark. Data comes from turbostat (package average), which in turn is read
from the RAPL interface on CPUs. We know the patch affects CPU frequencies so
it's reasonable to ignore other power consumers (such as memory or I/O). Also,
we don't have a power meter available in the lab so RAPL is the best we have.

turbostat sampled average power every 10 seconds for the entire duration of
each benchmark. We took all those values and averaged them (i.e. with don't
have detail on a per-parameter granularity, only on whole benchmarks).

80x-BROADWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |      8C
pgbench-ro       130.01   142.77   131.11   132.45  | 134.65 |  136.84
pgbench-rw        68.30    60.83    71.45    71.70  |  71.65 |   72.54
dbench4           90.25    59.06   101.43    99.89  | 101.10 |  102.94
netperf-udp       65.70    69.81    66.02    68.03  |  68.27 |   68.95
netperf-tcp       88.08    87.96    88.97    88.89  |  88.85 |   88.20
tbench4          142.32   176.73   153.02   163.91  | 165.58 |  176.07
kernbench         92.94   101.95   114.91   115.47  | 115.52 |  115.10
gitsource         40.92    41.87    75.14    75.20  |  75.40 |   75.70
                                                    +--------+
8x-SKYLAKE-UMA (power consumption, watts)
                                                    +--------+
              BASELINE I_PSTATE/HWP    1C       3C  |     4C |
pgbench-ro        46.49    46.68    46.56    46.59  |  46.52 |
pgbench-rw        29.34    31.38    30.98    31.00  |  31.00 |
dbench4           27.28    27.37    27.49    27.41  |  27.38 |
netperf-udp       22.33    22.41    22.36    22.35  |  22.36 |
netperf-tcp       27.29    27.29    27.30    27.31  |  27.33 |
tbench4           41.13    45.61    43.10    43.33  |  43.56 |
kernbench         42.56    42.63    43.01    43.01  |  43.01 |
gitsource         13.32    13.69    17.33    17.30  |  17.35 |
                                                    +--------+
48x-HASWELL-NUMA (power consumption, watts)
                                                    +--------+
               BASELINE I_PSTATE       1C       3C  |     4C |     12C
pgbench-ro       128.84   136.04   129.87   132.43  | 132.30 |  134.86
pgbench-rw        37.68    37.92    37.17    37.74  |  37.73 |   37.31
dbench4           28.56    28.73    28.60    28.73  |  28.70 |   28.79
netperf-udp       56.70    60.44    56.79    57.42  |  57.54 |   57.52
netperf-tcp       75.49    75.27    75.87    76.02  |  76.01 |   75.95
tbench4          115.44   139.51   119.53   123.07  | 123.97 |  130.22
kernbench         83.23    91.55    95.58    95.69  |  95.72 |   96.04
gitsource         36.79    36.99    39.99    40.34  |  40.35 |   40.23
                                                    +--------+

A lower power consumption isn't necessarily better, it depends on what is done
with that energy. Here are tables with the ratio of performance-per-watt on
each machine and benchmark. Higher is always better; a tilde (~) means a
neutral ratio (i.e. 1.00).

80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |    8C
pgbench-ro       1.04   1.06   0.94  | 1.07 |  1.08
pgbench-rw       1.10   0.97   0.96  | 0.96 |  0.97
dbench4          1.24   0.94   0.95  | 0.94 |  0.92
netperf-udp      ~      1.02   1.02  | ~    |  1.02
netperf-tcp      ~      1.02   ~     | ~    |  1.02
tbench4          1.26   1.10   1.06  | 1.12 |  1.26
kernbench        0.98   0.97   0.97  | 0.97 |  0.98
gitsource        ~      1.11   1.11  | 1.11 |  1.13
                                     +------+

8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better)
                                     +------+
         I_PSTATE/HWP     1C     3C  |   4C |
pgbench-ro       ~      ~      ~     | ~    |
pgbench-rw       0.95   0.97   0.96  | 0.96 |
dbench4          ~      ~      ~     | ~    |
netperf-udp      ~      ~      ~     | ~    |
netperf-tcp      ~      ~      ~     | ~    |
tbench4          1.17   1.09   1.08  | 1.10 |
kernbench        ~      ~      ~     | ~    |
gitsource        1.06   1.40   1.40  | 1.40 |
                                     +------+

48x-HASWELL-NUMA  (performance-per-watt ratios; higher is better)
                                     +------+
             I_PSTATE     1C     3C  |   4C |   12C
pgbench-ro       1.09   ~      1.09  | 1.03 |  1.11
pgbench-rw       ~      0.86   ~     | ~    |  0.86
dbench4          ~      1.02   1.02  | 1.02 |  ~
netperf-udp      ~      0.97   1.03  | 1.02 |  ~
netperf-tcp      0.96   ~      ~     | ~    |  ~
tbench4          1.24   ~      1.06  | 1.05 |  1.11
kernbench        0.97   0.97   0.98  | 0.97 |  0.96
gitsource        1.03   1.33   1.32  | 1.32 |  1.33
                                     +------+

These results are overall pleasing: in plenty of cases we observe
performance-per-watt improvements. The few regressions (read/write pgbench and
dbench on the Broadwell machine) are of small magnitude. kernbench loses a few
percentage points (it has a 10-15% performance improvement, but apparently the
increase in power consumption is larger than that). tbench4 and gitsource, which
benefit the most from the patch, keep a positive score in this table which is
a welcome surprise; that suggests that in those particular workloads the
non-invariant schedutil (and active intel_pstate, too) makes some rather
suboptimal frequency selections.

+-------------------------------------------------------------------------+
| 6. MICROARCH'ES ADDRESSED HERE
+-------------------------------------------------------------------------+

The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and
MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies
respectively. This excludes the recent Xeon Scalable Performance processors
line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently.

Subsequent patches will address:

* Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus
* Xeon Phi (Knights Landing, Knights Mill)
* Atom Silvermont

+-------------------------------------------------------------------------+
| 7. REFERENCES
+-------------------------------------------------------------------------+

Tests have been run with the help of the MMTests performance testing
framework, see github.com/gormanm/mmtests. The configuration file names for
the benchmark used are:

    db-pgbench-timed-ro-small-xfs
    db-pgbench-timed-rw-small-xfs
    io-dbench4-async-xfs
    network-netperf-unbound
    network-tbench
    scheduler-unbound
    workload-kerndevel-xfs
    workload-shellscripts-xfs
    hpc-nas-c-class-mpi-full-xfs
    hpc-nas-c-class-omp-full

All those benchmarks are generally available on the web:

pgbench: https://www.postgresql.org/docs/10/pgbench.html
netperf: https://hewlettpackard.github.io/netperf/
dbench/tbench: https://dbench.samba.org/
gitsource: git unit test suite, github.com/git/git
NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html
hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 arch/x86/include/asm/topology.h |  29 +++++++
 arch/x86/kernel/smpboot.c       | 180 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/core.c             |   1 +
 kernel/sched/sched.h            |   7 ++
 4 files changed, 216 insertions(+), 1 deletion(-)

Comments

Doug Smythies Sept. 11, 2019, 3:28 p.m. UTC | #1
Hi Giovanni,

Thank you for the great detail and test results you provided.

On 2019.09.08.07:42 Giovanni Gherdovich wrote:

... [snip]...

> The test we call "gitsource" (running the git unit test suite, a long-running
> single-threaded shell script) appears rather spectacular in this table (gains
> of 30-50% depending on the machine). It is to be noted, however, that
> gitsource has no adjustable parameters (such as the number of jobs in
> kernbench, which we average over in order to get a single-number summary
> score) and is exactly the kind of low-parallelism workload that benefits the
> most from this patch. When looking at the detailed tables of kernbench or
> tbench4, at low process or client counts one can see similar numbers.

I think the "gitsource" test, is the one I learned about here two years
ago, [1]. It is an extremely good (best I know of) example of single
threaded, high PID consumption (about 400 / second average, my computer
[3]), performance issues on a multi CPU computer. I.E., this:

Dountil the list of tasks is finished:
   Start the next task in the list of stuff to do.
   Wait for it to finish
Enduntil

The problem with the test is its run to run variability, which was from
all the disk I/O, as far as I could determine. At the time,
I studied this to death [2], and made a more repeatable test, without
any disk I/O.

While the challenges with this work flow have tended to be focused
on the CPU frequency scaling driver, I have always considered
the root issue here to be a scheduling issue. Excerpt from my notes
[2]:

> The issue is that performance is much much better if the system is
> forced to use only 1 CPU rather than relying on the defaults where
> the CPU scheduler decides what to do.
> The scheduler seems to not realize that the current CPU has just
> become free, and assigns the new task to a new CPU. Thus the load
> on any one CPU is so low that it doesn't ramp up the CPU frequency.
> It would be better if somehow the scheduler knew that the current
> active CPU was now able to take on the new task, overall resulting
> on one fully loaded CPU at the highest CPU frequency.

I do not know if such is practical, and I didn't re-visit the issue.

Anyway these are my results:

Kernel: 5.3-rc8 and + these patches
Processor: i7-2600K

This is important, at least for the performance governor numbers:

cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526
35 * 100.0 = 3500.0 MHz max turbo 4 active cores
36 * 100.0 = 3600.0 MHz max turbo 3 active cores
37 * 100.0 = 3700.0 MHz max turbo 2 active cores
38 * 100.0 = 3800.0 MHz max turbo 1 active cores

For reference against which all other results are compared
is the forced CPU affinity test run. i.e.:

taskset -c 3 test_script.

Mode		Governor		degradation	Power		Bzy_MHz
Reference	perf 1 CPU		1.00		reference	3798
-		performance 	1.2		6% worse	3618
passive	ondemand		2.3
active	powersave		2.6
passive	schedutil		2.7				1600
passive	schedutil-4C	1.68				2515

Where degradation ratio is the time to execute / the reference time for
the same conditions. The test runs over a wide range of processes per
second, and the worst ratio has been selected for the above table.
I have yet to write up this experiment, but the graphs that will
eventually be used are at [4] and [5] (same data presented two
different ways).

The energy for the performance cases is worth more detail, as it
is being wasted with CPUs waking up and going to sleep, and can be
observed in the IRQ column of turbostat output:

$ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 60
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
12.52   3798    81407   49      22.17   0.12 <<< Forced to CPU 3 only
12.52   3798    81139   51      22.18   0.12
12.52   3798    81036   51      22.20   0.12
11.43   3704    267644  48      21.16   0.12 <<< Change over
12.56   3618    490994  48      23.43   0.12 <<< Let the scheduler decide
12.56   3620    491336  47      23.50   0.12
12.56   3619    491607  47      23.50   0.12
12.56   3619    491512  48      23.52   0.12
12.56   3619    490806  47      23.51   0.12
12.56   3618    491356  49      23.48   0.12
12.56   3618    491035  48      23.51   0.12
12.56   3618    491121  48      23.46   0.12

Note also the busy megahertz column, where other active cores
(constantly waking and sleeping as we rotate through which
CPUs are used) are limiting the highest frequency.

... Doug

[1] https://marc.info/?l=linux-kernel&m=149181369622980&w=2
[2] http://www.smythies.com/~doug/linux/single-threaded/index.html
[3] http://www.smythies.com/~doug/linux/single-threaded/pids_per_second2.png
[4] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps.png
[5] http://www.smythies.com/~doug/linux/single-threaded/gg-loops.png
Doug Smythies Sept. 13, 2019, 8:58 p.m. UTC | #2
On 2019.09.11 08:28 Doug Smythies wrote:

> Hi Giovanni,
>
> Thank you for the great detail and test results you provided.
>
> On 2019.09.08.07:42 Giovanni Gherdovich wrote:
>
> ... [snip]...
>
>> The test we call "gitsource" (running the git unit test suite, a long-running
>> single-threaded shell script) appears rather spectacular in this table (gains
>> of 30-50% depending on the machine). It is to be noted, however, that
>> gitsource has no adjustable parameters (such as the number of jobs in
>> kernbench, which we average over in order to get a single-number summary
>> score) and is exactly the kind of low-parallelism workload that benefits the
>> most from this patch. When looking at the detailed tables of kernbench or
>> tbench4, at low process or client counts one can see similar numbers.
>
> I think the "gitsource" test, is the one I learned about here two years
> ago, [1]. It is an extremely good (best I know of) example of single
> threaded, high PID consumption (about 400 / second average, my computer
> [3]), performance issues on a multi CPU computer. I.E., this:
>
> Dountil the list of tasks is finished:
>    Start the next task in the list of stuff to do.
> Enduntil
>
> The problem with the test is its run to run variability, which was from
> all the disk I/O, as far as I could determine.

I forgot, also some memory caching. I always toss out the first test,
then do it 5 more times. If I do not do much stuff with my hard disk
in between tests, it is repeatable enough.

I did the "make test" method and, presenting the numbers your way,
got that 4C took 0.69 times as long as the unpatched schedutil.
Your numbers were same or better (copied below, lower is better):
80x-BROADWELL-NUMA:	0.49
8x-SKYLAKE-UMA:		0.55
48x-HASWELL-NUMA:		0.69

> At the time,
> I studied this to death [2], and made a more repeatable test, without
> any disk I/O.
>
> While the challenges with this work flow have tended to be focused
> on the CPU frequency scaling driver, I have always considered
> the root issue here to be a scheduling issue. Excerpt from my notes
> [2]:
>
>> The issue is that performance is much much better if the system is
>> forced to use only 1 CPU rather than relying on the defaults where
>> the CPU scheduler decides what to do.
>> The scheduler seems to not realize that the current CPU has just
>> become free, and assigns the new task to a new CPU. Thus the load
>> on any one CPU is so low that it doesn't ramp up the CPU frequency.
>> It would be better if somehow the scheduler knew that the current
>> active CPU was now able to take on the new task, overall resulting
>> on one fully loaded CPU at the highest CPU frequency.
>
> I do not know if such is practical, and I didn't re-visit the issue.
>
> Anyway these are my results:
>
> Kernel: 5.3-rc8 and + these patches
> Processor: i7-2600K
>
> This is important, at least for the performance governor numbers:
>
> cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526
> 35 * 100.0 = 3500.0 MHz max turbo 4 active cores
> 36 * 100.0 = 3600.0 MHz max turbo 3 active cores
> 37 * 100.0 = 3700.0 MHz max turbo 2 active cores
> 38 * 100.0 = 3800.0 MHz max turbo 1 active cores
>
> For reference against which all other results are compared
> is the forced CPU affinity test run. i.e.:
>
> taskset -c 3 test_script.
>
> Mode		Governor		degradation	Power		Bzy_MHz
> Reference	perf 1 CPU		1.00		reference	3798
> -		performance 	1.2		6% worse	3618
> passive	ondemand		2.3
> active	powersave		2.6
> passive	schedutil		2.7				1600
> passive	schedutil-4C	1.68				2515
>
> Where degradation ratio is the time to execute / the reference time for
> the same conditions. The test runs over a wide range of processes per
> second, and the worst ratio has been selected for the above table.
> I have yet to write up this experiment, but the graphs that will
> eventually be used are at [4] and [5] (same data presented two
> different ways).

The experiment write up is at [6], however I wanted more data
from the lower tasks per second region, and so I re-did it, [7].
In the limit as sequential tasks per second goes to 0, the
differences should diminish and I wanted to clearly observe this.

Excerpt:
> Conclusion: the schedutil governor improves from the worst 
> governor to (mostly) second only to the performance governor
> for unforced CPU affinity execution.

> The energy for the performance cases is worth more detail, as it
> is being wasted with CPUs waking up and going to sleep, and can be
> observed in the IRQ column of turbostat output:
>
> $ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,GFXWatt,IRQ --interval 60
> Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt GFXWatt
> 12.52   3798    81407   49      22.17   0.12 <<< Forced to CPU 3 only
> 12.52   3798    81139   51      22.18   0.12
> 12.52   3798    81036   51      22.20   0.12
> 11.43   3704    267644  48      21.16   0.12 <<< Change over
> 12.56   3618    490994  48      23.43   0.12 <<< Let the scheduler decide
> 12.56   3620    491336  47      23.50   0.12
> 12.56   3619    491607  47      23.50   0.12
> 12.56   3619    491512  48      23.52   0.12
> 12.56   3619    490806  47      23.51   0.12
> 12.56   3618    491356  49      23.48   0.12
> 12.56   3618    491035  48      23.51   0.12
> 12.56   3618    491121  48      23.46   0.12
> 
> Note also the busy megahertz column, where other active cores
> (constantly waking and sleeping as we rotate through which
> CPUs are used) are limiting the highest frequency.

I looked at the power and idle statistics for this forced verses
unforced CPU affinity scenario in more detail, [8].
(which isn't really part of this patch sets concern.)

Just an additional note:

>> +-------------------------------------------------------------------------+
>> | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR
>> +-------------------------------------------------------------------------+
>> 
>> Once an architecture implements a frequency scale-invariant utilization (the
>> PELT signal util_avg), schedutil switches its frequency selection formula from
>>
>>    freq_next = 1.25 * freq_curr * util            [non-invariant util signal]
>>
>> to
>>
>>    freq_next = 1.25 * freq_max * util             [invariant util signal]
>>
>> where, in the second formula, freq_max is set to the 1C turbo frequency (max
>> turbo). The advantage of the second formula, whose usage we unlock with this
>> patch, is that freq_next doesn't depend on the current frequency in an
>> iterative fashion, but can jump to any frequency in a single update. This
>> absence of feedback in the formula makes it quicker to react to utilization
>> changes and more robust against pathological instabilities.
>>
>> Compare it to the update formula of intel_pstate/powersave:
>>
>>    freq_next = 1.25 * freq_max * Busy%
>>
>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent
>> idling (calculated with delta_MPERF / delta_TSC);

Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old
method of utilization does not (at least not last time I investigated, which was
awhile ago (and I can not find my notes)).

... Doug

> [1] https://marc.info/?l=linux-kernel&m=149181369622980&w=2
> [2] http://www.smythies.com/~doug/linux/single-threaded/index.html
> [3] http://www.smythies.com/~doug/linux/single-threaded/pids_per_second2.png
> [4] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps.png
> [5] http://www.smythies.com/~doug/linux/single-threaded/gg-loops.png

[6] http://www.smythies.com/~doug/linux/single-threaded/k53rc8gg.html
[7] http://www.smythies.com/~doug/linux/single-threaded/k53rc8gg2.html
[8] http://www.smythies.com/~doug/linux/single-threaded/idle01/index.html
Srinivas Pandruvada Sept. 13, 2019, 10:52 p.m. UTC | #3
On Mon, 2019-09-09 at 04:42 +0200, Giovanni Gherdovich wrote:

...

> +
> +/*
> + * APERF/MPERF frequency ratio computation.
> + *
> + * The scheduler wants to do frequency invariant accounting and
> needs a <1
> + * ratio to account for the 'current' frequency, corresponding to
> + * freq_curr / freq_max.
I thought this is no longer the restriction and Vincent did some work
to remove this restriction. 


Thanks,
Srinivas
Quentin Perret Sept. 14, 2019, 10:57 a.m. UTC | #4
Hi Giovanni

On Monday 09 Sep 2019 at 04:42:15 (+0200), Giovanni Gherdovich wrote:
> +static inline long arch_scale_freq_capacity(int cpu)
> +{
> +	if (static_cpu_has(X86_FEATURE_APERFMPERF))
> +		return per_cpu(arch_cpu_freq, cpu);

So, if this is conditional, perhaps you could also add this check in an
x86-specific implementation of arch_scale_freq_invariant() ? That would
guide sugov in the right path (see get_next_freq()) if APERF/MPERF are
unavailable.

> +	return 1024 /* SCHED_CAPACITY_SCALE */;
> +}

Thanks,
Quentin
Giovanni Gherdovich Sept. 17, 2019, 2:25 p.m. UTC | #5
Hello Doug,

thanks for testing as usual, having some review on the experimental results is
really helpful. Sorry for the late reply as I'm traveling at the moment.

You raise really good points regarding pinning workloads on cpus, my comments
below.

On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
> [...]
>
> I think the "gitsource" test, is the one I learned about here two years
> ago, [1]. It is an extremely good (best I know of) example of single
> threaded, high PID consumption (about 400 / second average, my computer
> [3]), performance issues on a multi CPU computer. I.E., this:
> 
> Dountil the list of tasks is finished:
>    Start the next task in the list of stuff to do.
>    Wait for it to finish
> Enduntil
>

yes that's the one.

> The problem with the test is its run to run variability, which was from
> all the disk I/O, as far as I could determine. At the time,
> I studied this to death [2], and made a more repeatable test, without
> any disk I/O.
> 
> While the challenges with this work flow have tended to be focused
> on the CPU frequency scaling driver, I have always considered
> the root issue here to be a scheduling issue. Excerpt from my notes
> [2]:
> 
> > The issue is that performance is much much better if the system is
> > forced to use only 1 CPU rather than relying on the defaults where
> > the CPU scheduler decides what to do.
> > The scheduler seems to not realize that the current CPU has just
> > become free, and assigns the new task to a new CPU. Thus the load
> > on any one CPU is so low that it doesn't ramp up the CPU frequency.
> > It would be better if somehow the scheduler knew that the current
> > active CPU was now able to take on the new task, overall resulting
> > on one fully loaded CPU at the highest CPU frequency.
> 
> I do not know if such is practical, and I didn't re-visit the issue.
>

You're absolutely right, pinning a serialized, fork-intensive workload such as
gitsource gives you as good of a performance as you can get, because it removes
the scheduler out of the picture.

So one might be tempted to flag this test as non-representative of a
real-world scenario; the reasons we keep looking at it are:

1. pinning may not always practical, as you mention
2. it's an adversary, worst-case sort of test for some scheduler code paths

Experience with enterprise use cases shows that pinning (as with 'taskset') is
done on a case-by-case basis, requires a little more cognitive load (you have
to know the workload in depth, profile it, write ad-hoc scripts to do the
pinning or modify the code of your software etc). In the case of "personal
computing" one hardly bothers about pinning at all.

You definitely want to try those things for the software that runs the core of
your business (say, sometimes the database server), but there is a ton of
ancillary infrastructure out there which is implemented in shell scripts
because it does the job just fine, and it doesn't harm if that goes a little
faster.

The unbound workload (no cpu pinning) will always perform worse than the bound
scenario, simply because the scheduler can't know the future, and it's a good
upper limit to keep in mind when evaluating these results. When a task is
freshly forked the schedutil governor can evaluate its compute need only by
the initialization value of the 'util' PELT signal (see "Per-entity load
tracking" at [LWN-1]); when a task is migrated from a CPU to another its
utilization score is transferred accordingly, so the accrued amount isn't lost
(see again PELT and also "Toward better CPU load estimation" at [LWN-2]).
These are active development areas in the scheduler, and gitsource (as well as
other tests) give an idea of the progress done so far.

[LWN-1] Per-entity load tracking, https://lwn.net/Articles/531853/
[LWN-2] Toward better CPU load estimation, https://lwn.net/Articles/741171/

> Anyway these are my results:
> 
> Kernel: 5.3-rc8 and + these patches
> Processor: i7-2600K
> 
> This is important, at least for the performance governor numbers:
> 
> cpu6: MSR_TURBO_RATIO_LIMIT: 0x23242526
> 35 * 100.0 = 3500.0 MHz max turbo 4 active cores
> 36 * 100.0 = 3600.0 MHz max turbo 3 active cores
> 37 * 100.0 = 3700.0 MHz max turbo 2 active cores
> 38 * 100.0 = 3800.0 MHz max turbo 1 active cores
> 
> For reference against which all other results are compared
> is the forced CPU affinity test run. i.e.:
> 
> taskset -c 3 test_script.
> 
> Mode          Governor                degradation     Power           Bzy_MHz
> Reference     perf 1 CPU              1.00            reference       3798
> -             performance             1.2             6% worse        3618
> passive       ondemand                2.3
> active        powersave               2.6
> passive       schedutil               2.7                             1600
> passive       schedutil-4C            1.68                            2515
> 
> Where degradation ratio is the time to execute / the reference time for
> the same conditions. The test runs over a wide range of processes per
> second, and the worst ratio has been selected for the above table.
> I have yet to write up this experiment, but the graphs that will
> eventually be used are at [4] and [5] (same data presented two
> different ways).

Your table is interesting; I'd say that the one to beat there (from the
schedutil point of view) is intel_pstate(active)/performance. I'm slightly
surprised that intel_pstate(passive)/ondemand is worse than
intel_pstate(active)/powersave, I'd have guessed the other way around but it's
also true that the latter lost some grip on iowait_boost in of the recent
dev cycles.

> 
> I did the "make test" method and, presenting the numbers your way,
> got that 4C took 0.69 times as long as the unpatched schedutil.
> Your numbers were same or better (copied below, lower is better):
> 80x-BROADWELL-NUMA:   0.49
> 8x-SKYLAKE-UMA:               0.55
> 48x-HASWELL-NUMA:             0.69
> 

I think your 0.69 and my three values tell the same story: schedutil really
needs to use the frequency invariant formula otherwise it's out of the
race. Enabling scale-invariance gives multple tens of percent point in
advantage.

Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I
can test on multiple machines and get a somehow more varied picture.

Also, didn't you mention you made several runs and selected the worst one for
the final score? I was less adventurous and took the average of 5 runs for my
gitsource executions :) that might contribute to a slightly higher final mark.

> > > 
> > > Compare it to the update formula of intel_pstate/powersave:
> > > 
> > >    freq_next = 1.25 * freq_max * Busy%
> > > 
> > > where again freq_max is 1C turbo and Busy% is the percentage of time not spent
> > > idling (calculated with delta_MPERF / delta_TSC);
> 
> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old
> method of utilization does not (at least not last time I investigated, which was
> awhile ago (and I can not find my notes)).

I think that depends on whether or not TSC stops at idle. As understand from
the Intel Software Developer manual (SDM) a TSC that stops at idle is called
"invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the
two counters behaves exactly the same and the ratio is always 1, modulo the
delays in actually reading the two values. But all I know comes from
turbostat's man page and the SDM, so don't quote me on that :)


Thanks,
Giovanni
Giovanni Gherdovich Sept. 17, 2019, 2:27 p.m. UTC | #6
Hello Srinivas,

On Fri, 2019-09-13 at 15:52 -0700, Srinivas Pandruvada wrote:
> On Mon, 2019-09-09 at 04:42 +0200, Giovanni Gherdovich wrote:
> 
> ...
> 
> > +
> > +/*
> > + * APERF/MPERF frequency ratio computation.
> > + *
> > + * The scheduler wants to do frequency invariant accounting and
> > needs a <1
> > + * ratio to account for the 'current' frequency, corresponding to
> > + * freq_curr / freq_max.
> 
> I thought this is no longer the restriction and Vincent did some work
> to remove this restriction. 

If you're referring to the patch

  23127296889f "sched/fair: Update scale invariance of PELT"

merged in v5.2, I'm familiar with that and from my understanding you still
want a <1 scaling factor. This is my recalling of the patch:

Vincent was studying some synthetic traces and realized that util_avg reported
by PELT didn't quite match the result you'd get computing the formula with pen
and paper (theoretical value). To address this he changed where the scaling
factor is applied in the PELT formula.

At some point when accumulating the PELT sums, you'll have to measure the time
'delta' since you last updated PELT. What we have after Vincent's change is
that this time length 'delta' gets itself scaled by the freq_curr/freq_max
ratio:

    delta = time since last PELT update
    delta *= freq_percent

In this way time goes at "wall clock speed" only when you're running at max
capacitiy, and goes "slower" (from the PELT point of view) if we're running at
a lower frequency. I don't think Vincent had in mind a faster-than-wall-clock
PELT time (which you'd get w/ freq_percent>1).

Speaking of which, Srinivas, do you have any opinion and/or requirement about
this? I confusely remember Peter Zijlstra saying (more than a year ago, now)
that you would like an unclipped freq_curr/freq_max ratio, and may not be
happy with this patch clipping it to 1 when freq_curr > 4_cores_turbo. If
that's the case, could you elaborate on this?
Ignore that if it doesn't make sense, I may be mis-remembering.


Thanks,
Giovanni
Giovanni Gherdovich Sept. 17, 2019, 2:27 p.m. UTC | #7
Hello Quentin,

On Sat, 2019-09-14 at 12:57 +0200, Quentin Perret wrote:
> Hi Giovanni
> 
> On Monday 09 Sep 2019 at 04:42:15 (+0200), Giovanni Gherdovich wrote:
> > +static inline long arch_scale_freq_capacity(int cpu)
> > +{
> > +	if (static_cpu_has(X86_FEATURE_APERFMPERF))
> > +		return per_cpu(arch_cpu_freq, cpu);
> 
> So, if this is conditional, perhaps you could also add this check in an
> x86-specific implementation of arch_scale_freq_invariant() ? That would
> guide sugov in the right path (see get_next_freq()) if APERF/MPERF are
> unavailable.
> 
> > +	return 1024 /* SCHED_CAPACITY_SCALE */;
> > +}
>

Good remark. If the cpu doesn't have APERF/MPERF, the choice here is that
freq_curr is constantly equal to freq_max, and the scaling factor is 1 all the
time.

But I'm checking this static_cpu_has() every time I do a frequency update;
arguably schedutil should be smarter and settle such a case once and for all
at boot time.

I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll
do what you suggest (x86-specific version of arch_scale_freq_invariant().


Giovanni
Quentin Perret Sept. 17, 2019, 2:39 p.m. UTC | #8
On Tuesday 17 Sep 2019 at 16:27:46 (+0200), Giovanni Gherdovich wrote:
> I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll
> do what you suggest (x86-specific version of arch_scale_freq_invariant().

In case this is indeed expensive to check, you could always add a static
key, set at boot time, to optimize things a bit ... That might be worth
it since this is called in latency-sensitive paths of the scheduler.

Thanks,
Quentin
Vincent Guittot Sept. 17, 2019, 3:55 p.m. UTC | #9
On Tue, 17 Sep 2019 at 16:21, Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>
> Hello Srinivas,
>
> On Fri, 2019-09-13 at 15:52 -0700, Srinivas Pandruvada wrote:
> > On Mon, 2019-09-09 at 04:42 +0200, Giovanni Gherdovich wrote:
> >
> > ...
> >
> > > +
> > > +/*
> > > + * APERF/MPERF frequency ratio computation.
> > > + *
> > > + * The scheduler wants to do frequency invariant accounting and
> > > needs a <1
> > > + * ratio to account for the 'current' frequency, corresponding to
> > > + * freq_curr / freq_max.
> >
> > I thought this is no longer the restriction and Vincent did some work
> > to remove this restriction.
>
> If you're referring to the patch
>
>   23127296889f "sched/fair: Update scale invariance of PELT"
>
> merged in v5.2, I'm familiar with that and from my understanding you still
> want a <1 scaling factor. This is my recalling of the patch:
>
> Vincent was studying some synthetic traces and realized that util_avg reported
> by PELT didn't quite match the result you'd get computing the formula with pen
> and paper (theoretical value). To address this he changed where the scaling
> factor is applied in the PELT formula.
>
> At some point when accumulating the PELT sums, you'll have to measure the time
> 'delta' since you last updated PELT. What we have after Vincent's change is
> that this time length 'delta' gets itself scaled by the freq_curr/freq_max
> ratio:
>
>     delta = time since last PELT update
>     delta *= freq_percent
>
> In this way time goes at "wall clock speed" only when you're running at max
> capacitiy, and goes "slower" (from the PELT point of view) if we're running at
> a lower frequency. I don't think Vincent had in mind a faster-than-wall-clock
> PELT time (which you'd get w/ freq_percent>1).

Yes, I haven't really planned to have time going faster that wall
clock but I don't see any algorithm problem at least if that would be
the case.
There will be a reduced maximum delta update of clock pelt but that
will still be large enough

>
> Speaking of which, Srinivas, do you have any opinion and/or requirement about
> this? I confusely remember Peter Zijlstra saying (more than a year ago, now)
> that you would like an unclipped freq_curr/freq_max ratio, and may not be
> happy with this patch clipping it to 1 when freq_curr > 4_cores_turbo. If
> that's the case, could you elaborate on this?
> Ignore that if it doesn't make sense, I may be mis-remembering.
>
>
> Thanks,
> Giovanni
Doug Smythies Sept. 19, 2019, 2:42 p.m. UTC | #10
Hi Giovanni,

Thank you for your detailed reply.

On 2019.09.17 07:25 Giovanni Gherdovich wrote:
>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
> [...]

>> The problem with the test is its run to run variability, which was from
>> all the disk I/O, as far as I could determine. At the time,
>> I studied this to death [2], and made a more repeatable test, without
>> any disk I/O.
>> 
>> While the challenges with this work flow have tended to be focused
>> on the CPU frequency scaling driver, I have always considered
>> the root issue here to be a scheduling issue. Excerpt from my notes
>> [2]:
>> 
>>> The issue is that performance is much much better if the system is
>>> forced to use only 1 CPU rather than relying on the defaults where
>>> the CPU scheduler decides what to do.
>>> The scheduler seems to not realize that the current CPU has just
>>> become free, and assigns the new task to a new CPU. Thus the load
>>> on any one CPU is so low that it doesn't ramp up the CPU frequency.
>>> It would be better if somehow the scheduler knew that the current
>>> active CPU was now able to take on the new task, overall resulting
>>> on one fully loaded CPU at the highest CPU frequency.
>> 
>> I do not know if such is practical, and I didn't re-visit the issue.
>>
>
> You're absolutely right, pinning a serialized, fork-intensive workload such as
> gitsource gives you as good of a performance as you can get, because it removes
> the scheduler out of the picture.
>
> So one might be tempted to flag this test as non-representative of a
> real-world scenario;

Disagree. I consider this test to be very representative of real-world
scenarios. However, and I do not know for certain, the relatively high
average fork rate of the gitsource "make test" is less common.

> the reasons we keep looking at it are:
> 1. pinning may not always practical, as you mention
> 2. it's an adversary, worst-case sort of test for some scheduler code paths

Agree.

>> For reference against which all other results are compared
>> is the forced CPU affinity test run. i.e.:
>> 
>> taskset -c 3 test_script.
>> 
>> Mode          Governor                degradation     Power           Bzy_MHz
>> Reference     perf 1 CPU              1.00            reference       3798
>> -             performance             1.2             6% worse        3618
>> passive       ondemand                2.3
>> active        powersave               2.6
>> passive       schedutil               2.7                             1600
>> passive       schedutil-4C            1.68                            2515
>> 
>> Where degradation ratio is the time to execute / the reference time for
>> the same conditions. The test runs over a wide range of processes per
>> second, and the worst ratio has been selected for the above table.
>> I have yet to write up this experiment, but the graphs that will
>> eventually be used are at [4] and [5] (same data presented two
>> different ways).
>
> Your table is interesting; I'd say that the one to beat there (from the
> schedutil point of view) is intel_pstate(active)/performance. I'm slightly
> surprised that intel_pstate(passive)/ondemand is worse than
> intel_pstate(active)/powersave, I'd have guessed the other way around but it's
> also true that the latter lost some grip on iowait_boost in of the recent
> dev cycles.

??
intel_pstate(passive)/ondemand is better than intel_pstate(active)/powersave,
not worse, over the entire range of PIDs (forks) per second and by quite a lot.

>> I did the "make test" method and, presenting the numbers your way,
>> got that 4C took 0.69 times as long as the unpatched schedutil.
>> Your numbers were same or better (copied below, lower is better):
>> 80x-BROADWELL-NUMA:   0.49
>> 8x-SKYLAKE-UMA:               0.55
>> 48x-HASWELL-NUMA:             0.69

> I think your 0.69 and my three values tell the same story: schedutil really
> needs to use the frequency invariant formula otherwise it's out of the
> race. Enabling scale-invariance gives multple tens of percent point in
> advantage.

Agreed. This frequency invariant addition is great. However, if
schedutil is "out of the race" without it, as you say, then isn't
intel_pstate(passive)/ondemand out of the race also? It performs
just as poorly for this test, until very low PIDs per second.

> Now, is it 0.69 or 0.49? There are many factors to it; that's why I'm happy I
> can test on multiple machines and get a somehow more varied picture.
>
> Also, didn't you mention you made several runs and selected the worst one for
> the final score? I was less adventurous and took the average of 5 runs for my
> gitsource executions :) that might contribute to a slightly higher final mark.

No, I did the exact same as you for the gitsource "make test" method, except
that I do 6 runs and throw out the first one and average the next 5.

Yes, I said I picked the worse ratio, but that was for my version of this test,
with the disk I/O and its related non-repeatability eliminated, only to provide
something for readers that did not want to go to my web site to look at the
related graph [1]. I'll send you the graph in a separate e-mail, in case you didn't
go to the web site.

>>>> 
>>>> Compare it to the update formula of intel_pstate/powersave:
>>> 
>>>    freq_next = 1.25 * freq_max * Busy%
>>> 
>>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent
>>> idling (calculated with delta_MPERF / delta_TSC);
>> 
>> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old
>> method of utilization does not (at least not last time I investigated, which was
>> awhile ago (and I can not find my notes)).
>
> I think that depends on whether or not TSC stops at idle. As understand from
> the Intel Software Developer manual (SDM) a TSC that stops at idle is called
> "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the
> two counters behaves exactly the same and the ratio is always 1, modulo the
> delays in actually reading the two values. But all I know comes from
> turbostat's man page and the SDM, so don't quote me on that :)

I was only talking about idle state 0 (polling), where TSC does not stop.

By the way, I have now done some tests with this patch set and multi-threaded
stuff. Nothing to report, it all looks great.

[1] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps2.png

... Doug
Srinivas Pandruvada Sept. 19, 2019, 11:55 p.m. UTC | #11
On Tue, 2019-09-17 at 16:27 +0200, Giovanni Gherdovich wrote:
> Hello Srinivas,
> 
> On Fri, 2019-09-13 at 15:52 -0700, Srinivas Pandruvada wrote:
> > On Mon, 2019-09-09 at 04:42 +0200, Giovanni Gherdovich wrote:
> > 
> > ...
> > 
> > > +
> > > +/*
> > > + * APERF/MPERF frequency ratio computation.
> > > + *
> > > + * The scheduler wants to do frequency invariant accounting and
> > > needs a <1
> > > + * ratio to account for the 'current' frequency, corresponding
> > > to
> > > + * freq_curr / freq_max.
> > 
> > I thought this is no longer the restriction and Vincent did some
> > work
> > to remove this restriction. 
> 
> If you're referring to the patch
> 
>   23127296889f "sched/fair: Update scale invariance of PELT"
> 
> merged in v5.2, I'm familiar with that and from my understanding you
> still
> want a <1 scaling factor. This is my recalling of the patch:
> 
> Vincent was studying some synthetic traces and realized that util_avg
> reported
> by PELT didn't quite match the result you'd get computing the formula
> with pen
> and paper (theoretical value). To address this he changed where the
> scaling
> factor is applied in the PELT formula.
> 
> At some point when accumulating the PELT sums, you'll have to measure
> the time
> 'delta' since you last updated PELT. What we have after Vincent's
> change is
> that this time length 'delta' gets itself scaled by the
> freq_curr/freq_max
> ratio:
> 
>     delta = time since last PELT update
>     delta *= freq_percent
> 
> In this way time goes at "wall clock speed" only when you're running
> at max
> capacitiy, and goes "slower" (from the PELT point of view) if we're
> running at
> a lower frequency. I don't think Vincent had in mind a faster-than-
> wall-clock
> PELT time (which you'd get w/ freq_percent>1).
> 
> Speaking of which, Srinivas, do you have any opinion and/or
> requirement about
> this? I confusely remember Peter Zijlstra saying (more than a year
> ago, now)
> that you would like an unclipped freq_curr/freq_max ratio, and may
> not be
> happy with this patch clipping it to 1 when freq_curr >
> 4_cores_turbo. If
> that's the case, could you elaborate on this?
> Ignore that if it doesn't make sense, I may be mis-remembering.
I was thinking of power efficiency use case particularly for Atom like
platforms, 1C max as you observed is more efficient.

But now sched deadline code is using  arch_scale_freq_capacity(() to
calculate dl_se->runtime, where closer to deterministic value with all
cores, may be better, which will be scaled with base_freq. 

Thanks,
Srinivas
Mel Gorman Sept. 24, 2019, 8:06 a.m. UTC | #12
On Thu, Sep 19, 2019 at 07:42:29AM -0700, Doug Smythies wrote:
> On 2019.09.17 07:25 Giovanni Gherdovich wrote:
> >On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
> > [...]
> 
> >> The problem with the test is its run to run variability, which was from
> >> all the disk I/O, as far as I could determine. At the time,
> >> I studied this to death [2], and made a more repeatable test, without
> >> any disk I/O.
> >> 
> >> While the challenges with this work flow have tended to be focused
> >> on the CPU frequency scaling driver, I have always considered
> >> the root issue here to be a scheduling issue. Excerpt from my notes
> >> [2]:
> >> 
> >>> The issue is that performance is much much better if the system is
> >>> forced to use only 1 CPU rather than relying on the defaults where
> >>> the CPU scheduler decides what to do.
> >>> The scheduler seems to not realize that the current CPU has just
> >>> become free, and assigns the new task to a new CPU. Thus the load
> >>> on any one CPU is so low that it doesn't ramp up the CPU frequency.
> >>> It would be better if somehow the scheduler knew that the current
> >>> active CPU was now able to take on the new task, overall resulting
> >>> on one fully loaded CPU at the highest CPU frequency.
> >> 
> >> I do not know if such is practical, and I didn't re-visit the issue.
> >>
> >
> > You're absolutely right, pinning a serialized, fork-intensive workload such as
> > gitsource gives you as good of a performance as you can get, because it removes
> > the scheduler out of the picture.
> >
> > So one might be tempted to flag this test as non-representative of a
> > real-world scenario;
> 
> Disagree. I consider this test to be very representative of real-world
> scenarios. However, and I do not know for certain, the relatively high
> average fork rate of the gitsource "make test" is less common.
> 

I think it's common enough to be interesting. What I would be very cautious
of is considering this patch in the context of the scheduler decisions
made for synchronous tasks. By synchronous, I mean any waker/wakee
pattern where the waker always goes immediately to sleep. In that case,
it is best for the wakee to use the same CPU as the waker. Unfortunately,
the kernel has tried numerous times to accurately detect when a waker
will immediately go to sleep and it has never worked out properly. When
the sync wakeup hint was strictly obeyed, there were too many cases where
the waker did not immediately sleep and there was a latency hit for the
wakee when nearby cores were idle. `perf sched pipe is an excellent example
of a case where staking the wakee on the same CPU as the waker performs
excellently but there are too many other realistic workloads where it is
a sub-optimal decision such as a waker waking multiple wakees before it
goes to sleep meaning stacking should definitely not happen.

Hence, I think this patchset should be considered on its own merits.
There will always be some guesswork when deciding what factor to use
to account for turbo but the patch is still better than allowing the
estimated utilisation to vary depending on the CPU frequency.

I think the patch is fine and should be merged with the main caveat being
that some CPU families may need to use a different calculation to account
for turbo boost which is a per-arch and per-cpu-family decision. What,
if anything, should change in this patchset before it can be merged? Even
if there is follow-on work that is necessary then it still looks like a
reasonable starting point to me. If the waker/wakee stacking problem was
revisited, it would still be orthogonal to this patch and they would not
be in conflict.

> > I think your 0.69 and my three values tell the same story: schedutil really
> > needs to use the frequency invariant formula otherwise it's out of the
> > race. Enabling scale-invariance gives multple tens of percent point in
> > advantage.
> 
> Agreed. This frequency invariant addition is great. However, if
> schedutil is "out of the race" without it, as you say, then isn't
> intel_pstate(passive)/ondemand out of the race also? It performs
> just as poorly for this test, until very low PIDs per second.
> 

In the intel_pstate case, there have been hacks carried out of tree
trying to avoid some of the downsides of it. It also had things like IO
wait boosting in mainline which was partially to handle the case where
history was lost and in some cases to avoid problems when the wakup on
IO completion moved a task to another CPU.

I think it's a fair assessment to say that schedutil suffers if
frequency invariance is not used regardless of what the other cpufreq
drivers do.

> >>>> Compare it to the update formula of intel_pstate/powersave:
> >>> 
> >>>    freq_next = 1.25 * freq_max * Busy%
> >>> 
> >>> where again freq_max is 1C turbo and Busy% is the percentage of time not spent
> >>> idling (calculated with delta_MPERF / delta_TSC);
> >> 
> >> Note that the delta_MPERF / delta_TSC method includes idle state 0 and the old
> >> method of utilization does not (at least not last time I investigated, which was
> >> awhile ago (and I can not find my notes)).
> >
> > I think that depends on whether or not TSC stops at idle. As understand from
> > the Intel Software Developer manual (SDM) a TSC that stops at idle is called
> > "invariant TSC", and makes delta_MPERF / delta_TSC interesting. Otherwise the
> > two counters behaves exactly the same and the ratio is always 1, modulo the
> > delays in actually reading the two values. But all I know comes from
> > turbostat's man page and the SDM, so don't quote me on that :)
> 
> I was only talking about idle state 0 (polling), where TSC does not stop.
> 
> By the way, I have now done some tests with this patch set and multi-threaded
> stuff. Nothing to report, it all looks great.
> 
> [1] http://www.smythies.com/~doug/linux/single-threaded/gg-pidps2.png
> 

Is that an acked-by?

Thanks.
Peter Zijlstra Sept. 24, 2019, 2:03 p.m. UTC | #13
On Tue, Sep 17, 2019 at 04:27:46PM +0200, Giovanni Gherdovich wrote:
> Hello Quentin,
> 
> On Sat, 2019-09-14 at 12:57 +0200, Quentin Perret wrote:
> > Hi Giovanni
> > 
> > On Monday 09 Sep 2019 at 04:42:15 (+0200), Giovanni Gherdovich wrote:
> > > +static inline long arch_scale_freq_capacity(int cpu)
> > > +{
> > > +	if (static_cpu_has(X86_FEATURE_APERFMPERF))
> > > +		return per_cpu(arch_cpu_freq, cpu);
> > 
> > So, if this is conditional, perhaps you could also add this check in an
> > x86-specific implementation of arch_scale_freq_invariant() ? That would
> > guide sugov in the right path (see get_next_freq()) if APERF/MPERF are
> > unavailable.
> > 
> > > +	return 1024 /* SCHED_CAPACITY_SCALE */;
> > > +}
> >
> 
> Good remark. If the cpu doesn't have APERF/MPERF, the choice here is that
> freq_curr is constantly equal to freq_max, and the scaling factor is 1 all the
> time.
> 
> But I'm checking this static_cpu_has() every time I do a frequency update;
> arguably schedutil should be smarter and settle such a case once and for all
> at boot time.
> 
> I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll
> do what you suggest (x86-specific version of arch_scale_freq_invariant().

static_cpu_has() is an alternative and ends up being a static branch
(similar to static_key) once the alternative patching runs.
Peter Zijlstra Sept. 24, 2019, 4 p.m. UTC | #14
On Tue, Sep 24, 2019 at 04:03:32PM +0200, Peter Zijlstra wrote:

> > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll
> > do what you suggest (x86-specific version of arch_scale_freq_invariant().
> 
> static_cpu_has() is an alternative and ends up being a static branch
> (similar to static_key) once the alternative patching runs.

That said; I think you want a static key anyway, because if we can't
tell the max_freq we don't want to use the invariant stuff.

Something a little like so on top perhaps.

Also, the below fixes that silly tick_disable stuff.

---
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -196,20 +196,24 @@ static inline void sched_clear_itmt_supp
 #ifdef CONFIG_SMP
 #include <asm/cpufeature.h>
 
-#define arch_scale_freq_tick arch_scale_freq_tick
-#define arch_scale_freq_capacity arch_scale_freq_capacity
+DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
+#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)
 
 DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
 
 static inline long arch_scale_freq_capacity(int cpu)
 {
-	if (static_cpu_has(X86_FEATURE_APERFMPERF))
+	if (arch_scale_freq_invariant())
 		return per_cpu(arch_cpu_freq, cpu);
 
 	return 1024 /* SCHED_CAPACITY_SCALE */;
 }
+#define arch_scale_freq_capacity arch_scale_freq_capacity
 
 extern void arch_scale_freq_tick(void);
+#define arch_scale_freq_tick arch_scale_freq_tick
+
 extern void x86_arch_scale_freq_tick_enable(void);
 extern void x86_arch_scale_freq_tick_disable(void);
 #else
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1799,6 +1799,8 @@ void native_play_dead(void)
  * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
  */
 
+DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
+
 static DEFINE_PER_CPU(u64, arch_prev_aperf);
 static DEFINE_PER_CPU(u64, arch_prev_mperf);
 static u64 arch_max_freq = SCHED_CAPACITY_SCALE;
@@ -1860,6 +1862,8 @@ static void core_set_cpu_max_freq(void)
 	turbo_ratio = (turbo_ratio >> 24) & 0xFF;   /* 4C turbo ratio */
 
 	arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio);
+
+	static_key_enable(&arch_scale_freq_key);
 }
 
 static void intel_set_cpu_max_freq(void)
@@ -1876,10 +1880,19 @@ static void intel_set_cpu_max_freq(void)
 	core_set_cpu_max_freq();
 }
 
-static void set_cpu_max_freq(void)
+static void init_scale_freq(void *arg)
 {
 	u64 aperf, mperf;
 
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+}
+
+static void set_cpu_max_freq(void)
+{
 	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
 		return;
 
@@ -1891,11 +1904,7 @@ static void set_cpu_max_freq(void)
 		break;
 	}
 
-	rdmsrl(MSR_IA32_APERF, aperf);
-	rdmsrl(MSR_IA32_MPERF, mperf);
-
-	this_cpu_write(arch_prev_aperf, aperf);
-	this_cpu_write(arch_prev_mperf, mperf);
+	init_scale_freq(NULL);
 }
 
 DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
@@ -1908,7 +1917,7 @@ void arch_scale_freq_tick(void)
 	u64 aperf, mperf;
 	u64 acnt, mcnt;
 
-	if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable)
+	if (!arch_scale_freq_invariant() || tick_disable)
 		return;
 
 	rdmsrl(MSR_IA32_APERF, aperf);
@@ -1940,5 +1949,6 @@ void x86_arch_scale_freq_tick_enable(voi
 
 void x86_arch_scale_freq_tick_disable(void)
 {
+	on_each_cpu(init_scale_freq, NULL, 1);
 	tick_disable = true;
 }
Peter Zijlstra Sept. 24, 2019, 4:04 p.m. UTC | #15
On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote:

> +static void intel_set_cpu_max_freq(void)
> +{
> +	/*
> +	 * TODO: add support for:
> +	 *
> +	 * - Xeon Phi (KNM, KNL)
> +	 * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus
> +	 * - Atom Silvermont

ISTR I had code for Atom.. what happened with that?

> +	 *
> +	 * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE
> +	 */
> +	core_set_cpu_max_freq();
> +}
Peter Zijlstra Sept. 24, 2019, 4:30 p.m. UTC | #16
On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote:
> +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = {
> +	ICPU(INTEL_FAM6_ATOM_GOLDMONT),
> +	ICPU(INTEL_FAM6_ATOM_GOLDMONT_X),

That's GOLDMONT_D in recent tip kernels.

> +	ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS),
> +	ICPU(INTEL_FAM6_SKYLAKE_X),

What about KABYLAKE_X and ICELAKE_X ?

> +	{}
> +};
Doug Smythies Sept. 24, 2019, 5:52 p.m. UTC | #17
On 2019.09.24 01:06 Mel Gorman wrote:
> On Thu, Sep 19, 2019 at 07:42:29AM -0700, Doug Smythies wrote:
>> On 2019.09.17 07:25 Giovanni Gherdovich wrote:
>>>On Wed, 2019-09-11 at 08:28 -0700, Doug Smythies wrote:
>>> [...]
> 

> Hence, I think this patchset should be considered on its own merits.

Agree. 

> I think the patch is fine and should be merged with the main caveat being
> that some CPU families may need to use a different calculation to account
> for turbo boost which is a per-arch and per-cpu-family decision.

Agree.

> What, if anything, should change in this patchset before it can be merged?

Nothing, and apologies for the tangential discussion.

> Is that an acked-by?

Absolutely, if I am worthy of ack'ing then:

Acked-by: Doug Smythies <dsmythies@telus.net>

... Doug
Giovanni Gherdovich Oct. 2, 2019, 12:25 p.m. UTC | #18
Hello Peter,

late replies as I wasn't in the office last week.

On Tue, 2019-09-24 at 18:30 +0200, Peter Zijlstra wrote:
> On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote:
> > +static const struct x86_cpu_id has_turbo_ratio_group_limits[] = {
> > +   ICPU(INTEL_FAM6_ATOM_GOLDMONT),
> > +   ICPU(INTEL_FAM6_ATOM_GOLDMONT_X),
> 
> That's GOLDMONT_D in recent tip kernels.

Right, I saw that now.

> 
> > +   ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS),
> > +   ICPU(INTEL_FAM6_SKYLAKE_X),
> 
> What about KABYLAKE_X and ICELAKE_X ?

KABYLAKE_X: does it exist? I couldn't find it in
arch/x86/include/asm/intel-family.h (the tip tree), I only see KABYLAKE_L and
KABYLAKE.

ICELAKE_X: well, I don't know really. Does this model have the same semantic
for MSR_TURBO_RATIO_LIMIT as SKYLAKE_X (which is family = 0x6, model = 0x55)?
This is for Len B. and Srinivas P. (in CC).

The latest Software Developer's Manual (SDM) from May 2019 (volume 4, section
2.17.3, "MSRs Specific to Intel Xeon Processor Scalable Family") mentions only
"CPUID DisplayFamily_DisplayModel = 06_55H", which is SKYLAKE_X, as having the
semantic I'm looking for here (in addition to Atom Goldmont's).

The semantic I'm referring to is that MSR_TURBO_RATIO_LIMIT doesn't contain
turbo levels for the fixed group sizes 1-2-3-4-... cores, the group sizes are
specified in a different MSR (and could be 2-4-8-12-... for example).

If the SDM is outdated and ICELAKE_X is also in that category, then the
turbostat source code is outdated too as it has this function to detect this
feature:

    int has_turbo_ratio_group_limits(int family, int model)
    {

            if (!genuine_intel)
                    return 0;

            switch (model) {
            case INTEL_FAM6_ATOM_GOLDMONT:
            case INTEL_FAM6_SKYLAKE_X:
            case INTEL_FAM6_ATOM_GOLDMONT_X:
                    return 1;
            }
            return 0;
    }

(from the tree lenb/linux.git, branch "turbostat", turbostat version 19.08.31
not yet merged into mainline)


Giovanni
Giovanni Gherdovich Oct. 2, 2019, 12:26 p.m. UTC | #19
On Tue, 2019-09-24 at 18:04 +0200, Peter Zijlstra wrote:
> On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote:
> 
> > +static void intel_set_cpu_max_freq(void)
> > +{
> > +	/*
> > +	 * TODO: add support for:
> > +	 *
> > +	 * - Xeon Phi (KNM, KNL)
> > +	 * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus
> > +	 * - Atom Silvermont
> 
> ISTR I had code for Atom.. what happened with that?

I'm being overly zealous and I wanted to get a Silvermont machine to test that
code before sending.

The reason is that your code uses MSR_ATOM_CORE_RATIOS and
MSR_ATOM_CORE_TURBO_RATIOS which are not documented in the SDM. I wanted to
make sure those have the expected content on at least one machine before using
them in my code. I have no doubt you, Srinivas and Len (who uses them in
turbostat) have already checked but you know, more eyeballs.

I've talked to Len and Srinivas at LPC, they agreed that those two MSR may not
have made it to the SDM but said the turbostat source code is the reference in
this case.


Giovanni
Giovanni Gherdovich Oct. 2, 2019, 12:27 p.m. UTC | #20
On Tue, 2019-09-24 at 18:00 +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 04:03:32PM +0200, Peter Zijlstra wrote:
> 
> > > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll
> > > do what you suggest (x86-specific version of arch_scale_freq_invariant().
> > 
> > static_cpu_has() is an alternative and ends up being a static branch
> > (similar to static_key) once the alternative patching runs.
> 
> That said; I think you want a static key anyway, because if we can't
> tell the max_freq we don't want to use the invariant stuff.
> 
> Something a little like so on top perhaps.
> 
> Also, the below fixes that silly tick_disable stuff.

Thanks for this patch, I'll add this change in v2.

Can you elaborate on what you don't like in the tick_disable mechanism?

After reading your comments I realized there is a problem, but I'm not sure is
the same you're addressing.

More on this below, under your edit of the function
x86_arch_scale_freq_tick_disable().

> 
> ---
> --- a/arch/x86/include/asm/topology.h
> +++ b/arch/x86/include/asm/topology.h
> @@ -196,20 +196,24 @@ static inline void sched_clear_itmt_supp
>  #ifdef CONFIG_SMP
>  #include <asm/cpufeature.h>
>  
> -#define arch_scale_freq_tick arch_scale_freq_tick
> -#define arch_scale_freq_capacity arch_scale_freq_capacity
> +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key);
> +
> +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key)

This confused me for a second but then I realized that this #define comes
before the one in kernel/sched/sched.h where arch_scale_freq_invariant() is
defined again but guarded against previous definitions, so it all falls into
place; code from schedutil will see this one.

>  
>  DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
>  
>  static inline long arch_scale_freq_capacity(int cpu)
>  {
> -	if (static_cpu_has(X86_FEATURE_APERFMPERF))
> +	if (arch_scale_freq_invariant())
>  		return per_cpu(arch_cpu_freq, cpu);
>  
>  	return 1024 /* SCHED_CAPACITY_SCALE */;
>  }
> +#define arch_scale_freq_capacity arch_scale_freq_capacity
>  
>  extern void arch_scale_freq_tick(void);
> +#define arch_scale_freq_tick arch_scale_freq_tick
> +
>  extern void x86_arch_scale_freq_tick_enable(void);
>  extern void x86_arch_scale_freq_tick_disable(void);
>  #else
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -1799,6 +1799,8 @@ void native_play_dead(void)
>   * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
>   */
>  
> +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key);
> +
>  static DEFINE_PER_CPU(u64, arch_prev_aperf);
>  static DEFINE_PER_CPU(u64, arch_prev_mperf);
>  static u64 arch_max_freq = SCHED_CAPACITY_SCALE;
> @@ -1860,6 +1862,8 @@ static void core_set_cpu_max_freq(void)
>  	turbo_ratio = (turbo_ratio >> 24) & 0xFF;   /* 4C turbo ratio */
>  
>  	arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio);
> +
> +	static_key_enable(&arch_scale_freq_key);
>  }
>  
>  static void intel_set_cpu_max_freq(void)
> @@ -1876,10 +1880,19 @@ static void intel_set_cpu_max_freq(void)
>  	core_set_cpu_max_freq();
>  }
>  
> -static void set_cpu_max_freq(void)
> +static void init_scale_freq(void *arg)
>  {
>  	u64 aperf, mperf;
>  
> +	rdmsrl(MSR_IA32_APERF, aperf);
> +	rdmsrl(MSR_IA32_MPERF, mperf);
> +
> +	this_cpu_write(arch_prev_aperf, aperf);
> +	this_cpu_write(arch_prev_mperf, mperf);
> +}
> +
> +static void set_cpu_max_freq(void)
> +{
>  	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
>  		return;
>  
> @@ -1891,11 +1904,7 @@ static void set_cpu_max_freq(void)
>  		break;
>  	}
>  
> -	rdmsrl(MSR_IA32_APERF, aperf);
> -	rdmsrl(MSR_IA32_MPERF, mperf);
> -
> -	this_cpu_write(arch_prev_aperf, aperf);
> -	this_cpu_write(arch_prev_mperf, mperf);
> +	init_scale_freq(NULL);
>  }
>  
>  DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
> @@ -1908,7 +1917,7 @@ void arch_scale_freq_tick(void)
>  	u64 aperf, mperf;
>  	u64 acnt, mcnt;
>  
> -	if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable)
> +	if (!arch_scale_freq_invariant() || tick_disable)
>  		return;
>  
>  	rdmsrl(MSR_IA32_APERF, aperf);
> @@ -1940,5 +1949,6 @@ void x86_arch_scale_freq_tick_enable(voi
>  
>  void x86_arch_scale_freq_tick_disable(void)
>  {
> +	on_each_cpu(init_scale_freq, NULL, 1);
>  	tick_disable = true;

I don't see why the call init_scale_freq() here is needed; why would I care of
what's in arch_prev_[am]perf at this point. arch_scale_freq_tick() will see
that tick_disable == true and exit early before reading arch_prev_[am]perf.

The problem IMO emerges in the following configuration, which is a bug in the
patch I sent:

  * arch_scale_freq_invariant() is true (because we have APERF/MPERF)
  * arch_scale_freq_capacity() is non-trivial (reads arch_cpu_freq)
  * tick calculations are disabled

In this case arch_scale_freq_capacity() feeds stale data to the function
update_rq_clock_pelt() in kernel/sched/pelt.h. I initially missed this problem
because I forgot that PELT signals have more users than just the schedutil
governor (load balancer etc).

This is exactly the situation produced by patch 2/2 which disables the tick
calculations for intel_cpufreq (aka intel_pstate=passive).

I think the fix for this is to set arch_cpu_freq (each per-cpu instance of the
variable) to SCHED_CAPACITY_SCALE here in x86_arch_scale_freq_tick_disable().
That would render the scaling factor for invariance moot (always 1), just as
it is w/o scale invariance.

I'm sending v2 with all your amendmends except this last one.


Giovanni
Peter Zijlstra Oct. 2, 2019, 6:35 p.m. UTC | #21
On Wed, Oct 02, 2019 at 02:26:44PM +0200, Giovanni Gherdovich wrote:
> On Tue, 2019-09-24 at 18:04 +0200, Peter Zijlstra wrote:
> > On Mon, Sep 09, 2019 at 04:42:15AM +0200, Giovanni Gherdovich wrote:
> > 
> > > +static void intel_set_cpu_max_freq(void)
> > > +{
> > > +	/*
> > > +	 * TODO: add support for:
> > > +	 *
> > > +	 * - Xeon Phi (KNM, KNL)
> > > +	 * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus
> > > +	 * - Atom Silvermont
> > 
> > ISTR I had code for Atom.. what happened with that?
> 
> I'm being overly zealous and I wanted to get a Silvermont machine to test that
> code before sending.
> 
> The reason is that your code uses MSR_ATOM_CORE_RATIOS and
> MSR_ATOM_CORE_TURBO_RATIOS which are not documented in the SDM. I wanted to
> make sure those have the expected content on at least one machine before using
> them in my code. I have no doubt you, Srinivas and Len (who uses them in
> turbostat) have already checked but you know, more eyeballs.
> 
> I've talked to Len and Srinivas at LPC, they agreed that those two MSR may not
> have made it to the SDM but said the turbostat source code is the reference in
> this case.

Can you at least include the patch as RFC then? Perhaps other people,
who have hardware at hand, can then help test it.
Peter Zijlstra Oct. 2, 2019, 6:45 p.m. UTC | #22
On Wed, Oct 02, 2019 at 02:27:54PM +0200, Giovanni Gherdovich wrote:
> On Tue, 2019-09-24 at 18:00 +0200, Peter Zijlstra wrote:
> > On Tue, Sep 24, 2019 at 04:03:32PM +0200, Peter Zijlstra wrote:
> > 
> > > > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll
> > > > do what you suggest (x86-specific version of arch_scale_freq_invariant().
> > > 
> > > static_cpu_has() is an alternative and ends up being a static branch
> > > (similar to static_key) once the alternative patching runs.
> > 
> > That said; I think you want a static key anyway, because if we can't
> > tell the max_freq we don't want to use the invariant stuff.
> > 
> > Something a little like so on top perhaps.
> > 
> > Also, the below fixes that silly tick_disable stuff.
> 
> Thanks for this patch, I'll add this change in v2.
> 
> Can you elaborate on what you don't like in the tick_disable mechanism?

Mostly because I dislike intel_pstate active mode a lot, but also
because it makes PELT behave differently between pstate and !pstate.

> > +static void init_scale_freq(void *arg)
> >  {
> >  	u64 aperf, mperf;
> >  
> > +	rdmsrl(MSR_IA32_APERF, aperf);
> > +	rdmsrl(MSR_IA32_MPERF, mperf);
> > +
> > +	this_cpu_write(arch_prev_aperf, aperf);
> > +	this_cpu_write(arch_prev_mperf, mperf);
> > +}
> > +

> > @@ -1940,5 +1949,6 @@ void x86_arch_scale_freq_tick_enable(voi
> >  
> >  void x86_arch_scale_freq_tick_disable(void)
> >  {
> > +	on_each_cpu(init_scale_freq, NULL, 1);
> >  	tick_disable = true;
> 
> I don't see why the call init_scale_freq() here is needed; why would I care of
> what's in arch_prev_[am]perf at this point. arch_scale_freq_tick() will see
> that tick_disable == true and exit early before reading arch_prev_[am]perf.

You're right, we should reset the prev values on enable. Otherwise the
first tick after enable will see 'weird' values.
Peter Zijlstra Oct. 2, 2019, 6:47 p.m. UTC | #23
On Wed, Oct 02, 2019 at 02:25:52PM +0200, Giovanni Gherdovich wrote:

> > What about KABYLAKE_X and ICELAKE_X ?
> 
> KABYLAKE_X: does it exist? I couldn't find it in
> arch/x86/include/asm/intel-family.h (the tip tree), I only see KABYLAKE_L and
> KABYLAKE.

My bad, I must've been staring cross-eyed at intel-family.h.

> If the SDM is outdated and ICELAKE_X is also in that category, then the
> turbostat source code is outdated too as it has this function to detect this
> feature:

I think you can trust the turbostat code.

Patch
diff mbox series

diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 4b14d2318251..462edd6aefd5 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -193,4 +193,33 @@  static inline void sched_clear_itmt_support(void)
 }
 #endif /* CONFIG_SCHED_MC_PRIO */
 
+#ifdef CONFIG_SMP
+#include <asm/cpufeature.h>
+
+#define arch_scale_freq_tick arch_scale_freq_tick
+#define arch_scale_freq_capacity arch_scale_freq_capacity
+
+DECLARE_PER_CPU(unsigned long, arch_cpu_freq);
+
+static inline long arch_scale_freq_capacity(int cpu)
+{
+	if (static_cpu_has(X86_FEATURE_APERFMPERF))
+		return per_cpu(arch_cpu_freq, cpu);
+
+	return 1024 /* SCHED_CAPACITY_SCALE */;
+}
+
+extern void arch_scale_freq_tick(void);
+extern void x86_arch_scale_freq_tick_enable(void);
+extern void x86_arch_scale_freq_tick_disable(void);
+#else
+static inline void x86_arch_scale_freq_tick_enable(void)
+{
+}
+
+static inline void x86_arch_scale_freq_tick_disable(void)
+{
+}
+#endif
+
 #endif /* _ASM_X86_TOPOLOGY_H */
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index fdbd47ceb84d..dd6ae8087cc0 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -147,6 +147,8 @@  static inline void smpboot_restore_warm_reset_vector(void)
 	*((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0;
 }
 
+static void set_cpu_max_freq(void);
+
 /*
  * Report back to the Boot Processor during boot time or to the caller processor
  * during CPU online.
@@ -183,6 +185,8 @@  static void smp_callin(void)
 	 */
 	set_cpu_sibling_map(raw_smp_processor_id());
 
+	set_cpu_max_freq();
+
 	/*
 	 * Get our bogomips.
 	 * Update loops_per_jiffy in cpu_data. Previous call to
@@ -1342,7 +1346,7 @@  void __init native_smp_prepare_cpus(unsigned int max_cpus)
 	set_sched_topology(x86_topology);
 
 	set_cpu_sibling_map(0);
-
+	set_cpu_max_freq();
 	smp_sanity_check();
 
 	switch (apic_intr_mode) {
@@ -1764,3 +1768,177 @@  void native_play_dead(void)
 }
 
 #endif
+
+/*
+ * APERF/MPERF frequency ratio computation.
+ *
+ * The scheduler wants to do frequency invariant accounting and needs a <1
+ * ratio to account for the 'current' frequency, corresponding to
+ * freq_curr / freq_max.
+ *
+ * Since the frequency freq_curr on x86 is controlled by micro-controller and
+ * our P-state setting is little more than a request/hint, we need to observe
+ * the effective frequency 'BusyMHz', i.e. the average frequency over a time
+ * interval after discarding idle time. This is given by:
+ *
+ *   BusyMHz = delta_APERF / delta_MPERF * freq_base
+ *
+ * where freq_base is the max non-turbo P-state.
+ *
+ * The freq_max term has to be set to a somewhat arbitrary value, because we
+ * can't know which turbo states will be available at a given point in time:
+ * it all depends on the thermal headroom of the entire package. We set it to
+ * the turbo level with 4 cores active.
+ *
+ * Benchmarks show that's a good compromise between the 1C turbo ratio
+ * (freq_curr/freq_max would rarely reach 1) and something close to freq_base,
+ * which would ignore the entire turbo range (a conspicuous part, making
+ * freq_curr/freq_max always maxed out).
+ *
+ * Setting freq_max to anything less than the 1C turbo ratio makes the ratio
+ * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1.
+ */
+
+static DEFINE_PER_CPU(u64, arch_prev_aperf);
+static DEFINE_PER_CPU(u64, arch_prev_mperf);
+static u64 arch_max_freq = SCHED_CAPACITY_SCALE;
+
+static bool turbo_disabled(void)
+{
+	u64 misc_en;
+	int err;
+
+	err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en);
+	if (err)
+		return false;
+
+	return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE);
+}
+
+#include <asm/cpu_device_id.h>
+#include <asm/intel-family.h>
+
+#define ICPU(model) \
+	{ X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0}
+
+static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = {
+	ICPU(INTEL_FAM6_XEON_PHI_KNL),
+	ICPU(INTEL_FAM6_XEON_PHI_KNM),
+	{}
+};
+
+static const struct x86_cpu_id has_turbo_ratio_group_limits[] = {
+	ICPU(INTEL_FAM6_ATOM_GOLDMONT),
+	ICPU(INTEL_FAM6_ATOM_GOLDMONT_X),
+	ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS),
+	ICPU(INTEL_FAM6_SKYLAKE_X),
+	{}
+};
+
+static void core_set_cpu_max_freq(void)
+{
+	u64 ratio, turbo_ratio;
+	int err;
+
+	if (smp_processor_id() != 0)
+		return;
+
+	if (turbo_disabled() ||
+		x86_match_cpu(has_knl_turbo_ratio_limits) ||
+		x86_match_cpu(has_turbo_ratio_group_limits))
+		return;
+
+	err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio);
+	if (err)
+		return;
+
+	err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio);
+	if (err)
+		return;
+
+	ratio = (ratio >> 8) & 0xFF;                /* max P state ratio */
+	turbo_ratio = (turbo_ratio >> 24) & 0xFF;   /* 4C turbo ratio */
+
+	arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio);
+}
+
+static void intel_set_cpu_max_freq(void)
+{
+	/*
+	 * TODO: add support for:
+	 *
+	 * - Xeon Phi (KNM, KNL)
+	 * - Xeon Gold/Platinum, Atom Goldmont/Goldmont Plus
+	 * - Atom Silvermont
+	 *
+	 * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE
+	 */
+	core_set_cpu_max_freq();
+}
+
+static void set_cpu_max_freq(void)
+{
+	u64 aperf, mperf;
+
+	if (!boot_cpu_has(X86_FEATURE_APERFMPERF))
+		return;
+
+	switch (boot_cpu_data.x86_vendor) {
+	case X86_VENDOR_INTEL:
+		intel_set_cpu_max_freq();
+		break;
+	default:
+		break;
+	}
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+}
+
+DEFINE_PER_CPU(unsigned long, arch_cpu_freq);
+
+static bool tick_disable;
+
+void arch_scale_freq_tick(void)
+{
+	u64 freq;
+	u64 aperf, mperf;
+	u64 acnt, mcnt;
+
+	if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable)
+		return;
+
+	rdmsrl(MSR_IA32_APERF, aperf);
+	rdmsrl(MSR_IA32_MPERF, mperf);
+
+	acnt = aperf - this_cpu_read(arch_prev_aperf);
+	mcnt = mperf - this_cpu_read(arch_prev_mperf);
+	if (!mcnt)
+		return;
+
+	this_cpu_write(arch_prev_aperf, aperf);
+	this_cpu_write(arch_prev_mperf, mperf);
+
+	acnt <<= 2*SCHED_CAPACITY_SHIFT;
+	mcnt *= arch_max_freq;
+
+	freq = div64_u64(acnt, mcnt);
+
+	if (freq > SCHED_CAPACITY_SCALE)
+		freq = SCHED_CAPACITY_SCALE;
+
+	this_cpu_write(arch_cpu_freq, freq);
+}
+
+void x86_arch_scale_freq_tick_enable(void)
+{
+	tick_disable = false;
+}
+
+void x86_arch_scale_freq_tick_disable(void)
+{
+	tick_disable = true;
+}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 010d578118d6..9fb4af689dfd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3463,6 +3463,7 @@  void scheduler_tick(void)
 	struct task_struct *curr = rq->curr;
 	struct rq_flags rf;
 
+	arch_scale_freq_tick();
 	sched_clock_tick();
 
 	rq_lock(rq, &rf);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 802b1f3405f2..0b724c06b4d9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1934,6 +1934,13 @@  static inline int hrtick_enabled(struct rq *rq)
 
 #endif /* CONFIG_SCHED_HRTICK */
 
+#ifndef arch_scale_freq_tick
+static __always_inline
+void arch_scale_freq_tick(void)
+{
+}
+#endif
+
 #ifndef arch_scale_freq_capacity
 static __always_inline
 unsigned long arch_scale_freq_capacity(int cpu)