From patchwork Wed Nov 13 12:46:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Giovanni Gherdovich X-Patchwork-Id: 11241831 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 31BA417E6 for ; Wed, 13 Nov 2019 12:41:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id ED60F2246D for ; Wed, 13 Nov 2019 12:41:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727256AbfKMMlW (ORCPT ); Wed, 13 Nov 2019 07:41:22 -0500 Received: from mx2.suse.de ([195.135.220.15]:39136 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726986AbfKMMlV (ORCPT ); Wed, 13 Nov 2019 07:41:21 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id C2819B1A5; Wed, 13 Nov 2019 12:41:16 +0000 (UTC) From: Giovanni Gherdovich To: Srinivas Pandruvada , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Borislav Petkov , Len Brown , "Rafael J . Wysocki" Cc: x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Mel Gorman , Matt Fleming , Viresh Kumar , Juri Lelli , Paul Turner , Vincent Guittot , Quentin Perret , Dietmar Eggemann , Doug Smythies , Giovanni Gherdovich Subject: [PATCH v4 1/6] x86,sched: Add support for frequency invariance Date: Wed, 13 Nov 2019 13:46:49 +0100 Message-Id: <20191113124654.18122-2-ggherdovich@suse.cz> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20191113124654.18122-1-ggherdovich@suse.cz> References: <20191113124654.18122-1-ggherdovich@suse.cz> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org Implement arch_scale_freq_capacity() for 'modern' x86. This function is used by the scheduler to correctly account usage in the face of DVFS. The present patch addresses Intel processors specifically and has positive performance and performance-per-watt implications for the schedutil cpufreq governor, bringing it closer to, if not on-par with, the powersave governor from the intel_pstate driver/framework. Large performance gains are obtained when the machine is lightly loaded and no regression are observed at saturation. The benchmarks with the largest gains are kernel compilation, tbench (the networking version of dbench) and shell-intensive workloads. 1. FREQUENCY INVARIANCE: MOTIVATION * Without it, a task looks larger if the CPU runs slower 2. PECULIARITIES OF X86 * freq invariance accounting requires knowing the ratio freq_curr/freq_max 2.1 CURRENT FREQUENCY * Use delta_APERF / delta_MPERF * freq_base (a.k.a "BusyMHz") 2.2 MAX FREQUENCY * It varies with time (turbo). As an approximation, we set it to a constant, i.e. 4-cores turbo frequency (or base frequency if nothing else is reported by MSRs) 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR * The invariant schedutil's formula has no feedback loop and reacts faster to utilization changes 4. KNOWN LIMITATIONS * In some cases tasks can't reach max util despite how hard they try 5. PERFORMANCE TESTING 5.1 MACHINES * Skylake, Broadwell, Haswell 5.2 SETUP * baseline Linux v5.2 w/ non-invariant schedutil. Tested freq_max = 1-2-3-4-8-12 active cores turbo w/ invariant schedutil, and intel_pstate/powersave 5.3 BENCHMARK RESULTS 5.3.1 NEUTRAL BENCHMARKS * NAS Parallel Benchmark (HPC), hackbench 5.3.2 NON-NEUTRAL BENCHMARKS * tbench (10-30% better), kernbench (10-15% better), shell-intensive-scripts (30-50% better) * no regressions 5.3.3 SELECTION OF DETAILED RESULTS 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT * dbench (5% worse on one machine), kernbench (3% worse), tbench (5-10% better), shell-intensive-scripts (10-40% better) 6. MICROARCH'ES ADDRESSED HERE * Xeon Core before Scalable Performance processors line (Xeon Gold/Platinum etc have different MSRs semantic for querying turbo levels) 7. REFERENCES * MMTests performance testing framework, github.com/gormanm/mmtests +-------------------------------------------------------------------------+ | 1. FREQUENCY INVARIANCE: MOTIVATION +-------------------------------------------------------------------------+ For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When running a task that would consume 1/3rd of a CPU at 1000 MHz, it would appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the false impression this CPU is almost at capacity, even though it can go faster [*]. In a nutshell, without frequency scale-invariance tasks look larger just because the CPU is running slower. [*] (footnote: this assumes a linear frequency/performance relation; which everybody knows to be false, but given realities its the best approximation we can make.) +-------------------------------------------------------------------------+ | 2. PECULIARITIES OF X86 +-------------------------------------------------------------------------+ Accounting for frequency changes in PELT signals requires the computation of the ratio freq_curr / freq_max. On x86 neither of those terms is readily available. 2.1 CURRENT FREQUENCY ==================== Since modern x86 has hardware control over the actual frequency we run at (because amongst other things, Turbo-Mode), we cannot simply use the frequency as requested through cpufreq. Instead we use the APERF/MPERF MSRs to compute the effective frequency over the recent past. Also, because reading MSRs is expensive, don't do so every time we need the value, but amortize the cost by doing it every tick. 2.2 MAX FREQUENCY ================= Obtaining freq_max is also non-trivial because at any time the hardware can provide a frequency boost to a selected subset of cores if the package has enough power to spare (eg: Turbo Boost). This means that the maximum frequency available to a given core changes with time. The approach taken in this change is to arbitrarily set freq_max to a constant value at boot. The value chosen is the "4-cores (4C) turbo frequency" on most microarchitectures, after evaluating the following candidates: * 1-core (1C) turbo frequency (the fastest turbo state available) * around base frequency (a.k.a. max P-state) * something in between, such as 4C turbo To interpret these options, consider that this is the denominator in freq_curr/freq_max, and that ratio will be used to scale PELT signals such as util_avg and load_avg. A large denominator will undershoot (util_avg looks a bit smaller than it really is), viceversa with a smaller denominator PELT signals will tend to overshoot. Given that PELT drives frequency selection in the schedutil governor, we will have: freq_max set to | effect on DVFS --------------------+------------------ 1C turbo | power efficiency (lower freq choices) base freq | performance (higher util_avg, higher freq requests) 4C turbo | a bit of both 4C turbo proves to be a good compromise in a number of benchmarks (see below). Note that when the function intel_set_cpu_max_freq() fails to query the various MSRs for the 4C turbo value, the variable arch_max_freq retains its default value of SCHED_CAPACITY_SCALE (1024) that corresponds to setting freq_max to base frequency wrt the table above. +-------------------------------------------------------------------------+ | 3. EFFECTS ON THE SCHEDUTIL FREQUENCY GOVERNOR +-------------------------------------------------------------------------+ Once an architecture implements a frequency scale-invariant utilization (the PELT signal util_avg), schedutil switches its frequency selection formula from freq_next = 1.25 * freq_curr * util [non-invariant util signal] to freq_next = 1.25 * freq_max * util [invariant util signal] where, in the second formula, freq_max is set to the 1C turbo frequency (max turbo). The advantage of the second formula, whose usage we unlock with this patch, is that freq_next doesn't depend on the current frequency in an iterative fashion, but can jump to any frequency in a single update. This absence of feedback in the formula makes it quicker to react to utilization changes and more robust against pathological instabilities. Compare it to the update formula of intel_pstate/powersave: freq_next = 1.25 * freq_max * Busy% where again freq_max is 1C turbo and Busy% is the percentage of time not spent idling (calculated with delta_MPERF / delta_TSC); essentially the same as invariant schedutil, and largely responsible for intel_pstate/powersave good reputation. The non-invariant schedutil formula is derived from the invariant one by approximating util_inv with util_raw * freq_curr / freq_max, but this has limitations. Testing shows improved performances due to better frequency selections when the machine is lightly loaded, and essentially no change in behaviour at saturation / overutilization. +-------------------------------------------------------------------------+ | 4. KNOWN LIMITATIONS +-------------------------------------------------------------------------+ It's been shown that it is possible to create pathological scenarios where a CPU-bound task cannot reach max utilization, if the normalizing factor freq_max is fixed to a constant value (see [Lelli-2018]). If freq_max is set to 4C turbo as we do here, one needs to peg at least 5 cores in a package doing some busywork, and observe that none of those task will ever reach max util (1024) because they're all running at less than the 4C turbo frequency. While this concern still applies, we believe the performance benefit of frequency scale-invariant PELT signals outweights the cost of this limitation. [Lelli-2018] https://lore.kernel.org/lkml/20180517150418.GF22493@localhost.localdomain/ +-------------------------------------------------------------------------+ | 5. PERFORMANCE TESTING +-------------------------------------------------------------------------+ 5.1 MACHINES ============ We tested the patch on three machines, with Skylake, Broadwell and Haswell CPUs. The details are below, together with the available turbo ratios as reported by the appropriate MSRs. * 8x-SKYLAKE-UMA: Single socket E3-1240 v5, Skylake 4 cores/8 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 800 |******** BASE 3500 |*********************************** 4C 3700 |************************************* 3C 3800 |************************************** 2C 3900 |*************************************** 1C 3900 |*************************************** * 80x-BROADWELL-NUMA: Two sockets E5-2698 v4, 2x Broadwell 20 cores/40 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2200 |********************** 8C 2900 |***************************** 7C 3000 |****************************** 6C 3100 |******************************* 5C 3200 |******************************** 4C 3300 |********************************* 3C 3400 |********************************** 2C 3600 |************************************ 1C 3600 |************************************ * 48x-HASWELL-NUMA Two sockets E5-2670 v3, 2x Haswell 12 cores/24 threads Max EFFiciency, BASE frequency and available turbo levels (MHz): EFFIC 1200 |************ BASE 2300 |*********************** 12C 2600 |************************** 11C 2600 |************************** 10C 2600 |************************** 9C 2600 |************************** 8C 2600 |************************** 7C 2600 |************************** 6C 2600 |************************** 5C 2700 |*************************** 4C 2800 |**************************** 3C 2900 |***************************** 2C 3100 |******************************* 1C 3100 |******************************* 5.2 SETUP ========= * The baseline is Linux v5.2 with schedutil (non-invariant) and the intel_pstate driver in passive mode. * The rationale for choosing the various freq_max values to test have been to try all the 1-2-3-4C turbo levels (note that 1C and 2C turbo are identical on all machines), plus one more value closer to base_freq but still in the turbo range (8C turbo for both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA). * In addition we've run all tests with intel_pstate/powersave for comparison. * The filesystem is always XFS, the userspace is openSUSE Leap 15.1. * 8x-SKYLAKE-UMA is capable of HWP (Hardware-Managed P-States), so the runs with active intel_pstate on this machine use that. This gives, in terms of combinations tested on each machine: * 8x-SKYLAKE-UMA * Baseline: Linux v5.2, non-invariant schedutil, intel_pstate passive * intel_pstate active + powersave + HWP * invariant schedutil, freq_max = 1C turbo * invariant schedutil, freq_max = 3C turbo * invariant schedutil, freq_max = 4C turbo * both 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA * [same as 8x-SKYLAKE-UMA, but no HWP capable] * invariant schedutil, freq_max = 8C turbo * (which on 48x-HASWELL-NUMA is the same as 12C turbo, or "all cores turbo") 5.3 BENCHMARK RESULTS ===================== 5.3.1 NEUTRAL BENCHMARKS ------------------------ Tests that didn't show any measurable difference in performance on any of the test machines between non-invariant schedutil and our patch are: * NAS Parallel Benchmarks (NPB) using either MPI or openMP for IPC, any computational kernel * flexible I/O (FIO) * hackbench (using threads or processes, and using pipes or sockets) 5.3.2 NON-NEUTRAL BENCHMARKS ---------------------------- What follow are summary tables where each benchmark result is given a score. * A tilde (~) means a neutral result, i.e. no difference from baseline. * Scores are computed with the ratio result_new / result_baseline, so a tilde means a score of 1.00. * The results in the score ratio are the geometric means of results running the benchmark with different parameters (eg: for kernbench: using 1, 2, 4, ... number of processes; for pgbench: varying the number of clients, and so on). * The first three tables show higher-is-better kind of tests (i.e. measured in operations/second), the subsequent three show lower-is-better kind of tests (i.e. the workload is fixed and we measure elapsed time, think kernbench). * "gitsource" is a name we made up for the test consisting in running the entire unit tests suite of the Git SCM and measuring how long it takes. We take it as a typical example of shell-intensive serialized workload. * In the "I_PSTATE" column we have the results for intel_pstate/powersave. Other columns show invariant schedutil for different values of freq_max. 4C turbo is circled as it's the value we've chosen for the final implementation. 80x-BROADWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.14 ~ ~ | 1.11 | 1.14 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.06 ~ 1.06 | 1.05 | 1.07 netperf-tcp ~ 1.03 ~ | 1.01 | 1.02 tbench4 1.57 1.18 1.22 | 1.30 | 1.56 +------+ 8x-SKYLAKE-UMA (comparison ratio; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.30 1.14 1.14 | 1.16 | +------+ 48x-HASWELL-NUMA (comparison ratio; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.15 ~ ~ | 1.06 | 1.16 pgbench-rw ~ ~ ~ | ~ | ~ netperf-udp 1.05 0.97 1.04 | 1.04 | 1.02 netperf-tcp 0.96 1.01 1.01 | 1.01 | 1.01 tbench4 1.50 1.05 1.13 | 1.13 | 1.25 +------+ In the table above we see that active intel_pstate is slightly better than our 4C-turbo patch (both in reference to the baseline non-invariant schedutil) on read-only pgbench and much better on tbench. Both cases are notable in which it shows that lowering our freq_max (to 8C-turbo and 12C-turbo on 80x-BROADWELL-NUMA and 48x-HASWELL-NUMA respectively) helps invariant schedutil to get closer. If we ignore active intel_pstate and focus on the comparison with baseline alone, there are several instances of double-digit performance improvement. 80x-BROADWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 1.23 0.95 0.95 | 0.95 | 0.95 kernbench 0.93 0.83 0.83 | 0.83 | 0.82 gitsource 0.98 0.49 0.49 | 0.49 | 0.48 +------+ 8x-SKYLAKE-UMA (comparison ratio; lower is better) +------+ I_PSTATE/HWP 1C 3C | 4C | dbench4 ~ ~ ~ | ~ | kernbench ~ ~ ~ | ~ | gitsource 0.92 0.55 0.55 | 0.55 | +------+ 48x-HASWELL-NUMA (comparison ratio; lower is better) +------+ I_PSTATE 1C 3C | 4C | 8C dbench4 ~ ~ ~ | ~ | ~ kernbench 0.94 0.90 0.89 | 0.90 | 0.90 gitsource 0.97 0.69 0.69 | 0.69 | 0.69 +------+ dbench is not very remarkable here, unless we notice how poorly active intel_pstate is performing on 80x-BROADWELL-NUMA: 23% regression versus non-invariant schedutil. We repeated that run getting consistent results. Out of scope for the patch at hand, but deserving future investigation. Other than that, we previously ran this campaign with Linux v5.0 and saw the patch doing better on dbench a the time. We haven't checked closely and can only speculate at this point. On the NUMA boxes kernbench gets 10-15% improvements on average; we'll see in the detailed tables that the gains concentrate on low process counts (lightly loaded machines). The test we call "gitsource" (running the git unit test suite, a long-running single-threaded shell script) appears rather spectacular in this table (gains of 30-50% depending on the machine). It is to be noted, however, that gitsource has no adjustable parameters (such as the number of jobs in kernbench, which we average over in order to get a single-number summary score) and is exactly the kind of low-parallelism workload that benefits the most from this patch. When looking at the detailed tables of kernbench or tbench4, at low process or client counts one can see similar numbers. 5.3.3 SELECTION OF DETAILED RESULTS ----------------------------------- Machine : 48x-HASWELL-NUMA Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 126.73 +- 0.31% ( ) 315.91 +- 0.66% ( 149.28%) 125.03 +- 0.76% ( -1.34%) Hmean 2 258.04 +- 0.62% ( ) 614.16 +- 0.51% ( 138.01%) 269.58 +- 1.45% ( 4.47%) Hmean 4 514.30 +- 0.67% ( ) 1146.58 +- 0.54% ( 122.94%) 533.84 +- 1.99% ( 3.80%) Hmean 8 1111.38 +- 2.52% ( ) 2159.78 +- 0.38% ( 94.33%) 1359.92 +- 1.56% ( 22.36%) Hmean 16 2286.47 +- 1.36% ( ) 3338.29 +- 0.21% ( 46.00%) 2720.20 +- 0.52% ( 18.97%) Hmean 32 4704.84 +- 0.35% ( ) 4759.03 +- 0.43% ( 1.15%) 4774.48 +- 0.30% ( 1.48%) Hmean 64 7578.04 +- 0.27% ( ) 7533.70 +- 0.43% ( -0.59%) 7462.17 +- 0.65% ( -1.53%) Hmean 128 6998.52 +- 0.16% ( ) 6987.59 +- 0.12% ( -0.16%) 6909.17 +- 0.14% ( -1.28%) Hmean 192 6901.35 +- 0.25% ( ) 6913.16 +- 0.10% ( 0.17%) 6855.47 +- 0.21% ( -0.66%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 12C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 128.43 +- 0.28% ( 1.34%) 130.64 +- 3.81% ( 3.09%) 153.71 +- 5.89% ( 21.30%) Hmean 2 311.70 +- 6.15% ( 20.79%) 281.66 +- 3.40% ( 9.15%) 305.08 +- 5.70% ( 18.23%) Hmean 4 641.98 +- 2.32% ( 24.83%) 623.88 +- 5.28% ( 21.31%) 906.84 +- 4.65% ( 76.32%) Hmean 8 1633.31 +- 1.56% ( 46.96%) 1714.16 +- 0.93% ( 54.24%) 2095.74 +- 0.47% ( 88.57%) Hmean 16 3047.24 +- 0.42% ( 33.27%) 3155.02 +- 0.30% ( 37.99%) 3634.58 +- 0.15% ( 58.96%) Hmean 32 4734.31 +- 0.60% ( 0.63%) 4804.38 +- 0.23% ( 2.12%) 4674.62 +- 0.27% ( -0.64%) Hmean 64 7699.74 +- 0.35% ( 1.61%) 7499.72 +- 0.34% ( -1.03%) 7659.03 +- 0.25% ( 1.07%) Hmean 128 6935.18 +- 0.15% ( -0.91%) 6942.54 +- 0.10% ( -0.80%) 7004.85 +- 0.12% ( 0.09%) Hmean 192 6901.62 +- 0.12% ( 0.00%) 6856.93 +- 0.10% ( -0.64%) 6978.74 +- 0.10% ( 1.12%) This is one of the cases where the patch still can't surpass active intel_pstate, not even when freq_max is as low as 12C-turbo. Otherwise, gains are visible up to 16 clients and the saturated scenario is the same as baseline. The scores in the summary table from the previous sections are ratios of geometric means of the results over different clients, as seen in this table. Machine : 80x-BROADWELL-NUMA Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 379.68 +- 0.06% ( ) 330.20 +- 0.43% ( 13.03%) 285.93 +- 0.07% ( 24.69%) Amean 4 200.15 +- 0.24% ( ) 175.89 +- 0.22% ( 12.12%) 153.78 +- 0.25% ( 23.17%) Amean 8 106.20 +- 0.31% ( ) 95.54 +- 0.23% ( 10.03%) 86.74 +- 0.10% ( 18.32%) Amean 16 56.96 +- 1.31% ( ) 53.25 +- 1.22% ( 6.50%) 48.34 +- 1.73% ( 15.13%) Amean 32 34.80 +- 2.46% ( ) 33.81 +- 0.77% ( 2.83%) 30.28 +- 1.59% ( 12.99%) Amean 64 26.11 +- 1.63% ( ) 25.04 +- 1.07% ( 4.10%) 22.41 +- 2.37% ( 14.16%) Amean 128 24.80 +- 1.36% ( ) 23.57 +- 1.23% ( 4.93%) 21.44 +- 1.37% ( 13.55%) Amean 160 24.85 +- 0.56% ( ) 23.85 +- 1.17% ( 4.06%) 21.25 +- 1.12% ( 14.49%) 5.2.0 3C-turbo 5.2.0 4C-turbo 5.2.0 8C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 284.08 +- 0.13% ( 25.18%) 283.96 +- 0.51% ( 25.21%) 285.05 +- 0.21% ( 24.92%) Amean 4 153.18 +- 0.22% ( 23.47%) 154.70 +- 1.64% ( 22.71%) 153.64 +- 0.30% ( 23.24%) Amean 8 87.06 +- 0.28% ( 18.02%) 86.77 +- 0.46% ( 18.29%) 86.78 +- 0.22% ( 18.28%) Amean 16 48.03 +- 0.93% ( 15.68%) 47.75 +- 1.99% ( 16.17%) 47.52 +- 1.61% ( 16.57%) Amean 32 30.23 +- 1.20% ( 13.14%) 30.08 +- 1.67% ( 13.57%) 30.07 +- 1.67% ( 13.60%) Amean 64 22.59 +- 2.02% ( 13.50%) 22.63 +- 0.81% ( 13.32%) 22.42 +- 0.76% ( 14.12%) Amean 128 21.37 +- 0.67% ( 13.82%) 21.31 +- 1.15% ( 14.07%) 21.17 +- 1.93% ( 14.63%) Amean 160 21.68 +- 0.57% ( 12.76%) 21.18 +- 1.74% ( 14.77%) 21.22 +- 1.00% ( 14.61%) The patch outperform active intel_pstate (and baseline) by a considerable margin; the summary table from the previous section says 4C turbo and active intel_pstate are 0.83 and 0.93 against baseline respectively, so 4C turbo is 0.83/0.93=0.89 against intel_pstate (~10% better on average). There is no noticeable difference with regard to the value of freq_max. Machine : 8x-SKYLAKE-UMA Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla 5.2.0 intel_pstate/hwp 5.2.0 1C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 858.85 +- 1.16% ( ) 791.94 +- 0.21% ( 7.79%) 474.95 ( 44.70%) 5.2.0 3C-turbo 5.2.0 4C-turbo - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 475.26 +- 0.20% ( 44.66%) 474.34 +- 0.13% ( 44.77%) In this test, which is of interest as representing shell-intensive (i.e. fork-intensive) serialized workloads, invariant schedutil outperforms intel_pstate/powersave by a whopping 40% margin. 5.3.4 POWER CONSUMPTION, PERFORMANCE-PER-WATT --------------------------------------------- The following table shows average power consumption in watt for each benchmark. Data comes from turbostat (package average), which in turn is read from the RAPL interface on CPUs. We know the patch affects CPU frequencies so it's reasonable to ignore other power consumers (such as memory or I/O). Also, we don't have a power meter available in the lab so RAPL is the best we have. turbostat sampled average power every 10 seconds for the entire duration of each benchmark. We took all those values and averaged them (i.e. with don't have detail on a per-parameter granularity, only on whole benchmarks). 80x-BROADWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 8C pgbench-ro 130.01 142.77 131.11 132.45 | 134.65 | 136.84 pgbench-rw 68.30 60.83 71.45 71.70 | 71.65 | 72.54 dbench4 90.25 59.06 101.43 99.89 | 101.10 | 102.94 netperf-udp 65.70 69.81 66.02 68.03 | 68.27 | 68.95 netperf-tcp 88.08 87.96 88.97 88.89 | 88.85 | 88.20 tbench4 142.32 176.73 153.02 163.91 | 165.58 | 176.07 kernbench 92.94 101.95 114.91 115.47 | 115.52 | 115.10 gitsource 40.92 41.87 75.14 75.20 | 75.40 | 75.70 +--------+ 8x-SKYLAKE-UMA (power consumption, watts) +--------+ BASELINE I_PSTATE/HWP 1C 3C | 4C | pgbench-ro 46.49 46.68 46.56 46.59 | 46.52 | pgbench-rw 29.34 31.38 30.98 31.00 | 31.00 | dbench4 27.28 27.37 27.49 27.41 | 27.38 | netperf-udp 22.33 22.41 22.36 22.35 | 22.36 | netperf-tcp 27.29 27.29 27.30 27.31 | 27.33 | tbench4 41.13 45.61 43.10 43.33 | 43.56 | kernbench 42.56 42.63 43.01 43.01 | 43.01 | gitsource 13.32 13.69 17.33 17.30 | 17.35 | +--------+ 48x-HASWELL-NUMA (power consumption, watts) +--------+ BASELINE I_PSTATE 1C 3C | 4C | 12C pgbench-ro 128.84 136.04 129.87 132.43 | 132.30 | 134.86 pgbench-rw 37.68 37.92 37.17 37.74 | 37.73 | 37.31 dbench4 28.56 28.73 28.60 28.73 | 28.70 | 28.79 netperf-udp 56.70 60.44 56.79 57.42 | 57.54 | 57.52 netperf-tcp 75.49 75.27 75.87 76.02 | 76.01 | 75.95 tbench4 115.44 139.51 119.53 123.07 | 123.97 | 130.22 kernbench 83.23 91.55 95.58 95.69 | 95.72 | 96.04 gitsource 36.79 36.99 39.99 40.34 | 40.35 | 40.23 +--------+ A lower power consumption isn't necessarily better, it depends on what is done with that energy. Here are tables with the ratio of performance-per-watt on each machine and benchmark. Higher is always better; a tilde (~) means a neutral ratio (i.e. 1.00). 80x-BROADWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 8C pgbench-ro 1.04 1.06 0.94 | 1.07 | 1.08 pgbench-rw 1.10 0.97 0.96 | 0.96 | 0.97 dbench4 1.24 0.94 0.95 | 0.94 | 0.92 netperf-udp ~ 1.02 1.02 | ~ | 1.02 netperf-tcp ~ 1.02 ~ | ~ | 1.02 tbench4 1.26 1.10 1.06 | 1.12 | 1.26 kernbench 0.98 0.97 0.97 | 0.97 | 0.98 gitsource ~ 1.11 1.11 | 1.11 | 1.13 +------+ 8x-SKYLAKE-UMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE/HWP 1C 3C | 4C | pgbench-ro ~ ~ ~ | ~ | pgbench-rw 0.95 0.97 0.96 | 0.96 | dbench4 ~ ~ ~ | ~ | netperf-udp ~ ~ ~ | ~ | netperf-tcp ~ ~ ~ | ~ | tbench4 1.17 1.09 1.08 | 1.10 | kernbench ~ ~ ~ | ~ | gitsource 1.06 1.40 1.40 | 1.40 | +------+ 48x-HASWELL-NUMA (performance-per-watt ratios; higher is better) +------+ I_PSTATE 1C 3C | 4C | 12C pgbench-ro 1.09 ~ 1.09 | 1.03 | 1.11 pgbench-rw ~ 0.86 ~ | ~ | 0.86 dbench4 ~ 1.02 1.02 | 1.02 | ~ netperf-udp ~ 0.97 1.03 | 1.02 | ~ netperf-tcp 0.96 ~ ~ | ~ | ~ tbench4 1.24 ~ 1.06 | 1.05 | 1.11 kernbench 0.97 0.97 0.98 | 0.97 | 0.96 gitsource 1.03 1.33 1.32 | 1.32 | 1.33 +------+ These results are overall pleasing: in plenty of cases we observe performance-per-watt improvements. The few regressions (read/write pgbench and dbench on the Broadwell machine) are of small magnitude. kernbench loses a few percentage points (it has a 10-15% performance improvement, but apparently the increase in power consumption is larger than that). tbench4 and gitsource, which benefit the most from the patch, keep a positive score in this table which is a welcome surprise; that suggests that in those particular workloads the non-invariant schedutil (and active intel_pstate, too) makes some rather suboptimal frequency selections. +-------------------------------------------------------------------------+ | 6. MICROARCH'ES ADDRESSED HERE +-------------------------------------------------------------------------+ The patch addresses Xeon Core processors that use MSR_PLATFORM_INFO and MSR_TURBO_RATIO_LIMIT to advertise their base frequency and turbo frequencies respectively. This excludes the recent Xeon Scalable Performance processors line (Xeon Gold, Platinum etc) whose MSRs have to be parsed differently. Subsequent patches will address: * Xeon Scalable Performance processors and Atom Goldmont/Goldmont Plus * Xeon Phi (Knights Landing, Knights Mill) * Atom Silvermont +-------------------------------------------------------------------------+ | 7. REFERENCES +-------------------------------------------------------------------------+ Tests have been run with the help of the MMTests performance testing framework, see github.com/gormanm/mmtests. The configuration file names for the benchmark used are: db-pgbench-timed-ro-small-xfs db-pgbench-timed-rw-small-xfs io-dbench4-async-xfs network-netperf-unbound network-tbench scheduler-unbound workload-kerndevel-xfs workload-shellscripts-xfs hpc-nas-c-class-mpi-full-xfs hpc-nas-c-class-omp-full All those benchmarks are generally available on the web: pgbench: https://www.postgresql.org/docs/10/pgbench.html netperf: https://hewlettpackard.github.io/netperf/ dbench/tbench: https://dbench.samba.org/ gitsource: git unit test suite, github.com/git/git NAS Parallel Benchmarks: https://www.nas.nasa.gov/publications/npb.html hackbench: https://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c Suggested-by: Peter Zijlstra Signed-off-by: Giovanni Gherdovich Acked-by: Doug Smythies --- arch/x86/include/asm/topology.h | 23 ++++++ arch/x86/kernel/smpboot.c | 176 +++++++++++++++++++++++++++++++++++++++- kernel/sched/core.c | 1 + kernel/sched/sched.h | 7 ++ 4 files changed, 206 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 4b14d2318251..9b3aca463c8f 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -193,4 +193,27 @@ static inline void sched_clear_itmt_support(void) } #endif /* CONFIG_SCHED_MC_PRIO */ +#ifdef CONFIG_SMP +#include + +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); + +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) + +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); + +static inline long arch_scale_freq_capacity(int cpu) +{ + if (arch_scale_freq_invariant()) + return per_cpu(arch_cpu_freq, cpu); + + return 1024 /* SCHED_CAPACITY_SCALE */; +} +#define arch_scale_freq_capacity arch_scale_freq_capacity + +extern void arch_scale_freq_tick(void); +#define arch_scale_freq_tick arch_scale_freq_tick + +#endif + #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 69881b2d446c..814d7900779d 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -147,6 +147,8 @@ static inline void smpboot_restore_warm_reset_vector(void) *((volatile u32 *)phys_to_virt(TRAMPOLINE_PHYS_LOW)) = 0; } +static void set_cpu_max_freq(void); + /* * Report back to the Boot Processor during boot time or to the caller processor * during CPU online. @@ -183,6 +185,8 @@ static void smp_callin(void) */ set_cpu_sibling_map(raw_smp_processor_id()); + set_cpu_max_freq(); + /* * Get our bogomips. * Update loops_per_jiffy in cpu_data. Previous call to @@ -1337,7 +1341,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) set_sched_topology(x86_topology); set_cpu_sibling_map(0); - + set_cpu_max_freq(); smp_sanity_check(); switch (apic_intr_mode) { @@ -1764,3 +1768,173 @@ void native_play_dead(void) } #endif + +/* + * APERF/MPERF frequency ratio computation. + * + * The scheduler wants to do frequency invariant accounting and needs a <1 + * ratio to account for the 'current' frequency, corresponding to + * freq_curr / freq_max. + * + * Since the frequency freq_curr on x86 is controlled by micro-controller and + * our P-state setting is little more than a request/hint, we need to observe + * the effective frequency 'BusyMHz', i.e. the average frequency over a time + * interval after discarding idle time. This is given by: + * + * BusyMHz = delta_APERF / delta_MPERF * freq_base + * + * where freq_base is the max non-turbo P-state. + * + * The freq_max term has to be set to a somewhat arbitrary value, because we + * can't know which turbo states will be available at a given point in time: + * it all depends on the thermal headroom of the entire package. We set it to + * the turbo level with 4 cores active. + * + * Benchmarks show that's a good compromise between the 1C turbo ratio + * (freq_curr/freq_max would rarely reach 1) and something close to freq_base, + * which would ignore the entire turbo range (a conspicuous part, making + * freq_curr/freq_max always maxed out). + * + * Setting freq_max to anything less than the 1C turbo ratio makes the ratio + * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. + */ + +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); + +static DEFINE_PER_CPU(u64, arch_prev_aperf); +static DEFINE_PER_CPU(u64, arch_prev_mperf); +static u64 arch_max_freq = SCHED_CAPACITY_SCALE; + +static bool turbo_disabled(void) +{ + u64 misc_en; + int err; + + err = rdmsrl_safe(MSR_IA32_MISC_ENABLE, &misc_en); + if (err) + return false; + + return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); +} + +#include +#include + +#define ICPU(model) \ + {X86_VENDOR_INTEL, 6, model, X86_FEATURE_APERFMPERF, 0} + +static const struct x86_cpu_id has_knl_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_XEON_PHI_KNL), + ICPU(INTEL_FAM6_XEON_PHI_KNM), + {} +}; + +static const struct x86_cpu_id has_skx_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_SKYLAKE_X), + {} +}; + +static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { + ICPU(INTEL_FAM6_ATOM_GOLDMONT), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_D), + ICPU(INTEL_FAM6_ATOM_GOLDMONT_PLUS), + {} +}; + +static void core_set_cpu_max_freq(void) +{ + u64 ratio, turbo_ratio; + int err; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); + if (err) + return; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); + if (err) + return; + + ratio = (ratio >> 8) & 0xFF; /* max P state ratio */ + turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ + + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); +} + +static void intel_set_cpu_max_freq(void) +{ + /* + * TODO: add support for: + * + * - Xeon Gold/Platinum + * - Xeon Phi (KNM, KNL) + * - Atom Goldmont + * - Atom Silvermont + * + * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE + */ + + static_branch_enable(&arch_scale_freq_key); + + if (turbo_disabled() || + x86_match_cpu(has_skx_turbo_ratio_limits) || + x86_match_cpu(has_knl_turbo_ratio_limits) || + x86_match_cpu(has_glm_turbo_ratio_limits)) + return; + + core_set_cpu_max_freq(); +} + +static void init_scale_freq(void *arg) +{ + u64 aperf, mperf; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); +} + +static void set_cpu_max_freq(void) +{ + if (smp_processor_id() != 0 || !boot_cpu_has(X86_FEATURE_APERFMPERF)) + return; + + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + intel_set_cpu_max_freq(); + + init_scale_freq(NULL); +} + +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); + +void arch_scale_freq_tick(void) +{ + u64 freq; + u64 aperf, mperf; + u64 acnt, mcnt; + + if (!arch_scale_freq_invariant()) + return; + + rdmsrl(MSR_IA32_APERF, aperf); + rdmsrl(MSR_IA32_MPERF, mperf); + + acnt = aperf - this_cpu_read(arch_prev_aperf); + mcnt = mperf - this_cpu_read(arch_prev_mperf); + if (!mcnt) + return; + + this_cpu_write(arch_prev_aperf, aperf); + this_cpu_write(arch_prev_mperf, mperf); + + acnt <<= 2*SCHED_CAPACITY_SHIFT; + mcnt *= arch_max_freq; + + freq = div64_u64(acnt, mcnt); + + if (freq > SCHED_CAPACITY_SCALE) + freq = SCHED_CAPACITY_SCALE; + + this_cpu_write(arch_cpu_freq, freq); +} diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0f2eb3629070..1d0f5df8020e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3593,6 +3593,7 @@ void scheduler_tick(void) struct task_struct *curr = rq->curr; struct rq_flags rf; + arch_scale_freq_tick(); sched_clock_tick(); rq_lock(rq, &rf); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c8870c5bd7df..4c4f5f197e10 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1977,6 +1977,13 @@ static inline int hrtick_enabled(struct rq *rq) #endif /* CONFIG_SCHED_HRTICK */ +#ifndef arch_scale_freq_tick +static __always_inline +void arch_scale_freq_tick(void) +{ +} +#endif + #ifndef arch_scale_freq_capacity static __always_inline unsigned long arch_scale_freq_capacity(int cpu) From patchwork Wed Nov 13 12:46:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Giovanni Gherdovich X-Patchwork-Id: 11241835 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B36C717E6 for ; Wed, 13 Nov 2019 12:41:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8E3D121783 for ; Wed, 13 Nov 2019 12:41:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727387AbfKMMlh (ORCPT ); Wed, 13 Nov 2019 07:41:37 -0500 Received: from mx2.suse.de ([195.135.220.15]:39214 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726994AbfKMMlV (ORCPT ); Wed, 13 Nov 2019 07:41:21 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 8FE5CB549; Wed, 13 Nov 2019 12:41:17 +0000 (UTC) From: Giovanni Gherdovich To: Srinivas Pandruvada , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Borislav Petkov , Len Brown , "Rafael J . Wysocki" Cc: x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Mel Gorman , Matt Fleming , Viresh Kumar , Juri Lelli , Paul Turner , Vincent Guittot , Quentin Perret , Dietmar Eggemann , Doug Smythies , Giovanni Gherdovich Subject: [PATCH v4 2/6] x86,sched: Add support for frequency invariance on SKYLAKE_X Date: Wed, 13 Nov 2019 13:46:50 +0100 Message-Id: <20191113124654.18122-3-ggherdovich@suse.cz> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20191113124654.18122-1-ggherdovich@suse.cz> References: <20191113124654.18122-1-ggherdovich@suse.cz> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org The scheduler needs the ratio freq_curr/freq_max for frequency-invariant accounting. On SKYLAKE_X CPUs set freq_max to the highest frequency that can be sustained by a group of at least 4 cores. From the changelog of commit 31e07522be56 ("tools/power turbostat: fix decoding for GLM, DNV, SKX turbo-ratio limits"): > Newer processors do not hard-code the the number of cpus in each bin > to {1, 2, 3, 4, 5, 6, 7, 8} Rather, they can specify any number > of CPUS in each of the 8 bins: > > eg. > > ... > 37 * 100.0 = 3600.0 MHz max turbo 4 active cores > 38 * 100.0 = 3700.0 MHz max turbo 3 active cores > 39 * 100.0 = 3800.0 MHz max turbo 2 active cores > 39 * 100.0 = 3900.0 MHz max turbo 1 active cores > > could now look something like this: > > ... > 37 * 100.0 = 3600.0 MHz max turbo 16 active cores > 38 * 100.0 = 3700.0 MHz max turbo 8 active cores > 39 * 100.0 = 3800.0 MHz max turbo 4 active cores > 39 * 100.0 = 3900.0 MHz max turbo 2 active cores This encoding of turbo levels applies to both SKYLAKE_X and GOLDMONT/GOLDMONT_D, but we treat these two classes in separate commits because their freq_max values need to be different. For SKX we prefer a lower freq_max in the ratio freq_curr/freq_max, allowing load and utilization to overshoot and the schedutil governor to be more performance-oriented. Models from the Atom series (such as GOLDMONT*) are handled in a forthcoming commit as they have to favor power-efficiency over performance. Results from a performance evaluation follow. 1. TEST SETUP 2. NEUTRAL BENCHMARKS 3. NON-NEUTRAL BENCHMARKS 4. DETAILED TABLES 1. TEST SETUP ------------- Test machine: CPU Model : Intel Xeon Platinum 8260L CPU @ 2.40GHz (a.k.a. Cascade Lake) Fam/Mod/Ste : 6:85:6 Topology : 2 sockets, 24 cores / 48 threads each socket Memory : 192G Storage : SSD, XFS filesystem Max EFFICiency, BASE frequency and available turbo levels (MHz): EFFIC 1000 |********** BASE 2400 |************************ 24C 3100 |******************************* 20C 3300 |********************************* 16C 3600 |************************************ 12C 3600 |************************************ 8C 3600 |************************************ 4C 3700 |************************************* 2C 3900 |*************************************** Tested kernels: Baseline : v5.2, intel_pstate passive, schedutil Comparison #1 : v5.2, intel_pstate active , powersave+HWP Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil 2. NEUTRAL BENCHMARKS --------------------- * pgbench read/write * NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing * hackbench * netperf 3. NON-NEUTRAL BENCHMARKS ------------------------- comparison ratio with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQ-INV ---------------------------------------- pgbench read-only 1.10 ~ tbench 1.82 1.14 comparison ratio with baseline; 1.00 means neutral, lower is better: I_PSTATE FREQ-INV ---------------------------------------- dbench ~ 0.97 kernbench 0.88 0.78 gitsource[*] ~ 0.46 [*] "gitsource" consists in running git's unit tests tilde (~) means 1.00, ie result identical to baseline Performance per watt: performance-per-watt ratios with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQ-INV ---------------------------------------- dbench 0.92 0.91 tbench 1.26 1.04 kernbench 0.95 0.96 gitsource 1.03 1.30 Similarly to earlier Xeons, measurable performance gains over non-invariant schedutil are observed on dbench, tbench, kernel compilation and running the git unit tests suite. Looking at the detailed tables show that the patch scores the largest difference when the machine is lightly loaded. Power efficiency suffers lightly on kernbench and a bit more on dbench, but largely improves on gitsource (which also runs considerably faster). For reference, we also report results using active intel_pstate with powersave and HWP; the largest gap between non-invariant schedutil and intel_pstate+powersave is still tbench, which runs 82% better and with 26% improved efficiency on the latter configuration -- this divide isn't closed yet by frequency-invariant schedutil. 4. DETAILED TABLES ------------------ Benchmark : tbench4 (i.e. dbench4 over the network, actually loopback) Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 183.56 +- 0.21% ( ) 516.12 +- 0.57% ( 181.18%) 185.59 +- 0.59% ( 1.11%) Hmean 2 365.75 +- 0.25% ( ) 1015.14 +- 0.33% ( 177.55%) 402.59 +- 4.48% ( 10.07%) Hmean 4 720.99 +- 0.44% ( ) 1951.75 +- 0.28% ( 170.70%) 738.39 +- 1.72% ( 2.41%) Hmean 8 1449.93 +- 0.34% ( ) 3830.56 +- 0.24% ( 164.19%) 1750.36 +- 4.65% ( 20.72%) Hmean 16 2874.26 +- 0.57% ( ) 7381.62 +- 0.53% ( 156.82%) 4348.35 +- 2.22% ( 51.29%) Hmean 32 6116.17 +- 5.10% ( ) 13013.05 +- 0.08% ( 112.76%) 8980.35 +- 0.66% ( 46.83%) Hmean 64 14485.04 +- 3.46% ( ) 17835.12 +- 0.35% ( 23.13%) 16540.73 +- 0.51% ( 14.19%) Hmean 128 30779.16 +- 3.20% ( ) 32796.94 +- 2.13% ( 6.56%) 31512.58 +- 0.20% ( 2.38%) Hmean 256 34664.66 +- 0.81% ( ) 34604.67 +- 0.46% ( -0.17%) 34943.70 +- 0.25% ( 0.80%) Hmean 384 33957.51 +- 0.11% ( ) 34091.50 +- 0.14% ( 0.39%) 33921.41 +- 0.09% ( -0.11%) Benchmark : kernbench (kernel compilation) Varying parameter : number of jobs Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 2 332.94 +- 0.40% ( ) 260.16 +- 0.45% ( 21.86%) 233.56 +- 0.21% ( 29.85%) Amean 4 173.04 +- 0.43% ( ) 138.76 +- 0.03% ( 19.81%) 123.59 +- 0.11% ( 28.58%) Amean 8 89.65 +- 0.20% ( ) 73.54 +- 0.09% ( 17.97%) 65.69 +- 0.10% ( 26.72%) Amean 16 48.08 +- 1.41% ( ) 41.64 +- 1.61% ( 13.40%) 36.00 +- 1.80% ( 25.11%) Amean 32 28.78 +- 0.72% ( ) 26.61 +- 1.99% ( 7.55%) 23.19 +- 1.68% ( 19.43%) Amean 64 20.46 +- 1.85% ( ) 19.76 +- 0.35% ( 3.42%) 17.38 +- 0.92% ( 15.06%) Amean 128 18.69 +- 1.70% ( ) 17.59 +- 1.04% ( 5.90%) 15.73 +- 1.40% ( 15.85%) Amean 192 18.82 +- 1.01% ( ) 17.76 +- 0.77% ( 5.67%) 15.57 +- 1.80% ( 17.28%) Benchmark : gitsource (time to run the git unit test suite) Varying parameter : none Unit : seconds (lower is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate/HWP 5.2.0 freq-inv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Amean 792.49 +- 0.20% ( ) 779.35 +- 0.24% ( 1.66%) 427.14 +- 0.16% ( 46.10%) Signed-off-by: Giovanni Gherdovich --- arch/x86/kernel/smpboot.c | 79 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 66 insertions(+), 13 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 814d7900779d..11d57d741584 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1841,23 +1841,71 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { {} }; -static void core_set_cpu_max_freq(void) +static int get_turbo_ratio_group(u64 *turbo_ratio) +{ + u64 ratio, core_counts; + u32 group_size = 0; + int err, i, found = 0; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &ratio); + if (err) + return err; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT1, &core_counts); + if (err) + return err; + + for (i = 0; i < 64; i += 8) { + group_size = (core_counts >> i) & 0xFF; + if (group_size >= 4) { + *turbo_ratio = (ratio >> i) & 0xFF; + found = 1; + break; + } + } + + if (!found) + return 1; + + return 0; +} + +static bool skx_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio) { - u64 ratio, turbo_ratio; int err; - err = rdmsrl_safe(MSR_PLATFORM_INFO, &ratio); + if (!x86_match_cpu(has_skx_turbo_ratio_limits)) + return false; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio); if (err) - return; + return false; - err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &turbo_ratio); + err = get_turbo_ratio_group(turbo_ratio); /* 4C (circa) turbo ratio */ if (err) - return; + return false; - ratio = (ratio >> 8) & 0xFF; /* max P state ratio */ - turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ + *ratio = (*ratio >> 8) & 0xFF; /* max P state ratio */ - arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); + return true; +} + +static bool core_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio) +{ + int err; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio); + if (err) + return false; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, turbo_ratio); + if (err) + return false; + + *ratio = (*ratio >> 8) & 0xFF; /* max P state ratio */ + *turbo_ratio = (*turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ + + return true; } static void intel_set_cpu_max_freq(void) @@ -1865,7 +1913,6 @@ static void intel_set_cpu_max_freq(void) /* * TODO: add support for: * - * - Xeon Gold/Platinum * - Xeon Phi (KNM, KNL) * - Atom Goldmont * - Atom Silvermont @@ -1873,15 +1920,21 @@ static void intel_set_cpu_max_freq(void) * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE */ - static_branch_enable(&arch_scale_freq_key); + u64 ratio = 1, turbo_ratio = 1; if (turbo_disabled() || - x86_match_cpu(has_skx_turbo_ratio_limits) || x86_match_cpu(has_knl_turbo_ratio_limits) || x86_match_cpu(has_glm_turbo_ratio_limits)) return; - core_set_cpu_max_freq(); + if (skx_set_cpu_max_freq(&ratio, &turbo_ratio)) + goto set_value; + + core_set_cpu_max_freq(&ratio, &turbo_ratio); + +set_value: + arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); + static_branch_enable(&arch_scale_freq_key); } static void init_scale_freq(void *arg) From patchwork Wed Nov 13 12:46:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Giovanni Gherdovich X-Patchwork-Id: 11241837 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3343B1850 for ; Wed, 13 Nov 2019 12:41:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 16980222D0 for ; Wed, 13 Nov 2019 12:41:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726350AbfKMMlh (ORCPT ); Wed, 13 Nov 2019 07:41:37 -0500 Received: from mx2.suse.de ([195.135.220.15]:39246 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727064AbfKMMlV (ORCPT ); Wed, 13 Nov 2019 07:41:21 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 5FB2DB554; Wed, 13 Nov 2019 12:41:18 +0000 (UTC) From: Giovanni Gherdovich To: Srinivas Pandruvada , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Borislav Petkov , Len Brown , "Rafael J . Wysocki" Cc: x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Mel Gorman , Matt Fleming , Viresh Kumar , Juri Lelli , Paul Turner , Vincent Guittot , Quentin Perret , Dietmar Eggemann , Doug Smythies , Giovanni Gherdovich Subject: [PATCH v4 3/6] x86,sched: Add support for frequency invariance on XEON_PHI_KNL/KNM Date: Wed, 13 Nov 2019 13:46:51 +0100 Message-Id: <20191113124654.18122-4-ggherdovich@suse.cz> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20191113124654.18122-1-ggherdovich@suse.cz> References: <20191113124654.18122-1-ggherdovich@suse.cz> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org The scheduler needs the ratio freq_curr/freq_max for frequency-invariant accounting. On Xeon Phi CPUs set freq_max to the second-highest frequency reported by the CPU. Xeon Phi CPUs such as Knights Landing and Knights Mill typically have either one or two turbo frequencies; in the former case that's 100 MHz above the base frequency, in the latter case the two levels are 100 MHz and 200 MHz above base frequency. We set freq_max to the second-highest frequency reported by the CPU. This could be the base frequency (if only one turbo level is available) or the first turbo level (if two levels are available). The rationale is to compromise between power efficiency or performance -- going straight to max turbo would favor efficiency and blindly using base freq would favor performance. For reference, this is how MSR_TURBO_RATIO_LIMIT must be parsed on a Xeon Phi to get the available frequencies (taken from a comment in turbostat's sources): [0] -- Reserved [7:1] -- Base value of number of active cores of bucket 1. [15:8] -- Base value of freq ratio of bucket 1. [20:16] -- +ve delta of number of active cores of bucket 2. i.e. active cores of bucket 2 = active cores of bucket 1 + delta [23:21] -- Negative delta of freq ratio of bucket 2. i.e. freq ratio of bucket 2 = freq ratio of bucket 1 - delta [28:24]-- +ve delta of number of active cores of bucket 3. [31:29]-- -ve delta of freq ratio of bucket 3. [36:32]-- +ve delta of number of active cores of bucket 4. [39:37]-- -ve delta of freq ratio of bucket 4. [44:40]-- +ve delta of number of active cores of bucket 5. [47:45]-- -ve delta of freq ratio of bucket 5. [52:48]-- +ve delta of number of active cores of bucket 6. [55:53]-- -ve delta of freq ratio of bucket 6. [60:56]-- +ve delta of number of active cores of bucket 7. [63:61]-- -ve delta of freq ratio of bucket 7. 1. PERFORMANCE EVALUATION: TBENCH +5% 2. NEUTRAL BENCHMARKS (ALL OTHERS) 3. TEST SETUP 1. PERFORMANCE EVALUATION: TBENCH +5% ------------------------------------- A performance evaluation was conducted on a Knights Mill machine (see "Test Setup" below), were the frequency-invariance patch (on schedutil) is compared to both non-invariant schedutil and active intel_pstate with powersave: all three tested kernels behave the same performance-wise and with regard to power consumption (performance per watt). The only notable difference is tbench: comparison ratio of performance with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQ-INV ---------------------------------------- tbench 1.04 1.05 performance-per-watt ratios with baseline; 1.00 means neutral, higher is better: I_PSTATE FREQ-INV ---------------------------------------- tbench 1.03 1.04 which essentially means that frequency-invariant schedutil is 5% better than baseline, the same as intel_pstate+powersave. As the results above are averaged over the varying parameter, here the detailed table. Varying parameter : number of clients Unit : MB/sec (higher is better) 5.2.0 vanilla (BASELINE) 5.2.0 intel_pstate 5.2.0 freq-inv - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Hmean 1 49.06 +- 2.12% ( ) 51.66 +- 1.52% ( 5.30%) 52.87 +- 0.88% ( 7.76%) Hmean 2 93.82 +- 0.45% ( ) 103.24 +- 0.70% ( 10.05%) 105.90 +- 0.70% ( 12.88%) Hmean 4 192.46 +- 1.15% ( ) 215.95 +- 0.60% ( 12.21%) 215.78 +- 1.43% ( 12.12%) Hmean 8 406.74 +- 2.58% ( ) 438.58 +- 0.36% ( 7.83%) 437.61 +- 0.97% ( 7.59%) Hmean 16 857.70 +- 1.22% ( ) 890.26 +- 0.72% ( 3.80%) 889.11 +- 0.73% ( 3.66%) Hmean 32 1760.10 +- 0.92% ( ) 1791.70 +- 0.44% ( 1.79%) 1787.95 +- 0.44% ( 1.58%) Hmean 64 3183.50 +- 0.34% ( ) 3183.19 +- 0.36% ( -0.01%) 3187.53 +- 0.36% ( 0.13%) Hmean 128 4830.96 +- 0.31% ( ) 4846.53 +- 0.30% ( 0.32%) 4855.86 +- 0.30% ( 0.52%) Hmean 256 5467.98 +- 0.38% ( ) 5793.80 +- 0.28% ( 5.96%) 5821.94 +- 0.17% ( 6.47%) Hmean 512 5398.10 +- 0.06% ( ) 5745.56 +- 0.08% ( 6.44%) 5503.68 +- 0.07% ( 1.96%) Hmean 1024 5290.43 +- 0.63% ( ) 5221.07 +- 0.47% ( -1.31%) 5277.22 +- 0.80% ( -0.25%) Hmean 1088 5139.71 +- 0.57% ( ) 5236.02 +- 0.71% ( 1.87%) 5190.57 +- 0.41% ( 0.99%) 2. NEUTRAL BENCHMARKS (ALL OTHERS) ---------------------------------- * pgbench (both read/write and read-only) * NASA Parallel Benchmarks (NPB), MPI or OpenMP for message-passing * hackbench * netperf * dbench * kernbench * gitsource (git unit test suite) 3. TEST SETUP ------------- Test machine: CPU Model : Intel Xeon Phi CPU 7255 @ 1.10GHz (a.k.a. Knights Mill) Fam/Mod/Ste : 6:133:0 Topology : 1 socket, 68 cores / 272 threads Memory : 96G Storage : rotary, XFS filesystem Max EFFICiency, BASE frequency and available turbo levels (MHz): EFFIC 1000 |********** BASE 1100 |*********** 68C 1100 |*********** 30C 1200 |************ Tested kernels: Baseline : v5.2, intel_pstate passive, schedutil Comparison #1 : v5.2, intel_pstate active , powersave Comparison #2 : v5.2, this patch, intel_pstate passive, schedutil Signed-off-by: Giovanni Gherdovich --- arch/x86/kernel/smpboot.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 52 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 11d57d741584..0e79dcc03ae4 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1841,6 +1841,55 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { {} }; +static int get_knl_turbo_ratio(u64 *turbo_ratio) +{ + u64 msr; + u32 ratio, delta_ratio; + int err, i, found = 0; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, &msr); + if (err) + return err; + + ratio = (msr >> 8) & 0xFF; + + for (i = 16; i < 64; i += 8) { + delta_ratio = (msr >> (i + 5)) & 0x7; + if (delta_ratio) { + *turbo_ratio = ratio - delta_ratio; + found = 1; + break; + } + } + + if (!found) + return 1; + + return 0; +} + +static bool knl_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio) +{ + int err; + + if (!x86_match_cpu(has_knl_turbo_ratio_limits)) + return false; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio); + if (err) + return false; + + /* second highest turbo ratio */ + err = get_knl_turbo_ratio(turbo_ratio); + if (err) + return false; + + /* max P state ratio */ + *ratio = (*ratio >> 8) & 0xFF; + + return true; +} + static int get_turbo_ratio_group(u64 *turbo_ratio) { u64 ratio, core_counts; @@ -1913,7 +1962,6 @@ static void intel_set_cpu_max_freq(void) /* * TODO: add support for: * - * - Xeon Phi (KNM, KNL) * - Atom Goldmont * - Atom Silvermont * @@ -1923,10 +1971,12 @@ static void intel_set_cpu_max_freq(void) u64 ratio = 1, turbo_ratio = 1; if (turbo_disabled() || - x86_match_cpu(has_knl_turbo_ratio_limits) || x86_match_cpu(has_glm_turbo_ratio_limits)) return; + if (knl_set_cpu_max_freq(&ratio, &turbo_ratio)) + goto set_value; + if (skx_set_cpu_max_freq(&ratio, &turbo_ratio)) goto set_value; From patchwork Wed Nov 13 12:46:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Giovanni Gherdovich X-Patchwork-Id: 11241839 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8A23716B1 for ; Wed, 13 Nov 2019 12:41:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 763002245C for ; Wed, 13 Nov 2019 12:41:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727152AbfKMMlV (ORCPT ); Wed, 13 Nov 2019 07:41:21 -0500 Received: from mx2.suse.de ([195.135.220.15]:39278 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727097AbfKMMlV (ORCPT ); Wed, 13 Nov 2019 07:41:21 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 2CDD3B555; Wed, 13 Nov 2019 12:41:19 +0000 (UTC) From: Giovanni Gherdovich To: Srinivas Pandruvada , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Borislav Petkov , Len Brown , "Rafael J . Wysocki" Cc: x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Mel Gorman , Matt Fleming , Viresh Kumar , Juri Lelli , Paul Turner , Vincent Guittot , Quentin Perret , Dietmar Eggemann , Doug Smythies , Giovanni Gherdovich Subject: [PATCH v4 4/6] x86,sched: Add support for frequency invariance on ATOM_GOLDMONT* Date: Wed, 13 Nov 2019 13:46:52 +0100 Message-Id: <20191113124654.18122-5-ggherdovich@suse.cz> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20191113124654.18122-1-ggherdovich@suse.cz> References: <20191113124654.18122-1-ggherdovich@suse.cz> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org The scheduler needs the ratio freq_curr/freq_max for frequency-invariant accounting. On GOLDMONT (aka Apollo Lake), GOLDMONT_D (aka Denverton) and GOLDMONT_PLUS CPUs (aka Gemini Lake) set freq_max to the highest frequency reported by the CPU. The encoding of turbo ratios for GOLDMONT* is identical to the one for SKYLAKE_X, but we treat the Atom case apart because we want to set freq_max to a higher value, thus the ratio freq_curr/freq_max to be lower, leading to more conservative frequency selections (favoring power efficiency). Signed-off-by: Giovanni Gherdovich --- arch/x86/kernel/smpboot.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 0e79dcc03ae4..4d192abf337d 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1795,6 +1795,10 @@ void native_play_dead(void) * which would ignore the entire turbo range (a conspicuous part, making * freq_curr/freq_max always maxed out). * + * An exception to the heuristic above is the Atom uarch, where we choose the + * highest turbo level for freq_max since Atom's are generally oriented towards + * power efficiency. + * * Setting freq_max to anything less than the 1C turbo ratio makes the ratio * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. */ @@ -1841,6 +1845,27 @@ static const struct x86_cpu_id has_glm_turbo_ratio_limits[] = { {} }; +static bool glm_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio) +{ + int err; + + if (!x86_match_cpu(has_glm_turbo_ratio_limits)) + return false; + + err = rdmsrl_safe(MSR_PLATFORM_INFO, ratio); + if (err) + return false; + + err = rdmsrl_safe(MSR_TURBO_RATIO_LIMIT, turbo_ratio); + if (err) + return false; + + *ratio = (*ratio >> 8) & 0xFF; /* max P state ratio */ + *turbo_ratio = *turbo_ratio & 0xFF; /* highest turbo ratio */ + + return true; +} + static int get_knl_turbo_ratio(u64 *turbo_ratio) { u64 msr; @@ -1962,7 +1987,6 @@ static void intel_set_cpu_max_freq(void) /* * TODO: add support for: * - * - Atom Goldmont * - Atom Silvermont * * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE @@ -1970,10 +1994,12 @@ static void intel_set_cpu_max_freq(void) u64 ratio = 1, turbo_ratio = 1; - if (turbo_disabled() || - x86_match_cpu(has_glm_turbo_ratio_limits)) + if (turbo_disabled()) return; + if (glm_set_cpu_max_freq(&ratio, &turbo_ratio)) + goto set_value; + if (knl_set_cpu_max_freq(&ratio, &turbo_ratio)) goto set_value; From patchwork Wed Nov 13 12:46:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Giovanni Gherdovich X-Patchwork-Id: 11241833 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1EFB916B1 for ; Wed, 13 Nov 2019 12:41:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0A6FF222D0 for ; Wed, 13 Nov 2019 12:41:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727192AbfKMMlc (ORCPT ); Wed, 13 Nov 2019 07:41:32 -0500 Received: from mx2.suse.de ([195.135.220.15]:39312 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726548AbfKMMlV (ORCPT ); Wed, 13 Nov 2019 07:41:21 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id EDF00B558; Wed, 13 Nov 2019 12:41:19 +0000 (UTC) From: Giovanni Gherdovich To: Srinivas Pandruvada , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Borislav Petkov , Len Brown , "Rafael J . Wysocki" Cc: x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Mel Gorman , Matt Fleming , Viresh Kumar , Juri Lelli , Paul Turner , Vincent Guittot , Quentin Perret , Dietmar Eggemann , Doug Smythies , Giovanni Gherdovich Subject: [PATCH v4 5/6] x86,sched: Add support for frequency invariance on ATOM Date: Wed, 13 Nov 2019 13:46:53 +0100 Message-Id: <20191113124654.18122-6-ggherdovich@suse.cz> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20191113124654.18122-1-ggherdovich@suse.cz> References: <20191113124654.18122-1-ggherdovich@suse.cz> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org The scheduler needs the ratio freq_curr/freq_max for frequency-invariant accounting. On all ATOM CPUs prior to Goldmont, set freq_max to the 1-core turbo ratio. We intended to perform tests validating that this patch doesn't regress in terms of energy efficiency, given that this is the primary concern on Atom processors. Alas, we found out that turbostat doesn't support reading RAPL interfaces on our test machine (Airmont), and we don't have external equipment to measure power consumption; all we have is the performance results of the benchmarks we ran. Test machine: Platform : Dell Wyse 3040 Thin Client[1] CPU Model : Intel Atom x5-Z8350 (aka Cherry Trail, aka Airmont) Fam/Mod/Ste : 6:76:4 Topology : 1 socket, 4 cores / 4 threads Memory : 2G Storage : onboard flash, XFS filesystem [1] https://www.dell.com/en-us/work/shop/wyse-endpoints-and-software/wyse-3040-thin-client/spd/wyse-3040-thin-client Base frequency and available turbo levels (MHz): Min Operating Freq 266 |*** Low Freq Mode 800 |******** Base Freq 2400 |************************ 4 Cores 2800 |**************************** 3 Cores 2800 |**************************** 2 Cores 3200 |******************************** 1 Core 3200 |******************************** Tested kernels: Baseline : v5.4-rc1, intel_pstate passive, schedutil Comparison #1 : v5.4-rc1, intel_pstate active , powersave Comparison #2 : v5.4-rc1, this patch, intel_pstate passive, schedutil tbench, hackbench and kernbench performed the same under all three kernels; dbench ran faster with intel_pstate/powersave and the git unit tests were a lot faster with intel_pstate/powersave and invariant schedutil wrt the baseline. Not that any of this is terrbily interesting anyway, one doesn't buy an Atom system to go fast. Power consumption regressions aren't expected but we lack the equipment to make that measurement. Turbostat seems to think that reading RAPL on this machine isn't a good idea and we're trusting that decision. comparison ratio of performance with baseline; 1.00 means neutral, lower is better: I_PSTATE FREQ-INV ---------------------------------------- dbench 0.90 ~ kernbench 0.98 0.97 gitsource 0.63 0.43 Signed-off-by: Giovanni Gherdovich --- arch/x86/kernel/smpboot.c | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 4d192abf337d..8988177064be 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1821,6 +1821,24 @@ static bool turbo_disabled(void) return (misc_en & MSR_IA32_MISC_ENABLE_TURBO_DISABLE); } +static bool slv_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio) +{ + int err; + + err = rdmsrl_safe(MSR_ATOM_CORE_RATIOS, ratio); + if (err) + return false; + + err = rdmsrl_safe(MSR_ATOM_CORE_TURBO_RATIOS, turbo_ratio); + if (err) + return false; + + *ratio = (*ratio >> 16) & 0x3F; /* max P state ratio */ + *turbo_ratio = *turbo_ratio & 0x3F; /* 1C turbo ratio */ + + return true; +} + #include #include @@ -1984,19 +2002,14 @@ static bool core_set_cpu_max_freq(u64 *ratio, u64 *turbo_ratio) static void intel_set_cpu_max_freq(void) { - /* - * TODO: add support for: - * - * - Atom Silvermont - * - * which all now get by default arch_max_freq = SCHED_CAPACITY_SCALE - */ - u64 ratio = 1, turbo_ratio = 1; if (turbo_disabled()) return; + if (slv_set_cpu_max_freq(&ratio, &turbo_ratio)) + goto set_value; + if (glm_set_cpu_max_freq(&ratio, &turbo_ratio)) goto set_value; From patchwork Wed Nov 13 12:46:54 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Giovanni Gherdovich X-Patchwork-Id: 11241829 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A2C3D17E6 for ; Wed, 13 Nov 2019 12:41:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8E04022459 for ; Wed, 13 Nov 2019 12:41:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727345AbfKMMlZ (ORCPT ); Wed, 13 Nov 2019 07:41:25 -0500 Received: from mx2.suse.de ([195.135.220.15]:39346 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727192AbfKMMlX (ORCPT ); Wed, 13 Nov 2019 07:41:23 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id B9AE6B556; Wed, 13 Nov 2019 12:41:20 +0000 (UTC) From: Giovanni Gherdovich To: Srinivas Pandruvada , Thomas Gleixner , Ingo Molnar , Peter Zijlstra , Borislav Petkov , Len Brown , "Rafael J . Wysocki" Cc: x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Mel Gorman , Matt Fleming , Viresh Kumar , Juri Lelli , Paul Turner , Vincent Guittot , Quentin Perret , Dietmar Eggemann , Doug Smythies , Giovanni Gherdovich Subject: [PATCH v4 6/6] x86: intel_pstate: handle runtime turbo disablement/enablement in freq. invariance Date: Wed, 13 Nov 2019 13:46:54 +0100 Message-Id: <20191113124654.18122-7-ggherdovich@suse.cz> X-Mailer: git-send-email 2.16.4 In-Reply-To: <20191113124654.18122-1-ggherdovich@suse.cz> References: <20191113124654.18122-1-ggherdovich@suse.cz> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org On some platforms such as the Dell XPS 13 laptop the firmware disables turbo when the machine is disconnected from AC, and viceversa it enables it again when it's reconnected. In these cases a _PPC ACPI notification is issued. The scheduler needs to know freq_max for frequency-invariant calculations. To account for turbo availability to come and go, record freq_max at boot as if turbo was available and store it in a helper variable. Use a setter function to swap between 1024 and freq_max every time turbo goes off or on. Reminder: "freq_max" isn't really the maximum frequency but the ratio turbo_freq*1024/base_freq, which becomes 1024 when turbo is disabled. The name could be misleading but makes explainations a lot simpler and is really convenient in calculations. Signed-off-by: Giovanni Gherdovich --- arch/x86/include/asm/topology.h | 5 +++++ arch/x86/kernel/smpboot.c | 14 ++++++++++---- drivers/cpufreq/intel_pstate.c | 1 + 3 files changed, 16 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 9b3aca463c8f..4834bde53b5a 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -214,6 +214,11 @@ static inline long arch_scale_freq_capacity(int cpu) extern void arch_scale_freq_tick(void); #define arch_scale_freq_tick arch_scale_freq_tick +extern void set_arch_max_freq(bool turbo_disabled); +#else +static inline void set_arch_max_freq(bool turbo_disabled) +{ +} #endif #endif /* _ASM_X86_TOPOLOGY_H */ diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 8988177064be..f94aa1dfc778 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1807,8 +1807,15 @@ DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); static DEFINE_PER_CPU(u64, arch_prev_aperf); static DEFINE_PER_CPU(u64, arch_prev_mperf); +static u64 arch_max_turbo_freq = SCHED_CAPACITY_SCALE; static u64 arch_max_freq = SCHED_CAPACITY_SCALE; +void set_arch_max_freq(bool turbo_disabled) +{ + arch_max_freq = turbo_disabled ? SCHED_CAPACITY_SCALE : + arch_max_turbo_freq; +} + static bool turbo_disabled(void) { u64 misc_en; @@ -2004,9 +2011,6 @@ static void intel_set_cpu_max_freq(void) { u64 ratio = 1, turbo_ratio = 1; - if (turbo_disabled()) - return; - if (slv_set_cpu_max_freq(&ratio, &turbo_ratio)) goto set_value; @@ -2022,7 +2026,9 @@ static void intel_set_cpu_max_freq(void) core_set_cpu_max_freq(&ratio, &turbo_ratio); set_value: - arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); + arch_max_turbo_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, + ratio); + set_arch_max_freq(turbo_disabled()); static_branch_enable(&arch_scale_freq_key); } diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index 8ab31702cf6a..6bf50783bc24 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -922,6 +922,7 @@ static void intel_pstate_update_limits(unsigned int cpu) */ if (global.turbo_disabled_mf != global.turbo_disabled) { global.turbo_disabled_mf = global.turbo_disabled; + set_arch_max_freq(global.turbo_disabled); for_each_possible_cpu(cpu) intel_pstate_update_max_freq(cpu); } else {