From patchwork Sat Aug 13 15:59:01 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Doug Smythies X-Patchwork-Id: 9278443 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id EFAAA60231 for ; Sat, 13 Aug 2016 15:59:10 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D63D928A62 for ; Sat, 13 Aug 2016 15:59:10 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CA9AA28A71; Sat, 13 Aug 2016 15:59:10 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_HI,T_DKIM_INVALID autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id AD06628A62 for ; Sat, 13 Aug 2016 15:59:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752653AbcHMP7I (ORCPT ); Sat, 13 Aug 2016 11:59:08 -0400 Received: from cmta18.telus.net ([209.171.16.91]:60453 "EHLO cmta18.telus.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752630AbcHMP7H (ORCPT ); Sat, 13 Aug 2016 11:59:07 -0400 Received: from dougxps ([173.180.45.4]) by cmsmtp with SMTP id YbKvbvlpbCk2EYbKwb8lqm; Sat, 13 Aug 2016 09:59:05 -0600 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=telus.net; s=neo; t=1471103945; bh=dFGDPhkccvuEcMY5mpDSzLHl09+u/Bw/HvuKzthD1LY=; h=From:To:Cc:References:In-Reply-To:Subject:Date; b=lCqa6mkannDplPFUoT96OU2FumV1GgsGHP8diTc/wM9idtmdb47kFe/V+teTgplWc pP1uCgP/qN5TCKbXu966475rxAotYrP6tm3QvI+VvJ1VUs1QfWmc6vGQyrV1iEdLuv ghN7z2QbubrwKAI6OUWmMHsznGZl3MbChWce4RDi0JQ15QaW7qKseYU5DHXp1eVMXn 3bT0dHQogJVfFmr7YbTWTZEfp/oDyDys3c+ke66E33naly+hhZYrCxXIpaRczH/cA2 YOXGVBSCXP/77vAMwDfXwHeUHQTFCg5GS+mXxAuO4yHCvucP3DNWHza+yesJgyVZ9P +XniP7PlbeS+w== X-Authority-Analysis: v=2.2 cv=RoIqFGuK c=1 sm=1 tr=0 a=zJWegnE7BH9C0Gl4FFgQyA==:117 a=zJWegnE7BH9C0Gl4FFgQyA==:17 a=Pyq9K9CWowscuQLKlpiwfMBGOR0=:19 a=IkcTkHD0fZMA:10 a=VwQbUJbxAAAA:8 a=Eq8KhUzsnO9lAcgWFGMA:9 a=7Zwj6sZBwVKJAoWSPKxL6X1jA+E=:19 a=hMeN2HAWo8vGzdXS:21 a=djB0HunJChYtv2to:21 a=AjGcO6oz07-iQ99wixmX:22 From: "Doug Smythies" To: "'Rafael J. Wysocki'" Cc: "'Peter Zijlstra'" , "'Srinivas Pandruvada'" , "'Viresh Kumar'" , "'Linux Kernel Mailing List'" , "'Steve Muckle'" , "'Juri Lelli'" , "'Ingo Molnar'" , "'Linux PM list'" References: <3752826.3sXAQIvcIA@vostro.rjw.lan> <1572483.RZjvRFdxPx@vostro.rjw.lan> <001b01d1ee1c$e6a6a170$b3f3e450$@net> <10156236.RI1knH5Wfs@vostro.rjw.lan> In-Reply-To: <10156236.RI1knH5Wfs@vostro.rjw.lan> Subject: RE: [RFC][PATCH 7/7] cpufreq: intel_pstate: Change P-state selection algorithm for Core Date: Sat, 13 Aug 2016 08:59:01 -0700 Message-ID: <000301d1f57b$9e1fed10$da5fc730$@net> MIME-Version: 1.0 X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: AdHwK5hOBzUn4K4SS1O5PfjvcvMAewFRJPoQ Content-Language: en-ca X-CMAE-Envelope: MS4wfO5Fyj8w/w7Npw6YH1yL/xWpbFEixFR4QtRMu3CbNIy++VmPF1kKa+VR6y/8uEEoIq9yHUwcj5JfobeUVfkEiAfYIJoreO2kgE7VWLV27Fl1mb3YfB5t AWukvafuPzw5f8QftYDYz9Ut3QY0pdmqmbKJcPKbrti2jCRt8C4dDMwUYILihTXV7Y571xj9+9qvcCvfB1MRvnfu2sFC5VOH9swh2QnnVs4VBsX61pIILc1u JQDc162z19sbgX6GKQZVHWRdUxPlRk6p6QD2y+HO1h4j1jrNDMhUrgsJhBjrA7+n7ApsKOUx+qzWE30vzxoIDSSIjHn0K5nWEYyDYkhvY2AKl9NYqNSeEVuD jKPvN9yJdl8JLhsgTiM47Km6AFSUMHfd4osa8dU1ev0hGNve5KGR1Ah4WTqL7StFH/j2cmhgQ3nJmyb/vBbZd4BPQEn/oQ== Sender: linux-pm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pm@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 2016.08.05 17:02 Rafael J. Wysocki wrote: >> On 2016.08.03 21:19 Doug Smythies wrote: >>> On 2016.07.31 16:49 Rafael J. Wysocki wrote: >>> >>> The PID-base P-state selection algorithm used by intel_pstate for >>> Core processors is based on very weak foundations. >> >> ...[cut]... >> >>> +static inline int32_t get_target_pstate_default(struct cpudata *cpu) >>> +{ >>> + struct sample *sample = &cpu->sample; >>> + int32_t busy_frac; >>> + int pstate; >>> + >>> + busy_frac = div_fp(sample->mperf, sample->tsc); >>> + sample->busy_scaled = busy_frac * 100; >>> + >>> + if (busy_frac < cpu->iowait_boost) >>> + busy_frac = cpu->iowait_boost; >>> + >>> + cpu->iowait_boost >>= 1; >>> + >>> + pstate = cpu->pstate.turbo_pstate; >>> + return fp_toint((pstate + (pstate >> 2)) * busy_frac); >>> +} >>> + >> My previous replies (and see below) have suggested that some filtering is needed on the target pstate, otherwise, and dependant on the type of workload, it tends to oscillate. I added the IIR (Infinite Impulse Response) filter that I have suggested in the past: The filter introduces a trade-off between step function load response time and the tendency for the target pstate to oscillate. ...[cut]... >> Several tests were done with this patch set. >> The patch set would not apply to kernel 4.7, but did apply fine to a 4.7+ kernel >> (I did as of 7a66ecf) from a few days ago. >> >> Test 1: Phoronix ffmpeg test (less time is better): >> Reason: Because it suffers from rotating amongst CPUs in an odd way, challenging for CPU frequency scaling drivers. >> This test tends to be an indicator of potential troubles with some games. >> Criteria: (Dirk Brandewie): Must match or better acpi_cpufreq - ondemand. >> With patch set: 15.8 Seconds average and 24.51 package watts. >> Without patch set: 11.61 Seconds average and 27.59 watts. >> Conclusion: Significant reduction in performance with proposed patch set. With the filter this become even worse at ~18 seconds. I used to fix this by moving the response curve to the left. I have not tested this: + unfiltered_target = (pstate + (pstate >> 1)) * busy_frac; which moves the response curve left a little, yet. I will test it. ...[cut]... >> Test 9: Doug's_specpower simulator (20% load): >> Time is fixed, less energy is better. >> Reason: During the long >> "[intel-pstate driver regression] processor frequency very high even if in idle" >> and subsequent https://bugzilla.kernel.org/show_bug.cgi?id=115771 >> discussion / thread(s), some sort of test was needed to try to mimic what Srinivas >> was getting on his fancy SpecPower test platform. So far at least, this test does that. >> Only the 20% load case was created, because that was the biggest problem case back then. >> With patch set: 4 tests at an average of 7197 Joules per test, relatively high CPU frequencies. >> Without the patch set: 4 tests at an average of 5956 Joules per test, relatively low CPU frequencies. >> Conclusion: 21% energy regression with the patch set. >> Note: Newer processors might do better than my older i7-2600K. Patch set + above and IIR gain = 10%: 5670 Joules. Conclusion: Energy regression eliminated. Other gains: gain = 5%: 5342 Joules; Busy MHz: 2172 gain = 10%: 5670 Joules; Busy MHz: 2285 gain = 20%: 6126 Joules; Busy MHz: 2560 gain = 30%: 6426 Joules; Busy MHz: 2739 gain = 40%: 6674 Joules; Busy MHz: 2912 gain = 70%: 7109 Joules; Busy MHz: 3199 locked at minimum pstate (reference): 4653 Joules; Busy MHz: 1600 Performance mode (reference): 7808 Joules; Busy MHz: 3647 >> Test 10: measure the frequency response curve, fixed work packet method, >> 75 hertz work / sleep frequency (all CPU, no IOWAIT): >> Reason: To compare to some older data and observe overall. >> png graph NOT attached. >> Conclusions: Tends to oscillate, suggesting some sort of damping is needed. >> However, any filtering tends to increase the step function load rise time >> (see test 11 below, I think there is some wiggle room here). >> See also graph which has: with and without patch set; performance mode (for reference); >> Philippe Longepe's cpu_load method also with setpoint 40 (for reference); one of my previous >> attempts at a load related patch set from quite some time ago (for reference). As expected, the filter damps out the oscillation. New graphs will be sent to Rafael and Srinivas off-list. >> >> Test 11: Look at the step function load response. From no load to 100% on one CPU (CPU load only, no IO). >> While there is a graph, it is not attached: >> Conclusion: The step function response is greatly improved (virtually one sample time max). >> It would probably be O.K. to slow it down a little with a filter so as to reduce the >> tendency to oscillate under periodic load conditions (to a point, at least. A low enough frequency will >> always oscillate) (see the graph for test10). I haven't done this test yet, but from previous work, a gain setting of 10 to 15% gives a load step function response time similar to the current PID based filter. The other tests gave similar results with or without the filter. ... Doug --- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c index c43ef55..262ec5f 100644 --- a/drivers/cpufreq/intel_pstate.c +++ b/drivers/cpufreq/intel_pstate.c @@ -98,6 +98,7 @@ static inline u64 div_ext_fp(u64 x, u64 y) * @tsc: Difference of time stamp counter between last and * current sample * @time: Current time from scheduler + * @target: target pstate filtered. * * This structure is used in the cpudata structure to store performance sample * data for choosing next P State. @@ -108,6 +109,7 @@ struct sample { u64 aperf; u64 mperf; u64 tsc; + u64 target; u64 time; }; @@ -1168,6 +1170,7 @@ static void intel_pstate_get_cpu_pstates(struct cpudata *cpu) pstate_funcs.get_vid(cpu); intel_pstate_set_min_pstate(cpu); + cpu->sample.target = int_tofp(cpu->pstate.min_pstate); } static inline void intel_pstate_calc_avg_perf(struct cpudata *cpu) @@ -1301,8 +1304,10 @@ static inline int32_t get_target_pstate_use_performance(struct cpudata *cpu) static inline int32_t get_target_pstate_default(struct cpudata *cpu) { struct sample *sample = &cpu->sample; + int64_t scaled_gain, unfiltered_target; int32_t busy_frac; int pstate; + u64 duration_ns; busy_frac = div_fp(sample->mperf, sample->tsc); sample->busy_scaled = busy_frac * 100; @@ -1313,7 +1318,74 @@ static inline int32_t get_target_pstate_default(struct cpudata *cpu) cpu->iowait_boost >>= 1; pstate = cpu->pstate.turbo_pstate; - return fp_toint((pstate + (pstate >> 2)) * busy_frac); + /* To Do: I think the above should be: + * + * if (limits.no_turbo || limits.turbo_disabled) + * pstate = cpu->pstate.max_pstate; + * else + * pstate = cpu->pstate.turbo_pstate; + * + * figure it out. + * + * no clamps. Pre-filter clamping was needed in past implementations. + * To Do: Is any pre-filter clamping needed here? */ + + unfiltered_target = (pstate + (pstate >> 2)) * busy_frac; + + /* + * Idle check. + * We have a deferrable timer. Very long durations can be + * either due to long idle (C0 time near 0), + * or due to short idle times that spanned jiffy boundaries + * (C0 time not near zero). + * + * To Do: As of the utilization stuff, I do not think the + * spanning jiffy boundaries thing is true anymore. + * Check, and fix the comment. + * + * The very long durations are 0.4 seconds or more. + * Either way, a very long duration will effectively flush + * the IIR filter, otherwise falling edge load response times + * can be on the order of tens of seconds, because this driver + * runs very rarely. Furthermore, for higher periodic loads that + * just so happen to not be in the C0 state on jiffy boundaries, + * the long ago history should be forgotten. + * For cases of durations that are a few times the set sample + * period, increase the IIR filter gain so as to weight + * the current sample more appropriately. + * + * To Do: sample_time should be forced to be accurate. For + * example if the kernel is a 250 Hz kernel, then a + * sample_rate_ms of 10 should result in a sample_time of 12. + * + * To Do: Check that the IO Boost case is not filtered too much. + * It might be that a filter by-pass is needed for the boost case. + * However, the existing gain = f(duration) might be good enough. + */ + + duration_ns = cpu->sample.time - cpu->last_sample_time; + + scaled_gain = div_u64(int_tofp(duration_ns) * + int_tofp(pid_params.p_gain_pct), int_tofp(pid_params.sample_rate_ns)); + if (scaled_gain > int_tofp(100)) + scaled_gain = int_tofp(100); + /* + * This code should not be required, + * but short duration times have been observed + * To Do: Check if this code is actually still needed. I don't think so. + */ + if (scaled_gain < int_tofp(pid_params.p_gain_pct)) + scaled_gain = int_tofp(pid_params.p_gain_pct); + + /* + * Bandwidth limit the output. For now, re-task p_gain_pct for this purpose. + * Use a smple IIR (Infinite Impulse Response) filter. + */ + cpu->sample.target = div_u64((int_tofp(100) - scaled_gain) * + cpu->sample.target + scaled_gain * + unfiltered_target, int_tofp(100)); + + return fp_toint(cpu->sample.target + (1 << (FRAC_BITS-1))); } static inline void intel_pstate_update_pstate(struct cpudata *cpu, int pstate) @@ -1579,6 +1651,7 @@ static void intel_pstate_stop_cpu(struct cpufreq_policy *policy) return; intel_pstate_set_min_pstate(cpu); + cpu->sample.target = int_tofp(cpu->pstate.min_pstate); } static int intel_pstate_cpu_init(struct cpufreq_policy *policy)