diff mbox series

Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX"

Message ID 1563431200-3042-1-git-send-email-dsmythies@telus.net (mailing list archive)
State Superseded, archived
Headers show
Series Revert "cpufreq: schedutil: Don't set next_freq to UINT_MAX" | expand

Commit Message

Doug Smythies July 18, 2019, 6:26 a.m. UTC
This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.

The commit caused a regression whereby reducing the maximum
CPU clock frequency is ineffective while busy, and the CPU
clock remains unchanged. Once the system has experienced
some idle time, the new limit is implemented.
A consequence is that any thermal throttling monitoring
and control based on max freq limits fail to respond
in a timely manor, if at all, to a thermal temperature
trip on a busy system.

Please review the original e-mail threads, as this
attempt at a revertion would then re-open some of
those questions. One possible alterative to this
revertion would be to revert B7EAF1AAB9F8:

  Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
  Date:   Wed Mar 22 00:08:50 2017 +0100

      cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely

References:

https://marc.info/?t=152576189300002&r=1&w=2
https://marc.info/?t=152586219000002&r=1&w=2
https://marc.info/?t=152607169800004&r=1&w=2

https://marc.info/?t=149013813900002&r=1&w=2
---
Test Data:

Typical commands:
watch -n 5 cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq
sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
watch -n 5 cat /sys/devices/system/cpu/cpufreq/policy*/scaling_cur_freq

Driver: intel_cpufreq ; Governor: Schedutil
Kernel 5.2:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test).
- adjust downwards the maximum clock frequency
  echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
- observe the Bzy_MHz does not change on the turbostat output.
- reduce the system load, such that there is some idle time.
- observe the Bzy_MHz now does change.
- increase the load again.
- observe the CPU frequency is now pinned to the new limit.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use pstate control first.
  On my system a lower thermal trip point it used,
  because it doesn't get hot enough.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the max_perf_pct but nothing happens in response.
- Once the lowest limit of max_perf_pct is reached, observe
  thermald fall through to intel-powerclamp, and kidle inject
  starts to be used. Therefore the "busy" condition (see the code)
  is not longer met, and the CPU frequency suddenly goes to minimum.

Test 3: Limit thermald to using max_perf_pct only:
- Adjust the cpu cooling device order to just include the one
  option, thereby excluding help from the fall through partner
  experienced in test 2.
- Observations here tracked test 2 until the lower limit
  of max_perf_pct is reached. The CPUs remain at maximum
  frequency until the load is removed. Processor
  temperature continues to climb.

Driver: intel_cpufreq ; Governor: Schedutil
Kernel 5.2 + this revert patch:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
  echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
- observe the Bzy_MHz changes on the turbostat output.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use pstate control first.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the max_perf_pct and observe the actual frequency
  adjust in response.

Test 3: Limit thermald to using max_perf_pct only:
- Adjust the cpu cooling device order to just include the one
  option, adjusting max_perf_pct.
- Observations here tracked test 2. (test 2 never fell
  through to intel_powerclamp and kidle inject.

Driver: acpi-cpufreq ; Governor: Schedutil
Kernel 5.2:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
  for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo "2200000" > $file; done
- observe the Bzy_MHz does not change on the turbostat output.
- reduce the system load, such that there is some idle time.
- observe the Bzy_MHz now does change.
- increase the load again.
- observe the CPU frequency is now pinned to the new limit.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use intel_powerclamp control first.
- enable and start thermald
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  adjusting the kidle_inj amounts.
- If the maximum kidle_inj is not enough, then oberve thermald
  start to limit scaling_max_freq, however the actual frequency
  response is immediate. Why? Because of the kidle inject, the
  system is not longer fully "busy", and so that condition is not met.

Test 3: Limit thermald to using scaling_max_freq only:
- Adjust the cpu cooling device order to just include the one
  option, thereby excluding help from previous partner
  experienced in test 2.
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  reducing scaling_max_freq.
- observe the Bzy_MHz does not change on the turbostat output.
- Eventually, for all CPUs, scaling_max_freq is set to the minimum.
- Processor temperature continues to climb.
- The CPUs remain at maximum frequency until the load is removed.

Driver: acpi-cpufreq ; Governor: Schedutil
Kernel 5.2 + this revert patch:

Test 1: No thermal throttling involved:
- load the system (some sort of CPU stress test)
- adjust downwards the maximum clock frequency
  for file in /sys/devices/system/cpu/cpufreq/policy*/scaling_max_freq; do echo "2200000" > $file; done
- observe the Bzy_MHz changes on the turbostat output.

Test 2: Use thermald to try Thermal throttling:
- By default thermald will use intel_powerclamp control first.
- enable and start thermald
- This test already passes using kernel 5.2, and
  proper operation was observed with this revert.

Test 3: Limit thermald to using scaling_max_freq only:
- Adjust the cpu cooling device order to just include the one
  option, thereby excluding help from previous partner
  experienced in test 2.
- load the system (some sort of CPU stress test)
- Once the thermal trip point is reached, observe thermald
  reducing scaling_max_freq.
- observe the Bzy_MHz now does change on the turbostat output
  and the processor package temperature is now controlled.
---
 kernel/sched/cpufreq_schedutil.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

Comments

Viresh Kumar July 18, 2019, 10:28 a.m. UTC | #1
On 17-07-19, 23:26, Doug Smythies wrote:
> This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.
> 
> The commit caused a regression whereby reducing the maximum
> CPU clock frequency is ineffective while busy, and the CPU
> clock remains unchanged. Once the system has experienced
> some idle time, the new limit is implemented.

Can you explain why this patch caused that issue ? I am sorry but I couldn't
understand it from your email. How are we trying to reduce the frequency? Is
clk_set_rate() getting called with that finally and not working ?

> A consequence is that any thermal throttling monitoring
> and control based on max freq limits fail to respond
> in a timely manor, if at all, to a thermal temperature
> trip on a busy system.
Doug Smythies July 18, 2019, 3:46 p.m. UTC | #2
On 2019.07.18 03:28 Viresh Kumar wrote:
> On 17-07-19, 23:26, Doug Smythies wrote:
>> This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.
>> 
>> The commit caused a regression whereby reducing the maximum
>> CPU clock frequency is ineffective while busy, and the CPU
>> clock remains unchanged. Once the system has experienced
>> some idle time, the new limit is implemented.
>
> Can you explain why this patch caused that issue ? I am sorry but I couldn't
> understand it from your email. How are we trying to reduce the frequency? Is
> clk_set_rate() getting called with that finally and not working ?

The patch eliminates the "flag", UNIT_MAX, and it's related comment,
that was used to indicate if it was a limit change that causes the
subsequent execution of sugov_update_single.

As for "clk_set_rate()", I don't know. I bisected the kernel
and got here. I also looked at reverting B7EAF1AAB9F8, but
it seemed to have some purpose which I don't know of more
important than this one or not.

I'll reinsert the first test below with more detail:

On 2019.07.17 23:27 Doug wrote:

> Driver: intel_cpufreq ; Governor: Schedutil
> Kernel 5.2:

$ uname -a
Linux s15 5.2.0-stock #608 SMP PREEMPT Mon Jul 8 08:21:56 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux

doug@s15:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
intel_cpufreq
doug@s15:~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil
schedutil

> Test 1: No thermal throttling involved (I only ever use thermal, if anything):

doug@s15:~$ sudo systemctl status thermald.service
. thermald.service - Thermal Daemon Service
   Loaded: loaded (/lib/systemd/system/thermald.service; disabled; vendor preset: enabled)
   Active: inactive (dead)

> - load the system (some sort of CPU stress test).

Use whatever you want. I use my own, only because I always have and it prints something
every so often which gives an indication of actual clock speed.

doug@s15:~$ ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 &
[1] 3000
[2] 3001
[3] 3002
[4] 3003
[5] 3004
[6] 3005
[7] 3006
[8] 3007
[9] 3008

Watch everything with turobstat:

doug@s15:~$ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
0.03    1600    269     25      3.69
0.07    1600    390     25      3.72
0.06    1600    368     25      3.72
0.06    1600    343     25      3.71
0.04    1600    269     26      3.70
74.72   3474    30230   46      43.95  <<< Add load
100.00  3500    40228   50      58.27
100.00  3500    40196   52      58.59
100.00  3500    40215   55      58.91
100.00  3500    40211   56      59.12
100.00  3500    40291   58      59.33  <<< Try to set 60% max
100.00  3500    40278   59      59.55
100.00  3500    40591   61      59.73
100.00  3500    40279   61      60.10
100.00  3500    40292   63      60.35
100.00  3500    40401   64      60.85
99.99   3500    40352   65      61.12
100.00  3500    40230   67      61.32
100.00  3500    40228   67      61.52
100.00  3500    40400   68      61.60  <<< Try to set 42% max
100.00  3500    40279   69      61.74
100.00  3500    40258   68      61.85
100.00  3500    40226   70      62.01
100.00  3500    40220   70      62.10
100.00  3500    40211   71      62.25

> - adjust downwards the maximum clock frequency
>   echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct

doug@s15:~/temp$ echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
60
doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
60
... wait awhile ...
doug@s15:~/temp$ echo 42 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
42
doug@s15:~/temp$ cat /sys/devices/system/cpu/intel_pstate/max_perf_pct
42

> - observe the Bzy_MHz does not change on the turbostat output.

See annotated turbostat output above.

> - reduce the system load, such that there is some idle time.
> - observe the Bzy_MHz now does change.
> - increase the load again.
> - observe the CPU frequency is now pinned to the new limit.

Go back to 60% first:

doug@s15:~/temp$ echo 60 | sudo tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
60
killall test1
... wait awhile ...
doug@s15:~$ ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 & ~/c/test1 &
[1] 3040
[2] 3041
[3] 3042
[4] 3043
[5] 3044
[6] 3045
[7] 3046
[8] 3047
[9] 3048

doug@s15:~$ sudo turbostat --quiet --Summary --show Busy%,Bzy_MHz,PkgTmp,PkgWatt,IRQ --interval 5
Busy%   Bzy_MHz IRQ     PkgTmp  PkgWatt
100.00  3500    40289   80      64.32
100.00  3500    40223   79      64.16
100.00  3500    40212   79      64.14
100.00  3500    40269   79      64.17
100.00  3500    41330   79      64.19  <<< Freq is above the 60% limit
14.10   3489    6260    55      12.63  <<< Load removed, now not "busy"
0.03    1600    263     53      4.13   <<< see sugov_update_single
23.22   2293    9582    65      10.72  <<< Load applied
100.00  2300    40229   66      35.09  <<< 3.8 GHz * 0.6 = 2.3 GHz
100.00  2300    40209   66      35.21
100.00  2300    40211   65      35.20
100.00  2300    40219   65      35.16
100.00  2300    40224   65      35.14

The only procedure changes when using the acpi-cpufreq CPU scaling driver and
the schedutil governor are for setting and monitoring the max freq settings,
as described in the test write-ups.

Hope this helps.

... Doug
Viresh Kumar July 22, 2019, 6:49 a.m. UTC | #3
On 18-07-19, 08:46, Doug Smythies wrote:
> On 2019.07.18 03:28 Viresh Kumar wrote:
> > On 17-07-19, 23:26, Doug Smythies wrote:
> >> This reverts commit ecd2884291261e3fddbc7651ee11a20d596bb514.
> >> 
> >> The commit caused a regression whereby reducing the maximum
> >> CPU clock frequency is ineffective while busy, and the CPU
> >> clock remains unchanged. Once the system has experienced
> >> some idle time, the new limit is implemented.
> >
> > Can you explain why this patch caused that issue ? I am sorry but I couldn't
> > understand it from your email. How are we trying to reduce the frequency? Is
> > clk_set_rate() getting called with that finally and not working ?
> 
> The patch eliminates the "flag", UNIT_MAX, and it's related comment,
> that was used to indicate if it was a limit change that causes the
> subsequent execution of sugov_update_single.

I think I may have understood the root cause. Please try the patch I just sent
as reply to this thread. Thanks.
diff mbox series

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 962cf34..ea4e04b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -89,8 +89,15 @@  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 	    !cpufreq_this_cpu_can_update(sg_policy->policy))
 		return false;
 
-	if (unlikely(sg_policy->need_freq_update))
+	if (unlikely(sg_policy->need_freq_update)) {
+		sg_policy->need_freq_update = false;
+		/*
+		 * This happens when limits change, so forget the previous
+		 * next_freq value and force an update.
+		 */
+		sg_policy->next_freq = UINT_MAX;
 		return true;
+	}
 
 	delta_ns = time - sg_policy->last_freq_update_time;
 
@@ -168,10 +175,8 @@  static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 
 	freq = map_util_freq(util, freq, max);
 
-	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
+	if (freq == sg_policy->cached_raw_freq && sg_policy->next_freq != UINT_MAX)
 		return sg_policy->next_freq;
-
-	sg_policy->need_freq_update = false;
 	sg_policy->cached_raw_freq = freq;
 	return cpufreq_driver_resolve_freq(policy, freq);
 }
@@ -457,7 +462,8 @@  static void sugov_update_single(struct update_util_data *hook, u64 time,
 	 * Do not reduce the frequency if the CPU has not been idle
 	 * recently, as the reduction is likely to be premature then.
 	 */
-	if (busy && next_f < sg_policy->next_freq) {
+	if (busy && next_f < sg_policy->next_freq &&
+	    sg_policy->next_freq != UINT_MAX) {
 		next_f = sg_policy->next_freq;
 
 		/* Reset cached freq as next_freq has changed */
@@ -819,7 +825,7 @@  static int sugov_start(struct cpufreq_policy *policy)
 
 	sg_policy->freq_update_delay_ns	= sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
 	sg_policy->last_freq_update_time	= 0;
-	sg_policy->next_freq			= 0;
+	sg_policy->next_freq			= UINT_MAX;
 	sg_policy->work_in_progress		= false;
 	sg_policy->need_freq_update		= false;
 	sg_policy->cached_raw_freq		= 0;