diff mbox series

cpufreq: exclude boost frequencies from valid count if not enabled

Message ID 20210217000013.4063289-1-thara.gopinath@linaro.org (mailing list archive)
State Superseded, archived
Headers show
Series cpufreq: exclude boost frequencies from valid count if not enabled | expand

Commit Message

Thara Gopinath Feb. 17, 2021, midnight UTC
This is a fix for a regression observed on db845 platforms with 5.7-rc11
kernel.  On these platforms running stress tests with 5.11-rc7 kernel
causes big cpus to overheat and ultimately shutdown the system due to
hitting critical temperature (thermal throttling does not happen and
cur_state of cpufreq cooling device for big cpus remain stuck at 0 or max
frequency).

This platform has boost opp defined for big cpus but boost mode itself is
disabled in the cpufreq driver. Hence the initial max frequency request
from cpufreq cooling device(cur_state) for big cpus is for boost
frequency(2803200) where as initial max frequency request from cpufreq
driver itself is for the highest non boost frequency (2649600). qos
framework collates these two requests and puts the max frequency of big
cpus to 2649600 which the thermal framework is unaware of. Now during an
over heat event, with step-wise policy governor, thermal framework tries to
throttle the cpu and places a restriction on max frequency of the cpu to
cur_state - 1 which in this case 2649600. qos framework in turn tells the
cpufreq cooling device that max frequency of the cpu is already at 2649600
and the cooling device driver returns doing nothing(cur_state of the
cooling device remains unchanged). Thus thermal remains stuck in a loop and
never manages to actually throttle the cpu frequency. This ultimately leads
to system shutdown in case of a thermal overheat event on big cpus.

There are multiple possible fixes for this issue. Fundamentally,it is wrong
for cpufreq driver and cpufreq cooling device driver to show different
maximum possible state/frequency for a cpu. Hence fix this issue by
ensuring that the max state of cpufreq cooling device is in sync with the
maximum frequency of the cpu in cpufreq driver.
cpufreq_table_count_valid_entries is used to retrieve max level/max
frequency of a cpu by cpufreq_cooling_device during initialization. Add
check in this api to ignore boost frequencies if boost mode is not enabled
thus keeping the max state of cpufreq cooling device in sync with the
maximum frequency of the cpu in cpufreq driver.
cpufreq_frequency_table_cpuinfo that calculates the maximum frequency of a
cpu for cpufreq driver already has such a check in place.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
---
 include/linux/cpufreq.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Viresh Kumar Feb. 17, 2021, 5:50 a.m. UTC | #1
Hi Thara,

On 16-02-21, 19:00, Thara Gopinath wrote:
> This is a fix for a regression observed on db845 platforms with 5.7-rc11
> kernel.  On these platforms running stress tests with 5.11-rc7 kernel
> causes big cpus to overheat and ultimately shutdown the system due to
> hitting critical temperature (thermal throttling does not happen and
> cur_state of cpufreq cooling device for big cpus remain stuck at 0 or max
> frequency).
> 
> This platform has boost opp defined for big cpus but boost mode itself is
> disabled in the cpufreq driver. Hence the initial max frequency request
> from cpufreq cooling device(cur_state) for big cpus is for boost
> frequency(2803200) where as initial max frequency request from cpufreq
> driver itself is for the highest non boost frequency (2649600).

Okay.

> qos
> framework collates these two requests and puts the max frequency of big
> cpus to 2649600 which the thermal framework is unaware of.

It doesn't need to be aware of that. It sets its max frequency and other
frameworks can put their own requests and the lowest one wins. In this case the
other constraint came from cpufreq-core, which is fine.

> Now during an
> over heat event, with step-wise policy governor, thermal framework tries to
> throttle the cpu and places a restriction on max frequency of the cpu to
> cur_state - 1

Actually it is cur_state + 1 as the values are inversed here, cooling state 0
refers to highest frequency :)

> which in this case 2649600. qos framework in turn tells the
> cpufreq cooling device that max frequency of the cpu is already at 2649600
> and the cooling device driver returns doing nothing(cur_state of the
> cooling device remains unchanged).

And that's where the bug lies, I have sent proper fix for that now.

> Thus thermal remains stuck in a loop and
> never manages to actually throttle the cpu frequency. This ultimately leads
> to system shutdown in case of a thermal overheat event on big cpus.
 
> There are multiple possible fixes for this issue. Fundamentally,it is wrong
> for cpufreq driver and cpufreq cooling device driver to show different
> maximum possible state/frequency for a cpu.

Not actually, cpufreq core changes the max supported frequency at runtime based
on the availability of boost frequencies.

cpufreq_table_count_valid_entries() is used at different places and it is
implemented correctly.
Thara Gopinath Feb. 17, 2021, 3:32 p.m. UTC | #2
On 2/17/21 12:50 AM, Viresh Kumar wrote:
> Hi Thara,
> 
> On 16-02-21, 19:00, Thara Gopinath wrote:
>> This is a fix for a regression observed on db845 platforms with 5.7-rc11
>> kernel.  On these platforms running stress tests with 5.11-rc7 kernel
>> causes big cpus to overheat and ultimately shutdown the system due to
>> hitting critical temperature (thermal throttling does not happen and
>> cur_state of cpufreq cooling device for big cpus remain stuck at 0 or max
>> frequency).
>>
>> This platform has boost opp defined for big cpus but boost mode itself is
>> disabled in the cpufreq driver. Hence the initial max frequency request
>> from cpufreq cooling device(cur_state) for big cpus is for boost
>> frequency(2803200) where as initial max frequency request from cpufreq
>> driver itself is for the highest non boost frequency (2649600).
> 
> Okay.
> 
>> qos
>> framework collates these two requests and puts the max frequency of big
>> cpus to 2649600 which the thermal framework is unaware of.
> 
> It doesn't need to be aware of that. It sets its max frequency and other
> frameworks can put their own requests and the lowest one wins. In this case the
> other constraint came from cpufreq-core, which is fine.

Yes. the qos behavior is correct here.

> 
>> Now during an
>> over heat event, with step-wise policy governor, thermal framework tries to
>> throttle the cpu and places a restriction on max frequency of the cpu to
>> cur_state - 1
> 
> Actually it is cur_state + 1 as the values are inversed here, cooling state 0
> refers to highest frequency :)

yes. it does indeed!

> 
>> which in this case 2649600. qos framework in turn tells the
>> cpufreq cooling device that max frequency of the cpu is already at 2649600
>> and the cooling device driver returns doing nothing(cur_state of the
>> cooling device remains unchanged).
> 
> And that's where the bug lies, I have sent proper fix for that now.

Like I mention below there are multiple possible fixes for this issue!
More on mismatch of frequencies below.
> 
>> Thus thermal remains stuck in a loop and
>> never manages to actually throttle the cpu frequency. This ultimately leads
>> to system shutdown in case of a thermal overheat event on big cpus.
>   
>> There are multiple possible fixes for this issue. Fundamentally,it is wrong
>> for cpufreq driver and cpufreq cooling device driver to show different
>> maximum possible state/frequency for a cpu.
> 
> Not actually, cpufreq core changes the max supported frequency at runtime based
> on the availability of boost frequencies.

First of all, I am still unable to find this setting in the sysfs space.
Irrespective the ideal behavior here will be to change the cpufreq 
cooling dev max state when this happens. I say this for two reasons
1. The cooling device max state will reflect the correct highest 
frequency as reported by cpufreq core. These are interfaces exposed to
user space and they should not be showing two different things.
2. More importantly, thermal will not waste valuable cycles attempting 
to throttle down from an non-existing high frequency. In the case of 
sdm845 we have only one boost opp in the opp table and hence the first 
time thermal tries to throttle via the cpufreq cooling device(with the 
step policy governor), it will return back saying that the state is 
already achieved and then will retry again because overheating has not 
stopped. But let us a platform has 5 such opps in the table and boost 
mode not enabled. cpufreq cooling device will have to attempt 5 times 
before any actual cooling action happens.

> 
> cpufreq_table_count_valid_entries() is used at different places and it is
> implemented correctly.

It is used in one other place which is for statistics count. Boost 
statistics need not be considered if boost mode is not enabled. And like 
I mentioned before as in the case of cpufreq cooling device correct 
behavior will be to reflect this as and when boost is enabled. But then 
again for statistics purpose it is not much of an issue if the entry 
itself is present with the count showing 0 if boost modes are not 
enabled. In this case, we should have another api or cpufreq cooling 
device not use cpufreq_table_count_valid_entries to get the max state.

>
Viresh Kumar Feb. 18, 2021, 8:48 a.m. UTC | #3
On 17-02-21, 10:32, Thara Gopinath wrote:
> First of all, I am still unable to find this setting in the sysfs space.

The driver needs to call cpufreq_enable_boost_support() for that.

> Irrespective the ideal behavior here will be to change the cpufreq cooling
> dev max state when this happens.

Hmm.. recreating it every time boost frequency is enabled is like
inviting trouble and it will be tricky. Maybe it can be done, I don't
know.:)
Thara Gopinath Feb. 18, 2021, 3:03 p.m. UTC | #4
On 2/18/21 3:48 AM, Viresh Kumar wrote:
> On 17-02-21, 10:32, Thara Gopinath wrote:
>> First of all, I am still unable to find this setting in the sysfs space.
> 
> The driver needs to call cpufreq_enable_boost_support() for that.

Ok. that makes sense.

> 
>> Irrespective the ideal behavior here will be to change the cpufreq cooling
>> dev max state when this happens.
> 
> Hmm.. recreating it every time boost frequency is enabled is like
> inviting trouble and it will be tricky. Maybe it can be done, I don't
> know.:)

Scheduling a notifier for max frequency change from the qos framework 
should do the work, right?

>
Viresh Kumar Feb. 18, 2021, 3:45 p.m. UTC | #5
On 18-02-21, 10:03, Thara Gopinath wrote:
> Scheduling a notifier for max frequency change from the qos framework should
> do the work, right?

Not that, but we need to increase/decrease cooling states at run time,
create sysfs files/directories, etc. It isn't worth it.
diff mbox series

Patch

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 9c8b7437b6cd..fe52892e0812 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -1006,8 +1006,11 @@  static inline int cpufreq_table_count_valid_entries(const struct cpufreq_policy
 	if (unlikely(!policy->freq_table))
 		return 0;
 
-	cpufreq_for_each_valid_entry(pos, policy->freq_table)
+	cpufreq_for_each_valid_entry(pos, policy->freq_table) {
+		if (!cpufreq_boost_enabled() && (pos->flags & CPUFREQ_BOOST_FREQ))
+			continue;
 		count++;
+	}
 
 	return count;
 }