diff mbox series

[1/2] thermal: power_allocator: maintain the device statistics from going stale

Message ID 20210331163352.32416-2-lukasz.luba@arm.com (mailing list archive)
State New
Delegated to: Daniel Lezcano
Headers show
Series Improve IPA mechanisms in low temperature state | expand

Commit Message

Lukasz Luba March 31, 2021, 4:33 p.m. UTC
When the temperature is below the first activation trip point the cooling
devices are not checked, so they cannot maintain fresh statistics. It
leads into the situation, when temperature crosses first trip point, the
statistics are stale and show state for very long period. This has impact
on IPA algorithm calculation and wrong decisions. Thus, check the cooling
devices even when the temperature is low, to refresh these statistics.

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
---
 drivers/thermal/gov_power_allocator.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Daniel Lezcano April 2, 2021, 3:54 p.m. UTC | #1
On 31/03/2021 18:33, Lukasz Luba wrote:
> When the temperature is below the first activation trip point the cooling
> devices are not checked, so they cannot maintain fresh statistics. It
> leads into the situation, when temperature crosses first trip point, the
> statistics are stale and show state for very long period. 

Can you elaborate the statistics you are referring to ?

I can understand the pid controller needs temperature but I don't
understand the statistics with the cooling device.


> This has impact
> on IPA algorithm calculation and wrong decisions. Thus, check the cooling
> devices even when the temperature is low, to refresh these statistics.
> 
> Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
> ---
>  drivers/thermal/gov_power_allocator.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/thermal/gov_power_allocator.c b/drivers/thermal/gov_power_allocator.c
> index 2802a0e13c88..0cbd10cab193 100644
> --- a/drivers/thermal/gov_power_allocator.c
> +++ b/drivers/thermal/gov_power_allocator.c
> @@ -575,15 +575,27 @@ static void allow_maximum_power(struct thermal_zone_device *tz)
>  {
>  	struct thermal_instance *instance;
>  	struct power_allocator_params *params = tz->governor_data;
> +	u32 req_power;
>  
>  	mutex_lock(&tz->lock);
>  	list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
> +		struct thermal_cooling_device *cdev = instance->cdev;
> +
>  		if ((instance->trip != params->trip_max_desired_temperature) ||
>  		    (!cdev_is_power_actor(instance->cdev)))
>  			continue;
>  
>  		instance->target = 0;
>  		mutex_lock(&instance->cdev->lock);
> +		/*
> +		 * Call for updating the cooling devices local stats and avoid
> +		 * periods of dozen of seconds when those have not been
> +		 * maintained. The long period would come into the first check
> +		 * when lower threshold is crossed. Thus, limit it to single
> +		 * one longer polling period.
> +		 */
> +		cdev->ops->get_requested_power(cdev, &req_power);
> +
>  		instance->cdev->updated = false;
>  		mutex_unlock(&instance->cdev->lock);
>  		thermal_cdev_update(instance->cdev);
>
Lukasz Luba April 6, 2021, 8:44 a.m. UTC | #2
On 4/2/21 4:54 PM, Daniel Lezcano wrote:
> On 31/03/2021 18:33, Lukasz Luba wrote:
>> When the temperature is below the first activation trip point the cooling
>> devices are not checked, so they cannot maintain fresh statistics. It
>> leads into the situation, when temperature crosses first trip point, the
>> statistics are stale and show state for very long period.
> 
> Can you elaborate the statistics you are referring to ?
> 
> I can understand the pid controller needs temperature but I don't
> understand the statistics with the cooling device.
> 
> 

The allocate_power() calls cooling_ops->get_requested_power(),
which is for CPUs cpufreq_get_requested_power() function.
In that cpufreq implementation for !SMP we still has the
issue of stale statistics. Viresh, when he introduced the usage
of sched_cpu_util(), he fixed that 'long non-meaningful period'
of the broken statistics and it can be found since v5.12-rc1.

The bug is still there for the !SMP. Look at the way how idle time
is calculated in get_load() [1]. It relies on 'idle_time->timestamp'
for calculating the period. But when this function is not called,
the value can be very far away in time, e.g. a few seconds back,
when the last allocate_power() was called.

The bug is there for both SMP and !SMP [2] for older kernels, which can
be used in Android or ChromeOS. I've been considering to put this simple
IPA fix also to some other kernels, because Viresh's change is more
a 'feature' and does not cover both platforms.

Regards,
Lukasz

[1] 
https://elixir.bootlin.com/linux/v5.12-rc5/source/drivers/thermal/cpufreq_cooling.c#L156
[2] 
https://elixir.bootlin.com/linux/v5.11.11/source/drivers/thermal/cpufreq_cooling.c#L143
Daniel Lezcano April 6, 2021, 10:16 a.m. UTC | #3
On 06/04/2021 10:44, Lukasz Luba wrote:
> 
> 
> On 4/2/21 4:54 PM, Daniel Lezcano wrote:
>> On 31/03/2021 18:33, Lukasz Luba wrote:
>>> When the temperature is below the first activation trip point the
>>> cooling
>>> devices are not checked, so they cannot maintain fresh statistics. It
>>> leads into the situation, when temperature crosses first trip point, the
>>> statistics are stale and show state for very long period.
>>
>> Can you elaborate the statistics you are referring to ?
>>
>> I can understand the pid controller needs temperature but I don't
>> understand the statistics with the cooling device.
>>
>>
> 
> The allocate_power() calls cooling_ops->get_requested_power(),
> which is for CPUs cpufreq_get_requested_power() function.
> In that cpufreq implementation for !SMP we still has the
> issue of stale statistics. Viresh, when he introduced the usage
> of sched_cpu_util(), he fixed that 'long non-meaningful period'
> of the broken statistics and it can be found since v5.12-rc1.
> 
> The bug is still there for the !SMP. Look at the way how idle time
> is calculated in get_load() [1]. It relies on 'idle_time->timestamp'
> for calculating the period. But when this function is not called,
> the value can be very far away in time, e.g. a few seconds back,
> when the last allocate_power() was called.
> 
> The bug is there for both SMP and !SMP [2] for older kernels, which can
> be used in Android or ChromeOS. I've been considering to put this simple
> IPA fix also to some other kernels, because Viresh's change is more
> a 'feature' and does not cover both platforms.

Ok, so IIUC, the temperature is needed as well as the power to do the
connection for the pid loop (temp vs power).

I'm trying to figure out how to delegate the mitigation switch on/off to
the core code instead of the governor (and kill tz->passive) but how
things are implemented for the IPA makes this very difficult.

AFAICT, this fix is not correct.

If the temperature is below the 'switch_on_temp' the passive is set to
zero and the throttle function is not called anymore (assuming it is
interrupt mode not polling mode), so the power number is not updated also.

My suggestion is to do the following:

	ret = tz->ops->get_trip_temp(tz, params->trip_switch_on,

					&switch_on_temp);
	if (ret)
		return ret;

	if ((tz->temperature < switch_on_temp)) {

		/* Nothing to do */
		if (tz->last_temperature < switch_on_temp)
			return 0;

		/* Switch off */		
                tz->passive = 0;
                reset_pid_controller(params);
                allow_maximum_power(tz);
                return 0;
        }

	/* Collect the power numbers */
	...


Two birds in a shot.
Lukasz Luba April 6, 2021, 10:39 a.m. UTC | #4
On 4/6/21 11:16 AM, Daniel Lezcano wrote:
> On 06/04/2021 10:44, Lukasz Luba wrote:
>>
>>
>> On 4/2/21 4:54 PM, Daniel Lezcano wrote:
>>> On 31/03/2021 18:33, Lukasz Luba wrote:
>>>> When the temperature is below the first activation trip point the
>>>> cooling
>>>> devices are not checked, so they cannot maintain fresh statistics. It
>>>> leads into the situation, when temperature crosses first trip point, the
>>>> statistics are stale and show state for very long period.
>>>
>>> Can you elaborate the statistics you are referring to ?
>>>
>>> I can understand the pid controller needs temperature but I don't
>>> understand the statistics with the cooling device.
>>>
>>>
>>
>> The allocate_power() calls cooling_ops->get_requested_power(),
>> which is for CPUs cpufreq_get_requested_power() function.
>> In that cpufreq implementation for !SMP we still has the
>> issue of stale statistics. Viresh, when he introduced the usage
>> of sched_cpu_util(), he fixed that 'long non-meaningful period'
>> of the broken statistics and it can be found since v5.12-rc1.
>>
>> The bug is still there for the !SMP. Look at the way how idle time
>> is calculated in get_load() [1]. It relies on 'idle_time->timestamp'
>> for calculating the period. But when this function is not called,
>> the value can be very far away in time, e.g. a few seconds back,
>> when the last allocate_power() was called.
>>
>> The bug is there for both SMP and !SMP [2] for older kernels, which can
>> be used in Android or ChromeOS. I've been considering to put this simple
>> IPA fix also to some other kernels, because Viresh's change is more
>> a 'feature' and does not cover both platforms.
> 
> Ok, so IIUC, the temperature is needed as well as the power to do the
> connection for the pid loop (temp vs power).
> 
> I'm trying to figure out how to delegate the mitigation switch on/off to
> the core code instead of the governor (and kill tz->passive) but how
> things are implemented for the IPA makes this very difficult.
> 
> AFAICT, this fix is not correct.
> 
> If the temperature is below the 'switch_on_temp' the passive is set to
> zero and the throttle function is not called anymore (assuming it is
> interrupt mode not polling mode), so the power number is not updated also.

IPA doesn't work well in asynchronous mode, because it needs this fixed
length for the period. I have been experimenting with tsens IRQs and
also both fixed polling + sporadic asynchronous IRQs, trying to fix it
and have 'predictable' IPA, but without a luck.
IPA needs synchronous polling events like we have for high temp e.g.
100ms and low temp e.g. 1000ms. The asynchronous events are root of
undesirable states (too low or to high) calculated and set for cooling
devices. It's also harder to escape these states with asynchronous
events. I don't recommend using IPA with asynchronous events from IRQs,
for now. It might change in future, though.

The patch 2/2 should calm down the unnecessary updates/notifications so
your request.
The longer polling, which we have for temperature below 'switch_on_temp'
(e.g. every 1sec) shouldn't harm the performance of the system, but
definitely makes IPA more predictable and stable.
Daniel Lezcano April 6, 2021, 11:24 a.m. UTC | #5
On 06/04/2021 12:39, Lukasz Luba wrote:
> 
> 
> On 4/6/21 11:16 AM, Daniel Lezcano wrote:
>> On 06/04/2021 10:44, Lukasz Luba wrote:
>>>
>>>
>>> On 4/2/21 4:54 PM, Daniel Lezcano wrote:
>>>> On 31/03/2021 18:33, Lukasz Luba wrote:
>>>>> When the temperature is below the first activation trip point the
>>>>> cooling
>>>>> devices are not checked, so they cannot maintain fresh statistics. It
>>>>> leads into the situation, when temperature crosses first trip
>>>>> point, the
>>>>> statistics are stale and show state for very long period.
>>>>
>>>> Can you elaborate the statistics you are referring to ?
>>>>
>>>> I can understand the pid controller needs temperature but I don't
>>>> understand the statistics with the cooling device.
>>>>
>>>>
>>>
>>> The allocate_power() calls cooling_ops->get_requested_power(),
>>> which is for CPUs cpufreq_get_requested_power() function.
>>> In that cpufreq implementation for !SMP we still has the
>>> issue of stale statistics. Viresh, when he introduced the usage
>>> of sched_cpu_util(), he fixed that 'long non-meaningful period'
>>> of the broken statistics and it can be found since v5.12-rc1.
>>>
>>> The bug is still there for the !SMP. Look at the way how idle time
>>> is calculated in get_load() [1]. It relies on 'idle_time->timestamp'
>>> for calculating the period. But when this function is not called,
>>> the value can be very far away in time, e.g. a few seconds back,
>>> when the last allocate_power() was called.
>>>
>>> The bug is there for both SMP and !SMP [2] for older kernels, which can
>>> be used in Android or ChromeOS. I've been considering to put this simple
>>> IPA fix also to some other kernels, because Viresh's change is more
>>> a 'feature' and does not cover both platforms.
>>
>> Ok, so IIUC, the temperature is needed as well as the power to do the
>> connection for the pid loop (temp vs power).
>>
>> I'm trying to figure out how to delegate the mitigation switch on/off to
>> the core code instead of the governor (and kill tz->passive) but how
>> things are implemented for the IPA makes this very difficult.
>>
>> AFAICT, this fix is not correct.
>>
>> If the temperature is below the 'switch_on_temp' the passive is set to
>> zero and the throttle function is not called anymore (assuming it is
>> interrupt mode not polling mode), so the power number is not updated
>> also.
> 
> IPA doesn't work well in asynchronous mode, because it needs this fixed
> length for the period. I have been experimenting with tsens IRQs and
> also both fixed polling + sporadic asynchronous IRQs, trying to fix it
> and have 'predictable' IPA, but without a luck.
> IPA needs synchronous polling events like we have for high temp e.g.
> 100ms and low temp e.g. 1000ms. The asynchronous events are root of
> undesirable states (too low or to high) calculated and set for cooling
> devices. It's also harder to escape these states with asynchronous
> events. I don't recommend using IPA with asynchronous events from IRQs,
> for now. It might change in future, though.

I understand that but there is the 'switch_on_temp' trip point which is
supposed to begin to collect the power values ahead of the
'desired_temp' (aka mitigation trip point / sustainable power).


> The patch 2/2 should calm down the unnecessary updates/notifications so
> your request.
> The longer polling, which we have for temperature below 'switch_on_temp'
> (e.g. every 1sec) shouldn't harm the performance of the system, but
> definitely makes IPA more predictable and stable.

The change I proposed is correct then no ? The polling is still effective.

If the IPA needs a sampling, it may be more adequate to separate the
sampling from the polling. So the other governors won't be impacted by
the IPA specific setup, and we do not end up with polling/passive delays
tricks for a governor. The IPA would have its own self-encapsulated
sampling rate and the thermal zone setup won't depend on the governor.

What do you think ?
Lukasz Luba April 6, 2021, 12:25 p.m. UTC | #6
On 4/6/21 12:24 PM, Daniel Lezcano wrote:
> On 06/04/2021 12:39, Lukasz Luba wrote:
>>
>>
>> On 4/6/21 11:16 AM, Daniel Lezcano wrote:
>>> On 06/04/2021 10:44, Lukasz Luba wrote:
>>>>
>>>>
>>>> On 4/2/21 4:54 PM, Daniel Lezcano wrote:
>>>>> On 31/03/2021 18:33, Lukasz Luba wrote:
>>>>>> When the temperature is below the first activation trip point the
>>>>>> cooling
>>>>>> devices are not checked, so they cannot maintain fresh statistics. It
>>>>>> leads into the situation, when temperature crosses first trip
>>>>>> point, the
>>>>>> statistics are stale and show state for very long period.
>>>>>
>>>>> Can you elaborate the statistics you are referring to ?
>>>>>
>>>>> I can understand the pid controller needs temperature but I don't
>>>>> understand the statistics with the cooling device.
>>>>>
>>>>>
>>>>
>>>> The allocate_power() calls cooling_ops->get_requested_power(),
>>>> which is for CPUs cpufreq_get_requested_power() function.
>>>> In that cpufreq implementation for !SMP we still has the
>>>> issue of stale statistics. Viresh, when he introduced the usage
>>>> of sched_cpu_util(), he fixed that 'long non-meaningful period'
>>>> of the broken statistics and it can be found since v5.12-rc1.
>>>>
>>>> The bug is still there for the !SMP. Look at the way how idle time
>>>> is calculated in get_load() [1]. It relies on 'idle_time->timestamp'
>>>> for calculating the period. But when this function is not called,
>>>> the value can be very far away in time, e.g. a few seconds back,
>>>> when the last allocate_power() was called.
>>>>
>>>> The bug is there for both SMP and !SMP [2] for older kernels, which can
>>>> be used in Android or ChromeOS. I've been considering to put this simple
>>>> IPA fix also to some other kernels, because Viresh's change is more
>>>> a 'feature' and does not cover both platforms.
>>>
>>> Ok, so IIUC, the temperature is needed as well as the power to do the
>>> connection for the pid loop (temp vs power).
>>>
>>> I'm trying to figure out how to delegate the mitigation switch on/off to
>>> the core code instead of the governor (and kill tz->passive) but how
>>> things are implemented for the IPA makes this very difficult.
>>>
>>> AFAICT, this fix is not correct.
>>>
>>> If the temperature is below the 'switch_on_temp' the passive is set to
>>> zero and the throttle function is not called anymore (assuming it is
>>> interrupt mode not polling mode), so the power number is not updated
>>> also.
>>
>> IPA doesn't work well in asynchronous mode, because it needs this fixed
>> length for the period. I have been experimenting with tsens IRQs and
>> also both fixed polling + sporadic asynchronous IRQs, trying to fix it
>> and have 'predictable' IPA, but without a luck.
>> IPA needs synchronous polling events like we have for high temp e.g.
>> 100ms and low temp e.g. 1000ms. The asynchronous events are root of
>> undesirable states (too low or to high) calculated and set for cooling
>> devices. It's also harder to escape these states with asynchronous
>> events. I don't recommend using IPA with asynchronous events from IRQs,
>> for now. It might change in future, though.
> 
> I understand that but there is the 'switch_on_temp' trip point which is
> supposed to begin to collect the power values ahead of the
> 'desired_temp' (aka mitigation trip point / sustainable power).
> 
> 
>> The patch 2/2 should calm down the unnecessary updates/notifications so
>> your request.
>> The longer polling, which we have for temperature below 'switch_on_temp'
>> (e.g. every 1sec) shouldn't harm the performance of the system, but
>> definitely makes IPA more predictable and stable.
> 
> The change I proposed is correct then no ? The polling is still effective.

In your proposed code there is 'tz->last_temperature < switch_on_temp'
which then return 0 immediately. So we don't poke the devices.

> 
> If the IPA needs a sampling, it may be more adequate to separate the
> sampling from the polling. So the other governors won't be impacted by
> the IPA specific setup, and we do not end up with polling/passive delays
> tricks for a governor. The IPA would have its own self-encapsulated
> sampling rate and the thermal zone setup won't depend on the governor.
> 
> What do you think ?
> 

IMHO having a private timer in the governor creates another complexity
and confusion.

What we have in thermal now is good enough. We have DT support for both
periods so there is need even to write via sysfs:
polling-delay-passive
polling-delay
The device driver developers can rely on this reliable check in the
thermal framework.

I don't agree that IPA forces any specific setup. If the thermal is
configured to do the polling of the temp sensor, because maybe there
are no HW interrupts, then there is no other way. The Arm SCMI sensors
were one of them, where we had to send a SCMI request. There was no
notifications/IRQs that temp crossed some trip point. Now it should be
better, but still it depends on vendor's FW implementation if there is
IRQ.

The reliable polling is not IPA 'feature request'.
We cannot avoid polling in some configurations. Thermal framework
must support this scenario: polling/checking temp sensor even when
the temp is low.

Thus, since framework must check the temp, calling
the governor->throttle() doesn't harm (as I said every 1sec).
Furthermore, the governor interprets what trip point and temperature to
interpret and how to react.
Daniel Lezcano April 6, 2021, 2:32 p.m. UTC | #7
Hi Lukasz,


On 06/04/2021 14:25, Lukasz Luba wrote:
> 
> 
> On 4/6/21 12:24 PM, Daniel Lezcano wrote:
>> On 06/04/2021 12:39, Lukasz Luba wrote:
>>>
>>>
>>> On 4/6/21 11:16 AM, Daniel Lezcano wrote:
>>>> On 06/04/2021 10:44, Lukasz Luba wrote:
>>>>>
>>>>>
>>>>> On 4/2/21 4:54 PM, Daniel Lezcano wrote:
>>>>>> On 31/03/2021 18:33, Lukasz Luba wrote:
>>>>>>> When the temperature is below the first activation trip point the
>>>>>>> cooling
>>>>>>> devices are not checked, so they cannot maintain fresh
>>>>>>> statistics. It
>>>>>>> leads into the situation, when temperature crosses first trip
>>>>>>> point, the
>>>>>>> statistics are stale and show state for very long period.
>>>>>>
>>>>>> Can you elaborate the statistics you are referring to ?
>>>>>>
>>>>>> I can understand the pid controller needs temperature but I don't
>>>>>> understand the statistics with the cooling device.
>>>>>>
>>>>>>
>>>>>
>>>>> The allocate_power() calls cooling_ops->get_requested_power(),
>>>>> which is for CPUs cpufreq_get_requested_power() function.
>>>>> In that cpufreq implementation for !SMP we still has the
>>>>> issue of stale statistics. Viresh, when he introduced the usage
>>>>> of sched_cpu_util(), he fixed that 'long non-meaningful period'
>>>>> of the broken statistics and it can be found since v5.12-rc1.
>>>>>
>>>>> The bug is still there for the !SMP. Look at the way how idle time
>>>>> is calculated in get_load() [1]. It relies on 'idle_time->timestamp'
>>>>> for calculating the period. But when this function is not called,
>>>>> the value can be very far away in time, e.g. a few seconds back,
>>>>> when the last allocate_power() was called.
>>>>>
>>>>> The bug is there for both SMP and !SMP [2] for older kernels, which
>>>>> can
>>>>> be used in Android or ChromeOS. I've been considering to put this
>>>>> simple
>>>>> IPA fix also to some other kernels, because Viresh's change is more
>>>>> a 'feature' and does not cover both platforms.
>>>>
>>>> Ok, so IIUC, the temperature is needed as well as the power to do the
>>>> connection for the pid loop (temp vs power).
>>>>
>>>> I'm trying to figure out how to delegate the mitigation switch
>>>> on/off to
>>>> the core code instead of the governor (and kill tz->passive) but how
>>>> things are implemented for the IPA makes this very difficult.
>>>>
>>>> AFAICT, this fix is not correct.
>>>>
>>>> If the temperature is below the 'switch_on_temp' the passive is set to
>>>> zero and the throttle function is not called anymore (assuming it is
>>>> interrupt mode not polling mode), so the power number is not updated
>>>> also.
>>>
>>> IPA doesn't work well in asynchronous mode, because it needs this fixed
>>> length for the period. I have been experimenting with tsens IRQs and
>>> also both fixed polling + sporadic asynchronous IRQs, trying to fix it
>>> and have 'predictable' IPA, but without a luck.
>>> IPA needs synchronous polling events like we have for high temp e.g.
>>> 100ms and low temp e.g. 1000ms. The asynchronous events are root of
>>> undesirable states (too low or to high) calculated and set for cooling
>>> devices. It's also harder to escape these states with asynchronous
>>> events. I don't recommend using IPA with asynchronous events from IRQs,
>>> for now. It might change in future, though.
>>
>> I understand that but there is the 'switch_on_temp' trip point which is
>> supposed to begin to collect the power values ahead of the
>> 'desired_temp' (aka mitigation trip point / sustainable power).
>>
>>
>>> The patch 2/2 should calm down the unnecessary updates/notifications so
>>> your request.
>>> The longer polling, which we have for temperature below 'switch_on_temp'
>>> (e.g. every 1sec) shouldn't harm the performance of the system, but
>>> definitely makes IPA more predictable and stable.
>>
>> The change I proposed is correct then no ? The polling is still
>> effective.
> 
> In your proposed code there is 'tz->last_temperature < switch_on_temp'
> which then return 0 immediately. So we don't poke the devices.

Ah yes, I see your point.

Often the device tree is setup with an additional trip point to trigger
a stat collection. So I assumed we don't need to feed the pid loop below
this temperature.

What is the goal of sampling at polling time (not passive) ?

Here are the boards with the extra trip point:

- mt8173.dtsi
- hi6220.dtsi
- hi6220.dtsi
- px30.dts
- rk3328.dtsi
- rk3399-gru-kevin.dts

Those have an interrupt mode but do polling also.

See the configuration for:
- sc7180.dtsi (no polling delay and no pre-trip point)
- r8a77990.dtsi

>> If the IPA needs a sampling, it may be more adequate to separate the
>> sampling from the polling. So the other governors won't be impacted by
>> the IPA specific setup, and we do not end up with polling/passive delays
>> tricks for a governor. The IPA would have its own self-encapsulated
>> sampling rate and the thermal zone setup won't depend on the governor.
>>
>> What do you think ?
>>
> 
> IMHO having a private timer in the governor creates another complexity
> and confusion.

I would say we move the adherence between the thermal core and the IPA
into the governor only :)

Especially, we *have* to call throttle on a governor even if we are not
in the mitigation process.

And from a design POV, it should be the thermal core to be in control of
what is happening, not passively call the different callbacks and expect
them to behave correctly (eg. set tz->passive)


> What we have in thermal now is good enough. We have DT support for both
> periods so there is need even to write via sysfs:
> polling-delay-passive
> polling-delay
> The device driver developers can rely on this reliable check in the
> thermal framework.
> I don't agree that IPA forces any specific setup. 

Yes, it does IMO.

For instance, on the hi3660, the sensor is able to do interrupt mode, so
to be wake up when the first temperature threshold is reached.

But there is still the polling delay because the governor is IPA in this
case? There is also an additional trip-point0 which is not a target for
a cooling device, just put there to ensure the IPA has enough data when
reaching the second trip point which is the target.

If, for any reason, we want to switch to step_wise, where the interrupt
mode is enough to trigger the mitigation (eg. with passive polling), and
ensure the system is not constantly waking up (and AFAICT even 1s
periodic wake up can have a significant impact on battery life), it
won't be possible because of the device tree.

Moreover, some sensors do not use their interrupt mode because of IPA
setup or they use it incorrectly.

See my concern here ? IPA has an impact on the thermal core and the
sensor, while those should be governor agnostic.

> If the thermal is
> configured to do the polling of the temp sensor, because maybe there
> are no HW interrupts, then there is no other way. The Arm SCMI sensors
> were one of them, where we had to send a SCMI request. There was no
> notifications/IRQs that temp crossed some trip point. Now it should be
> better, but still it depends on vendor's FW implementation if there is
> IRQ.

I don't know all the platforms but so far the interrupt mode is largely
supported, to not say in the vast majority.

> The reliable polling is not IPA 'feature request'.
> We cannot avoid polling in some configurations. Thermal framework
> must support this scenario: polling/checking temp sensor even when
> the temp is low.

I agree with that, I'm not questioning about removing the polling.

> Thus, since framework must check the temp, calling
> the governor->throttle() doesn't harm (as I said every 1sec).
> Furthermore, the governor interprets what trip point and temperature to
> interpret and how to react.

Precisely, that is the reason why I disagree. The thermal core should
switch on/off the mitigation (say cool down / warm up) and the governor
applies its recipe. With polling and sampling tied together it is not
possible to create self-encapsulated components. As a result we have bug
like [1] :/

  -- Daniel

[1] https://bugzilla.kernel.org/show_bug.cgi?id=212507
Lukasz Luba April 6, 2021, 6:38 p.m. UTC | #8
On 4/6/21 3:32 PM, Daniel Lezcano wrote:
> 
> Hi Lukasz,
> 
> 
> On 06/04/2021 14:25, Lukasz Luba wrote:
>>
>>
>> On 4/6/21 12:24 PM, Daniel Lezcano wrote:

[snip]

>>
>> In your proposed code there is 'tz->last_temperature < switch_on_temp'
>> which then return 0 immediately. So we don't poke the devices.
> 
> Ah yes, I see your point.
> 
> Often the device tree is setup with an additional trip point to trigger
> a stat collection. So I assumed we don't need to feed the pid loop below
> this temperature.
> 
> What is the goal of sampling at polling time (not passive) ?

True, we don't need to feed the PID loop with this data when the temp
is low. We only have to call ->get_requested_power() for each cooling
device, because they maintain context (in that get_load() function
which I pointed earlier). When we don't call them, then next time
when we call, their context is growing and shows statistics for a period
e.g. 30 sec. Then this 30 sec is almost all idle, but only 1sec was
'running', but it doesn't tell you that it was actually the last
1 sec, which means the avg load is ~100%. Due to 'stale' context
IPA might interpret that the load was 1s/30s = ~3%.
When we have this reliable polling every 1sec, so we poke the cooling
devices even at low temp, they will maintain proper statistics context,
then this ~3% misscalculation would not happen.

> 
> Here are the boards with the extra trip point:
> 
> - mt8173.dtsi
> - hi6220.dtsi
> - hi6220.dtsi
> - px30.dts
> - rk3328.dtsi
> - rk3399-gru-kevin.dts
> 
> Those have an interrupt mode but do polling also.
> 
> See the configuration for:
> - sc7180.dtsi (no polling delay and no pre-trip point)
> - r8a77990.dtsi
> 
>>> If the IPA needs a sampling, it may be more adequate to separate the
>>> sampling from the polling. So the other governors won't be impacted by
>>> the IPA specific setup, and we do not end up with polling/passive delays
>>> tricks for a governor. The IPA would have its own self-encapsulated
>>> sampling rate and the thermal zone setup won't depend on the governor.
>>>
>>> What do you think ?
>>>
>>
>> IMHO having a private timer in the governor creates another complexity
>> and confusion.
> 
> I would say we move the adherence between the thermal core and the IPA
> into the governor only :)
> 
> Especially, we *have* to call throttle on a governor even if we are not
> in the mitigation process.
> 
> And from a design POV, it should be the thermal core to be in control of
> what is happening, not passively call the different callbacks and expect
> them to behave correctly (eg. set tz->passive)

In that design you would like to interpret the temp and trips' values
and decide not to call governor like IPA when the temp is low, below
the 1st trip point. Well

> 
> 
>> What we have in thermal now is good enough. We have DT support for both
>> periods so there is need even to write via sysfs:
>> polling-delay-passive
>> polling-delay
>> The device driver developers can rely on this reliable check in the
>> thermal framework.
>> I don't agree that IPA forces any specific setup.
> 
> Yes, it does IMO.
> 
> For instance, on the hi3660, the sensor is able to do interrupt mode, so
> to be wake up when the first temperature threshold is reached.
> 
> But there is still the polling delay because the governor is IPA in this
> case? There is also an additional trip-point0 which is not a target for
> a cooling device, just put there to ensure the IPA has enough data when
> reaching the second trip point which is the target.

It's just a configuration which was there for years. Some who wants to
use IPA have to be sure that it has this 2 trip points: switch_on and
control. If you are talking about a new design, then it's not for this
patch. The complete re-design of DT thermal zones, sensors, etc is
a huge topic.

> 
> If, for any reason, we want to switch to step_wise, where the interrupt
> mode is enough to trigger the mitigation (eg. with passive polling), and
> ensure the system is not constantly waking up (and AFAICT even 1s
> periodic wake up can have a significant impact on battery life), it
> won't be possible because of the device tree.

What do you mean 'and ensure the system is not constantly waking up'?
Thermal workqueue mechanism doesn't cause system wakeup from suspend
AFAIK, unless something has changed.

I don't think it has 'significant impact on battery life'. The phone is
not woken-up when it's in suspend after power button was hit and screen
is off. There will be no thermal delayed work timer firing.

Then if we are considering a low system utilization, which would allow
to fire the thermal workqueue and run IPA every 1s, IPA does it's
computation in less then 10us on single big core with 1GHz freq for a
thermal zone with 2 cooling devices.
Thus, for a big CPU running 100% at 1GHz power usage is ~500mW. We can
estimate in various ways the cost of this 10us IPA running time.
It's ~5uW or, when we have 1V (in Juno) for that freq, then
amps=500mA * 10us / 1000000us = 5uA

Even when we add firing & scheduling workqueue cost of CPU idle wake-up
if needed, then still the impact is not 'significant'. It's not a
frequent task, this 1sec thermal+IPA check. Even if you double or triple 
this 10us overhead due to idle wake-up.

That's why we have two values for polling, one is longer, the impact is
low and not 'significant'.

There is plenty of other software which might have bigger impact than
this 1sec polling.

> 
> Moreover, some sensors do not use their interrupt mode because of IPA
> setup or they use it incorrectly.

Sensors have two modes, I don't see why only IRQ is OK.
Polling a sensor every 1s is also OK, it's not waking up the whole
device from suspend.

> 
> See my concern here ? IPA has an impact on the thermal core and the
> sensor, while those should be governor agnostic.

I don't see this as argument for blaming that current design of thermal
has 'significant impact' and is wrong. I don't agree that IPA is the
root cause of this design.

> 
>> If the thermal is
>> configured to do the polling of the temp sensor, because maybe there
>> are no HW interrupts, then there is no other way. The Arm SCMI sensors
>> were one of them, where we had to send a SCMI request. There was no
>> notifications/IRQs that temp crossed some trip point. Now it should be
>> better, but still it depends on vendor's FW implementation if there is
>> IRQ.
> 
> I don't know all the platforms but so far the interrupt mode is largely
> supported, to not say in the vast majority.

It's not obligatory to support IRQ for all trip points per sensor.
There are platforms which support a number, e.g. 4 IRQs, then the
rest of trip points must be defined as polling. This 'mix' for
a single sensor is a concern for me.

In other platforms which use SCMI, till recently, there was no IRQs
for sensors protocols (with latest notifications support it
might be possible in kernel, but still FW must support it).

> 
>> The reliable polling is not IPA 'feature request'.
>> We cannot avoid polling in some configurations. Thermal framework
>> must support this scenario: polling/checking temp sensor even when
>> the temp is low.
> 
> I agree with that, I'm not questioning about removing the polling.
> 
>> Thus, since framework must check the temp, calling
>> the governor->throttle() doesn't harm (as I said every 1sec).
>> Furthermore, the governor interprets what trip point and temperature to
>> interpret and how to react.
> 
> Precisely, that is the reason why I disagree. The thermal core should
> switch on/off the mitigation (say cool down / warm up) and the governor
> applies its recipe. With polling and sampling tied together it is not
> possible to create self-encapsulated components. As a result we have bug
> like [1] :/

I wouldn't blame that this design is all wrong. Framework calls governor
update callback for all temperature and that's it.

Switching governors is tricky, especially when they rely not only on
their private data, but data from it's devices private fields, like this
example step_wise.

This step_wise governor doesn't even provide bind_to_tz() and
unbind_from_tz(). Maybe it should setup itself before starting
operating on the thermal zone after binding?
Daniel Lezcano April 6, 2021, 7:45 p.m. UTC | #9
On 06/04/2021 20:38, Lukasz Luba wrote:

[ ... ]

>> But there is still the polling delay because the governor is IPA in this
>> case? There is also an additional trip-point0 which is not a target for
>> a cooling device, just put there to ensure the IPA has enough data when
>> reaching the second trip point which is the target.
> 
> It's just a configuration which was there for years. Some who wants to
> use IPA have to be sure that it has this 2 trip points: switch_on and
> control. If you are talking about a new design, then it's not for this
> patch. The complete re-design of DT thermal zones, sensors, etc is
> a huge topic.

Right it is not for this patch. Just pointing out there is something
wrong from my POV when polling is used for sampling.

[ ... ]
diff mbox series

Patch

diff --git a/drivers/thermal/gov_power_allocator.c b/drivers/thermal/gov_power_allocator.c
index 2802a0e13c88..0cbd10cab193 100644
--- a/drivers/thermal/gov_power_allocator.c
+++ b/drivers/thermal/gov_power_allocator.c
@@ -575,15 +575,27 @@  static void allow_maximum_power(struct thermal_zone_device *tz)
 {
 	struct thermal_instance *instance;
 	struct power_allocator_params *params = tz->governor_data;
+	u32 req_power;
 
 	mutex_lock(&tz->lock);
 	list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
+		struct thermal_cooling_device *cdev = instance->cdev;
+
 		if ((instance->trip != params->trip_max_desired_temperature) ||
 		    (!cdev_is_power_actor(instance->cdev)))
 			continue;
 
 		instance->target = 0;
 		mutex_lock(&instance->cdev->lock);
+		/*
+		 * Call for updating the cooling devices local stats and avoid
+		 * periods of dozen of seconds when those have not been
+		 * maintained. The long period would come into the first check
+		 * when lower threshold is crossed. Thus, limit it to single
+		 * one longer polling period.
+		 */
+		cdev->ops->get_requested_power(cdev, &req_power);
+
 		instance->cdev->updated = false;
 		mutex_unlock(&instance->cdev->lock);
 		thermal_cdev_update(instance->cdev);