Various Exynos targets never return to no cooling

Message ID	20231113130435.500353-1-m.majewski2@samsung.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-samsung-soc-owner@vger.kernel.org> From: Mateusz Majewski <m.majewski2@samsung.com> To: linux-pm@vger.kernel.org, linux-samsung-soc@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Mateusz Majewski <m.majewski2@samsung.com>, "Rafael J . Wysocki" <rafael@kernel.org>, Daniel Lezcano <daniel.lezcano@linaro.org>, Zhang Rui <rui.zhang@intel.com>, Lukasz Luba <lukasz.luba@arm.com>, Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>, Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>, Marek Szyprowski <m.szyprowski@samsung.com> Subject: Various Exynos targets never return to no cooling Date: Mon, 13 Nov 2023 14:04:34 +0100 Message-ID: <20231113130435.500353-1-m.majewski2@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 201P References: <CGME20231113130451eucas1p293985c1bee8dc71b9c78a013663ce8e6@eucas1p2.samsung.com> Precedence: bulk
Series	Various Exynos targets never return to no cooling \| expand Various Exynos targets never return to no cooling

Message ID

20231113130435.500353-1-m.majewski2@samsung.com (mailing list archive)

State

New

Headers

From: Mateusz Majewski <m.majewski2@samsung.com>
To: linux-pm@vger.kernel.org, linux-samsung-soc@vger.kernel.org,
        linux-kernel@vger.kernel.org
Cc: Mateusz Majewski <m.majewski2@samsung.com>,
        "Rafael J . Wysocki" <rafael@kernel.org>,
        Daniel Lezcano <daniel.lezcano@linaro.org>,
        Zhang Rui <rui.zhang@intel.com>,
        Lukasz Luba <lukasz.luba@arm.com>,
        Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>,
        Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>,
        Marek Szyprowski <m.szyprowski@samsung.com>
Subject: Various Exynos targets never return to no cooling
Date: Mon, 13 Nov 2023 14:04:34 +0100
Message-ID: <20231113130435.500353-1-m.majewski2@samsung.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="utf-8"
CMS-TYPE: 201P
References: 
 <CGME20231113130451eucas1p293985c1bee8dc71b9c78a013663ce8e6@eucas1p2.samsung.com>
Precedence: bulk

Series

Various Exynos targets never return to no cooling | expand

Commit Message

Mateusz Majewski Nov. 13, 2023, 1:04 p.m. UTC

Hi,

While working on some fixes on the Exynos thermal driver, I have found that some
of the Exynos-based boards will never return to no cooling. That is, after
heating the board a bit and letting it cool, we see in the sysfs output similar
to this:

root@target:~# cat /sys/class/thermal/thermal_zone*/temp
30000
29000
32000
31000
30000
root@target:~# cat /sys/class/thermal/cooling_device*/cur_state
1
0
0
0

This is on the Odroid XU4 board, where the lowest trip point is 50 deg. C.
Similar behavior happens on some other boards, for example TM2E. The issue
happens when the stepwise governor is being used and I have not tested the
behavior of the other governors.

I have attempted to fix this myself, but the issue seems somewhat complex and
over my level of understanding. I did some debugging, and here is what I think
is happening:

1. Since there is no temperature polling enabled on the mentioned boards, the
governor will only be called when a trip point is being passed.
2. The board heats up and a couple trip points get passed through. Each time,
the governor is called for each trip point.
3. For the lowest thermal instance, it will find out that the temperature is
higher than the lowest trip point (i.e. throttle is true), and that the trend
is THERMAL_TREND_RAISING. Therefore, it will attempt to increase the target
state each time and the state will be set to the higher limit.
4. Let's now say that the temperature starts falling, which means that the trip
points get passed from the other side. Again, the governor will be called for
each trip point.
5. For the lowest thermal instance, the trend will be THERMAL_TREND_DROPPING. The
temperature will be higher than the lowest trip point all but one time (i.e.
throttle will be true). This will mean that in these cases, nothing will
happen and the state will remain at the higher limit.
6. Finally, when the lowest trip point is passed and the governor is called for
its thermal instance, the trend will still be THERMAL_TREND_DROPPING and the
temperature will be lower than the trip point (i.e. throttle will be false).
Therefore, the governor will reduce the state, but it is unlikely that this
will result in deactivation of the thermal instance, since the state has been
at the higher limit up until this point.
7. Now the governor will never be called anymore, and the state will never
change from this point.

I have found two workarounds, but neither seem satisfactory:

1. The issue doesn't appear when at least two lowest trip points have their
lower state limit equal to the higher state limit. For instance, for TM2E,
the following change is enough for the issue to not appear:

Two trip points need to change and not only one, since the calculation in the
governor is based on the maximum of all states and not only the state of a
single instance. It's not clear if that would be enough in all cases, but
this feels hacky anyway. Though since we only give the governor information
when the trip point is passed, it does make some limited sense to make it
simply set the state to a specific value instead of making decisions.
2. The issue also disappears when polling is enabled, since this means that the
governor is called periodically. However, it would be great to not have to do
so and keep using only interrupts, since we already have them in our SoC.

It seems that in the past, there has been an attempt to handle this case
differently: https://lore.kernel.org/all/1352348786-24255-1-git-send-email-amit.kachhap@linaro.org/
However it seems that the attempt has never been completed, and the remains
have been removed: https://lore.kernel.org/all/20220629151012.3115773-2-daniel.lezcano@linaro.org/

There also might be a race condition possible here, as it might be the case
that after the interrupt, when the thermal framework calls get_temp, the value
will already change to a value that would not trigger the trip point. This
could be problematic when the temperature is raising, as then the governor will
essentially ignore that trip point (trend will be RAISING, but throttle will be
false, so nothing will happen). It is less problematic when the temperature is
falling, as the temperature will be much lower than the trip point due to
hysteresis. However, for the Exynos 5433 SoC, hysteresis is unfortunately set
to 1 deg. C and the temperature values are also rounded to 1 deg. C. This means
that the race condition might also be possible in this direction on this SoC. I
have once managed to get the state stuck at 2 instead of the usual 1 on TM2E. I
have not investigated that further, but it seems that this race condition is a
good explanation of this behavior.

I feel very incompetent to attempt to resolve these issues, as I have only read
the thermal framework code for a bit. What do you think should be done here?

Thank you! :)
Mateusz

Comments

Lukasz Luba Nov. 14, 2023, 10:32 a.m. UTC | #1

Hi Mateusz,

On 11/13/23 13:04, Mateusz Majewski wrote:
> Hi,
> 
> While working on some fixes on the Exynos thermal driver, I have found that some
> of the Exynos-based boards will never return to no cooling. That is, after
> heating the board a bit and letting it cool, we see in the sysfs output similar
> to this:
> 
> root@target:~# cat /sys/class/thermal/thermal_zone*/temp
> 30000
> 29000
> 32000
> 31000
> 30000
> root@target:~# cat /sys/class/thermal/cooling_device*/cur_state
> 1
> 0
> 0
> 0

You can also use this command:
$ grep . /sys/class/thermal/cooling_device*/cur_state

That would print also the names.

> 
> This is on the Odroid XU4 board, where the lowest trip point is 50 deg. C.
> Similar behavior happens on some other boards, for example TM2E. The issue
> happens when the stepwise governor is being used and I have not tested the
> behavior of the other governors.

You won't be able easily test IPA on odroidxu4, but if you like ping me
offline.

IPA won't work with this current DT thermal, so there is no issue.
Also, IPA works in polling mode only, more about why you can find below.

> 
> I have attempted to fix this myself, but the issue seems somewhat complex and
> over my level of understanding. I did some debugging, and here is what I think
> is happening:
> 
> 1. Since there is no temperature polling enabled on the mentioned boards, the
>     governor will only be called when a trip point is being passed.
> 2. The board heats up and a couple trip points get passed through. Each time,
>     the governor is called for each trip point.
> 3. For the lowest thermal instance, it will find out that the temperature is
>     higher than the lowest trip point (i.e. throttle is true), and that the trend
>     is THERMAL_TREND_RAISING. Therefore, it will attempt to increase the target
>     state each time and the state will be set to the higher limit.
> 4. Let's now say that the temperature starts falling, which means that the trip
>     points get passed from the other side. Again, the governor will be called for
>     each trip point.
> 5. For the lowest thermal instance, the trend will be THERMAL_TREND_DROPPING. The
>     temperature will be higher than the lowest trip point all but one time (i.e.
>     throttle will be true). This will mean that in these cases, nothing will
>     happen and the state will remain at the higher limit.

Tricky corner case, but possible for your interrupt only mode. The way
how step_wise is designed with this activation/deactivation of the
passive mode linked to the cooling state returned values is confusing
IMO.

> 6. Finally, when the lowest trip point is passed and the governor is called for
>     its thermal instance, the trend will still be THERMAL_TREND_DROPPING and the
>     temperature will be lower than the trip point (i.e. throttle will be false).
>     Therefore, the governor will reduce the state, but it is unlikely that this
>     will result in deactivation of the thermal instance, since the state has been
>     at the higher limit up until this point.

That's possible. That's why I would separate that control mode, so this
corner case would not happen. That governor unfortunately has quite a
lot of legacy platforms with polling only mode.

> 7. Now the governor will never be called anymore, and the state will never
>     change from this point.
> 
> I have found two workarounds, but neither seem satisfactory:
> 
> 1. The issue doesn't appear when at least two lowest trip points have their
>     lower state limit equal to the higher state limit. For instance, for TM2E,
>     the following change is enough for the issue to not appear:
> 
> diff --git a/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi b/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
> index 81b72393dd0d..145c4c80893a 100644
> --- a/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
> +++ b/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
> @@ -55,14 +55,14 @@ cooling-maps {
>               map0 {
>                   /* Set maximum frequency as 1800MHz  */
>                   trip = <&atlas0_alert_0>;
> -                cooling-device = <&cpu4 1 2>, <&cpu5 1 2>,
> -                         <&cpu6 1 2>, <&cpu7 1 2>;
> +                cooling-device = <&cpu4 1 1>, <&cpu5 1 1>,
> +                         <&cpu6 1 1>, <&cpu7 1 1>;
>               };
>               map1 {
>                   /* Set maximum frequency as 1700MHz  */
>                   trip = <&atlas0_alert_1>;
> -                cooling-device = <&cpu4 2 3>, <&cpu5 2 3>,
> -                         <&cpu6 2 3>, <&cpu7 2 3>;
> +                cooling-device = <&cpu4 2 2>, <&cpu5 2 2>,
> +                         <&cpu6 2 2>, <&cpu7 2 2>;
>               };
>               map2 {
>                   /* Set maximum frequency as 1600MHz  */
> @@ -229,14 +229,14 @@ cooling-maps {
>               map0 {
>                   /* Set maximum frequency as 1200MHz  */
>                   trip = <&apollo_alert_2>;
> -                cooling-device = <&cpu0 1 2>, <&cpu1 1 2>,
> -                         <&cpu2 1 2>, <&cpu3 1 2>;
> +                cooling-device = <&cpu0 1 1>, <&cpu1 1 1>,
> +                         <&cpu2 1 1>, <&cpu3 1 1>;
>               };
>               map1 {
>                   /* Set maximum frequency as 1100MHz  */
>                   trip = <&apollo_alert_3>;
> -                cooling-device = <&cpu0 2 3>, <&cpu1 2 3>,
> -                         <&cpu2 2 3>, <&cpu3 2 3>;
> +                cooling-device = <&cpu0 2 2>, <&cpu1 2 2>,
> +                         <&cpu2 2 2>, <&cpu3 2 2>;
>               };
>               map2 {
>                   /* Set maximum frequency as 1000MHz  */
> 
>     Two trip points need to change and not only one, since the calculation in the
>     governor is based on the maximum of all states and not only the state of a
>     single instance. It's not clear if that would be enough in all cases, but
>     this feels hacky anyway. Though since we only give the governor information
>     when the trip point is passed, it does make some limited sense to make it
>     simply set the state to a specific value instead of making decisions.

Yes, this is problematic - playing with the limits for scope in the DT
+ step_wise governor + only interrupt mode.

Also, I never like this approach, since we are dealing with dynamic
system (nonlinear) and control with static approach is not recommended.
BTW, IPA was born to introduce dynamic control alg.

> 2. The issue also disappears when polling is enabled, since this means that the
>     governor is called periodically. However, it would be great to not have to do
>     so and keep using only interrupts, since we already have them in our SoC.

There are pros and cons with the polling approach. The thermal and
temperature is a dynamic system and observability is important. The
interrupt mode limits the observability, that's what you've shown.

I understand your requirement for the interrupts only mode, but
maybe till the moment there is no fix upstream, you can enable
it as well?

> 
> It seems that in the past, there has been an attempt to handle this case
> differently: https://lore.kernel.org/all/1352348786-24255-1-git-send-email-amit.kachhap@linaro.org/
> However it seems that the attempt has never been completed, and the remains
> have been removed: https://lore.kernel.org/all/20220629151012.3115773-2-daniel.lezcano@linaro.org/
> 
> There also might be a race condition possible here, as it might be the case
> that after the interrupt, when the thermal framework calls get_temp, the value
> will already change to a value that would not trigger the trip point. This
> could be problematic when the temperature is raising, as then the governor will
> essentially ignore that trip point (trend will be RAISING, but throttle will be
> false, so nothing will happen). It is less problematic when the temperature is
> falling, as the temperature will be much lower than the trip point due to
> hysteresis. However, for the Exynos 5433 SoC, hysteresis is unfortunately set
> to 1 deg. C and the temperature values are also rounded to 1 deg. C. This means
> that the race condition might also be possible in this direction on this SoC. I
> have once managed to get the state stuck at 2 instead of the usual 1 on TM2E. I
> have not investigated that further, but it seems that this race condition is a
> good explanation of this behavior.

Interesting. That's really tough situation in that platform. AFAICS the
code_to_temp() cannot do much, that rounding happens in the HW.

In such situation that you've described: temp value can 'oscillate'
very fast around the value point (next gets rounded in HW) in the
meantime the run of the code is progressing. I would not relay on
interrupt only mode in this case.

The value stuck at 2 worries me because even if you change those
values in the DT as above, the race condition could happen and leave
you with state stuck at 1.

Escaping the control algorithm while going down is less dangerous,
but based on the code AFAICS it might be also in theory possible
while going up with the temperature. Very unlikely for the 2nd
trip point, because the temp would be higher than 1st trip. That would
set the throttling to true for 1st trip (because we loop over all
possible trip points). I'm worried about last trip point, because
it could escape the control alg there. Furthermore, since those
cooling state values are limited-space and hard-coded for the last
trip points, the temp can reach hot/critical trip point and shut
down the device (fortunately it's not handled by the governor
interpretation code, but fwk and it would work fine).
So please make sure you have hot/critical trip point as last one,
if you already don't have that.


> 
> I feel very incompetent to attempt to resolve these issues, as I have only read
> the thermal framework code for a bit. What do you think should be done here?
> 

Therefore, IMO this deserves a fix in step_wise governor code. It has to
be re-designed how it interprets trip point that has checked vs.
the temperature comparisons. Also, this confusing passive mode
handled with the cooling state values comparisons (not ideal)...

Regards,
Lukasz

Lukasz Luba Dec. 4, 2023, 11:57 a.m. UTC | #2

On 11/14/23 10:32, Lukasz Luba wrote:
> Hi Mateusz,
> 
> On 11/13/23 13:04, Mateusz Majewski wrote:
>> Hi,
>>
>> While working on some fixes on the Exynos thermal driver, I have found 
>> that some
>> of the Exynos-based boards will never return to no cooling. That is, 
>> after
>> heating the board a bit and letting it cool, we see in the sysfs 
>> output similar
>> to this:
>>

Regarding this topic, I just wanted to tell you that I had conversation
with Rafael & Daniel last Fri. Rafael gave me a hint to his latest work
in his repo regarding potentially similar race with temperature value.

I'll have a look there and experiment next week, than come back to you.

Regards,
Lukasz

Mateusz Majewski Dec. 13, 2023, 1:42 p.m. UTC | #3

Hi,

> I understand your requirement for the interrupts only mode, but
> maybe till the moment there is no fix upstream, you can enable
> it as well?

We (actually Marek and independently another coworker) had an idea how
to solve this while still avoiding polling all the time, and it turned
out to be quite simple to implement (PoC-quality). The idea was to run
several cycles of polling after each interrupt. This could be done like
this:

diff --git a/drivers/thermal/samsung/exynos_tmu.c b/drivers/thermal/samsung/exynos_tmu.c
index 6482513bfe66..b4bffe405194 100644
--- a/drivers/thermal/samsung/exynos_tmu.c
+++ b/drivers/thermal/samsung/exynos_tmu.c
@@ -760,6 +760,12 @@ static irqreturn_t exynos_tmu_threaded_irq(int irq, void *id)
 {
 	struct exynos_tmu_data *data = id;
 
+	/* TODO: would need some API */
+	mutex_lock(&data->tzd->lock);
+	data->tzd->additional_poll_reps = 10;
+	data->tzd->additional_poll_jiffies = HZ / 10;
+	mutex_unlock(&data->tzd->lock);
+
 	thermal_zone_device_update(data->tzd, THERMAL_EVENT_UNSPECIFIED);
 
 	mutex_lock(&data->lock);
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 625ba07cbe2f..c825d068402f 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -299,12 +299,24 @@ static void thermal_zone_device_set_polling(struct thermal_zone_device *tz,
 
 static void monitor_thermal_zone(struct thermal_zone_device *tz)
 {
+	unsigned long delay;
+
 	if (tz->mode != THERMAL_DEVICE_ENABLED)
-		thermal_zone_device_set_polling(tz, 0);
+		delay = 0;
 	else if (tz->passive)
-		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
+		delay = tz->passive_delay_jiffies;
 	else if (tz->polling_delay_jiffies)
-		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
+		delay = tz->polling_delay_jiffies;
+	else
+		delay = 0; /* TODO: ??? */
+
+	if (tz->additional_poll_reps > 0) {
+		tz->additional_poll_reps -= 1;
+		if (delay == 0 || tz->additional_poll_jiffies < delay)
+			delay = tz->additional_poll_jiffies;
+	}
+
+	thermal_zone_device_set_polling(tz, delay);
 }
 
 static void handle_non_critical_trips(struct thermal_zone_device *tz,
@@ -425,6 +437,8 @@ static void thermal_zone_device_init(struct thermal_zone_device *tz)
 	tz->temperature = THERMAL_TEMP_INVALID;
 	tz->prev_low_trip = -INT_MAX;
 	tz->prev_high_trip = INT_MAX;
+	tz->additional_poll_jiffies = 0;
+	tz->additional_poll_reps = 0;
 	list_for_each_entry(pos, &tz->thermal_instances, tz_node)
 		pos->initialized = false;
 }
diff --git a/include/linux/thermal.h b/include/linux/thermal.h
index c7190e2dfcb4..576b1f3ef25d 100644
--- a/include/linux/thermal.h
+++ b/include/linux/thermal.h
@@ -172,6 +172,8 @@ struct thermal_zone_device {
 	int passive;
 	int prev_low_trip;
 	int prev_high_trip;
+	int additional_poll_reps;
+	unsigned long additional_poll_jiffies;
 	atomic_t need_update;
 	struct thermal_zone_device_ops *ops;
 	struct thermal_zone_params *tzp;

In my tests this is enough to resolve the issue consistently on both
TM2E and XU4, both before and after my other patchset.

To be honest, this is not the most elegant solution probably and it
still doesn't really take into account the governor needs. Therefore, if

> Regarding this topic, I just wanted to tell you that I had conversation
> with Rafael & Daniel last Fri. Rafael gave me a hint to his latest work
> in his repo regarding potentially similar race with temperature value.

brings a better solution, it would be great :)

Thank you!
Mateusz

Lukasz Luba Dec. 15, 2023, 6:48 p.m. UTC | #4

Hi Mateusz,

On 12/13/23 13:42, Mateusz Majewski wrote:
> Hi,
> 
>> I understand your requirement for the interrupts only mode, but
>> maybe till the moment there is no fix upstream, you can enable
>> it as well?
> 
> We (actually Marek and independently another coworker) had an idea how
> to solve this while still avoiding polling all the time, and it turned
> out to be quite simple to implement (PoC-quality). The idea was to run
> several cycles of polling after each interrupt. This could be done like
> this:

That's cool PoC, if we don't find any other solution, we would have
consider it as well. So let's try first tackle from other angle...

> 
> diff --git a/drivers/thermal/samsung/exynos_tmu.c b/drivers/thermal/samsung/exynos_tmu.c
> index 6482513bfe66..b4bffe405194 100644
> --- a/drivers/thermal/samsung/exynos_tmu.c
> +++ b/drivers/thermal/samsung/exynos_tmu.c
> @@ -760,6 +760,12 @@ static irqreturn_t exynos_tmu_threaded_irq(int irq, void *id)
>   {
>   	struct exynos_tmu_data *data = id;
>   
> +	/* TODO: would need some API */
> +	mutex_lock(&data->tzd->lock);
> +	data->tzd->additional_poll_reps = 10;
> +	data->tzd->additional_poll_jiffies = HZ / 10;
> +	mutex_unlock(&data->tzd->lock);
> +
>   	thermal_zone_device_update(data->tzd, THERMAL_EVENT_UNSPECIFIED);
>   
>   	mutex_lock(&data->lock);
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 625ba07cbe2f..c825d068402f 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -299,12 +299,24 @@ static void thermal_zone_device_set_polling(struct thermal_zone_device *tz,
>   
>   static void monitor_thermal_zone(struct thermal_zone_device *tz)
>   {
> +	unsigned long delay;
> +
>   	if (tz->mode != THERMAL_DEVICE_ENABLED)
> -		thermal_zone_device_set_polling(tz, 0);
> +		delay = 0;
>   	else if (tz->passive)
> -		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
> +		delay = tz->passive_delay_jiffies;
>   	else if (tz->polling_delay_jiffies)
> -		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> +		delay = tz->polling_delay_jiffies;
> +	else
> +		delay = 0; /* TODO: ??? */
> +
> +	if (tz->additional_poll_reps > 0) {
> +		tz->additional_poll_reps -= 1;
> +		if (delay == 0 || tz->additional_poll_jiffies < delay)
> +			delay = tz->additional_poll_jiffies;
> +	}
> +
> +	thermal_zone_device_set_polling(tz, delay);
>   }
>   
>   static void handle_non_critical_trips(struct thermal_zone_device *tz,
> @@ -425,6 +437,8 @@ static void thermal_zone_device_init(struct thermal_zone_device *tz)
>   	tz->temperature = THERMAL_TEMP_INVALID;
>   	tz->prev_low_trip = -INT_MAX;
>   	tz->prev_high_trip = INT_MAX;
> +	tz->additional_poll_jiffies = 0;
> +	tz->additional_poll_reps = 0;
>   	list_for_each_entry(pos, &tz->thermal_instances, tz_node)
>   		pos->initialized = false;
>   }
> diff --git a/include/linux/thermal.h b/include/linux/thermal.h
> index c7190e2dfcb4..576b1f3ef25d 100644
> --- a/include/linux/thermal.h
> +++ b/include/linux/thermal.h
> @@ -172,6 +172,8 @@ struct thermal_zone_device {
>   	int passive;
>   	int prev_low_trip;
>   	int prev_high_trip;
> +	int additional_poll_reps;
> +	unsigned long additional_poll_jiffies;
>   	atomic_t need_update;
>   	struct thermal_zone_device_ops *ops;
>   	struct thermal_zone_params *tzp;
> 
> In my tests this is enough to resolve the issue consistently on both
> TM2E and XU4, both before and after my other patchset.
> 
> To be honest, this is not the most elegant solution probably and it
> still doesn't really take into account the governor needs. Therefore, if
> 
>> Regarding this topic, I just wanted to tell you that I had conversation
>> with Rafael & Daniel last Fri. Rafael gave me a hint to his latest work
>> in his repo regarding potentially similar race with temperature value.
> 
> brings a better solution, it would be great :)
> 

I have spent some time to better understand the machinery in those
updates and notifications in the core and the governors. I've also
checked the hint from Rafael about maybe similar trip & temp dance.
(that change is in PM linux-next:
"44844db91397 thermal: core: Add trip thresholds for trip crossing 
detection")

Unfortunately, that won't help, since in this TMU we get the
temperature value wrong sometimes IIUC (CMIIW).

Are we able inside the exynos_tmu_threaded_irq() get information
which of the 2 (3 if count critical) IRQs for low_temp, high_temp
has triggered?
I can see in the ->tmu_clear_irqs() we read the IRQ pending reg.
If we are able to say which of the two temp values triggered
that action, then thermal framework shouldn't ignore that and use
extra features to mitigate the glitching of the temp value.
IMO it would be correct because IRQ was triggered based on physics
so the value that we would pass and use is valid.
You would probably need to properly track those temp values and where
you are with the programmed trips (some extra context in the driver).

The exynos_get_temp() needs to work like today - no faking temperature,
no filtering, etc, always report best known temp from that reg.
Different components in thermal can read that temp e.g. sysfs.

In all governors 'tz->temperature' is used, which is good (not the
tz->ops->get_temp(). So let's try to introduce something simple:

----------------------------------------8<---------------------------

diff --git a/drivers/thermal/samsung/exynos_tmu.c 
b/drivers/thermal/samsung/exynos_tmu.c
index 667bb18205fc..345ea0836b99 100644
--- a/drivers/thermal/samsung/exynos_tmu.c
+++ b/drivers/thermal/samsung/exynos_tmu.c
@@ -786,8 +786,13 @@ static int exynos7_tmu_read(struct exynos_tmu_data 
*data)
  static irqreturn_t exynos_tmu_threaded_irq(int irq, void *id)
  {
         struct exynos_tmu_data *data = id;
+       int temp;

-       thermal_zone_device_update(data->tzd, THERMAL_EVENT_UNSPECIFIED);
+       temp = exynos_determine_temp_based_on_irq(data);
+
+       thermal_zone_device_update_with_temp(data->tzd,
+                               THERMAL_EVENT_TEMP_SAMPLE,
+                               temp);

         mutex_lock(&data->lock);
         clk_enable(data->clk);
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 9c17d35ccbbd..4cdc7b7eed1e 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -495,6 +495,43 @@ void thermal_zone_device_update(struct 
thermal_zone_device *tz,
  }
  EXPORT_SYMBOL_GPL(thermal_zone_device_update);

+void thermal_zone_device_update_with_temp(struct thermal_zone_device *tz,
+                                         enum thermal_notify_event event,
+                                         int temp)
+{
+       const struct thermal_trip *trip;
+
+       mutex_lock(&tz->lock);
+       if (device_is_registered(&tz->device))
+               goto unlock;
+
+       if (atomic_read(&in_suspend))
+               goto unlock;
+
+       if (!thermal_zone_device_is_enabled(tz))
+               goto unlock;
+
+       tz->last_temperature = tz->temperature;
+       tz->temperature = temp;
+
+       trace_thermal_temperature(tz);
+
+       thermal_genl_sampling_temp(tz->id, temp);
+
+       __thermal_zone_set_trips(tz);
+
+       tz->notify_event = event;
+
+       for_each_trip(tz, trip)
+               handle_thermal_trip(tz, trip);
+
+       monitor_thermal_zone(tz);
+
+unlock:
+       mutex_unlock(&tz->lock);
+}
+EXPORT_SYMBOL_GPL(thermal_zone_device_update_with_temp);
+
  static void thermal_zone_device_check(struct work_struct *work)
  {
         struct thermal_zone_device *tz = container_of(work, struct
-------------------------------->8-----------------------------------

There is this event not used in thermal fwk: THERMAL_EVENT_TEMP_SAMPLE,
which could also suit here.

Some extras:
There is a callback which is used in step_wise governor:
get_tz_trend()
      tz->ops->get_trend()
which is not implemented in the Exynos TMU tz. We can use it
and provide the trend, if we still have to after this modification.

Beside:
1. This new API function should be faster on some devices, e.g.
those which has temperature sensor attached to some slow bus
(I have heard about temperature overhead even a few
milli-seconds)
2. Sounds to re-use the information from such drivers like
this Exynos which has IRQs for particular temp value and
can avoid 2nd reading, which might cause confusion by providing
different value due to glitch/rounding.

So I think I could justify this new API interface.

Let me know what you think about this approach.

Regards,
Lukasz

diff --git a/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi b/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
index 81b72393dd0d..145c4c80893a 100644
--- a/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
+++ b/arch/arm64/boot/dts/exynos/exynos5433-tmu.dtsi
@@ -55,14 +55,14 @@  cooling-maps {
             map0 {
                 /* Set maximum frequency as 1800MHz  */
                 trip = <&atlas0_alert_0>;
-                cooling-device = <&cpu4 1 2>, <&cpu5 1 2>,
-                         <&cpu6 1 2>, <&cpu7 1 2>;
+                cooling-device = <&cpu4 1 1>, <&cpu5 1 1>,
+                         <&cpu6 1 1>, <&cpu7 1 1>;
             };
             map1 {
                 /* Set maximum frequency as 1700MHz  */
                 trip = <&atlas0_alert_1>;
-                cooling-device = <&cpu4 2 3>, <&cpu5 2 3>,
-                         <&cpu6 2 3>, <&cpu7 2 3>;
+                cooling-device = <&cpu4 2 2>, <&cpu5 2 2>,
+                         <&cpu6 2 2>, <&cpu7 2 2>;
             };
             map2 {
                 /* Set maximum frequency as 1600MHz  */
@@ -229,14 +229,14 @@  cooling-maps {
             map0 {
                 /* Set maximum frequency as 1200MHz  */
                 trip = <&apollo_alert_2>;
-                cooling-device = <&cpu0 1 2>, <&cpu1 1 2>,
-                         <&cpu2 1 2>, <&cpu3 1 2>;
+                cooling-device = <&cpu0 1 1>, <&cpu1 1 1>,
+                         <&cpu2 1 1>, <&cpu3 1 1>;
             };
             map1 {
                 /* Set maximum frequency as 1100MHz  */
                 trip = <&apollo_alert_3>;
-                cooling-device = <&cpu0 2 3>, <&cpu1 2 3>,
-                         <&cpu2 2 3>, <&cpu3 2 3>;
+                cooling-device = <&cpu0 2 2>, <&cpu1 2 2>,
+                         <&cpu2 2 2>, <&cpu3 2 2>;
             };
             map2 {
                 /* Set maximum frequency as 1000MHz  */

Various Exynos targets never return to no cooling

Commit Message

Comments

Patch