diff mbox series

[v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

Message ID 6064157.lOV4Wx5bFT@rjwysocki.net (mailing list archive)
State Superseded, archived
Headers show
Series [v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid | expand

Commit Message

Rafael J. Wysocki July 4, 2024, 11:46 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
if zone temperature is invalid") caused __thermal_zone_device_update()
to return early if the current thermal zone temperature was invalid.

This was done to avoid running handle_thermal_trip() and governor
callbacks in that case which led to confusion.  However, it went too
far because monitor_thermal_zone() still needs to be called even when
the zone temperature is invalid to ensure that it will be updated
eventually in case thermal polling is enabled and the driver has no
other means to notify the core of zone temperature changes (for example,
it does not register an interrupt handler or ACPI notifier).

Also if the .set_trips() zone callback is expected to set up monitoring
interrupts for a thermal zone, it needs to be provided with valid
boundaries and that can only be done if the zone temperature is known.

Accordingly, to ensure that __thermal_zone_device_update() will
run again after a failing zone temperature check, make it call
monitor_thermal_zone() regardless of whether or not the zone
temperature is valid and make the latter schedule a thermal zone
temperature update if the zone temperature is invalid even if
polling is not enabled for the thermal zone (however, if this
continues to fail, give up after some time).

Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/thermal/thermal_core.c |   13 ++++++++++++-
 drivers/thermal/thermal_core.h |    9 +++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

Comments

Daniel Lezcano July 4, 2024, 12:49 p.m. UTC | #1
On 04/07/2024 13:46, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> if zone temperature is invalid") caused __thermal_zone_device_update()
> to return early if the current thermal zone temperature was invalid.
> 
> This was done to avoid running handle_thermal_trip() and governor
> callbacks in that case which led to confusion.  However, it went too
> far because monitor_thermal_zone() still needs to be called even when
> the zone temperature is invalid to ensure that it will be updated
> eventually in case thermal polling is enabled and the driver has no
> other means to notify the core of zone temperature changes (for example,
> it does not register an interrupt handler or ACPI notifier).
> 
> Also if the .set_trips() zone callback is expected to set up monitoring
> interrupts for a thermal zone, it needs to be provided with valid
> boundaries and that can only be done if the zone temperature is known.
> 
> Accordingly, to ensure that __thermal_zone_device_update() will
> run again after a failing zone temperature check, make it call
> monitor_thermal_zone() regardless of whether or not the zone
> temperature is valid and make the latter schedule a thermal zone
> temperature update if the zone temperature is invalid even if
> polling is not enabled for the thermal zone (however, if this
> continues to fail, give up after some time).

Rafael,

do we agree that we should fix somehow the current issue in this way 
because we are close to the merge window, but the proper fix is not 
doing that ?


> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>   drivers/thermal/thermal_core.c |   13 ++++++++++++-
>   drivers/thermal/thermal_core.h |    9 +++++++++
>   2 files changed, 21 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/thermal/thermal_core.c
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_core.c
> +++ linux-pm/drivers/thermal/thermal_core.c
> @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
>   		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>   	else if (tz->polling_delay_jiffies)
>   		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> +	else if (tz->temperature == THERMAL_TEMP_INVALID &&
> +		 tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
> +		thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
> +		/* Double the recheck delay for the next attempt. */
> +		tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
> +		if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
> +			dev_info(&tz->device, "Temperature unknown, giving up\n");
> +	}
>   }
>   
>   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
> @@ -430,6 +438,7 @@ static void update_temperature(struct th
>   
>   	tz->last_temperature = tz->temperature;
>   	tz->temperature = temp;
> +	tz->recheck_delay_jiffies = 1;
>   
>   	trace_thermal_temperature(tz);
>   
> @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
>   	update_temperature(tz);
>   
>   	if (tz->temperature == THERMAL_TEMP_INVALID)
> -		return;
> +		goto monitor;
>   
>   	tz->notify_event = event;
>   
> @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
>   
>   	thermal_debug_update_trip_stats(tz);
>   
> +monitor:
>   	monitor_thermal_zone(tz);
>   }
>   
> @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
>   
>   	thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
>   	thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
> +	tz->recheck_delay_jiffies = 1;
>   
>   	/* sys I/F */
>   	/* Add nodes that are always present via .groups */
> Index: linux-pm/drivers/thermal/thermal_core.h
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_core.h
> +++ linux-pm/drivers/thermal/thermal_core.h
> @@ -67,6 +67,8 @@ struct thermal_governor {
>    * @polling_delay_jiffies: number of jiffies to wait between polls when
>    *			checking whether trip points have been crossed (0 for
>    *			interrupt driven systems)
> + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
> + * 			before attempting to check it again
>    * @temperature:	current temperature.  This is only for core code,
>    *			drivers should use thermal_zone_get_temp() to get the
>    *			current temperature
> @@ -108,6 +110,7 @@ struct thermal_zone_device {
>   	int num_trips;
>   	unsigned long passive_delay_jiffies;
>   	unsigned long polling_delay_jiffies;
> +	unsigned long recheck_delay_jiffies;
>   	int temperature;
>   	int last_temperature;
>   	int emul_temperature;
> @@ -133,6 +136,12 @@ struct thermal_zone_device {
>   	struct thermal_trip_desc trips[] __counted_by(num_trips);
>   };
>   
> +/*
> + * Maximum delay after a failing thermal zone temperature check before
> + * attempting to check it again (in jiffies).
> + */
> +#define THERMAL_MAX_RECHECK_DELAY	(30 * HZ)
> +
>   /* Default Thermal Governor */
>   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
> 
> 
>
Neil Armstrong July 4, 2024, 12:52 p.m. UTC | #2
Hi,

On 04/07/2024 14:49, Daniel Lezcano wrote:
> On 04/07/2024 13:46, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>> if zone temperature is invalid") caused __thermal_zone_device_update()
>> to return early if the current thermal zone temperature was invalid.
>>
>> This was done to avoid running handle_thermal_trip() and governor
>> callbacks in that case which led to confusion.  However, it went too
>> far because monitor_thermal_zone() still needs to be called even when
>> the zone temperature is invalid to ensure that it will be updated
>> eventually in case thermal polling is enabled and the driver has no
>> other means to notify the core of zone temperature changes (for example,
>> it does not register an interrupt handler or ACPI notifier).
>>
>> Also if the .set_trips() zone callback is expected to set up monitoring
>> interrupts for a thermal zone, it needs to be provided with valid
>> boundaries and that can only be done if the zone temperature is known.
>>
>> Accordingly, to ensure that __thermal_zone_device_update() will
>> run again after a failing zone temperature check, make it call
>> monitor_thermal_zone() regardless of whether or not the zone
>> temperature is valid and make the latter schedule a thermal zone
>> temperature update if the zone temperature is invalid even if
>> polling is not enabled for the thermal zone (however, if this
>> continues to fail, give up after some time).
> 
> Rafael,
> 
> do we agree that we should fix somehow the current issue in this way because we are close to the merge window, but the proper fix is not doing that ?

I've tested this patch, but I have no opinion about it.

I sent https://lore.kernel.org/all/20240704-topic-sm8x50-upstream-fix-battmgr-temp-tz-warn-v1-1-9d66d6f6efde@linaro.org/ which
fixes the warning print, leaving the option for thermal core to update the tz once it becomes available,
which is the initial goal of this patchset.

Neil

> 
> 
>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
>> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>   drivers/thermal/thermal_core.c |   13 ++++++++++++-
>>   drivers/thermal/thermal_core.h |    9 +++++++++
>>   2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> Index: linux-pm/drivers/thermal/thermal_core.c
>> ===================================================================
>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>> +++ linux-pm/drivers/thermal/thermal_core.c
>> @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
>>           thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>>       else if (tz->polling_delay_jiffies)
>>           thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
>> +    else if (tz->temperature == THERMAL_TEMP_INVALID &&
>> +         tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
>> +        thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
>> +        /* Double the recheck delay for the next attempt. */
>> +        tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
>> +        if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
>> +            dev_info(&tz->device, "Temperature unknown, giving up\n");
>> +    }
>>   }
>>   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
>> @@ -430,6 +438,7 @@ static void update_temperature(struct th
>>       tz->last_temperature = tz->temperature;
>>       tz->temperature = temp;
>> +    tz->recheck_delay_jiffies = 1;
>>       trace_thermal_temperature(tz);
>> @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
>>       update_temperature(tz);
>>       if (tz->temperature == THERMAL_TEMP_INVALID)
>> -        return;
>> +        goto monitor;
>>       tz->notify_event = event;
>> @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
>>       thermal_debug_update_trip_stats(tz);
>> +monitor:
>>       monitor_thermal_zone(tz);
>>   }
>> @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
>>       thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
>>       thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
>> +    tz->recheck_delay_jiffies = 1;
>>       /* sys I/F */
>>       /* Add nodes that are always present via .groups */
>> Index: linux-pm/drivers/thermal/thermal_core.h
>> ===================================================================
>> --- linux-pm.orig/drivers/thermal/thermal_core.h
>> +++ linux-pm/drivers/thermal/thermal_core.h
>> @@ -67,6 +67,8 @@ struct thermal_governor {
>>    * @polling_delay_jiffies: number of jiffies to wait between polls when
>>    *            checking whether trip points have been crossed (0 for
>>    *            interrupt driven systems)
>> + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
>> + *             before attempting to check it again
>>    * @temperature:    current temperature.  This is only for core code,
>>    *            drivers should use thermal_zone_get_temp() to get the
>>    *            current temperature
>> @@ -108,6 +110,7 @@ struct thermal_zone_device {
>>       int num_trips;
>>       unsigned long passive_delay_jiffies;
>>       unsigned long polling_delay_jiffies;
>> +    unsigned long recheck_delay_jiffies;
>>       int temperature;
>>       int last_temperature;
>>       int emul_temperature;
>> @@ -133,6 +136,12 @@ struct thermal_zone_device {
>>       struct thermal_trip_desc trips[] __counted_by(num_trips);
>>   };
>> +/*
>> + * Maximum delay after a failing thermal zone temperature check before
>> + * attempting to check it again (in jiffies).
>> + */
>> +#define THERMAL_MAX_RECHECK_DELAY    (30 * HZ)
>> +
>>   /* Default Thermal Governor */
>>   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>>   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
>>
>>
>>
>
Rafael J. Wysocki July 4, 2024, 2:21 p.m. UTC | #3
On Thu, Jul 4, 2024 at 2:49 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>
> On 04/07/2024 13:46, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> > if zone temperature is invalid") caused __thermal_zone_device_update()
> > to return early if the current thermal zone temperature was invalid.
> >
> > This was done to avoid running handle_thermal_trip() and governor
> > callbacks in that case which led to confusion.  However, it went too
> > far because monitor_thermal_zone() still needs to be called even when
> > the zone temperature is invalid to ensure that it will be updated
> > eventually in case thermal polling is enabled and the driver has no
> > other means to notify the core of zone temperature changes (for example,
> > it does not register an interrupt handler or ACPI notifier).
> >
> > Also if the .set_trips() zone callback is expected to set up monitoring
> > interrupts for a thermal zone, it needs to be provided with valid
> > boundaries and that can only be done if the zone temperature is known.
> >
> > Accordingly, to ensure that __thermal_zone_device_update() will
> > run again after a failing zone temperature check, make it call
> > monitor_thermal_zone() regardless of whether or not the zone
> > temperature is valid and make the latter schedule a thermal zone
> > temperature update if the zone temperature is invalid even if
> > polling is not enabled for the thermal zone (however, if this
> > continues to fail, give up after some time).
>
> Rafael,
>
> do we agree that we should fix somehow the current issue in this way
> because we are close to the merge window,

Yes.

> but the proper fix is not doing that ?

We need to decide what to do in general when __thermal_zone_get_temp()
returns an error.  A proper fix would result from that, but it would
require more time than is available IMV.  We can properly fix this in
6.11.

For 6.10 I see two options:

1. Apply the v2 of this patch:

https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net/

I slightly prefer it because it is simpler and doesn't change the size
of struct thermal_zone_device.  However, the clear disadvantage of it
is that it will poke at dead thermal zones indefinitely.

The THERMAL_RECHECK_DELAY_MS value in it can be adjusted.  Maybe 250
ms would be a better choice?

2. Apply this patch (ie. v3)

It is nicer to thermal zones that never become operational, but it may
miss thermal zones that become operational very late.

> > Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> > Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> > Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >   drivers/thermal/thermal_core.c |   13 ++++++++++++-
> >   drivers/thermal/thermal_core.h |    9 +++++++++
> >   2 files changed, 21 insertions(+), 1 deletion(-)
> >
> > Index: linux-pm/drivers/thermal/thermal_core.c
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_core.c
> > +++ linux-pm/drivers/thermal/thermal_core.c
> > @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
> >               thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
> >       else if (tz->polling_delay_jiffies)
> >               thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> > +     else if (tz->temperature == THERMAL_TEMP_INVALID &&
> > +              tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
> > +             thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
> > +             /* Double the recheck delay for the next attempt. */
> > +             tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
> > +             if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
> > +                     dev_info(&tz->device, "Temperature unknown, giving up\n");
> > +     }
> >   }
> >
> >   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
> > @@ -430,6 +438,7 @@ static void update_temperature(struct th
> >
> >       tz->last_temperature = tz->temperature;
> >       tz->temperature = temp;
> > +     tz->recheck_delay_jiffies = 1;
> >
> >       trace_thermal_temperature(tz);
> >
> > @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
> >       update_temperature(tz);
> >
> >       if (tz->temperature == THERMAL_TEMP_INVALID)
> > -             return;
> > +             goto monitor;
> >
> >       tz->notify_event = event;
> >
> > @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
> >
> >       thermal_debug_update_trip_stats(tz);
> >
> > +monitor:
> >       monitor_thermal_zone(tz);
> >   }
> >
> > @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
> >
> >       thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
> >       thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
> > +     tz->recheck_delay_jiffies = 1;
> >
> >       /* sys I/F */
> >       /* Add nodes that are always present via .groups */
> > Index: linux-pm/drivers/thermal/thermal_core.h
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_core.h
> > +++ linux-pm/drivers/thermal/thermal_core.h
> > @@ -67,6 +67,8 @@ struct thermal_governor {
> >    * @polling_delay_jiffies: number of jiffies to wait between polls when
> >    *                  checking whether trip points have been crossed (0 for
> >    *                  interrupt driven systems)
> > + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
> > + *                   before attempting to check it again
> >    * @temperature:    current temperature.  This is only for core code,
> >    *                  drivers should use thermal_zone_get_temp() to get the
> >    *                  current temperature
> > @@ -108,6 +110,7 @@ struct thermal_zone_device {
> >       int num_trips;
> >       unsigned long passive_delay_jiffies;
> >       unsigned long polling_delay_jiffies;
> > +     unsigned long recheck_delay_jiffies;
> >       int temperature;
> >       int last_temperature;
> >       int emul_temperature;
> > @@ -133,6 +136,12 @@ struct thermal_zone_device {
> >       struct thermal_trip_desc trips[] __counted_by(num_trips);
> >   };
> >
> > +/*
> > + * Maximum delay after a failing thermal zone temperature check before
> > + * attempting to check it again (in jiffies).
> > + */
> > +#define THERMAL_MAX_RECHECK_DELAY    (30 * HZ)
> > +
> >   /* Default Thermal Governor */
> >   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
> >   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
> >
> >
> >
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
>
Rafael J. Wysocki July 4, 2024, 2:23 p.m. UTC | #4
Hi,

On Thu, Jul 4, 2024 at 2:52 PM Neil Armstrong <neil.armstrong@linaro.org> wrote:
>
> Hi,
>
> On 04/07/2024 14:49, Daniel Lezcano wrote:
> > On 04/07/2024 13:46, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> >> if zone temperature is invalid") caused __thermal_zone_device_update()
> >> to return early if the current thermal zone temperature was invalid.
> >>
> >> This was done to avoid running handle_thermal_trip() and governor
> >> callbacks in that case which led to confusion.  However, it went too
> >> far because monitor_thermal_zone() still needs to be called even when
> >> the zone temperature is invalid to ensure that it will be updated
> >> eventually in case thermal polling is enabled and the driver has no
> >> other means to notify the core of zone temperature changes (for example,
> >> it does not register an interrupt handler or ACPI notifier).
> >>
> >> Also if the .set_trips() zone callback is expected to set up monitoring
> >> interrupts for a thermal zone, it needs to be provided with valid
> >> boundaries and that can only be done if the zone temperature is known.
> >>
> >> Accordingly, to ensure that __thermal_zone_device_update() will
> >> run again after a failing zone temperature check, make it call
> >> monitor_thermal_zone() regardless of whether or not the zone
> >> temperature is valid and make the latter schedule a thermal zone
> >> temperature update if the zone temperature is invalid even if
> >> polling is not enabled for the thermal zone (however, if this
> >> continues to fail, give up after some time).
> >
> > Rafael,
> >
> > do we agree that we should fix somehow the current issue in this way because we are close to the merge window, but the proper fix is not doing that ?
>
> I've tested this patch, but I have no opinion about it.
>
> I sent https://lore.kernel.org/all/20240704-topic-sm8x50-upstream-fix-battmgr-temp-tz-warn-v1-1-9d66d6f6efde@linaro.org/ which
> fixes the warning print, leaving the option for thermal core to update the tz once it becomes available,
> which is the initial goal of this patchset.

Thank you!

I gather that I can use the v2 of the $subject patch without worrying
about the problem you have reported.
Daniel Lezcano July 4, 2024, 4:53 p.m. UTC | #5
On 04/07/2024 16:21, Rafael J. Wysocki wrote:
> On Thu, Jul 4, 2024 at 2:49 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>>
>> On 04/07/2024 13:46, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>>> if zone temperature is invalid") caused __thermal_zone_device_update()
>>> to return early if the current thermal zone temperature was invalid.
>>>
>>> This was done to avoid running handle_thermal_trip() and governor
>>> callbacks in that case which led to confusion.  However, it went too
>>> far because monitor_thermal_zone() still needs to be called even when
>>> the zone temperature is invalid to ensure that it will be updated
>>> eventually in case thermal polling is enabled and the driver has no
>>> other means to notify the core of zone temperature changes (for example,
>>> it does not register an interrupt handler or ACPI notifier).
>>>
>>> Also if the .set_trips() zone callback is expected to set up monitoring
>>> interrupts for a thermal zone, it needs to be provided with valid
>>> boundaries and that can only be done if the zone temperature is known.
>>>
>>> Accordingly, to ensure that __thermal_zone_device_update() will
>>> run again after a failing zone temperature check, make it call
>>> monitor_thermal_zone() regardless of whether or not the zone
>>> temperature is valid and make the latter schedule a thermal zone
>>> temperature update if the zone temperature is invalid even if
>>> polling is not enabled for the thermal zone (however, if this
>>> continues to fail, give up after some time).
>>
>> Rafael,
>>
>> do we agree that we should fix somehow the current issue in this way
>> because we are close to the merge window,
> 
> Yes.
> 
>> but the proper fix is not doing that ?
> 
> We need to decide what to do in general when __thermal_zone_get_temp()
> returns an error.  A proper fix would result from that, but it would
> require more time than is available IMV.  We can properly fix this in
> 6.11.

Right, in general we should take care of returning values from the 
different functions, update_temperature(), etc... in order to have the 
thermal_zone_device_update() returning a value.

So from there we can catch the result in the initialization function and 
do the proper actions.

 From a higher perspective, IMO the code contains too many returning 
void functions. We should convert that into returning values and handle 
the error cases.

> For 6.10 I see two options:
> 
> 1. Apply the v2 of this patch:
> 
> https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net/
> 
> I slightly prefer it because it is simpler and doesn't change the size
> of struct thermal_zone_device.

I agree

>  However, the clear disadvantage of it
> is that it will poke at dead thermal zones indefinitely.

Yes, but the advantage of this disadvantage is it is so visible that 
buggy routine will be brought to the light, so they can be fixed. I 
don't think we should have so many, perhaps none.

> The THERMAL_RECHECK_DELAY_MS value in it can be adjusted.  Maybe 250
> ms would be a better choice?

Yes

> 2. Apply this patch (ie. v3)
> 
> It is nicer to thermal zones that never become operational, but it may
> miss thermal zones that become operational very late.

I would keep this v3 as a backup in case there are too many complaints, 
but I doubt

>>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
>>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>>> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
>>> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>    drivers/thermal/thermal_core.c |   13 ++++++++++++-
>>>    drivers/thermal/thermal_core.h |    9 +++++++++
>>>    2 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-pm/drivers/thermal/thermal_core.c
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>>> +++ linux-pm/drivers/thermal/thermal_core.c
>>> @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
>>>                thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>>>        else if (tz->polling_delay_jiffies)
>>>                thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
>>> +     else if (tz->temperature == THERMAL_TEMP_INVALID &&
>>> +              tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
>>> +             thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
>>> +             /* Double the recheck delay for the next attempt. */
>>> +             tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
>>> +             if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
>>> +                     dev_info(&tz->device, "Temperature unknown, giving up\n");
>>> +     }
>>>    }
>>>
>>>    static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
>>> @@ -430,6 +438,7 @@ static void update_temperature(struct th
>>>
>>>        tz->last_temperature = tz->temperature;
>>>        tz->temperature = temp;
>>> +     tz->recheck_delay_jiffies = 1;
>>>
>>>        trace_thermal_temperature(tz);
>>>
>>> @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
>>>        update_temperature(tz);
>>>
>>>        if (tz->temperature == THERMAL_TEMP_INVALID)
>>> -             return;
>>> +             goto monitor;
>>>
>>>        tz->notify_event = event;
>>>
>>> @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
>>>
>>>        thermal_debug_update_trip_stats(tz);
>>>
>>> +monitor:
>>>        monitor_thermal_zone(tz);
>>>    }
>>>
>>> @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
>>>
>>>        thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
>>>        thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
>>> +     tz->recheck_delay_jiffies = 1;
>>>
>>>        /* sys I/F */
>>>        /* Add nodes that are always present via .groups */
>>> Index: linux-pm/drivers/thermal/thermal_core.h
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.h
>>> +++ linux-pm/drivers/thermal/thermal_core.h
>>> @@ -67,6 +67,8 @@ struct thermal_governor {
>>>     * @polling_delay_jiffies: number of jiffies to wait between polls when
>>>     *                  checking whether trip points have been crossed (0 for
>>>     *                  interrupt driven systems)
>>> + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
>>> + *                   before attempting to check it again
>>>     * @temperature:    current temperature.  This is only for core code,
>>>     *                  drivers should use thermal_zone_get_temp() to get the
>>>     *                  current temperature
>>> @@ -108,6 +110,7 @@ struct thermal_zone_device {
>>>        int num_trips;
>>>        unsigned long passive_delay_jiffies;
>>>        unsigned long polling_delay_jiffies;
>>> +     unsigned long recheck_delay_jiffies;
>>>        int temperature;
>>>        int last_temperature;
>>>        int emul_temperature;
>>> @@ -133,6 +136,12 @@ struct thermal_zone_device {
>>>        struct thermal_trip_desc trips[] __counted_by(num_trips);
>>>    };
>>>
>>> +/*
>>> + * Maximum delay after a failing thermal zone temperature check before
>>> + * attempting to check it again (in jiffies).
>>> + */
>>> +#define THERMAL_MAX_RECHECK_DELAY    (30 * HZ)
>>> +
>>>    /* Default Thermal Governor */
>>>    #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>>>    #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
>>>
>>>
>>>
>>
>> --
>> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>>
>> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
>> <http://twitter.com/#!/linaroorg> Twitter |
>> <http://www.linaro.org/linaro-blog/> Blog
>>
>>
Rafael J. Wysocki July 4, 2024, 4:58 p.m. UTC | #6
On Thu, Jul 4, 2024 at 6:53 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>
> On 04/07/2024 16:21, Rafael J. Wysocki wrote:
> > On Thu, Jul 4, 2024 at 2:49 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
> >>
> >> On 04/07/2024 13:46, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> >>> if zone temperature is invalid") caused __thermal_zone_device_update()
> >>> to return early if the current thermal zone temperature was invalid.
> >>>
> >>> This was done to avoid running handle_thermal_trip() and governor
> >>> callbacks in that case which led to confusion.  However, it went too
> >>> far because monitor_thermal_zone() still needs to be called even when
> >>> the zone temperature is invalid to ensure that it will be updated
> >>> eventually in case thermal polling is enabled and the driver has no
> >>> other means to notify the core of zone temperature changes (for example,
> >>> it does not register an interrupt handler or ACPI notifier).
> >>>
> >>> Also if the .set_trips() zone callback is expected to set up monitoring
> >>> interrupts for a thermal zone, it needs to be provided with valid
> >>> boundaries and that can only be done if the zone temperature is known.
> >>>
> >>> Accordingly, to ensure that __thermal_zone_device_update() will
> >>> run again after a failing zone temperature check, make it call
> >>> monitor_thermal_zone() regardless of whether or not the zone
> >>> temperature is valid and make the latter schedule a thermal zone
> >>> temperature update if the zone temperature is invalid even if
> >>> polling is not enabled for the thermal zone (however, if this
> >>> continues to fail, give up after some time).
> >>
> >> Rafael,
> >>
> >> do we agree that we should fix somehow the current issue in this way
> >> because we are close to the merge window,
> >
> > Yes.
> >
> >> but the proper fix is not doing that ?
> >
> > We need to decide what to do in general when __thermal_zone_get_temp()
> > returns an error.  A proper fix would result from that, but it would
> > require more time than is available IMV.  We can properly fix this in
> > 6.11.
>
> Right, in general we should take care of returning values from the
> different functions, update_temperature(), etc... in order to have the
> thermal_zone_device_update() returning a value.
>
> So from there we can catch the result in the initialization function and
> do the proper actions.
>
>  From a higher perspective, IMO the code contains too many returning
> void functions. We should convert that into returning values and handle
> the error cases.
>
> > For 6.10 I see two options:
> >
> > 1. Apply the v2 of this patch:
> >
> > https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net/
> >
> > I slightly prefer it because it is simpler and doesn't change the size
> > of struct thermal_zone_device.
>
> I agree
>
> >  However, the clear disadvantage of it
> > is that it will poke at dead thermal zones indefinitely.
>
> Yes, but the advantage of this disadvantage is it is so visible that
> buggy routine will be brought to the light, so they can be fixed. I
> don't think we should have so many, perhaps none.
>
> > The THERMAL_RECHECK_DELAY_MS value in it can be adjusted.  Maybe 250
> > ms would be a better choice?
>
> Yes
>
> > 2. Apply this patch (ie. v3)
> >
> > It is nicer to thermal zones that never become operational, but it may
> > miss thermal zones that become operational very late.
>
> I would keep this v3 as a backup in case there are too many complaints,
> but I doubt

OK, I'll go for the v2 with THERMAL_RECHECK_DELAY_MS equal to 250 ms.

Thanks!

> >>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> >>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> >>> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> >>> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> ---
> >>>    drivers/thermal/thermal_core.c |   13 ++++++++++++-
> >>>    drivers/thermal/thermal_core.h |    9 +++++++++
> >>>    2 files changed, 21 insertions(+), 1 deletion(-)
> >>>
> >>> Index: linux-pm/drivers/thermal/thermal_core.c
> >>> ===================================================================
> >>> --- linux-pm.orig/drivers/thermal/thermal_core.c
> >>> +++ linux-pm/drivers/thermal/thermal_core.c
> >>> @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
> >>>                thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
> >>>        else if (tz->polling_delay_jiffies)
> >>>                thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> >>> +     else if (tz->temperature == THERMAL_TEMP_INVALID &&
> >>> +              tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
> >>> +             thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
> >>> +             /* Double the recheck delay for the next attempt. */
> >>> +             tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
> >>> +             if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
> >>> +                     dev_info(&tz->device, "Temperature unknown, giving up\n");
> >>> +     }
> >>>    }
> >>>
> >>>    static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
> >>> @@ -430,6 +438,7 @@ static void update_temperature(struct th
> >>>
> >>>        tz->last_temperature = tz->temperature;
> >>>        tz->temperature = temp;
> >>> +     tz->recheck_delay_jiffies = 1;
> >>>
> >>>        trace_thermal_temperature(tz);
> >>>
> >>> @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
> >>>        update_temperature(tz);
> >>>
> >>>        if (tz->temperature == THERMAL_TEMP_INVALID)
> >>> -             return;
> >>> +             goto monitor;
> >>>
> >>>        tz->notify_event = event;
> >>>
> >>> @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
> >>>
> >>>        thermal_debug_update_trip_stats(tz);
> >>>
> >>> +monitor:
> >>>        monitor_thermal_zone(tz);
> >>>    }
> >>>
> >>> @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
> >>>
> >>>        thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
> >>>        thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
> >>> +     tz->recheck_delay_jiffies = 1;
> >>>
> >>>        /* sys I/F */
> >>>        /* Add nodes that are always present via .groups */
> >>> Index: linux-pm/drivers/thermal/thermal_core.h
> >>> ===================================================================
> >>> --- linux-pm.orig/drivers/thermal/thermal_core.h
> >>> +++ linux-pm/drivers/thermal/thermal_core.h
> >>> @@ -67,6 +67,8 @@ struct thermal_governor {
> >>>     * @polling_delay_jiffies: number of jiffies to wait between polls when
> >>>     *                  checking whether trip points have been crossed (0 for
> >>>     *                  interrupt driven systems)
> >>> + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
> >>> + *                   before attempting to check it again
> >>>     * @temperature:    current temperature.  This is only for core code,
> >>>     *                  drivers should use thermal_zone_get_temp() to get the
> >>>     *                  current temperature
> >>> @@ -108,6 +110,7 @@ struct thermal_zone_device {
> >>>        int num_trips;
> >>>        unsigned long passive_delay_jiffies;
> >>>        unsigned long polling_delay_jiffies;
> >>> +     unsigned long recheck_delay_jiffies;
> >>>        int temperature;
> >>>        int last_temperature;
> >>>        int emul_temperature;
> >>> @@ -133,6 +136,12 @@ struct thermal_zone_device {
> >>>        struct thermal_trip_desc trips[] __counted_by(num_trips);
> >>>    };
> >>>
> >>> +/*
> >>> + * Maximum delay after a failing thermal zone temperature check before
> >>> + * attempting to check it again (in jiffies).
> >>> + */
> >>> +#define THERMAL_MAX_RECHECK_DELAY    (30 * HZ)
> >>> +
> >>>    /* Default Thermal Governor */
> >>>    #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
> >>>    #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
> >>>
> >>>
> >>>
> >>
> >> --
> >> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> >>
> >> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> >> <http://twitter.com/#!/linaroorg> Twitter |
> >> <http://www.linaro.org/linaro-blog/> Blog
> >>
> >>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
diff mbox series

Patch

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -300,6 +300,14 @@  static void monitor_thermal_zone(struct
 		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
 	else if (tz->polling_delay_jiffies)
 		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
+	else if (tz->temperature == THERMAL_TEMP_INVALID &&
+		 tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
+		thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
+		/* Double the recheck delay for the next attempt. */
+		tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
+		if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
+			dev_info(&tz->device, "Temperature unknown, giving up\n");
+	}
 }
 
 static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
@@ -430,6 +438,7 @@  static void update_temperature(struct th
 
 	tz->last_temperature = tz->temperature;
 	tz->temperature = temp;
+	tz->recheck_delay_jiffies = 1;
 
 	trace_thermal_temperature(tz);
 
@@ -514,7 +523,7 @@  void __thermal_zone_device_update(struct
 	update_temperature(tz);
 
 	if (tz->temperature == THERMAL_TEMP_INVALID)
-		return;
+		goto monitor;
 
 	tz->notify_event = event;
 
@@ -536,6 +545,7 @@  void __thermal_zone_device_update(struct
 
 	thermal_debug_update_trip_stats(tz);
 
+monitor:
 	monitor_thermal_zone(tz);
 }
 
@@ -1438,6 +1448,7 @@  thermal_zone_device_register_with_trips(
 
 	thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
 	thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
+	tz->recheck_delay_jiffies = 1;
 
 	/* sys I/F */
 	/* Add nodes that are always present via .groups */
Index: linux-pm/drivers/thermal/thermal_core.h
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.h
+++ linux-pm/drivers/thermal/thermal_core.h
@@ -67,6 +67,8 @@  struct thermal_governor {
  * @polling_delay_jiffies: number of jiffies to wait between polls when
  *			checking whether trip points have been crossed (0 for
  *			interrupt driven systems)
+ * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
+ * 			before attempting to check it again
  * @temperature:	current temperature.  This is only for core code,
  *			drivers should use thermal_zone_get_temp() to get the
  *			current temperature
@@ -108,6 +110,7 @@  struct thermal_zone_device {
 	int num_trips;
 	unsigned long passive_delay_jiffies;
 	unsigned long polling_delay_jiffies;
+	unsigned long recheck_delay_jiffies;
 	int temperature;
 	int last_temperature;
 	int emul_temperature;
@@ -133,6 +136,12 @@  struct thermal_zone_device {
 	struct thermal_trip_desc trips[] __counted_by(num_trips);
 };
 
+/*
+ * Maximum delay after a failing thermal zone temperature check before
+ * attempting to check it again (in jiffies).
+ */
+#define THERMAL_MAX_RECHECK_DELAY	(30 * HZ)
+
 /* Default Thermal Governor */
 #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
 #define DEFAULT_THERMAL_GOVERNOR       "step_wise"