diff mbox series

[v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

Message ID 6064157.lOV4Wx5bFT@rjwysocki.net (mailing list archive)
State Superseded, archived
Headers show
Series [v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid | expand

Commit Message

Rafael J. Wysocki July 4, 2024, 11:46 a.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
if zone temperature is invalid") caused __thermal_zone_device_update()
to return early if the current thermal zone temperature was invalid.

This was done to avoid running handle_thermal_trip() and governor
callbacks in that case which led to confusion.  However, it went too
far because monitor_thermal_zone() still needs to be called even when
the zone temperature is invalid to ensure that it will be updated
eventually in case thermal polling is enabled and the driver has no
other means to notify the core of zone temperature changes (for example,
it does not register an interrupt handler or ACPI notifier).

Also if the .set_trips() zone callback is expected to set up monitoring
interrupts for a thermal zone, it needs to be provided with valid
boundaries and that can only be done if the zone temperature is known.

Accordingly, to ensure that __thermal_zone_device_update() will
run again after a failing zone temperature check, make it call
monitor_thermal_zone() regardless of whether or not the zone
temperature is valid and make the latter schedule a thermal zone
temperature update if the zone temperature is invalid even if
polling is not enabled for the thermal zone (however, if this
continues to fail, give up after some time).

Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/thermal/thermal_core.c |   13 ++++++++++++-
 drivers/thermal/thermal_core.h |    9 +++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

Comments

Daniel Lezcano July 4, 2024, 12:49 p.m. UTC | #1
On 04/07/2024 13:46, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> if zone temperature is invalid") caused __thermal_zone_device_update()
> to return early if the current thermal zone temperature was invalid.
> 
> This was done to avoid running handle_thermal_trip() and governor
> callbacks in that case which led to confusion.  However, it went too
> far because monitor_thermal_zone() still needs to be called even when
> the zone temperature is invalid to ensure that it will be updated
> eventually in case thermal polling is enabled and the driver has no
> other means to notify the core of zone temperature changes (for example,
> it does not register an interrupt handler or ACPI notifier).
> 
> Also if the .set_trips() zone callback is expected to set up monitoring
> interrupts for a thermal zone, it needs to be provided with valid
> boundaries and that can only be done if the zone temperature is known.
> 
> Accordingly, to ensure that __thermal_zone_device_update() will
> run again after a failing zone temperature check, make it call
> monitor_thermal_zone() regardless of whether or not the zone
> temperature is valid and make the latter schedule a thermal zone
> temperature update if the zone temperature is invalid even if
> polling is not enabled for the thermal zone (however, if this
> continues to fail, give up after some time).

Rafael,

do we agree that we should fix somehow the current issue in this way 
because we are close to the merge window, but the proper fix is not 
doing that ?


> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>   drivers/thermal/thermal_core.c |   13 ++++++++++++-
>   drivers/thermal/thermal_core.h |    9 +++++++++
>   2 files changed, 21 insertions(+), 1 deletion(-)
> 
> Index: linux-pm/drivers/thermal/thermal_core.c
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_core.c
> +++ linux-pm/drivers/thermal/thermal_core.c
> @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
>   		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>   	else if (tz->polling_delay_jiffies)
>   		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> +	else if (tz->temperature == THERMAL_TEMP_INVALID &&
> +		 tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
> +		thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
> +		/* Double the recheck delay for the next attempt. */
> +		tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
> +		if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
> +			dev_info(&tz->device, "Temperature unknown, giving up\n");
> +	}
>   }
>   
>   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
> @@ -430,6 +438,7 @@ static void update_temperature(struct th
>   
>   	tz->last_temperature = tz->temperature;
>   	tz->temperature = temp;
> +	tz->recheck_delay_jiffies = 1;
>   
>   	trace_thermal_temperature(tz);
>   
> @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
>   	update_temperature(tz);
>   
>   	if (tz->temperature == THERMAL_TEMP_INVALID)
> -		return;
> +		goto monitor;
>   
>   	tz->notify_event = event;
>   
> @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
>   
>   	thermal_debug_update_trip_stats(tz);
>   
> +monitor:
>   	monitor_thermal_zone(tz);
>   }
>   
> @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
>   
>   	thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
>   	thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
> +	tz->recheck_delay_jiffies = 1;
>   
>   	/* sys I/F */
>   	/* Add nodes that are always present via .groups */
> Index: linux-pm/drivers/thermal/thermal_core.h
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_core.h
> +++ linux-pm/drivers/thermal/thermal_core.h
> @@ -67,6 +67,8 @@ struct thermal_governor {
>    * @polling_delay_jiffies: number of jiffies to wait between polls when
>    *			checking whether trip points have been crossed (0 for
>    *			interrupt driven systems)
> + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
> + * 			before attempting to check it again
>    * @temperature:	current temperature.  This is only for core code,
>    *			drivers should use thermal_zone_get_temp() to get the
>    *			current temperature
> @@ -108,6 +110,7 @@ struct thermal_zone_device {
>   	int num_trips;
>   	unsigned long passive_delay_jiffies;
>   	unsigned long polling_delay_jiffies;
> +	unsigned long recheck_delay_jiffies;
>   	int temperature;
>   	int last_temperature;
>   	int emul_temperature;
> @@ -133,6 +136,12 @@ struct thermal_zone_device {
>   	struct thermal_trip_desc trips[] __counted_by(num_trips);
>   };
>   
> +/*
> + * Maximum delay after a failing thermal zone temperature check before
> + * attempting to check it again (in jiffies).
> + */
> +#define THERMAL_MAX_RECHECK_DELAY	(30 * HZ)
> +
>   /* Default Thermal Governor */
>   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
> 
> 
>
Neil Armstrong July 4, 2024, 12:52 p.m. UTC | #2
Hi,

On 04/07/2024 14:49, Daniel Lezcano wrote:
> On 04/07/2024 13:46, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>> if zone temperature is invalid") caused __thermal_zone_device_update()
>> to return early if the current thermal zone temperature was invalid.
>>
>> This was done to avoid running handle_thermal_trip() and governor
>> callbacks in that case which led to confusion.  However, it went too
>> far because monitor_thermal_zone() still needs to be called even when
>> the zone temperature is invalid to ensure that it will be updated
>> eventually in case thermal polling is enabled and the driver has no
>> other means to notify the core of zone temperature changes (for example,
>> it does not register an interrupt handler or ACPI notifier).
>>
>> Also if the .set_trips() zone callback is expected to set up monitoring
>> interrupts for a thermal zone, it needs to be provided with valid
>> boundaries and that can only be done if the zone temperature is known.
>>
>> Accordingly, to ensure that __thermal_zone_device_update() will
>> run again after a failing zone temperature check, make it call
>> monitor_thermal_zone() regardless of whether or not the zone
>> temperature is valid and make the latter schedule a thermal zone
>> temperature update if the zone temperature is invalid even if
>> polling is not enabled for the thermal zone (however, if this
>> continues to fail, give up after some time).
> 
> Rafael,
> 
> do we agree that we should fix somehow the current issue in this way because we are close to the merge window, but the proper fix is not doing that ?

I've tested this patch, but I have no opinion about it.

I sent https://lore.kernel.org/all/20240704-topic-sm8x50-upstream-fix-battmgr-temp-tz-warn-v1-1-9d66d6f6efde@linaro.org/ which
fixes the warning print, leaving the option for thermal core to update the tz once it becomes available,
which is the initial goal of this patchset.

Neil

> 
> 
>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
>> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> ---
>>   drivers/thermal/thermal_core.c |   13 ++++++++++++-
>>   drivers/thermal/thermal_core.h |    9 +++++++++
>>   2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> Index: linux-pm/drivers/thermal/thermal_core.c
>> ===================================================================
>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>> +++ linux-pm/drivers/thermal/thermal_core.c
>> @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
>>           thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>>       else if (tz->polling_delay_jiffies)
>>           thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
>> +    else if (tz->temperature == THERMAL_TEMP_INVALID &&
>> +         tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
>> +        thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
>> +        /* Double the recheck delay for the next attempt. */
>> +        tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
>> +        if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
>> +            dev_info(&tz->device, "Temperature unknown, giving up\n");
>> +    }
>>   }
>>   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
>> @@ -430,6 +438,7 @@ static void update_temperature(struct th
>>       tz->last_temperature = tz->temperature;
>>       tz->temperature = temp;
>> +    tz->recheck_delay_jiffies = 1;
>>       trace_thermal_temperature(tz);
>> @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
>>       update_temperature(tz);
>>       if (tz->temperature == THERMAL_TEMP_INVALID)
>> -        return;
>> +        goto monitor;
>>       tz->notify_event = event;
>> @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
>>       thermal_debug_update_trip_stats(tz);
>> +monitor:
>>       monitor_thermal_zone(tz);
>>   }
>> @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
>>       thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
>>       thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
>> +    tz->recheck_delay_jiffies = 1;
>>       /* sys I/F */
>>       /* Add nodes that are always present via .groups */
>> Index: linux-pm/drivers/thermal/thermal_core.h
>> ===================================================================
>> --- linux-pm.orig/drivers/thermal/thermal_core.h
>> +++ linux-pm/drivers/thermal/thermal_core.h
>> @@ -67,6 +67,8 @@ struct thermal_governor {
>>    * @polling_delay_jiffies: number of jiffies to wait between polls when
>>    *            checking whether trip points have been crossed (0 for
>>    *            interrupt driven systems)
>> + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
>> + *             before attempting to check it again
>>    * @temperature:    current temperature.  This is only for core code,
>>    *            drivers should use thermal_zone_get_temp() to get the
>>    *            current temperature
>> @@ -108,6 +110,7 @@ struct thermal_zone_device {
>>       int num_trips;
>>       unsigned long passive_delay_jiffies;
>>       unsigned long polling_delay_jiffies;
>> +    unsigned long recheck_delay_jiffies;
>>       int temperature;
>>       int last_temperature;
>>       int emul_temperature;
>> @@ -133,6 +136,12 @@ struct thermal_zone_device {
>>       struct thermal_trip_desc trips[] __counted_by(num_trips);
>>   };
>> +/*
>> + * Maximum delay after a failing thermal zone temperature check before
>> + * attempting to check it again (in jiffies).
>> + */
>> +#define THERMAL_MAX_RECHECK_DELAY    (30 * HZ)
>> +
>>   /* Default Thermal Governor */
>>   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>>   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
>>
>>
>>
>
Rafael J. Wysocki July 4, 2024, 2:21 p.m. UTC | #3
On Thu, Jul 4, 2024 at 2:49 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>
> On 04/07/2024 13:46, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> > if zone temperature is invalid") caused __thermal_zone_device_update()
> > to return early if the current thermal zone temperature was invalid.
> >
> > This was done to avoid running handle_thermal_trip() and governor
> > callbacks in that case which led to confusion.  However, it went too
> > far because monitor_thermal_zone() still needs to be called even when
> > the zone temperature is invalid to ensure that it will be updated
> > eventually in case thermal polling is enabled and the driver has no
> > other means to notify the core of zone temperature changes (for example,
> > it does not register an interrupt handler or ACPI notifier).
> >
> > Also if the .set_trips() zone callback is expected to set up monitoring
> > interrupts for a thermal zone, it needs to be provided with valid
> > boundaries and that can only be done if the zone temperature is known.
> >
> > Accordingly, to ensure that __thermal_zone_device_update() will
> > run again after a failing zone temperature check, make it call
> > monitor_thermal_zone() regardless of whether or not the zone
> > temperature is valid and make the latter schedule a thermal zone
> > temperature update if the zone temperature is invalid even if
> > polling is not enabled for the thermal zone (however, if this
> > continues to fail, give up after some time).
>
> Rafael,
>
> do we agree that we should fix somehow the current issue in this way
> because we are close to the merge window,

Yes.

> but the proper fix is not doing that ?

We need to decide what to do in general when __thermal_zone_get_temp()
returns an error.  A proper fix would result from that, but it would
require more time than is available IMV.  We can properly fix this in
6.11.

For 6.10 I see two options:

1. Apply the v2 of this patch:

https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net/

I slightly prefer it because it is simpler and doesn't change the size
of struct thermal_zone_device.  However, the clear disadvantage of it
is that it will poke at dead thermal zones indefinitely.

The THERMAL_RECHECK_DELAY_MS value in it can be adjusted.  Maybe 250
ms would be a better choice?

2. Apply this patch (ie. v3)

It is nicer to thermal zones that never become operational, but it may
miss thermal zones that become operational very late.

> > Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> > Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> > Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > ---
> >   drivers/thermal/thermal_core.c |   13 ++++++++++++-
> >   drivers/thermal/thermal_core.h |    9 +++++++++
> >   2 files changed, 21 insertions(+), 1 deletion(-)
> >
> > Index: linux-pm/drivers/thermal/thermal_core.c
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_core.c
> > +++ linux-pm/drivers/thermal/thermal_core.c
> > @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
> >               thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
> >       else if (tz->polling_delay_jiffies)
> >               thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> > +     else if (tz->temperature == THERMAL_TEMP_INVALID &&
> > +              tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
> > +             thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
> > +             /* Double the recheck delay for the next attempt. */
> > +             tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
> > +             if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
> > +                     dev_info(&tz->device, "Temperature unknown, giving up\n");
> > +     }
> >   }
> >
> >   static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
> > @@ -430,6 +438,7 @@ static void update_temperature(struct th
> >
> >       tz->last_temperature = tz->temperature;
> >       tz->temperature = temp;
> > +     tz->recheck_delay_jiffies = 1;
> >
> >       trace_thermal_temperature(tz);
> >
> > @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
> >       update_temperature(tz);
> >
> >       if (tz->temperature == THERMAL_TEMP_INVALID)
> > -             return;
> > +             goto monitor;
> >
> >       tz->notify_event = event;
> >
> > @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
> >
> >       thermal_debug_update_trip_stats(tz);
> >
> > +monitor:
> >       monitor_thermal_zone(tz);
> >   }
> >
> > @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
> >
> >       thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
> >       thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
> > +     tz->recheck_delay_jiffies = 1;
> >
> >       /* sys I/F */
> >       /* Add nodes that are always present via .groups */
> > Index: linux-pm/drivers/thermal/thermal_core.h
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_core.h
> > +++ linux-pm/drivers/thermal/thermal_core.h
> > @@ -67,6 +67,8 @@ struct thermal_governor {
> >    * @polling_delay_jiffies: number of jiffies to wait between polls when
> >    *                  checking whether trip points have been crossed (0 for
> >    *                  interrupt driven systems)
> > + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
> > + *                   before attempting to check it again
> >    * @temperature:    current temperature.  This is only for core code,
> >    *                  drivers should use thermal_zone_get_temp() to get the
> >    *                  current temperature
> > @@ -108,6 +110,7 @@ struct thermal_zone_device {
> >       int num_trips;
> >       unsigned long passive_delay_jiffies;
> >       unsigned long polling_delay_jiffies;
> > +     unsigned long recheck_delay_jiffies;
> >       int temperature;
> >       int last_temperature;
> >       int emul_temperature;
> > @@ -133,6 +136,12 @@ struct thermal_zone_device {
> >       struct thermal_trip_desc trips[] __counted_by(num_trips);
> >   };
> >
> > +/*
> > + * Maximum delay after a failing thermal zone temperature check before
> > + * attempting to check it again (in jiffies).
> > + */
> > +#define THERMAL_MAX_RECHECK_DELAY    (30 * HZ)
> > +
> >   /* Default Thermal Governor */
> >   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
> >   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
> >
> >
> >
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
>
Rafael J. Wysocki July 4, 2024, 2:23 p.m. UTC | #4
Hi,

On Thu, Jul 4, 2024 at 2:52 PM Neil Armstrong <neil.armstrong@linaro.org> wrote:
>
> Hi,
>
> On 04/07/2024 14:49, Daniel Lezcano wrote:
> > On 04/07/2024 13:46, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> >> if zone temperature is invalid") caused __thermal_zone_device_update()
> >> to return early if the current thermal zone temperature was invalid.
> >>
> >> This was done to avoid running handle_thermal_trip() and governor
> >> callbacks in that case which led to confusion.  However, it went too
> >> far because monitor_thermal_zone() still needs to be called even when
> >> the zone temperature is invalid to ensure that it will be updated
> >> eventually in case thermal polling is enabled and the driver has no
> >> other means to notify the core of zone temperature changes (for example,
> >> it does not register an interrupt handler or ACPI notifier).
> >>
> >> Also if the .set_trips() zone callback is expected to set up monitoring
> >> interrupts for a thermal zone, it needs to be provided with valid
> >> boundaries and that can only be done if the zone temperature is known.
> >>
> >> Accordingly, to ensure that __thermal_zone_device_update() will
> >> run again after a failing zone temperature check, make it call
> >> monitor_thermal_zone() regardless of whether or not the zone
> >> temperature is valid and make the latter schedule a thermal zone
> >> temperature update if the zone temperature is invalid even if
> >> polling is not enabled for the thermal zone (however, if this
> >> continues to fail, give up after some time).
> >
> > Rafael,
> >
> > do we agree that we should fix somehow the current issue in this way because we are close to the merge window, but the proper fix is not doing that ?
>
> I've tested this patch, but I have no opinion about it.
>
> I sent https://lore.kernel.org/all/20240704-topic-sm8x50-upstream-fix-battmgr-temp-tz-warn-v1-1-9d66d6f6efde@linaro.org/ which
> fixes the warning print, leaving the option for thermal core to update the tz once it becomes available,
> which is the initial goal of this patchset.

Thank you!

I gather that I can use the v2 of the $subject patch without worrying
about the problem you have reported.
Daniel Lezcano July 4, 2024, 4:53 p.m. UTC | #5
On 04/07/2024 16:21, Rafael J. Wysocki wrote:
> On Thu, Jul 4, 2024 at 2:49 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>>
>> On 04/07/2024 13:46, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>>> if zone temperature is invalid") caused __thermal_zone_device_update()
>>> to return early if the current thermal zone temperature was invalid.
>>>
>>> This was done to avoid running handle_thermal_trip() and governor
>>> callbacks in that case which led to confusion.  However, it went too
>>> far because monitor_thermal_zone() still needs to be called even when
>>> the zone temperature is invalid to ensure that it will be updated
>>> eventually in case thermal polling is enabled and the driver has no
>>> other means to notify the core of zone temperature changes (for example,
>>> it does not register an interrupt handler or ACPI notifier).
>>>
>>> Also if the .set_trips() zone callback is expected to set up monitoring
>>> interrupts for a thermal zone, it needs to be provided with valid
>>> boundaries and that can only be done if the zone temperature is known.
>>>
>>> Accordingly, to ensure that __thermal_zone_device_update() will
>>> run again after a failing zone temperature check, make it call
>>> monitor_thermal_zone() regardless of whether or not the zone
>>> temperature is valid and make the latter schedule a thermal zone
>>> temperature update if the zone temperature is invalid even if
>>> polling is not enabled for the thermal zone (however, if this
>>> continues to fail, give up after some time).
>>
>> Rafael,
>>
>> do we agree that we should fix somehow the current issue in this way
>> because we are close to the merge window,
> 
> Yes.
> 
>> but the proper fix is not doing that ?
> 
> We need to decide what to do in general when __thermal_zone_get_temp()
> returns an error.  A proper fix would result from that, but it would
> require more time than is available IMV.  We can properly fix this in
> 6.11.

Right, in general we should take care of returning values from the 
different functions, update_temperature(), etc... in order to have the 
thermal_zone_device_update() returning a value.

So from there we can catch the result in the initialization function and 
do the proper actions.

 From a higher perspective, IMO the code contains too many returning 
void functions. We should convert that into returning values and handle 
the error cases.

> For 6.10 I see two options:
> 
> 1. Apply the v2 of this patch:
> 
> https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net/
> 
> I slightly prefer it because it is simpler and doesn't change the size
> of struct thermal_zone_device.

I agree

>  However, the clear disadvantage of it
> is that it will poke at dead thermal zones indefinitely.

Yes, but the advantage of this disadvantage is it is so visible that 
buggy routine will be brought to the light, so they can be fixed. I 
don't think we should have so many, perhaps none.

> The THERMAL_RECHECK_DELAY_MS value in it can be adjusted.  Maybe 250
> ms would be a better choice?

Yes

> 2. Apply this patch (ie. v3)
> 
> It is nicer to thermal zones that never become operational, but it may
> miss thermal zones that become operational very late.

I would keep this v3 as a backup in case there are too many complaints, 
but I doubt

>>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
>>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>>> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
>>> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>> ---
>>>    drivers/thermal/thermal_core.c |   13 ++++++++++++-
>>>    drivers/thermal/thermal_core.h |    9 +++++++++
>>>    2 files changed, 21 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-pm/drivers/thermal/thermal_core.c
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>>> +++ linux-pm/drivers/thermal/thermal_core.c
>>> @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
>>>                thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>>>        else if (tz->polling_delay_jiffies)
>>>                thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
>>> +     else if (tz->temperature == THERMAL_TEMP_INVALID &&
>>> +              tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
>>> +             thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
>>> +             /* Double the recheck delay for the next attempt. */
>>> +             tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
>>> +             if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
>>> +                     dev_info(&tz->device, "Temperature unknown, giving up\n");
>>> +     }
>>>    }
>>>
>>>    static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
>>> @@ -430,6 +438,7 @@ static void update_temperature(struct th
>>>
>>>        tz->last_temperature = tz->temperature;
>>>        tz->temperature = temp;
>>> +     tz->recheck_delay_jiffies = 1;
>>>
>>>        trace_thermal_temperature(tz);
>>>
>>> @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
>>>        update_temperature(tz);
>>>
>>>        if (tz->temperature == THERMAL_TEMP_INVALID)
>>> -             return;
>>> +             goto monitor;
>>>
>>>        tz->notify_event = event;
>>>
>>> @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
>>>
>>>        thermal_debug_update_trip_stats(tz);
>>>
>>> +monitor:
>>>        monitor_thermal_zone(tz);
>>>    }
>>>
>>> @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
>>>
>>>        thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
>>>        thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
>>> +     tz->recheck_delay_jiffies = 1;
>>>
>>>        /* sys I/F */
>>>        /* Add nodes that are always present via .groups */
>>> Index: linux-pm/drivers/thermal/thermal_core.h
>>> ===================================================================
>>> --- linux-pm.orig/drivers/thermal/thermal_core.h
>>> +++ linux-pm/drivers/thermal/thermal_core.h
>>> @@ -67,6 +67,8 @@ struct thermal_governor {
>>>     * @polling_delay_jiffies: number of jiffies to wait between polls when
>>>     *                  checking whether trip points have been crossed (0 for
>>>     *                  interrupt driven systems)
>>> + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
>>> + *                   before attempting to check it again
>>>     * @temperature:    current temperature.  This is only for core code,
>>>     *                  drivers should use thermal_zone_get_temp() to get the
>>>     *                  current temperature
>>> @@ -108,6 +110,7 @@ struct thermal_zone_device {
>>>        int num_trips;
>>>        unsigned long passive_delay_jiffies;
>>>        unsigned long polling_delay_jiffies;
>>> +     unsigned long recheck_delay_jiffies;
>>>        int temperature;
>>>        int last_temperature;
>>>        int emul_temperature;
>>> @@ -133,6 +136,12 @@ struct thermal_zone_device {
>>>        struct thermal_trip_desc trips[] __counted_by(num_trips);
>>>    };
>>>
>>> +/*
>>> + * Maximum delay after a failing thermal zone temperature check before
>>> + * attempting to check it again (in jiffies).
>>> + */
>>> +#define THERMAL_MAX_RECHECK_DELAY    (30 * HZ)
>>> +
>>>    /* Default Thermal Governor */
>>>    #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>>>    #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
>>>
>>>
>>>
>>
>> --
>> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>>
>> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
>> <http://twitter.com/#!/linaroorg> Twitter |
>> <http://www.linaro.org/linaro-blog/> Blog
>>
>>
Rafael J. Wysocki July 4, 2024, 4:58 p.m. UTC | #6
On Thu, Jul 4, 2024 at 6:53 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
>
> On 04/07/2024 16:21, Rafael J. Wysocki wrote:
> > On Thu, Jul 4, 2024 at 2:49 PM Daniel Lezcano <daniel.lezcano@linaro.org> wrote:
> >>
> >> On 04/07/2024 13:46, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> >>> if zone temperature is invalid") caused __thermal_zone_device_update()
> >>> to return early if the current thermal zone temperature was invalid.
> >>>
> >>> This was done to avoid running handle_thermal_trip() and governor
> >>> callbacks in that case which led to confusion.  However, it went too
> >>> far because monitor_thermal_zone() still needs to be called even when
> >>> the zone temperature is invalid to ensure that it will be updated
> >>> eventually in case thermal polling is enabled and the driver has no
> >>> other means to notify the core of zone temperature changes (for example,
> >>> it does not register an interrupt handler or ACPI notifier).
> >>>
> >>> Also if the .set_trips() zone callback is expected to set up monitoring
> >>> interrupts for a thermal zone, it needs to be provided with valid
> >>> boundaries and that can only be done if the zone temperature is known.
> >>>
> >>> Accordingly, to ensure that __thermal_zone_device_update() will
> >>> run again after a failing zone temperature check, make it call
> >>> monitor_thermal_zone() regardless of whether or not the zone
> >>> temperature is valid and make the latter schedule a thermal zone
> >>> temperature update if the zone temperature is invalid even if
> >>> polling is not enabled for the thermal zone (however, if this
> >>> continues to fail, give up after some time).
> >>
> >> Rafael,
> >>
> >> do we agree that we should fix somehow the current issue in this way
> >> because we are close to the merge window,
> >
> > Yes.
> >
> >> but the proper fix is not doing that ?
> >
> > We need to decide what to do in general when __thermal_zone_get_temp()
> > returns an error.  A proper fix would result from that, but it would
> > require more time than is available IMV.  We can properly fix this in
> > 6.11.
>
> Right, in general we should take care of returning values from the
> different functions, update_temperature(), etc... in order to have the
> thermal_zone_device_update() returning a value.
>
> So from there we can catch the result in the initialization function and
> do the proper actions.
>
>  From a higher perspective, IMO the code contains too many returning
> void functions. We should convert that into returning values and handle
> the error cases.
>
> > For 6.10 I see two options:
> >
> > 1. Apply the v2 of this patch:
> >
> > https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net/
> >
> > I slightly prefer it because it is simpler and doesn't change the size
> > of struct thermal_zone_device.
>
> I agree
>
> >  However, the clear disadvantage of it
> > is that it will poke at dead thermal zones indefinitely.
>
> Yes, but the advantage of this disadvantage is it is so visible that
> buggy routine will be brought to the light, so they can be fixed. I
> don't think we should have so many, perhaps none.
>
> > The THERMAL_RECHECK_DELAY_MS value in it can be adjusted.  Maybe 250
> > ms would be a better choice?
>
> Yes
>
> > 2. Apply this patch (ie. v3)
> >
> > It is nicer to thermal zones that never become operational, but it may
> > miss thermal zones that become operational very late.
>
> I would keep this v3 as a backup in case there are too many complaints,
> but I doubt

OK, I'll go for the v2 with THERMAL_RECHECK_DELAY_MS equal to 250 ms.

Thanks!

> >>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> >>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> >>> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> >>> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>> ---
> >>>    drivers/thermal/thermal_core.c |   13 ++++++++++++-
> >>>    drivers/thermal/thermal_core.h |    9 +++++++++
> >>>    2 files changed, 21 insertions(+), 1 deletion(-)
> >>>
> >>> Index: linux-pm/drivers/thermal/thermal_core.c
> >>> ===================================================================
> >>> --- linux-pm.orig/drivers/thermal/thermal_core.c
> >>> +++ linux-pm/drivers/thermal/thermal_core.c
> >>> @@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
> >>>                thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
> >>>        else if (tz->polling_delay_jiffies)
> >>>                thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
> >>> +     else if (tz->temperature == THERMAL_TEMP_INVALID &&
> >>> +              tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
> >>> +             thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
> >>> +             /* Double the recheck delay for the next attempt. */
> >>> +             tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
> >>> +             if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
> >>> +                     dev_info(&tz->device, "Temperature unknown, giving up\n");
> >>> +     }
> >>>    }
> >>>
> >>>    static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
> >>> @@ -430,6 +438,7 @@ static void update_temperature(struct th
> >>>
> >>>        tz->last_temperature = tz->temperature;
> >>>        tz->temperature = temp;
> >>> +     tz->recheck_delay_jiffies = 1;
> >>>
> >>>        trace_thermal_temperature(tz);
> >>>
> >>> @@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
> >>>        update_temperature(tz);
> >>>
> >>>        if (tz->temperature == THERMAL_TEMP_INVALID)
> >>> -             return;
> >>> +             goto monitor;
> >>>
> >>>        tz->notify_event = event;
> >>>
> >>> @@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
> >>>
> >>>        thermal_debug_update_trip_stats(tz);
> >>>
> >>> +monitor:
> >>>        monitor_thermal_zone(tz);
> >>>    }
> >>>
> >>> @@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
> >>>
> >>>        thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
> >>>        thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
> >>> +     tz->recheck_delay_jiffies = 1;
> >>>
> >>>        /* sys I/F */
> >>>        /* Add nodes that are always present via .groups */
> >>> Index: linux-pm/drivers/thermal/thermal_core.h
> >>> ===================================================================
> >>> --- linux-pm.orig/drivers/thermal/thermal_core.h
> >>> +++ linux-pm/drivers/thermal/thermal_core.h
> >>> @@ -67,6 +67,8 @@ struct thermal_governor {
> >>>     * @polling_delay_jiffies: number of jiffies to wait between polls when
> >>>     *                  checking whether trip points have been crossed (0 for
> >>>     *                  interrupt driven systems)
> >>> + * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
> >>> + *                   before attempting to check it again
> >>>     * @temperature:    current temperature.  This is only for core code,
> >>>     *                  drivers should use thermal_zone_get_temp() to get the
> >>>     *                  current temperature
> >>> @@ -108,6 +110,7 @@ struct thermal_zone_device {
> >>>        int num_trips;
> >>>        unsigned long passive_delay_jiffies;
> >>>        unsigned long polling_delay_jiffies;
> >>> +     unsigned long recheck_delay_jiffies;
> >>>        int temperature;
> >>>        int last_temperature;
> >>>        int emul_temperature;
> >>> @@ -133,6 +136,12 @@ struct thermal_zone_device {
> >>>        struct thermal_trip_desc trips[] __counted_by(num_trips);
> >>>    };
> >>>
> >>> +/*
> >>> + * Maximum delay after a failing thermal zone temperature check before
> >>> + * attempting to check it again (in jiffies).
> >>> + */
> >>> +#define THERMAL_MAX_RECHECK_DELAY    (30 * HZ)
> >>> +
> >>>    /* Default Thermal Governor */
> >>>    #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
> >>>    #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
> >>>
> >>>
> >>>
> >>
> >> --
> >> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> >>
> >> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> >> <http://twitter.com/#!/linaroorg> Twitter |
> >> <http://www.linaro.org/linaro-blog/> Blog
> >>
> >>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog
>
Eric Biggers July 15, 2024, 4:45 a.m. UTC | #7
Hello,

On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> if zone temperature is invalid") caused __thermal_zone_device_update()
> to return early if the current thermal zone temperature was invalid.
> 
> This was done to avoid running handle_thermal_trip() and governor
> callbacks in that case which led to confusion.  However, it went too
> far because monitor_thermal_zone() still needs to be called even when
> the zone temperature is invalid to ensure that it will be updated
> eventually in case thermal polling is enabled and the driver has no
> other means to notify the core of zone temperature changes (for example,
> it does not register an interrupt handler or ACPI notifier).
> 
> Also if the .set_trips() zone callback is expected to set up monitoring
> interrupts for a thermal zone, it needs to be provided with valid
> boundaries and that can only be done if the zone temperature is known.
> 
> Accordingly, to ensure that __thermal_zone_device_update() will
> run again after a failing zone temperature check, make it call
> monitor_thermal_zone() regardless of whether or not the zone
> temperature is valid and make the latter schedule a thermal zone
> temperature update if the zone temperature is invalid even if
> polling is not enabled for the thermal zone (however, if this
> continues to fail, give up after some time).
> 
> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

On v6.10 I'm seeing the following messages spammed to the kernel log endlessly,
and reverting this commit fixes it.

    [  156.410567] thermal thermal_zone0: failed to read out thermal zone (-61)
    [  156.666583] thermal thermal_zone0: failed to read out thermal zone (-61)
    [  156.922598] thermal thermal_zone0: failed to read out thermal zone (-61)
    [  157.178613] thermal thermal_zone0: failed to read out thermal zone (-61)
    [  157.434636] thermal thermal_zone0: failed to read out thermal zone (-61)
    [  157.690774] thermal thermal_zone0: failed to read out thermal zone (-61)
    [  157.946659] thermal thermal_zone0: failed to read out thermal zone (-61)
    [  158.202717] thermal thermal_zone0: failed to read out thermal zone (-61)
    [  158.458697] thermal thermal_zone0: failed to read out thermal zone (-61)

/sys/class/thermal/thermal_zone0/type contains "iwlwifi_1".

- Eric
Stefan Lippers-Hollmann July 15, 2024, 9:06 a.m. UTC | #8
Hi

On 2024-07-14, Eric Biggers wrote:
> On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> > if zone temperature is invalid") caused __thermal_zone_device_update()
> > to return early if the current thermal zone temperature was invalid.
> >
> > This was done to avoid running handle_thermal_trip() and governor
> > callbacks in that case which led to confusion.  However, it went too
> > far because monitor_thermal_zone() still needs to be called even when
> > the zone temperature is invalid to ensure that it will be updated
> > eventually in case thermal polling is enabled and the driver has no
> > other means to notify the core of zone temperature changes (for example,
> > it does not register an interrupt handler or ACPI notifier).
> >
> > Also if the .set_trips() zone callback is expected to set up monitoring
> > interrupts for a thermal zone, it needs to be provided with valid
> > boundaries and that can only be done if the zone temperature is known.
> >
> > Accordingly, to ensure that __thermal_zone_device_update() will
> > run again after a failing zone temperature check, make it call
> > monitor_thermal_zone() regardless of whether or not the zone
> > temperature is valid and make the latter schedule a thermal zone
> > temperature update if the zone temperature is invalid even if
> > polling is not enabled for the thermal zone (however, if this
> > continues to fail, give up after some time).
> >
> > Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> > Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> > Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> On v6.10 I'm seeing the following messages spammed to the kernel log endlessly,
> and reverting this commit fixes it.
>
>     [  156.410567] thermal thermal_zone0: failed to read out thermal zone (-61)
[...]
>     [  158.458697] thermal thermal_zone0: failed to read out thermal zone (-61)
>
> /sys/class/thermal/thermal_zone0/type contains "iwlwifi_1".

I am observing the same issue on v6.10 with an Intel ax200 WLAN
card in a kaby-lake/ i5-7400 system and a Fujitsu D3400-B22
mainboard and the 'newest' BIOS (V5.0.0.12 R1.29.0) as well:

$ dmesg | grep -i -e iwlwifi -e thermal_zone2
[    3.692433] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
[    3.698547] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
[    3.698556] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
[    3.703292] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
[    3.797296] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
[    4.090341] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[    4.090524] thermal thermal_zone2: failed to read out thermal zone (-61)
[    4.218496] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
[    4.285399] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
[    4.341754] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
[    4.345445] thermal thermal_zone2: failed to read out thermal zone (-61)
[    4.601400] thermal thermal_zone2: failed to read out thermal zone (-61)
[    4.857372] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.114387] thermal thermal_zone2: failed to read out thermal zone (-61)
[...]
[  143.643801] thermal thermal_zone2: failed to read out thermal zone (-61)
[  143.899818] thermal thermal_zone2: failed to read out thermal zone (-61)
[  144.155813] thermal thermal_zone2: failed to read out thermal zone (-61)
[  144.411815] thermal thermal_zone2: failed to read out thermal zone (-61)
[  144.667828] thermal thermal_zone2: failed to read out thermal zone (-61)
[  144.923801] thermal thermal_zone2: failed to read out thermal zone (-61)
[  145.179822] thermal thermal_zone2: failed to read out thermal zone (-61)
[...]

$ cat  /sys/class/thermal/thermal_zone2/type
iwlwifi_1

38cba05a86d157685d930a4400022eb4  /lib/firmware/iwlwifi-cc-a0-77.ucode
ce9c6e3bda22003f9a9b97cbca94b8215911b7a146c0f4f017963dbb1a233351  /lib/firmware/iwlwifi-cc-a0-77.ucode

git bisect led me to this commit as part of kernel v6.10:

$ LANG= git bisect log
git bisect start
# Status: warte auf guten und schlechten Commit
# bad: [0c3836482481200ead7b416ca80c68a29cfdaabd] Linux 6.10
git bisect bad 0c3836482481200ead7b416ca80c68a29cfdaabd
# Status: warte auf gute(n) Commit(s), schlechter Commit bekannt
# good: [a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6] Linux 6.9
git bisect good a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
# good: [33e02dc69afbd8f1b85a51d74d72f139ba4ca623] Merge tag 'sound-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good 33e02dc69afbd8f1b85a51d74d72f139ba4ca623
# good: [29c73fc794c83505066ee6db893b2a83ac5fac63] Merge tag 'perf-tools-for-v6.10-1-2024-05-21' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools
git bisect good 29c73fc794c83505066ee6db893b2a83ac5fac63
# good: [e159d63e6940a2a16bb73616d8c528e93b84a6bb] Merge tag 'kvm-riscv-fixes-6.10-2' of https://github.com/kvm-riscv/linux into HEAD
git bisect good e159d63e6940a2a16bb73616d8c528e93b84a6bb
# good: [d1505b5cd0426bbddbbc99f10e3ae0b52aaa1d1f] Merge tag 'powerpc-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
git bisect good d1505b5cd0426bbddbbc99f10e3ae0b52aaa1d1f
# good: [4a0929b0062a6b04207a414be9be97eb22965bc1] Merge tag 'media/v6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect good 4a0929b0062a6b04207a414be9be97eb22965bc1
# bad: [ef2b7eb55e10294f4f384f21506ef20a6184128c] Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad ef2b7eb55e10294f4f384f21506ef20a6184128c
# good: [968460731f95be9977bc59a513acbc5afc71117d] Merge tag 'gpio-fixes-for-v6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
git bisect good 968460731f95be9977bc59a513acbc5afc71117d
# good: [5a4bd506ddad75f1f2711cfbcf7551a5504e3f1e] Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect good 5a4bd506ddad75f1f2711cfbcf7551a5504e3f1e
# bad: [a19ea421490dcc45c9f78145bb2703ac5d373b28] Merge tag 'platform-drivers-x86-v6.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
git bisect bad a19ea421490dcc45c9f78145bb2703ac5d373b28
# good: [34afb82a3c67f869267a26f593b6f8fc6bf35905] Merge tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd
git bisect good 34afb82a3c67f869267a26f593b6f8fc6bf35905
# bad: [d045c46c52740b0d5e92d376f0b7843b0c0d935a] Merge tag 'thermal-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad d045c46c52740b0d5e92d376f0b7843b0c0d935a
# bad: [94eacc1c583dd2ba51a2158fb13285f5dc42714b] thermal: core: Fix list sorting in __thermal_zone_device_update()
git bisect bad 94eacc1c583dd2ba51a2158fb13285f5dc42714b
# bad: [a8a261774466d8691e555ea674c193bb1b09edab] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
git bisect bad a8a261774466d8691e555ea674c193bb1b09edab
# good: [aaa18ff54b97706b84306b6613630262706b1f6b] thermal: gov_power_allocator: Return early in manage if trip_max is NULL
git bisect good aaa18ff54b97706b84306b6613630262706b1f6b
# first bad commit: [a8a261774466d8691e555ea674c193bb1b09edab] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

Reverting 202aa0d4bb532338cd27bcc64c60abc2987a2be7 on top of v6.10 avoids
the issue for me.

$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:591f] (rev 05)
00:01.0 PCI bridge [0604]: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 05)
00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 630 [8086:5912] (rev 04)
00:14.0 USB controller [0c03]: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller [8086:a12f] (rev 31)
00:14.2 Signal processing controller [1180]: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem [8086:a131] (rev 31)
00:16.0 Communication controller [0780]: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 [8086:a13a] (rev 31)
00:17.0 SATA controller [0106]: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] [8086:a102] (rev 31)
00:1c.0 PCI bridge [0604]: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 [8086:a114] (rev f1)
00:1c.6 PCI bridge [0604]: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #7 [8086:a116] (rev f1)
00:1c.7 PCI bridge [0604]: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #8 [8086:a117] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation H110 Chipset LPC/eSPI Controller [8086:a143] (rev 31)
00:1f.2 Memory controller [0580]: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller [8086:a121] (rev 31)
00:1f.3 Audio device [0403]: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller [8086:a170] (rev 31)
00:1f.4 SMBus [0c05]: Intel Corporation 100 Series/C230 Series Chipset Family SMBus [8086:a123] (rev 31)
01:00.0 Non-Volatile memory controller [0108]: SK hynix BC901 NVMe Solid State Drive (DRAM-less) [1c5c:1d59] (rev 03)
02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller [10ec:8125] (rev 05)
03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 0c)
04:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6 AX200 [8086:2723] (rev 1a)

04:00.0 Network controller: Intel Corporation Wi-Fi 6 AX200 (rev 1a)
        Subsystem: Intel Corporation Wi-Fi 6 AX200NGW
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 19
        IOMMU group: 12
        Region 0: Memory at efb00000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [c8] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [d0] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [40] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <512ns, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L1, Exit Latency L1 <8us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x1
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 16ms to 55ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [80] MSI-X: Enable+ Count=16 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=0 offset=00003000
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF-
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [14c v1] Latency Tolerance Reporting
                Max snoop latency: 3145728ns
                Max no snoop latency: 3145728ns
        Capabilities: [154 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=30us PortTPowerOnTime=18us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=44us
        Kernel driver in use: iwlwifi
        Kernel modules: iwlwifi

Regards
	Stefan Lippers-Hollmann
Daniel Lezcano July 15, 2024, 9:09 a.m. UTC | #9
On 15/07/2024 06:45, Eric Biggers wrote:
> Hello,
> 
> On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>> if zone temperature is invalid") caused __thermal_zone_device_update()
>> to return early if the current thermal zone temperature was invalid.
>>
>> This was done to avoid running handle_thermal_trip() and governor
>> callbacks in that case which led to confusion.  However, it went too
>> far because monitor_thermal_zone() still needs to be called even when
>> the zone temperature is invalid to ensure that it will be updated
>> eventually in case thermal polling is enabled and the driver has no
>> other means to notify the core of zone temperature changes (for example,
>> it does not register an interrupt handler or ACPI notifier).
>>
>> Also if the .set_trips() zone callback is expected to set up monitoring
>> interrupts for a thermal zone, it needs to be provided with valid
>> boundaries and that can only be done if the zone temperature is known.
>>
>> Accordingly, to ensure that __thermal_zone_device_update() will
>> run again after a failing zone temperature check, make it call
>> monitor_thermal_zone() regardless of whether or not the zone
>> temperature is valid and make the latter schedule a thermal zone
>> temperature update if the zone temperature is invalid even if
>> polling is not enabled for the thermal zone (however, if this
>> continues to fail, give up after some time).
>>
>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
>> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
>> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> On v6.10 I'm seeing the following messages spammed to the kernel log endlessly,
> and reverting this commit fixes it.
> 
>      [  156.410567] thermal thermal_zone0: failed to read out thermal zone (-61)
>      [  156.666583] thermal thermal_zone0: failed to read out thermal zone (-61)
>      [  156.922598] thermal thermal_zone0: failed to read out thermal zone (-61)
>      [  157.178613] thermal thermal_zone0: failed to read out thermal zone (-61)
>      [  157.434636] thermal thermal_zone0: failed to read out thermal zone (-61)
>      [  157.690774] thermal thermal_zone0: failed to read out thermal zone (-61)
>      [  157.946659] thermal thermal_zone0: failed to read out thermal zone (-61)
>      [  158.202717] thermal thermal_zone0: failed to read out thermal zone (-61)
>      [  158.458697] thermal thermal_zone0: failed to read out thermal zone (-61)
> 
> /sys/class/thermal/thermal_zone0/type contains "iwlwifi_1".

Does the following change fixes the messages  ?

diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c 
b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
index 61a4638d1be2..b519db76d402 100644
--- a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
+++ b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
@@ -622,7 +622,7 @@ static int iwl_mvm_tzone_get_temp(struct 
thermal_zone_device *device,

  	if (!iwl_mvm_firmware_running(mvm) ||
  	    mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
-		ret = -ENODATA;
+		ret = -EAGAIN;
  		goto out;
  	}
Rafael J. Wysocki July 15, 2024, 10:49 a.m. UTC | #10
On Mon, Jul 15, 2024 at 6:45 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> Hello,
>
> On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> > if zone temperature is invalid") caused __thermal_zone_device_update()
> > to return early if the current thermal zone temperature was invalid.
> >
> > This was done to avoid running handle_thermal_trip() and governor
> > callbacks in that case which led to confusion.  However, it went too
> > far because monitor_thermal_zone() still needs to be called even when
> > the zone temperature is invalid to ensure that it will be updated
> > eventually in case thermal polling is enabled and the driver has no
> > other means to notify the core of zone temperature changes (for example,
> > it does not register an interrupt handler or ACPI notifier).
> >
> > Also if the .set_trips() zone callback is expected to set up monitoring
> > interrupts for a thermal zone, it needs to be provided with valid
> > boundaries and that can only be done if the zone temperature is known.
> >
> > Accordingly, to ensure that __thermal_zone_device_update() will
> > run again after a failing zone temperature check, make it call
> > monitor_thermal_zone() regardless of whether or not the zone
> > temperature is valid and make the latter schedule a thermal zone
> > temperature update if the zone temperature is invalid even if
> > polling is not enabled for the thermal zone (however, if this
> > continues to fail, give up after some time).
> >
> > Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> > Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> > Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> On v6.10 I'm seeing the following messages spammed to the kernel log endlessly,
> and reverting this commit fixes it.
>
>     [  156.410567] thermal thermal_zone0: failed to read out thermal zone (-61)
>     [  156.666583] thermal thermal_zone0: failed to read out thermal zone (-61)
>     [  156.922598] thermal thermal_zone0: failed to read out thermal zone (-61)
>     [  157.178613] thermal thermal_zone0: failed to read out thermal zone (-61)
>     [  157.434636] thermal thermal_zone0: failed to read out thermal zone (-61)
>     [  157.690774] thermal thermal_zone0: failed to read out thermal zone (-61)
>     [  157.946659] thermal thermal_zone0: failed to read out thermal zone (-61)
>     [  158.202717] thermal thermal_zone0: failed to read out thermal zone (-61)
>     [  158.458697] thermal thermal_zone0: failed to read out thermal zone (-61)
>
> /sys/class/thermal/thermal_zone0/type contains "iwlwifi_1".

thermal_zone0 is useless on your system and hence the message (note
that it is a debug-level one).  That thermal zone certainly shouldn't
have been enabled and it probably shouldn't have been registered
either.

Previously, the core would just leave it alone and now it is poked at
periodically.

You can make the message go away by echoing "disabled" to the mode
attribute of thermal_zone0.

I think we'll see more of this, so we'll probably need to add some
kind of a backoff to it.
Rafael J. Wysocki July 15, 2024, 10:52 a.m. UTC | #11
On Mon, Jul 15, 2024 at 11:07 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>
> Hi
>
> On 2024-07-14, Eric Biggers wrote:
> > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >
> > > Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> > > if zone temperature is invalid") caused __thermal_zone_device_update()
> > > to return early if the current thermal zone temperature was invalid.
> > >
> > > This was done to avoid running handle_thermal_trip() and governor
> > > callbacks in that case which led to confusion.  However, it went too
> > > far because monitor_thermal_zone() still needs to be called even when
> > > the zone temperature is invalid to ensure that it will be updated
> > > eventually in case thermal polling is enabled and the driver has no
> > > other means to notify the core of zone temperature changes (for example,
> > > it does not register an interrupt handler or ACPI notifier).
> > >
> > > Also if the .set_trips() zone callback is expected to set up monitoring
> > > interrupts for a thermal zone, it needs to be provided with valid
> > > boundaries and that can only be done if the zone temperature is known.
> > >
> > > Accordingly, to ensure that __thermal_zone_device_update() will
> > > run again after a failing zone temperature check, make it call
> > > monitor_thermal_zone() regardless of whether or not the zone
> > > temperature is valid and make the latter schedule a thermal zone
> > > temperature update if the zone temperature is invalid even if
> > > polling is not enabled for the thermal zone (however, if this
> > > continues to fail, give up after some time).
> > >
> > > Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> > > Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > > Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> > > Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > On v6.10 I'm seeing the following messages spammed to the kernel log endlessly,
> > and reverting this commit fixes it.
> >
> >     [  156.410567] thermal thermal_zone0: failed to read out thermal zone (-61)
> [...]
> >     [  158.458697] thermal thermal_zone0: failed to read out thermal zone (-61)
> >
> > /sys/class/thermal/thermal_zone0/type contains "iwlwifi_1".
>
> I am observing the same issue on v6.10 with an Intel ax200 WLAN
> card in a kaby-lake/ i5-7400 system and a Fujitsu D3400-B22
> mainboard and the 'newest' BIOS (V5.0.0.12 R1.29.0) as well:
>
> $ dmesg | grep -i -e iwlwifi -e thermal_zone2
> [    3.692433] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
> [    3.698547] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
> [    3.698556] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
> [    3.703292] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
> [    3.797296] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
> [    4.090341] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
> [    4.090524] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    4.218496] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
> [    4.285399] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
> [    4.341754] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
> [    4.345445] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    4.601400] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    4.857372] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    5.114387] thermal thermal_zone2: failed to read out thermal zone (-61)
> [...]
> [  143.643801] thermal thermal_zone2: failed to read out thermal zone (-61)
> [  143.899818] thermal thermal_zone2: failed to read out thermal zone (-61)
> [  144.155813] thermal thermal_zone2: failed to read out thermal zone (-61)
> [  144.411815] thermal thermal_zone2: failed to read out thermal zone (-61)
> [  144.667828] thermal thermal_zone2: failed to read out thermal zone (-61)
> [  144.923801] thermal thermal_zone2: failed to read out thermal zone (-61)
> [  145.179822] thermal thermal_zone2: failed to read out thermal zone (-61)
> [...]

As I said in the reply to the previous report, this thermal zone is
useless and it can be disabled via sysfs.  The message will go away
then.

We'll see what can be done to make the message go away completely or
at least stop being printed after a certain number of iterations.
Rafael J. Wysocki July 15, 2024, 11:21 a.m. UTC | #12
On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 15/07/2024 06:45, Eric Biggers wrote:
> > Hello,
> >
> > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> >> if zone temperature is invalid") caused __thermal_zone_device_update()
> >> to return early if the current thermal zone temperature was invalid.
> >>
> >> This was done to avoid running handle_thermal_trip() and governor
> >> callbacks in that case which led to confusion.  However, it went too
> >> far because monitor_thermal_zone() still needs to be called even when
> >> the zone temperature is invalid to ensure that it will be updated
> >> eventually in case thermal polling is enabled and the driver has no
> >> other means to notify the core of zone temperature changes (for example,
> >> it does not register an interrupt handler or ACPI notifier).
> >>
> >> Also if the .set_trips() zone callback is expected to set up monitoring
> >> interrupts for a thermal zone, it needs to be provided with valid
> >> boundaries and that can only be done if the zone temperature is known.
> >>
> >> Accordingly, to ensure that __thermal_zone_device_update() will
> >> run again after a failing zone temperature check, make it call
> >> monitor_thermal_zone() regardless of whether or not the zone
> >> temperature is valid and make the latter schedule a thermal zone
> >> temperature update if the zone temperature is invalid even if
> >> polling is not enabled for the thermal zone (however, if this
> >> continues to fail, give up after some time).
> >>
> >> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
> >> Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> Link: https://lore.kernel.org/linux-pm/dc1e6cba-352b-4c78-93b5-94dd033fca16@linaro.org
> >> Link: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net
> >> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > On v6.10 I'm seeing the following messages spammed to the kernel log endlessly,
> > and reverting this commit fixes it.
> >
> >      [  156.410567] thermal thermal_zone0: failed to read out thermal zone (-61)
> >      [  156.666583] thermal thermal_zone0: failed to read out thermal zone (-61)
> >      [  156.922598] thermal thermal_zone0: failed to read out thermal zone (-61)
> >      [  157.178613] thermal thermal_zone0: failed to read out thermal zone (-61)
> >      [  157.434636] thermal thermal_zone0: failed to read out thermal zone (-61)
> >      [  157.690774] thermal thermal_zone0: failed to read out thermal zone (-61)
> >      [  157.946659] thermal thermal_zone0: failed to read out thermal zone (-61)
> >      [  158.202717] thermal thermal_zone0: failed to read out thermal zone (-61)
> >      [  158.458697] thermal thermal_zone0: failed to read out thermal zone (-61)
> >
> > /sys/class/thermal/thermal_zone0/type contains "iwlwifi_1".
>
> Does the following change fixes the messages  ?
>
> diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> index 61a4638d1be2..b519db76d402 100644
> --- a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> +++ b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> @@ -622,7 +622,7 @@ static int iwl_mvm_tzone_get_temp(struct
> thermal_zone_device *device,
>
>         if (!iwl_mvm_firmware_running(mvm) ||
>             mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
> -               ret = -ENODATA;
> +               ret = -EAGAIN;
>                 goto out;
>         }
>
>
> --

It would make the message go away, but it wouldn't stop the useless
polling of the dead thermal zone.

I think that two things need to be done:

(1) Add backoff to the thermal core as proposed previously.
(2) Make iwlwifi enable the thermal zone only if the firmware is running.
Stefan Lippers-Hollmann July 15, 2024, 12:54 p.m. UTC | #13
Hi

On 2024-07-15, Rafael J. Wysocki wrote:
> On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
> > On 15/07/2024 06:45, Eric Biggers wrote:
> > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >>
> > >> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
[...]
> > Does the following change fixes the messages  ?
> >
> > diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > index 61a4638d1be2..b519db76d402 100644
> > --- a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > +++ b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > @@ -622,7 +622,7 @@ static int iwl_mvm_tzone_get_temp(struct
> > thermal_zone_device *device,
> >
> >         if (!iwl_mvm_firmware_running(mvm) ||
> >             mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
> > -               ret = -ENODATA;
> > +               ret = -EAGAIN;
> >                 goto out;
> >         }
> >
> >
> > --
>
> It would make the message go away, but it wouldn't stop the useless
> polling of the dead thermal zone.

Silencing the warnings is already a big improvement - and that patch
works to this extent for me with an ax200, thanks.

> I think that two things need to be done:
>
> (1) Add backoff to the thermal core as proposed previously.
> (2) Make iwlwifi enable the thermal zone only if the firmware is running.

Regards
	Stefan Lippers-Hollmann
Rafael J. Wysocki July 15, 2024, 2:48 p.m. UTC | #14
On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>
> Hi
>
> On 2024-07-15, Rafael J. Wysocki wrote:
> > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > <daniel.lezcano@linaro.org> wrote:
> > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >>
> > > >> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> [...]
> > > Does the following change fixes the messages  ?
> > >
> > > diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > > b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > > index 61a4638d1be2..b519db76d402 100644
> > > --- a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > > +++ b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > > @@ -622,7 +622,7 @@ static int iwl_mvm_tzone_get_temp(struct
> > > thermal_zone_device *device,
> > >
> > >         if (!iwl_mvm_firmware_running(mvm) ||
> > >             mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
> > > -               ret = -ENODATA;
> > > +               ret = -EAGAIN;
> > >                 goto out;
> > >         }
> > >
> > >
> > > --
> >
> > It would make the message go away, but it wouldn't stop the useless
> > polling of the dead thermal zone.
>
> Silencing the warnings is already a big improvement - and that patch
> works to this extent for me with an ax200, thanks.

So attached is a patch that should avoid enabling the thermal zone
when it is not ready for use in the first place, so it should address
both the message and the useless polling.

I would appreciate giving it a go (please note that it hasn't received
much testing so far, though).
Eric Biggers July 15, 2024, 9:12 p.m. UTC | #15
On Mon, Jul 15, 2024 at 04:48:20PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> >
> > Hi
> >
> > On 2024-07-15, Rafael J. Wysocki wrote:
> > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > <daniel.lezcano@linaro.org> wrote:
> > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > >>
> > > > >> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
> > [...]
> > > > Does the following change fixes the messages  ?
> > > >
> > > > diff --git a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > > > b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > > > index 61a4638d1be2..b519db76d402 100644
> > > > --- a/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > > > +++ b/drivers/net/wireless/intel/iwlwifi/mvm/tt.c
> > > > @@ -622,7 +622,7 @@ static int iwl_mvm_tzone_get_temp(struct
> > > > thermal_zone_device *device,
> > > >
> > > >         if (!iwl_mvm_firmware_running(mvm) ||
> > > >             mvm->fwrt.cur_fw_img != IWL_UCODE_REGULAR) {
> > > > -               ret = -ENODATA;
> > > > +               ret = -EAGAIN;
> > > >                 goto out;
> > > >         }
> > > >
> > > >
> > > > --
> > >
> > > It would make the message go away, but it wouldn't stop the useless
> > > polling of the dead thermal zone.
> >
> > Silencing the warnings is already a big improvement - and that patch
> > works to this extent for me with an ax200, thanks.
> 
> So attached is a patch that should avoid enabling the thermal zone
> when it is not ready for use in the first place, so it should address
> both the message and the useless polling.
> 
> I would appreciate giving it a go (please note that it hasn't received
> much testing so far, though).

> ---
>  drivers/net/wireless/intel/iwlwifi/mvm/fw.c  |    1 
>  drivers/net/wireless/intel/iwlwifi/mvm/mvm.h |    1 
>  drivers/net/wireless/intel/iwlwifi/mvm/tt.c  |   55 ++++++++++++++++++++++-----
>  drivers/thermal/thermal_core.c               |   46 ++++++++++++++++++++++
>  include/linux/thermal.h                      |    1 
>  5 files changed, 95 insertions(+), 9 deletions(-)

I'm still getting the warning messages with this patch applied.

- Eric
Stefan Lippers-Hollmann July 15, 2024, 11:48 p.m. UTC | #16
Hi

On 2024-07-15, Rafael J. Wysocki wrote:
> On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > On 2024-07-15, Rafael J. Wysocki wrote:
> > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > <daniel.lezcano@linaro.org> wrote:
> > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[...]
> > Silencing the warnings is already a big improvement - and that patch
> > works to this extent for me with an ax200, thanks.
>
> So attached is a patch that should avoid enabling the thermal zone
> when it is not ready for use in the first place, so it should address
> both the message and the useless polling.
>
> I would appreciate giving it a go (please note that it hasn't received
> much testing so far, though).

Sadly this patch doesn't seem to help:

$ dmesg  | grep -e iwlwifi -e thermal
[    0.113700] thermal_sys: Registered thermal governor 'fair_share'
[    0.113700] thermal_sys: Registered thermal governor 'bang_bang'
[    0.113700] thermal_sys: Registered thermal governor 'step_wise'
[    0.113700] thermal_sys: Registered thermal governor 'user_space'
[    0.113700] thermal_sys: Registered thermal governor 'power_allocator'
[    3.885485] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
[    3.888462] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
[    3.888471] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
[    3.892720] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
[    3.994292] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
[    4.383879] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[    4.513229] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
[    4.578828] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
[    4.592597] thermal thermal_zone2: failed to read out thermal zone (-61)
[    4.604651] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
[    4.849442] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.105488] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.361470] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.618458] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.873428] thermal thermal_zone2: failed to read out thermal zone (-61)
[    6.129429] thermal thermal_zone2: failed to read out thermal zone (-61)
[    6.385446] thermal thermal_zone2: failed to read out thermal zone (-61)
[    6.641695] thermal thermal_zone2: failed to read out thermal zone (-61)

Regards
	Stefan Lippers-Hollmann

P.S.: I've now also noticed the same issue on a raptor-lake system with AX201.
Rafael J. Wysocki July 16, 2024, 10:05 a.m. UTC | #17
On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>
> Hi
>
> On 2024-07-15, Rafael J. Wysocki wrote:
> > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > <daniel.lezcano@linaro.org> wrote:
> > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> [...]
> > > Silencing the warnings is already a big improvement - and that patch
> > > works to this extent for me with an ax200, thanks.
> >
> > So attached is a patch that should avoid enabling the thermal zone
> > when it is not ready for use in the first place, so it should address
> > both the message and the useless polling.
> >
> > I would appreciate giving it a go (please note that it hasn't received
> > much testing so far, though).
>
> Sadly this patch doesn't seem to help:

This is likely because it is missing checks for firmware image type.
I've added them to the attached new version.  Please try it.

I've also added two pr_info() messages to get a better idea of what's
going on, so please grep dmesg for "Thermal zone not ready" and
"Enabling thermal zone".

In the meantime, I'll prepare thermal core changes that should
mitigate the problem independently.
Stefan Lippers-Hollmann July 16, 2024, 10:55 a.m. UTC | #18
Hi

On 2024-07-16, Rafael J. Wysocki wrote:
> On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > On 2024-07-15, Rafael J. Wysocki wrote:
> > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > [...]
> > > > Silencing the warnings is already a big improvement - and that patch
> > > > works to this extent for me with an ax200, thanks.
> > >
> > > So attached is a patch that should avoid enabling the thermal zone
> > > when it is not ready for use in the first place, so it should address
> > > both the message and the useless polling.
> > >
> > > I would appreciate giving it a go (please note that it hasn't received
> > > much testing so far, though).
> >
> > Sadly this patch doesn't seem to help:
> 
> This is likely because it is missing checks for firmware image type.
> I've added them to the attached new version.  Please try it.
> 
> I've also added two pr_info() messages to get a better idea of what's
> going on, so please grep dmesg for "Thermal zone not ready" and
> "Enabling thermal zone".

This is the output with the patch applied:

$ dmesg | grep -i -e iwlwifi -e thermal
[    0.081026] CPU0: Thermal monitoring enabled (TM1)
[    0.113898] thermal_sys: Registered thermal governor 'fair_share'
[    0.113900] thermal_sys: Registered thermal governor 'bang_bang'
[    0.113901] thermal_sys: Registered thermal governor 'step_wise'
[    0.113902] thermal_sys: Registered thermal governor 'user_space'
[    0.113903] thermal_sys: Registered thermal governor 'power_allocator'
[    3.917770] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
[    3.926543] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
[    3.926551] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
[    3.936737] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
[    4.021494] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
[    4.347478] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[    4.347616] iwl_mvm_thermal_zone_register: Thermal zone not ready
[    4.478749] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
[    4.478777] thermal thermal_zone2: Enabling thermal zone
[    4.543601] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
[    4.559564] thermal thermal_zone2: failed to read out thermal zone (-61)
[    4.602339] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
[    4.810373] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.066381] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.322385] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.579377] thermal thermal_zone2: failed to read out thermal zone (-61)
[    5.834375] thermal thermal_zone2: failed to read out thermal zone (-61)
[    6.091372] thermal thermal_zone2: failed to read out thermal zone (-61)
[    6.346400] thermal thermal_zone2: failed to read out thermal zone (-61)
               [...]

Regards
	Stefan Lippers-Hollmann
Stefan Lippers-Hollmann July 16, 2024, 11:15 a.m. UTC | #19
Hi

On 2024-07-16, Stefan Lippers-Hollmann wrote:
> On 2024-07-16, Rafael J. Wysocki wrote:
> > On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > [...]
> > > > > Silencing the warnings is already a big improvement - and that patch
> > > > > works to this extent for me with an ax200, thanks.
> > > >
> > > > So attached is a patch that should avoid enabling the thermal zone
> > > > when it is not ready for use in the first place, so it should address
> > > > both the message and the useless polling.
> > > >
> > > > I would appreciate giving it a go (please note that it hasn't received
> > > > much testing so far, though).
> > >
> > > Sadly this patch doesn't seem to help:
> >
> > This is likely because it is missing checks for firmware image type.
> > I've added them to the attached new version.  Please try it.
> >
> > I've also added two pr_info() messages to get a better idea of what's
> > going on, so please grep dmesg for "Thermal zone not ready" and
> > "Enabling thermal zone".
>
> This is the output with the patch applied:

The ax200 wlan interface is currently not up/ configured (system
using its wired ethernet cards instead), the thermal_zone1 stops
if I manually enable the interface (ip link set dev wlp4s0 up)
after booting up:

$ dmesg | grep -i -e iwlwifi -e thermal
[    0.080899] CPU0: Thermal monitoring enabled (TM1)
[    0.113768] thermal_sys: Registered thermal governor 'fair_share'
[    0.113770] thermal_sys: Registered thermal governor 'bang_bang'
[    0.113771] thermal_sys: Registered thermal governor 'step_wise'
[    0.113772] thermal_sys: Registered thermal governor 'user_space'
[    0.113773] thermal_sys: Registered thermal governor 'power_allocator'
[    3.759673] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
[    3.764918] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
[    3.764974] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
[    3.769432] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
[    3.873466] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
[    3.907122] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
[    3.907886] iwl_mvm_thermal_zone_register: Thermal zone not ready
[    4.032380] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
[    4.032392] thermal thermal_zone1: Enabling thermal zone
[    4.098308] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
[    4.112535] thermal thermal_zone1: failed to read out thermal zone (-61)
[    4.128306] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
[    4.369396] thermal thermal_zone1: failed to read out thermal zone (-61)
[    4.625385] thermal thermal_zone1: failed to read out thermal zone (-61)
[    4.881416] thermal thermal_zone1: failed to read out thermal zone (-61)
[    5.137377] thermal thermal_zone1: failed to read out thermal zone (-61)
[    5.394377] thermal thermal_zone1: failed to read out thermal zone (-61)
[    5.649412] thermal thermal_zone1: failed to read out thermal zone (-61)
[    5.905379] thermal thermal_zone1: failed to read out thermal zone (-61)
[    6.161380] thermal thermal_zone1: failed to read out thermal zone (-61)
[    6.418381] thermal thermal_zone1: failed to read out thermal zone (-61)
[    6.673381] thermal thermal_zone1: failed to read out thermal zone (-61)
[    6.929377] thermal thermal_zone1: failed to read out thermal zone (-61)
               [...]
[   21.009413] thermal thermal_zone1: failed to read out thermal zone (-61)
[   21.265496] thermal thermal_zone1: failed to read out thermal zone (-61)
[   21.521462] thermal thermal_zone1: failed to read out thermal zone (-61)
[   21.777481] thermal thermal_zone1: failed to read out thermal zone (-61)
[   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
[   22.213120] thermal thermal_zone1: Enabling thermal zone
[   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0

Regards
	Stefan Lippers-Hollmann
Rafael J. Wysocki July 16, 2024, 11:19 a.m. UTC | #20
On Tue, Jul 16, 2024 at 12:55 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>
> Hi
>
> On 2024-07-16, Rafael J. Wysocki wrote:
> > On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > [...]
> > > > > Silencing the warnings is already a big improvement - and that patch
> > > > > works to this extent for me with an ax200, thanks.
> > > >
> > > > So attached is a patch that should avoid enabling the thermal zone
> > > > when it is not ready for use in the first place, so it should address
> > > > both the message and the useless polling.
> > > >
> > > > I would appreciate giving it a go (please note that it hasn't received
> > > > much testing so far, though).
> > >
> > > Sadly this patch doesn't seem to help:
> >
> > This is likely because it is missing checks for firmware image type.
> > I've added them to the attached new version.  Please try it.
> >
> > I've also added two pr_info() messages to get a better idea of what's
> > going on, so please grep dmesg for "Thermal zone not ready" and
> > "Enabling thermal zone".
>
> This is the output with the patch applied:

Thanks for testing!

> $ dmesg | grep -i -e iwlwifi -e thermal
> [    0.081026] CPU0: Thermal monitoring enabled (TM1)
> [    0.113898] thermal_sys: Registered thermal governor 'fair_share'
> [    0.113900] thermal_sys: Registered thermal governor 'bang_bang'
> [    0.113901] thermal_sys: Registered thermal governor 'step_wise'
> [    0.113902] thermal_sys: Registered thermal governor 'user_space'
> [    0.113903] thermal_sys: Registered thermal governor 'power_allocator'
> [    3.917770] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
> [    3.926543] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
> [    3.926551] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
> [    3.936737] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
> [    4.021494] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
> [    4.347478] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
> [    4.347616] iwl_mvm_thermal_zone_register: Thermal zone not ready

So this means that iwl_mvm_thermal_zone_register() sees that the
thermal zone is not ready and returns without enabling it.  So far so
good.

> [    4.478749] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
> [    4.478777] thermal thermal_zone2: Enabling thermal zone

This means that iwl_mvm_load_ucode_wait_alive() has called
iwl_mvm_thermal_tzone_enable() for thermal_zone2 after checking that
the firmware image type is IWL_UCODE_REGULAR and after setting
IWL_MVM_STATUS_FIRMWARE_RUNNING is mvm->status.

> [    4.543601] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
> [    4.559564] thermal thermal_zone2: failed to read out thermal zone (-61)

And interestingly enough, iwl_mvm_tzone_get_temp() sees that
IWL_MVM_STATUS_FIRMWARE_RUNNING is not set in mvm->status or the
firmware image type is not IWL_UCODE_REGULAR.  I'm guessing the
former.

> [    4.602339] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
> [    4.810373] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    5.066381] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    5.322385] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    5.579377] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    5.834375] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    6.091372] thermal thermal_zone2: failed to read out thermal zone (-61)
> [    6.346400] thermal thermal_zone2: failed to read out thermal zone (-61)
>                [...]

Since there is only one place where IWL_MVM_STATUS_FIRMWARE_RUNNING is
set and that is in iwl_mvm_load_ucode_wait_alive(), I think that it is
cleared somewhere after iwl_mvm_load_ucode_wait_alive() has completed
and before iwl_mvm_tzone_get_temp() runs.
Rafael J. Wysocki July 16, 2024, 11:36 a.m. UTC | #21
On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>
> Hi
>
> On 2024-07-16, Stefan Lippers-Hollmann wrote:
> > On 2024-07-16, Rafael J. Wysocki wrote:
> > > On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > [...]
> > > > > > Silencing the warnings is already a big improvement - and that patch
> > > > > > works to this extent for me with an ax200, thanks.
> > > > >
> > > > > So attached is a patch that should avoid enabling the thermal zone
> > > > > when it is not ready for use in the first place, so it should address
> > > > > both the message and the useless polling.
> > > > >
> > > > > I would appreciate giving it a go (please note that it hasn't received
> > > > > much testing so far, though).
> > > >
> > > > Sadly this patch doesn't seem to help:
> > >
> > > This is likely because it is missing checks for firmware image type.
> > > I've added them to the attached new version.  Please try it.
> > >
> > > I've also added two pr_info() messages to get a better idea of what's
> > > going on, so please grep dmesg for "Thermal zone not ready" and
> > > "Enabling thermal zone".
> >
> > This is the output with the patch applied:
>
> The ax200 wlan interface is currently not up/ configured (system
> using its wired ethernet cards instead), the thermal_zone1 stops
> if I manually enable the interface (ip link set dev wlp4s0 up)
> after booting up:

This explains it, thanks!

The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
premature or it should get disabled in the other two places that clear
the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.

I'm not sure why the thermal zone depends on whether or not this bit
is set, though. Is it really a good idea to return errors from it if
the interface is not up?

> $ dmesg | grep -i -e iwlwifi -e thermal
> [    0.080899] CPU0: Thermal monitoring enabled (TM1)
> [    0.113768] thermal_sys: Registered thermal governor 'fair_share'
> [    0.113770] thermal_sys: Registered thermal governor 'bang_bang'
> [    0.113771] thermal_sys: Registered thermal governor 'step_wise'
> [    0.113772] thermal_sys: Registered thermal governor 'user_space'
> [    0.113773] thermal_sys: Registered thermal governor 'power_allocator'
> [    3.759673] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
> [    3.764918] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
> [    3.764974] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
> [    3.769432] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
> [    3.873466] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
> [    3.907122] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
> [    3.907886] iwl_mvm_thermal_zone_register: Thermal zone not ready
> [    4.032380] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
> [    4.032392] thermal thermal_zone1: Enabling thermal zone
> [    4.098308] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
> [    4.112535] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    4.128306] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
> [    4.369396] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    4.625385] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    4.881416] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    5.137377] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    5.394377] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    5.649412] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    5.905379] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    6.161380] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    6.418381] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    6.673381] thermal thermal_zone1: failed to read out thermal zone (-61)
> [    6.929377] thermal thermal_zone1: failed to read out thermal zone (-61)
>                [...]
> [   21.009413] thermal thermal_zone1: failed to read out thermal zone (-61)
> [   21.265496] thermal thermal_zone1: failed to read out thermal zone (-61)
> [   21.521462] thermal thermal_zone1: failed to read out thermal zone (-61)
> [   21.777481] thermal thermal_zone1: failed to read out thermal zone (-61)
> [   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
> [   22.213120] thermal thermal_zone1: Enabling thermal zone
> [   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0

Thanks for this data point!

AFAICS the thermal zone in iwlwifi is always enabled, but only valid
if the interface is up.  It looks to me like the thermal core needs a
special "don't poll me" error code to be returned in such cases.
Daniel Lezcano July 16, 2024, 12:10 p.m. UTC | #22
On 16/07/2024 13:36, Rafael J. Wysocki wrote:
> On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>>
>> Hi
>>
>> On 2024-07-16, Stefan Lippers-Hollmann wrote:
>>> On 2024-07-16, Rafael J. Wysocki wrote:
>>>> On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>>>>> On 2024-07-15, Rafael J. Wysocki wrote:
>>>>>> On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>>>>>>> On 2024-07-15, Rafael J. Wysocki wrote:
>>>>>>>> On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
>>>>>>>> <daniel.lezcano@linaro.org> wrote:
>>>>>>>>> On 15/07/2024 06:45, Eric Biggers wrote:
>>>>>>>>>> On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
>>>>>>>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>>> [...]
>>>>>>> Silencing the warnings is already a big improvement - and that patch
>>>>>>> works to this extent for me with an ax200, thanks.
>>>>>>
>>>>>> So attached is a patch that should avoid enabling the thermal zone
>>>>>> when it is not ready for use in the first place, so it should address
>>>>>> both the message and the useless polling.
>>>>>>
>>>>>> I would appreciate giving it a go (please note that it hasn't received
>>>>>> much testing so far, though).
>>>>>
>>>>> Sadly this patch doesn't seem to help:
>>>>
>>>> This is likely because it is missing checks for firmware image type.
>>>> I've added them to the attached new version.  Please try it.
>>>>
>>>> I've also added two pr_info() messages to get a better idea of what's
>>>> going on, so please grep dmesg for "Thermal zone not ready" and
>>>> "Enabling thermal zone".
>>>
>>> This is the output with the patch applied:
>>
>> The ax200 wlan interface is currently not up/ configured (system
>> using its wired ethernet cards instead), the thermal_zone1 stops
>> if I manually enable the interface (ip link set dev wlp4s0 up)
>> after booting up:
> 
> This explains it, thanks!
> 
> The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
> premature or it should get disabled in the other two places that clear
> the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.
> 
> I'm not sure why the thermal zone depends on whether or not this bit
> is set, though. Is it really a good idea to return errors from it if
> the interface is not up?
> 
>> $ dmesg | grep -i -e iwlwifi -e thermal
>> [    0.080899] CPU0: Thermal monitoring enabled (TM1)
>> [    0.113768] thermal_sys: Registered thermal governor 'fair_share'
>> [    0.113770] thermal_sys: Registered thermal governor 'bang_bang'
>> [    0.113771] thermal_sys: Registered thermal governor 'step_wise'
>> [    0.113772] thermal_sys: Registered thermal governor 'user_space'
>> [    0.113773] thermal_sys: Registered thermal governor 'power_allocator'
>> [    3.759673] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
>> [    3.764918] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
>> [    3.764974] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
>> [    3.769432] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
>> [    3.873466] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
>> [    3.907122] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
>> [    3.907886] iwl_mvm_thermal_zone_register: Thermal zone not ready
>> [    4.032380] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
>> [    4.032392] thermal thermal_zone1: Enabling thermal zone
>> [    4.098308] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
>> [    4.112535] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    4.128306] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
>> [    4.369396] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    4.625385] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    4.881416] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    5.137377] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    5.394377] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    5.649412] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    5.905379] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    6.161380] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    6.418381] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    6.673381] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [    6.929377] thermal thermal_zone1: failed to read out thermal zone (-61)
>>                 [...]
>> [   21.009413] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [   21.265496] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [   21.521462] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [   21.777481] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
>> [   22.213120] thermal thermal_zone1: Enabling thermal zone
>> [   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
> 
> Thanks for this data point!
> 
> AFAICS the thermal zone in iwlwifi is always enabled, but only valid
> if the interface is up.  It looks to me like the thermal core needs a
> special "don't poll me" error code to be returned in such cases.

 From my POV, it is not up to the thermal core to adapt to the driver.

Usually network devices have ops when they are transitioning to up or 
down, would it make sense to move enable / disable the thermal zone in 
these ops ?
Rafael J. Wysocki July 16, 2024, 12:18 p.m. UTC | #23
On Tue, Jul 16, 2024 at 2:10 PM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 16/07/2024 13:36, Rafael J. Wysocki wrote:
> > On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> >>
> >> Hi
> >>
> >> On 2024-07-16, Stefan Lippers-Hollmann wrote:
> >>> On 2024-07-16, Rafael J. Wysocki wrote:
> >>>> On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> >>>>> On 2024-07-15, Rafael J. Wysocki wrote:
> >>>>>> On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> >>>>>>> On 2024-07-15, Rafael J. Wysocki wrote:
> >>>>>>>> On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> >>>>>>>> <daniel.lezcano@linaro.org> wrote:
> >>>>>>>>> On 15/07/2024 06:45, Eric Biggers wrote:
> >>>>>>>>>> On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> >>>>>>>>>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>>> [...]
> >>>>>>> Silencing the warnings is already a big improvement - and that patch
> >>>>>>> works to this extent for me with an ax200, thanks.
> >>>>>>
> >>>>>> So attached is a patch that should avoid enabling the thermal zone
> >>>>>> when it is not ready for use in the first place, so it should address
> >>>>>> both the message and the useless polling.
> >>>>>>
> >>>>>> I would appreciate giving it a go (please note that it hasn't received
> >>>>>> much testing so far, though).
> >>>>>
> >>>>> Sadly this patch doesn't seem to help:
> >>>>
> >>>> This is likely because it is missing checks for firmware image type.
> >>>> I've added them to the attached new version.  Please try it.
> >>>>
> >>>> I've also added two pr_info() messages to get a better idea of what's
> >>>> going on, so please grep dmesg for "Thermal zone not ready" and
> >>>> "Enabling thermal zone".
> >>>
> >>> This is the output with the patch applied:
> >>
> >> The ax200 wlan interface is currently not up/ configured (system
> >> using its wired ethernet cards instead), the thermal_zone1 stops
> >> if I manually enable the interface (ip link set dev wlp4s0 up)
> >> after booting up:
> >
> > This explains it, thanks!
> >
> > The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
> > premature or it should get disabled in the other two places that clear
> > the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.
> >
> > I'm not sure why the thermal zone depends on whether or not this bit
> > is set, though. Is it really a good idea to return errors from it if
> > the interface is not up?
> >
> >> $ dmesg | grep -i -e iwlwifi -e thermal
> >> [    0.080899] CPU0: Thermal monitoring enabled (TM1)
> >> [    0.113768] thermal_sys: Registered thermal governor 'fair_share'
> >> [    0.113770] thermal_sys: Registered thermal governor 'bang_bang'
> >> [    0.113771] thermal_sys: Registered thermal governor 'step_wise'
> >> [    0.113772] thermal_sys: Registered thermal governor 'user_space'
> >> [    0.113773] thermal_sys: Registered thermal governor 'power_allocator'
> >> [    3.759673] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
> >> [    3.764918] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
> >> [    3.764974] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
> >> [    3.769432] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
> >> [    3.873466] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
> >> [    3.907122] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
> >> [    3.907886] iwl_mvm_thermal_zone_register: Thermal zone not ready
> >> [    4.032380] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
> >> [    4.032392] thermal thermal_zone1: Enabling thermal zone
> >> [    4.098308] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
> >> [    4.112535] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    4.128306] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
> >> [    4.369396] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    4.625385] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    4.881416] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    5.137377] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    5.394377] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    5.649412] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    5.905379] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    6.161380] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    6.418381] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    6.673381] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [    6.929377] thermal thermal_zone1: failed to read out thermal zone (-61)
> >>                 [...]
> >> [   21.009413] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [   21.265496] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [   21.521462] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [   21.777481] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
> >> [   22.213120] thermal thermal_zone1: Enabling thermal zone
> >> [   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
> >
> > Thanks for this data point!
> >
> > AFAICS the thermal zone in iwlwifi is always enabled, but only valid
> > if the interface is up.  It looks to me like the thermal core needs a
> > special "don't poll me" error code to be returned in such cases.
>
>  From my POV, it is not up to the thermal core to adapt to the driver.

The core provides a service to its users, not the other way around,
and this is a valid use case.

The owner of the thermal zone knows that it is only useful when the
interface is up and so it should be possible for them to indicate to
the core that, for the time being, nothing needs to be done.

> Usually network devices have ops when they are transitioning to up or
> down, would it make sense to move enable / disable the thermal zone in
> these ops ?

Not really, because it can be enabled and disabled via sysfs in the meantime.
Rafael J. Wysocki July 16, 2024, 12:30 p.m. UTC | #24
On Tue, Jul 16, 2024 at 1:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> >
> > Hi
> >
> > On 2024-07-16, Stefan Lippers-Hollmann wrote:
> > > On 2024-07-16, Rafael J. Wysocki wrote:
> > > > On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > [...]
> > > > > > > Silencing the warnings is already a big improvement - and that patch
> > > > > > > works to this extent for me with an ax200, thanks.
> > > > > >
> > > > > > So attached is a patch that should avoid enabling the thermal zone
> > > > > > when it is not ready for use in the first place, so it should address
> > > > > > both the message and the useless polling.
> > > > > >
> > > > > > I would appreciate giving it a go (please note that it hasn't received
> > > > > > much testing so far, though).
> > > > >
> > > > > Sadly this patch doesn't seem to help:
> > > >
> > > > This is likely because it is missing checks for firmware image type.
> > > > I've added them to the attached new version.  Please try it.
> > > >
> > > > I've also added two pr_info() messages to get a better idea of what's
> > > > going on, so please grep dmesg for "Thermal zone not ready" and
> > > > "Enabling thermal zone".
> > >
> > > This is the output with the patch applied:
> >
> > The ax200 wlan interface is currently not up/ configured (system
> > using its wired ethernet cards instead), the thermal_zone1 stops
> > if I manually enable the interface (ip link set dev wlp4s0 up)
> > after booting up:
>
> This explains it, thanks!
>
> The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
> premature or it should get disabled in the other two places that clear
> the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.
>
> I'm not sure why the thermal zone depends on whether or not this bit
> is set, though. Is it really a good idea to return errors from it if
> the interface is not up?
>
> > $ dmesg | grep -i -e iwlwifi -e thermal
> > [    0.080899] CPU0: Thermal monitoring enabled (TM1)
> > [    0.113768] thermal_sys: Registered thermal governor 'fair_share'
> > [    0.113770] thermal_sys: Registered thermal governor 'bang_bang'
> > [    0.113771] thermal_sys: Registered thermal governor 'step_wise'
> > [    0.113772] thermal_sys: Registered thermal governor 'user_space'
> > [    0.113773] thermal_sys: Registered thermal governor 'power_allocator'
> > [    3.759673] iwlwifi 0000:04:00.0: enabling device (0140 -> 0142)
> > [    3.764918] iwlwifi 0000:04:00.0: Detected crf-id 0x3617, cnv-id 0x100530 wfpm id 0x80000000
> > [    3.764974] iwlwifi 0000:04:00.0: PCI dev 2723/0084, rev=0x340, rfid=0x10a100
> > [    3.769432] iwlwifi 0000:04:00.0: TLV_FW_FSEQ_VERSION: FSEQ Version: 89.3.35.37
> > [    3.873466] iwlwifi 0000:04:00.0: loaded firmware version 77.a20fb07d.0 cc-a0-77.ucode op_mode iwlmvm
> > [    3.907122] iwlwifi 0000:04:00.0: Detected Intel(R) Wi-Fi 6 AX200 160MHz, REV=0x340
> > [    3.907886] iwl_mvm_thermal_zone_register: Thermal zone not ready
> > [    4.032380] iwlwifi 0000:04:00.0: Detected RF HR B3, rfid=0x10a100
> > [    4.032392] thermal thermal_zone1: Enabling thermal zone
> > [    4.098308] iwlwifi 0000:04:00.0: base HW address: 94:e6:f7:XX:XX:XX
> > [    4.112535] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    4.128306] iwlwifi 0000:04:00.0 wlp4s0: renamed from wlan0
> > [    4.369396] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    4.625385] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    4.881416] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    5.137377] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    5.394377] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    5.649412] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    5.905379] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    6.161380] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    6.418381] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    6.673381] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [    6.929377] thermal thermal_zone1: failed to read out thermal zone (-61)
> >                [...]
> > [   21.009413] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [   21.265496] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [   21.521462] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [   21.777481] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
> > [   22.213120] thermal thermal_zone1: Enabling thermal zone
> > [   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
>
> Thanks for this data point!
>
> AFAICS the thermal zone in iwlwifi is always enabled, but only valid
> if the interface is up.  It looks to me like the thermal core needs a
> special "don't poll me" error code to be returned in such cases.

Attached is a thermal core patch with an iwlwifi piece along the lines
above (tested lightly).  It adds a way for a driver to indicate that
temperature cannot be provided at the moment, but that's OK and the
core need not worry about that.

Please give it a go.
Stefan Lippers-Hollmann July 16, 2024, 1:20 p.m. UTC | #25
Hi

On 2024-07-16, Rafael J. Wysocki wrote:
> On Tue, Jul 16, 2024 at 1:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > On 2024-07-16, Stefan Lippers-Hollmann wrote:
> > > > On 2024-07-16, Rafael J. Wysocki wrote:
> > > > > On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > [...]
> > > > > > > > Silencing the warnings is already a big improvement - and that patch
> > > > > > > > works to this extent for me with an ax200, thanks.
> > > > > > >
> > > > > > > So attached is a patch that should avoid enabling the thermal zone
> > > > > > > when it is not ready for use in the first place, so it should address
> > > > > > > both the message and the useless polling.
> > > > > > >
> > > > > > > I would appreciate giving it a go (please note that it hasn't received
> > > > > > > much testing so far, though).
> > > > > >
> > > > > > Sadly this patch doesn't seem to help:
> > > > >
> > > > > This is likely because it is missing checks for firmware image type.
> > > > > I've added them to the attached new version.  Please try it.
> > > > >
> > > > > I've also added two pr_info() messages to get a better idea of what's
> > > > > going on, so please grep dmesg for "Thermal zone not ready" and
> > > > > "Enabling thermal zone".
> > > >
> > > > This is the output with the patch applied:
> > >
> > > The ax200 wlan interface is currently not up/ configured (system
> > > using its wired ethernet cards instead), the thermal_zone1 stops
> > > if I manually enable the interface (ip link set dev wlp4s0 up)
> > > after booting up:
> >
> > This explains it, thanks!
> >
> > The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
> > premature or it should get disabled in the other two places that clear
> > the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.
> >
> > I'm not sure why the thermal zone depends on whether or not this bit
> > is set, though. Is it really a good idea to return errors from it if
> > the interface is not up?
[...]
> > > [   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
> > > [   22.213120] thermal thermal_zone1: Enabling thermal zone
> > > [   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
> >
> > Thanks for this data point!
> >
> > AFAICS the thermal zone in iwlwifi is always enabled, but only valid
> > if the interface is up.  It looks to me like the thermal core needs a
> > special "don't poll me" error code to be returned in such cases.
>
> Attached is a thermal core patch with an iwlwifi piece along the lines
> above (tested lightly).  It adds a way for a driver to indicate that
> temperature cannot be provided at the moment, but that's OK and the
> core need not worry about that.
>
> Please give it a go.

This seems to fail to build on top of v6.10, should I test Linus' HEAD
or some staging tree instead?

[ I will be offline for the next few hours now, but will test it as soon
  as possible, probably in ~9-10 hours ]

  CC      drivers/thermal/thermal_core.o
drivers/thermal/thermal_core.c: In function 'handle_thermal_trip':
drivers/thermal/thermal_core.c:383:37: error: 'THERMAL_TEMP_INIT' undeclared (first use in this function); did you mean 'THERMAL_TEMP_INVALID'?
  383 |             tz->last_temperature != THERMAL_TEMP_INIT) {
      |                                     ^~~~~~~~~~~~~~~~~
      |                                     THERMAL_TEMP_INVALID
drivers/thermal/thermal_core.c:383:37: note: each undeclared identifier is reported only once for each function it appears in
drivers/thermal/thermal_core.c: In function 'thermal_zone_device_init':
drivers/thermal/thermal_core.c:432:27: error: 'THERMAL_TEMP_INIT' undeclared (first use in this function); did you mean 'THERMAL_TEMP_INVALID'?
  432 |         tz->temperature = THERMAL_TEMP_INIT;
      |                           ^~~~~~~~~~~~~~~~~
      |                           THERMAL_TEMP_INVALID

Regards
	Stefan Lippers-Hollmann
Rafael J. Wysocki July 16, 2024, 2:04 p.m. UTC | #26
On Tue, Jul 16, 2024 at 3:20 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
>
> Hi
>
> On 2024-07-16, Rafael J. Wysocki wrote:
> > On Tue, Jul 16, 2024 at 1:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > On 2024-07-16, Stefan Lippers-Hollmann wrote:
> > > > > On 2024-07-16, Rafael J. Wysocki wrote:
> > > > > > On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > > > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > > [...]
> > > > > > > > > Silencing the warnings is already a big improvement - and that patch
> > > > > > > > > works to this extent for me with an ax200, thanks.
> > > > > > > >
> > > > > > > > So attached is a patch that should avoid enabling the thermal zone
> > > > > > > > when it is not ready for use in the first place, so it should address
> > > > > > > > both the message and the useless polling.
> > > > > > > >
> > > > > > > > I would appreciate giving it a go (please note that it hasn't received
> > > > > > > > much testing so far, though).
> > > > > > >
> > > > > > > Sadly this patch doesn't seem to help:
> > > > > >
> > > > > > This is likely because it is missing checks for firmware image type.
> > > > > > I've added them to the attached new version.  Please try it.
> > > > > >
> > > > > > I've also added two pr_info() messages to get a better idea of what's
> > > > > > going on, so please grep dmesg for "Thermal zone not ready" and
> > > > > > "Enabling thermal zone".
> > > > >
> > > > > This is the output with the patch applied:
> > > >
> > > > The ax200 wlan interface is currently not up/ configured (system
> > > > using its wired ethernet cards instead), the thermal_zone1 stops
> > > > if I manually enable the interface (ip link set dev wlp4s0 up)
> > > > after booting up:
> > >
> > > This explains it, thanks!
> > >
> > > The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
> > > premature or it should get disabled in the other two places that clear
> > > the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.
> > >
> > > I'm not sure why the thermal zone depends on whether or not this bit
> > > is set, though. Is it really a good idea to return errors from it if
> > > the interface is not up?
> [...]
> > > > [   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
> > > > [   22.213120] thermal thermal_zone1: Enabling thermal zone
> > > > [   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
> > >
> > > Thanks for this data point!
> > >
> > > AFAICS the thermal zone in iwlwifi is always enabled, but only valid
> > > if the interface is up.  It looks to me like the thermal core needs a
> > > special "don't poll me" error code to be returned in such cases.
> >
> > Attached is a thermal core patch with an iwlwifi piece along the lines
> > above (tested lightly).  It adds a way for a driver to indicate that
> > temperature cannot be provided at the moment, but that's OK and the
> > core need not worry about that.
> >
> > Please give it a go.
>
> This seems to fail to build on top of v6.10, should I test Linus' HEAD
> or some staging tree instead?

No, it's missing one hunk, sorry about that.

> [ I will be offline for the next few hours now, but will test it as soon
>   as possible, probably in ~9-10 hours ]

No worries and thanks for your persistence!

>   CC      drivers/thermal/thermal_core.o
> drivers/thermal/thermal_core.c: In function 'handle_thermal_trip':
> drivers/thermal/thermal_core.c:383:37: error: 'THERMAL_TEMP_INIT' undeclared (first use in this function); did you mean 'THERMAL_TEMP_INVALID'?
>   383 |             tz->last_temperature != THERMAL_TEMP_INIT) {
>       |                                     ^~~~~~~~~~~~~~~~~
>       |                                     THERMAL_TEMP_INVALID
> drivers/thermal/thermal_core.c:383:37: note: each undeclared identifier is reported only once for each function it appears in
> drivers/thermal/thermal_core.c: In function 'thermal_zone_device_init':
> drivers/thermal/thermal_core.c:432:27: error: 'THERMAL_TEMP_INIT' undeclared (first use in this function); did you mean 'THERMAL_TEMP_INVALID'?
>   432 |         tz->temperature = THERMAL_TEMP_INIT;
>       |                           ^~~~~~~~~~~~~~~~~
>       |                           THERMAL_TEMP_INVALID
>

Attached is a new version that builds for me on top of plain 6.10.
Oleksandr Natalenko July 16, 2024, 4:37 p.m. UTC | #27
Hello.

On úterý 16. července 2024 16:04:16, SELČ Rafael J. Wysocki wrote:
> On Tue, Jul 16, 2024 at 3:20 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> >
> > Hi
> >
> > On 2024-07-16, Rafael J. Wysocki wrote:
> > > On Tue, Jul 16, 2024 at 1:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > On 2024-07-16, Stefan Lippers-Hollmann wrote:
> > > > > > On 2024-07-16, Rafael J. Wysocki wrote:
> > > > > > > On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > > > > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > > > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > > > [...]
> > > > > > > > > > Silencing the warnings is already a big improvement - and that patch
> > > > > > > > > > works to this extent for me with an ax200, thanks.
> > > > > > > > >
> > > > > > > > > So attached is a patch that should avoid enabling the thermal zone
> > > > > > > > > when it is not ready for use in the first place, so it should address
> > > > > > > > > both the message and the useless polling.
> > > > > > > > >
> > > > > > > > > I would appreciate giving it a go (please note that it hasn't received
> > > > > > > > > much testing so far, though).
> > > > > > > >
> > > > > > > > Sadly this patch doesn't seem to help:
> > > > > > >
> > > > > > > This is likely because it is missing checks for firmware image type.
> > > > > > > I've added them to the attached new version.  Please try it.
> > > > > > >
> > > > > > > I've also added two pr_info() messages to get a better idea of what's
> > > > > > > going on, so please grep dmesg for "Thermal zone not ready" and
> > > > > > > "Enabling thermal zone".
> > > > > >
> > > > > > This is the output with the patch applied:
> > > > >
> > > > > The ax200 wlan interface is currently not up/ configured (system
> > > > > using its wired ethernet cards instead), the thermal_zone1 stops
> > > > > if I manually enable the interface (ip link set dev wlp4s0 up)
> > > > > after booting up:
> > > >
> > > > This explains it, thanks!
> > > >
> > > > The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
> > > > premature or it should get disabled in the other two places that clear
> > > > the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.
> > > >
> > > > I'm not sure why the thermal zone depends on whether or not this bit
> > > > is set, though. Is it really a good idea to return errors from it if
> > > > the interface is not up?
> > [...]
> > > > > [   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
> > > > > [   22.213120] thermal thermal_zone1: Enabling thermal zone
> > > > > [   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
> > > >
> > > > Thanks for this data point!
> > > >
> > > > AFAICS the thermal zone in iwlwifi is always enabled, but only valid
> > > > if the interface is up.  It looks to me like the thermal core needs a
> > > > special "don't poll me" error code to be returned in such cases.
> > >
> > > Attached is a thermal core patch with an iwlwifi piece along the lines
> > > above (tested lightly).  It adds a way for a driver to indicate that
> > > temperature cannot be provided at the moment, but that's OK and the
> > > core need not worry about that.
> > >
> > > Please give it a go.
> >
> > This seems to fail to build on top of v6.10, should I test Linus' HEAD
> > or some staging tree instead?
> 
> No, it's missing one hunk, sorry about that.
> 
> > [ I will be offline for the next few hours now, but will test it as soon
> >   as possible, probably in ~9-10 hours ]
> 
> No worries and thanks for your persistence!
> 
> >   CC      drivers/thermal/thermal_core.o
> > drivers/thermal/thermal_core.c: In function 'handle_thermal_trip':
> > drivers/thermal/thermal_core.c:383:37: error: 'THERMAL_TEMP_INIT' undeclared (first use in this function); did you mean 'THERMAL_TEMP_INVALID'?
> >   383 |             tz->last_temperature != THERMAL_TEMP_INIT) {
> >       |                                     ^~~~~~~~~~~~~~~~~
> >       |                                     THERMAL_TEMP_INVALID
> > drivers/thermal/thermal_core.c:383:37: note: each undeclared identifier is reported only once for each function it appears in
> > drivers/thermal/thermal_core.c: In function 'thermal_zone_device_init':
> > drivers/thermal/thermal_core.c:432:27: error: 'THERMAL_TEMP_INIT' undeclared (first use in this function); did you mean 'THERMAL_TEMP_INVALID'?
> >   432 |         tz->temperature = THERMAL_TEMP_INIT;
> >       |                           ^~~~~~~~~~~~~~~~~
> >       |                           THERMAL_TEMP_INVALID
> >
> 
> Attached is a new version that builds for me on top of plain 6.10.
> 

This builds and runs fine for me, no dmesg spamming any more. In `sensors` I get this:

```
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:       -274.0°C
```

(very beneficial during the heat wave)

There are no "thermal" messages in dmesg whatsoever, any other info you'd like me to provide?

Also, feel free to add:

Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

Thank you.
Rafael J. Wysocki July 16, 2024, 5:03 p.m. UTC | #28
Hi,

On Tue, Jul 16, 2024 at 6:38 PM Oleksandr Natalenko
<oleksandr@natalenko.name> wrote:
>
> Hello.
>
> On úterý 16. července 2024 16:04:16, SELČ Rafael J. Wysocki wrote:
> > On Tue, Jul 16, 2024 at 3:20 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > >
> > > Hi
> > >
> > > On 2024-07-16, Rafael J. Wysocki wrote:
> > > > On Tue, Jul 16, 2024 at 1:36 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > > > > On Tue, Jul 16, 2024 at 1:15 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > On 2024-07-16, Stefan Lippers-Hollmann wrote:
> > > > > > > On 2024-07-16, Rafael J. Wysocki wrote:
> > > > > > > > On Tue, Jul 16, 2024 at 1:48 AM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > > > > On Mon, Jul 15, 2024 at 2:54 PM Stefan Lippers-Hollmann <s.l-h@gmx.de> wrote:
> > > > > > > > > > > On 2024-07-15, Rafael J. Wysocki wrote:
> > > > > > > > > > > > On Mon, Jul 15, 2024 at 11:09 AM Daniel Lezcano
> > > > > > > > > > > > <daniel.lezcano@linaro.org> wrote:
> > > > > > > > > > > > > On 15/07/2024 06:45, Eric Biggers wrote:
> > > > > > > > > > > > > > On Thu, Jul 04, 2024 at 01:46:26PM +0200, Rafael J. Wysocki wrote:
> > > > > > > > > > > > > >> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > > > > > > > [...]
> > > > > > > > > > > Silencing the warnings is already a big improvement - and that patch
> > > > > > > > > > > works to this extent for me with an ax200, thanks.
> > > > > > > > > >
> > > > > > > > > > So attached is a patch that should avoid enabling the thermal zone
> > > > > > > > > > when it is not ready for use in the first place, so it should address
> > > > > > > > > > both the message and the useless polling.
> > > > > > > > > >
> > > > > > > > > > I would appreciate giving it a go (please note that it hasn't received
> > > > > > > > > > much testing so far, though).
> > > > > > > > >
> > > > > > > > > Sadly this patch doesn't seem to help:
> > > > > > > >
> > > > > > > > This is likely because it is missing checks for firmware image type.
> > > > > > > > I've added them to the attached new version.  Please try it.
> > > > > > > >
> > > > > > > > I've also added two pr_info() messages to get a better idea of what's
> > > > > > > > going on, so please grep dmesg for "Thermal zone not ready" and
> > > > > > > > "Enabling thermal zone".
> > > > > > >
> > > > > > > This is the output with the patch applied:
> > > > > >
> > > > > > The ax200 wlan interface is currently not up/ configured (system
> > > > > > using its wired ethernet cards instead), the thermal_zone1 stops
> > > > > > if I manually enable the interface (ip link set dev wlp4s0 up)
> > > > > > after booting up:
> > > > >
> > > > > This explains it, thanks!
> > > > >
> > > > > The enabling of the thermal zone in iwl_mvm_load_ucode_wait_alive() is
> > > > > premature or it should get disabled in the other two places that clear
> > > > > the IWL_MVM_STATUS_FIRMWARE_RUNNING bit.
> > > > >
> > > > > I'm not sure why the thermal zone depends on whether or not this bit
> > > > > is set, though. Is it really a good idea to return errors from it if
> > > > > the interface is not up?
> > > [...]
> > > > > > [   22.033468] thermal thermal_zone1: failed to read out thermal zone (-61)
> > > > > > [   22.213120] thermal thermal_zone1: Enabling thermal zone
> > > > > > [   22.283954] iwlwifi 0000:04:00.0: Registered PHC clock: iwlwifi-PTP, with index: 0
> > > > >
> > > > > Thanks for this data point!
> > > > >
> > > > > AFAICS the thermal zone in iwlwifi is always enabled, but only valid
> > > > > if the interface is up.  It looks to me like the thermal core needs a
> > > > > special "don't poll me" error code to be returned in such cases.
> > > >
> > > > Attached is a thermal core patch with an iwlwifi piece along the lines
> > > > above (tested lightly).  It adds a way for a driver to indicate that
> > > > temperature cannot be provided at the moment, but that's OK and the
> > > > core need not worry about that.
> > > >
> > > > Please give it a go.
> > >
> > > This seems to fail to build on top of v6.10, should I test Linus' HEAD
> > > or some staging tree instead?
> >
> > No, it's missing one hunk, sorry about that.
> >
> > > [ I will be offline for the next few hours now, but will test it as soon
> > >   as possible, probably in ~9-10 hours ]
> >
> > No worries and thanks for your persistence!
> >
> > >   CC      drivers/thermal/thermal_core.o
> > > drivers/thermal/thermal_core.c: In function 'handle_thermal_trip':
> > > drivers/thermal/thermal_core.c:383:37: error: 'THERMAL_TEMP_INIT' undeclared (first use in this function); did you mean 'THERMAL_TEMP_INVALID'?
> > >   383 |             tz->last_temperature != THERMAL_TEMP_INIT) {
> > >       |                                     ^~~~~~~~~~~~~~~~~
> > >       |                                     THERMAL_TEMP_INVALID
> > > drivers/thermal/thermal_core.c:383:37: note: each undeclared identifier is reported only once for each function it appears in
> > > drivers/thermal/thermal_core.c: In function 'thermal_zone_device_init':
> > > drivers/thermal/thermal_core.c:432:27: error: 'THERMAL_TEMP_INIT' undeclared (first use in this function); did you mean 'THERMAL_TEMP_INVALID'?
> > >   432 |         tz->temperature = THERMAL_TEMP_INIT;
> > >       |                           ^~~~~~~~~~~~~~~~~
> > >       |                           THERMAL_TEMP_INVALID
> > >
> >
> > Attached is a new version that builds for me on top of plain 6.10.
> >
>
> This builds and runs fine for me, no dmesg spamming any more. In `sensors` I get this:
>
> ```
> iwlwifi_1-virtual-0
> Adapter: Virtual device
> temp1:       -274.0°C
> ```
>
> (very beneficial during the heat wave)
>
> There are no "thermal" messages in dmesg whatsoever, any other info you'd like me to provide?

No, thank you, it works as expected.

> Also, feel free to add:
>
> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>

Thanks!
diff mbox series

Patch

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -300,6 +300,14 @@  static void monitor_thermal_zone(struct
 		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
 	else if (tz->polling_delay_jiffies)
 		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
+	else if (tz->temperature == THERMAL_TEMP_INVALID &&
+		 tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
+		thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
+		/* Double the recheck delay for the next attempt. */
+		tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
+		if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
+			dev_info(&tz->device, "Temperature unknown, giving up\n");
+	}
 }
 
 static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
@@ -430,6 +438,7 @@  static void update_temperature(struct th
 
 	tz->last_temperature = tz->temperature;
 	tz->temperature = temp;
+	tz->recheck_delay_jiffies = 1;
 
 	trace_thermal_temperature(tz);
 
@@ -514,7 +523,7 @@  void __thermal_zone_device_update(struct
 	update_temperature(tz);
 
 	if (tz->temperature == THERMAL_TEMP_INVALID)
-		return;
+		goto monitor;
 
 	tz->notify_event = event;
 
@@ -536,6 +545,7 @@  void __thermal_zone_device_update(struct
 
 	thermal_debug_update_trip_stats(tz);
 
+monitor:
 	monitor_thermal_zone(tz);
 }
 
@@ -1438,6 +1448,7 @@  thermal_zone_device_register_with_trips(
 
 	thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
 	thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
+	tz->recheck_delay_jiffies = 1;
 
 	/* sys I/F */
 	/* Add nodes that are always present via .groups */
Index: linux-pm/drivers/thermal/thermal_core.h
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.h
+++ linux-pm/drivers/thermal/thermal_core.h
@@ -67,6 +67,8 @@  struct thermal_governor {
  * @polling_delay_jiffies: number of jiffies to wait between polls when
  *			checking whether trip points have been crossed (0 for
  *			interrupt driven systems)
+ * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
+ * 			before attempting to check it again
  * @temperature:	current temperature.  This is only for core code,
  *			drivers should use thermal_zone_get_temp() to get the
  *			current temperature
@@ -108,6 +110,7 @@  struct thermal_zone_device {
 	int num_trips;
 	unsigned long passive_delay_jiffies;
 	unsigned long polling_delay_jiffies;
+	unsigned long recheck_delay_jiffies;
 	int temperature;
 	int last_temperature;
 	int emul_temperature;
@@ -133,6 +136,12 @@  struct thermal_zone_device {
 	struct thermal_trip_desc trips[] __counted_by(num_trips);
 };
 
+/*
+ * Maximum delay after a failing thermal zone temperature check before
+ * attempting to check it again (in jiffies).
+ */
+#define THERMAL_MAX_RECHECK_DELAY	(30 * HZ)
+
 /* Default Thermal Governor */
 #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
 #define DEFAULT_THERMAL_GOVERNOR       "step_wise"