diff mbox series

[v1,12/17] platform/x86: acerhdf: Use the .should_bind() thermal zone callback

Message ID 2242500.C4sosBPzcN@rjwysocki.net (mailing list archive)
State Changes Requested, archived
Headers show
Series None | expand

Commit Message

Rafael J. Wysocki July 30, 2024, 6:33 p.m. UTC
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Make the acerhdf driver use the .should_bind() thermal zone
callback to provide the thermal core with the information on whether or
not to bind the given cooling device to the given trip point in the
given thermal zone.  If it returns 'true', the thermal core will bind
the cooling device to the trip and the corresponding unbinding will be
taken care of automatically by the core on the removal of the involved
thermal zone or cooling device.

The previously existing acerhdf_bind() function bound cooling devices
to thermal trip point 0 only, so the new callback needs to return 'true'
for trip point 0.  However, it is straightforward to observe that trip
point 0 is an active trip point and the only other trip point in the
driver's thermal zone is a critical one, so it is sufficient to return
'true' from that callback if the type of the given trip point is
THERMAL_TRIP_ACTIVE.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

This patch only depends on patch [09/17].

---
 drivers/platform/x86/acerhdf.c |   33 ++++++---------------------------
 1 file changed, 6 insertions(+), 27 deletions(-)

Comments

Peter Kästle July 31, 2024, 8:50 p.m. UTC | #1
Hi Rafael,

On 30.07.24 20:33, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> 
> Make the acerhdf driver use the .should_bind() thermal zone
> callback to provide the thermal core with the information on whether or
> not to bind the given cooling device to the given trip point in the
> given thermal zone.  If it returns 'true', the thermal core will bind
> the cooling device to the trip and the corresponding unbinding will be
> taken care of automatically by the core on the removal of the involved
> thermal zone or cooling device.
> 
> The previously existing acerhdf_bind() function bound cooling devices
> to thermal trip point 0 only, so the new callback needs to return 'true'
> for trip point 0.  However, it is straightforward to observe that trip
> point 0 is an active trip point and the only other trip point in the
> driver's thermal zone is a critical one, so it is sufficient to return
> 'true' from that callback if the type of the given trip point is
> THERMAL_TRIP_ACTIVE.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Thanks for including me on the review.
I'm working on it, but unfortunately the refactoring of the thermal layer
around gov_bang_bang.c earlier this year broke acerhdf.
This needs some debugging and refactoring.  I think I can finish it on
upcoming weekend.
Rafael J. Wysocki Aug. 1, 2024, 10:14 a.m. UTC | #2
Hi Peter,

On Wed, Jul 31, 2024 at 10:50 PM Peter Kästle <xypiie@gmail.com> wrote:
>
> Hi Rafael,
>
> On 30.07.24 20:33, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >
> > Make the acerhdf driver use the .should_bind() thermal zone
> > callback to provide the thermal core with the information on whether or
> > not to bind the given cooling device to the given trip point in the
> > given thermal zone.  If it returns 'true', the thermal core will bind
> > the cooling device to the trip and the corresponding unbinding will be
> > taken care of automatically by the core on the removal of the involved
> > thermal zone or cooling device.
> >
> > The previously existing acerhdf_bind() function bound cooling devices
> > to thermal trip point 0 only, so the new callback needs to return 'true'
> > for trip point 0.  However, it is straightforward to observe that trip
> > point 0 is an active trip point and the only other trip point in the
> > driver's thermal zone is a critical one, so it is sufficient to return
> > 'true' from that callback if the type of the given trip point is
> > THERMAL_TRIP_ACTIVE.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>
> Thanks for including me on the review.
> I'm working on it, but unfortunately the refactoring of the thermal layer
> around gov_bang_bang.c earlier this year broke acerhdf.

Well, sorry about that.

> This needs some debugging and refactoring.  I think I can finish it on
> upcoming weekend.

Thank you!

I'll be offline next week, so I'll go back to this material in two
weeks or so anyway.
Peter Kaestle Aug. 5, 2024, 9:51 p.m. UTC | #3
Hi Rafael,

On 01.08.24 12:14, Rafael J. Wysocki wrote:
> Hi Peter,
> 
> On Wed, Jul 31, 2024 at 10:50 PM Peter Kästle <xypiie@gmail.com> wrote:
>>
>> Hi Rafael,
>>
>> On 30.07.24 20:33, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>>
>>> Make the acerhdf driver use the .should_bind() thermal zone
>>> callback to provide the thermal core with the information on whether or
>>> not to bind the given cooling device to the given trip point in the
>>> given thermal zone.  If it returns 'true', the thermal core will bind
>>> the cooling device to the trip and the corresponding unbinding will be
>>> taken care of automatically by the core on the removal of the involved
>>> thermal zone or cooling device.
>>>
>>> The previously existing acerhdf_bind() function bound cooling devices
>>> to thermal trip point 0 only, so the new callback needs to return 'true'
>>> for trip point 0.  However, it is straightforward to observe that trip
>>> point 0 is an active trip point and the only other trip point in the
>>> driver's thermal zone is a critical one, so it is sufficient to return
>>> 'true' from that callback if the type of the given trip point is
>>> THERMAL_TRIP_ACTIVE.
>>>
>>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>>
>> Thanks for including me on the review.
>> I'm working on it, but unfortunately the refactoring of the thermal layer
>> around gov_bang_bang.c earlier this year broke acerhdf.
> 
> Well, sorry about that.

I already fixed the main problem, which caused full malfunction of acerhdf:

The new functionality of .trip_crossed in the gov_bang_bang is missing an
initial check, whether the temperature is below the fanoff temperature
(trip_point.temperature - hysteresis) and by that it did not turn the
fan off.  This then caused that the system will never heat up as much to
cross the upper temperature. As a consequence it could never cross the
lower temperature to turn the fan off. -> Fan was locked always on.
And that's obviously not what we want.
As I didn't find any API call, to ask the governor doing that for me, I
added an "acerhdf_init_fan()" functionality into acerhdf init function right
after registering the thermal zone (and on resume from suspend) which turns
the fan off if the temperature is lower than the fanoff parameter.
Probably not the nicest solution, but maybe the most pragmatic one without
touching the thermal layer.

>> This needs some debugging and refactoring.  I think I can finish it on
>> upcoming weekend.
> 
> Thank you!

I'll need some more time to check why other features of acerhdf broke:
* interval cannot be changed to longer than one second.
   No idea yet, do you have any idea?  Maybe I'll simply drop this
   functionality, as there's no big overhead by polling every second.
* changing /sys/module/acerhdf/parameters/{fanon,fanoff} at runtime
   to change the trip point settings stopped working.  This needs some
   restructuring using module_param_cb callbacks.

> I'll be offline next week, so I'll go back to this material in two
> weeks or so anyway.

I still need some time to fix the remaining part anyhow.  Once this is
done, I'll check your latest patch series and send my acerhdf rework for
review / submission.
Rafael J. Wysocki Aug. 12, 2024, 2:56 p.m. UTC | #4
Hi Peter,

On Tue, Aug 6, 2024 at 12:10 AM Peter Kästle <peter@piie.net> wrote:
>
> Hi Rafael,
>
> On 01.08.24 12:14, Rafael J. Wysocki wrote:
> > Hi Peter,
> >
> > On Wed, Jul 31, 2024 at 10:50 PM Peter Kästle <xypiie@gmail.com> wrote:
> >>
> >> Hi Rafael,
> >>
> >> On 30.07.24 20:33, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>>
> >>> Make the acerhdf driver use the .should_bind() thermal zone
> >>> callback to provide the thermal core with the information on whether or
> >>> not to bind the given cooling device to the given trip point in the
> >>> given thermal zone.  If it returns 'true', the thermal core will bind
> >>> the cooling device to the trip and the corresponding unbinding will be
> >>> taken care of automatically by the core on the removal of the involved
> >>> thermal zone or cooling device.
> >>>
> >>> The previously existing acerhdf_bind() function bound cooling devices
> >>> to thermal trip point 0 only, so the new callback needs to return 'true'
> >>> for trip point 0.  However, it is straightforward to observe that trip
> >>> point 0 is an active trip point and the only other trip point in the
> >>> driver's thermal zone is a critical one, so it is sufficient to return
> >>> 'true' from that callback if the type of the given trip point is
> >>> THERMAL_TRIP_ACTIVE.
> >>>
> >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> >>
> >> Thanks for including me on the review.
> >> I'm working on it, but unfortunately the refactoring of the thermal layer
> >> around gov_bang_bang.c earlier this year broke acerhdf.
> >
> > Well, sorry about that.
>
> I already fixed the main problem, which caused full malfunction of acerhdf:
>
> The new functionality of .trip_crossed in the gov_bang_bang is missing an
> initial check, whether the temperature is below the fanoff temperature
> (trip_point.temperature - hysteresis) and by that it did not turn the
> fan off.

So IIUC this is when the fan starts in the "on" state and the thermal
core is expected to turn it off when the zone temperature is not in
fact above the trip point low temperature.

> This then caused that the system will never heat up as much to
> cross the upper temperature. As a consequence it could never cross the
> lower temperature to turn the fan off. -> Fan was locked always on.
> And that's obviously not what we want.

Sure.

> As I didn't find any API call, to ask the governor doing that for me, I
> added an "acerhdf_init_fan()" functionality into acerhdf init function right
> after registering the thermal zone (and on resume from suspend) which turns
> the fan off if the temperature is lower than the fanoff parameter.
> Probably not the nicest solution, but maybe the most pragmatic one without
> touching the thermal layer.

Well, this issue may not be limited to the acerhdf case, so it may be
good to address it in the thermal core.  There is kind of a
chicken-and-egg situation in there, however, because the cooling
device state is determined by the governor which only runs when it is
called, but the core doesn't know that the governor should be invoked
when there are no trip point crossing events.

It may just be a matter of adding an ->update_tz() callback to the
bang-bang governor, let me see.

> >> This needs some debugging and refactoring.  I think I can finish it on
> >> upcoming weekend.
> >
> > Thank you!
>
> I'll need some more time to check why other features of acerhdf broke:
> * interval cannot be changed to longer than one second.
>    No idea yet, do you have any idea?

No, I don't, but I'll have a look.

>  Maybe I'll simply drop this
>    functionality, as there's no big overhead by polling every second.

No, there isn't, but anyway it would be good to know why this does not
work as expected any more.

> * changing /sys/module/acerhdf/parameters/{fanon,fanoff} at runtime
>    to change the trip point settings stopped working.  This needs some
>    restructuring using module_param_cb callbacks.

I'll have a look at this too.

> > I'll be offline next week, so I'll go back to this material in two
> > weeks or so anyway.
>
> I still need some time to fix the remaining part anyhow.  Once this is
> done, I'll check your latest patch series and send my acerhdf rework for
> review / submission.

Sure.

In the meantime, I have resent the series including the acerhdf
change, but it is the same as before.

Thanks!
Rafael J. Wysocki Aug. 12, 2024, 4:15 p.m. UTC | #5
Hi Peter,

On Mon, Aug 12, 2024 at 4:56 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Tue, Aug 6, 2024 at 12:10 AM Peter Kästle <peter@piie.net> wrote:
> >
> > Hi Rafael,
> >
> > On 01.08.24 12:14, Rafael J. Wysocki wrote:
> > > Hi Peter,
> > >
> > > On Wed, Jul 31, 2024 at 10:50 PM Peter Kästle <xypiie@gmail.com> wrote:
> > >>
> > >> Hi Rafael,
> > >>
> > >> On 30.07.24 20:33, Rafael J. Wysocki wrote:
> > >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >>>
> > >>> Make the acerhdf driver use the .should_bind() thermal zone
> > >>> callback to provide the thermal core with the information on whether or
> > >>> not to bind the given cooling device to the given trip point in the
> > >>> given thermal zone.  If it returns 'true', the thermal core will bind
> > >>> the cooling device to the trip and the corresponding unbinding will be
> > >>> taken care of automatically by the core on the removal of the involved
> > >>> thermal zone or cooling device.
> > >>>
> > >>> The previously existing acerhdf_bind() function bound cooling devices
> > >>> to thermal trip point 0 only, so the new callback needs to return 'true'
> > >>> for trip point 0.  However, it is straightforward to observe that trip
> > >>> point 0 is an active trip point and the only other trip point in the
> > >>> driver's thermal zone is a critical one, so it is sufficient to return
> > >>> 'true' from that callback if the type of the given trip point is
> > >>> THERMAL_TRIP_ACTIVE.
> > >>>
> > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > >>
> > >> Thanks for including me on the review.
> > >> I'm working on it, but unfortunately the refactoring of the thermal layer
> > >> around gov_bang_bang.c earlier this year broke acerhdf.
> > >
> > > Well, sorry about that.
> >
> > I already fixed the main problem, which caused full malfunction of acerhdf:
> >
> > The new functionality of .trip_crossed in the gov_bang_bang is missing an
> > initial check, whether the temperature is below the fanoff temperature
> > (trip_point.temperature - hysteresis) and by that it did not turn the
> > fan off.
>
> So IIUC this is when the fan starts in the "on" state and the thermal
> core is expected to turn it off when the zone temperature is not in
> fact above the trip point low temperature.
>
> > This then caused that the system will never heat up as much to
> > cross the upper temperature. As a consequence it could never cross the
> > lower temperature to turn the fan off. -> Fan was locked always on.
> > And that's obviously not what we want.
>
> Sure.
>
> > As I didn't find any API call, to ask the governor doing that for me, I
> > added an "acerhdf_init_fan()" functionality into acerhdf init function right
> > after registering the thermal zone (and on resume from suspend) which turns
> > the fan off if the temperature is lower than the fanoff parameter.
> > Probably not the nicest solution, but maybe the most pragmatic one without
> > touching the thermal layer.
>
> Well, this issue may not be limited to the acerhdf case, so it may be
> good to address it in the thermal core.  There is kind of a
> chicken-and-egg situation in there, however, because the cooling
> device state is determined by the governor which only runs when it is
> called, but the core doesn't know that the governor should be invoked
> when there are no trip point crossing events.
>
> It may just be a matter of adding an ->update_tz() callback to the
> bang-bang governor, let me see.

Attached is a patch doing that - and one more change to invoke this
callback during thermal zone resume.

It should apply to 6.11-rc3 (without any of the thermal core changes
in linux-next or on the list).

If I'm not mistaken, this should address the issue above.

> > >> This needs some debugging and refactoring.  I think I can finish it on
> > >> upcoming weekend.
> > >
> > > Thank you!
> >
> > I'll need some more time to check why other features of acerhdf broke:
> > * interval cannot be changed to longer than one second.
> >    No idea yet, do you have any idea?
>
> No, I don't, but I'll have a look.

This may be related to the way in which round_jiffies() is used by
thermal_set_delay_jiffies() in the thermal core.  I think it should
call round_jiffies_relative() instead, so you may try to replace
round_jiffies() with round_jiffies_relative() in
thermal_set_delay_jiffies() and see if that helps.

> >  Maybe I'll simply drop this
> >    functionality, as there's no big overhead by polling every second.
>
> No, there isn't, but anyway it would be good to know why this does not
> work as expected any more.
>
> > * changing /sys/module/acerhdf/parameters/{fanon,fanoff} at runtime
> >    to change the trip point settings stopped working.  This needs some
> >    restructuring using module_param_cb callbacks.
>
> I'll have a look at this too.

So acerhdf_check_param() updates the trips table, but it is only
called from interval_set_uint(), so it will only trigger when the
interval module param is updated AFAICS.

Apart from this, it is not sufficient to update the original trips
table passed to register_thermal_zone_device_with_trips() because the
core makes a copy of it and does not use the original one any more
after zone registration.

What you need is to either mark the trip point as writable (set
THERMAL_TRIP_FLAG_RW_TEMP and THERMAL_TRIP_FLAG_RW_HYST for it in
flags), but that will not engage the fanon/fanoff sanity check in the
driver, or add callbacks for setting trip temperature and hysteresis
to the thermal zone operations and hook them up to the module
parameters.  However, you'd need to restore .set_trip_hyst() for this
which was removed because it was unused.

Thanks!
Rafael J. Wysocki Aug. 12, 2024, 6:09 p.m. UTC | #6
Hi Peter,

On Mon, Aug 12, 2024 at 6:15 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Mon, Aug 12, 2024 at 4:56 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Tue, Aug 6, 2024 at 12:10 AM Peter Kästle <peter@piie.net> wrote:
> > >
> > > Hi Rafael,
> > >
> > > On 01.08.24 12:14, Rafael J. Wysocki wrote:
> > > > Hi Peter,
> > > >
> > > > On Wed, Jul 31, 2024 at 10:50 PM Peter Kästle <xypiie@gmail.com> wrote:
> > > >>
> > > >> Hi Rafael,
> > > >>
> > > >> On 30.07.24 20:33, Rafael J. Wysocki wrote:
> > > >>> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >>>
> > > >>> Make the acerhdf driver use the .should_bind() thermal zone
> > > >>> callback to provide the thermal core with the information on whether or
> > > >>> not to bind the given cooling device to the given trip point in the
> > > >>> given thermal zone.  If it returns 'true', the thermal core will bind
> > > >>> the cooling device to the trip and the corresponding unbinding will be
> > > >>> taken care of automatically by the core on the removal of the involved
> > > >>> thermal zone or cooling device.
> > > >>>
> > > >>> The previously existing acerhdf_bind() function bound cooling devices
> > > >>> to thermal trip point 0 only, so the new callback needs to return 'true'
> > > >>> for trip point 0.  However, it is straightforward to observe that trip
> > > >>> point 0 is an active trip point and the only other trip point in the
> > > >>> driver's thermal zone is a critical one, so it is sufficient to return
> > > >>> 'true' from that callback if the type of the given trip point is
> > > >>> THERMAL_TRIP_ACTIVE.
> > > >>>
> > > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > > >>
> > > >> Thanks for including me on the review.
> > > >> I'm working on it, but unfortunately the refactoring of the thermal layer
> > > >> around gov_bang_bang.c earlier this year broke acerhdf.
> > > >
> > > > Well, sorry about that.
> > >
> > > I already fixed the main problem, which caused full malfunction of acerhdf:
> > >
> > > The new functionality of .trip_crossed in the gov_bang_bang is missing an
> > > initial check, whether the temperature is below the fanoff temperature
> > > (trip_point.temperature - hysteresis) and by that it did not turn the
> > > fan off.
> >
> > So IIUC this is when the fan starts in the "on" state and the thermal
> > core is expected to turn it off when the zone temperature is not in
> > fact above the trip point low temperature.
> >
> > > This then caused that the system will never heat up as much to
> > > cross the upper temperature. As a consequence it could never cross the
> > > lower temperature to turn the fan off. -> Fan was locked always on.
> > > And that's obviously not what we want.
> >
> > Sure.
> >
> > > As I didn't find any API call, to ask the governor doing that for me, I
> > > added an "acerhdf_init_fan()" functionality into acerhdf init function right
> > > after registering the thermal zone (and on resume from suspend) which turns
> > > the fan off if the temperature is lower than the fanoff parameter.
> > > Probably not the nicest solution, but maybe the most pragmatic one without
> > > touching the thermal layer.
> >
> > Well, this issue may not be limited to the acerhdf case, so it may be
> > good to address it in the thermal core.  There is kind of a
> > chicken-and-egg situation in there, however, because the cooling
> > device state is determined by the governor which only runs when it is
> > called, but the core doesn't know that the governor should be invoked
> > when there are no trip point crossing events.
> >
> > It may just be a matter of adding an ->update_tz() callback to the
> > bang-bang governor, let me see.

This isn't the right approach because .update_tz() is called before
checking the zone temperature for the first time, but to remedy the
issue at hand, code needs to run when the zone temperature is known to
the thermal core.

This means that the Bang-bang governor needs a .manage() callback in
addition to the .trip_crossed() one, but that callback will only need
to check if the states of cooling devices bound to the trip points
below the zone temperature need to be adjusted, and just once.

So something like in the attached patch.
diff mbox series

Patch

Index: linux-pm/drivers/platform/x86/acerhdf.c
===================================================================
--- linux-pm.orig/drivers/platform/x86/acerhdf.c
+++ linux-pm/drivers/platform/x86/acerhdf.c
@@ -378,33 +378,13 @@  static int acerhdf_get_ec_temp(struct th
 	return 0;
 }
 
-static int acerhdf_bind(struct thermal_zone_device *thermal,
-			struct thermal_cooling_device *cdev)
+static bool acerhdf_should_bind(struct thermal_zone_device *thermal,
+				const struct thermal_trip *trip,
+				struct thermal_cooling_device *cdev,
+				struct cooling_spec *c)
 {
 	/* if the cooling device is the one from acerhdf bind it */
-	if (cdev != cl_dev)
-		return 0;
-
-	if (thermal_zone_bind_cooling_device(thermal, 0, cdev,
-			THERMAL_NO_LIMIT, THERMAL_NO_LIMIT,
-			THERMAL_WEIGHT_DEFAULT)) {
-		pr_err("error binding cooling dev\n");
-		return -EINVAL;
-	}
-	return 0;
-}
-
-static int acerhdf_unbind(struct thermal_zone_device *thermal,
-			  struct thermal_cooling_device *cdev)
-{
-	if (cdev != cl_dev)
-		return 0;
-
-	if (thermal_zone_unbind_cooling_device(thermal, 0, cdev)) {
-		pr_err("error unbinding cooling dev\n");
-		return -EINVAL;
-	}
-	return 0;
+	return cdev == cl_dev && trip->type == THERMAL_TRIP_ACTIVE;
 }
 
 static inline void acerhdf_revert_to_bios_mode(void)
@@ -447,8 +427,7 @@  static int acerhdf_get_crit_temp(struct
 
 /* bind callback functions to thermalzone */
 static struct thermal_zone_device_ops acerhdf_dev_ops = {
-	.bind = acerhdf_bind,
-	.unbind = acerhdf_unbind,
+	.should_bind = acerhdf_should_bind,
 	.get_temp = acerhdf_get_ec_temp,
 	.change_mode = acerhdf_change_mode,
 	.get_crit_temp = acerhdf_get_crit_temp,