mbox series

[v2,00/13] thermal: tegra: Do not register cooling device

Message ID 20231012175836.3408077-1-thierry.reding@gmail.com (mailing list archive)
Headers show
Series thermal: tegra: Do not register cooling device | expand

Message

Thierry Reding Oct. 12, 2023, 5:58 p.m. UTC
From: Thierry Reding <treding@nvidia.com>

Hi,

this set of patches removes the registration of the SOCTHERM internal
throttling mechanism as cooling device. Since this throttling starts
automatically once a certain temperature threshold is crossed, it
doesn't make sense to represent it as a cooling device, which are
typically "manually" activated by the thermal framework when thermal
sensors report temperature thresholds being crossed.

Instead of using the cooling device mechanism, this statically programs
the throttling mechanism when it is configured in device tree. In order
to do this, an additional device tree property is needed to replace the
information that was previously contained in trip points.

There's a few preparatory patches to make the removal a bit simpler and
also some follow up cleanups included as well.

Changes in v2:
- rework the device tree bindings:
  - add nvidia,thermal-zones property to attach throttling to zones
  - use -millicelsius suffix and add hysteresis
- add patch to store thermal zone device tree node for later use
- add patch to enforce self-encapsulation of the thermal core now that
  no drivers need to reach into it anymore

This applies on top of Daniel's self-encapsulation hardening series:

	https://lore.kernel.org/all/20231012102700.2858952-1-daniel.lezcano@linaro.org/

Thierry

Thierry Reding (13):
  thermal: Store device tree node for thermal zone devices
  dt-bindings: thermal: tegra: Document throttle temperature
  dt-bindings: thermal: tegra: Add nvidia,thermal-zones property
  thermal: tegra: Use driver-private data consistently
  thermal: tegra: Constify SoC-specific data
  thermal: tegra: Do not register cooling device
  thermal: tegra: Use unsigned int where appropriate
  thermal: tegra: Avoid over-allocation of temporary array
  thermal: tegra: Remove gratuitous error assignment
  thermal: tegra: Minor stylistic cleanups
  ARM: tegra: Rework SOCTHERM on Tegra124
  arm64: tegra: Rework SOCTHERM on Tegra132 and Tegra210
  thermal: Enforce self-encapsulation

 .../thermal/nvidia,tegra124-soctherm.yaml     |  19 +
 arch/arm/boot/dts/nvidia/tegra124.dtsi        |  68 +--
 arch/arm64/boot/dts/nvidia/tegra132.dtsi      |  66 +--
 arch/arm64/boot/dts/nvidia/tegra210.dtsi      |  86 +--
 drivers/thermal/tegra/soctherm.c              | 525 ++++++++----------
 drivers/thermal/tegra/soctherm.h              |   1 +
 drivers/thermal/tegra/tegra124-soctherm.c     |   4 +
 drivers/thermal/tegra/tegra132-soctherm.c     |   4 +
 drivers/thermal/tegra/tegra210-soctherm.c     |   4 +
 drivers/thermal/thermal_core.h                |   2 +-
 drivers/thermal/thermal_of.c                  |   3 +
 11 files changed, 329 insertions(+), 453 deletions(-)

Comments

Nicolas Chauvet Oct. 13, 2023, 9:14 a.m. UTC | #1
Le jeu. 12 oct. 2023 à 19:58, Thierry Reding
<thierry.reding@gmail.com> a écrit :
>
> From: Thierry Reding <treding@nvidia.com>
>
> Hi,
>
> this set of patches removes the registration of the SOCTHERM internal
> throttling mechanism as cooling device. Since this throttling starts
> automatically once a certain temperature threshold is crossed, it
> doesn't make sense to represent it as a cooling device, which are
> typically "manually" activated by the thermal framework when thermal
> sensors report temperature thresholds being crossed.
>
> Instead of using the cooling device mechanism, this statically programs
> the throttling mechanism when it is configured in device tree. In order
> to do this, an additional device tree property is needed to replace the
> information that was previously contained in trip points.
>
> There's a few preparatory patches to make the removal a bit simpler and
> also some follow up cleanups included as well.
>
> Changes in v2:
> - rework the device tree bindings:
>   - add nvidia,thermal-zones property to attach throttling to zones
>   - use -millicelsius suffix and add hysteresis
> - add patch to store thermal zone device tree node for later use
> - add patch to enforce self-encapsulation of the thermal core now that
>   no drivers need to reach into it anymore
>
> This applies on top of Daniel's self-encapsulation hardening series:
>
>         https://lore.kernel.org/all/20231012102700.2858952-1-daniel.lezcano@linaro.org/
>
> Thierry
>
> Thierry Reding (13):
>   thermal: Store device tree node for thermal zone devices
>   dt-bindings: thermal: tegra: Document throttle temperature
>   dt-bindings: thermal: tegra: Add nvidia,thermal-zones property
>   thermal: tegra: Use driver-private data consistently
>   thermal: tegra: Constify SoC-specific data
>   thermal: tegra: Do not register cooling device
>   thermal: tegra: Use unsigned int where appropriate
>   thermal: tegra: Avoid over-allocation of temporary array
>   thermal: tegra: Remove gratuitous error assignment
>   thermal: tegra: Minor stylistic cleanups
>   ARM: tegra: Rework SOCTHERM on Tegra124
>   arm64: tegra: Rework SOCTHERM on Tegra132 and Tegra210
>   thermal: Enforce self-encapsulation
>
>  .../thermal/nvidia,tegra124-soctherm.yaml     |  19 +
>  arch/arm/boot/dts/nvidia/tegra124.dtsi        |  68 +--
>  arch/arm64/boot/dts/nvidia/tegra132.dtsi      |  66 +--
>  arch/arm64/boot/dts/nvidia/tegra210.dtsi      |  86 +--
>  drivers/thermal/tegra/soctherm.c              | 525 ++++++++----------
>  drivers/thermal/tegra/soctherm.h              |   1 +
>  drivers/thermal/tegra/tegra124-soctherm.c     |   4 +
>  drivers/thermal/tegra/tegra132-soctherm.c     |   4 +
>  drivers/thermal/tegra/tegra210-soctherm.c     |   4 +
>  drivers/thermal/thermal_core.h                |   2 +-
>  drivers/thermal/thermal_of.c                  |   3 +
>  11 files changed, 329 insertions(+), 453 deletions(-)
>
> --
> 2.42.0
>

I'm still experiencing the following message on jetson-tx1 with this
serie applied on top of 6.6-rc5 (with iommu-next and tegra-next
applied).
oct. 13 10:53:16 jetson-tx1 kernel: max77620-thermal max77620-thermal:
Failed to register thermal zone: -19
oct. 13 10:53:16 jetson-tx1 kernel: tegra_soctherm
700e2000.thermal-sensor: throttle-cfg: heavy: no throt prop or invalid
prop

Is this expected ?
Thierry Reding Oct. 13, 2023, 11:43 a.m. UTC | #2
On Fri, Oct 13, 2023 at 11:14:25AM +0200, Nicolas Chauvet wrote:
> Le jeu. 12 oct. 2023 à 19:58, Thierry Reding
> <thierry.reding@gmail.com> a écrit :
> >
> > From: Thierry Reding <treding@nvidia.com>
> >
> > Hi,
> >
> > this set of patches removes the registration of the SOCTHERM internal
> > throttling mechanism as cooling device. Since this throttling starts
> > automatically once a certain temperature threshold is crossed, it
> > doesn't make sense to represent it as a cooling device, which are
> > typically "manually" activated by the thermal framework when thermal
> > sensors report temperature thresholds being crossed.
> >
> > Instead of using the cooling device mechanism, this statically programs
> > the throttling mechanism when it is configured in device tree. In order
> > to do this, an additional device tree property is needed to replace the
> > information that was previously contained in trip points.
> >
> > There's a few preparatory patches to make the removal a bit simpler and
> > also some follow up cleanups included as well.
> >
> > Changes in v2:
> > - rework the device tree bindings:
> >   - add nvidia,thermal-zones property to attach throttling to zones
> >   - use -millicelsius suffix and add hysteresis
> > - add patch to store thermal zone device tree node for later use
> > - add patch to enforce self-encapsulation of the thermal core now that
> >   no drivers need to reach into it anymore
> >
> > This applies on top of Daniel's self-encapsulation hardening series:
> >
> >         https://lore.kernel.org/all/20231012102700.2858952-1-daniel.lezcano@linaro.org/
> >
> > Thierry
> >
> > Thierry Reding (13):
> >   thermal: Store device tree node for thermal zone devices
> >   dt-bindings: thermal: tegra: Document throttle temperature
> >   dt-bindings: thermal: tegra: Add nvidia,thermal-zones property
> >   thermal: tegra: Use driver-private data consistently
> >   thermal: tegra: Constify SoC-specific data
> >   thermal: tegra: Do not register cooling device
> >   thermal: tegra: Use unsigned int where appropriate
> >   thermal: tegra: Avoid over-allocation of temporary array
> >   thermal: tegra: Remove gratuitous error assignment
> >   thermal: tegra: Minor stylistic cleanups
> >   ARM: tegra: Rework SOCTHERM on Tegra124
> >   arm64: tegra: Rework SOCTHERM on Tegra132 and Tegra210
> >   thermal: Enforce self-encapsulation
> >
> >  .../thermal/nvidia,tegra124-soctherm.yaml     |  19 +
> >  arch/arm/boot/dts/nvidia/tegra124.dtsi        |  68 +--
> >  arch/arm64/boot/dts/nvidia/tegra132.dtsi      |  66 +--
> >  arch/arm64/boot/dts/nvidia/tegra210.dtsi      |  86 +--
> >  drivers/thermal/tegra/soctherm.c              | 525 ++++++++----------
> >  drivers/thermal/tegra/soctherm.h              |   1 +
> >  drivers/thermal/tegra/tegra124-soctherm.c     |   4 +
> >  drivers/thermal/tegra/tegra132-soctherm.c     |   4 +
> >  drivers/thermal/tegra/tegra210-soctherm.c     |   4 +
> >  drivers/thermal/thermal_core.h                |   2 +-
> >  drivers/thermal/thermal_of.c                  |   3 +
> >  11 files changed, 329 insertions(+), 453 deletions(-)
> >
> > --
> > 2.42.0
> >
> 
> I'm still experiencing the following message on jetson-tx1 with this
> serie applied on top of 6.6-rc5 (with iommu-next and tegra-next
> applied).
> oct. 13 10:53:16 jetson-tx1 kernel: max77620-thermal max77620-thermal:
> Failed to register thermal zone: -19

Not sure about this one. I don't think I've seen it. Do you know if
that's somehow caused by this series, or is it preexisting?

> oct. 13 10:53:16 jetson-tx1 kernel: tegra_soctherm
> 700e2000.thermal-sensor: throttle-cfg: heavy: no throt prop or invalid
> prop
> 
> Is this expected ?

This one is definitely not expected. I have seen this before, and it
happens when the device tree doesn't contain all the properties that are
expected. Patch 12 in this series should take care of that. Have you
made sure that that's been applied and is what the kernel uses to boot
with?

Thierry
Nicolas Chauvet Oct. 13, 2023, 12:45 p.m. UTC | #3
Le ven. 13 oct. 2023 à 13:43, Thierry Reding
<thierry.reding@gmail.com> a écrit :
>
> On Fri, Oct 13, 2023 at 11:14:25AM +0200, Nicolas Chauvet wrote:
> > Le jeu. 12 oct. 2023 à 19:58, Thierry Reding
> > <thierry.reding@gmail.com> a écrit :
> > >
> > > From: Thierry Reding <treding@nvidia.com>
> > >
> > > Hi,
> > >
> > > this set of patches removes the registration of the SOCTHERM internal
> > > throttling mechanism as cooling device. Since this throttling starts
> > > automatically once a certain temperature threshold is crossed, it
> > > doesn't make sense to represent it as a cooling device, which are
> > > typically "manually" activated by the thermal framework when thermal
> > > sensors report temperature thresholds being crossed.
> > >
> > > Instead of using the cooling device mechanism, this statically programs
> > > the throttling mechanism when it is configured in device tree. In order
> > > to do this, an additional device tree property is needed to replace the
> > > information that was previously contained in trip points.
> > >
> > > There's a few preparatory patches to make the removal a bit simpler and
> > > also some follow up cleanups included as well.
> > >
> > > Changes in v2:
> > > - rework the device tree bindings:
> > >   - add nvidia,thermal-zones property to attach throttling to zones
> > >   - use -millicelsius suffix and add hysteresis
> > > - add patch to store thermal zone device tree node for later use
> > > - add patch to enforce self-encapsulation of the thermal core now that
> > >   no drivers need to reach into it anymore
> > >
> > > This applies on top of Daniel's self-encapsulation hardening series:
> > >
> > >         https://lore.kernel.org/all/20231012102700.2858952-1-daniel.lezcano@linaro.org/
> > >
> > > Thierry
> > >
> > > Thierry Reding (13):
> > >   thermal: Store device tree node for thermal zone devices
> > >   dt-bindings: thermal: tegra: Document throttle temperature
> > >   dt-bindings: thermal: tegra: Add nvidia,thermal-zones property
> > >   thermal: tegra: Use driver-private data consistently
> > >   thermal: tegra: Constify SoC-specific data
> > >   thermal: tegra: Do not register cooling device
> > >   thermal: tegra: Use unsigned int where appropriate
> > >   thermal: tegra: Avoid over-allocation of temporary array
> > >   thermal: tegra: Remove gratuitous error assignment
> > >   thermal: tegra: Minor stylistic cleanups
> > >   ARM: tegra: Rework SOCTHERM on Tegra124
> > >   arm64: tegra: Rework SOCTHERM on Tegra132 and Tegra210
> > >   thermal: Enforce self-encapsulation
> > >
> > >  .../thermal/nvidia,tegra124-soctherm.yaml     |  19 +
> > >  arch/arm/boot/dts/nvidia/tegra124.dtsi        |  68 +--
> > >  arch/arm64/boot/dts/nvidia/tegra132.dtsi      |  66 +--
> > >  arch/arm64/boot/dts/nvidia/tegra210.dtsi      |  86 +--
> > >  drivers/thermal/tegra/soctherm.c              | 525 ++++++++----------
> > >  drivers/thermal/tegra/soctherm.h              |   1 +
> > >  drivers/thermal/tegra/tegra124-soctherm.c     |   4 +
> > >  drivers/thermal/tegra/tegra132-soctherm.c     |   4 +
> > >  drivers/thermal/tegra/tegra210-soctherm.c     |   4 +
> > >  drivers/thermal/thermal_core.h                |   2 +-
> > >  drivers/thermal/thermal_of.c                  |   3 +
> > >  11 files changed, 329 insertions(+), 453 deletions(-)
> > >
> > > --
> > > 2.42.0
> > >
> >
> > I'm still experiencing the following message on jetson-tx1 with this
> > serie applied on top of 6.6-rc5 (with iommu-next and tegra-next
> > applied).
> > oct. 13 10:53:16 jetson-tx1 kernel: max77620-thermal max77620-thermal:
> > Failed to register thermal zone: -19
>
> Not sure about this one. I don't think I've seen it. Do you know if
> that's somehow caused by this series, or is it preexisting?

It's pre-existing from this serie.

> > oct. 13 10:53:16 jetson-tx1 kernel: tegra_soctherm
> > 700e2000.thermal-sensor: throttle-cfg: heavy: no throt prop or invalid
> > prop
> >
> > Is this expected ?
>
> This one is definitely not expected. I have seen this before, and it
> happens when the device tree doesn't contain all the properties that are
> expected. Patch 12 in this series should take care of that. Have you
> made sure that that's been applied and is what the kernel uses to boot
> with?

Yes, this dtb change in patch12 is propagated to the device (as seen
in /boot/dtbs)
But comparing with what's available at runtime in /proc/device-tree, I
see some changes

                        heavy {
-                               hysteresis-millicelsius = <0xfa0>;
+                               #cooling-cells = <0x02>;
                                nvidia,cpu-throt-percent = <0x55>;
                                nvidia,gpu-throt-level = <0x03>;
                                nvidia,priority = <0x64>;
-                               nvidia,thermal-zones = <0x49 0x4a>;
-                               temperature-millicelsius = <0x180c4>;
+                               phandle = <0x130>;
                        };

I'm using u-boot 2023.07 with EFI boot (L4T 32.7.4).
Could it be that the bootloader has changed these entries ? Can this
be prevented ?
(MAC ethernet address is set as appropropriate).

Thanks
Thierry Reding Oct. 13, 2023, 1:13 p.m. UTC | #4
On Fri, Oct 13, 2023 at 02:45:57PM +0200, Nicolas Chauvet wrote:
> Le ven. 13 oct. 2023 à 13:43, Thierry Reding
> <thierry.reding@gmail.com> a écrit :
> >
> > On Fri, Oct 13, 2023 at 11:14:25AM +0200, Nicolas Chauvet wrote:
> > > Le jeu. 12 oct. 2023 à 19:58, Thierry Reding
> > > <thierry.reding@gmail.com> a écrit :
> > > >
> > > > From: Thierry Reding <treding@nvidia.com>
> > > >
> > > > Hi,
> > > >
> > > > this set of patches removes the registration of the SOCTHERM internal
> > > > throttling mechanism as cooling device. Since this throttling starts
> > > > automatically once a certain temperature threshold is crossed, it
> > > > doesn't make sense to represent it as a cooling device, which are
> > > > typically "manually" activated by the thermal framework when thermal
> > > > sensors report temperature thresholds being crossed.
> > > >
> > > > Instead of using the cooling device mechanism, this statically programs
> > > > the throttling mechanism when it is configured in device tree. In order
> > > > to do this, an additional device tree property is needed to replace the
> > > > information that was previously contained in trip points.
> > > >
> > > > There's a few preparatory patches to make the removal a bit simpler and
> > > > also some follow up cleanups included as well.
> > > >
> > > > Changes in v2:
> > > > - rework the device tree bindings:
> > > >   - add nvidia,thermal-zones property to attach throttling to zones
> > > >   - use -millicelsius suffix and add hysteresis
> > > > - add patch to store thermal zone device tree node for later use
> > > > - add patch to enforce self-encapsulation of the thermal core now that
> > > >   no drivers need to reach into it anymore
> > > >
> > > > This applies on top of Daniel's self-encapsulation hardening series:
> > > >
> > > >         https://lore.kernel.org/all/20231012102700.2858952-1-daniel.lezcano@linaro.org/
> > > >
> > > > Thierry
> > > >
> > > > Thierry Reding (13):
> > > >   thermal: Store device tree node for thermal zone devices
> > > >   dt-bindings: thermal: tegra: Document throttle temperature
> > > >   dt-bindings: thermal: tegra: Add nvidia,thermal-zones property
> > > >   thermal: tegra: Use driver-private data consistently
> > > >   thermal: tegra: Constify SoC-specific data
> > > >   thermal: tegra: Do not register cooling device
> > > >   thermal: tegra: Use unsigned int where appropriate
> > > >   thermal: tegra: Avoid over-allocation of temporary array
> > > >   thermal: tegra: Remove gratuitous error assignment
> > > >   thermal: tegra: Minor stylistic cleanups
> > > >   ARM: tegra: Rework SOCTHERM on Tegra124
> > > >   arm64: tegra: Rework SOCTHERM on Tegra132 and Tegra210
> > > >   thermal: Enforce self-encapsulation
> > > >
> > > >  .../thermal/nvidia,tegra124-soctherm.yaml     |  19 +
> > > >  arch/arm/boot/dts/nvidia/tegra124.dtsi        |  68 +--
> > > >  arch/arm64/boot/dts/nvidia/tegra132.dtsi      |  66 +--
> > > >  arch/arm64/boot/dts/nvidia/tegra210.dtsi      |  86 +--
> > > >  drivers/thermal/tegra/soctherm.c              | 525 ++++++++----------
> > > >  drivers/thermal/tegra/soctherm.h              |   1 +
> > > >  drivers/thermal/tegra/tegra124-soctherm.c     |   4 +
> > > >  drivers/thermal/tegra/tegra132-soctherm.c     |   4 +
> > > >  drivers/thermal/tegra/tegra210-soctherm.c     |   4 +
> > > >  drivers/thermal/thermal_core.h                |   2 +-
> > > >  drivers/thermal/thermal_of.c                  |   3 +
> > > >  11 files changed, 329 insertions(+), 453 deletions(-)
> > > >
> > > > --
> > > > 2.42.0
> > > >
> > >
> > > I'm still experiencing the following message on jetson-tx1 with this
> > > serie applied on top of 6.6-rc5 (with iommu-next and tegra-next
> > > applied).
> > > oct. 13 10:53:16 jetson-tx1 kernel: max77620-thermal max77620-thermal:
> > > Failed to register thermal zone: -19
> >
> > Not sure about this one. I don't think I've seen it. Do you know if
> > that's somehow caused by this series, or is it preexisting?
> 
> It's pre-existing from this serie.
> 
> > > oct. 13 10:53:16 jetson-tx1 kernel: tegra_soctherm
> > > 700e2000.thermal-sensor: throttle-cfg: heavy: no throt prop or invalid
> > > prop
> > >
> > > Is this expected ?
> >
> > This one is definitely not expected. I have seen this before, and it
> > happens when the device tree doesn't contain all the properties that are
> > expected. Patch 12 in this series should take care of that. Have you
> > made sure that that's been applied and is what the kernel uses to boot
> > with?
> 
> Yes, this dtb change in patch12 is propagated to the device (as seen
> in /boot/dtbs)
> But comparing with what's available at runtime in /proc/device-tree, I
> see some changes
> 
>                         heavy {
> -                               hysteresis-millicelsius = <0xfa0>;
> +                               #cooling-cells = <0x02>;
>                                 nvidia,cpu-throt-percent = <0x55>;
>                                 nvidia,gpu-throt-level = <0x03>;
>                                 nvidia,priority = <0x64>;
> -                               nvidia,thermal-zones = <0x49 0x4a>;
> -                               temperature-millicelsius = <0x180c4>;
> +                               phandle = <0x130>;
>                         };

Okay, that explains the error message.

> 
> I'm using u-boot 2023.07 with EFI boot (L4T 32.7.4).
> Could it be that the bootloader has changed these entries ? Can this
> be prevented ?

I'm not aware of anything in the bootloader that would do this. Some
versions of U-Boot that ships with L4T can copy certain nodes in DTB but
I have never seen anything that would've touched thermal.

Is it possible that you're not loading the DTB and end up receiving one
from UEFI or cboot?

> (MAC ethernet address is set as appropropriate).

That's a completely separate mechanism and shouldn't touch thermal at
all.

Thierry
Nicolas Chauvet Oct. 13, 2023, 1:55 p.m. UTC | #5
Le ven. 13 oct. 2023 à 15:13, Thierry Reding
<thierry.reding@gmail.com> a écrit :
>
> On Fri, Oct 13, 2023 at 02:45:57PM +0200, Nicolas Chauvet wrote:
> > Le ven. 13 oct. 2023 à 13:43, Thierry Reding
> > <thierry.reding@gmail.com> a écrit :
> > >
> > > On Fri, Oct 13, 2023 at 11:14:25AM +0200, Nicolas Chauvet wrote:
> > > > Le jeu. 12 oct. 2023 à 19:58, Thierry Reding
> > > > <thierry.reding@gmail.com> a écrit :
> > > > >
> > > > > From: Thierry Reding <treding@nvidia.com>
> > > > >
> > > > > Hi,
> > > > >
> > > > > this set of patches removes the registration of the SOCTHERM internal
> > > > > throttling mechanism as cooling device. Since this throttling starts
> > > > > automatically once a certain temperature threshold is crossed, it
> > > > > doesn't make sense to represent it as a cooling device, which are
> > > > > typically "manually" activated by the thermal framework when thermal
> > > > > sensors report temperature thresholds being crossed.
> > > > >
> > > > > Instead of using the cooling device mechanism, this statically programs
> > > > > the throttling mechanism when it is configured in device tree. In order
> > > > > to do this, an additional device tree property is needed to replace the
> > > > > information that was previously contained in trip points.
> > > > >
> > > > > There's a few preparatory patches to make the removal a bit simpler and
> > > > > also some follow up cleanups included as well.
> > > > >
> > > > > Changes in v2:
> > > > > - rework the device tree bindings:
> > > > >   - add nvidia,thermal-zones property to attach throttling to zones
> > > > >   - use -millicelsius suffix and add hysteresis
> > > > > - add patch to store thermal zone device tree node for later use
> > > > > - add patch to enforce self-encapsulation of the thermal core now that
> > > > >   no drivers need to reach into it anymore
> > > > >
> > > > > This applies on top of Daniel's self-encapsulation hardening series:
> > > > >
> > > > >         https://lore.kernel.org/all/20231012102700.2858952-1-daniel.lezcano@linaro.org/
> > > > >
> > > > > Thierry
> > > > >
> > > > > Thierry Reding (13):
> > > > >   thermal: Store device tree node for thermal zone devices
> > > > >   dt-bindings: thermal: tegra: Document throttle temperature
> > > > >   dt-bindings: thermal: tegra: Add nvidia,thermal-zones property
> > > > >   thermal: tegra: Use driver-private data consistently
> > > > >   thermal: tegra: Constify SoC-specific data
> > > > >   thermal: tegra: Do not register cooling device
> > > > >   thermal: tegra: Use unsigned int where appropriate
> > > > >   thermal: tegra: Avoid over-allocation of temporary array
> > > > >   thermal: tegra: Remove gratuitous error assignment
> > > > >   thermal: tegra: Minor stylistic cleanups
> > > > >   ARM: tegra: Rework SOCTHERM on Tegra124
> > > > >   arm64: tegra: Rework SOCTHERM on Tegra132 and Tegra210
> > > > >   thermal: Enforce self-encapsulation
> > > > >
> > > > >  .../thermal/nvidia,tegra124-soctherm.yaml     |  19 +
> > > > >  arch/arm/boot/dts/nvidia/tegra124.dtsi        |  68 +--
> > > > >  arch/arm64/boot/dts/nvidia/tegra132.dtsi      |  66 +--
> > > > >  arch/arm64/boot/dts/nvidia/tegra210.dtsi      |  86 +--
> > > > >  drivers/thermal/tegra/soctherm.c              | 525 ++++++++----------
> > > > >  drivers/thermal/tegra/soctherm.h              |   1 +
> > > > >  drivers/thermal/tegra/tegra124-soctherm.c     |   4 +
> > > > >  drivers/thermal/tegra/tegra132-soctherm.c     |   4 +
> > > > >  drivers/thermal/tegra/tegra210-soctherm.c     |   4 +
> > > > >  drivers/thermal/thermal_core.h                |   2 +-
> > > > >  drivers/thermal/thermal_of.c                  |   3 +
> > > > >  11 files changed, 329 insertions(+), 453 deletions(-)
> > > > >
> > > > > --
> > > > > 2.42.0
> > > > >
> > > >
> > > > I'm still experiencing the following message on jetson-tx1 with this
> > > > serie applied on top of 6.6-rc5 (with iommu-next and tegra-next
> > > > applied).
> > > > oct. 13 10:53:16 jetson-tx1 kernel: max77620-thermal max77620-thermal:
> > > > Failed to register thermal zone: -19
> > >
> > > Not sure about this one. I don't think I've seen it. Do you know if
> > > that's somehow caused by this series, or is it preexisting?
> >
> > It's pre-existing from this serie.
> >
> > > > oct. 13 10:53:16 jetson-tx1 kernel: tegra_soctherm
> > > > 700e2000.thermal-sensor: throttle-cfg: heavy: no throt prop or invalid
> > > > prop
> > > >
> > > > Is this expected ?
> > >
> > > This one is definitely not expected. I have seen this before, and it
> > > happens when the device tree doesn't contain all the properties that are
> > > expected. Patch 12 in this series should take care of that. Have you
> > > made sure that that's been applied and is what the kernel uses to boot
> > > with?
> >
> > Yes, this dtb change in patch12 is propagated to the device (as seen
> > in /boot/dtbs)
> > But comparing with what's available at runtime in /proc/device-tree, I
> > see some changes
> >
> >                         heavy {
> > -                               hysteresis-millicelsius = <0xfa0>;
> > +                               #cooling-cells = <0x02>;
> >                                 nvidia,cpu-throt-percent = <0x55>;
> >                                 nvidia,gpu-throt-level = <0x03>;
> >                                 nvidia,priority = <0x64>;
> > -                               nvidia,thermal-zones = <0x49 0x4a>;
> > -                               temperature-millicelsius = <0x180c4>;
> > +                               phandle = <0x130>;
> >                         };
>
> Okay, that explains the error message.
>
> >
> > I'm using u-boot 2023.07 with EFI boot (L4T 32.7.4).
> > Could it be that the bootloader has changed these entries ? Can this
> > be prevented ?
>
> I'm not aware of anything in the bootloader that would do this. Some
> versions of U-Boot that ships with L4T can copy certain nodes in DTB but
> I have never seen anything that would've touched thermal.
>
> Is it possible that you're not loading the DTB and end up receiving one
> from UEFI or cboot?
Seems like it: a previous copy was in /boot/dts that took over /boot/dtbs.
With updated dtb, the warning disappeared.

Only remaining error is: kernel: max77620-thermal max77620-thermal:
Failed to register thermal zone: -19

Thanks
Thierry Reding Oct. 13, 2023, 3:45 p.m. UTC | #6
On Fri, Oct 13, 2023 at 03:55:43PM +0200, Nicolas Chauvet wrote:
> Le ven. 13 oct. 2023 à 15:13, Thierry Reding
> <thierry.reding@gmail.com> a écrit :
> >
> > On Fri, Oct 13, 2023 at 02:45:57PM +0200, Nicolas Chauvet wrote:
> > > Le ven. 13 oct. 2023 à 13:43, Thierry Reding
> > > <thierry.reding@gmail.com> a écrit :
> > > >
> > > > On Fri, Oct 13, 2023 at 11:14:25AM +0200, Nicolas Chauvet wrote:
> > > > > Le jeu. 12 oct. 2023 à 19:58, Thierry Reding
> > > > > <thierry.reding@gmail.com> a écrit :
> > > > > >
> > > > > > From: Thierry Reding <treding@nvidia.com>
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > this set of patches removes the registration of the SOCTHERM internal
> > > > > > throttling mechanism as cooling device. Since this throttling starts
> > > > > > automatically once a certain temperature threshold is crossed, it
> > > > > > doesn't make sense to represent it as a cooling device, which are
> > > > > > typically "manually" activated by the thermal framework when thermal
> > > > > > sensors report temperature thresholds being crossed.
> > > > > >
> > > > > > Instead of using the cooling device mechanism, this statically programs
> > > > > > the throttling mechanism when it is configured in device tree. In order
> > > > > > to do this, an additional device tree property is needed to replace the
> > > > > > information that was previously contained in trip points.
> > > > > >
> > > > > > There's a few preparatory patches to make the removal a bit simpler and
> > > > > > also some follow up cleanups included as well.
> > > > > >
> > > > > > Changes in v2:
> > > > > > - rework the device tree bindings:
> > > > > >   - add nvidia,thermal-zones property to attach throttling to zones
> > > > > >   - use -millicelsius suffix and add hysteresis
> > > > > > - add patch to store thermal zone device tree node for later use
> > > > > > - add patch to enforce self-encapsulation of the thermal core now that
> > > > > >   no drivers need to reach into it anymore
> > > > > >
> > > > > > This applies on top of Daniel's self-encapsulation hardening series:
> > > > > >
> > > > > >         https://lore.kernel.org/all/20231012102700.2858952-1-daniel.lezcano@linaro.org/
> > > > > >
> > > > > > Thierry
> > > > > >
> > > > > > Thierry Reding (13):
> > > > > >   thermal: Store device tree node for thermal zone devices
> > > > > >   dt-bindings: thermal: tegra: Document throttle temperature
> > > > > >   dt-bindings: thermal: tegra: Add nvidia,thermal-zones property
> > > > > >   thermal: tegra: Use driver-private data consistently
> > > > > >   thermal: tegra: Constify SoC-specific data
> > > > > >   thermal: tegra: Do not register cooling device
> > > > > >   thermal: tegra: Use unsigned int where appropriate
> > > > > >   thermal: tegra: Avoid over-allocation of temporary array
> > > > > >   thermal: tegra: Remove gratuitous error assignment
> > > > > >   thermal: tegra: Minor stylistic cleanups
> > > > > >   ARM: tegra: Rework SOCTHERM on Tegra124
> > > > > >   arm64: tegra: Rework SOCTHERM on Tegra132 and Tegra210
> > > > > >   thermal: Enforce self-encapsulation
> > > > > >
> > > > > >  .../thermal/nvidia,tegra124-soctherm.yaml     |  19 +
> > > > > >  arch/arm/boot/dts/nvidia/tegra124.dtsi        |  68 +--
> > > > > >  arch/arm64/boot/dts/nvidia/tegra132.dtsi      |  66 +--
> > > > > >  arch/arm64/boot/dts/nvidia/tegra210.dtsi      |  86 +--
> > > > > >  drivers/thermal/tegra/soctherm.c              | 525 ++++++++----------
> > > > > >  drivers/thermal/tegra/soctherm.h              |   1 +
> > > > > >  drivers/thermal/tegra/tegra124-soctherm.c     |   4 +
> > > > > >  drivers/thermal/tegra/tegra132-soctherm.c     |   4 +
> > > > > >  drivers/thermal/tegra/tegra210-soctherm.c     |   4 +
> > > > > >  drivers/thermal/thermal_core.h                |   2 +-
> > > > > >  drivers/thermal/thermal_of.c                  |   3 +
> > > > > >  11 files changed, 329 insertions(+), 453 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 2.42.0
> > > > > >
> > > > >
> > > > > I'm still experiencing the following message on jetson-tx1 with this
> > > > > serie applied on top of 6.6-rc5 (with iommu-next and tegra-next
> > > > > applied).
> > > > > oct. 13 10:53:16 jetson-tx1 kernel: max77620-thermal max77620-thermal:
> > > > > Failed to register thermal zone: -19
> > > >
> > > > Not sure about this one. I don't think I've seen it. Do you know if
> > > > that's somehow caused by this series, or is it preexisting?
> > >
> > > It's pre-existing from this serie.
> > >
> > > > > oct. 13 10:53:16 jetson-tx1 kernel: tegra_soctherm
> > > > > 700e2000.thermal-sensor: throttle-cfg: heavy: no throt prop or invalid
> > > > > prop
> > > > >
> > > > > Is this expected ?
> > > >
> > > > This one is definitely not expected. I have seen this before, and it
> > > > happens when the device tree doesn't contain all the properties that are
> > > > expected. Patch 12 in this series should take care of that. Have you
> > > > made sure that that's been applied and is what the kernel uses to boot
> > > > with?
> > >
> > > Yes, this dtb change in patch12 is propagated to the device (as seen
> > > in /boot/dtbs)
> > > But comparing with what's available at runtime in /proc/device-tree, I
> > > see some changes
> > >
> > >                         heavy {
> > > -                               hysteresis-millicelsius = <0xfa0>;
> > > +                               #cooling-cells = <0x02>;
> > >                                 nvidia,cpu-throt-percent = <0x55>;
> > >                                 nvidia,gpu-throt-level = <0x03>;
> > >                                 nvidia,priority = <0x64>;
> > > -                               nvidia,thermal-zones = <0x49 0x4a>;
> > > -                               temperature-millicelsius = <0x180c4>;
> > > +                               phandle = <0x130>;
> > >                         };
> >
> > Okay, that explains the error message.
> >
> > >
> > > I'm using u-boot 2023.07 with EFI boot (L4T 32.7.4).
> > > Could it be that the bootloader has changed these entries ? Can this
> > > be prevented ?
> >
> > I'm not aware of anything in the bootloader that would do this. Some
> > versions of U-Boot that ships with L4T can copy certain nodes in DTB but
> > I have never seen anything that would've touched thermal.
> >
> > Is it possible that you're not loading the DTB and end up receiving one
> > from UEFI or cboot?
> Seems like it: a previous copy was in /boot/dts that took over /boot/dtbs.
> With updated dtb, the warning disappeared.

Okay, great!

> Only remaining error is: kernel: max77620-thermal max77620-thermal:
> Failed to register thermal zone: -19

Looks like that's -ENODEV from devm_thermal_of_zone_register() via
max77620_thermal_probe(). Looking at thermal_of_zone_register(), the
-ENODEV case is treated specially, so perhaps we should be doing the
same thing in max77620_probe()?

Let me send out a patch.

Thierry