mbox series

[v5,00/33] New thermal OF code

Message ID 20220804224349.1926752-1-daniel.lezcano@linexp.org (mailing list archive)
Headers show
Series New thermal OF code | expand

Message

Daniel Lezcano Aug. 4, 2022, 10:43 p.m. UTC
The following changes are depending on:

 - 20220722200007.1839356-1-daniel.lezcano@linexp.org

which are present in the thermal/linux-next branch:

https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/log/?h=thermal/linux-next

The series introduces a new thermal OF code. The patch description gives
a detailed explanation of the changes. Basically we write new OF parsing
functions, we migrate all the users of the old thermal OF API to the new
one and then we finish by removing the old OF code.

That is the second step to rework the thermal OF code. More patches will
come after that to remove the duplication of the trip definitions in the
different drivers which will result in more code duplication removed and
consolidation of the core thermal framework.

Thanks for those who tested the series on their platform and
investigated the regression with the disabled by default thermal zones.

Changelog:
 v5:
   - Fixed st, exynos_tmu and adc iio sun4i compilation issue when
     THERMAL_EMULATION or THERMAL_OF are not set
   - Add the pmbus_core.c API conversion which was missing
   - Collected more tags
 v4:
   - Fixed a compilation error when THERMAL_OF=n
   - Collected more tags
 v3:
   - Rebased on the right branch as reported by Niklas Söderlund
   - Collected more tags
 v2:
   - Changed the code in the register thermal zone function to prevent
     the 'const' annotation being removed in the different drivers
   - Collected the tags and adding Cc for more context
   - Changed the first line patch description to comply to the 'input'
     subsystem format
   - Give a more detailed description in the changelog for the drivers
   - Remove pointless calls to unregister as the devm version is used
     instead
   - Moved dummy functions from one patch to another to prevent git
     bisecting issue when THERMAL_OF=n
   - Fixed thermal zone disabled by default


Daniel Lezcano (33):
  thermal/of: Rework the thermal device tree initialization
  thermal/of: Make new code and old code co-exist
  thermal/drivers/rockchip: Switch to new of API
  thermal/drivers/uniphier: Switch to new of API
  thermal/drivers/generic-adc: Switch to new of API
  thermal/drivers/mmio: Switch to new of API
  thermal/drivers/tegra: Switch to new of API
  thermal/drivers/sun8i: Switch to new of API
  thermal/drivers/sprd: Switch to new of API
  thermal/drivers/broadcom: Switch to new of API
  thermal/drivers/qcom: Switch to new of API
  thermal/drivers/st: Switch to new of API
  thermal/drivers/amlogic: Switch to new of API
  thermal/drivers/armada: Switch to new of API
  thermal/drivers/db8500: Switch to new of API
  thermal/drivers/imx: Switch to new of API
  thermal/drivers/rcar: Switch to new of API
  thermal/drivers/rzg2l: Switch to new of API
  thermal/drivers/qoriq: Switch to new of API
  thermal/drivers/mtk: Switch to new of API
  thermal/drivers/banggap: Switch to new of API
  thermal/drivers/maxim: Switch to new of API
  thermal/drivers/hisilicon: Switch to new of API
  thermal/drivers/ti-soc: Switch to new of API
  ata/drivers/ahci_imx: Switch to new of thermal API
  hwmon/drivers/pm_bus: Switch to new of thermal API
  hwmon/drivers/core: Switch to new of thermal API
  iio/drivers/sun4i_gpadc: Switch to new of thermal API
  Input: sun4i-ts - switch to new of thermal API
  regulator/drivers/max8976: Switch to new of thermal API
  thermal/drivers/samsung: Switch to new of thermal API
  thermal/core: Move set_trip_temp ops to the sysfs code
  thermal/of: Remove old OF code

 drivers/ata/ahci_imx.c                        |   15 +-
 drivers/hwmon/hwmon.c                         |   14 +-
 drivers/hwmon/pmbus/pmbus_core.c              |   10 +-
 drivers/hwmon/scpi-hwmon.c                    |   14 +-
 drivers/iio/adc/sun4i-gpadc-iio.c             |   14 +-
 drivers/input/touchscreen/sun4i-ts.c          |   10 +-
 drivers/regulator/max8973-regulator.c         |   10 +-
 drivers/thermal/amlogic_thermal.c             |   16 +-
 drivers/thermal/armada_thermal.c              |   12 +-
 drivers/thermal/broadcom/bcm2711_thermal.c    |   14 +-
 drivers/thermal/broadcom/bcm2835_thermal.c    |   14 +-
 drivers/thermal/broadcom/brcmstb_thermal.c    |   20 +-
 drivers/thermal/broadcom/ns-thermal.c         |   50 +-
 drivers/thermal/broadcom/sr-thermal.c         |   16 +-
 drivers/thermal/db8500_thermal.c              |    8 +-
 drivers/thermal/hisi_thermal.c                |   14 +-
 drivers/thermal/imx8mm_thermal.c              |   14 +-
 drivers/thermal/imx_sc_thermal.c              |   14 +-
 drivers/thermal/k3_bandgap.c                  |   12 +-
 drivers/thermal/k3_j72xx_bandgap.c            |   12 +-
 drivers/thermal/max77620_thermal.c            |    8 +-
 drivers/thermal/mtk_thermal.c                 |   10 +-
 drivers/thermal/qcom/qcom-spmi-adc-tm5.c      |   19 +-
 drivers/thermal/qcom/qcom-spmi-temp-alarm.c   |   12 +-
 drivers/thermal/qcom/tsens.c                  |   16 +-
 drivers/thermal/qoriq_thermal.c               |   12 +-
 drivers/thermal/rcar_gen3_thermal.c           |   16 +-
 drivers/thermal/rcar_thermal.c                |   13 +-
 drivers/thermal/rockchip_thermal.c            |   14 +-
 drivers/thermal/rzg2l_thermal.c               |   10 +-
 drivers/thermal/samsung/exynos_tmu.c          |   24 +-
 drivers/thermal/sprd_thermal.c                |   18 +-
 drivers/thermal/st/stm_thermal.c              |   18 +-
 drivers/thermal/sun8i_thermal.c               |   14 +-
 drivers/thermal/tegra/soctherm.c              |   21 +-
 drivers/thermal/tegra/tegra-bpmp-thermal.c    |   19 +-
 drivers/thermal/tegra/tegra30-tsensor.c       |   12 +-
 drivers/thermal/thermal-generic-adc.c         |   10 +-
 drivers/thermal/thermal_core.c                |    6 -
 drivers/thermal/thermal_core.h                |    2 -
 drivers/thermal/thermal_mmio.c                |   17 +-
 drivers/thermal/thermal_of.c                  | 1140 ++++++-----------
 drivers/thermal/thermal_sysfs.c               |    5 +-
 .../ti-soc-thermal/ti-thermal-common.c        |   16 +-
 drivers/thermal/uniphier_thermal.c            |   10 +-
 include/linux/thermal.h                       |   85 +-
 46 files changed, 708 insertions(+), 1142 deletions(-)

Comments

Michael Walle Aug. 8, 2022, 9:42 a.m. UTC | #1
Hi,

> The following changes are depending on:
> 
>  - 20220722200007.1839356-1-daniel.lezcano@linexp.org
> 
> which are present in the thermal/linux-next branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/log/?h=thermal/linux-next
> 
> The series introduces a new thermal OF code. The patch description gives
> a detailed explanation of the changes. Basically we write new OF parsing
> functions, we migrate all the users of the old thermal OF API to the new
> one and then we finish by removing the old OF code.
> 
> That is the second step to rework the thermal OF code. More patches will
> come after that to remove the duplication of the trip definitions in the
> different drivers which will result in more code duplication removed and
> consolidation of the core thermal framework.
> 
> Thanks for those who tested the series on their platform and
> investigated the regression with the disabled by default thermal zones.

I haven't looked closely yet, but this series is breaking two of my
boards.

There seems to be one mistake within the new thermal code:

[    2.030452] thermal_sys: Failed to find 'trips' node
[    2.033664] usb 1-1: new high-speed USB device number 2 using xhci-hcd
[    2.035434] thermal_sys: Failed to find trip points for tmu id=2
[    2.048010] qoriq_thermal 1f80000.tmu: Failed to register sensors
[    2.054128] qoriq_thermal: probe of 1f80000.tmu failed with error -22
[    2.060607] devm_thermal_of_zone_release:707 res=ffff002002377180
[    2.067044] Unable to handle kernel paging request at virtual address 01adadadadadad88
[    2.075003] Mem abort info:
[    2.077805]   ESR = 0x0000000096000004
[    2.081562]   EC = 0x25: DABT (current EL), IL = 32 bits
[    2.086893]   SET = 0, FnV = 0
[    2.089955]   EA = 0, S1PTW = 0
[    2.093100]   FSC = 0x04: level 0 translation fault
[    2.097993] Data abort info:
[    2.100876]   ISV = 0, ISS = 0x00000004
[    2.104724]   CM = 0, WnR = 0
[    2.107698] [01adadadadadad88] address between user and kernel address ranges
[    2.114863] Internal error: Oops: 96000004 [#1] SMP
[    2.119754] Modules linked in:
[    2.122815] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-next-20220808-00078-ga957a15f74fc-dirty #1694
[    2.132504] Hardware name: Kontron KBox A-230-LS (DT)
[    2.137568] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    2.144554] pc : kfree+0x5c/0x3c0
[    2.147885] lr : thermal_of_zone_unregister+0x34/0x54
[    2.152954] sp : ffff80000a22bab0
[    2.156274] x29: ffff80000a22bab0 x28: 0000000000000000 x27: ffff800009960464
[    2.163438] x26: ffff800009a16960 x25: 0000000000000006 x24: ffff800009f09a40
[    2.170601] x23: ffff800009ab9008 x22: ffff800008d0d684 x21: 01adadadadadad80
[    2.177763] x20: 6b6b6b6b6b6b6b6b x19: ffff002002335000 x18: 00000000fffffffb
[    2.184925] x17: ffff800008d0d67c x16: ffff800008d072b4 x15: ffff800008d0c6c4
[    2.192087] x14: ffff800008d0c34c x13: ffff8000088d5034 x12: ffff8000088d46d4
[    2.199248] x11: ffff8000088d4624 x10: 0000000000000000 x9 : ffff800008d0d684
[    2.206410] x8 : ffff002000b1a158 x7 : bbbbbbbbbbbbbbbb x6 : ffff80000a0f53b8
[    2.213572] x5 : ffff80000a22b940 x4 : 0000000000000000 x3 : 0000000000000000
[    2.220733] x2 : fffffc0000000000 x1 : ffff002000838040 x0 : 01adb1adadadad80
[    2.227895] Call trace:
[    2.230342]  kfree+0x5c/0x3c0
[    2.233318]  thermal_of_zone_unregister+0x34/0x54
[    2.238036]  devm_thermal_of_zone_release+0x44/0x54
[    2.242931]  release_nodes+0x64/0xd0
[    2.246516]  devres_release_all+0xbc/0x350
[    2.250623]  device_unbind_cleanup+0x20/0x70
[    2.254905]  really_probe+0x1a0/0x2e4
[    2.258577]  __driver_probe_device+0x80/0xec
[    2.262859]  driver_probe_device+0x44/0x130
[    2.267055]  __driver_attach+0x104/0x1b4
[    2.270989]  bus_for_each_dev+0x7c/0xe0
[    2.274834]  driver_attach+0x30/0x40
[    2.278418]  bus_add_driver+0x160/0x210
[    2.281900] hub 1-1:1.0: USB hub found
[    2.282264]  driver_register+0x84/0x140
[    2.286109] hub 1-1:1.0: 7 ports detected
[    2.289859]  __platform_driver_register+0x34/0x40
[    2.289867]  qoriq_tmu_init+0x28/0x34
[    2.302258]  do_one_initcall+0x50/0x250
[    2.306104]  kernel_init_freeable+0x278/0x31c
[    2.310474]  kernel_init+0x30/0x140
[    2.313972]  ret_from_fork+0x10/0x20
[    2.317559] Code: b25657e2 d34cfc00 d37ae400 8b020015 (f94006a1) 
[    2.323672] ---[ end trace 0000000000000000 ]---
[    2.328317] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[    2.335999] SMP: stopping secondary CPUs
[    2.339932] Kernel Offset: disabled
[    2.343425] CPU features: 0x2000,0800f021,00001086
[    2.348229] Memory Limit: none
[    2.351289] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

This was seen a sl28 board
(arch/arm64/boot/dts/freescale/fsl-ls1028a-kontron-kbox-a-230-ls.dts).
The same board in the KernelCI also have some more information:
https://lavalab.kontron.com/scheduler/job/151900#L1162

But I guess even if that is fixed, the driver will not probe due to the
missing trip points? Are they now mandatory? Does it mean we'd need to
update our device trees? But that will then mean older devices trees
don't work anymore.

On my second board
(arch/arm/boot/dts/lan966x-kontron-kswitch-d10-mmt-6g-2gs.dts). I get the
following error:

[    6.292819] thermal_sys: Unable to find thermal zones description
[    6.298872] thermal_sys: Failed to find thermal zone for hwmon id=0
[    6.305375] lan966x-hwmon e2010180.hwmon: error -EINVAL: failed to register hwmon device
[    6.313508] lan966x-hwmon: probe of e2010180.hwmon failed with error -22

Again, is there seems to be something missing in the device tree. For this
board a device tree change should be easily doable, as it is still in
development.

Let me know if I can help testing changes.

-michael
Daniel Lezcano Aug. 8, 2022, 10:21 a.m. UTC | #2
Hi Michael,

On 08/08/2022 11:42, Michael Walle wrote:
> Hi,
>
>> The following changes are depending on:
>>
>>   - 20220722200007.1839356-1-daniel.lezcano@linexp.org
>>
>> which are present in the thermal/linux-next branch:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/log/?h=thermal/linux-next
>>
>> The series introduces a new thermal OF code. The patch description gives
>> a detailed explanation of the changes. Basically we write new OF parsing
>> functions, we migrate all the users of the old thermal OF API to the new
>> one and then we finish by removing the old OF code.
>>
>> That is the second step to rework the thermal OF code. More patches will
>> come after that to remove the duplication of the trip definitions in the
>> different drivers which will result in more code duplication removed and
>> consolidation of the core thermal framework.
>>
>> Thanks for those who tested the series on their platform and
>> investigated the regression with the disabled by default thermal zones.
> I haven't looked closely yet, but this series is breaking two of my
> boards.
>
> There seems to be one mistake within the new thermal code:
>
> [    2.030452] thermal_sys: Failed to find 'trips' node
> [    2.033664] usb 1-1: new high-speed USB device number 2 using xhci-hcd
> [    2.035434] thermal_sys: Failed to find trip points for tmu id=2
> [    2.048010] qoriq_thermal 1f80000.tmu: Failed to register sensors
> [    2.054128] qoriq_thermal: probe of 1f80000.tmu failed with error -22
> [    2.060607] devm_thermal_of_zone_release:707 res=ffff002002377180
> [    2.067044] Unable to handle kernel paging request at virtual address 01adadadadadad88
> [    2.075003] Mem abort info:
> [    2.077805]   ESR = 0x0000000096000004
> [    2.081562]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    2.086893]   SET = 0, FnV = 0
> [    2.089955]   EA = 0, S1PTW = 0
> [    2.093100]   FSC = 0x04: level 0 translation fault
> [    2.097993] Data abort info:
> [    2.100876]   ISV = 0, ISS = 0x00000004
> [    2.104724]   CM = 0, WnR = 0
> [    2.107698] [01adadadadadad88] address between user and kernel address ranges
> [    2.114863] Internal error: Oops: 96000004 [#1] SMP
> [    2.119754] Modules linked in:
> [    2.122815] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-next-20220808-00078-ga957a15f74fc-dirty #1694
> [    2.132504] Hardware name: Kontron KBox A-230-LS (DT)
> [    2.137568] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    2.144554] pc : kfree+0x5c/0x3c0
> [    2.147885] lr : thermal_of_zone_unregister+0x34/0x54
> [    2.152954] sp : ffff80000a22bab0
> [    2.156274] x29: ffff80000a22bab0 x28: 0000000000000000 x27: ffff800009960464
> [    2.163438] x26: ffff800009a16960 x25: 0000000000000006 x24: ffff800009f09a40
> [    2.170601] x23: ffff800009ab9008 x22: ffff800008d0d684 x21: 01adadadadadad80
> [    2.177763] x20: 6b6b6b6b6b6b6b6b x19: ffff002002335000 x18: 00000000fffffffb
> [    2.184925] x17: ffff800008d0d67c x16: ffff800008d072b4 x15: ffff800008d0c6c4
> [    2.192087] x14: ffff800008d0c34c x13: ffff8000088d5034 x12: ffff8000088d46d4
> [    2.199248] x11: ffff8000088d4624 x10: 0000000000000000 x9 : ffff800008d0d684
> [    2.206410] x8 : ffff002000b1a158 x7 : bbbbbbbbbbbbbbbb x6 : ffff80000a0f53b8
> [    2.213572] x5 : ffff80000a22b940 x4 : 0000000000000000 x3 : 0000000000000000
> [    2.220733] x2 : fffffc0000000000 x1 : ffff002000838040 x0 : 01adb1adadadad80
> [    2.227895] Call trace:
> [    2.230342]  kfree+0x5c/0x3c0
> [    2.233318]  thermal_of_zone_unregister+0x34/0x54
> [    2.238036]  devm_thermal_of_zone_release+0x44/0x54
> [    2.242931]  release_nodes+0x64/0xd0
> [    2.246516]  devres_release_all+0xbc/0x350
> [    2.250623]  device_unbind_cleanup+0x20/0x70
> [    2.254905]  really_probe+0x1a0/0x2e4
> [    2.258577]  __driver_probe_device+0x80/0xec
> [    2.262859]  driver_probe_device+0x44/0x130
> [    2.267055]  __driver_attach+0x104/0x1b4
> [    2.270989]  bus_for_each_dev+0x7c/0xe0
> [    2.274834]  driver_attach+0x30/0x40
> [    2.278418]  bus_add_driver+0x160/0x210
> [    2.281900] hub 1-1:1.0: USB hub found
> [    2.282264]  driver_register+0x84/0x140
> [    2.286109] hub 1-1:1.0: 7 ports detected
> [    2.289859]  __platform_driver_register+0x34/0x40
> [    2.289867]  qoriq_tmu_init+0x28/0x34
> [    2.302258]  do_one_initcall+0x50/0x250
> [    2.306104]  kernel_init_freeable+0x278/0x31c
> [    2.310474]  kernel_init+0x30/0x140
> [    2.313972]  ret_from_fork+0x10/0x20
> [    2.317559] Code: b25657e2 d34cfc00 d37ae400 8b020015 (f94006a1)
> [    2.323672] ---[ end trace 0000000000000000 ]---
> [    2.328317] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    2.335999] SMP: stopping secondary CPUs
> [    2.339932] Kernel Offset: disabled
> [    2.343425] CPU features: 0x2000,0800f021,00001086
> [    2.348229] Memory Limit: none
> [    2.351289] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
>
> This was seen a sl28 board
> (arch/arm64/boot/dts/freescale/fsl-ls1028a-kontron-kbox-a-230-ls.dts).
> The same board in the KernelCI also have some more information:
> https://lavalab.kontron.com/scheduler/job/151900#L1162
>
> But I guess even if that is fixed, the driver will not probe due to the
> missing trip points? Are they now mandatory? Does it mean we'd need to
> update our device trees? But that will then mean older devices trees
> don't work anymore.

Thanks for reporting, I'll investigate the issues you are reporting.

There is no need to update any device tree file. The code does not 
change the bindings, it is a rewrite of the implementation supposed to 
be without impact on the existing bindings, thus the existing device 
tree descriptions.

Why are you saying there are missing trip points ? The dts shows trip 
points for 'core-cluster' and 'ddr-controller' ?

> On my second board
> (arch/arm/boot/dts/lan966x-kontron-kswitch-d10-mmt-6g-2gs.dts). I get the
> following error:
>
> [    6.292819] thermal_sys: Unable to find thermal zones description
> [    6.298872] thermal_sys: Failed to find thermal zone for hwmon id=0
> [    6.305375] lan966x-hwmon e2010180.hwmon: error -EINVAL: failed to register hwmon device
> [    6.313508] lan966x-hwmon: probe of e2010180.hwmon failed with error -22

Interesting ...

> Again, is there seems to be something missing in the device tree. For this
> board a device tree change should be easily doable, as it is still in
> development.


> Let me know if I can help testing changes.

Yes, definitively, thanks for proposing
Guenter Roeck Aug. 8, 2022, 10:26 a.m. UTC | #3
On Mon, Aug 08, 2022 at 11:42:16AM +0200, Michael Walle wrote:
> Hi,
> 
> > The following changes are depending on:
> > 
> >  - 20220722200007.1839356-1-daniel.lezcano@linexp.org
> > 
> > which are present in the thermal/linux-next branch:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/log/?h=thermal/linux-next
> > 
> > The series introduces a new thermal OF code. The patch description gives
> > a detailed explanation of the changes. Basically we write new OF parsing
> > functions, we migrate all the users of the old thermal OF API to the new
> > one and then we finish by removing the old OF code.
> > 
> > That is the second step to rework the thermal OF code. More patches will
> > come after that to remove the duplication of the trip definitions in the
> > different drivers which will result in more code duplication removed and
> > consolidation of the core thermal framework.
> > 
> > Thanks for those who tested the series on their platform and
> > investigated the regression with the disabled by default thermal zones.
> 
> I haven't looked closely yet, but this series is breaking two of my
> boards.
> 
> There seems to be one mistake within the new thermal code:
> 
> [    2.030452] thermal_sys: Failed to find 'trips' node
> [    2.033664] usb 1-1: new high-speed USB device number 2 using xhci-hcd
> [    2.035434] thermal_sys: Failed to find trip points for tmu id=2
> [    2.048010] qoriq_thermal 1f80000.tmu: Failed to register sensors
> [    2.054128] qoriq_thermal: probe of 1f80000.tmu failed with error -22
> [    2.060607] devm_thermal_of_zone_release:707 res=ffff002002377180
> [    2.067044] Unable to handle kernel paging request at virtual address 01adadadadadad88
> [    2.075003] Mem abort info:
> [    2.077805]   ESR = 0x0000000096000004
> [    2.081562]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    2.086893]   SET = 0, FnV = 0
> [    2.089955]   EA = 0, S1PTW = 0
> [    2.093100]   FSC = 0x04: level 0 translation fault
> [    2.097993] Data abort info:
> [    2.100876]   ISV = 0, ISS = 0x00000004
> [    2.104724]   CM = 0, WnR = 0
> [    2.107698] [01adadadadadad88] address between user and kernel address ranges
> [    2.114863] Internal error: Oops: 96000004 [#1] SMP
> [    2.119754] Modules linked in:
> [    2.122815] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-next-20220808-00078-ga957a15f74fc-dirty #1694
> [    2.132504] Hardware name: Kontron KBox A-230-LS (DT)
> [    2.137568] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    2.144554] pc : kfree+0x5c/0x3c0
> [    2.147885] lr : thermal_of_zone_unregister+0x34/0x54
> [    2.152954] sp : ffff80000a22bab0
> [    2.156274] x29: ffff80000a22bab0 x28: 0000000000000000 x27: ffff800009960464
> [    2.163438] x26: ffff800009a16960 x25: 0000000000000006 x24: ffff800009f09a40
> [    2.170601] x23: ffff800009ab9008 x22: ffff800008d0d684 x21: 01adadadadadad80
> [    2.177763] x20: 6b6b6b6b6b6b6b6b x19: ffff002002335000 x18: 00000000fffffffb
> [    2.184925] x17: ffff800008d0d67c x16: ffff800008d072b4 x15: ffff800008d0c6c4
> [    2.192087] x14: ffff800008d0c34c x13: ffff8000088d5034 x12: ffff8000088d46d4
> [    2.199248] x11: ffff8000088d4624 x10: 0000000000000000 x9 : ffff800008d0d684
> [    2.206410] x8 : ffff002000b1a158 x7 : bbbbbbbbbbbbbbbb x6 : ffff80000a0f53b8
> [    2.213572] x5 : ffff80000a22b940 x4 : 0000000000000000 x3 : 0000000000000000
> [    2.220733] x2 : fffffc0000000000 x1 : ffff002000838040 x0 : 01adb1adadadad80
> [    2.227895] Call trace:
> [    2.230342]  kfree+0x5c/0x3c0
> [    2.233318]  thermal_of_zone_unregister+0x34/0x54
> [    2.238036]  devm_thermal_of_zone_release+0x44/0x54
> [    2.242931]  release_nodes+0x64/0xd0
> [    2.246516]  devres_release_all+0xbc/0x350
> [    2.250623]  device_unbind_cleanup+0x20/0x70
> [    2.254905]  really_probe+0x1a0/0x2e4
> [    2.258577]  __driver_probe_device+0x80/0xec
> [    2.262859]  driver_probe_device+0x44/0x130
> [    2.267055]  __driver_attach+0x104/0x1b4
> [    2.270989]  bus_for_each_dev+0x7c/0xe0
> [    2.274834]  driver_attach+0x30/0x40
> [    2.278418]  bus_add_driver+0x160/0x210
> [    2.281900] hub 1-1:1.0: USB hub found
> [    2.282264]  driver_register+0x84/0x140
> [    2.286109] hub 1-1:1.0: 7 ports detected
> [    2.289859]  __platform_driver_register+0x34/0x40
> [    2.289867]  qoriq_tmu_init+0x28/0x34
> [    2.302258]  do_one_initcall+0x50/0x250
> [    2.306104]  kernel_init_freeable+0x278/0x31c
> [    2.310474]  kernel_init+0x30/0x140
> [    2.313972]  ret_from_fork+0x10/0x20
> [    2.317559] Code: b25657e2 d34cfc00 d37ae400 8b020015 (f94006a1) 
> [    2.323672] ---[ end trace 0000000000000000 ]---
> [    2.328317] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    2.335999] SMP: stopping secondary CPUs
> [    2.339932] Kernel Offset: disabled
> [    2.343425] CPU features: 0x2000,0800f021,00001086
> [    2.348229] Memory Limit: none
> [    2.351289] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
> 
> This was seen a sl28 board
> (arch/arm64/boot/dts/freescale/fsl-ls1028a-kontron-kbox-a-230-ls.dts).
> The same board in the KernelCI also have some more information:
> https://lavalab.kontron.com/scheduler/job/151900#L1162
> 
> But I guess even if that is fixed, the driver will not probe due to the
> missing trip points? Are they now mandatory? Does it mean we'd need to
> update our device trees? But that will then mean older devices trees
> don't work anymore.

It would also mean that all hwmon drivers registering a thermal zone sensor
would fail to register unless such a thermal zone actually exists. This
would make the whole concept of having the hwmon core register thermal
zone sensors impossible. I have no idea how this is expected to work now,
but there is an apparent flaw in the logic. That means I withdraw my
Acked-by: for the hwmon patches in this series until it is guaranteed
that hwmon registration does not fail as above if there is no thermal
zone associated with a sensor.

> 
> On my second board
> (arch/arm/boot/dts/lan966x-kontron-kswitch-d10-mmt-6g-2gs.dts). I get the
> following error:
> 
> [    6.292819] thermal_sys: Unable to find thermal zones description
> [    6.298872] thermal_sys: Failed to find thermal zone for hwmon id=0
> [    6.305375] lan966x-hwmon e2010180.hwmon: error -EINVAL: failed to register hwmon device
> [    6.313508] lan966x-hwmon: probe of e2010180.hwmon failed with error -22
> 
> Again, is there seems to be something missing in the device tree. For this
> board a device tree change should be easily doable, as it is still in
> development.
> 

That would work for this board, but not for all other boards where a sensor
tries to register with the thermal subsystem but there is no thermal zone
defined for it.

Guenter
Michael Walle Aug. 8, 2022, 10:55 a.m. UTC | #4
Hi Daniel,

Am 2022-08-08 12:21, schrieb Daniel Lezcano:
> On 08/08/2022 11:42, Michael Walle wrote:
>>> The following changes are depending on:
>>> 
>>>   - 20220722200007.1839356-1-daniel.lezcano@linexp.org
>>> 
>>> which are present in the thermal/linux-next branch:
>>> 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/log/?h=thermal/linux-next
>>> 
>>> The series introduces a new thermal OF code. The patch description 
>>> gives
>>> a detailed explanation of the changes. Basically we write new OF 
>>> parsing
>>> functions, we migrate all the users of the old thermal OF API to the 
>>> new
>>> one and then we finish by removing the old OF code.
>>> 
>>> That is the second step to rework the thermal OF code. More patches 
>>> will
>>> come after that to remove the duplication of the trip definitions in 
>>> the
>>> different drivers which will result in more code duplication removed 
>>> and
>>> consolidation of the core thermal framework.
>>> 
>>> Thanks for those who tested the series on their platform and
>>> investigated the regression with the disabled by default thermal 
>>> zones.
>> I haven't looked closely yet, but this series is breaking two of my
>> boards.
>> 
>> There seems to be one mistake within the new thermal code:
>> 
>> [    2.030452] thermal_sys: Failed to find 'trips' node
>> [    2.033664] usb 1-1: new high-speed USB device number 2 using 
>> xhci-hcd
>> [    2.035434] thermal_sys: Failed to find trip points for tmu id=2
>> [    2.048010] qoriq_thermal 1f80000.tmu: Failed to register sensors
>> [    2.054128] qoriq_thermal: probe of 1f80000.tmu failed with error 
>> -22
>> [    2.060607] devm_thermal_of_zone_release:707 res=ffff002002377180
>> [    2.067044] Unable to handle kernel paging request at virtual 
>> address 01adadadadadad88
>> [    2.075003] Mem abort info:
>> [    2.077805]   ESR = 0x0000000096000004
>> [    2.081562]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [    2.086893]   SET = 0, FnV = 0
>> [    2.089955]   EA = 0, S1PTW = 0
>> [    2.093100]   FSC = 0x04: level 0 translation fault
>> [    2.097993] Data abort info:
>> [    2.100876]   ISV = 0, ISS = 0x00000004
>> [    2.104724]   CM = 0, WnR = 0
>> [    2.107698] [01adadadadadad88] address between user and kernel 
>> address ranges
>> [    2.114863] Internal error: Oops: 96000004 [#1] SMP
>> [    2.119754] Modules linked in:
>> [    2.122815] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
>> 5.19.0-next-20220808-00078-ga957a15f74fc-dirty #1694
>> [    2.132504] Hardware name: Kontron KBox A-230-LS (DT)
>> [    2.137568] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS 
>> BTYPE=--)
>> [    2.144554] pc : kfree+0x5c/0x3c0
>> [    2.147885] lr : thermal_of_zone_unregister+0x34/0x54
>> [    2.152954] sp : ffff80000a22bab0
>> [    2.156274] x29: ffff80000a22bab0 x28: 0000000000000000 x27: 
>> ffff800009960464
>> [    2.163438] x26: ffff800009a16960 x25: 0000000000000006 x24: 
>> ffff800009f09a40
>> [    2.170601] x23: ffff800009ab9008 x22: ffff800008d0d684 x21: 
>> 01adadadadadad80
>> [    2.177763] x20: 6b6b6b6b6b6b6b6b x19: ffff002002335000 x18: 
>> 00000000fffffffb
>> [    2.184925] x17: ffff800008d0d67c x16: ffff800008d072b4 x15: 
>> ffff800008d0c6c4
>> [    2.192087] x14: ffff800008d0c34c x13: ffff8000088d5034 x12: 
>> ffff8000088d46d4
>> [    2.199248] x11: ffff8000088d4624 x10: 0000000000000000 x9 : 
>> ffff800008d0d684
>> [    2.206410] x8 : ffff002000b1a158 x7 : bbbbbbbbbbbbbbbb x6 : 
>> ffff80000a0f53b8
>> [    2.213572] x5 : ffff80000a22b940 x4 : 0000000000000000 x3 : 
>> 0000000000000000
>> [    2.220733] x2 : fffffc0000000000 x1 : ffff002000838040 x0 : 
>> 01adb1adadadad80
>> [    2.227895] Call trace:
>> [    2.230342]  kfree+0x5c/0x3c0
>> [    2.233318]  thermal_of_zone_unregister+0x34/0x54
>> [    2.238036]  devm_thermal_of_zone_release+0x44/0x54
>> [    2.242931]  release_nodes+0x64/0xd0
>> [    2.246516]  devres_release_all+0xbc/0x350
>> [    2.250623]  device_unbind_cleanup+0x20/0x70
>> [    2.254905]  really_probe+0x1a0/0x2e4
>> [    2.258577]  __driver_probe_device+0x80/0xec
>> [    2.262859]  driver_probe_device+0x44/0x130
>> [    2.267055]  __driver_attach+0x104/0x1b4
>> [    2.270989]  bus_for_each_dev+0x7c/0xe0
>> [    2.274834]  driver_attach+0x30/0x40
>> [    2.278418]  bus_add_driver+0x160/0x210
>> [    2.281900] hub 1-1:1.0: USB hub found
>> [    2.282264]  driver_register+0x84/0x140
>> [    2.286109] hub 1-1:1.0: 7 ports detected
>> [    2.289859]  __platform_driver_register+0x34/0x40
>> [    2.289867]  qoriq_tmu_init+0x28/0x34
>> [    2.302258]  do_one_initcall+0x50/0x250
>> [    2.306104]  kernel_init_freeable+0x278/0x31c
>> [    2.310474]  kernel_init+0x30/0x140
>> [    2.313972]  ret_from_fork+0x10/0x20
>> [    2.317559] Code: b25657e2 d34cfc00 d37ae400 8b020015 (f94006a1)
>> [    2.323672] ---[ end trace 0000000000000000 ]---
>> [    2.328317] Kernel panic - not syncing: Attempted to kill init! 
>> exitcode=0x0000000b
>> [    2.335999] SMP: stopping secondary CPUs
>> [    2.339932] Kernel Offset: disabled
>> [    2.343425] CPU features: 0x2000,0800f021,00001086
>> [    2.348229] Memory Limit: none
>> [    2.351289] ---[ end Kernel panic - not syncing: Attempted to kill 
>> init! exitcode=0x0000000b ]---
>> 
>> This was seen a sl28 board
>> (arch/arm64/boot/dts/freescale/fsl-ls1028a-kontron-kbox-a-230-ls.dts).
>> The same board in the KernelCI also have some more information:
>> https://lavalab.kontron.com/scheduler/job/151900#L1162
>> 
>> But I guess even if that is fixed, the driver will not probe due to 
>> the
>> missing trip points? Are they now mandatory? Does it mean we'd need to
>> update our device trees? But that will then mean older devices trees
>> don't work anymore.
> 
> Thanks for reporting, I'll investigate the issues you are reporting.
> 
> There is no need to update any device tree file. The code does not
> change the bindings, it is a rewrite of the implementation supposed to
> be without impact on the existing bindings, thus the existing device
> tree descriptions.
> 
> Why are you saying there are missing trip points ? The dts shows trip
> points for 'core-cluster' and 'ddr-controller' ?

You are right. I've just looked at the error message:

[    2.030452] thermal_sys: Failed to find 'trips' node
[    2.033664] usb 1-1: new high-speed USB device number 2 using 
xhci-hcd
[    2.035434] thermal_sys: Failed to find trip points for tmu id=2

So maybe the code just don't find em.

-michael
Daniel Lezcano Aug. 8, 2022, 11:06 a.m. UTC | #5
On 08/08/2022 12:55, Michael Walle wrote:
> Hi Daniel,

[ ... ]

>>> There seems to be one mistake within the new thermal code:

[ ... ]

>>> This was seen a sl28 board
>>> (arch/arm64/boot/dts/freescale/fsl-ls1028a-kontron-kbox-a-230-ls.dts).
>>> The same board in the KernelCI also have some more information:
>>> https://lavalab.kontron.com/scheduler/job/151900#L1162
>>>
>>> But I guess even if that is fixed, the driver will not probe due to the
>>> missing trip points? Are they now mandatory? Does it mean we'd need to
>>> update our device trees? But that will then mean older devices trees
>>> don't work anymore.
>>
>> Thanks for reporting, I'll investigate the issues you are reporting.
>>
>> There is no need to update any device tree file. The code does not
>> change the bindings, it is a rewrite of the implementation supposed to
>> be without impact on the existing bindings, thus the existing device
>> tree descriptions.
>>
>> Why are you saying there are missing trip points ? The dts shows trip
>> points for 'core-cluster' and 'ddr-controller' ?
> 
> You are right. I've just looked at the error message:
> 
> [    2.030452] thermal_sys: Failed to find 'trips' node
> [    2.033664] usb 1-1: new high-speed USB device number 2 using xhci-hcd
> [    2.035434] thermal_sys: Failed to find trip points for tmu id=2
> 
> So maybe the code just don't find em.

For the board it seems there is no definition for tmu with id 2
Daniel Lezcano Aug. 8, 2022, 1:09 p.m. UTC | #6
On 08/08/2022 11:42, Michael Walle wrote:
> Hi,
> 
>> The following changes are depending on:
>>
>>   - 20220722200007.1839356-1-daniel.lezcano@linexp.org
>>
>> which are present in the thermal/linux-next branch:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux.git/log/?h=thermal/linux-next
>>
>> The series introduces a new thermal OF code. The patch description gives
>> a detailed explanation of the changes. Basically we write new OF parsing
>> functions, we migrate all the users of the old thermal OF API to the new
>> one and then we finish by removing the old OF code.
>>
>> That is the second step to rework the thermal OF code. More patches will
>> come after that to remove the duplication of the trip definitions in the
>> different drivers which will result in more code duplication removed and
>> consolidation of the core thermal framework.
>>
>> Thanks for those who tested the series on their platform and
>> investigated the regression with the disabled by default thermal zones.
> 
> I haven't looked closely yet, but this series is breaking two of my
> boards.
> 
> There seems to be one mistake within the new thermal code:
> 
> [    2.030452] thermal_sys: Failed to find 'trips' node
> [    2.033664] usb 1-1: new high-speed USB device number 2 using xhci-hcd
> [    2.035434] thermal_sys: Failed to find trip points for tmu id=2
> [    2.048010] qoriq_thermal 1f80000.tmu: Failed to register sensors
> [    2.054128] qoriq_thermal: probe of 1f80000.tmu failed with error -22
> [    2.060607] devm_thermal_of_zone_release:707 res=ffff002002377180
> [    2.067044] Unable to handle kernel paging request at virtual address 01adadadadadad88
> [    2.075003] Mem abort info:
> [    2.077805]   ESR = 0x0000000096000004
> [    2.081562]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    2.086893]   SET = 0, FnV = 0
> [    2.089955]   EA = 0, S1PTW = 0
> [    2.093100]   FSC = 0x04: level 0 translation fault
> [    2.097993] Data abort info:
> [    2.100876]   ISV = 0, ISS = 0x00000004
> [    2.104724]   CM = 0, WnR = 0
> [    2.107698] [01adadadadadad88] address between user and kernel address ranges
> [    2.114863] Internal error: Oops: 96000004 [#1] SMP
> [    2.119754] Modules linked in:
> [    2.122815] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.19.0-next-20220808-00078-ga957a15f74fc-dirty #1694
> [    2.132504] Hardware name: Kontron KBox A-230-LS (DT)
> [    2.137568] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    2.144554] pc : kfree+0x5c/0x3c0
> [    2.147885] lr : thermal_of_zone_unregister+0x34/0x54
> [    2.152954] sp : ffff80000a22bab0
> [    2.156274] x29: ffff80000a22bab0 x28: 0000000000000000 x27: ffff800009960464
> [    2.163438] x26: ffff800009a16960 x25: 0000000000000006 x24: ffff800009f09a40
> [    2.170601] x23: ffff800009ab9008 x22: ffff800008d0d684 x21: 01adadadadadad80
> [    2.177763] x20: 6b6b6b6b6b6b6b6b x19: ffff002002335000 x18: 00000000fffffffb
> [    2.184925] x17: ffff800008d0d67c x16: ffff800008d072b4 x15: ffff800008d0c6c4
> [    2.192087] x14: ffff800008d0c34c x13: ffff8000088d5034 x12: ffff8000088d46d4
> [    2.199248] x11: ffff8000088d4624 x10: 0000000000000000 x9 : ffff800008d0d684
> [    2.206410] x8 : ffff002000b1a158 x7 : bbbbbbbbbbbbbbbb x6 : ffff80000a0f53b8
> [    2.213572] x5 : ffff80000a22b940 x4 : 0000000000000000 x3 : 0000000000000000
> [    2.220733] x2 : fffffc0000000000 x1 : ffff002000838040 x0 : 01adb1adadadad80
> [    2.227895] Call trace:
> [    2.230342]  kfree+0x5c/0x3c0
> [    2.233318]  thermal_of_zone_unregister+0x34/0x54
> [    2.238036]  devm_thermal_of_zone_release+0x44/0x54
> [    2.242931]  release_nodes+0x64/0xd0
> [    2.246516]  devres_release_all+0xbc/0x350
> [    2.250623]  device_unbind_cleanup+0x20/0x70
> [    2.254905]  really_probe+0x1a0/0x2e4
> [    2.258577]  __driver_probe_device+0x80/0xec
> [    2.262859]  driver_probe_device+0x44/0x130
> [    2.267055]  __driver_attach+0x104/0x1b4
> [    2.270989]  bus_for_each_dev+0x7c/0xe0
> [    2.274834]  driver_attach+0x30/0x40
> [    2.278418]  bus_add_driver+0x160/0x210
> [    2.281900] hub 1-1:1.0: USB hub found
> [    2.282264]  driver_register+0x84/0x140
> [    2.286109] hub 1-1:1.0: 7 ports detected
> [    2.289859]  __platform_driver_register+0x34/0x40
> [    2.289867]  qoriq_tmu_init+0x28/0x34
> [    2.302258]  do_one_initcall+0x50/0x250
> [    2.306104]  kernel_init_freeable+0x278/0x31c
> [    2.310474]  kernel_init+0x30/0x140
> [    2.313972]  ret_from_fork+0x10/0x20
> [    2.317559] Code: b25657e2 d34cfc00 d37ae400 8b020015 (f94006a1)
> [    2.323672] ---[ end trace 0000000000000000 ]---
> [    2.328317] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [    2.335999] SMP: stopping secondary CPUs
> [    2.339932] Kernel Offset: disabled
> [    2.343425] CPU features: 0x2000,0800f021,00001086
> [    2.348229] Memory Limit: none
> [    2.351289] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
> 
> This was seen a sl28 board
> (arch/arm64/boot/dts/freescale/fsl-ls1028a-kontron-kbox-a-230-ls.dts).
> The same board in the KernelCI also have some more information:
> https://lavalab.kontron.com/scheduler/job/151900#L1162
> 
> But I guess even if that is fixed, the driver will not probe due to the
> missing trip points? Are they now mandatory? Does it mean we'd need to
> update our device trees? But that will then mean older devices trees
> don't work anymore.

Does this fix solves this first issue ?

https://lore.kernel.org/all/YvDzovkMCQecPDjz@kili/
Michael Walle Aug. 8, 2022, 1:24 p.m. UTC | #7
Hi Daniel,

Am 2022-08-08 15:09, schrieb Daniel Lezcano:
> Does this fix solves this first issue ?
> 
> https://lore.kernel.org/all/YvDzovkMCQecPDjz@kili/

Unfortunately not, it is still the same:

[    1.915140] thermal_sys: Failed to find thermal zone for tmu id=2
[    1.921279] qoriq_thermal 1f80000.tmu: Failed to register sensors
[    1.927395] qoriq_thermal: probe of 1f80000.tmu failed with error -22
[    1.934189] Unable to handle kernel paging request at virtual address 
01adadadadadad88
[    1.942146] Mem abort info:
[    1.944948]   ESR = 0x0000000096000004
[    1.948708]   EC = 0x25: DABT (current EL), IL = 32 bits
[    1.954042]   SET = 0, FnV = 0
[    1.957107]   EA = 0, S1PTW = 0
[    1.960253]   FSC = 0x04: level 0 translation fault
[    1.965147] Data abort info:
[    1.968030]   ISV = 0, ISS = 0x00000004
[    1.971878]   CM = 0, WnR = 0
[    1.974852] [01adadadadadad88] address between user and kernel 
address ranges
[    1.982016] Internal error: Oops: 96000004 [#1] SMP
[    1.986907] Modules linked in:
[    1.989969] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
5.19.0-next-20220808-00080-g1c46f44502e0 #1697
[    1.999135] Hardware name: Kontron KBox A-230-LS (DT)
[    2.004199] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)
[    2.011185] pc : kfree+0x5c/0x3c0
[    2.014516] lr : devm_thermal_of_zone_release+0x38/0x60
[    2.019761] sp : ffff80000a22bad0
[    2.023081] x29: ffff80000a22bad0 x28: 0000000000000000 x27: 
ffff800009960464
[    2.030245] x26: ffff800009a16960 x25: 0000000000000006 x24: 
ffff800009f09a40
[    2.037407] x23: ffff800009ab9008 x22: ffff800008d0eea8 x21: 
01adadadadadad80
[    2.044569] x20: 6b6b6b6b6b6b6b6b x19: ffff00200232b800 x18: 
00000000fffffffb
[    2.051731] x17: ffff800008d0eea0 x16: ffff800008d07d44 x15: 
ffff800008d0d154
[    2.056647] usb 1-1: new high-speed USB device number 2 using 
xhci-hcd
[    2.058893] x14: ffff800008d0cddc x13: ffff8000088d1c2c x12: 
ffff8000088d5034
[    2.072597] x11: ffff8000088d46d4 x10: 0000000000000000 x9 : 
ffff800008d0eea8
[    2.079759] x8 : ffff002000b1a158 x7 : bbbbbbbbbbbbbbbb x6 : 
ffff80000a0f53b8
[    2.086921] x5 : ffff80000a22b960 x4 : 0000000000000000 x3 : 
0000000000000000
[    2.094082] x2 : fffffc0000000000 x1 : ffff002000838040 x0 : 
01adb1adadadad80
[    2.101244] Call trace:
[    2.103692]  kfree+0x5c/0x3c0
[    2.106666]  devm_thermal_of_zone_release+0x38/0x60
[    2.111561]  release_nodes+0x64/0xd0
[    2.115146]  devres_release_all+0xbc/0x350
[    2.119253]  device_unbind_cleanup+0x20/0x70
[    2.123536]  really_probe+0x1a0/0x2e4
[    2.127208]  __driver_probe_device+0x80/0xec
[    2.131490]  driver_probe_device+0x44/0x130
[    2.135685]  __driver_attach+0x104/0x1b4
[    2.139619]  bus_for_each_dev+0x7c/0xe0
[    2.143465]  driver_attach+0x30/0x40
[    2.147048]  bus_add_driver+0x160/0x210
[    2.150894]  driver_register+0x84/0x140
[    2.154741]  __platform_driver_register+0x34/0x40
[    2.159461]  qoriq_tmu_init+0x28/0x34
[    2.163133]  do_one_initcall+0x50/0x250
[    2.166979]  kernel_init_freeable+0x278/0x31c
[    2.171349]  kernel_init+0x30/0x140
[    2.174847]  ret_from_fork+0x10/0x20
[    2.178433] Code: b25657e2 d34cfc00 d37ae400 8b020015 (f94006a1)
[    2.184546] ---[ end trace 0000000000000000 ]---
[    2.189188] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0000000b
[    2.196869] SMP: stopping secondary CPUs
[    2.200803] Kernel Offset: disabled
[    2.204296] CPU features: 0x2000,0800f021,00001086
[    2.209100] Memory Limit: none
[    2.212158] ---[ end Kernel panic - not syncing: Attempted to kill 
init! exitcode=0x0000000b ]---

-michael
Daniel Lezcano Aug. 8, 2022, 1:31 p.m. UTC | #8
On 08/08/2022 15:24, Michael Walle wrote:
> Hi Daniel,
> 
> Am 2022-08-08 15:09, schrieb Daniel Lezcano:
>> Does this fix solves this first issue ?
>>
>> https://lore.kernel.org/all/YvDzovkMCQecPDjz@kili/
> 
> Unfortunately not, it is still the same:

Ok, thanks for testing

> 
> [    1.915140] thermal_sys: Failed to find thermal zone for tmu id=2
> [    1.921279] qoriq_thermal 1f80000.tmu: Failed to register sensors
> [    1.927395] qoriq_thermal: probe of 1f80000.tmu failed with error -22
> [    1.934189] Unable to handle kernel paging request at virtual address 
> 01adadadadadad88
> [    1.942146] Mem abort info:
> [    1.944948]   ESR = 0x0000000096000004
> [    1.948708]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    1.954042]   SET = 0, FnV = 0
> [    1.957107]   EA = 0, S1PTW = 0
> [    1.960253]   FSC = 0x04: level 0 translation fault
> [    1.965147] Data abort info:
> [    1.968030]   ISV = 0, ISS = 0x00000004
> [    1.971878]   CM = 0, WnR = 0
> [    1.974852] [01adadadadadad88] address between user and kernel 
> address ranges
> [    1.982016] Internal error: Oops: 96000004 [#1] SMP
> [    1.986907] Modules linked in:
> [    1.989969] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
> 5.19.0-next-20220808-00080-g1c46f44502e0 #1697
> [    1.999135] Hardware name: Kontron KBox A-230-LS (DT)
> [    2.004199] pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS 
> BTYPE=--)
> [    2.011185] pc : kfree+0x5c/0x3c0
> [    2.014516] lr : devm_thermal_of_zone_release+0x38/0x60
> [    2.019761] sp : ffff80000a22bad0
> [    2.023081] x29: ffff80000a22bad0 x28: 0000000000000000 x27: 
> ffff800009960464
> [    2.030245] x26: ffff800009a16960 x25: 0000000000000006 x24: 
> ffff800009f09a40
> [    2.037407] x23: ffff800009ab9008 x22: ffff800008d0eea8 x21: 
> 01adadadadadad80
> [    2.044569] x20: 6b6b6b6b6b6b6b6b x19: ffff00200232b800 x18: 
> 00000000fffffffb
> [    2.051731] x17: ffff800008d0eea0 x16: ffff800008d07d44 x15: 
> ffff800008d0d154
> [    2.056647] usb 1-1: new high-speed USB device number 2 using xhci-hcd
> [    2.058893] x14: ffff800008d0cddc x13: ffff8000088d1c2c x12: 
> ffff8000088d5034
> [    2.072597] x11: ffff8000088d46d4 x10: 0000000000000000 x9 : 
> ffff800008d0eea8
> [    2.079759] x8 : ffff002000b1a158 x7 : bbbbbbbbbbbbbbbb x6 : 
> ffff80000a0f53b8
> [    2.086921] x5 : ffff80000a22b960 x4 : 0000000000000000 x3 : 
> 0000000000000000
> [    2.094082] x2 : fffffc0000000000 x1 : ffff002000838040 x0 : 
> 01adb1adadadad80
> [    2.101244] Call trace:
> [    2.103692]  kfree+0x5c/0x3c0
> [    2.106666]  devm_thermal_of_zone_release+0x38/0x60
> [    2.111561]  release_nodes+0x64/0xd0
> [    2.115146]  devres_release_all+0xbc/0x350
> [    2.119253]  device_unbind_cleanup+0x20/0x70
> [    2.123536]  really_probe+0x1a0/0x2e4
> [    2.127208]  __driver_probe_device+0x80/0xec
> [    2.131490]  driver_probe_device+0x44/0x130
> [    2.135685]  __driver_attach+0x104/0x1b4
> [    2.139619]  bus_for_each_dev+0x7c/0xe0
> [    2.143465]  driver_attach+0x30/0x40
> [    2.147048]  bus_add_driver+0x160/0x210
> [    2.150894]  driver_register+0x84/0x140
> [    2.154741]  __platform_driver_register+0x34/0x40
> [    2.159461]  qoriq_tmu_init+0x28/0x34
> [    2.163133]  do_one_initcall+0x50/0x250
> [    2.166979]  kernel_init_freeable+0x278/0x31c
> [    2.171349]  kernel_init+0x30/0x140
> [    2.174847]  ret_from_fork+0x10/0x20
> [    2.178433] Code: b25657e2 d34cfc00 d37ae400 8b020015 (f94006a1)
> [    2.184546] ---[ end trace 0000000000000000 ]---
> [    2.189188] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x0000000b
> [    2.196869] SMP: stopping secondary CPUs
> [    2.200803] Kernel Offset: disabled
> [    2.204296] CPU features: 0x2000,0800f021,00001086
> [    2.209100] Memory Limit: none
> [    2.212158] ---[ end Kernel panic - not syncing: Attempted to kill 
> init! exitcode=0x0000000b ]---
> 
> -michael
Daniel Lezcano Aug. 9, 2022, 8:23 a.m. UTC | #9
On 08/08/2022 11:42, Michael Walle wrote:

[ ... ]

> On my second board
> (arch/arm/boot/dts/lan966x-kontron-kswitch-d10-mmt-6g-2gs.dts). I get the
> following error:
> 
> [    6.292819] thermal_sys: Unable to find thermal zones description
> [    6.298872] thermal_sys: Failed to find thermal zone for hwmon id=0
> [    6.305375] lan966x-hwmon e2010180.hwmon: error -EINVAL: failed to register hwmon device
> [    6.313508] lan966x-hwmon: probe of e2010180.hwmon failed with error -22
> 
> Again, is there seems to be something missing in the device tree. For this
> board a device tree change should be easily doable, as it is still in
> development.

Logically with the fixes I've send these errors should have gone. Just a 
pr_info should appear "... not attached to any thermal zone".

If I'm correct, without or with the changes (new thermal OF code + 
fixes), the hwmon message is the same and the hwmon thermal zone is not 
created. So no regression hopefully.

Is it possible to check that?

[ ... ]
Daniel Lezcano Aug. 9, 2022, 8:53 a.m. UTC | #10
Hi Guenter,

On 08/08/2022 12:26, Guenter Roeck wrote:

[ ... ]

>> But I guess even if that is fixed, the driver will not probe due to the
>> missing trip points? Are they now mandatory? Does it mean we'd need to
>> update our device trees? But that will then mean older devices trees
>> don't work anymore.
> 
> It would also mean that all hwmon drivers registering a thermal zone sensor
> would fail to register unless such a thermal zone actually exists. 

Probably missing something but if the thermal zone is not described, the 
hwmon driver won't initialize. And except if I'm wrong, that was already 
the case before these changes, no?

> This
> would make the whole concept of having the hwmon core register thermal
> zone sensors impossible.

No, only the way the thermal OF is implemented changed. No functional 
changes. So AFAICT, you can still create thermal zones with the hwmon.

> I have no idea how this is expected to work now,
> but there is an apparent flaw in the logic. That means I withdraw my
> Acked-by: for the hwmon patches in this series until it is guaranteed
> that hwmon registration does not fail as above if there is no thermal
> zone associated with a sensor.


If the thermal zone creation fails with -ENODEV, then it is no 
considered as an error when creating the hwmon [1]

The function [devm]_thermal_zone_of_sensor_register() checks if there is 
a thermal zone description, if not it bails out with -ENODEV [2]

Otherwise it checks all the thermal zones if the device passed as 
parameter matches a sensor in the thermal zone [3][4]

If there is no match, then it returns -ENODEV which is the default error 
code [5]

My understanding is there is no thermal zone creation if there is no 
description in the device tree for such a device in the thermal zone.

The issue we had here was the confusing error message when -ENODEV 
(before was -EINVAL) is returning while before the code was silently 
continuing without creating the thermal zone.

We are talking here about what is in under CONFIG_THERMAL_OF in the 
hwmon code path. The rest is untouched.

Am I missing something?



[1] 
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/hwmon/hwmon.c#230

[2] 
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/thermal/thermal_of.c#499

[3] 
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/thermal/thermal_of.c#510

[4] 
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/thermal/thermal_of.c#428

[5] 
https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/thermal/thermal_of.c#497
Guenter Roeck Aug. 9, 2022, 2:32 p.m. UTC | #11
On 8/9/22 01:53, Daniel Lezcano wrote:
> Hi Guenter,
> 
> On 08/08/2022 12:26, Guenter Roeck wrote:
> 
> [ ... ]
> 
>>> But I guess even if that is fixed, the driver will not probe due to the
>>> missing trip points? Are they now mandatory? Does it mean we'd need to
>>> update our device trees? But that will then mean older devices trees
>>> don't work anymore.
>>
>> It would also mean that all hwmon drivers registering a thermal zone sensor
>> would fail to register unless such a thermal zone actually exists. 
> 
> Probably missing something but if the thermal zone is not described, the hwmon driver won't initialize. And except if I'm wrong, that was already the case before these changes, no?
> 

In the hwmon source (you point to it below):

         if (IS_ERR(tzd)) {
                 if (PTR_ERR(tzd) != -ENODEV)
                         return PTR_ERR(tzd);
                 dev_info(dev, "temp%d_input not attached to any thermal zone\n",
                          index + 1);
                 devm_kfree(dev, tdata);
                 return 0;
         }

That contradicts "if the thermal zone is not described, the hwmon driver won't initialize".
Now I must be missing something, since you mention that yourself below, and your new patch
series fixes the problem, at least AFAICS. Confused.

Guenter

>> This
>> would make the whole concept of having the hwmon core register thermal
>> zone sensors impossible.
> 
> No, only the way the thermal OF is implemented changed. No functional changes. So AFAICT, you can still create thermal zones with the hwmon.
> 
>> I have no idea how this is expected to work now,
>> but there is an apparent flaw in the logic. That means I withdraw my
>> Acked-by: for the hwmon patches in this series until it is guaranteed
>> that hwmon registration does not fail as above if there is no thermal
>> zone associated with a sensor.
> 
> 
> If the thermal zone creation fails with -ENODEV, then it is no considered as an error when creating the hwmon [1]
> 
> The function [devm]_thermal_zone_of_sensor_register() checks if there is a thermal zone description, if not it bails out with -ENODEV [2]
> 
> Otherwise it checks all the thermal zones if the device passed as parameter matches a sensor in the thermal zone [3][4]
> 
> If there is no match, then it returns -ENODEV which is the default error code [5]
> 
> My understanding is there is no thermal zone creation if there is no description in the device tree for such a device in the thermal zone.
> 
> The issue we had here was the confusing error message when -ENODEV (before was -EINVAL) is returning while before the code was silently continuing without creating the thermal zone.
> 
> We are talking here about what is in under CONFIG_THERMAL_OF in the hwmon code path. The rest is untouched.
> 
> Am I missing something?
> 
> 
> 
> [1] https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/hwmon/hwmon.c#230
> 
> [2] https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/thermal/thermal_of.c#499
> 
> [3] https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/thermal/thermal_of.c#510
> 
> [4] https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/thermal/thermal_of.c#428
> 
> [5] https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux/+/refs/tags/v5.18/drivers/thermal/thermal_of.c#497
Daniel Lezcano Aug. 9, 2022, 3:37 p.m. UTC | #12
On 09/08/2022 16:32, Guenter Roeck wrote:
> On 8/9/22 01:53, Daniel Lezcano wrote:
>> Hi Guenter,
>>
>> On 08/08/2022 12:26, Guenter Roeck wrote:
>>
>> [ ... ]
>>
>>>> But I guess even if that is fixed, the driver will not probe due to the
>>>> missing trip points? Are they now mandatory? Does it mean we'd need to
>>>> update our device trees? But that will then mean older devices trees
>>>> don't work anymore.
>>>
>>> It would also mean that all hwmon drivers registering a thermal zone 
>>> sensor
>>> would fail to register unless such a thermal zone actually exists. 
>>
>> Probably missing something but if the thermal zone is not described, 
>> the hwmon driver won't initialize. And except if I'm wrong, that was 
>> already the case before these changes, no?
>>
> 
> In the hwmon source (you point to it below):
> 
>          if (IS_ERR(tzd)) {
>                  if (PTR_ERR(tzd) != -ENODEV)
>                          return PTR_ERR(tzd);
>                  dev_info(dev, "temp%d_input not attached to any thermal 
> zone\n",
>                           index + 1);
>                  devm_kfree(dev, tdata);
>                  return 0;
>          }
> 
> That contradicts "if the thermal zone is not described, the hwmon driver 
> won't initialize".
> Now I must be missing something, since you mention that yourself below, 
> and your new patch
> series fixes the problem, at least AFAICS. Confused.

Sorry for not being clear. Let me try to explain it differently.

The function hwmon_thermal_add_sensor() is calling:


Without "[PATCH v5 00/33] New thermal OF code":
-----------------------------------------------

   devm_thermal_zone_of_sensor_register(dev, index, ...);

If there is no thermal zone description or the 'dev' does not belong to 
any thermal zone then -ENODEV is returned -> OK and the hwmon thermal 
zone is _not_ created



With "[PATCH v5 00/33] New thermal OF code":
--------------------------------------------

  devm_thermal_of_zone_register(dev, index, ...);

If there is no thermal zone description or the 'dev' does not belong to 
any thermal zone then *-EINVAL* is returned -> NOK, error message, and 
hwmon thermal zone is _not_ created



With "[PATCH v5 00/33] New thermal OF code" + fixes:
----------------------------------------------------

  devm_thermal_of_zone_register(dev, index, ...);

If there is no thermal zone description or the 'dev' does not belong to 
any thermal zone then *-ENODEV* is returned -> OK and the hwmon thermal 
zone is _not_ created


Quoting your initial message:

"It would also mean that all hwmon drivers registering a thermal zone 
sensor would fail to register unless such a thermal zone actually 
exists. This would make the whole concept of having the hwmon core 
register thermal zone sensors impossible [...]"

The thermal zone must be described if the OF registering function 
variant is used. Except I'm wrong that was already the case before the 
changes.

Otherwise, nothing prevents a hwmon to register itself with the core 
code, that will create the thermal zone.
Michael Walle Aug. 10, 2022, 8:01 a.m. UTC | #13
Hi Daniel,

Am 2022-08-09 10:23, schrieb Daniel Lezcano:
>> On my second board
>> (arch/arm/boot/dts/lan966x-kontron-kswitch-d10-mmt-6g-2gs.dts). I get 
>> the
>> following error:
>> 
>> [    6.292819] thermal_sys: Unable to find thermal zones description
>> [    6.298872] thermal_sys: Failed to find thermal zone for hwmon id=0
>> [    6.305375] lan966x-hwmon e2010180.hwmon: error -EINVAL: failed to 
>> register hwmon device
>> [    6.313508] lan966x-hwmon: probe of e2010180.hwmon failed with 
>> error -22
>> 
>> Again, is there seems to be something missing in the device tree. For 
>> this
>> board a device tree change should be easily doable, as it is still in
>> development.
> 
> Logically with the fixes I've send these errors should have gone. Just
> a pr_info should appear "... not attached to any thermal zone".
> 
> If I'm correct, without or with the changes (new thermal OF code +
> fixes), the hwmon message is the same and the hwmon thermal zone is
> not created. So no regression hopefully.
> 
> Is it possible to check that?

Yes, I'm no more seeing any error messages and the device
is probed successfully. Both on the kswitch board as well as on
the sl28 board.

Thanks,
-michael
Daniel Lezcano Aug. 10, 2022, 8:26 a.m. UTC | #14
Hi Michael,

On 10/08/2022 10:01, Michael Walle wrote:
> Hi Daniel, >
> Am 2022-08-09 10:23, schrieb Daniel Lezcano:
>>> On my second board
>>> (arch/arm/boot/dts/lan966x-kontron-kswitch-d10-mmt-6g-2gs.dts). I get 
>>> the
>>> following error:
>>>
>>> [    6.292819] thermal_sys: Unable to find thermal zones description
>>> [    6.298872] thermal_sys: Failed to find thermal zone for hwmon id=0
>>> [    6.305375] lan966x-hwmon e2010180.hwmon: error -EINVAL: failed to 
>>> register hwmon device
>>> [    6.313508] lan966x-hwmon: probe of e2010180.hwmon failed with 
>>> error -22
>>>
>>> Again, is there seems to be something missing in the device tree. For 
>>> this
>>> board a device tree change should be easily doable, as it is still in
>>> development.
>>
>> Logically with the fixes I've send these errors should have gone. Just
>> a pr_info should appear "... not attached to any thermal zone".
>>
>> If I'm correct, without or with the changes (new thermal OF code +
>> fixes), the hwmon message is the same and the hwmon thermal zone is
>> not created. So no regression hopefully.
>>
>> Is it possible to check that?
> 
> Yes, I'm no more seeing any error messages and the device
> is probed successfully. Both on the kswitch board as well as on
> the sl28 board.

Great! thanks for testing

   -- Daniel
Daniel Lezcano Aug. 10, 2022, 8:34 a.m. UTC | #15
Hi Guenter,

On 08/08/2022 12:26, Guenter Roeck wrote:
> On Mon, Aug 08, 2022 at 11:42:16AM +0200, Michael Walle wrote:

[ ... ]

> It would also mean that all hwmon drivers registering a thermal zone sensor
> would fail to register unless such a thermal zone actually exists. This
> would make the whole concept of having the hwmon core register thermal
> zone sensors impossible. I have no idea how this is expected to work now,
> but there is an apparent flaw in the logic. That means I withdraw my
> Acked-by: for the hwmon patches in this series until it is guaranteed
> that hwmon registration does not fail as above if there is no thermal
> zone associated with a sensor.

I tried to figure out where we are not in sync and I suspect there is a 
misunderstanding from my side of your initial statement.

I understood you meant the thermal zone can not be created with the 
hwmon if there is no thermal zone description and that was the case before.

But actually, I suspect I misunderstood and you meant if the thermal OF 
registration fails, the hwmon creation fails for the hwmon subsystem, 
and before the changes it was created anyway.

Is that correct ?

If yes, the change -EINVAL --> -ENODEV fixes the issue, shall I consider 
in this case your Acked-by ?

Thanks
   -- Daniel
Guenter Roeck Aug. 10, 2022, 9:56 a.m. UTC | #16
On 8/10/22 01:34, Daniel Lezcano wrote:
> 
> Hi Guenter,
> 
> On 08/08/2022 12:26, Guenter Roeck wrote:
>> On Mon, Aug 08, 2022 at 11:42:16AM +0200, Michael Walle wrote:
> 
> [ ... ]
> 
>> It would also mean that all hwmon drivers registering a thermal zone sensor
>> would fail to register unless such a thermal zone actually exists. This
>> would make the whole concept of having the hwmon core register thermal
>> zone sensors impossible. I have no idea how this is expected to work now,
>> but there is an apparent flaw in the logic. That means I withdraw my
>> Acked-by: for the hwmon patches in this series until it is guaranteed
>> that hwmon registration does not fail as above if there is no thermal
>> zone associated with a sensor.
> 
> I tried to figure out where we are not in sync and I suspect there is a misunderstanding from my side of your initial statement.
> 
> I understood you meant the thermal zone can not be created with the hwmon if there is no thermal zone description and that was the case before.
> 
> But actually, I suspect I misunderstood and you meant if the thermal OF registration fails, the hwmon creation fails for the hwmon subsystem, and before the changes it was created anyway.
> 
> Is that correct ?
> 

Yes, that is what I meant.

> If yes, the change -EINVAL --> -ENODEV fixes the issue, shall I consider in this case your Acked-by ?
> 
Yes.

Thanks,
Guenter

> Thanks
>    -- Daniel
>