diff mbox series

[v2] hwmon: Handle failure to register sensor with thermal zone correctly

Message ID 20220221182209.1795242-1-linux@roeck-us.net (mailing list archive)
State Accepted
Headers show
Series [v2] hwmon: Handle failure to register sensor with thermal zone correctly | expand

Commit Message

Guenter Roeck Feb. 21, 2022, 6:22 p.m. UTC
If an attempt is made to a sensor with a thermal zone and it fails,
the call to devm_thermal_zone_of_sensor_register() may return -ENODEV.
This may result in crashes similar to the following.

Unable to handle kernel NULL pointer dereference at virtual address 00000000000003cd
...
Internal error: Oops: 96000021 [#1] PREEMPT SMP
...
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mutex_lock+0x18/0x60
lr : thermal_zone_device_update+0x40/0x2e0
sp : ffff800014c4fc60
x29: ffff800014c4fc60 x28: ffff365ee3f6e000 x27: ffffdde218426790
x26: ffff365ee3f6e000 x25: 0000000000000000 x24: ffff365ee3f6e000
x23: ffffdde218426870 x22: ffff365ee3f6e000 x21: 00000000000003cd
x20: ffff365ee8bf3308 x19: ffffffffffffffed x18: 0000000000000000
x17: ffffdde21842689c x16: ffffdde1cb7a0b7c x15: 0000000000000040
x14: ffffdde21a4889a0 x13: 0000000000000228 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
x8 : 0000000001120000 x7 : 0000000000000001 x6 : 0000000000000000
x5 : 0068000878e20f07 x4 : 0000000000000000 x3 : 00000000000003cd
x2 : ffff365ee3f6e000 x1 : 0000000000000000 x0 : 00000000000003cd
Call trace:
 mutex_lock+0x18/0x60
 hwmon_notify_event+0xfc/0x110
 0xffffdde1cb7a0a90
 0xffffdde1cb7a0b7c
 irq_thread_fn+0x2c/0xa0
 irq_thread+0x134/0x240
 kthread+0x178/0x190
 ret_from_fork+0x10/0x20
Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)

Jon Hunter reports that the exact call sequence is:

hwmon_notify_event()
  --> hwmon_thermal_notify()
    --> thermal_zone_device_update()
      --> update_temperature()
        --> mutex_lock()

The hwmon core needs to handle all errors returned from calls
to devm_thermal_zone_of_sensor_register(). If the call fails
with -ENODEV, report that the sensor was not attached to a
thermal zone  but continue to register the hwmon device.

Reported-by: Jon Hunter <jonathanh@nvidia.com>
Cc: Dmitry Osipenko <digetx@gmail.com>
Fixes: 1597b374af222 ("hwmon: Add notification support")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
v2: Use dev_info instead of dev_warn, and change message to be
    less alarming.

 drivers/hwmon/hwmon.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

Comments

Dmitry Osipenko Feb. 21, 2022, 6:49 p.m. UTC | #1
On 2/21/22 21:22, Guenter Roeck wrote:
> If an attempt is made to a sensor with a thermal zone and it fails,
> the call to devm_thermal_zone_of_sensor_register() may return -ENODEV.
> This may result in crashes similar to the following.
> 
> Unable to handle kernel NULL pointer dereference at virtual address 00000000000003cd
> ...
> Internal error: Oops: 96000021 [#1] PREEMPT SMP
> ...
> pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : mutex_lock+0x18/0x60
> lr : thermal_zone_device_update+0x40/0x2e0
> sp : ffff800014c4fc60
> x29: ffff800014c4fc60 x28: ffff365ee3f6e000 x27: ffffdde218426790
> x26: ffff365ee3f6e000 x25: 0000000000000000 x24: ffff365ee3f6e000
> x23: ffffdde218426870 x22: ffff365ee3f6e000 x21: 00000000000003cd
> x20: ffff365ee8bf3308 x19: ffffffffffffffed x18: 0000000000000000
> x17: ffffdde21842689c x16: ffffdde1cb7a0b7c x15: 0000000000000040
> x14: ffffdde21a4889a0 x13: 0000000000000228 x12: 0000000000000000
> x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
> x8 : 0000000001120000 x7 : 0000000000000001 x6 : 0000000000000000
> x5 : 0068000878e20f07 x4 : 0000000000000000 x3 : 00000000000003cd
> x2 : ffff365ee3f6e000 x1 : 0000000000000000 x0 : 00000000000003cd
> Call trace:
>  mutex_lock+0x18/0x60
>  hwmon_notify_event+0xfc/0x110
>  0xffffdde1cb7a0a90
>  0xffffdde1cb7a0b7c
>  irq_thread_fn+0x2c/0xa0
>  irq_thread+0x134/0x240
>  kthread+0x178/0x190
>  ret_from_fork+0x10/0x20
> Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)
> 
> Jon Hunter reports that the exact call sequence is:
> 
> hwmon_notify_event()
>   --> hwmon_thermal_notify()
>     --> thermal_zone_device_update()
>       --> update_temperature()
>         --> mutex_lock()
> 
> The hwmon core needs to handle all errors returned from calls
> to devm_thermal_zone_of_sensor_register(). If the call fails
> with -ENODEV, report that the sensor was not attached to a
> thermal zone  but continue to register the hwmon device.
> 
> Reported-by: Jon Hunter <jonathanh@nvidia.com>
> Cc: Dmitry Osipenko <digetx@gmail.com>
> Fixes: 1597b374af222 ("hwmon: Add notification support")
> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
> ---
> v2: Use dev_info instead of dev_warn, and change message to be
>     less alarming.
> 
>  drivers/hwmon/hwmon.c | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
> index 3501a3ead4ba..3ae961986fc3 100644
> --- a/drivers/hwmon/hwmon.c
> +++ b/drivers/hwmon/hwmon.c
> @@ -214,12 +214,14 @@ static int hwmon_thermal_add_sensor(struct device *dev, int index)
>  
>  	tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
>  						   &hwmon_thermal_ops);
> -	/*
> -	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
> -	 * so ignore that error but forward any other error.
> -	 */
> -	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
> -		return PTR_ERR(tzd);
> +	if (IS_ERR(tzd)) {
> +		if (PTR_ERR(tzd) != -ENODEV)
> +			return PTR_ERR(tzd);
> +		dev_info(dev, "temp%d_input not attached to any thermal zone\n",
> +			 index + 1);
> +		devm_kfree(dev, tdata);
> +		return 0;
> +	}
>  
>  	err = devm_add_action(dev, hwmon_thermal_remove_sensor, &tdata->node);
>  	if (err)

LGTM, thank you. But still needs a t-b from Jon.

Reviewed-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Guenter Roeck Feb. 21, 2022, 8:12 p.m. UTC | #2
On 2/21/22 10:49, Dmitry Osipenko wrote:
> 
> On 2/21/22 21:22, Guenter Roeck wrote:
>> If an attempt is made to a sensor with a thermal zone and it fails,
>> the call to devm_thermal_zone_of_sensor_register() may return -ENODEV.
>> This may result in crashes similar to the following.
>>
>> Unable to handle kernel NULL pointer dereference at virtual address 00000000000003cd
>> ...
>> Internal error: Oops: 96000021 [#1] PREEMPT SMP
>> ...
>> pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> pc : mutex_lock+0x18/0x60
>> lr : thermal_zone_device_update+0x40/0x2e0
>> sp : ffff800014c4fc60
>> x29: ffff800014c4fc60 x28: ffff365ee3f6e000 x27: ffffdde218426790
>> x26: ffff365ee3f6e000 x25: 0000000000000000 x24: ffff365ee3f6e000
>> x23: ffffdde218426870 x22: ffff365ee3f6e000 x21: 00000000000003cd
>> x20: ffff365ee8bf3308 x19: ffffffffffffffed x18: 0000000000000000
>> x17: ffffdde21842689c x16: ffffdde1cb7a0b7c x15: 0000000000000040
>> x14: ffffdde21a4889a0 x13: 0000000000000228 x12: 0000000000000000
>> x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
>> x8 : 0000000001120000 x7 : 0000000000000001 x6 : 0000000000000000
>> x5 : 0068000878e20f07 x4 : 0000000000000000 x3 : 00000000000003cd
>> x2 : ffff365ee3f6e000 x1 : 0000000000000000 x0 : 00000000000003cd
>> Call trace:
>>   mutex_lock+0x18/0x60
>>   hwmon_notify_event+0xfc/0x110
>>   0xffffdde1cb7a0a90
>>   0xffffdde1cb7a0b7c
>>   irq_thread_fn+0x2c/0xa0
>>   irq_thread+0x134/0x240
>>   kthread+0x178/0x190
>>   ret_from_fork+0x10/0x20
>> Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)
>>
>> Jon Hunter reports that the exact call sequence is:
>>
>> hwmon_notify_event()
>>    --> hwmon_thermal_notify()
>>      --> thermal_zone_device_update()
>>        --> update_temperature()
>>          --> mutex_lock()
>>
>> The hwmon core needs to handle all errors returned from calls
>> to devm_thermal_zone_of_sensor_register(). If the call fails
>> with -ENODEV, report that the sensor was not attached to a
>> thermal zone  but continue to register the hwmon device.
>>
>> Reported-by: Jon Hunter <jonathanh@nvidia.com>
>> Cc: Dmitry Osipenko <digetx@gmail.com>
>> Fixes: 1597b374af222 ("hwmon: Add notification support")
>> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
>> ---
>> v2: Use dev_info instead of dev_warn, and change message to be
>>      less alarming.
>>
>>   drivers/hwmon/hwmon.c | 14 ++++++++------
>>   1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
>> index 3501a3ead4ba..3ae961986fc3 100644
>> --- a/drivers/hwmon/hwmon.c
>> +++ b/drivers/hwmon/hwmon.c
>> @@ -214,12 +214,14 @@ static int hwmon_thermal_add_sensor(struct device *dev, int index)
>>   
>>   	tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
>>   						   &hwmon_thermal_ops);
>> -	/*
>> -	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
>> -	 * so ignore that error but forward any other error.
>> -	 */
>> -	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
>> -		return PTR_ERR(tzd);
>> +	if (IS_ERR(tzd)) {
>> +		if (PTR_ERR(tzd) != -ENODEV)
>> +			return PTR_ERR(tzd);
>> +		dev_info(dev, "temp%d_input not attached to any thermal zone\n",
>> +			 index + 1);
>> +		devm_kfree(dev, tdata);
>> +		return 0;
>> +	}
>>   
>>   	err = devm_add_action(dev, hwmon_thermal_remove_sensor, &tdata->node);
>>   	if (err)
> 
> LGTM, thank you. But still needs a t-b from Jon.
> 

Yes, I'll wait for a tested-by: tag before sending the patch upstream
to make sure that it actually fixes the problem.

Thanks,
Guenter

> Reviewed-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Jon Hunter Feb. 22, 2022, 12:55 p.m. UTC | #3
Hi Guenter,

On 21/02/2022 18:22, Guenter Roeck wrote:
> If an attempt is made to a sensor with a thermal zone and it fails,
> the call to devm_thermal_zone_of_sensor_register() may return -ENODEV.
> This may result in crashes similar to the following.
> 
> Unable to handle kernel NULL pointer dereference at virtual address 00000000000003cd
> ...
> Internal error: Oops: 96000021 [#1] PREEMPT SMP
> ...
> pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : mutex_lock+0x18/0x60
> lr : thermal_zone_device_update+0x40/0x2e0
> sp : ffff800014c4fc60
> x29: ffff800014c4fc60 x28: ffff365ee3f6e000 x27: ffffdde218426790
> x26: ffff365ee3f6e000 x25: 0000000000000000 x24: ffff365ee3f6e000
> x23: ffffdde218426870 x22: ffff365ee3f6e000 x21: 00000000000003cd
> x20: ffff365ee8bf3308 x19: ffffffffffffffed x18: 0000000000000000
> x17: ffffdde21842689c x16: ffffdde1cb7a0b7c x15: 0000000000000040
> x14: ffffdde21a4889a0 x13: 0000000000000228 x12: 0000000000000000
> x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
> x8 : 0000000001120000 x7 : 0000000000000001 x6 : 0000000000000000
> x5 : 0068000878e20f07 x4 : 0000000000000000 x3 : 00000000000003cd
> x2 : ffff365ee3f6e000 x1 : 0000000000000000 x0 : 00000000000003cd
> Call trace:
>   mutex_lock+0x18/0x60
>   hwmon_notify_event+0xfc/0x110
>   0xffffdde1cb7a0a90
>   0xffffdde1cb7a0b7c
>   irq_thread_fn+0x2c/0xa0
>   irq_thread+0x134/0x240
>   kthread+0x178/0x190
>   ret_from_fork+0x10/0x20
> Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)
> 
> Jon Hunter reports that the exact call sequence is:
> 
> hwmon_notify_event()
>    --> hwmon_thermal_notify()
>      --> thermal_zone_device_update()
>        --> update_temperature()
>          --> mutex_lock()
> 
> The hwmon core needs to handle all errors returned from calls
> to devm_thermal_zone_of_sensor_register(). If the call fails
> with -ENODEV, report that the sensor was not attached to a
> thermal zone  but continue to register the hwmon device.
> 
> Reported-by: Jon Hunter <jonathanh@nvidia.com>
> Cc: Dmitry Osipenko <digetx@gmail.com>
> Fixes: 1597b374af222 ("hwmon: Add notification support")
> Signed-off-by: Guenter Roeck <linux@roeck-us.net>
> ---
> v2: Use dev_info instead of dev_warn, and change message to be
>      less alarming.
> 
>   drivers/hwmon/hwmon.c | 14 ++++++++------
>   1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
> index 3501a3ead4ba..3ae961986fc3 100644
> --- a/drivers/hwmon/hwmon.c
> +++ b/drivers/hwmon/hwmon.c
> @@ -214,12 +214,14 @@ static int hwmon_thermal_add_sensor(struct device *dev, int index)
>   
>   	tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
>   						   &hwmon_thermal_ops);
> -	/*
> -	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
> -	 * so ignore that error but forward any other error.
> -	 */
> -	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
> -		return PTR_ERR(tzd);
> +	if (IS_ERR(tzd)) {
> +		if (PTR_ERR(tzd) != -ENODEV)
> +			return PTR_ERR(tzd);
> +		dev_info(dev, "temp%d_input not attached to any thermal zone\n",
> +			 index + 1);
> +		devm_kfree(dev, tdata);
> +		return 0;
> +	}
>   
>   	err = devm_add_action(dev, hwmon_thermal_remove_sensor, &tdata->node);
>   	if (err)


Thanks for the quick fix. Works for me!

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Jon
diff mbox series

Patch

diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
index 3501a3ead4ba..3ae961986fc3 100644
--- a/drivers/hwmon/hwmon.c
+++ b/drivers/hwmon/hwmon.c
@@ -214,12 +214,14 @@  static int hwmon_thermal_add_sensor(struct device *dev, int index)
 
 	tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
 						   &hwmon_thermal_ops);
-	/*
-	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
-	 * so ignore that error but forward any other error.
-	 */
-	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
-		return PTR_ERR(tzd);
+	if (IS_ERR(tzd)) {
+		if (PTR_ERR(tzd) != -ENODEV)
+			return PTR_ERR(tzd);
+		dev_info(dev, "temp%d_input not attached to any thermal zone\n",
+			 index + 1);
+		devm_kfree(dev, tdata);
+		return 0;
+	}
 
 	err = devm_add_action(dev, hwmon_thermal_remove_sensor, &tdata->node);
 	if (err)