diff mbox series

hwmon: Handle failure to register sensor with thermal zone correctly

Message ID 20220221164434.4169560-1-linux@roeck-us.net (mailing list archive)
State Superseded
Headers show
Series hwmon: Handle failure to register sensor with thermal zone correctly | expand

Commit Message

Guenter Roeck Feb. 21, 2022, 4:44 p.m. UTC
If an attempt is made to a sensor with a thermal zone and it fails,
the call to devm_thermal_zone_of_sensor_register() may return -ENODEV.
This may result in crashes similar to the following.

Unable to handle kernel NULL pointer dereference at virtual address 00000000000003cd
...
Internal error: Oops: 96000021 [#1] PREEMPT SMP
...
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mutex_lock+0x18/0x60
lr : thermal_zone_device_update+0x40/0x2e0
sp : ffff800014c4fc60
x29: ffff800014c4fc60 x28: ffff365ee3f6e000 x27: ffffdde218426790
x26: ffff365ee3f6e000 x25: 0000000000000000 x24: ffff365ee3f6e000
x23: ffffdde218426870 x22: ffff365ee3f6e000 x21: 00000000000003cd
x20: ffff365ee8bf3308 x19: ffffffffffffffed x18: 0000000000000000
x17: ffffdde21842689c x16: ffffdde1cb7a0b7c x15: 0000000000000040
x14: ffffdde21a4889a0 x13: 0000000000000228 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
x8 : 0000000001120000 x7 : 0000000000000001 x6 : 0000000000000000
x5 : 0068000878e20f07 x4 : 0000000000000000 x3 : 00000000000003cd
x2 : ffff365ee3f6e000 x1 : 0000000000000000 x0 : 00000000000003cd
Call trace:
 mutex_lock+0x18/0x60
 hwmon_notify_event+0xfc/0x110
 0xffffdde1cb7a0a90
 0xffffdde1cb7a0b7c
 irq_thread_fn+0x2c/0xa0
 irq_thread+0x134/0x240
 kthread+0x178/0x190
 ret_from_fork+0x10/0x20
Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)

Jon Hunter reports that the exact call sequence is:

hwmon_notify_event()
  --> hwmon_thermal_notify()
    --> thermal_zone_device_update()
      --> update_temperature()
        --> mutex_lock()

The hwmon core needs to handle all errors returned from calls
to devm_thermal_zone_of_sensor_register(). If the call fails
with -ENODEV, report the problem as warning but continue to
register the hwmon device.

Reported-by: Jon Hunter <jonathanh@nvidia.com>
Cc: Dmitry Osipenko <digetx@gmail.com>
Fixes: 1597b374af222 ("hwmon: Add notification support")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
---
 drivers/hwmon/hwmon.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

Comments

Dmitry Osipenko Feb. 21, 2022, 4:56 p.m. UTC | #1
21.02.2022 19:44, Guenter Roeck пишет:
> diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
> index 3501a3ead4ba..4bfe3791a5ba 100644
> --- a/drivers/hwmon/hwmon.c
> +++ b/drivers/hwmon/hwmon.c
> @@ -214,12 +214,14 @@ static int hwmon_thermal_add_sensor(struct device *dev, int index)
>  
>  	tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
>  						   &hwmon_thermal_ops);
> -	/*
> -	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
> -	 * so ignore that error but forward any other error.
> -	 */
> -	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
> -		return PTR_ERR(tzd);
> +	if (IS_ERR(tzd)) {
> +		if (PTR_ERR(tzd) != -ENODEV)
> +			return PTR_ERR(tzd);
> +		dev_warn(dev, "Failed to register temp%d_input with thermal zone\n",
> +			 index + 1);

Do we really need this warning? I suppose it should be okay if sensor
isn't attached to any device in a device-tree and just reports temperature.
Guenter Roeck Feb. 21, 2022, 5:12 p.m. UTC | #2
On 2/21/22 08:56, Dmitry Osipenko wrote:
> 21.02.2022 19:44, Guenter Roeck пишет:
>> diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
>> index 3501a3ead4ba..4bfe3791a5ba 100644
>> --- a/drivers/hwmon/hwmon.c
>> +++ b/drivers/hwmon/hwmon.c
>> @@ -214,12 +214,14 @@ static int hwmon_thermal_add_sensor(struct device *dev, int index)
>>   
>>   	tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
>>   						   &hwmon_thermal_ops);
>> -	/*
>> -	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
>> -	 * so ignore that error but forward any other error.
>> -	 */
>> -	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
>> -		return PTR_ERR(tzd);
>> +	if (IS_ERR(tzd)) {
>> +		if (PTR_ERR(tzd) != -ENODEV)
>> +			return PTR_ERR(tzd);
>> +		dev_warn(dev, "Failed to register temp%d_input with thermal zone\n",
>> +			 index + 1);
> 
> Do we really need this warning? I suppose it should be okay if sensor
> isn't attached to any device in a device-tree and just reports temperature.

I'd rather leave it there for the time being. It will only affect devicetree
systems (turns out there is already a check for of_node elsewhere). Thermal
zone specification is not always easy and there may be a mismatch between
what is reported by the driver and what the user (programmer) expects to
see (which I think is what happens here). I don't want to silently
ignore such problems without any notification.

We could make it dev_notice and/or change the message (instead of "Failed to
..." just say "temp%d_input not registered with thermal zone" , maybe ?).

Guenter
Dmitry Osipenko Feb. 21, 2022, 5:59 p.m. UTC | #3
21.02.2022 20:12, Guenter Roeck пишет:
> On 2/21/22 08:56, Dmitry Osipenko wrote:
>> 21.02.2022 19:44, Guenter Roeck пишет:
>>> diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
>>> index 3501a3ead4ba..4bfe3791a5ba 100644
>>> --- a/drivers/hwmon/hwmon.c
>>> +++ b/drivers/hwmon/hwmon.c
>>> @@ -214,12 +214,14 @@ static int hwmon_thermal_add_sensor(struct
>>> device *dev, int index)
>>>         tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
>>>                              &hwmon_thermal_ops);
>>> -    /*
>>> -     * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
>>> -     * so ignore that error but forward any other error.
>>> -     */
>>> -    if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
>>> -        return PTR_ERR(tzd);
>>> +    if (IS_ERR(tzd)) {
>>> +        if (PTR_ERR(tzd) != -ENODEV)
>>> +            return PTR_ERR(tzd);
>>> +        dev_warn(dev, "Failed to register temp%d_input with thermal
>>> zone\n",
>>> +             index + 1);
>>
>> Do we really need this warning? I suppose it should be okay if sensor
>> isn't attached to any device in a device-tree and just reports
>> temperature.
> 
> I'd rather leave it there for the time being. It will only affect
> devicetree
> systems (turns out there is already a check for of_node elsewhere). Thermal
> zone specification is not always easy and there may be a mismatch between
> what is reported by the driver and what the user (programmer) expects to
> see (which I think is what happens here). I don't want to silently
> ignore such problems without any notification.
> 
> We could make it dev_notice and/or change the message (instead of
> "Failed to
> ..." just say "temp%d_input not registered with thermal zone" , maybe ?).

I'd change it to:

dev_info(dev, "temp%d_input not attached to any thermal zone\n", index + 1);

I'd also add an info message to print out to which tzd attachment happened.
Guenter Roeck Feb. 21, 2022, 6:23 p.m. UTC | #4
On 2/21/22 09:59, Dmitry Osipenko wrote:
> 21.02.2022 20:12, Guenter Roeck пишет:
>> On 2/21/22 08:56, Dmitry Osipenko wrote:
>>> 21.02.2022 19:44, Guenter Roeck пишет:
>>>> diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
>>>> index 3501a3ead4ba..4bfe3791a5ba 100644
>>>> --- a/drivers/hwmon/hwmon.c
>>>> +++ b/drivers/hwmon/hwmon.c
>>>> @@ -214,12 +214,14 @@ static int hwmon_thermal_add_sensor(struct
>>>> device *dev, int index)
>>>>          tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
>>>>                               &hwmon_thermal_ops);
>>>> -    /*
>>>> -     * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
>>>> -     * so ignore that error but forward any other error.
>>>> -     */
>>>> -    if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
>>>> -        return PTR_ERR(tzd);
>>>> +    if (IS_ERR(tzd)) {
>>>> +        if (PTR_ERR(tzd) != -ENODEV)
>>>> +            return PTR_ERR(tzd);
>>>> +        dev_warn(dev, "Failed to register temp%d_input with thermal
>>>> zone\n",
>>>> +             index + 1);
>>>
>>> Do we really need this warning? I suppose it should be okay if sensor
>>> isn't attached to any device in a device-tree and just reports
>>> temperature.
>>
>> I'd rather leave it there for the time being. It will only affect
>> devicetree
>> systems (turns out there is already a check for of_node elsewhere). Thermal
>> zone specification is not always easy and there may be a mismatch between
>> what is reported by the driver and what the user (programmer) expects to
>> see (which I think is what happens here). I don't want to silently
>> ignore such problems without any notification.
>>
>> We could make it dev_notice and/or change the message (instead of
>> "Failed to
>> ..." just say "temp%d_input not registered with thermal zone" , maybe ?).
> 
> I'd change it to:
> 
> dev_info(dev, "temp%d_input not attached to any thermal zone\n", index + 1);
> 

Makes sense. I just sent v2 with that change.

> I'd also add an info message to print out to which tzd attachment happened.

Oh, I don't know. Kernel logs are already way too noisy. Either case that would
not be a bug fix, so such a change should not be part of this patch.

Thanks,
Guenter
diff mbox series

Patch

diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
index 3501a3ead4ba..4bfe3791a5ba 100644
--- a/drivers/hwmon/hwmon.c
+++ b/drivers/hwmon/hwmon.c
@@ -214,12 +214,14 @@  static int hwmon_thermal_add_sensor(struct device *dev, int index)
 
 	tzd = devm_thermal_zone_of_sensor_register(dev, index, tdata,
 						   &hwmon_thermal_ops);
-	/*
-	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
-	 * so ignore that error but forward any other error.
-	 */
-	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
-		return PTR_ERR(tzd);
+	if (IS_ERR(tzd)) {
+		if (PTR_ERR(tzd) != -ENODEV)
+			return PTR_ERR(tzd);
+		dev_warn(dev, "Failed to register temp%d_input with thermal zone\n",
+			 index + 1);
+		devm_kfree(dev, tdata);
+		return 0;
+	}
 
 	err = devm_add_action(dev, hwmon_thermal_remove_sensor, &tdata->node);
 	if (err)