diff mbox series

hwmon: (ntc_thermistor) return error instead of clipping on OOB

Message ID 20250304-ntc_oob-v1-1-600d8992478d@gocontroll.com (mailing list archive)
State Changes Requested
Headers show
Series hwmon: (ntc_thermistor) return error instead of clipping on OOB | expand

Commit Message

Maud Spierings via B4 Relay March 4, 2025, 2:10 p.m. UTC
From: Maud Spierings <maudspierings@gocontroll.com>

When the ntc is reading Out Of Bounds instead of clipping to the nearest
limit (min/max) return -ENODATA. This prevents malfunctioning sensors
from sending a device into a shutdown loop due to a critical trip.

This implementation will only work for ntc type thermistors if a ptc
type is to be implemented the min/max ohm calculation must be adjusted
to take that into account.

Signed-off-by: Maud Spierings <maudspierings@gocontroll.com>
---
This patch is a continuation of another discussion [1]. I felt like it
should be a new patch, not a v2 as this is a very different change.

I have left the clamping of n to INT_MAX in the code with a comment, but
it may be possible to find a better solution to it. One I thought of is
to make the ohm field of the ntc_compensation struct a signed int as
well. It would get rid of this weird edge case, but it doesn't make
sense to allow for negative resistances to be entered into the sensor
table.

Currently the only feedback this provides to the user is when they
manually try to read the temperature and it returns the error. I have
added a simple printk to these error points to see how spammy it gets
and this is the result:

dmesg | grep hwmon
[    4.982682] hwmon: sensor out of bounds
[    5.249758] hwmon: sensor out of bounds
[    5.633729] hwmon: sensor out of bounds
[    6.215285] hwmon: sensor out of bounds
[    7.073882] hwmon: sensor out of bounds
[    7.486620] hwmon: sensor out of bounds
[    8.833765] hwmon: sensor out of bounds
[   10.785969] hwmon: sensor out of bounds
[   13.793722] hwmon: sensor out of bounds
[   16.761124] hwmon: sensor out of bounds
[   17.889706] hwmon: sensor out of bounds
[   25.057715] hwmon: sensor out of bounds
[   35.041725] hwmon: sensor out of bounds
[   50.110346] hwmon: sensor out of bounds
[   72.945283] hwmon: sensor out of bounds
[  105.712619] hwmon: sensor out of bounds
[  154.863976] hwmon: sensor out of bounds
[  164.937104] hwmon: sensor out of bounds
[  228.590909] hwmon: sensor out of bounds
[  315.365777] hwmon: sensor out of bounds
[  464.718403] hwmon: sensor out of bounds
[  615.079123] hwmon: sensor out of bounds
[  764.496780] hwmon: sensor out of bounds

This is with polling-delay set to 1000, it seems to rate-limit itself?
But I feel there should be a better way to communicate the potential
sensor failure to the user, but I can't think of anything.

[1]: https://lore.kernel.org/all/20250304-ntc_min_max-v1-1-b08e70e56459@gocontroll.com/
---
 drivers/hwmon/ntc_thermistor.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)


---
base-commit: 20d5c66e1810e6e8805ec0d01373afb2dba9f51a
change-id: 20250304-ntc_oob-8224b5b2578f

Best regards,

Comments

Guenter Roeck March 4, 2025, 3:04 p.m. UTC | #1
On 3/4/25 06:10, Maud Spierings via B4 Relay wrote:
> From: Maud Spierings <maudspierings@gocontroll.com>
> 
> When the ntc is reading Out Of Bounds instead of clipping to the nearest
> limit (min/max) return -ENODATA. This prevents malfunctioning sensors
> from sending a device into a shutdown loop due to a critical trip.
> 
> This implementation will only work for ntc type thermistors if a ptc
> type is to be implemented the min/max ohm calculation must be adjusted
> to take that into account.
> 
> Signed-off-by: Maud Spierings <maudspierings@gocontroll.com>
> ---
> This patch is a continuation of another discussion [1]. I felt like it
> should be a new patch, not a v2 as this is a very different change.
> 
> I have left the clamping of n to INT_MAX in the code with a comment, but
> it may be possible to find a better solution to it. One I thought of is
> to make the ohm field of the ntc_compensation struct a signed int as
> well. It would get rid of this weird edge case, but it doesn't make
> sense to allow for negative resistances to be entered into the sensor
> table.
> 
> Currently the only feedback this provides to the user is when they
> manually try to read the temperature and it returns the error. I have
> added a simple printk to these error points to see how spammy it gets
> and this is the result:
> 
> dmesg | grep hwmon
> [    4.982682] hwmon: sensor out of bounds
> [    5.249758] hwmon: sensor out of bounds
> [    5.633729] hwmon: sensor out of bounds
> [    6.215285] hwmon: sensor out of bounds
> [    7.073882] hwmon: sensor out of bounds
> [    7.486620] hwmon: sensor out of bounds
> [    8.833765] hwmon: sensor out of bounds
> [   10.785969] hwmon: sensor out of bounds
> [   13.793722] hwmon: sensor out of bounds
> [   16.761124] hwmon: sensor out of bounds
> [   17.889706] hwmon: sensor out of bounds
> [   25.057715] hwmon: sensor out of bounds
> [   35.041725] hwmon: sensor out of bounds
> [   50.110346] hwmon: sensor out of bounds
> [   72.945283] hwmon: sensor out of bounds
> [  105.712619] hwmon: sensor out of bounds
> [  154.863976] hwmon: sensor out of bounds
> [  164.937104] hwmon: sensor out of bounds
> [  228.590909] hwmon: sensor out of bounds
> [  315.365777] hwmon: sensor out of bounds
> [  464.718403] hwmon: sensor out of bounds
> [  615.079123] hwmon: sensor out of bounds
> [  764.496780] hwmon: sensor out of bounds
> 
> This is with polling-delay set to 1000, it seems to rate-limit itself?
> But I feel there should be a better way to communicate the potential
> sensor failure to the user, but I can't think of anything.
> 

Hmm ... we could add "fault" attributes and report a sensor fault
if uv == 0 or uv >= puv, together with the -ENODATA temperature reading.
That could distinguish the fault from the "resistor value out of range" case.

> [1]: https://lore.kernel.org/all/20250304-ntc_min_max-v1-1-b08e70e56459@gocontroll.com/

Most of the above should be after "---" since it is irrelevant for the commit log.

> ---
>   drivers/hwmon/ntc_thermistor.c | 19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/hwmon/ntc_thermistor.c b/drivers/hwmon/ntc_thermistor.c
> index 0d29c8f97ba7c2f264588b6309b91ca494012ad6..311a60498d409ea068a18590415b2d5b43e73eb1 100644
> --- a/drivers/hwmon/ntc_thermistor.c
> +++ b/drivers/hwmon/ntc_thermistor.c
> @@ -387,12 +387,9 @@ static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
>   	puo = data->pullup_ohm;
>   	pdo = data->pulldown_ohm;
>   
> -	if (uv == 0)
> -		return (data->connect == NTC_CONNECTED_POSITIVE) ?
> -			INT_MAX : 0;
> -	if (uv >= puv)
> -		return (data->connect == NTC_CONNECTED_POSITIVE) ?
> -			0 : INT_MAX;
> +	/* faulty adc value */
> +	if (uv == 0 || uv >= puv)
> +		return -ENODATA;
>   
>   	if (data->connect == NTC_CONNECTED_POSITIVE && puo == 0)
>   		n = div_u64(pdo * (puv - uv), uv);
> @@ -404,6 +401,16 @@ static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
>   	else
>   		n = div64_u64_safe(pdo * puo * uv, pdo * (puv - uv) - puo * uv);
>   
> +	/* sensor out of bounds */
> +	if (n > data->comp[0].ohm || n < data->comp[data->n_comp-1].ohm)
> +		return -ENODATA;
> +
> +	/*
> +	 * theoretically data->comp[0].ohm can be greater than INT_MAX as it is an
> +	 * unsigned integer, but it doesn't make any sense for it to be so as the
> +	 * maximum return value of this function is INT_MAX, so it will never be
> +	 * able to properly calculate that temperature.
> +	 */
>   	if (n > INT_MAX)
>   		n = INT_MAX;

I am not a friend of theoretic code or comments like this. Please drop.
The original code was intended to catch out-of-bounds values which would
otherwise have been reported as error, not to catch unrealistic resistor
values in the compensation tables.

Thanks,
Guenter
Maud Spierings | GOcontroll March 6, 2025, 1:35 p.m. UTC | #2
From: Guenter Roeck <groeck7@gmail.com> on behalf of Guenter Roeck <linux@roeck-us.net>
Sent: Tuesday, March 4, 2025 4:04 PM
 
>On 3/4/25 06:10, Maud Spierings via B4 Relay wrote:
>> From: Maud Spierings <maudspierings@gocontroll.com>
>>
>> When the ntc is reading Out Of Bounds instead of clipping to the nearest
>> limit (min/max) return -ENODATA. This prevents malfunctioning sensors
>> from sending a device into a shutdown loop due to a critical trip.
>>
>> This implementation will only work for ntc type thermistors if a ptc
>> type is to be implemented the min/max ohm calculation must be adjusted
>> to take that into account.
>>
>> Signed-off-by: Maud Spierings <maudspierings@gocontroll.com>
>> ---
>> This patch is a continuation of another discussion [1]. I felt like it
>> should be a new patch, not a v2 as this is a very different change.
thoughts about the theoretic code comment here.
>> I have left the clamping of n to INT_MAX in the code with a comment, but
>> it may be possible to find a better solution to it. One I thought of is
>> to make the ohm field of the ntc_compensation struct a signed int as
>> well. It would get rid of this weird edge case, but it doesn't make
>> sense to allow for negative resistances to be entered into the sensor
>> table.
>>
>> Currently the only feedback this provides to the user is when they
>> manually try to read the temperature and it returns the error. I have
>> added a simple printk to these error points to see how spammy it gets
>> and this is the result:
>>
>> dmesg | grep hwmon
>> [    4.982682] hwmon: sensor out of bounds
>> [    5.249758] hwmon: sensor out of bounds
>> [    5.633729] hwmon: sensor out of bounds
>> [    6.215285] hwmon: sensor out of bounds
>> [    7.073882] hwmon: sensor out of bounds
>> [    7.486620] hwmon: sensor out of bounds
>> [    8.833765] hwmon: sensor out of bounds
>> [   10.785969] hwmon: sensor out of bounds
>> [   13.793722] hwmon: sensor out of bounds
>> [   16.761124] hwmon: sensor out of bounds
>> [   17.889706] hwmon: sensor out of bounds
>> [   25.057715] hwmon: sensor out of bounds
>> [   35.041725] hwmon: sensor out of bounds
>> [   50.110346] hwmon: sensor out of bounds
>> [   72.945283] hwmon: sensor out of bounds
>> [  105.712619] hwmon: sensor out of bounds
>> [  154.863976] hwmon: sensor out of bounds
>> [  164.937104] hwmon: sensor out of bounds
>> [  228.590909] hwmon: sensor out of bounds
>> [  315.365777] hwmon: sensor out of bounds
>> [  464.718403] hwmon: sensor out of bounds
>> [  615.079123] hwmon: sensor out of bounds
>> [  764.496780] hwmon: sensor out of bounds
>>
>> This is with polling-delay set to 1000, it seems to rate-limit itself?
>> But I feel there should be a better way to communicate the potential
>> sensor failure to the user, but I can't think of anything.
>>
>
>Hmm ... we could add "fault" attributes and report a sensor fault
>if uv == 0 or uv >= puv, together with the -ENODATA temperature reading.
>That could distinguish the fault from the "resistor value out of range" case.

I've done some working around that fault attribute, but I can't seem to
find a satisfactory way to implement it. This fault also is not triggered
if anything is disconnected, maybe the 0 case but I can't even get that.

I think leaving it at this current solution is good for now.

>> [1]: https://lore.kernel.org/all/20250304-ntc_min_max-v1-1-b08e70e56459@gocontroll.com/
>
>Most of the above should be after "---" since it is irrelevant for the commit log.

I believe my cover letter seperated it correctly after my signed-off-by
tag.

>> ---
>>   drivers/hwmon/ntc_thermistor.c | 19 +++++++++++++------
>>   1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/hwmon/ntc_thermistor.c b/drivers/hwmon/ntc_thermistor.c
>> index 0d29c8f97ba7c2f264588b6309b91ca494012ad6..311a60498d409ea068a18590415b2d5b43e73eb1 100644
>> --- a/drivers/hwmon/ntc_thermistor.c
>> +++ b/drivers/hwmon/ntc_thermistor.c
>> @@ -387,12 +387,9 @@ static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
>>        puo = data->pullup_ohm;
>>        pdo = data->pulldown_ohm;
>>  
>> -     if (uv == 0)
>> -             return (data->connect == NTC_CONNECTED_POSITIVE) ?
>> -                     INT_MAX : 0;
>> -     if (uv >= puv)
>> -             return (data->connect == NTC_CONNECTED_POSITIVE) ?
>> -                     0 : INT_MAX;
>> +     /* faulty adc value */
>> +     if (uv == 0 || uv >= puv)
>> +             return -ENODATA;
>>  
>>        if (data->connect == NTC_CONNECTED_POSITIVE && puo == 0)
>>                n = div_u64(pdo * (puv - uv), uv);
>> @@ -404,6 +401,16 @@ static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
>>        else
>>                n = div64_u64_safe(pdo * puo * uv, pdo * (puv - uv) - puo * uv);
>>  
>> +     /* sensor out of bounds */
>> +     if (n > data->comp[0].ohm || n < data->comp[data->n_comp-1].ohm)
>> +             return -ENODATA;
>> +
>> +     /*
>> +      * theoretically data->comp[0].ohm can be greater than INT_MAX as it is an
>> +      * unsigned integer, but it doesn't make any sense for it to be so as the
>> +      * maximum return value of this function is INT_MAX, so it will never be
>> +      * able to properly calculate that temperature.
>> +      */
>>        if (n > INT_MAX)
>>                n = INT_MAX;
>
>I am not a friend of theoretic code or comments like this. Please drop.
>The original code was intended to catch out-of-bounds values which would
>otherwise have been reported as error, not to catch unrealistic resistor
>values in the compensation tables.

So, drop the check and comment? Or just drop the comment? I was thinking
to fully remove that check in an earlier comment in my cover letter.

Kind regards,
Maud
Guenter Roeck March 6, 2025, 8:52 p.m. UTC | #3
On Thu, Mar 06, 2025 at 01:35:53PM +0000, Maud Spierings | GOcontroll wrote:
> >
> >Most of the above should be after "---" since it is irrelevant for the commit log.
> 
> I believe my cover letter seperated it correctly after my signed-off-by
> tag.
> 
Yes, you are correct. I missed the second (or, rather, first) "---"
above. Sorry for the noise.

> >> +     /*
> >> +      * theoretically data->comp[0].ohm can be greater than INT_MAX as it is an
> >> +      * unsigned integer, but it doesn't make any sense for it to be so as the
> >> +      * maximum return value of this function is INT_MAX, so it will never be
> >> +      * able to properly calculate that temperature.
> >> +      */
> >>        if (n > INT_MAX)
> >>                n = INT_MAX;
> >
> >I am not a friend of theoretic code or comments like this. Please drop.
> >The original code was intended to catch out-of-bounds values which would
> >otherwise have been reported as error, not to catch unrealistic resistor
> >values in the compensation tables.
> 
> So, drop the check and comment? Or just drop the comment? I was thinking
> to fully remove that check in an earlier comment in my cover letter.

Drop both.

Thanks,
Guenter
diff mbox series

Patch

diff --git a/drivers/hwmon/ntc_thermistor.c b/drivers/hwmon/ntc_thermistor.c
index 0d29c8f97ba7c2f264588b6309b91ca494012ad6..311a60498d409ea068a18590415b2d5b43e73eb1 100644
--- a/drivers/hwmon/ntc_thermistor.c
+++ b/drivers/hwmon/ntc_thermistor.c
@@ -387,12 +387,9 @@  static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
 	puo = data->pullup_ohm;
 	pdo = data->pulldown_ohm;
 
-	if (uv == 0)
-		return (data->connect == NTC_CONNECTED_POSITIVE) ?
-			INT_MAX : 0;
-	if (uv >= puv)
-		return (data->connect == NTC_CONNECTED_POSITIVE) ?
-			0 : INT_MAX;
+	/* faulty adc value */
+	if (uv == 0 || uv >= puv)
+		return -ENODATA;
 
 	if (data->connect == NTC_CONNECTED_POSITIVE && puo == 0)
 		n = div_u64(pdo * (puv - uv), uv);
@@ -404,6 +401,16 @@  static int get_ohm_of_thermistor(struct ntc_data *data, unsigned int uv)
 	else
 		n = div64_u64_safe(pdo * puo * uv, pdo * (puv - uv) - puo * uv);
 
+	/* sensor out of bounds */
+	if (n > data->comp[0].ohm || n < data->comp[data->n_comp-1].ohm)
+		return -ENODATA;
+
+	/*
+	 * theoretically data->comp[0].ohm can be greater than INT_MAX as it is an
+	 * unsigned integer, but it doesn't make any sense for it to be so as the
+	 * maximum return value of this function is INT_MAX, so it will never be
+	 * able to properly calculate that temperature.
+	 */
 	if (n > INT_MAX)
 		n = INT_MAX;
 	return n;