diff mbox series

nvme-hwmon: drop not common HWMON_T_MIN and HWMON_T_MAX

Message ID 20210313083256.68158-1-wangyugui@e16-tech.com (mailing list archive)
State Rejected
Headers show
Series nvme-hwmon: drop not common HWMON_T_MIN and HWMON_T_MAX | expand

Commit Message

Wang Yugui March 13, 2021, 8:32 a.m. UTC
HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon.

HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors.

Before this patch(SSD: PM1733):
#sensors
nvme-pci-4300
Adapter: PCI adapter
Composite:    +49.9°C  (low  = -273.1°C, high = +71.8°C)
                       (crit = +84.8°C)
Sensor 1:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
ERROR: Can't get value of subfeature temp3_min: I/O error
ERROR: Can't get value of subfeature temp3_max: I/O error
Sensor 2:     +49.9°C  (low  =  +0.0°C, high =  +0.0°C)

# cat /sys/class/nvme/nvme0/hwmon1/temp3_min
cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error
# cat /sys/class/nvme/nvme0/hwmon1/temp3_max
cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error

After this patch(SSD: PM1733):
#sensors
nvme-pci-4300
Adapter: PCI adapter
Composite:    +48.9°C  (high = +71.8°C, crit = +84.8°C)
Sensor 1:     +46.9°C
Sensor 2:     +48.9°C

---
 drivers/nvme/host/hwmon.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

Comments

Guenter Roeck March 13, 2021, 3:32 p.m. UTC | #1
On 3/13/21 12:32 AM, wangyugui wrote:
> HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon.
> 
> HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors.
> 
> Before this patch(SSD: PM1733):
> #sensors
> nvme-pci-4300
> Adapter: PCI adapter
> Composite:    +49.9°C  (low  = -273.1°C, high = +71.8°C)
>                        (crit = +84.8°C)
> Sensor 1:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
> ERROR: Can't get value of subfeature temp3_min: I/O error
> ERROR: Can't get value of subfeature temp3_max: I/O error
> Sensor 2:     +49.9°C  (low  =  +0.0°C, high =  +0.0°C)
> 
> # cat /sys/class/nvme/nvme0/hwmon1/temp3_min
> cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error
> # cat /sys/class/nvme/nvme0/hwmon1/temp3_max
> cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error
> 
> After this patch(SSD: PM1733):
> #sensors
> nvme-pci-4300
> Adapter: PCI adapter
> Composite:    +48.9°C  (high = +71.8°C, crit = +84.8°C)
> Sensor 1:     +46.9°C
> Sensor 2:     +48.9°C
> 

Signed-off-by: missing.

Either case, no.

On one of my NVMEs, after setting the limits:

nvme-pci-0100
Adapter: PCI adapter
Composite:    +29.9°C  (low  =  -0.1°C, high = +76.8°C)
                       (crit = +78.8°C)
Sensor 1:     +29.9°C  (low  =  -0.1°C, high = +254.8°C)
Sensor 2:     +37.9°C  (low  =  -0.1°C, high = +254.8°C)

That it doesn't work on yours doesn't mean it needs to be disabled for
all other NVMEs. Instead, we'll need to figure out how to correctly
identify that limits for the second subfeature sensor are not supported,
or what exactly the NVME complains about when trying to read the limits
for the second subsensor. One possible solution might be to call
nvme_get_temp_thresh() from nvme_hwmon_is_visible() and return 0 if the
call returns an error.

Also, the low/high limit readings on the composite sensor and on
Sensor 1 only mean that limits are not configured. That is not a reason
to disable the limit attributes entirely. One could instead write useful
limits into those attributes.

Guenter

> ---
>  drivers/nvme/host/hwmon.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c
> index 552dbc0..a93612f 100644
> --- a/drivers/nvme/host/hwmon.c
> +++ b/drivers/nvme/host/hwmon.c
> @@ -188,23 +188,23 @@ static umode_t nvme_hwmon_is_visible(const void *_data,
>  static const struct hwmon_channel_info *nvme_hwmon_info[] = {
>  	HWMON_CHANNEL_INFO(chip, HWMON_C_REGISTER_TZ),
>  	HWMON_CHANNEL_INFO(temp,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT | HWMON_T_MAX |
>  				HWMON_T_CRIT | HWMON_T_LABEL | HWMON_T_ALARM,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT |
>  				HWMON_T_LABEL,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT |
>  				HWMON_T_LABEL,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT |
>  				HWMON_T_LABEL,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT |
>  				HWMON_T_LABEL,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT |
>  				HWMON_T_LABEL,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT |
>  				HWMON_T_LABEL,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT |
>  				HWMON_T_LABEL,
> -			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
> +			   HWMON_T_INPUT |
>  				HWMON_T_LABEL),
>  	NULL
>  };
>
Wang Yugui March 13, 2021, 10:44 p.m. UTC | #2
Hi,

> On 3/13/21 12:32 AM, wangyugui wrote:
> > HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon.
> > 
> > HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors.
> > 
> > Before this patch(SSD: PM1733):
> > #sensors
> > nvme-pci-4300
> > Adapter: PCI adapter
> > Composite:    +49.9°C  (low  = -273.1°C, high = +71.8°C)
> >                        (crit = +84.8°C)
> > Sensor 1:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
> > ERROR: Can't get value of subfeature temp3_min: I/O error
> > ERROR: Can't get value of subfeature temp3_max: I/O error
> > Sensor 2:     +49.9°C  (low  =  +0.0°C, high =  +0.0°C)
> > 
> > # cat /sys/class/nvme/nvme0/hwmon1/temp3_min
> > cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error
> > # cat /sys/class/nvme/nvme0/hwmon1/temp3_max
> > cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error
> > 
> > After this patch(SSD: PM1733):
> > #sensors
> > nvme-pci-4300
> > Adapter: PCI adapter
> > Composite:    +48.9°C  (high = +71.8°C, crit = +84.8°C)
> > Sensor 1:     +46.9°C
> > Sensor 2:     +48.9°C
> > 
> 
> Signed-off-by: missing.
> 
> Either case, no.
> 
> On one of my NVMEs, after setting the limits:
> 
> nvme-pci-0100
> Adapter: PCI adapter
> Composite:    +29.9°C  (low  =  -0.1°C, high = +76.8°C)
>                        (crit = +78.8°C)
> Sensor 1:     +29.9°C  (low  =  -0.1°C, high = +254.8°C)
> Sensor 2:     +37.9°C  (low  =  -0.1°C, high = +254.8°C)

high = +254.8°C is a real value or unconfigured value ?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/03/14

> That it doesn't work on yours doesn't mean it needs to be disabled for
> all other NVMEs. Instead, we'll need to figure out how to correctly
> identify that limits for the second subfeature sensor are not supported,
> or what exactly the NVME complains about when trying to read the limits
> for the second subsensor. One possible solution might be to call
> nvme_get_temp_thresh() from nvme_hwmon_is_visible() and return 0 if the
> call returns an error.
> 
> Also, the low/high limit readings on the composite sensor and on
> Sensor 1 only mean that limits are not configured. That is not a reason
> to disable the limit attributes entirely. One could instead write useful
> limits into those attributes.
Guenter Roeck March 14, 2021, 12:22 a.m. UTC | #3
On 3/13/21 2:44 PM, Wang Yugui wrote:
> Hi,
> 
>> On 3/13/21 12:32 AM, wangyugui wrote:
>>> HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon.
>>>
>>> HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors.
>>>
>>> Before this patch(SSD: PM1733):
>>> #sensors
>>> nvme-pci-4300
>>> Adapter: PCI adapter
>>> Composite:    +49.9°C  (low  = -273.1°C, high = +71.8°C)
>>>                        (crit = +84.8°C)
>>> Sensor 1:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
>>> ERROR: Can't get value of subfeature temp3_min: I/O error
>>> ERROR: Can't get value of subfeature temp3_max: I/O error
>>> Sensor 2:     +49.9°C  (low  =  +0.0°C, high =  +0.0°C)
>>>
>>> # cat /sys/class/nvme/nvme0/hwmon1/temp3_min
>>> cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error
>>> # cat /sys/class/nvme/nvme0/hwmon1/temp3_max
>>> cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error
>>>
>>> After this patch(SSD: PM1733):
>>> #sensors
>>> nvme-pci-4300
>>> Adapter: PCI adapter
>>> Composite:    +48.9°C  (high = +71.8°C, crit = +84.8°C)
>>> Sensor 1:     +46.9°C
>>> Sensor 2:     +48.9°C
>>>
>>
>> Signed-off-by: missing.
>>
>> Either case, no.
>>
>> On one of my NVMEs, after setting the limits:
>>
>> nvme-pci-0100
>> Adapter: PCI adapter
>> Composite:    +29.9°C  (low  =  -0.1°C, high = +76.8°C)
>>                        (crit = +78.8°C)
>> Sensor 1:     +29.9°C  (low  =  -0.1°C, high = +254.8°C)
>> Sensor 2:     +37.9°C  (low  =  -0.1°C, high = +254.8°C)
> 
> high = +254.8°C is a real value or unconfigured value ?
> 

This was a configured value, as mentioned above. Same system and NVME,
with differently configured limit values:

nvme-pci-0100
Adapter: PCI adapter
Composite:    +31.9°C  (low  =  -0.1°C, high = +76.8°C)
                       (crit = +78.8°C)
Sensor 1:     +31.9°C  (low  = -10.2°C, high = +126.8°C)
Sensor 2:     +49.9°C  (low  =  +4.8°C, high = +89.8°C)

Default values, with a different NVME, on a different system:

nvme-pci-0100
Adapter: PCI adapter
Composite:    +38.9°C  (low  = -273.1°C, high = +84.8°C)
                       (crit = +84.8°C)
Sensor 1:     +38.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +42.9°C  (low  = -273.1°C, high = +65261.8°C)

Same NVME as the first one, in a different system, with default
values:

nvme-pci-2500
Adapter: PCI adapter
Composite:    +38.9°C  (low  = -273.1°C, high = +76.8°C)
                       (crit = +78.8°C)
Sensor 1:     +38.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +44.9°C  (low  = -273.1°C, high = +65261.8°C)

Guenter
Wang Yugui March 14, 2021, 3:51 p.m. UTC | #4
Hi,

> On 3/13/21 2:44 PM, Wang Yugui wrote:
> > Hi,
> > 
> >> On 3/13/21 12:32 AM, wangyugui wrote:
> >>> HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon.
> >>>
> >>> HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors.
> >>>
> >>> Before this patch(SSD: PM1733):
> >>> #sensors
> >>> nvme-pci-4300
> >>> Adapter: PCI adapter
> >>> Composite:    +49.9°C  (low  = -273.1°C, high = +71.8°C)
> >>>                        (crit = +84.8°C)
> >>> Sensor 1:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
> >>> ERROR: Can't get value of subfeature temp3_min: I/O error
> >>> ERROR: Can't get value of subfeature temp3_max: I/O error
> >>> Sensor 2:     +49.9°C  (low  =  +0.0°C, high =  +0.0°C)
> >>>
> >>> # cat /sys/class/nvme/nvme0/hwmon1/temp3_min
> >>> cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error
> >>> # cat /sys/class/nvme/nvme0/hwmon1/temp3_max
> >>> cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error
> >>>
> >>> After this patch(SSD: PM1733):
> >>> #sensors
> >>> nvme-pci-4300
> >>> Adapter: PCI adapter
> >>> Composite:    +48.9°C  (high = +71.8°C, crit = +84.8°C)
> >>> Sensor 1:     +46.9°C
> >>> Sensor 2:     +48.9°C
> >>>
> >>
> >> Signed-off-by: missing.
> >>
> >> Either case, no.
> >>
> >> On one of my NVMEs, after setting the limits:
> >>
> >> nvme-pci-0100
> >> Adapter: PCI adapter
> >> Composite:    +29.9°C  (low  =  -0.1°C, high = +76.8°C)
> >>                        (crit = +78.8°C)
> >> Sensor 1:     +29.9°C  (low  =  -0.1°C, high = +254.8°C)
> >> Sensor 2:     +37.9°C  (low  =  -0.1°C, high = +254.8°C)
> > 
> > high = +254.8°C is a real value or unconfigured value ?
> > 
> 
> This was a configured value, as mentioned above. Same system and NVME,
> with differently configured limit values:
> 
> nvme-pci-0100
> Adapter: PCI adapter
> Composite:    +31.9°C  (low  =  -0.1°C, high = +76.8°C)
>                        (crit = +78.8°C)
> Sensor 1:     +31.9°C  (low  = -10.2°C, high = +126.8°C)
> Sensor 2:     +49.9°C  (low  =  +4.8°C, high = +89.8°C)
> 
> Default values, with a different NVME, on a different system:
> 
> nvme-pci-0100
> Adapter: PCI adapter
> Composite:    +38.9°C  (low  = -273.1°C, high = +84.8°C)
>                        (crit = +84.8°C)
> Sensor 1:     +38.9°C  (low  = -273.1°C, high = +65261.8°C)
> Sensor 2:     +42.9°C  (low  = -273.1°C, high = +65261.8°C)
> 
> Same NVME as the first one, in a different system, with default
> values:
> 
> nvme-pci-2500
> Adapter: PCI adapter
> Composite:    +38.9°C  (low  = -273.1°C, high = +76.8°C)
>                        (crit = +78.8°C)
> Sensor 1:     +38.9°C  (low  = -273.1°C, high = +65261.8°C)
> Sensor 2:     +44.9°C  (low  = -273.1°C, high = +65261.8°C)
> 
> Guenter

Most of the users do not like something like 'low  = -273.1°C' or 
'high = +65261.8°C' that just seem noise, or software bug, or hardware
error, and that will cause unnecessary worry.

We can support more HWMON_T_MIN and HWMON_T_MAX for advanced 
user iff a valid configure value in the futher, and without any noise.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/03/14
Guenter Roeck March 14, 2021, 6:04 p.m. UTC | #5
On 3/14/21 8:51 AM, Wang Yugui wrote:
[ ...]
> 
> Most of the users do not like something like 'low  = -273.1°C' or 
> 'high = +65261.8°C' that just seem noise, or software bug, or hardware
> error, and that will cause unnecessary worry.
> 
> We can support more HWMON_T_MIN and HWMON_T_MAX for advanced 
> user iff a valid configure value in the futher, and without any noise.
> 

This is not a valid argument. This is similar for almost all
sensors. You are effectively arguing that all limits for all sensors
should be disabled unless a default is configured by the hardware.
For example, looking at the Super-IO chip output in one of my systems,
I see:

nct6797-isa-0a20
Adapter: ISA adapter
in0:                    +1.42 V  (min =  +0.00 V, max =  +1.74 V)
in1:                    +1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in2:                    +3.39 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in3:                    +3.31 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:                    +0.99 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                    +0.15 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:                    +0.55 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in7:                    +3.39 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in8:                    +3.28 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in9:                    +1.82 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in10:                   +0.00 V  (min =  +0.00 V, max =  +0.00 V)
in11:                   +0.66 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                   +1.21 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                   +0.68 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                   +1.50 V  (min =  +0.00 V, max =  +0.00 V)  ALARM

Following your line of argument, we should disable reporting (and setting)
minimum and maximum values for pretty much all hardware monitoring drivers
because the hardware (or the BIOS) doesn't set default values. This simply
does not make any sense at all.

If you don't like the default limit output, I would suggest to either
configure limits or to disable the driver in your system. Depriving
all other users from setting limits (and getting alarms if they are
exceeded) just because you don't like that the hardware does not
initialize those limits is an exceedingly bad idea.

Thanks,
Guenter
diff mbox series

Patch

diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c
index 552dbc0..a93612f 100644
--- a/drivers/nvme/host/hwmon.c
+++ b/drivers/nvme/host/hwmon.c
@@ -188,23 +188,23 @@  static umode_t nvme_hwmon_is_visible(const void *_data,
 static const struct hwmon_channel_info *nvme_hwmon_info[] = {
 	HWMON_CHANNEL_INFO(chip, HWMON_C_REGISTER_TZ),
 	HWMON_CHANNEL_INFO(temp,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT | HWMON_T_MAX |
 				HWMON_T_CRIT | HWMON_T_LABEL | HWMON_T_ALARM,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT |
 				HWMON_T_LABEL,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT |
 				HWMON_T_LABEL,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT |
 				HWMON_T_LABEL,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT |
 				HWMON_T_LABEL,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT |
 				HWMON_T_LABEL,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT |
 				HWMON_T_LABEL,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT |
 				HWMON_T_LABEL,
-			   HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN |
+			   HWMON_T_INPUT |
 				HWMON_T_LABEL),
 	NULL
 };