Message ID | 20210313083256.68158-1-wangyugui@e16-tech.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
Series | nvme-hwmon: drop not common HWMON_T_MIN and HWMON_T_MAX | expand |
On 3/13/21 12:32 AM, wangyugui wrote: > HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon. > > HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors. > > Before this patch(SSD: PM1733): > #sensors > nvme-pci-4300 > Adapter: PCI adapter > Composite: +49.9°C (low = -273.1°C, high = +71.8°C) > (crit = +84.8°C) > Sensor 1: +47.9°C (low = -273.1°C, high = +65261.8°C) > ERROR: Can't get value of subfeature temp3_min: I/O error > ERROR: Can't get value of subfeature temp3_max: I/O error > Sensor 2: +49.9°C (low = +0.0°C, high = +0.0°C) > > # cat /sys/class/nvme/nvme0/hwmon1/temp3_min > cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error > # cat /sys/class/nvme/nvme0/hwmon1/temp3_max > cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error > > After this patch(SSD: PM1733): > #sensors > nvme-pci-4300 > Adapter: PCI adapter > Composite: +48.9°C (high = +71.8°C, crit = +84.8°C) > Sensor 1: +46.9°C > Sensor 2: +48.9°C > Signed-off-by: missing. Either case, no. On one of my NVMEs, after setting the limits: nvme-pci-0100 Adapter: PCI adapter Composite: +29.9°C (low = -0.1°C, high = +76.8°C) (crit = +78.8°C) Sensor 1: +29.9°C (low = -0.1°C, high = +254.8°C) Sensor 2: +37.9°C (low = -0.1°C, high = +254.8°C) That it doesn't work on yours doesn't mean it needs to be disabled for all other NVMEs. Instead, we'll need to figure out how to correctly identify that limits for the second subfeature sensor are not supported, or what exactly the NVME complains about when trying to read the limits for the second subsensor. One possible solution might be to call nvme_get_temp_thresh() from nvme_hwmon_is_visible() and return 0 if the call returns an error. Also, the low/high limit readings on the composite sensor and on Sensor 1 only mean that limits are not configured. That is not a reason to disable the limit attributes entirely. One could instead write useful limits into those attributes. Guenter > --- > drivers/nvme/host/hwmon.c | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c > index 552dbc0..a93612f 100644 > --- a/drivers/nvme/host/hwmon.c > +++ b/drivers/nvme/host/hwmon.c > @@ -188,23 +188,23 @@ static umode_t nvme_hwmon_is_visible(const void *_data, > static const struct hwmon_channel_info *nvme_hwmon_info[] = { > HWMON_CHANNEL_INFO(chip, HWMON_C_REGISTER_TZ), > HWMON_CHANNEL_INFO(temp, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | HWMON_T_MAX | > HWMON_T_CRIT | HWMON_T_LABEL | HWMON_T_ALARM, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | > HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | > HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | > HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | > HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | > HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | > HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | > HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_INPUT | > HWMON_T_LABEL), > NULL > }; >
Hi, > On 3/13/21 12:32 AM, wangyugui wrote: > > HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon. > > > > HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors. > > > > Before this patch(SSD: PM1733): > > #sensors > > nvme-pci-4300 > > Adapter: PCI adapter > > Composite: +49.9°C (low = -273.1°C, high = +71.8°C) > > (crit = +84.8°C) > > Sensor 1: +47.9°C (low = -273.1°C, high = +65261.8°C) > > ERROR: Can't get value of subfeature temp3_min: I/O error > > ERROR: Can't get value of subfeature temp3_max: I/O error > > Sensor 2: +49.9°C (low = +0.0°C, high = +0.0°C) > > > > # cat /sys/class/nvme/nvme0/hwmon1/temp3_min > > cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error > > # cat /sys/class/nvme/nvme0/hwmon1/temp3_max > > cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error > > > > After this patch(SSD: PM1733): > > #sensors > > nvme-pci-4300 > > Adapter: PCI adapter > > Composite: +48.9°C (high = +71.8°C, crit = +84.8°C) > > Sensor 1: +46.9°C > > Sensor 2: +48.9°C > > > > Signed-off-by: missing. > > Either case, no. > > On one of my NVMEs, after setting the limits: > > nvme-pci-0100 > Adapter: PCI adapter > Composite: +29.9°C (low = -0.1°C, high = +76.8°C) > (crit = +78.8°C) > Sensor 1: +29.9°C (low = -0.1°C, high = +254.8°C) > Sensor 2: +37.9°C (low = -0.1°C, high = +254.8°C) high = +254.8°C is a real value or unconfigured value ? Best Regards Wang Yugui (wangyugui@e16-tech.com) 2021/03/14 > That it doesn't work on yours doesn't mean it needs to be disabled for > all other NVMEs. Instead, we'll need to figure out how to correctly > identify that limits for the second subfeature sensor are not supported, > or what exactly the NVME complains about when trying to read the limits > for the second subsensor. One possible solution might be to call > nvme_get_temp_thresh() from nvme_hwmon_is_visible() and return 0 if the > call returns an error. > > Also, the low/high limit readings on the composite sensor and on > Sensor 1 only mean that limits are not configured. That is not a reason > to disable the limit attributes entirely. One could instead write useful > limits into those attributes.
On 3/13/21 2:44 PM, Wang Yugui wrote: > Hi, > >> On 3/13/21 12:32 AM, wangyugui wrote: >>> HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon. >>> >>> HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors. >>> >>> Before this patch(SSD: PM1733): >>> #sensors >>> nvme-pci-4300 >>> Adapter: PCI adapter >>> Composite: +49.9°C (low = -273.1°C, high = +71.8°C) >>> (crit = +84.8°C) >>> Sensor 1: +47.9°C (low = -273.1°C, high = +65261.8°C) >>> ERROR: Can't get value of subfeature temp3_min: I/O error >>> ERROR: Can't get value of subfeature temp3_max: I/O error >>> Sensor 2: +49.9°C (low = +0.0°C, high = +0.0°C) >>> >>> # cat /sys/class/nvme/nvme0/hwmon1/temp3_min >>> cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error >>> # cat /sys/class/nvme/nvme0/hwmon1/temp3_max >>> cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error >>> >>> After this patch(SSD: PM1733): >>> #sensors >>> nvme-pci-4300 >>> Adapter: PCI adapter >>> Composite: +48.9°C (high = +71.8°C, crit = +84.8°C) >>> Sensor 1: +46.9°C >>> Sensor 2: +48.9°C >>> >> >> Signed-off-by: missing. >> >> Either case, no. >> >> On one of my NVMEs, after setting the limits: >> >> nvme-pci-0100 >> Adapter: PCI adapter >> Composite: +29.9°C (low = -0.1°C, high = +76.8°C) >> (crit = +78.8°C) >> Sensor 1: +29.9°C (low = -0.1°C, high = +254.8°C) >> Sensor 2: +37.9°C (low = -0.1°C, high = +254.8°C) > > high = +254.8°C is a real value or unconfigured value ? > This was a configured value, as mentioned above. Same system and NVME, with differently configured limit values: nvme-pci-0100 Adapter: PCI adapter Composite: +31.9°C (low = -0.1°C, high = +76.8°C) (crit = +78.8°C) Sensor 1: +31.9°C (low = -10.2°C, high = +126.8°C) Sensor 2: +49.9°C (low = +4.8°C, high = +89.8°C) Default values, with a different NVME, on a different system: nvme-pci-0100 Adapter: PCI adapter Composite: +38.9°C (low = -273.1°C, high = +84.8°C) (crit = +84.8°C) Sensor 1: +38.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C) Same NVME as the first one, in a different system, with default values: nvme-pci-2500 Adapter: PCI adapter Composite: +38.9°C (low = -273.1°C, high = +76.8°C) (crit = +78.8°C) Sensor 1: +38.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +44.9°C (low = -273.1°C, high = +65261.8°C) Guenter
Hi, > On 3/13/21 2:44 PM, Wang Yugui wrote: > > Hi, > > > >> On 3/13/21 12:32 AM, wangyugui wrote: > >>> HWMON_T_MIN is not common in NVMe SSD, so drop all of them in nvme-hwmon. > >>> > >>> HWMON_T_MAX is only common in NVMe SSD Composite sensor, so drop them in other sensors. > >>> > >>> Before this patch(SSD: PM1733): > >>> #sensors > >>> nvme-pci-4300 > >>> Adapter: PCI adapter > >>> Composite: +49.9°C (low = -273.1°C, high = +71.8°C) > >>> (crit = +84.8°C) > >>> Sensor 1: +47.9°C (low = -273.1°C, high = +65261.8°C) > >>> ERROR: Can't get value of subfeature temp3_min: I/O error > >>> ERROR: Can't get value of subfeature temp3_max: I/O error > >>> Sensor 2: +49.9°C (low = +0.0°C, high = +0.0°C) > >>> > >>> # cat /sys/class/nvme/nvme0/hwmon1/temp3_min > >>> cat: /sys/class/nvme/nvme0/hwmon1/temp3_min: Input/output error > >>> # cat /sys/class/nvme/nvme0/hwmon1/temp3_max > >>> cat: /sys/class/nvme/nvme0/hwmon1/temp3_max: Input/output error > >>> > >>> After this patch(SSD: PM1733): > >>> #sensors > >>> nvme-pci-4300 > >>> Adapter: PCI adapter > >>> Composite: +48.9°C (high = +71.8°C, crit = +84.8°C) > >>> Sensor 1: +46.9°C > >>> Sensor 2: +48.9°C > >>> > >> > >> Signed-off-by: missing. > >> > >> Either case, no. > >> > >> On one of my NVMEs, after setting the limits: > >> > >> nvme-pci-0100 > >> Adapter: PCI adapter > >> Composite: +29.9°C (low = -0.1°C, high = +76.8°C) > >> (crit = +78.8°C) > >> Sensor 1: +29.9°C (low = -0.1°C, high = +254.8°C) > >> Sensor 2: +37.9°C (low = -0.1°C, high = +254.8°C) > > > > high = +254.8°C is a real value or unconfigured value ? > > > > This was a configured value, as mentioned above. Same system and NVME, > with differently configured limit values: > > nvme-pci-0100 > Adapter: PCI adapter > Composite: +31.9°C (low = -0.1°C, high = +76.8°C) > (crit = +78.8°C) > Sensor 1: +31.9°C (low = -10.2°C, high = +126.8°C) > Sensor 2: +49.9°C (low = +4.8°C, high = +89.8°C) > > Default values, with a different NVME, on a different system: > > nvme-pci-0100 > Adapter: PCI adapter > Composite: +38.9°C (low = -273.1°C, high = +84.8°C) > (crit = +84.8°C) > Sensor 1: +38.9°C (low = -273.1°C, high = +65261.8°C) > Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C) > > Same NVME as the first one, in a different system, with default > values: > > nvme-pci-2500 > Adapter: PCI adapter > Composite: +38.9°C (low = -273.1°C, high = +76.8°C) > (crit = +78.8°C) > Sensor 1: +38.9°C (low = -273.1°C, high = +65261.8°C) > Sensor 2: +44.9°C (low = -273.1°C, high = +65261.8°C) > > Guenter Most of the users do not like something like 'low = -273.1°C' or 'high = +65261.8°C' that just seem noise, or software bug, or hardware error, and that will cause unnecessary worry. We can support more HWMON_T_MIN and HWMON_T_MAX for advanced user iff a valid configure value in the futher, and without any noise. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2021/03/14
On 3/14/21 8:51 AM, Wang Yugui wrote: [ ...] > > Most of the users do not like something like 'low = -273.1°C' or > 'high = +65261.8°C' that just seem noise, or software bug, or hardware > error, and that will cause unnecessary worry. > > We can support more HWMON_T_MIN and HWMON_T_MAX for advanced > user iff a valid configure value in the futher, and without any noise. > This is not a valid argument. This is similar for almost all sensors. You are effectively arguing that all limits for all sensors should be disabled unless a default is configured by the hardware. For example, looking at the Super-IO chip output in one of my systems, I see: nct6797-isa-0a20 Adapter: ISA adapter in0: +1.42 V (min = +0.00 V, max = +1.74 V) in1: +1.02 V (min = +0.00 V, max = +0.00 V) ALARM in2: +3.39 V (min = +0.00 V, max = +0.00 V) ALARM in3: +3.31 V (min = +0.00 V, max = +0.00 V) ALARM in4: +0.99 V (min = +0.00 V, max = +0.00 V) ALARM in5: +0.15 V (min = +0.00 V, max = +0.00 V) ALARM in6: +0.55 V (min = +0.00 V, max = +0.00 V) ALARM in7: +3.39 V (min = +0.00 V, max = +0.00 V) ALARM in8: +3.28 V (min = +0.00 V, max = +0.00 V) ALARM in9: +1.82 V (min = +0.00 V, max = +0.00 V) ALARM in10: +0.00 V (min = +0.00 V, max = +0.00 V) in11: +0.66 V (min = +0.00 V, max = +0.00 V) ALARM in12: +1.21 V (min = +0.00 V, max = +0.00 V) ALARM in13: +0.68 V (min = +0.00 V, max = +0.00 V) ALARM in14: +1.50 V (min = +0.00 V, max = +0.00 V) ALARM Following your line of argument, we should disable reporting (and setting) minimum and maximum values for pretty much all hardware monitoring drivers because the hardware (or the BIOS) doesn't set default values. This simply does not make any sense at all. If you don't like the default limit output, I would suggest to either configure limits or to disable the driver in your system. Depriving all other users from setting limits (and getting alarms if they are exceeded) just because you don't like that the hardware does not initialize those limits is an exceedingly bad idea. Thanks, Guenter
diff --git a/drivers/nvme/host/hwmon.c b/drivers/nvme/host/hwmon.c index 552dbc0..a93612f 100644 --- a/drivers/nvme/host/hwmon.c +++ b/drivers/nvme/host/hwmon.c @@ -188,23 +188,23 @@ static umode_t nvme_hwmon_is_visible(const void *_data, static const struct hwmon_channel_info *nvme_hwmon_info[] = { HWMON_CHANNEL_INFO(chip, HWMON_C_REGISTER_TZ), HWMON_CHANNEL_INFO(temp, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT | HWMON_T_LABEL | HWMON_T_ALARM, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_INPUT | HWMON_T_LABEL), NULL };