Message ID | 1573395466-19526-1-git-send-email-akinobu.mita@gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | nvme: hwmon: provide temperature min and max values for each sensor | expand |
On 11/10/19 6:17 AM, Akinobu Mita wrote: > According to the NVMe specification, the over temperature threshold and > under temperature threshold features shall be implemented for Composite > Temperature if a non-zero WCTEMP field value is reported in the Identify > Controller data structure. The features are also implemented for all > implemented temperature sensors (i.e., all Temperature Sensor fields that > report a non-zero value). > > This provides the over temperature threshold and under temperature > threshold for each sensor as temperature min and max values of hwmon > sysfs attributes. > > The WCTEMP is already provided as a temperature max value for Composite > Temperature, but this change isn't incompatible. Because the default > value of the over temperature threshold for Composite Temperature is > the WCTEMP. > > This also provides alarm attributes for each temperature sensor. But all > alarm conditions are same, because there is only a single bit in > Critical Warning field that indicates one of the temperature is outside of > a temperature threshold. > I think it would be more appropriate to report the alarm only for the composite temperature, reason being that we don't really know which individual sensor it is associated with. > Example output from the "sensors" command: > > nvme-pci-0100 > Adapter: PCI adapter > Composite: +53.0 C (low = -273.0 C, high = +70.0 C) > (crit = +80.0 C) > Sensor 1: +56.0 C (low = -273.0 C, high = +65262.0 C) > Sensor 2: +51.0 C (low = -273.0 C, high = +65262.0 C) > Sensor 5: +73.0 C (low = -273.0 C, high = +65262.0 C) > Have you tried writing the limits ? On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the Composite temperature sensor results in a temperature warning, and that warning is sticky until I reset the controller. I don't see that problem on Samsung SSD 970 EVO 500GB; I have not yet tried others. root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 nvme-pci-0100 Adapter: PCI adapter Composite: +30.0°C (low = -273.0°C, high = +70.0°C) (crit = +80.0°C) root@jupiter:/sys/class/hwmon/hwmon0# echo 0 > temp1_min root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 nvme-pci-0100 Adapter: PCI adapter Composite: +30.0°C (low = +0.0°C, high = +70.0°C) ALARM (crit = +80.0°C) It doesn't seem to matter which temperature I write; writing -273000 has the same result. [This is actually why I didn't use the features commands; not that I had observed the problem, but I was concerned that problems like this would show up.] > Cc: Keith Busch <kbusch@kernel.org> > Cc: Jens Axboe <axboe@fb.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Sagi Grimberg <sagi@grimberg.me> > Cc: Jean Delvare <jdelvare@suse.com> > Cc: Guenter Roeck <linux@roeck-us.net> > Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> > --- > This patch depends on the patch "nvme: Add hardware monitoring support" [1] > [1] http://lists.infradead.org/pipermail/linux-nvme/2019-November/027883.html > > drivers/nvme/host/nvme-hwmon.c | 98 ++++++++++++++++++++++++++++++++++++------ > include/linux/nvme.h | 6 +++ > 2 files changed, 90 insertions(+), 14 deletions(-) > > diff --git a/drivers/nvme/host/nvme-hwmon.c b/drivers/nvme/host/nvme-hwmon.c > index 5480cbb..79323b2 100644 > --- a/drivers/nvme/host/nvme-hwmon.c > +++ b/drivers/nvme/host/nvme-hwmon.c > @@ -15,6 +15,46 @@ struct nvme_hwmon_data { > struct mutex read_lock; > }; > > +static int nvme_get_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, > + long *temp) > +{ > + unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; > + int status; > + int ret; > + > + if (under) > + threshold |= NVME_TEMP_THRESH_TYPE_UNDER; > + > + ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0, > + &status); > + if (!ret) > + *temp = ((status & NVME_TEMP_THRESH_MASK) - 273) * 1000; > + > + return ret <= 0 ? ret : -EIO; > +} > + > +static int nvme_set_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, > + long temp) > +{ > + unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; > + int status; > + int ret; > + > + temp = temp / 1000 + 273; > + if (temp > NVME_TEMP_THRESH_MASK) > + return -EINVAL; > + Traditionally we use clamp_val() in hwmon drivers to adjust value ranges for limit attributes, reason being that we can't expect userspace to dig through per-sensor-type documentation to identify valid limits. Also, note that the above does not handle negative values well (-274000 -> -274 -> -1). I would suggest something like temp = temp / 1000 + 273; temp = clamp_val(temp, 0, NVME_TEMP_THRESH_MASK); or, if you want to be fancy; temp = DIV_ROUND_CLOSEST(temp, 1000) - 273; temp = clamp_val(temp, 0, NVME_TEMP_THRESH_MASK); > + threshold |= temp; > + > + if (under) > + threshold |= NVME_TEMP_THRESH_TYPE_UNDER; > + > + ret = nvme_set_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0, > + &status); I am a bit baffled here. The last parameter of nvme_set_features() (and nvme_get_features) is a pointer to u32, but status is declared as int. I would have assumed this generates a compiler warning, but it doesn't, at least not with my version of gcc. Either case, it might be better to declare status as u32 (unless I did not have enough coffee and I am missing something). Also, I assume that the returned status value is irrelevant. I don't find useful information in the specification, but I may be missing it. > + > + return ret <= 0 ? ret : -EIO; > +} > + > static int nvme_hwmon_get_smart_log(struct nvme_hwmon_data *data) > { > int ret; > @@ -39,8 +79,12 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, > */ > switch (attr) { > case hwmon_temp_max: > - *val = (data->ctrl->wctemp - 273) * 1000; > + err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > + if (err) > + *val = (data->ctrl->wctemp - 273) * 1000; This would report WCTEMP for all sensors on errors, including errors seen while the controller is resetting. I think it should be something like int err = 0; ... if (!channel) *val = (data->ctrl->wctemp - 273) * 1000; else err = nvme_get_temp_thresh(data->ctrl, channel, false, val); return err; assuming we keep using ctrl->wctemp (see below). If changing the upper Composite temperature sensor limit changes wctemp, and we don't update it, we should not use it at all after registration and just report the error. > return 0; > + case hwmon_temp_min: > + return nvme_get_temp_thresh(data->ctrl, channel, true, val); > case hwmon_temp_crit: > *val = (data->ctrl->cctemp - 273) * 1000; > return 0; > @@ -73,6 +117,23 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, > return err; > } > > +static int nvme_hwmon_write(struct device *dev, enum hwmon_sensor_types type, > + u32 attr, int channel, long val) > +{ > + struct nvme_hwmon_data *data = dev_get_drvdata(dev); > + > + switch (attr) { > + case hwmon_temp_max: > + return nvme_set_temp_thresh(data->ctrl, channel, false, val); Does this change WCTEMP if written on channel 0 ? If so, we would have to update the cached value of ctrl->wctemp (or never use it after registration). > + case hwmon_temp_min: > + return nvme_set_temp_thresh(data->ctrl, channel, true, val); > + default: > + break; > + } > + > + return -EOPNOTSUPP; > +} > + > static const char * const nvme_hwmon_sensor_names[] = { > "Composite", > "Sensor 1", > @@ -105,13 +166,13 @@ static umode_t nvme_hwmon_is_visible(const void *_data, > return 0444; > break; > case hwmon_temp_max: > + case hwmon_temp_min: > if (!channel && data->ctrl->wctemp) > - return 0444; > + return 0644; > + else if (data->log.temp_sensor[channel - 1]) > + return 0644; This ends up with a negative index into data->log.temp_sensor if data->ctrl->wctemp == 0. It needs to be else if (channel && data->log.temp_sensor[channel - 1]) It can also be written as a single conditional since the return value is the same. > break; > case hwmon_temp_alarm: > - if (!channel) > - return 0444; > - break; > case hwmon_temp_input: > case hwmon_temp_label: > if (!channel || data->log.temp_sensor[channel - 1]) > @@ -126,16 +187,24 @@ static umode_t nvme_hwmon_is_visible(const void *_data, > static const struct hwmon_channel_info *nvme_hwmon_info[] = { > HWMON_CHANNEL_INFO(chip, HWMON_C_REGISTER_TZ), > HWMON_CHANNEL_INFO(temp, > - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT | > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_CRIT | HWMON_T_LABEL | HWMON_T_ALARM, > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_LABEL | HWMON_T_ALARM, > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_LABEL | HWMON_T_ALARM, > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_LABEL | HWMON_T_ALARM, > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_LABEL | HWMON_T_ALARM, > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_LABEL | HWMON_T_ALARM, > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_LABEL | HWMON_T_ALARM, > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > HWMON_T_LABEL | HWMON_T_ALARM, > - HWMON_T_INPUT | HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_LABEL, > - HWMON_T_INPUT | HWMON_T_LABEL), > + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | > + HWMON_T_LABEL | HWMON_T_ALARM), > NULL > }; > > @@ -143,6 +212,7 @@ static const struct hwmon_ops nvme_hwmon_ops = { > .is_visible = nvme_hwmon_is_visible, > .read = nvme_hwmon_read, > .read_string = nvme_hwmon_read_string, > + .write = nvme_hwmon_write, > }; > > static const struct hwmon_chip_info nvme_hwmon_chip_info = { > diff --git a/include/linux/nvme.h b/include/linux/nvme.h > index 269b2e8..347a44f 100644 > --- a/include/linux/nvme.h > +++ b/include/linux/nvme.h > @@ -803,6 +803,12 @@ struct nvme_write_zeroes_cmd { > > /* Features */ > > +enum { > + NVME_TEMP_THRESH_MASK = 0xffff, > + NVME_TEMP_THRESH_SELECT_SHIFT = 16, > + NVME_TEMP_THRESH_TYPE_UNDER = 0x100000, > +}; > + > struct nvme_feat_auto_pst { > __le64 entries[32]; > }; >
2019年11月11日(月) 1:30 Guenter Roeck <linux@roeck-us.net>: > > On 11/10/19 6:17 AM, Akinobu Mita wrote: > > According to the NVMe specification, the over temperature threshold and > > under temperature threshold features shall be implemented for Composite > > Temperature if a non-zero WCTEMP field value is reported in the Identify > > Controller data structure. The features are also implemented for all > > implemented temperature sensors (i.e., all Temperature Sensor fields that > > report a non-zero value). > > > > This provides the over temperature threshold and under temperature > > threshold for each sensor as temperature min and max values of hwmon > > sysfs attributes. > > > > The WCTEMP is already provided as a temperature max value for Composite > > Temperature, but this change isn't incompatible. Because the default > > value of the over temperature threshold for Composite Temperature is > > the WCTEMP. > > > > This also provides alarm attributes for each temperature sensor. But all > > alarm conditions are same, because there is only a single bit in > > Critical Warning field that indicates one of the temperature is outside of > > a temperature threshold. > > > > I think it would be more appropriate to report the alarm only for the > composite temperature, reason being that we don't really know which individual > sensor it is associated with. OK. > > Example output from the "sensors" command: > > > > nvme-pci-0100 > > Adapter: PCI adapter > > Composite: +53.0 C (low = -273.0 C, high = +70.0 C) > > (crit = +80.0 C) > > Sensor 1: +56.0 C (low = -273.0 C, high = +65262.0 C) > > Sensor 2: +51.0 C (low = -273.0 C, high = +65262.0 C) > > Sensor 5: +73.0 C (low = -273.0 C, high = +65262.0 C) > > > > Have you tried writing the limits ? On my Intel NVME drive (SSDPEKKW512G7), writing > any minimum limit on the Composite temperature sensor results in a temperature > warning, and that warning is sticky until I reset the controller. > I don't see that problem on Samsung SSD 970 EVO 500GB; I have not yet tried others. I have Crucial CT500P1SSD8 and WDC WDS512G1X0C-00ENX0, and I have no problem with these devices. > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 > nvme-pci-0100 > Adapter: PCI adapter > Composite: +30.0°C (low = -273.0°C, high = +70.0°C) > (crit = +80.0°C) > > root@jupiter:/sys/class/hwmon/hwmon0# echo 0 > temp1_min > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 > nvme-pci-0100 > Adapter: PCI adapter > Composite: +30.0°C (low = +0.0°C, high = +70.0°C) ALARM > (crit = +80.0°C) > > It doesn't seem to matter which temperature I write; writing -273000 has > the same result. > > [This is actually why I didn't use the features commands; not that I had observed > the problem, but I was concerned that problems like this would show up.] Maybe we should introduce a new quirk so that we can avoid changing temperature threshold for such devices. Could you tell SSDPEKKW512G7's vendor and device ID? Quick googling answers it's 8086:f1a5, but I want to make sure. > > Cc: Keith Busch <kbusch@kernel.org> > > Cc: Jens Axboe <axboe@fb.com> > > Cc: Christoph Hellwig <hch@lst.de> > > Cc: Sagi Grimberg <sagi@grimberg.me> > > Cc: Jean Delvare <jdelvare@suse.com> > > Cc: Guenter Roeck <linux@roeck-us.net> > > Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> > > --- > > This patch depends on the patch "nvme: Add hardware monitoring support" [1] > > [1] http://lists.infradead.org/pipermail/linux-nvme/2019-November/027883.html > > > > drivers/nvme/host/nvme-hwmon.c | 98 ++++++++++++++++++++++++++++++++++++------ > > include/linux/nvme.h | 6 +++ > > 2 files changed, 90 insertions(+), 14 deletions(-) > > > > diff --git a/drivers/nvme/host/nvme-hwmon.c b/drivers/nvme/host/nvme-hwmon.c > > index 5480cbb..79323b2 100644 > > --- a/drivers/nvme/host/nvme-hwmon.c > > +++ b/drivers/nvme/host/nvme-hwmon.c > > @@ -15,6 +15,46 @@ struct nvme_hwmon_data { > > struct mutex read_lock; > > }; > > > > +static int nvme_get_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, > > + long *temp) > > +{ > > + unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; > > + int status; > > + int ret; > > + > > + if (under) > > + threshold |= NVME_TEMP_THRESH_TYPE_UNDER; > > + > > + ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0, > > + &status); > > + if (!ret) > > + *temp = ((status & NVME_TEMP_THRESH_MASK) - 273) * 1000; > > + > > + return ret <= 0 ? ret : -EIO; > > +} > > + > > +static int nvme_set_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, > > + long temp) > > +{ > > + unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; > > + int status; > > + int ret; > > + > > + temp = temp / 1000 + 273; > > + if (temp > NVME_TEMP_THRESH_MASK) > > + return -EINVAL; > > + > > Traditionally we use clamp_val() in hwmon drivers to adjust value ranges > for limit attributes, reason being that we can't expect userspace to dig > through per-sensor-type documentation to identify valid limits. Also, note > that the above does not handle negative values well (-274000 -> -274 -> -1). > I would suggest something like > > temp = temp / 1000 + 273; > temp = clamp_val(temp, 0, NVME_TEMP_THRESH_MASK); > > or, if you want to be fancy; > > temp = DIV_ROUND_CLOSEST(temp, 1000) - 273; > temp = clamp_val(temp, 0, NVME_TEMP_THRESH_MASK); Either way looks good. > > + threshold |= temp; > > + > > + if (under) > > + threshold |= NVME_TEMP_THRESH_TYPE_UNDER; > > + > > + ret = nvme_set_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0, > > + &status); > > I am a bit baffled here. The last parameter of nvme_set_features() (and nvme_get_features) > is a pointer to u32, but status is declared as int. I would have assumed this generates > a compiler warning, but it doesn't, at least not with my version of gcc. > > Either case, it might be better to declare status as u32 (unless I did not have enough > coffee and I am missing something). > > Also, I assume that the returned status value is irrelevant. I don't find useful > information in the specification, but I may be missing it. You are right. I'll change the last parameter of nvme_set_features() with NULL. > > + > > + return ret <= 0 ? ret : -EIO; > > +} > > + > > static int nvme_hwmon_get_smart_log(struct nvme_hwmon_data *data) > > { > > int ret; > > @@ -39,8 +79,12 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, > > */ > > switch (attr) { > > case hwmon_temp_max: > > - *val = (data->ctrl->wctemp - 273) * 1000; > > + err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > > + if (err) > > + *val = (data->ctrl->wctemp - 273) * 1000; > > This would report WCTEMP for all sensors on errors, including errors seen while > the controller is resetting. I think it should be something like > > int err = 0; > ... > > if (!channel) > *val = (data->ctrl->wctemp - 273) * 1000; > else > err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > return err; > > assuming we keep using ctrl->wctemp (see below). If changing the upper Composite > temperature sensor limit changes wctemp, and we don't update it, we should not > use it at all after registration and just report the error. > > > return 0; > > + case hwmon_temp_min: > > + return nvme_get_temp_thresh(data->ctrl, channel, true, val); > > case hwmon_temp_crit: > > *val = (data->ctrl->cctemp - 273) * 1000; > > return 0; > > @@ -73,6 +117,23 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, > > return err; > > } > > > > +static int nvme_hwmon_write(struct device *dev, enum hwmon_sensor_types type, > > + u32 attr, int channel, long val) > > +{ > > + struct nvme_hwmon_data *data = dev_get_drvdata(dev); > > + > > + switch (attr) { > > + case hwmon_temp_max: > > + return nvme_set_temp_thresh(data->ctrl, channel, false, val); > > Does this change WCTEMP if written on channel 0 ? If so, we would have to update > the cached value of ctrl->wctemp (or never use it after registration). At least for the devices I have, setting the over temperature threshold doesn't change the WCTEMP. I have checked with 'nvme id-ctrl /dev/nvme0 | grep ctemp'. > > + case hwmon_temp_min: > > + return nvme_set_temp_thresh(data->ctrl, channel, true, val); > > + default: > > + break; > > + } > > + > > + return -EOPNOTSUPP; > > +} > > + > > static const char * const nvme_hwmon_sensor_names[] = { > > "Composite", > > "Sensor 1", > > @@ -105,13 +166,13 @@ static umode_t nvme_hwmon_is_visible(const void *_data, > > return 0444; > > break; > > case hwmon_temp_max: > > + case hwmon_temp_min: > > if (!channel && data->ctrl->wctemp) > > - return 0444; > > + return 0644; > > + else if (data->log.temp_sensor[channel - 1]) > > + return 0644; > > This ends up with a negative index into data->log.temp_sensor > if data->ctrl->wctemp == 0. It needs to be Oops. > else if (channel && data->log.temp_sensor[channel - 1]) > It can also be written as a single conditional since the return value is the same. Sounds good.
On Sun, Nov 10, 2019 at 11:17:46PM +0900, Akinobu Mita wrote: > +static int nvme_get_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, > + long *temp) > +{ > + unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; > + int status; > + int ret; > + > + if (under) > + threshold |= NVME_TEMP_THRESH_TYPE_UNDER; > + > + ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0, > + &status); > + if (!ret) > + *temp = ((status & NVME_TEMP_THRESH_MASK) - 273) * 1000; > + > + return ret <= 0 ? ret : -EIO; This looks a little obsfucated. aI'd prefer something like: if (ret > 0) return -EIO; if (ret < 0) return ret; *temp = ((status & NVME_TEMP_THRESH_MASK) - 273) * 1000; return 0; > + return ret <= 0 ? ret : -EIO; Similarly here, something like: if (ret > 0) return -EIO; return ret; > + err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > + if (err) > + *val = (data->ctrl->wctemp - 273) * 1000; Can we have a helper for this (x - 273) * 1000 conversion? It is repeated quite a bit over the code in this file.
On Tue, Nov 12, 2019 at 12:56:21AM +0900, Akinobu Mita wrote: > 2019年11月11日(月) 1:30 Guenter Roeck <linux@roeck-us.net>: > > > [ ... ] > > > Example output from the "sensors" command: > > > > > > nvme-pci-0100 > > > Adapter: PCI adapter > > > Composite: +53.0 C (low = -273.0 C, high = +70.0 C) > > > (crit = +80.0 C) > > > Sensor 1: +56.0 C (low = -273.0 C, high = +65262.0 C) > > > Sensor 2: +51.0 C (low = -273.0 C, high = +65262.0 C) > > > Sensor 5: +73.0 C (low = -273.0 C, high = +65262.0 C) > > > > > > > Have you tried writing the limits ? On my Intel NVME drive (SSDPEKKW512G7), writing > > any minimum limit on the Composite temperature sensor results in a temperature > > warning, and that warning is sticky until I reset the controller. > > I don't see that problem on Samsung SSD 970 EVO 500GB; I have not yet tried others. > > I have Crucial CT500P1SSD8 and WDC WDS512G1X0C-00ENX0, and I have no > problem with these devices. > > > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 > > nvme-pci-0100 > > Adapter: PCI adapter > > Composite: +30.0°C (low = -273.0°C, high = +70.0°C) > > (crit = +80.0°C) > > > > root@jupiter:/sys/class/hwmon/hwmon0# echo 0 > temp1_min > > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 > > nvme-pci-0100 > > Adapter: PCI adapter > > Composite: +30.0°C (low = +0.0°C, high = +70.0°C) ALARM > > (crit = +80.0°C) > > > > It doesn't seem to matter which temperature I write; writing -273000 has > > the same result. > > > > [This is actually why I didn't use the features commands; not that I had observed > > the problem, but I was concerned that problems like this would show up.] > > Maybe we should introduce a new quirk so that we can avoid changing > temperature threshold for such devices. Could you tell SSDPEKKW512G7's > vendor and device ID? Quick googling answers it's 8086:f1a5, but I want > to make sure. Yes, that is correct. 01:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:f1a5] (rev 03) I'll see if I can test this tonight on my other NVMEs. I also dug up an old NVMe drive from Toshiba; I'll see if I can connect and test it as well. [ ... ] > > > */ > > > switch (attr) { > > > case hwmon_temp_max: > > > - *val = (data->ctrl->wctemp - 273) * 1000; > > > + err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > > > + if (err) > > > + *val = (data->ctrl->wctemp - 273) * 1000; > > > > This would report WCTEMP for all sensors on errors, including errors seen while > > the controller is resetting. I think it should be something like > > > > int err = 0; > > ... > > > > if (!channel) > > *val = (data->ctrl->wctemp - 273) * 1000; > > else > > err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > > return err; > > > > assuming we keep using ctrl->wctemp (see below). If changing the upper Composite > > temperature sensor limit changes wctemp, and we don't update it, we should not > > use it at all after registration and just report the error. > > > > > return 0; > > > + case hwmon_temp_min: > > > + return nvme_get_temp_thresh(data->ctrl, channel, true, val); > > > case hwmon_temp_crit: > > > *val = (data->ctrl->cctemp - 273) * 1000; > > > return 0; > > > @@ -73,6 +117,23 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, > > > return err; > > > } > > > > > > +static int nvme_hwmon_write(struct device *dev, enum hwmon_sensor_types type, > > > + u32 attr, int channel, long val) > > > +{ > > > + struct nvme_hwmon_data *data = dev_get_drvdata(dev); > > > + > > > + switch (attr) { > > > + case hwmon_temp_max: > > > + return nvme_set_temp_thresh(data->ctrl, channel, false, val); > > > > Does this change WCTEMP if written on channel 0 ? If so, we would have to update > > the cached value of ctrl->wctemp (or never use it after registration). > > At least for the devices I have, setting the over temperature threshold > doesn't change the WCTEMP. > I have checked with 'nvme id-ctrl /dev/nvme0 | grep ctemp'. > Interesting. I just tested this, and the result is the same with Samsung SSD 970 EVO. With that in mind, maybe we should really not use wctemp at all after initialization, as I had suggested above. What do you think ? Thanks, Guenter
2019年11月12日(火) 1:53 Christoph Hellwig <hch@lst.de>: > > On Sun, Nov 10, 2019 at 11:17:46PM +0900, Akinobu Mita wrote: > > +static int nvme_get_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, > > + long *temp) > > +{ > > + unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; > > + int status; > > + int ret; > > + > > + if (under) > > + threshold |= NVME_TEMP_THRESH_TYPE_UNDER; > > + > > + ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0, > > + &status); > > + if (!ret) > > + *temp = ((status & NVME_TEMP_THRESH_MASK) - 273) * 1000; > > + > > + return ret <= 0 ? ret : -EIO; > > This looks a little obsfucated. aI'd prefer something like: > > if (ret > 0) > return -EIO; > if (ret < 0) > return ret; > *temp = ((status & NVME_TEMP_THRESH_MASK) - 273) * 1000; > return 0; Sounds good. > > + return ret <= 0 ? ret : -EIO; > > Similarly here, something like: > > if (ret > 0) > return -EIO; > return ret; OK. > > + err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > > + if (err) > > + *val = (data->ctrl->wctemp - 273) * 1000; > > Can we have a helper for this (x - 273) * 1000 conversion? It is > repeated quite a bit over the code in this file. OK. I'll add two macros. #define MILLICELSIUS_TO_KELVIN(t) ((t) / 1000 + 273) #define KELVIN_TO_MILLICELSIUS(t) (((t) - 273) * 1000)
On Tue, Nov 12, 2019 at 11:19:46PM +0900, Akinobu Mita wrote: > OK. I'll add two macros. > > #define MILLICELSIUS_TO_KELVIN(t) ((t) / 1000 + 273) > #define KELVIN_TO_MILLICELSIUS(t) (((t) - 273) * 1000) Can you add them to linux/thermal.h that already has similar helpers?
2019年11月12日(火) 2:35 Guenter Roeck <linux@roeck-us.net>: > > On Tue, Nov 12, 2019 at 12:56:21AM +0900, Akinobu Mita wrote: > > 2019年11月11日(月) 1:30 Guenter Roeck <linux@roeck-us.net>: > > > > > > [ ... ] > > > > > Example output from the "sensors" command: > > > > > > > > nvme-pci-0100 > > > > Adapter: PCI adapter > > > > Composite: +53.0 C (low = -273.0 C, high = +70.0 C) > > > > (crit = +80.0 C) > > > > Sensor 1: +56.0 C (low = -273.0 C, high = +65262.0 C) > > > > Sensor 2: +51.0 C (low = -273.0 C, high = +65262.0 C) > > > > Sensor 5: +73.0 C (low = -273.0 C, high = +65262.0 C) > > > > > > > > > > Have you tried writing the limits ? On my Intel NVME drive (SSDPEKKW512G7), writing > > > any minimum limit on the Composite temperature sensor results in a temperature > > > warning, and that warning is sticky until I reset the controller. > > > I don't see that problem on Samsung SSD 970 EVO 500GB; I have not yet tried others. > > > > I have Crucial CT500P1SSD8 and WDC WDS512G1X0C-00ENX0, and I have no > > problem with these devices. > > > > > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 > > > nvme-pci-0100 > > > Adapter: PCI adapter > > > Composite: +30.0°C (low = -273.0°C, high = +70.0°C) > > > (crit = +80.0°C) > > > > > > root@jupiter:/sys/class/hwmon/hwmon0# echo 0 > temp1_min > > > root@jupiter:/sys/class/hwmon/hwmon0# sensors nvme-pci-0100 > > > nvme-pci-0100 > > > Adapter: PCI adapter > > > Composite: +30.0°C (low = +0.0°C, high = +70.0°C) ALARM > > > (crit = +80.0°C) > > > > > > It doesn't seem to matter which temperature I write; writing -273000 has > > > the same result. > > > > > > [This is actually why I didn't use the features commands; not that I had observed > > > the problem, but I was concerned that problems like this would show up.] > > > > Maybe we should introduce a new quirk so that we can avoid changing > > temperature threshold for such devices. Could you tell SSDPEKKW512G7's > > vendor and device ID? Quick googling answers it's 8086:f1a5, but I want > > to make sure. > > Yes, that is correct. > > 01:00.0 Non-Volatile memory controller [0108]: Intel Corporation Device [8086:f1a5] (rev 03) OK. I'll add NVME_QUIRK_NO_TEMP_THRESH_CHANGE and mark the device. > I'll see if I can test this tonight on my other NVMEs. I also dug up an old > NVMe drive from Toshiba; I'll see if I can connect and test it as well. That's nice. > [ ... ] > > > > > */ > > > > switch (attr) { > > > > case hwmon_temp_max: > > > > - *val = (data->ctrl->wctemp - 273) * 1000; > > > > + err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > > > > + if (err) > > > > + *val = (data->ctrl->wctemp - 273) * 1000; > > > > > > This would report WCTEMP for all sensors on errors, including errors seen while > > > the controller is resetting. I think it should be something like > > > > > > int err = 0; > > > ... > > > > > > if (!channel) > > > *val = (data->ctrl->wctemp - 273) * 1000; > > > else > > > err = nvme_get_temp_thresh(data->ctrl, channel, false, val); > > > return err; > > > > > > assuming we keep using ctrl->wctemp (see below). If changing the upper Composite > > > temperature sensor limit changes wctemp, and we don't update it, we should not > > > use it at all after registration and just report the error. > > > > > > > return 0; > > > > + case hwmon_temp_min: > > > > + return nvme_get_temp_thresh(data->ctrl, channel, true, val); > > > > case hwmon_temp_crit: > > > > *val = (data->ctrl->cctemp - 273) * 1000; > > > > return 0; > > > > @@ -73,6 +117,23 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, > > > > return err; > > > > } > > > > > > > > +static int nvme_hwmon_write(struct device *dev, enum hwmon_sensor_types type, > > > > + u32 attr, int channel, long val) > > > > +{ > > > > + struct nvme_hwmon_data *data = dev_get_drvdata(dev); > > > > + > > > > + switch (attr) { > > > > + case hwmon_temp_max: > > > > + return nvme_set_temp_thresh(data->ctrl, channel, false, val); > > > > > > Does this change WCTEMP if written on channel 0 ? If so, we would have to update > > > the cached value of ctrl->wctemp (or never use it after registration). > > > > At least for the devices I have, setting the over temperature threshold > > doesn't change the WCTEMP. > > I have checked with 'nvme id-ctrl /dev/nvme0 | grep ctemp'. > > > > Interesting. I just tested this, and the result is the same with Samsung > SSD 970 EVO. With that in mind, maybe we should really not use wctemp > at all after initialization, as I had suggested above. What do you think ? Yes. We only need to use the WCTEMP in nvme_hwmon_is_visible().
2019年11月12日(火) 23:21 Christoph Hellwig <hch@lst.de>: > > On Tue, Nov 12, 2019 at 11:19:46PM +0900, Akinobu Mita wrote: > > OK. I'll add two macros. > > > > #define MILLICELSIUS_TO_KELVIN(t) ((t) / 1000 + 273) > > #define KELVIN_TO_MILLICELSIUS(t) (((t) - 273) * 1000) > > Can you add them to linux/thermal.h that already has similar > helpers? Should we add a new linux/temperature.h so that any other drivers or subsystems (including thermal.h and hwmon.h) can use these macros?
On 11/12/19 6:40 AM, Akinobu Mita wrote: [ ... ] >> I'll see if I can test this tonight on my other NVMEs. I also dug up an old >> NVMe drive from Toshiba; I'll see if I can connect and test it as well. > > That's nice. > It works on all devices I have, including the Toshiba, with the exception of the Intel. Interestingly, the Toshiba NVMe reports: nvme-pci-0600 Adapter: PCI adapter Composite: +56.0°C (low = -20.0°C, high = +85.0°C) (crit = +82.0°C) Sensor 1: +56.0°C (low = -20.0°C, high = +85.0°C) ie the critical temperature is lower than the high temperature. Go figure. The Toshiba model is THNSN5256GPU7 NVMe TOSHIBA 256GB The Intel NVMe has the latest firmware version installed, so this isn't a problem that was ever fixed. Guenter
On Tue, Nov 12, 2019 at 07:04:38AM -0800, Guenter Roeck wrote: > The Intel NVMe has the latest firmware version installed, so this isn't > a problem that was ever fixed. What Intel device is that? If it is one of the XXXp models, those just seem some of the most buggy NVMe SSDs around unfortuntely (excluding the Apple ones that don't actually claim to be NVMe at least).
On Wed, Nov 13, 2019 at 12:00:22AM +0900, Akinobu Mita wrote: > 2019年11月12日(火) 23:21 Christoph Hellwig <hch@lst.de>: > > > > On Tue, Nov 12, 2019 at 11:19:46PM +0900, Akinobu Mita wrote: > > > OK. I'll add two macros. > > > > > > #define MILLICELSIUS_TO_KELVIN(t) ((t) / 1000 + 273) > > > #define KELVIN_TO_MILLICELSIUS(t) (((t) - 273) * 1000) > > > > Can you add them to linux/thermal.h that already has similar > > helpers? > > Should we add a new linux/temperature.h so that any other drivers or > subsystems (including thermal.h and hwmon.h) can use these macros? Fine with me.
On Tue, Nov 12, 2019 at 04:06:59PM +0100, Christoph Hellwig wrote: > On Tue, Nov 12, 2019 at 07:04:38AM -0800, Guenter Roeck wrote: > > The Intel NVMe has the latest firmware version installed, so this isn't > > a problem that was ever fixed. > > What Intel device is that? If it is one of the XXXp models, those just > seem some of the most buggy NVMe SSDs around unfortuntely (excluding > the Apple ones that don't actually claim to be NVMe at least). SSDPEKKW512G7; this is a 600p model. Guenter
On Wed, Nov 13, 2019 at 12:00:22AM +0900, Akinobu Mita wrote: > 2019年11月12日(火) 23:21 Christoph Hellwig <hch@lst.de>: > > > > On Tue, Nov 12, 2019 at 11:19:46PM +0900, Akinobu Mita wrote: > > > OK. I'll add two macros. > > > > > > #define MILLICELSIUS_TO_KELVIN(t) ((t) / 1000 + 273) DIV_ROUND_CLOSEST() ? > > > #define KELVIN_TO_MILLICELSIUS(t) (((t) - 273) * 1000) > > > > Can you add them to linux/thermal.h that already has similar > > helpers? > > Should we add a new linux/temperature.h so that any other drivers or > subsystems (including thermal.h and hwmon.h) can use these macros? Good idea. I don't like the idea of pulling in all of linux/thermal.h just for the above macros. Guenter
2019年11月13日(水) 1:38 Guenter Roeck <linux@roeck-us.net>: > > On Wed, Nov 13, 2019 at 12:00:22AM +0900, Akinobu Mita wrote: > > 2019年11月12日(火) 23:21 Christoph Hellwig <hch@lst.de>: > > > > > > On Tue, Nov 12, 2019 at 11:19:46PM +0900, Akinobu Mita wrote: > > > > OK. I'll add two macros. > > > > > > > > #define MILLICELSIUS_TO_KELVIN(t) ((t) / 1000 + 273) > > DIV_ROUND_CLOSEST() ? How about these definitions? #define MILLICELSIUS_TO_KELVIN(t) DIV_ROUND_CLOSEST((t) + 273150, 1000) #define KELVIN_TO_MILLICELSIUS(t) ((t) * 1000L - 273150)
On 11/13/19 4:58 AM, Akinobu Mita wrote: > 2019年11月13日(水) 1:38 Guenter Roeck <linux@roeck-us.net>: >> >> On Wed, Nov 13, 2019 at 12:00:22AM +0900, Akinobu Mita wrote: >>> 2019年11月12日(火) 23:21 Christoph Hellwig <hch@lst.de>: >>>> >>>> On Tue, Nov 12, 2019 at 11:19:46PM +0900, Akinobu Mita wrote: >>>>> OK. I'll add two macros. >>>>> >>>>> #define MILLICELSIUS_TO_KELVIN(t) ((t) / 1000 + 273) >> >> DIV_ROUND_CLOSEST() ? > > How about these definitions? > > #define MILLICELSIUS_TO_KELVIN(t) DIV_ROUND_CLOSEST((t) + 273150, 1000) > #define KELVIN_TO_MILLICELSIUS(t) ((t) * 1000L - 273150) > LGTM Guenter
diff --git a/drivers/nvme/host/nvme-hwmon.c b/drivers/nvme/host/nvme-hwmon.c index 5480cbb..79323b2 100644 --- a/drivers/nvme/host/nvme-hwmon.c +++ b/drivers/nvme/host/nvme-hwmon.c @@ -15,6 +15,46 @@ struct nvme_hwmon_data { struct mutex read_lock; }; +static int nvme_get_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, + long *temp) +{ + unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; + int status; + int ret; + + if (under) + threshold |= NVME_TEMP_THRESH_TYPE_UNDER; + + ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0, + &status); + if (!ret) + *temp = ((status & NVME_TEMP_THRESH_MASK) - 273) * 1000; + + return ret <= 0 ? ret : -EIO; +} + +static int nvme_set_temp_thresh(struct nvme_ctrl *ctrl, int sensor, bool under, + long temp) +{ + unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT; + int status; + int ret; + + temp = temp / 1000 + 273; + if (temp > NVME_TEMP_THRESH_MASK) + return -EINVAL; + + threshold |= temp; + + if (under) + threshold |= NVME_TEMP_THRESH_TYPE_UNDER; + + ret = nvme_set_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0, + &status); + + return ret <= 0 ? ret : -EIO; +} + static int nvme_hwmon_get_smart_log(struct nvme_hwmon_data *data) { int ret; @@ -39,8 +79,12 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, */ switch (attr) { case hwmon_temp_max: - *val = (data->ctrl->wctemp - 273) * 1000; + err = nvme_get_temp_thresh(data->ctrl, channel, false, val); + if (err) + *val = (data->ctrl->wctemp - 273) * 1000; return 0; + case hwmon_temp_min: + return nvme_get_temp_thresh(data->ctrl, channel, true, val); case hwmon_temp_crit: *val = (data->ctrl->cctemp - 273) * 1000; return 0; @@ -73,6 +117,23 @@ static int nvme_hwmon_read(struct device *dev, enum hwmon_sensor_types type, return err; } +static int nvme_hwmon_write(struct device *dev, enum hwmon_sensor_types type, + u32 attr, int channel, long val) +{ + struct nvme_hwmon_data *data = dev_get_drvdata(dev); + + switch (attr) { + case hwmon_temp_max: + return nvme_set_temp_thresh(data->ctrl, channel, false, val); + case hwmon_temp_min: + return nvme_set_temp_thresh(data->ctrl, channel, true, val); + default: + break; + } + + return -EOPNOTSUPP; +} + static const char * const nvme_hwmon_sensor_names[] = { "Composite", "Sensor 1", @@ -105,13 +166,13 @@ static umode_t nvme_hwmon_is_visible(const void *_data, return 0444; break; case hwmon_temp_max: + case hwmon_temp_min: if (!channel && data->ctrl->wctemp) - return 0444; + return 0644; + else if (data->log.temp_sensor[channel - 1]) + return 0644; break; case hwmon_temp_alarm: - if (!channel) - return 0444; - break; case hwmon_temp_input: case hwmon_temp_label: if (!channel || data->log.temp_sensor[channel - 1]) @@ -126,16 +187,24 @@ static umode_t nvme_hwmon_is_visible(const void *_data, static const struct hwmon_channel_info *nvme_hwmon_info[] = { HWMON_CHANNEL_INFO(chip, HWMON_C_REGISTER_TZ), HWMON_CHANNEL_INFO(temp, - HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_CRIT | + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_CRIT | HWMON_T_LABEL | HWMON_T_ALARM, + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_LABEL | HWMON_T_ALARM, + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_LABEL | HWMON_T_ALARM, + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_LABEL | HWMON_T_ALARM, + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_LABEL | HWMON_T_ALARM, + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_LABEL | HWMON_T_ALARM, + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_LABEL | HWMON_T_ALARM, + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | HWMON_T_LABEL | HWMON_T_ALARM, - HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_LABEL, - HWMON_T_INPUT | HWMON_T_LABEL), + HWMON_T_INPUT | HWMON_T_MAX | HWMON_T_MIN | + HWMON_T_LABEL | HWMON_T_ALARM), NULL }; @@ -143,6 +212,7 @@ static const struct hwmon_ops nvme_hwmon_ops = { .is_visible = nvme_hwmon_is_visible, .read = nvme_hwmon_read, .read_string = nvme_hwmon_read_string, + .write = nvme_hwmon_write, }; static const struct hwmon_chip_info nvme_hwmon_chip_info = { diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 269b2e8..347a44f 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -803,6 +803,12 @@ struct nvme_write_zeroes_cmd { /* Features */ +enum { + NVME_TEMP_THRESH_MASK = 0xffff, + NVME_TEMP_THRESH_SELECT_SHIFT = 16, + NVME_TEMP_THRESH_TYPE_UNDER = 0x100000, +}; + struct nvme_feat_auto_pst { __le64 entries[32]; };
According to the NVMe specification, the over temperature threshold and under temperature threshold features shall be implemented for Composite Temperature if a non-zero WCTEMP field value is reported in the Identify Controller data structure. The features are also implemented for all implemented temperature sensors (i.e., all Temperature Sensor fields that report a non-zero value). This provides the over temperature threshold and under temperature threshold for each sensor as temperature min and max values of hwmon sysfs attributes. The WCTEMP is already provided as a temperature max value for Composite Temperature, but this change isn't incompatible. Because the default value of the over temperature threshold for Composite Temperature is the WCTEMP. This also provides alarm attributes for each temperature sensor. But all alarm conditions are same, because there is only a single bit in Critical Warning field that indicates one of the temperature is outside of a temperature threshold. Example output from the "sensors" command: nvme-pci-0100 Adapter: PCI adapter Composite: +53.0 C (low = -273.0 C, high = +70.0 C) (crit = +80.0 C) Sensor 1: +56.0 C (low = -273.0 C, high = +65262.0 C) Sensor 2: +51.0 C (low = -273.0 C, high = +65262.0 C) Sensor 5: +73.0 C (low = -273.0 C, high = +65262.0 C) Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Jean Delvare <jdelvare@suse.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> --- This patch depends on the patch "nvme: Add hardware monitoring support" [1] [1] http://lists.infradead.org/pipermail/linux-nvme/2019-November/027883.html drivers/nvme/host/nvme-hwmon.c | 98 ++++++++++++++++++++++++++++++++++++------ include/linux/nvme.h | 6 +++ 2 files changed, 90 insertions(+), 14 deletions(-)