Message ID | 20191215174509.1847-2-linux@roeck-us.net (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | Summary: hwmon driver for temperature sensors on SATA drives | expand |
Guenter, > This driver solves this problem by adding support for reading the > temperature of SATA drives from the kernel using the hwmon API and > by adding a temperature zone for each drive. My working tree is available here: https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.6/drivetemp A few notes: - Before applying your patch I did s/satatemp/drivetemp/ - I get a crash in the driver core during probe if the drivetemp module is loaded prior to loading ahci or a SCSI HBA driver. This crash is unrelated to my changes. Haven't had time to debug. - I tweaked your ATA detection heuristics and now use the cached VPD page 0x89 instead of fetching one from the device. - I also added support for reading the temperature log page on SCSI drives. - Tested with a mixed bag of about 40 SCSI and SATA drives attached. - I still think sensor naming needs work. How and where are the "drivetemp-scsi-8-140" names generated? I'll tinker some more but thought I'd share what I have for now.
On Wed, Dec 18, 2019 at 07:15:07PM -0500, Martin K. Petersen wrote: > > Guenter, > > > This driver solves this problem by adding support for reading the > > temperature of SATA drives from the kernel using the hwmon API and > > by adding a temperature zone for each drive. > > My working tree is available here: > > https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.6/drivetemp > > A few notes: > > - Before applying your patch I did s/satatemp/drivetemp/ > > - I get a crash in the driver core during probe if the drivetemp module > is loaded prior to loading ahci or a SCSI HBA driver. This crash is > unrelated to my changes. Haven't had time to debug. > > - I tweaked your ATA detection heuristics and now use the cached VPD > page 0x89 instead of fetching one from the device. > > - I also added support for reading the temperature log page on SCSI > drives. > > - Tested with a mixed bag of about 40 SCSI and SATA drives attached. > > - I still think sensor naming needs work. How and where are the > "drivetemp-scsi-8-140" names generated? > Quick one: In libsensors, outside the kernel. The naming is generic, along the line of <driver name>-<bus name>-<bus index>-<slot>. > I'll tinker some more but thought I'd share what I have for now. > Thanks for sharing. I'll be out on vacation until January 1. I'll look at the code after I am back. Guenter > -- > Martin K. Petersen Oracle Linux Engineering
Hi Martin, On 12/18/19 4:15 PM, Martin K. Petersen wrote: > > Guenter, > >> This driver solves this problem by adding support for reading the >> temperature of SATA drives from the kernel using the hwmon API and >> by adding a temperature zone for each drive. > > My working tree is available here: > > https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.6/drivetemp > I had a quick look at the patch. Looks good overall. I think you should be able to use the buffer allocated in struct drivetemp_data (named smartdata). Maybe rename it to something more generic. If it needs to be dma-aligned, maybe we can use kzalloc() to allocate it. Only reason I didn't use it for vpd was because that needed 1024 bytes. > A few notes: > > - Before applying your patch I did s/satatemp/drivetemp/ > > - I get a crash in the driver core during probe if the drivetemp module > is loaded prior to loading ahci or a SCSI HBA driver. This crash is > unrelated to my changes. Haven't had time to debug. > Definitely something we'll need to look into. Do you have a traceback ? Thanks, Guenter > - I tweaked your ATA detection heuristics and now use the cached VPD > page 0x89 instead of fetching one from the device. > > - I also added support for reading the temperature log page on SCSI > drives. > > - Tested with a mixed bag of about 40 SCSI and SATA drives attached. > > - I still think sensor naming needs work. How and where are the > "drivetemp-scsi-8-140" names generated? > > I'll tinker some more but thought I'd share what I have for now. >
Hi Martin, On 12/18/19 4:15 PM, Martin K. Petersen wrote: > > Guenter, > >> This driver solves this problem by adding support for reading the >> temperature of SATA drives from the kernel using the hwmon API and >> by adding a temperature zone for each drive. > > My working tree is available here: > > https://git.kernel.org/pub/scm/linux/kernel/git/mkp/linux.git/log/?h=5.6/drivetemp > > A few notes: > > - Before applying your patch I did s/satatemp/drivetemp/ > > - I get a crash in the driver core during probe if the drivetemp module > is loaded prior to loading ahci or a SCSI HBA driver. This crash is > unrelated to my changes. Haven't had time to debug. > Any idea how I might be able to reproduce this ? So far I have been unsuccessful. Building drivetemp into the kernel, with ahci and everything SCSI built as module, doesn't trigger the crash for me. This is with the drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7. Thanks, Guenter > - I tweaked your ATA detection heuristics and now use the cached VPD > page 0x89 instead of fetching one from the device. > > - I also added support for reading the temperature log page on SCSI > drives. > > - Tested with a mixed bag of about 40 SCSI and SATA drives attached. > > - I still think sensor naming needs work. How and where are the > "drivetemp-scsi-8-140" names generated? > > I'll tinker some more but thought I'd share what I have for now. >
Guenter, >> - I get a crash in the driver core during probe if the drivetemp module >> is loaded prior to loading ahci or a SCSI HBA driver. This crash is >> unrelated to my changes. Haven't had time to debug. >> > > Any idea how I might be able to reproduce this ? So far I have been > unsuccessful. I'm chipping away at a mountain of email. Will take a look tomorrow.
Hi Guenter! >> - I still think sensor naming needs work. How and where are the >> "drivetemp-scsi-8-140" names generated? >> > Quick one: In libsensors, outside the kernel. The naming is generic, > along the line of <driver name>-<bus name>-<bus index>-<slot>. I understand that there are sensors that may not have an obvious associated topology and therefore necessitate coming up with a suitable naming or enumeration scheme. But in this case we already have a well-defined SCSI device name. Any particular reason you don't shift the chip.addr back and print the H:C:T:L format that you used as input? However arcane H:C:T:L may seem, I think that predictable naming would make things a lot easier for users that need to identify which device matches which sensor...
On 1/6/20 8:10 PM, Martin K. Petersen wrote: > > Hi Guenter! > >>> - I still think sensor naming needs work. How and where are the >>> "drivetemp-scsi-8-140" names generated? >>> >> Quick one: In libsensors, outside the kernel. The naming is generic, >> along the line of <driver name>-<bus name>-<bus index>-<slot>. > > I understand that there are sensors that may not have an obvious > associated topology and therefore necessitate coming up with a suitable > naming or enumeration scheme. But in this case we already have a > well-defined SCSI device name. Any particular reason you don't shift the > chip.addr back and print the H:C:T:L format that you used as input? > > However arcane H:C:T:L may seem, I think that predictable naming would > make things a lot easier for users that need to identify which device > matches which sensor... > Not sure I understand. Do you mean to add "H:C:T:L" to "drivetemp" ? That would make it something like "drivetemp:H:C:T:L-scsi-8-140". Not sure if that is really useful, and it would at least be partially redundant. "scsi-8-140" is created by libsensors, so any change in that would have to be made there, not in the kernel driver. Guenter
Guenter, > Any idea how I might be able to reproduce this ? So far I have been > unsuccessful. > > Building drivetemp into the kernel, with ahci and everything SCSI > built as module, doesn't trigger the crash for me. This is with the > drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add > sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7. This is with 5.5-rc1. I'll try another kernel. My repro is: # modprobe drivetemp # modprobe <any SCSI driver, including ahci>
Guenter, > "scsi-8-140" is created by libsensors, so any change in that would > have to be made there, not in the kernel driver. Yes. Something like the patch below which will produce actual SCSI device instance names: drivetemp-scsi-7:0:29:0 drivetemp-scsi-8:0:30:0 drivetemp-scsi-8:0:15:0 drivetemp-scsi-7:0:24:0 Instead of the current: drivetemp-scsi-7-1d0 drivetemp-scsi-8-1e0 drivetemp-scsi-8-f0 drivetemp-scsi-7-180 Other question: Does hwmon have any notion of sensor topology? As I mentioned earlier, SCSI installations typically rely on SAF-TE or SES instead of the physical drive sensors. SES also includes monitoring of fans, power supplies, etc. And more importantly, it provides a representation of the location of a given component. E.g.: Tray number #4, disk drive bay #5. So it would be helpful if libsensors had a way to represent sensors in a way that mimics the physical device layout reported by SES.
Hi Martin, On Tue, Jan 07, 2020 at 08:29:40PM -0500, Martin K. Petersen wrote: > > Guenter, > > > "scsi-8-140" is created by libsensors, so any change in that would > > have to be made there, not in the kernel driver. > > Yes. Something like the patch below which will produce actual SCSI > device instance names: > > drivetemp-scsi-7:0:29:0 > drivetemp-scsi-8:0:30:0 > drivetemp-scsi-8:0:15:0 > drivetemp-scsi-7:0:24:0 > > Instead of the current: > > drivetemp-scsi-7-1d0 > drivetemp-scsi-8-1e0 > drivetemp-scsi-8-f0 > drivetemp-scsi-7-180 > > Other question: Does hwmon have any notion of sensor topology? As I > mentioned earlier, SCSI installations typically rely on SAF-TE or SES > instead of the physical drive sensors. SES also includes monitoring of > fans, power supplies, etc. And more importantly, it provides a > representation of the location of a given component. E.g.: Tray number > #4, disk drive bay #5. > Please go ahead and submit the patch for libsensors. > So it would be helpful if libsensors had a way to represent sensors in a > way that mimics the physical device layout reported by SES. > I guess it would be up to you to come up with a proposal. Thanks, Guenter
On Tue, Jan 07, 2020 at 08:12:06PM -0500, Martin K. Petersen wrote: > > Guenter, > > > Any idea how I might be able to reproduce this ? So far I have been > > unsuccessful. > > > > Building drivetemp into the kernel, with ahci and everything SCSI > > built as module, doesn't trigger the crash for me. This is with the > > drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add > > sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7. > > This is with 5.5-rc1. I'll try another kernel. > > My repro is: > > # modprobe drivetemp > # modprobe <any SCSI driver, including ahci> > No luck on my side. Can you provide a traceback ? Maybe we can use it to find out what is happening. Thanks, Guenter
On 1/8/20 7:33 AM, Guenter Roeck wrote: > On Tue, Jan 07, 2020 at 08:12:06PM -0500, Martin K. Petersen wrote: >> >> Guenter, >> >>> Any idea how I might be able to reproduce this ? So far I have been >>> unsuccessful. >>> >>> Building drivetemp into the kernel, with ahci and everything SCSI >>> built as module, doesn't trigger the crash for me. This is with the >>> drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add >>> sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7. >> >> This is with 5.5-rc1. I'll try another kernel. >> >> My repro is: >> >> # modprobe drivetemp >> # modprobe <any SCSI driver, including ahci> >> > No luck on my side. Can you provide a traceback ? Maybe we can use it > to find out what is happening. > I tried again, this time with v5.5-rc5. Loading and unloading ahci and drivetemp in any order does not cause any problems for me. At this point I don't know what else I could test. I went ahead and applied the drivetemp patch to hwmon-next. Maybe we'll get some additional test feedback this way. Guenter
Am Sa., 11. Jan. 2020 um 21:24 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: > > On 1/8/20 7:33 AM, Guenter Roeck wrote: > > On Tue, Jan 07, 2020 at 08:12:06PM -0500, Martin K. Petersen wrote: > >> > >> Guenter, > >> > >>> Any idea how I might be able to reproduce this ? So far I have been > >>> unsuccessful. > >>> > >>> Building drivetemp into the kernel, with ahci and everything SCSI > >>> built as module, doesn't trigger the crash for me. This is with the > >>> drivetemp patch (v3) as well as commit d188b0675b ("scsi: core: Add > >>> sysfs attributes for VPD pages 0h and 89h") applied on top of v5.4.7. > >> > >> This is with 5.5-rc1. I'll try another kernel. > >> > >> My repro is: > >> > >> # modprobe drivetemp > >> # modprobe <any SCSI driver, including ahci> > >> > > No luck on my side. Can you provide a traceback ? Maybe we can use it > > to find out what is happening. > > > > I tried again, this time with v5.5-rc5. Loading and unloading ahci and > drivetemp in any order does not cause any problems for me. > > At this point I don't know what else I could test. I went ahead and > applied the drivetemp patch to hwmon-next. Maybe we'll get some additional > test feedback this way. I've tested Linus git tree from right now + hwmon-next and I cannot make it crash. The driver seems to work fine here and temperature reportings are very accurate on all HDDs on that box. ( 8 x Seagate IronWolf 2 TB (ST2000VN004) ) What I've noticed however is the nvme temperature low/high values on the Sensors X are strange here. I'm not sure it is a v5.5 issue or a hwmon-next one right now, I didn't boot a vanilla v5.5-rc5 yet. Both nvme's are Samsung SSD 960 EVO 250GB. They look like this: nvme-pci-1300 Adapter: PCI adapter Composite: +27.9°C (low = -273.1°C, high = +76.8°C) (crit = +78.8°C) Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) nvme-pci-6100 Adapter: PCI adapter Composite: +23.9°C (low = -273.1°C, high = +76.8°C) (crit = +78.8°C) Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) Best Regards, Gabriel C.
On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: > What I've noticed however is the nvme temperature low/high values on > the Sensors X are strange here. (...) > Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) > Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) (...) > Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) > Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) That doesn't look strange to me. It seems like reasonable defaults from the firmware if either it doesn't really log the min/max temperatures or hasn't been through a cycle of updating these yet. Just set both to absolute min/max temperatures possible. Yours, Linus Walleij
Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij <linus.walleij@linaro.org>: > > On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: > > > What I've noticed however is the nvme temperature low/high values on > > the Sensors X are strange here. > (...) > > Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) > > Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) > (...) > > Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) > > Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) > > That doesn't look strange to me. It seems like reasonable defaults > from the firmware if either it doesn't really log the min/max temperatures > or hasn't been through a cycle of updating these yet. Just set both > to absolute min/max temperatures possible. Ok I'll check that. Do you mean by setting the temperatures to use a lmsensors config? Or is there a way to set these with a nvme command?
On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <nix.or.die@gmail.com> wrote: > Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij > <linus.walleij@linaro.org>: > > > > On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: > > > > > What I've noticed however is the nvme temperature low/high values on > > > the Sensors X are strange here. > > (...) > > > Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) > > > Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) > > (...) > > > Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) > > > Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) > > > > That doesn't look strange to me. It seems like reasonable defaults > > from the firmware if either it doesn't really log the min/max temperatures > > or hasn't been through a cycle of updating these yet. Just set both > > to absolute min/max temperatures possible. > > Ok I'll check that. > > Do you mean by setting the temperatures to use a lmsensors config? > Or is there a way to set these with a nvme command? Not that I know of. The min/max are the minumum and maximum temperatures the device has experienced during this power-on cycle. (If I understood things right!) If the device firmware doesn't log that, or the firmware hasn't ran through a log point, it makes sense to report absolute min/max of the scales. Yours, Linus Walleij
On 1/12/20 4:07 AM, Linus Walleij wrote: > On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <nix.or.die@gmail.com> wrote: >> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij >> <linus.walleij@linaro.org>: >>> >>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: >>> >>>> What I've noticed however is the nvme temperature low/high values on >>>> the Sensors X are strange here. >>> (...) >>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) >>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) >>> (...) >>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) >>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) >>> >>> That doesn't look strange to me. It seems like reasonable defaults >>> from the firmware if either it doesn't really log the min/max temperatures >>> or hasn't been through a cycle of updating these yet. Just set both >>> to absolute min/max temperatures possible. >> >> Ok I'll check that. >> >> Do you mean by setting the temperatures to use a lmsensors config? >> Or is there a way to set these with a nvme command? > > Not that I know of. > > The min/max are the minumum and maximum temperatures the > device has experienced during this power-on cycle. > No, that would be lowest/highest. The above are (or should be) per-sensor setpoints. The default for those is typically the absolute minimum / maximum of the supported range. Some SATA drives report the lowest/highest temperatures experienced since power cycle, like here. drivetemp-scsi-5-0 Adapter: SCSI adapter temp1: +23.0°C (low = +0.0°C, high = +60.0°C) (crit low = -41.0°C, crit = +85.0°C) (lowest = +20.0°C, highest = +31.0°C) Guenter
Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: > > On 1/12/20 4:07 AM, Linus Walleij wrote: > > On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <nix.or.die@gmail.com> wrote: > >> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij > >> <linus.walleij@linaro.org>: > >>> > >>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: > >>> > >>>> What I've noticed however is the nvme temperature low/high values on > >>>> the Sensors X are strange here. > >>> (...) > >>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) > >>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) > >>> (...) > >>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) > >>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) > >>> > >>> That doesn't look strange to me. It seems like reasonable defaults > >>> from the firmware if either it doesn't really log the min/max temperatures > >>> or hasn't been through a cycle of updating these yet. Just set both > >>> to absolute min/max temperatures possible. > >> > >> Ok I'll check that. > >> > >> Do you mean by setting the temperatures to use a lmsensors config? > >> Or is there a way to set these with a nvme command? > > > > Not that I know of. > > > > The min/max are the minumum and maximum temperatures the > > device has experienced during this power-on cycle. > > > > No, that would be lowest/highest. The above are (or should be) per-sensor > setpoints. The default for those is typically the absolute minimum / > maximum of the supported range. > > Some SATA drives report the lowest/highest temperatures experienced > since power cycle, like here. > > drivetemp-scsi-5-0 > Adapter: SCSI adapter > temp1: +23.0°C (low = +0.0°C, high = +60.0°C) > (crit low = -41.0°C, crit = +85.0°C) > (lowest = +20.0°C, highest = +31.0°C) > The SATA temperatures are fine and reported like this here too, just the nvme ones are strange. drivetemp-scsi-4-0 Adapter: SCSI adapter temp1: +28.0°C (low = +1.0°C, high = +61.0°C) (crit low = +2.0°C, crit = +60.0°C) (lowest = +16.0°C, highest = +31.0°C) drivetemp-scsi-12-0 Adapter: SCSI adapter temp1: +29.0°C (low = +1.0°C, high = +61.0°C) (crit low = +2.0°C, crit = +60.0°C) (lowest = +18.0°C, highest = +32.0°C) and so on. Btw, where I can find the code does these calculations? Best regards, Gabriel C.
On 1/12/20 5:45 AM, Gabriel C wrote: > Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: >> >> On 1/12/20 4:07 AM, Linus Walleij wrote: >>> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <nix.or.die@gmail.com> wrote: >>>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij >>>> <linus.walleij@linaro.org>: >>>>> >>>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: >>>>> >>>>>> What I've noticed however is the nvme temperature low/high values on >>>>>> the Sensors X are strange here. >>>>> (...) >>>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) >>>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) >>>>> (...) >>>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) >>>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) >>>>> >>>>> That doesn't look strange to me. It seems like reasonable defaults >>>>> from the firmware if either it doesn't really log the min/max temperatures >>>>> or hasn't been through a cycle of updating these yet. Just set both >>>>> to absolute min/max temperatures possible. >>>> >>>> Ok I'll check that. >>>> >>>> Do you mean by setting the temperatures to use a lmsensors config? >>>> Or is there a way to set these with a nvme command? >>> >>> Not that I know of. >>> >>> The min/max are the minumum and maximum temperatures the >>> device has experienced during this power-on cycle. >>> >> >> No, that would be lowest/highest. The above are (or should be) per-sensor >> setpoints. The default for those is typically the absolute minimum / >> maximum of the supported range. >> >> Some SATA drives report the lowest/highest temperatures experienced >> since power cycle, like here. >> >> drivetemp-scsi-5-0 >> Adapter: SCSI adapter >> temp1: +23.0°C (low = +0.0°C, high = +60.0°C) >> (crit low = -41.0°C, crit = +85.0°C) >> (lowest = +20.0°C, highest = +31.0°C) >> > > The SATA temperatures are fine and reported like this here too, just > the nvme ones are strange. > > drivetemp-scsi-4-0 > Adapter: SCSI adapter > temp1: +28.0°C (low = +1.0°C, high = +61.0°C) > (crit low = +2.0°C, crit = +60.0°C) > (lowest = +16.0°C, highest = +31.0°C) > > drivetemp-scsi-12-0 > Adapter: SCSI adapter > temp1: +29.0°C (low = +1.0°C, high = +61.0°C) > (crit low = +2.0°C, crit = +60.0°C) > (lowest = +18.0°C, highest = +32.0°C) > > and so on. > > Btw, where I can find the code does these calculations? > Not sure if that is what you are looking for, but the nvme hardware monitoring driver is at drivers/nvme/host/hwmon.c, the SATA hardware monitoring driver is at drivers/hwmon/drivetemp.c. The limits on nvme drives are configurable. root@server:/sys/class/hwmon# sensors nvme-pci-0100 nvme-pci-0100 Adapter: PCI adapter Composite: +40.9°C (low = -273.1°C, high = +84.8°C) (crit = +84.8°C) Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C) Sensor 2: +43.9°C (low = -273.1°C, high = +65261.8°C) root@server:/sys/class/hwmon# echo 0 > hwmon1/temp2_min root@server:/sys/class/hwmon# echo 100000 > hwmon1/temp2_max root@server:/sys/class/hwmon# sensors nvme-pci-0100 nvme-pci-0100 Adapter: PCI adapter Composite: +38.9°C (low = -273.1°C, high = +84.8°C) (crit = +84.8°C) Sensor 1: +38.9°C (low = -0.1°C, high = +99.8°C) Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C) If you dislike the defaults, just configure whatever you think is appropriate for your system. Thanks, Guenter
Am So., 12. Jan. 2020 um 16:26 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: > > On 1/12/20 5:45 AM, Gabriel C wrote: > > Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: > >> > >> On 1/12/20 4:07 AM, Linus Walleij wrote: > >>> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <nix.or.die@gmail.com> wrote: > >>>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij > >>>> <linus.walleij@linaro.org>: > >>>>> > >>>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: > >>>>> > >>>>>> What I've noticed however is the nvme temperature low/high values on > >>>>>> the Sensors X are strange here. > >>>>> (...) > >>>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) > >>>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) > >>>>> (...) > >>>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) > >>>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) > >>>>> > >>>>> That doesn't look strange to me. It seems like reasonable defaults > >>>>> from the firmware if either it doesn't really log the min/max temperatures > >>>>> or hasn't been through a cycle of updating these yet. Just set both > >>>>> to absolute min/max temperatures possible. > >>>> > >>>> Ok I'll check that. > >>>> > >>>> Do you mean by setting the temperatures to use a lmsensors config? > >>>> Or is there a way to set these with a nvme command? > >>> > >>> Not that I know of. > >>> > >>> The min/max are the minumum and maximum temperatures the > >>> device has experienced during this power-on cycle. > >>> > >> > >> No, that would be lowest/highest. The above are (or should be) per-sensor > >> setpoints. The default for those is typically the absolute minimum / > >> maximum of the supported range. > >> > >> Some SATA drives report the lowest/highest temperatures experienced > >> since power cycle, like here. > >> > >> drivetemp-scsi-5-0 > >> Adapter: SCSI adapter > >> temp1: +23.0°C (low = +0.0°C, high = +60.0°C) > >> (crit low = -41.0°C, crit = +85.0°C) > >> (lowest = +20.0°C, highest = +31.0°C) > >> > > > > The SATA temperatures are fine and reported like this here too, just > > the nvme ones are strange. > > > > drivetemp-scsi-4-0 > > Adapter: SCSI adapter > > temp1: +28.0°C (low = +1.0°C, high = +61.0°C) > > (crit low = +2.0°C, crit = +60.0°C) > > (lowest = +16.0°C, highest = +31.0°C) > > > > drivetemp-scsi-12-0 > > Adapter: SCSI adapter > > temp1: +29.0°C (low = +1.0°C, high = +61.0°C) > > (crit low = +2.0°C, crit = +60.0°C) > > (lowest = +18.0°C, highest = +32.0°C) > > > > and so on. > > > > Btw, where I can find the code does these calculations? > > > > Not sure if that is what you are looking for, but the nvme hardware > monitoring driver is at drivers/nvme/host/hwmon.c, the SATA hardware > monitoring driver is at drivers/hwmon/drivetemp.c. > I have a look thanks. I'm using your v2 patch for the nvme part since you posted it on 5.4 kernels. This is probably why I find the way the temperatures are now reported very strange. The ADATA XPG SX8200 Pro in my laptop seems to work better: nvme-pci-0200 Adapter: PCI adapter Composite: +37.9°C (low = -0.1°C, high = +74.8°C) (crit = +79.8°C) Low is 0° which is what the spec suggests. > The limits on nvme drives are configurable. Yes, I found this out already. > root@server:/sys/class/hwmon# sensors nvme-pci-0100 > nvme-pci-0100 > Adapter: PCI adapter > Composite: +40.9°C (low = -273.1°C, high = +84.8°C) > (crit = +84.8°C) > Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C) > Sensor 2: +43.9°C (low = -273.1°C, high = +65261.8°C) > > root@server:/sys/class/hwmon# echo 0 > hwmon1/temp2_min > root@server:/sys/class/hwmon# echo 100000 > hwmon1/temp2_max An lm-sensors configuration will work too. > root@server:/sys/class/hwmon# sensors nvme-pci-0100 > nvme-pci-0100 > Adapter: PCI adapter > Composite: +38.9°C (low = -273.1°C, high = +84.8°C) > (crit = +84.8°C) > Sensor 1: +38.9°C (low = -0.1°C, high = +99.8°C) > Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C) > > If you dislike the defaults, just configure whatever you think is > appropriate for your system. It's not about disliking the values. I want to find out if these Samsung models don't support that, or it is a bug somewhere in writing/calculating the values. In the case, Samsung and others don't support such a thing wouldn't be better to just ignore the bogus reading altogether?
On 1/12/20 10:37 AM, Gabriel C wrote: > Am So., 12. Jan. 2020 um 16:26 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: >> >> On 1/12/20 5:45 AM, Gabriel C wrote: >>> Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: >>>> >>>> On 1/12/20 4:07 AM, Linus Walleij wrote: >>>>> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <nix.or.die@gmail.com> wrote: >>>>>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij >>>>>> <linus.walleij@linaro.org>: >>>>>>> >>>>>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: >>>>>>> >>>>>>>> What I've noticed however is the nvme temperature low/high values on >>>>>>>> the Sensors X are strange here. >>>>>>> (...) >>>>>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) >>>>>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) >>>>>>> (...) >>>>>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) >>>>>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) >>>>>>> >>>>>>> That doesn't look strange to me. It seems like reasonable defaults >>>>>>> from the firmware if either it doesn't really log the min/max temperatures >>>>>>> or hasn't been through a cycle of updating these yet. Just set both >>>>>>> to absolute min/max temperatures possible. >>>>>> >>>>>> Ok I'll check that. >>>>>> >>>>>> Do you mean by setting the temperatures to use a lmsensors config? >>>>>> Or is there a way to set these with a nvme command? >>>>> >>>>> Not that I know of. >>>>> >>>>> The min/max are the minumum and maximum temperatures the >>>>> device has experienced during this power-on cycle. >>>>> >>>> >>>> No, that would be lowest/highest. The above are (or should be) per-sensor >>>> setpoints. The default for those is typically the absolute minimum / >>>> maximum of the supported range. >>>> >>>> Some SATA drives report the lowest/highest temperatures experienced >>>> since power cycle, like here. >>>> >>>> drivetemp-scsi-5-0 >>>> Adapter: SCSI adapter >>>> temp1: +23.0°C (low = +0.0°C, high = +60.0°C) >>>> (crit low = -41.0°C, crit = +85.0°C) >>>> (lowest = +20.0°C, highest = +31.0°C) >>>> >>> >>> The SATA temperatures are fine and reported like this here too, just >>> the nvme ones are strange. >>> >>> drivetemp-scsi-4-0 >>> Adapter: SCSI adapter >>> temp1: +28.0°C (low = +1.0°C, high = +61.0°C) >>> (crit low = +2.0°C, crit = +60.0°C) >>> (lowest = +16.0°C, highest = +31.0°C) >>> >>> drivetemp-scsi-12-0 >>> Adapter: SCSI adapter >>> temp1: +29.0°C (low = +1.0°C, high = +61.0°C) >>> (crit low = +2.0°C, crit = +60.0°C) >>> (lowest = +18.0°C, highest = +32.0°C) >>> >>> and so on. >>> >>> Btw, where I can find the code does these calculations? >>> >> >> Not sure if that is what you are looking for, but the nvme hardware >> monitoring driver is at drivers/nvme/host/hwmon.c, the SATA hardware >> monitoring driver is at drivers/hwmon/drivetemp.c. >> > > I have a look thanks. > > I'm using your v2 patch for the nvme part since you posted it on 5.4 kernels. > This is probably why I find the way the temperatures are now reported > very strange. > > The ADATA XPG SX8200 Pro in my laptop seems to work better: > > nvme-pci-0200 > Adapter: PCI adapter > Composite: +37.9°C (low = -0.1°C, high = +74.8°C) > (crit = +79.8°C) > > Low is 0° which is what the spec suggests. > >> The limits on nvme drives are configurable. > > Yes, I found this out already. > >> root@server:/sys/class/hwmon# sensors nvme-pci-0100 >> nvme-pci-0100 >> Adapter: PCI adapter >> Composite: +40.9°C (low = -273.1°C, high = +84.8°C) >> (crit = +84.8°C) >> Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C) >> Sensor 2: +43.9°C (low = -273.1°C, high = +65261.8°C) >> >> root@server:/sys/class/hwmon# echo 0 > hwmon1/temp2_min >> root@server:/sys/class/hwmon# echo 100000 > hwmon1/temp2_max > > An lm-sensors configuration will work too. > Sure, the above was just an example. >> root@server:/sys/class/hwmon# sensors nvme-pci-0100 >> nvme-pci-0100 >> Adapter: PCI adapter >> Composite: +38.9°C (low = -273.1°C, high = +84.8°C) >> (crit = +84.8°C) >> Sensor 1: +38.9°C (low = -0.1°C, high = +99.8°C) >> Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C) >> >> If you dislike the defaults, just configure whatever you think is >> appropriate for your system. > > It's not about disliking the values. I want to find out if these Samsung models > don't support that, or it is a bug somewhere in writing/calculating the values. > No, this is not a bug. It is perfectly valid for individual sensors to have uninitialized limits. If I recall correctly, the NVME specification specifically states that the default settings for individual sensors shall be those values (0 and 65535 Kelvin, specifically). And, yes, I would agree that is a bit odd that NVME drives report temperatures in Kelvin, but such is the world. > In the case, Samsung and others don't support such a thing wouldn't be > better to just ignore > the bogus reading altogether? Again, you can set whatever limits you like. The default limits on many hardware sensor chips have odd values. Just looking at my system: nct6797-isa-0a20 Adapter: ISA adapter in0: +0.48 V (min = +0.00 V, max = +1.74 V) in1: +1.02 V (min = +0.00 V, max = +0.00 V) ALARM in2: +3.39 V (min = +0.00 V, max = +0.00 V) ALARM in3: +3.31 V (min = +0.00 V, max = +0.00 V) ALARM in4: +1.00 V (min = +0.00 V, max = +0.00 V) ALARM in5: +0.14 V (min = +0.00 V, max = +0.00 V) ALARM in6: +0.82 V (min = +0.00 V, max = +0.00 V) ALARM in7: +3.38 V (min = +0.00 V, max = +0.00 V) ALARM in8: +3.26 V (min = +0.00 V, max = +0.00 V) ALARM in9: +1.82 V (min = +0.00 V, max = +0.00 V) ALARM in10: +0.00 V (min = +0.00 V, max = +0.00 V) in11: +0.74 V (min = +0.00 V, max = +0.00 V) ALARM in12: +1.20 V (min = +0.00 V, max = +0.00 V) ALARM in13: +0.68 V (min = +0.00 V, max = +0.00 V) ALARM in14: +1.50 V (min = +0.00 V, max = +0.00 V) ALARM Are you suggesting that we should not support setting min/max values for all drivers just because they are often not initialized to reasonable values by default ? Guenter
Am So., 12. Jan. 2020 um 21:08 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: > > On 1/12/20 10:37 AM, Gabriel C wrote: > > Am So., 12. Jan. 2020 um 16:26 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: > >> > >> On 1/12/20 5:45 AM, Gabriel C wrote: > >>> Am So., 12. Jan. 2020 um 14:07 Uhr schrieb Guenter Roeck <linux@roeck-us.net>: > >>>> > >>>> On 1/12/20 4:07 AM, Linus Walleij wrote: > >>>>> On Sun, Jan 12, 2020 at 1:03 PM Gabriel C <nix.or.die@gmail.com> wrote: > >>>>>> Am So., 12. Jan. 2020 um 12:22 Uhr schrieb Linus Walleij > >>>>>> <linus.walleij@linaro.org>: > >>>>>>> > >>>>>>> On Sun, Jan 12, 2020 at 12:18 PM Gabriel C <nix.or.die@gmail.com> wrote: > >>>>>>> > >>>>>>>> What I've noticed however is the nvme temperature low/high values on > >>>>>>>> the Sensors X are strange here. > >>>>>>> (...) > >>>>>>>> Sensor 1: +27.9°C (low = -273.1°C, high = +65261.8°C) > >>>>>>>> Sensor 2: +29.9°C (low = -273.1°C, high = +65261.8°C) > >>>>>>> (...) > >>>>>>>> Sensor 1: +23.9°C (low = -273.1°C, high = +65261.8°C) > >>>>>>>> Sensor 2: +25.9°C (low = -273.1°C, high = +65261.8°C) > >>>>>>> > >>>>>>> That doesn't look strange to me. It seems like reasonable defaults > >>>>>>> from the firmware if either it doesn't really log the min/max temperatures > >>>>>>> or hasn't been through a cycle of updating these yet. Just set both > >>>>>>> to absolute min/max temperatures possible. > >>>>>> > >>>>>> Ok I'll check that. > >>>>>> > >>>>>> Do you mean by setting the temperatures to use a lmsensors config? > >>>>>> Or is there a way to set these with a nvme command? > >>>>> > >>>>> Not that I know of. > >>>>> > >>>>> The min/max are the minumum and maximum temperatures the > >>>>> device has experienced during this power-on cycle. > >>>>> > >>>> > >>>> No, that would be lowest/highest. The above are (or should be) per-sensor > >>>> setpoints. The default for those is typically the absolute minimum / > >>>> maximum of the supported range. > >>>> > >>>> Some SATA drives report the lowest/highest temperatures experienced > >>>> since power cycle, like here. > >>>> > >>>> drivetemp-scsi-5-0 > >>>> Adapter: SCSI adapter > >>>> temp1: +23.0°C (low = +0.0°C, high = +60.0°C) > >>>> (crit low = -41.0°C, crit = +85.0°C) > >>>> (lowest = +20.0°C, highest = +31.0°C) > >>>> > >>> > >>> The SATA temperatures are fine and reported like this here too, just > >>> the nvme ones are strange. > >>> > >>> drivetemp-scsi-4-0 > >>> Adapter: SCSI adapter > >>> temp1: +28.0°C (low = +1.0°C, high = +61.0°C) > >>> (crit low = +2.0°C, crit = +60.0°C) > >>> (lowest = +16.0°C, highest = +31.0°C) > >>> > >>> drivetemp-scsi-12-0 > >>> Adapter: SCSI adapter > >>> temp1: +29.0°C (low = +1.0°C, high = +61.0°C) > >>> (crit low = +2.0°C, crit = +60.0°C) > >>> (lowest = +18.0°C, highest = +32.0°C) > >>> > >>> and so on. > >>> > >>> Btw, where I can find the code does these calculations? > >>> > >> > >> Not sure if that is what you are looking for, but the nvme hardware > >> monitoring driver is at drivers/nvme/host/hwmon.c, the SATA hardware > >> monitoring driver is at drivers/hwmon/drivetemp.c. > >> > > > > I have a look thanks. > > > > I'm using your v2 patch for the nvme part since you posted it on 5.4 kernels. > > This is probably why I find the way the temperatures are now reported > > very strange. > > > > The ADATA XPG SX8200 Pro in my laptop seems to work better: > > > > nvme-pci-0200 > > Adapter: PCI adapter > > Composite: +37.9°C (low = -0.1°C, high = +74.8°C) > > (crit = +79.8°C) > > > > Low is 0° which is what the spec suggests. > > > >> The limits on nvme drives are configurable. > > > > Yes, I found this out already. > > > >> root@server:/sys/class/hwmon# sensors nvme-pci-0100 > >> nvme-pci-0100 > >> Adapter: PCI adapter > >> Composite: +40.9°C (low = -273.1°C, high = +84.8°C) > >> (crit = +84.8°C) > >> Sensor 1: +40.9°C (low = -273.1°C, high = +65261.8°C) > >> Sensor 2: +43.9°C (low = -273.1°C, high = +65261.8°C) > >> > >> root@server:/sys/class/hwmon# echo 0 > hwmon1/temp2_min > >> root@server:/sys/class/hwmon# echo 100000 > hwmon1/temp2_max > > > > An lm-sensors configuration will work too. > > > Sure, the above was just an example. > > >> root@server:/sys/class/hwmon# sensors nvme-pci-0100 > >> nvme-pci-0100 > >> Adapter: PCI adapter > >> Composite: +38.9°C (low = -273.1°C, high = +84.8°C) > >> (crit = +84.8°C) > >> Sensor 1: +38.9°C (low = -0.1°C, high = +99.8°C) > >> Sensor 2: +42.9°C (low = -273.1°C, high = +65261.8°C) > >> > >> If you dislike the defaults, just configure whatever you think is > >> appropriate for your system. > > > > It's not about disliking the values. I want to find out if these Samsung models > > don't support that, or it is a bug somewhere in writing/calculating the values. > > > No, this is not a bug. It is perfectly valid for individual sensors to have > uninitialized limits. If I recall correctly, the NVME specification > specifically states that the default settings for individual sensors > shall be those values (0 and 65535 Kelvin, specifically). > > And, yes, I would agree that is a bit odd that NVME drives report temperatures > in Kelvin, but such is the world. > > > In the case, Samsung and others don't support such a thing wouldn't be > > better to just ignore > > the bogus reading altogether? > > Again, you can set whatever limits you like. The default limits on many > hardware sensor chips have odd values. Just looking at my system: > > nct6797-isa-0a20 > Adapter: ISA adapter > in0: +0.48 V (min = +0.00 V, max = +1.74 V) > in1: +1.02 V (min = +0.00 V, max = +0.00 V) ALARM > in2: +3.39 V (min = +0.00 V, max = +0.00 V) ALARM > in3: +3.31 V (min = +0.00 V, max = +0.00 V) ALARM > in4: +1.00 V (min = +0.00 V, max = +0.00 V) ALARM > in5: +0.14 V (min = +0.00 V, max = +0.00 V) ALARM > in6: +0.82 V (min = +0.00 V, max = +0.00 V) ALARM > in7: +3.38 V (min = +0.00 V, max = +0.00 V) ALARM > in8: +3.26 V (min = +0.00 V, max = +0.00 V) ALARM > in9: +1.82 V (min = +0.00 V, max = +0.00 V) ALARM > in10: +0.00 V (min = +0.00 V, max = +0.00 V) > in11: +0.74 V (min = +0.00 V, max = +0.00 V) ALARM > in12: +1.20 V (min = +0.00 V, max = +0.00 V) ALARM > in13: +0.68 V (min = +0.00 V, max = +0.00 V) ALARM > in14: +1.50 V (min = +0.00 V, max = +0.00 V) ALARM > > > Are you suggesting that we should not support setting min/max values for > all drivers just because they are often not initialized to reasonable values > by default ? No, I'm not suggesting that. I'm aware of strange I/O monitoring chips values and the lack of documentation, so in this case, something is better than nothing. In the nvme case, these are only 2 values who either are working/supported by firmware or not, so I thought it would be reasonable to have known-good values instead of 65261.8°C, which will probably cause users to report that as a bug a lot. Can we at least have that documented and explain how the values can be set/changed?
Hi Guenter! > I tried again, this time with v5.5-rc5. Loading and unloading ahci and > drivetemp in any order does not cause any problems for me. I tried your hwmon-next branch and it still happens for me. Both in qemu and on real hw. I'm really low on bandwidth the next couple of days. Will try to look later this week unless you beat me to it. I get lots of these warnings after modprobe drivetemp; modprobe ahci: [ 1055.611922] WARNING: CPU: 3 PID: 3233 at drivers/base/dd.c:519 really_probe+0x436/0x4f0 A quick test forcing synchronous SCSI scanning made no difference.
Hi Martin, On 1/13/20 7:03 PM, Martin K. Petersen wrote: > > Hi Guenter! > >> I tried again, this time with v5.5-rc5. Loading and unloading ahci and >> drivetemp in any order does not cause any problems for me. > > I tried your hwmon-next branch and it still happens for me. Both in qemu > and on real hw. I'm really low on bandwidth the next couple of days. > Will try to look later this week unless you beat me to it. I get lots of > these warnings after modprobe drivetemp; modprobe ahci: > > [ 1055.611922] WARNING: CPU: 3 PID: 3233 at drivers/base/dd.c:519 really_probe+0x436/0x4f0 > > A quick test forcing synchronous SCSI scanning made no difference. > The hwmon-next branch is based on v5.5-rc1. It might be better to either merge hwmon-next into mainline, or to apply the drivetemp patch to mainline, and test the result. I have seen some (unrelated) weird tracebacks in the driver core with v5.5-rc1, so that may not be the best baseline for a test. Thanks, Guenter
Guenter, > The hwmon-next branch is based on v5.5-rc1. It might be better to > either merge hwmon-next into mainline, or to apply the drivetemp patch > to mainline, and test the result. I have seen some (unrelated) weird > tracebacks in the driver core with v5.5-rc1, so that may not be the > best baseline for a test. I'm afraid the warnings still happen with hwmon-next on top of linus/master.
Hi Martin, On 1/15/20 8:12 PM, Martin K. Petersen wrote: > > Guenter, > >> The hwmon-next branch is based on v5.5-rc1. It might be better to >> either merge hwmon-next into mainline, or to apply the drivetemp patch >> to mainline, and test the result. I have seen some (unrelated) weird >> tracebacks in the driver core with v5.5-rc1, so that may not be the >> best baseline for a test. > > I'm afraid the warnings still happen with hwmon-next on top of > linus/master. > Can you possibly provide details, like the configuration you use for qemu, the qemu command line, and the exact command sequence you use in qemu to reproduce the problem ? Thanks, Guenter
On Wed, Jan 15, 2020 at 11:12:23PM -0500, Martin K. Petersen wrote: > > Guenter, > > > The hwmon-next branch is based on v5.5-rc1. It might be better to > > either merge hwmon-next into mainline, or to apply the drivetemp patch > > to mainline, and test the result. I have seen some (unrelated) weird > > tracebacks in the driver core with v5.5-rc1, so that may not be the > > best baseline for a test. > > I'm afraid the warnings still happen with hwmon-next on top of > linus/master. > Can you by any chance provide a full traceback ? The warning you reported suggests that a devm_ function was called on a device pointer prior to a device registration. I don't immediately see how that happens. A full traceback might give us an idea. I suspect that the underlying problem is in hwmon_device_register_with_info(), which uses devm_ functions to allocate memory associated with the device pointer passed to it, in this case the SCSI device. This is inherently wrong (independent of this driver), since the lifetime of the hardware monitoring device does not necessarily match the lifetime of its parent. The impact here is that we may get a memory leak under some circumstances. I'll have to fix that in the hardware monitoring core. Either case, I would like to track down how the warning happens, so any information you can provide that lets me reproduce the problem would be very helpful. Thanks, Guenter
Guenter, > Can you by any chance provide a full traceback ? My test machines are tied up with something else right now. This is from a few days ago (pristine hwmon-next, I believe): [ 1055.611912] ------------[ cut here ]------------ [ 1055.611922] WARNING: CPU: 3 PID: 3233 at drivers/base/dd.c:519 really_probe+0x436/0x4f0 [ 1055.611925] Modules linked in: sd_mod sg ahci libahci libata drivetemp scsi_mod crc32c_intel igb i2c_algo_bit i2c_core dca hwmon ipv6 nf_defrag_ipv6 crc_ccitt [ 1055.611955] CPU: 3 PID: 3233 Comm: kworker/u17:1 Tainted: G W 5.5.0-rc1+ #21 [ 1055.611965] Workqueue: events_unbound async_run_entry_fn [ 1055.611973] RIP: 0010:really_probe+0x436/0x4f0 [ 1055.611979] Code: c7 30 69 f8 82 e8 ba 94 e5 ff e9 60 ff ff ff 48 8d 7b 38 e8 cc d9 b4 ff 48 8b 43 38 48 85 c0 0f 85 41 fd ff ff e9 4f fd ff ff <0f> 0b e9 66 fc ff ff 48 8d 7d 50 e8 aa d9 b4 ff 4c 8b 6d 50 4d 85 [ 1055.611983] RSP: 0018:ffff8881edb77c98 EFLAGS: 00010287 [ 1055.611989] RAX: ffff8881e1f8fb80 RBX: ffffffffa033a000 RCX: ffffffff8182e583 [ 1055.611993] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8881dec506a8 [ 1055.611997] RBP: ffff8881dec50238 R08: 0000000000000001 R09: fffffbfff09629ed [ 1055.612000] R10: fffffbfff09629ec R11: 0000000000000003 R12: 0000000000000000 [ 1055.612004] R13: ffff8881dec506a8 R14: ffffffff8182eca0 R15: 000000000000000b [ 1055.612009] FS: 0000000000000000(0000) GS:ffff8881f8900000(0000) knlGS:0000000000000000 [ 1055.612013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1055.612017] CR2: 00007f957884a000 CR3: 00000001df5ec003 CR4: 00000000000606e0 [ 1055.612020] Call Trace: [ 1055.612038] ? driver_probe_device+0x170/0x170 [ 1055.612045] driver_probe_device+0x82/0x170 [ 1055.612058] ? driver_probe_device+0x170/0x170 [ 1055.612064] __driver_attach_async_helper+0xa3/0xe0 [ 1055.612076] async_run_entry_fn+0x68/0x2a0 [ 1055.612094] process_one_work+0x4df/0x990 [ 1055.612121] ? pwq_dec_nr_in_flight+0x110/0x110 [ 1055.612127] ? do_raw_spin_lock+0x113/0x1d0 [ 1055.612161] worker_thread+0x78/0x5c0 [ 1055.612190] ? process_one_work+0x990/0x990 [ 1055.612195] kthread+0x1be/0x1e0 [ 1055.612202] ? kthread_create_worker_on_cpu+0xd0/0xd0 [ 1055.612215] ret_from_fork+0x3a/0x50 [ 1055.612251] irq event stamp: 3512 [ 1055.612259] hardirqs last enabled at (3511): [<ffffffff81d2b874>] _raw_spin_unlock_irq+0x24/0x30 [ 1055.612265] hardirqs last disabled at (3512): [<ffffffff810029c9>] trace_hardirqs_off_thunk+0x1a/0x1c [ 1055.612272] softirqs last enabled at (3500): [<ffffffff820003a5>] __do_softirq+0x3a5/0x5a8 [ 1055.612281] softirqs last disabled at (3489): [<ffffffff810cec7b>] irq_exit+0xfb/0x100 [ 1055.612284] ---[ end trace f0a8dd9a37bea031 ]--- > Either case, I would like to track down how the warning happens, so any > information you can provide that lets me reproduce the problem would be > very helpful. The three systems that exhibit the problem are stock (2010/2012/2014 vintage) x86_64 servers with onboard AHCI and a variety of 4-6 SATA drives each. For the qemu test I didn't have ahci configured but I had my SCSI temp patch on top of yours and ran modprobe drivetemp; modprobe scsi_debug to trigger the warnings.
On 1/16/20 5:43 PM, Martin K. Petersen wrote: > > Guenter, > >> Can you by any chance provide a full traceback ? > > My test machines are tied up with something else right now. This is from > a few days ago (pristine hwmon-next, I believe): > > [ 1055.611912] ------------[ cut here ]------------ > [ 1055.611922] WARNING: CPU: 3 PID: 3233 at drivers/base/dd.c:519 really_probe+0x436/0x4f0 > [ 1055.611925] Modules linked in: sd_mod sg ahci libahci libata drivetemp scsi_mod crc32c_intel igb i2c_algo_bit i2c_core dca hwmon ipv6 nf_defrag_ipv6 crc_ccitt > [ 1055.611955] CPU: 3 PID: 3233 Comm: kworker/u17:1 Tainted: G W 5.5.0-rc1+ #21 > [ 1055.611965] Workqueue: events_unbound async_run_entry_fn > [ 1055.611973] RIP: 0010:really_probe+0x436/0x4f0 > [ 1055.611979] Code: c7 30 69 f8 82 e8 ba 94 e5 ff e9 60 ff ff ff 48 8d 7b 38 e8 cc d9 b4 ff 48 8b 43 38 48 85 c0 0f 85 41 fd ff ff e9 4f fd ff ff <0f> 0b e9 66 fc ff ff 48 8d 7d 50 e8 aa d9 b4 ff 4c 8b 6d 50 4d 85 > [ 1055.611983] RSP: 0018:ffff8881edb77c98 EFLAGS: 00010287 > [ 1055.611989] RAX: ffff8881e1f8fb80 RBX: ffffffffa033a000 RCX: ffffffff8182e583 > [ 1055.611993] RDX: dffffc0000000000 RSI: 0000000000000004 RDI: ffff8881dec506a8 > [ 1055.611997] RBP: ffff8881dec50238 R08: 0000000000000001 R09: fffffbfff09629ed > [ 1055.612000] R10: fffffbfff09629ec R11: 0000000000000003 R12: 0000000000000000 > [ 1055.612004] R13: ffff8881dec506a8 R14: ffffffff8182eca0 R15: 000000000000000b > [ 1055.612009] FS: 0000000000000000(0000) GS:ffff8881f8900000(0000) knlGS:0000000000000000 > [ 1055.612013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1055.612017] CR2: 00007f957884a000 CR3: 00000001df5ec003 CR4: 00000000000606e0 > [ 1055.612020] Call Trace: > [ 1055.612038] ? driver_probe_device+0x170/0x170 > [ 1055.612045] driver_probe_device+0x82/0x170 > [ 1055.612058] ? driver_probe_device+0x170/0x170 > [ 1055.612064] __driver_attach_async_helper+0xa3/0xe0 > [ 1055.612076] async_run_entry_fn+0x68/0x2a0 > [ 1055.612094] process_one_work+0x4df/0x990 > [ 1055.612121] ? pwq_dec_nr_in_flight+0x110/0x110 > [ 1055.612127] ? do_raw_spin_lock+0x113/0x1d0 > [ 1055.612161] worker_thread+0x78/0x5c0 > [ 1055.612190] ? process_one_work+0x990/0x990 > [ 1055.612195] kthread+0x1be/0x1e0 > [ 1055.612202] ? kthread_create_worker_on_cpu+0xd0/0xd0 > [ 1055.612215] ret_from_fork+0x3a/0x50 > [ 1055.612251] irq event stamp: 3512 > [ 1055.612259] hardirqs last enabled at (3511): [<ffffffff81d2b874>] _raw_spin_unlock_irq+0x24/0x30 > [ 1055.612265] hardirqs last disabled at (3512): [<ffffffff810029c9>] trace_hardirqs_off_thunk+0x1a/0x1c > [ 1055.612272] softirqs last enabled at (3500): [<ffffffff820003a5>] __do_softirq+0x3a5/0x5a8 > [ 1055.612281] softirqs last disabled at (3489): [<ffffffff810cec7b>] irq_exit+0xfb/0x100 > [ 1055.612284] ---[ end trace f0a8dd9a37bea031 ]--- > >> Either case, I would like to track down how the warning happens, so any >> information you can provide that lets me reproduce the problem would be >> very helpful. > > The three systems that exhibit the problem are stock (2010/2012/2014 > vintage) x86_64 servers with onboard AHCI and a variety of 4-6 SATA > drives each. > > For the qemu test I didn't have ahci configured but I had my SCSI temp > patch on top of yours and ran modprobe drivetemp; modprobe scsi_debug to > trigger the warnings. > Interesting. Looks like your system performs asynchronous probing. No idea how that can result in that kind of problem, but who knows. Can you send me the qemu command line ? Thanks, Guenter
diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst index 230ad59b462b..ecf1832dd013 100644 --- a/Documentation/hwmon/index.rst +++ b/Documentation/hwmon/index.rst @@ -133,6 +133,7 @@ Hardware Monitoring Kernel Drivers pxe1610 pwm-fan raspberrypi-hwmon + satatemp sch5627 sch5636 scpi-hwmon diff --git a/Documentation/hwmon/satatemp.rst b/Documentation/hwmon/satatemp.rst new file mode 100644 index 000000000000..59b105f3c79a --- /dev/null +++ b/Documentation/hwmon/satatemp.rst @@ -0,0 +1,48 @@ +Kernel driver satatemp +====================== + + +References +---------- + +ANS T13/1699-D +Information technology - AT Attachment 8 - ATA/ATAPI Command Set (ATA8-ACS) + +ANS Project T10/BSR INCITS 513 +Information technology - SCSI Primary Commands - 4 (SPC-4) + +ANS Project INCITS 557 +Information technology - SCSI / ATA Translation - 5 (SAT-5) + + +Description +----------- + +This driver supports reporting the temperature of SATA drives. +If supported, it uses the SCT Command Transport feature to read +the current drive temperature and, if available, temperature limits +as well as historic minimum and maximum temperatures. If SCT Command +Transport is not supported, the driver uses SMART attributes to read +the drive temperature. + + +Sysfs entries +------------- + +Only the temp1_input attribute is always available. Other attributes are +available only if reported by the drive. All temperatures are reported in +milli-degrees Celsius. + +======================= ===================================================== +temp1_input Current drive temperature +temp1_lcrit Minimum temperature limit. Operating the device below + this temperature may cause physical damage to the + device. +temp1_min Minimum recommended continuous operating limit +temp1_max Maximum recommended continuous operating temperature +temp1_crit Maximum temperature limit. Operating the device above + this temperature may cause physical damage to the + device. +temp1_lowest Minimum temperature seen this power cycle +temp1_highest Maximum temperature seen this power cycle +======================= ===================================================== diff --git a/drivers/hwmon/Kconfig b/drivers/hwmon/Kconfig index 13a6b4afb4b3..97acf4ebd5ca 100644 --- a/drivers/hwmon/Kconfig +++ b/drivers/hwmon/Kconfig @@ -1346,6 +1346,16 @@ config SENSORS_RASPBERRYPI_HWMON This driver can also be built as a module. If so, the module will be called raspberrypi-hwmon. +config SENSORS_SATATEMP + tristate "SATA hard disk drives with temperature sensors" + depends on SCSI && ATA + help + If you say yes you get support for the temperature sensor on + SATA hard disk drives. + + This driver can also be built as a module. If so, the module + will be called satatemp. + config SENSORS_SHT15 tristate "Sensiron humidity and temperature sensors. SHT15 and compat." depends on GPIOLIB || COMPILE_TEST diff --git a/drivers/hwmon/Makefile b/drivers/hwmon/Makefile index 40c036ea45e6..fe55b8f76af9 100644 --- a/drivers/hwmon/Makefile +++ b/drivers/hwmon/Makefile @@ -148,6 +148,7 @@ obj-$(CONFIG_SENSORS_S3C) += s3c-hwmon.o obj-$(CONFIG_SENSORS_SCH56XX_COMMON)+= sch56xx-common.o obj-$(CONFIG_SENSORS_SCH5627) += sch5627.o obj-$(CONFIG_SENSORS_SCH5636) += sch5636.o +obj-$(CONFIG_SENSORS_SATATEMP) += satatemp.o obj-$(CONFIG_SENSORS_SHT15) += sht15.o obj-$(CONFIG_SENSORS_SHT21) += sht21.o obj-$(CONFIG_SENSORS_SHT3x) += sht3x.o diff --git a/drivers/hwmon/satatemp.c b/drivers/hwmon/satatemp.c new file mode 100644 index 000000000000..9608716d422c --- /dev/null +++ b/drivers/hwmon/satatemp.c @@ -0,0 +1,575 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Hwmon client for SATA hard disk drives with temperature sensors + * Copyright (C) 2019 Zodiac Inflight Innovations + * + * With input from: + * Hwmon client for S.M.A.R.T. hard disk drives with temperature sensors. + * (C) 2018 Linus Walleij + * + * hwmon: Driver for SCSI/ATA temperature sensors + * by Constantin Baranov <const@mimas.ru>, submitted September 2009 + * + * The primary means to read hard drive temperatures and temperature limits + * is the SCT Command Transport feature set as specified in ATA8-ACS. + * It can be used to read the current drive temperature, temperature limits, + * and historic minimum and maximum temperatures. The SCT Command Transport + * feature set is documented in "AT Attachment 8 - ATA/ATAPI Command Set + * (ATA8-ACS)". + * + * If the SCT Command Transport feature set is not available, drive temperatures + * may be readable through SMART attributes. Since SMART attributes are not well + * defined, this method is only used as fallback mechanism. + * + * There are three SMART attributes which may report drive temperatures. + * Those are defined as follows (from + * http://www.cropel.com/library/smart-attribute-list.aspx). + * + * 190 Temperature Temperature, monitored by a sensor somewhere inside + * the drive. Raw value typicaly holds the actual + * temperature (hexadecimal) in its rightmost two digits. + * + * 194 Temperature Temperature, monitored by a sensor somewhere inside + * the drive. Raw value typicaly holds the actual + * temperature (hexadecimal) in its rightmost two digits. + * + * 231 Temperature Temperature, monitored by a sensor somewhere inside + * the drive. Raw value typicaly holds the actual + * temperature (hexadecimal) in its rightmost two digits. + * + * Wikipedia defines attributes a bit differently. + * + * 190 Temperature Value is equal to (100-temp. °C), allowing manufacturer + * Difference or to set a minimum threshold which corresponds to a + * Airflow maximum temperature. This also follows the convention of + * Temperature 100 being a best-case value and lower values being + * undesirable. However, some older drives may instead + * report raw Temperature (identical to 0xC2) or + * Temperature minus 50 here. + * 194 Temperature or Indicates the device temperature, if the appropriate + * Temperature sensor is fitted. Lowest byte of the raw value contains + * Celsius the exact temperature value (Celsius degrees). + * 231 Life Left Indicates the approximate SSD life left, in terms of + * (SSDs) or program/erase cycles or available reserved blocks. + * Temperature A normalized value of 100 represents a new drive, with + * a threshold value at 10 indicating a need for + * replacement. A value of 0 may mean that the drive is + * operating in read-only mode to allow data recovery. + * Previously (pre-2010) occasionally used for Drive + * Temperature (more typically reported at 0xC2). + * + * Common denominator is that the first raw byte reports the temperature + * in degrees C on almost all drives. Some drives may report a fractional + * temperature in the second raw byte. + * + * Known exceptions (from libatasmart): + * - SAMSUNG SV0412H and SAMSUNG SV1204H) report the temperature in 10th + * degrees C in the first two raw bytes. + * - A few Maxtor drives report an unknown or bad value in attribute 194. + * - Certain Apple SSD drives report an unknown value in attribute 190. + * Only certain firmware versions are affected. + * + * Those exceptions affect older ATA drives and are currently ignored. + * Also, the second raw byte (possibly reporting the fractional temperature) + * is currently ignored. + * + * Many drives also report temperature limits in additional SMART data raw + * bytes. The format of those is not well defined and varies widely. + * The driver does not currently attempt to report those limits. + * + * According to data in smartmontools, attribute 231 is rarely used to report + * drive temperatures. At the same time, several drives report SSD life left + * in attribute 231, but do not support temperature sensors. For this reason, + * attribute 231 is currently ignored. + * + * Following above definitions, temperatures are reported as follows. + * If SCT Command Transport is supported, it is used to read the + * temperature and, if available, temperature limits. + * - Otherwise, if SMART attribute 194 is supported, it is used to read + * the temperature. + * - Otherwise, if SMART attribute 190 is supported, it is used to read + * the temperature. + */ + +#include <linux/ata.h> +#include <linux/bits.h> +#include <linux/device.h> +#include <linux/hwmon.h> +#include <linux/kernel.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/mutex.h> +#include <scsi/scsi_cmnd.h> +#include <scsi/scsi_device.h> +#include <scsi/scsi_driver.h> +#include <scsi/scsi_proto.h> + +struct satatemp_data { + struct list_head list; /* list of instantiated devices */ + struct mutex lock; /* protect data buffer accesses */ + struct scsi_device *sdev; /* SCSI device */ + struct device *dev; /* instantiating device */ + struct device *hwdev; /* hardware monitoring device */ + u8 smartdata[ATA_SECT_SIZE]; /* local buffer */ + int (*get_temp)(struct satatemp_data *st, u32 attr, long *val); + bool have_temp_lowest; /* lowest temp in SCT status */ + bool have_temp_highest; /* highest temp in SCT status */ + bool have_temp_min; /* have min temp */ + bool have_temp_max; /* have max temp */ + bool have_temp_lcrit; /* have lower critical limit */ + bool have_temp_crit; /* have critical limit */ + int temp_min; /* min temp */ + int temp_max; /* max temp */ + int temp_lcrit; /* lower critical limit */ + int temp_crit; /* critical limit */ +}; + +static LIST_HEAD(satatemp_devlist); + +#define ATA_MAX_SMART_ATTRS 30 +#define SMART_TEMP_PROP_190 190 +#define SMART_TEMP_PROP_194 194 + +#define SCT_STATUS_REQ_ADDR 0xe0 +#define SCT_STATUS_VERSION_LOW 0 /* log byte offsets */ +#define SCT_STATUS_VERSION_HIGH 1 +#define SCT_STATUS_TEMP 200 +#define SCT_STATUS_TEMP_LOWEST 201 +#define SCT_STATUS_TEMP_HIGHEST 202 +#define SCT_READ_LOG_ADDR 0xe1 +#define SMART_READ_LOG 0xd5 +#define SMART_WRITE_LOG 0xd6 + +#define INVALID_TEMP 0x80 + +#define temp_is_valid(temp) ((temp) != INVALID_TEMP) +#define temp_from_sct(temp) (((s8)(temp)) * 1000) + +static inline bool ata_id_smart_supported(u16 *id) +{ + return id[ATA_ID_COMMAND_SET_1] & BIT(0); +} + +static inline bool ata_id_smart_enabled(u16 *id) +{ + return id[ATA_ID_CFS_ENABLE_1] & BIT(0); +} + +static int satatemp_scsi_command(struct satatemp_data *st, + u8 ata_command, u8 feature, + u8 lba_low, u8 lba_mid, u8 lba_high) +{ + u8 scsi_cmd[MAX_COMMAND_SIZE]; + int data_dir; + + memset(scsi_cmd, 0, sizeof(scsi_cmd)); + scsi_cmd[0] = ATA_16; + if (ata_command == ATA_CMD_SMART && feature == SMART_WRITE_LOG) { + scsi_cmd[1] = (5 << 1); /* PIO Data-out */ + /* + * No off.line or cc, write to dev, block count in sector count + * field. + */ + scsi_cmd[2] = 0x06; + data_dir = DMA_TO_DEVICE; + } else { + scsi_cmd[1] = (4 << 1); /* PIO Data-in */ + /* + * No off.line or cc, read from dev, block count in sector count + * field. + */ + scsi_cmd[2] = 0x0e; + data_dir = DMA_FROM_DEVICE; + } + scsi_cmd[4] = feature; + scsi_cmd[6] = 1; /* 1 sector */ + scsi_cmd[8] = lba_low; + scsi_cmd[10] = lba_mid; + scsi_cmd[12] = lba_high; + scsi_cmd[14] = ata_command; + + return scsi_execute_req(st->sdev, scsi_cmd, data_dir, + st->smartdata, ATA_SECT_SIZE, NULL, HZ, 5, + NULL); +} + +static int satatemp_ata_command(struct satatemp_data *st, u8 feature, u8 select) +{ + return satatemp_scsi_command(st, ATA_CMD_SMART, feature, select, + ATA_SMART_LBAM_PASS, ATA_SMART_LBAH_PASS); +} + +static int satatemp_get_smarttemp(struct satatemp_data *st, u32 attr, + long *temp) +{ + u8 *buf = st->smartdata; + bool have_temp = false; + u8 temp_raw; + u8 csum; + int err; + int i; + + err = satatemp_ata_command(st, ATA_SMART_READ_VALUES, 0); + if (err) + return err; + + /* Checksum the read value table */ + csum = 0; + for (i = 0; i < ATA_SECT_SIZE; i++) + csum += buf[i]; + if (csum) { + dev_dbg(&st->sdev->sdev_gendev, + "checksum error reading SMART values\n"); + return -EIO; + } + + for (i = 0; i < ATA_MAX_SMART_ATTRS; i++) { + u8 *attr = buf + i * 12; + int id = attr[2]; + + if (!id) + continue; + + if (id == SMART_TEMP_PROP_190) { + temp_raw = attr[7]; + have_temp = true; + } + if (id == SMART_TEMP_PROP_194) { + temp_raw = attr[7]; + have_temp = true; + break; + } + } + + if (have_temp) { + *temp = temp_raw * 1000; + return 0; + } + + return -ENXIO; +} + +static int satatemp_get_scttemp(struct satatemp_data *st, u32 attr, long *val) +{ + u8 *buf = st->smartdata; + int err; + + err = satatemp_ata_command(st, SMART_READ_LOG, SCT_STATUS_REQ_ADDR); + if (err) + return err; + switch (attr) { + case hwmon_temp_input: + *val = temp_from_sct(buf[SCT_STATUS_TEMP]); + break; + case hwmon_temp_lowest: + *val = temp_from_sct(buf[SCT_STATUS_TEMP_LOWEST]); + break; + case hwmon_temp_highest: + *val = temp_from_sct(buf[SCT_STATUS_TEMP_HIGHEST]); + break; + default: + err = -EINVAL; + break; + } + return err; +} + +static int satatemp_identify(struct satatemp_data *st) +{ + struct scsi_device *sdev = st->sdev; + u8 *buf = st->smartdata; + bool is_ata, is_sata; + bool have_sct_data_table; + bool have_sct_temp; + bool have_smart; + bool have_sct; + u16 *ata_id; + u16 version; + long temp; + u8 *vpd; + int err; + + /* bail out immediately if there is no inquiry data */ + if (!sdev->inquiry || sdev->inquiry_len < 16) + return -ENODEV; + + /* + * Inquiry data sanity checks (per SAT-5): + * - peripheral qualifier must be 0 + * - peripheral device type must be 0x0 (Direct access block device) + * - SCSI Vendor ID is "ATA " + */ + if (sdev->inquiry[0] || + strncmp(&sdev->inquiry[8], "ATA ", 8)) + return -ENODEV; + + vpd = kzalloc(1024, GFP_KERNEL); + if (!vpd) + return -ENOMEM; + + err = scsi_get_vpd_page(sdev, 0x89, vpd, 1024); + if (err) { + kfree(vpd); + return err; + } + + /* + * More sanity checks. + * For VPD offsets and values see ANS Project INCITS 557, + * "Information technology - SCSI / ATA Translation - 5 (SAT-5)". + */ + if (vpd[1] != 0x89 || vpd[2] != 0x02 || vpd[3] != 0x38 || + vpd[36] != 0x34 || vpd[56] != ATA_CMD_ID_ATA) { + kfree(vpd); + return -ENODEV; + } + ata_id = (u16 *)&vpd[60]; + is_ata = ata_id_is_ata(ata_id); + is_sata = ata_id_is_sata(ata_id); + have_sct = ata_id_sct_supported(ata_id); + have_sct_data_table = ata_id_sct_data_tables(ata_id); + have_smart = ata_id_smart_supported(ata_id) && + ata_id_smart_enabled(ata_id); + + kfree(vpd); + + /* bail out if this is not a SATA device */ + if (!is_ata || !is_sata) + return -ENODEV; + if (!have_sct) + goto skip_sct; + + err = satatemp_ata_command(st, SMART_READ_LOG, SCT_STATUS_REQ_ADDR); + if (err) + goto skip_sct; + + version = (buf[SCT_STATUS_VERSION_HIGH] << 8) | + buf[SCT_STATUS_VERSION_LOW]; + if (version != 2 && version != 3) + goto skip_sct; + + have_sct_temp = temp_is_valid(buf[SCT_STATUS_TEMP]); + if (!have_sct_temp) + goto skip_sct; + + st->have_temp_lowest = temp_is_valid(buf[SCT_STATUS_TEMP_LOWEST]); + st->have_temp_highest = temp_is_valid(buf[SCT_STATUS_TEMP_HIGHEST]); + + if (!have_sct_data_table) + goto skip_sct; + + /* Request and read temperature history table */ + memset(buf, '\0', sizeof(st->smartdata)); + buf[0] = 5; /* data table command */ + buf[2] = 1; /* read table */ + buf[4] = 2; /* temperature history table */ + + err = satatemp_ata_command(st, SMART_WRITE_LOG, SCT_STATUS_REQ_ADDR); + if (err) + goto skip_sct_data; + + err = satatemp_ata_command(st, SMART_READ_LOG, SCT_READ_LOG_ADDR); + if (err) + goto skip_sct_data; + + /* + * Temperature limits per AT Attachment 8 - + * ATA/ATAPI Command Set (ATA8-ACS) + */ + st->have_temp_max = temp_is_valid(buf[6]); + st->have_temp_crit = temp_is_valid(buf[7]); + st->have_temp_min = temp_is_valid(buf[8]); + st->have_temp_lcrit = temp_is_valid(buf[9]); + + st->temp_max = temp_from_sct(buf[6]); + st->temp_crit = temp_from_sct(buf[7]); + st->temp_min = temp_from_sct(buf[8]); + st->temp_lcrit = temp_from_sct(buf[9]); + +skip_sct_data: + if (have_sct_temp) { + st->get_temp = satatemp_get_scttemp; + return 0; + } +skip_sct: + if (!have_smart) + return -ENODEV; + st->get_temp = satatemp_get_smarttemp; + return satatemp_get_smarttemp(st, hwmon_temp_input, &temp); +} + +static int satatemp_read(struct device *dev, enum hwmon_sensor_types type, + u32 attr, int channel, long *val) +{ + struct satatemp_data *st = dev_get_drvdata(dev); + int err = 0; + + if (type != hwmon_temp) + return -EINVAL; + + switch (attr) { + case hwmon_temp_input: + case hwmon_temp_lowest: + case hwmon_temp_highest: + mutex_lock(&st->lock); + err = st->get_temp(st, attr, val); + mutex_unlock(&st->lock); + break; + case hwmon_temp_lcrit: + *val = st->temp_lcrit; + break; + case hwmon_temp_min: + *val = st->temp_min; + break; + case hwmon_temp_max: + *val = st->temp_max; + break; + case hwmon_temp_crit: + *val = st->temp_crit; + break; + default: + err = -EINVAL; + break; + } + return err; +} + +static umode_t satatemp_is_visible(const void *data, + enum hwmon_sensor_types type, + u32 attr, int channel) +{ + const struct satatemp_data *st = data; + + switch (type) { + case hwmon_temp: + switch (attr) { + case hwmon_temp_input: + return 0444; + case hwmon_temp_lowest: + if (st->have_temp_lowest) + return 0444; + break; + case hwmon_temp_highest: + if (st->have_temp_highest) + return 0444; + break; + case hwmon_temp_min: + if (st->have_temp_min) + return 0444; + break; + case hwmon_temp_max: + if (st->have_temp_max) + return 0444; + break; + case hwmon_temp_lcrit: + if (st->have_temp_lcrit) + return 0444; + break; + case hwmon_temp_crit: + if (st->have_temp_crit) + return 0444; + break; + default: + break; + } + break; + default: + break; + } + return 0; +} + +static const struct hwmon_channel_info *satatemp_info[] = { + HWMON_CHANNEL_INFO(chip, + HWMON_C_REGISTER_TZ), + HWMON_CHANNEL_INFO(temp, HWMON_T_INPUT | + HWMON_T_LOWEST | HWMON_T_HIGHEST | + HWMON_T_MIN | HWMON_T_MAX | + HWMON_T_LCRIT | HWMON_T_CRIT), + NULL +}; + +static const struct hwmon_ops satatemp_ops = { + .is_visible = satatemp_is_visible, + .read = satatemp_read, +}; + +static const struct hwmon_chip_info satatemp_chip_info = { + .ops = &satatemp_ops, + .info = satatemp_info, +}; + +/* + * The device argument points to sdev->sdev_dev. Its parent is + * sdev->sdev_gendev, which we can use to get the scsi_device pointer. + */ +static int satatemp_add(struct device *dev, struct class_interface *intf) +{ + struct scsi_device *sdev = to_scsi_device(dev->parent); + struct satatemp_data *st; + int err; + + st = kzalloc(sizeof(*st), GFP_KERNEL); + if (!st) + return -ENOMEM; + + st->sdev = sdev; + st->dev = dev; + mutex_init(&st->lock); + + if (satatemp_identify(st)) { + err = -ENODEV; + goto abort; + } + + st->hwdev = hwmon_device_register_with_info(dev->parent, "satatemp", + st, &satatemp_chip_info, + NULL); + if (IS_ERR(st->hwdev)) { + err = PTR_ERR(st->hwdev); + goto abort; + } + + list_add(&st->list, &satatemp_devlist); + return 0; + +abort: + kfree(st); + return err; +} + +static void satatemp_remove(struct device *dev, struct class_interface *intf) +{ + struct satatemp_data *st, *tmp; + + list_for_each_entry_safe(st, tmp, &satatemp_devlist, list) { + if (st->dev == dev) { + list_del(&st->list); + hwmon_device_unregister(st->hwdev); + kfree(st); + break; + } + } +} + +static struct class_interface satatemp_interface = { + .add_dev = satatemp_add, + .remove_dev = satatemp_remove, +}; + +static int __init satatemp_init(void) +{ + return scsi_register_interface(&satatemp_interface); +} + +static void __exit satatemp_exit(void) +{ + scsi_unregister_interface(&satatemp_interface); +} + +module_init(satatemp_init); +module_exit(satatemp_exit); + +MODULE_AUTHOR("Guenter Roeck <linus@roeck-us.net>"); +MODULE_DESCRIPTION("ATA temperature monitor"); +MODULE_LICENSE("GPL");