mbox series

[0/1] Summary: hwmon driver for temperature sensors on SATA drives

Message ID 20191209052119.32072-1-linux@roeck-us.net (mailing list archive)
Headers show
Series Summary: hwmon driver for temperature sensors on SATA drives | expand

Message

Guenter Roeck Dec. 9, 2019, 5:21 a.m. UTC
In the past, several attempts have been made to add support for reporting
SCSI/[S]ATA drive temperatures to the Linux kernel. This is desirable to
have a means to report drive temperatures to userspace without root
privileges and in a standard format, but also to be able to tie reported
temperatures with the thermal subsystem.

The most recent attempt was [1] by Linus Walleij. It went through a total
of seven iterations. At the end, it was rejected for a number of reasons;
see the provided link for details. This implementation resides in the
SCSI core. It originally resided in libata but was moved to SCSI per
maintainer request, where it was ultimately rejected.

The feedback on this approach suggests to use the SCSI Temperature log page
[0x0d] as means to access drive temperature information. It is unknown
if this is implemented in any real SCSI drive. The feedback also suggests to
obtain temperature from ATA drives, convert it into the SCSI temperature log
page in libata-scsi, and to use that information in a hardware monitoring
driver. The format and method to do this is documented in [3]. This is not
currently implemented in the Linux kernel.

An earlier submission of a driver to report SCSI/SATA drive temperatures
was made back in 2009 by Constantin Baranov [2]. This submission resides
in the hardware monitoring subsystem. It does not rely on changes in the
SCSI subsystem or in libata-scsi. Instead, it registers itself with the
SCSI subsystem using scsi_register_interface(). It was rejected primarily
because it executes ATA passthrough commands without verification that it
is actually connected to an ATA drive.

Both submissions use SMART attributes to read drive temperature information.
[1] also tries to identify temperature limits from those attributes.
Unfortunately, SMART attributes are not well defined, resulting in relative
complex code trying to identify the exact format of the reported data.

With the available information and feedback, we can make a number of
observations and conclusions.
a) Using available (S)ATA drive temperature information and convert it to
   a SCSI log page is an interesting idea. On the downside, it would add a
   substantial amount of complexity to libata-scsi. The code would either
   have to be optional, or it would have to be built into the kernel even
   if it is never used on a given system. Without access to SCSI drives
   supporting this feature, it would be all but impossible to test the code
   against such a drive. It would neither be possible to test correctness
   of the code in libata-scsi nor in the driver using that information.
   Overall it would be much easier and much less risky to implement such
   code on the receiving side (ie in a driver reporting the temperatures)
   instead of trying to convert the information from one format to another
   first. In summary, it is neither practical nor feasible. On top of that,
   there is no guarantee that code implementing this functionality would
   ever be accepted into the kernel for this very reason.
b) The code needed to read and analyze SCSI temperature log pages is quite
   complex (see smartmontools [5]). There is no existing support code
   in the Linux kernel; such code would have to be written. This makes
   the approach discussed in a) even more risky and less practical.
c) Overall, any attempt to report temperature information for anything
   but SATA drives in the kernel is not practical due to the complexity
   involved, and due to the inability to test the resulting code with
   non-SATA drives.
d) Using SMART data for anything but basic temperature reporting is not
   really feasible due to the lack of standardization. Any attempt to do
   this would add a substantial amount of code, ambiguity, and risk.

This submission implements a driver to report the temperature of SATA
drives through the hardware monitoring subsystem. It is implemented as
stand-alone driver in the hardware monitoring subsystem. The driver uses
the mechanism from submission [1] to register with the SCSI subsystem.
By using this mechanism, changes in the SCSI or ATA subsystems are not
required.  To reduce risk and complexity, it only instantiates after
reliably validating that it is connected to a SATA drive. It does not
attempt to report the temperature of non-SATA drives.

The driver uses the SCT Command Transport feature set as specified in
ATA8-ACS [4] to read and report the temperature as well as temperature
limits and lowest/highest temperature information (if available) for
SATA drives. If a drive does not support SCT Command Transport, the driver
attempts to access a limited set of well known SMART attributes to read
the drive temperature. In that case, only the current drive temperature
is reported.

---
References:
[1] https://patchwork.kernel.org/patch/10688021/
[2] https://lore.kernel.org/lkml/20090913040104.ab1d0b69.const@mimas.ru/
[3] http://www.t10.org/cgi-bin/ac.pl?t=f&f=sat5r02.pdf
    Information technology - SCSI / ATA Translation - 5 (SAT-5),
    section 10.3.8 (Temperature log page).
[4] http://www.t13.org/documents/uploadeddocuments/docs2008/d1699r6a-ata8-acs.pdf
    ANS T13/1699-D "Information technology - AT Attachment 8 - ATA/ATAPI Command
    Set (ATA8-ACS)"
[5] https://github.com/mirror/smartmontools.git

Comments

Martin K. Petersen Dec. 11, 2019, 4:08 a.m. UTC | #1
Hi Guenter,

> The most recent attempt was [1] by Linus Walleij. It went through a total
> of seven iterations. At the end, it was rejected for a number of reasons;
> see the provided link for details. This implementation resides in the
> SCSI core. It originally resided in libata but was moved to SCSI per
> maintainer request, where it was ultimately rejected.

While I am sure I come across as a curmudgeon, regressions is a major
concern for me. That, and making sure we pick the right architecture. I
thought we were making good progress in that department when Linus
abandoned the effort.

> The feedback on this approach suggests to use the SCSI Temperature log
> page [0x0d] as means to access drive temperature information. It is
> unknown if this is implemented in any real SCSI drive.

Almost every SCSI drive has it.

> The feedback also suggests to obtain temperature from ATA drives,
> convert it into the SCSI temperature log page in libata-scsi, and to
> use that information in a hardware monitoring driver. The format and
> method to do this is documented in [3]. This is not currently
> implemented in the Linux kernel.

Correct, but I have no qualms over exporting the SCSI temperature log
page. The devices that export that page are generally well-behaved.

My concerns are wrt. identifying whether SMART data is available for
USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID
controllers that hide the real drives in various ways).

I am not sure why the SCSI temperature log page parsing would be
complex. I will have to go check smartmontools to see what that is all
about. The spec is as simple as can be.

Anyway. I think the overall approach wrt. SCT and falling back to
well-known SMART fields is reasonably sane and fine for libata. But I
don't understand the pushback wrt. using the SCSI temperature log page
as a conduit. I think it would be fine if this worked out of the box for
both SCSI and ATA drives.

The elephant in the room remains USB. And coming up with a way we can
reliably detect whether it is safe to start poking at the device to
discover if SMART is provided. If we eventually want to pursue USB, I
think your heuristic stuff needs to be a library that can be leveraged
by both libata and USB. But that doesn't have to be part of the initial
effort.

And finally, my concerns wrt. reacting to bad sensors remain. Not too
familiar with hwmon, but I would still like any actions based on
reported temperatures to be under user control and not the kernel.
Guenter Roeck Dec. 11, 2019, 5:57 a.m. UTC | #2
Hi Martin,

On 12/10/19 8:08 PM, Martin K. Petersen wrote:
> 
> Hi Guenter,
> 
>> The most recent attempt was [1] by Linus Walleij. It went through a total
>> of seven iterations. At the end, it was rejected for a number of reasons;
>> see the provided link for details. This implementation resides in the
>> SCSI core. It originally resided in libata but was moved to SCSI per
>> maintainer request, where it was ultimately rejected.
> 
> While I am sure I come across as a curmudgeon, regressions is a major
> concern for me. That, and making sure we pick the right architecture. I
> thought we were making good progress in that department when Linus
> abandoned the effort.
> 

If anything, I am surprised that he did not give up earlier. Personally
I did not see a path to success after v7 of the patch set was rejected.

I also no longer believe that temperature monitoring of SATA drives
should be implemented within the ATA or SCSI subsystem. I came to the
conclusion that it is much better suited as separate hardware monitoring
driver. As separate driver, its instantiation is in full control of
the user. If it causes trouble (or, as mentioned separately, if it adds
too much instantiation time, or if it is considered to be too large),
it can simply be disabled in a given system by blacklisting it (or,
rather, by not explicitly loading it in the first place). With that,
there is no real compatibility concern. If and when drives are detected
which report bad information, such drives can be added to a blacklist
without impact on the core SCSI or ATA code. Until that happens, not
loading the driver solves the problem on any affected system.

>> The feedback on this approach suggests to use the SCSI Temperature log
>> page [0x0d] as means to access drive temperature information. It is
>> unknown if this is implemented in any real SCSI drive.
> 
> Almost every SCSI drive has it.
> 
Good to hear.

>> The feedback also suggests to obtain temperature from ATA drives,
>> convert it into the SCSI temperature log page in libata-scsi, and to
>> use that information in a hardware monitoring driver. The format and
>> method to do this is documented in [3]. This is not currently
>> implemented in the Linux kernel.
> 
> Correct, but I have no qualms over exporting the SCSI temperature log
> page. The devices that export that page are generally well-behaved.
> 
Also good to hear. However, for my part, I have no means to test such
code since I don't have any SCSI drives.

> My concerns are wrt. identifying whether SMART data is available for
> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID
> controllers that hide the real drives in various ways).
> 

The one USB/UAS connected SATA drive I have (a WD passport) reports
itself as "WD      ", not as "ATA     ". I would expect other drives
to do the same. That drive reports (via smartctl) that it supports
both SCT and SMART data. It doesn't report temperatures through SCT,
but it does report the drive temperature with SMART attribute 194.
I did not attempt to add support for this and similar drives since
I don't know if I can reliably detect it. The potential benefit
compared to the risk seemed too low (we would be getting into
possible regression space) for me to try. Such code (effectively
it boils down to relaxing SATA drive detection) can always be added
at a later time.

> I am not sure why the SCSI temperature log page parsing would be
> complex. I will have to go check smartmontools to see what that is all

Not as much the parsing, but detection if the information is there.

> about. The spec is as simple as can be.
> 

Possibly. I personally also find it quite vague. It is definitely not
something I would want to try to implement without ability to see how
the data actually looks like as reported by a real drive, and without
ability to test the code.

> Anyway. I think the overall approach wrt. SCT and falling back to
> well-known SMART fields is reasonably sane and fine for libata. But I
> don't understand the pushback wrt. using the SCSI temperature log page
> as a conduit. I think it would be fine if this worked out of the box for

This is not a pushback per se. It is simply a matter of ability (or lack
of it) to test any such code.

Regarding "conduit", I assume you mean converting SATA/SCT information
into SCST temperature pages and reporting temperature purely based
on those. I personally think that this would be the wrong approach:
It would effectively require code in the ATA core which is not really
needed there. This would bloat the ATA code with no real advantage.
In my opinion, available temperature information should be interpreted
where it is needed, and only there, not in several places. I see that
as much less risky and error prone than spreading the code to multiple
places.

> both SCSI and ATA drives.
> 
The elegance of my approach is that adding support for reading temperatures
from SCSI drives (or, for that matter, USB/UAS drives) would be
straightforward. All one would need to do is to implement the necessary
detection code as well as a function to actually read the information
from the drive. This can be done at any time, and, again, it should be
done by someone with the ability to test the code.

> The elephant in the room remains USB. And coming up with a way we can
> reliably detect whether it is safe to start poking at the device to
> discover if SMART is provided. If we eventually want to pursue USB,  > think your heuristic stuff needs to be a library that can be leveraged
> by both libata and USB. But that doesn't have to be part of the initial
> effort.
> 
> And finally, my concerns wrt. reacting to bad sensors remain. Not too
> familiar with hwmon, but I would still like any actions based on
> reported temperatures to be under user control and not the kernel.
> 
All sensors can report bad information, and quite often they do.
This is actually quite normal in any given system. That doesn't mean
that the available (connected) sensors should be ignored.

Also, when it comes to actions, the one subsystem performing any actions
in the kernel based on temperature sensor information is the thermal
subsystem, and that is on purpose implemented in the kernel.
The hardware monitoring subsystem, on its own, is purely passive
and only reports sensor information; it does not act on it. Any action
will either be done by userspace (eg with fancontrol) or by the thermal
subsystem.

Overall, I understand the desire to also support temperature reporting
for SCSI and USB/UAS drives. As hardware monitoring maintainer, I'd
be happy to accept patches implementing that support. However, I don't
see this as immediately necessary, and I would want to have some
reassurance that such code is well tested and doesn't cause any
regressions (especially since concerns about possible regressions were
mentioned several times in the context of the previous submissions).

Thanks,
Guenter
Martin K. Petersen Dec. 17, 2019, 2:35 a.m. UTC | #3
Guenter,

> If and when drives are detected which report bad information, such
> drives can be added to a blacklist without impact on the core SCSI or
> ATA code. Until that happens, not loading the driver solves the
> problem on any affected system.

My only concern with that is that we'll have blacklisting several
places. We already have ATA and SCSI blacklists. If we now add a third
place, that's going to be a maintenance nightmare.

More on that below.

>> My concerns are wrt. identifying whether SMART data is available for
>> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID
>> controllers that hide the real drives in various ways).

OK, so I spent my weekend tinkering with 15+ years of accumulated USB
devices. And my conclusion is that no, we can't in any sensible manner,
support USB storage monitoring in the kernel. There is no heuristic that
I can find that identifies that "this is a hard drive or an SSD and
attempting one of the various SMART methods may be safe". As opposed to
"this is a USB key that's likely to lock up if you try". And that's
ignoring the drives with USB-ATA bridges that I managed to wedge in my
attempt at sending down commands.

Even smartmontools is failing to work on a huge part of my vintage
collection.  Thanks to a wide variety of bridges with random, custom
interfaces.

So my stance on all this is that I'm fine with your general approach for
ATA. I will post a patch adding the required bits for SCSI. And if a
device does not implement either of the two standard methods, people
should use smartmontools.

Wrt. name, since I've added SCSI support, satatemp is a bit of a
misnomer. drivetemp, maybe? No particular preference.

> The one USB/UAS connected SATA drive I have (a WD passport) reports
> itself as "WD      ", not as "ATA     ". I would expect other drives
> to do the same.

Yes. Most vendors are too fond of their brand names to put "ATA" in
there. So my suggestion is to relax the heuristic to trigger on the ATA
Information VPD page only and ignore the name.

Also, there are some devices that will lock up the way you access that
VPD page. So a tweak is also required there.

To avoid the multiple blacklists and heuristic collections my suggestion
is that I introduce a helper function in SCSI (based on what I did in
the disk driver) that can be called to identify whether something is an
ATA device. And then the hwmon driver can call that and we can keep the
heuristics in one place.

If a device turns out to be problematic wrt. getting the ATA VPD for the
purpose of SMART, for instance, it will also need to be blacklisted for
other reasons in SCSI. So I would really like to keep the heuristics in
one place.
Guenter Roeck Dec. 17, 2019, 3:57 a.m. UTC | #4
On 12/16/19 6:35 PM, Martin K. Petersen wrote:
> 
> Guenter,
> 
>> If and when drives are detected which report bad information, such
>> drives can be added to a blacklist without impact on the core SCSI or
>> ATA code. Until that happens, not loading the driver solves the
>> problem on any affected system.
> 
> My only concern with that is that we'll have blacklisting several
> places. We already have ATA and SCSI blacklists. If we now add a third
> place, that's going to be a maintenance nightmare.
> 
> More on that below.
> 
>>> My concerns are wrt. identifying whether SMART data is available for
>>> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID
>>> controllers that hide the real drives in various ways).
> 
> OK, so I spent my weekend tinkering with 15+ years of accumulated USB
> devices. And my conclusion is that no, we can't in any sensible manner,
> support USB storage monitoring in the kernel. There is no heuristic that
> I can find that identifies that "this is a hard drive or an SSD and
> attempting one of the various SMART methods may be safe". As opposed to
> "this is a USB key that's likely to lock up if you try". And that's
> ignoring the drives with USB-ATA bridges that I managed to wedge in my
> attempt at sending down commands.
> 
> Even smartmontools is failing to work on a huge part of my vintage
> collection.  Thanks to a wide variety of bridges with random, custom
> interfaces.
> 
> So my stance on all this is that I'm fine with your general approach for
> ATA. I will post a patch adding the required bits for SCSI. And if a
> device does not implement either of the two standard methods, people
> should use smartmontools.
> 
> Wrt. name, since I've added SCSI support, satatemp is a bit of a
> misnomer. drivetemp, maybe? No particular preference.
> 
Agreed, if we extend this to SCSI, satatemp is less than perfect.
drivetemp ? disktemp ? I am open to suggestions, with maybe a small
personal preference for disktemp out of those two.

>> The one USB/UAS connected SATA drive I have (a WD passport) reports
>> itself as "WD      ", not as "ATA     ". I would expect other drives
>> to do the same.
> 
> Yes. Most vendors are too fond of their brand names to put "ATA" in
> there. So my suggestion is to relax the heuristic to trigger on the ATA
> Information VPD page only and ignore the name.
> 

Fine with me. I wanted to be as restrictive as possible.

> Also, there are some devices that will lock up the way you access that
> VPD page. So a tweak is also required there.
> 
Do you have details ? Do I need to add a call to scsi_device_supports_vpd(),
maybe ?

> To avoid the multiple blacklists and heuristic collections my suggestion
> is that I introduce a helper function in SCSI (based on what I did in
> the disk driver) that can be called to identify whether something is an
> ATA device. And then the hwmon driver can call that and we can keep the
> heuristics in one place.
> 
> If a device turns out to be problematic wrt. getting the ATA VPD for the
> purpose of SMART, for instance, it will also need to be blacklisted for
> other reasons in SCSI. So I would really like to keep the heuristics in
> one place.
> 
Fine with me. My only concern is that I don't want the driver to disappear
into nowhere-land (again).

Thanks,
Guenter
Damien Le Moal Dec. 17, 2019, 5:50 a.m. UTC | #5
On 2019/12/17 12:57, Guenter Roeck wrote:
> On 12/16/19 6:35 PM, Martin K. Petersen wrote:
>>
>> Guenter,
>>
>>> If and when drives are detected which report bad information, such
>>> drives can be added to a blacklist without impact on the core SCSI or
>>> ATA code. Until that happens, not loading the driver solves the
>>> problem on any affected system.
>>
>> My only concern with that is that we'll have blacklisting several
>> places. We already have ATA and SCSI blacklists. If we now add a third
>> place, that's going to be a maintenance nightmare.
>>
>> More on that below.
>>
>>>> My concerns are wrt. identifying whether SMART data is available for
>>>> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID
>>>> controllers that hide the real drives in various ways).
>>
>> OK, so I spent my weekend tinkering with 15+ years of accumulated USB
>> devices. And my conclusion is that no, we can't in any sensible manner,
>> support USB storage monitoring in the kernel. There is no heuristic that
>> I can find that identifies that "this is a hard drive or an SSD and
>> attempting one of the various SMART methods may be safe". As opposed to
>> "this is a USB key that's likely to lock up if you try". And that's
>> ignoring the drives with USB-ATA bridges that I managed to wedge in my
>> attempt at sending down commands.
>>
>> Even smartmontools is failing to work on a huge part of my vintage
>> collection.  Thanks to a wide variety of bridges with random, custom
>> interfaces.
>>
>> So my stance on all this is that I'm fine with your general approach for
>> ATA. I will post a patch adding the required bits for SCSI. And if a
>> device does not implement either of the two standard methods, people
>> should use smartmontools.
>>
>> Wrt. name, since I've added SCSI support, satatemp is a bit of a
>> misnomer. drivetemp, maybe? No particular preference.
>>
> Agreed, if we extend this to SCSI, satatemp is less than perfect.
> drivetemp ? disktemp ? I am open to suggestions, with maybe a small
> personal preference for disktemp out of those two.

"disk" tend to imply HDD, excluding SSDs. So my vote goes to
"drivetemp", or even the more generic, "devtemp".
Guenter Roeck Dec. 17, 2019, 3:47 p.m. UTC | #6
On Tue, Dec 17, 2019 at 05:50:17AM +0000, Damien Le Moal wrote:
> On 2019/12/17 12:57, Guenter Roeck wrote:
> > On 12/16/19 6:35 PM, Martin K. Petersen wrote:
> >>
> >> Guenter,
> >>
> >>> If and when drives are detected which report bad information, such
> >>> drives can be added to a blacklist without impact on the core SCSI or
> >>> ATA code. Until that happens, not loading the driver solves the
> >>> problem on any affected system.
> >>
> >> My only concern with that is that we'll have blacklisting several
> >> places. We already have ATA and SCSI blacklists. If we now add a third
> >> place, that's going to be a maintenance nightmare.
> >>
> >> More on that below.
> >>
> >>>> My concerns are wrt. identifying whether SMART data is available for
> >>>> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID
> >>>> controllers that hide the real drives in various ways).
> >>
> >> OK, so I spent my weekend tinkering with 15+ years of accumulated USB
> >> devices. And my conclusion is that no, we can't in any sensible manner,
> >> support USB storage monitoring in the kernel. There is no heuristic that
> >> I can find that identifies that "this is a hard drive or an SSD and
> >> attempting one of the various SMART methods may be safe". As opposed to
> >> "this is a USB key that's likely to lock up if you try". And that's
> >> ignoring the drives with USB-ATA bridges that I managed to wedge in my
> >> attempt at sending down commands.
> >>
> >> Even smartmontools is failing to work on a huge part of my vintage
> >> collection.  Thanks to a wide variety of bridges with random, custom
> >> interfaces.
> >>
> >> So my stance on all this is that I'm fine with your general approach for
> >> ATA. I will post a patch adding the required bits for SCSI. And if a
> >> device does not implement either of the two standard methods, people
> >> should use smartmontools.
> >>
> >> Wrt. name, since I've added SCSI support, satatemp is a bit of a
> >> misnomer. drivetemp, maybe? No particular preference.
> >>
> > Agreed, if we extend this to SCSI, satatemp is less than perfect.
> > drivetemp ? disktemp ? I am open to suggestions, with maybe a small
> > personal preference for disktemp out of those two.
> 
> "disk" tend to imply HDD, excluding SSDs. So my vote goes to
> "drivetemp", or even the more generic, "devtemp".
> 
"devtemp" would apply to all devices with temperature sensors, which
would be a bit too generic. I'll take that as a vote for "drivetemp".

Guenter
Martin K. Petersen Dec. 18, 2019, 3:42 a.m. UTC | #7
Guenter,

>> Also, there are some devices that will lock up the way you access that
>> VPD page. So a tweak is also required there.
>>
> Do you have details ? Do I need to add a call to scsi_device_supports_vpd(),
> maybe ?

Some devices lock up if you ask for too much data. I actually discovered
a VPD handling regression in 5.5 while working on a series of prep
patches for you today. Working on a fix. I'll try to get a patch series
out for review tomorrow.