Message ID | 20191209052119.32072-1-linux@roeck-us.net (mailing list archive) |
---|---|
Headers | show |
Series | Summary: hwmon driver for temperature sensors on SATA drives | expand |
Hi Guenter, > The most recent attempt was [1] by Linus Walleij. It went through a total > of seven iterations. At the end, it was rejected for a number of reasons; > see the provided link for details. This implementation resides in the > SCSI core. It originally resided in libata but was moved to SCSI per > maintainer request, where it was ultimately rejected. While I am sure I come across as a curmudgeon, regressions is a major concern for me. That, and making sure we pick the right architecture. I thought we were making good progress in that department when Linus abandoned the effort. > The feedback on this approach suggests to use the SCSI Temperature log > page [0x0d] as means to access drive temperature information. It is > unknown if this is implemented in any real SCSI drive. Almost every SCSI drive has it. > The feedback also suggests to obtain temperature from ATA drives, > convert it into the SCSI temperature log page in libata-scsi, and to > use that information in a hardware monitoring driver. The format and > method to do this is documented in [3]. This is not currently > implemented in the Linux kernel. Correct, but I have no qualms over exporting the SCSI temperature log page. The devices that export that page are generally well-behaved. My concerns are wrt. identifying whether SMART data is available for USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID controllers that hide the real drives in various ways). I am not sure why the SCSI temperature log page parsing would be complex. I will have to go check smartmontools to see what that is all about. The spec is as simple as can be. Anyway. I think the overall approach wrt. SCT and falling back to well-known SMART fields is reasonably sane and fine for libata. But I don't understand the pushback wrt. using the SCSI temperature log page as a conduit. I think it would be fine if this worked out of the box for both SCSI and ATA drives. The elephant in the room remains USB. And coming up with a way we can reliably detect whether it is safe to start poking at the device to discover if SMART is provided. If we eventually want to pursue USB, I think your heuristic stuff needs to be a library that can be leveraged by both libata and USB. But that doesn't have to be part of the initial effort. And finally, my concerns wrt. reacting to bad sensors remain. Not too familiar with hwmon, but I would still like any actions based on reported temperatures to be under user control and not the kernel.
Hi Martin, On 12/10/19 8:08 PM, Martin K. Petersen wrote: > > Hi Guenter, > >> The most recent attempt was [1] by Linus Walleij. It went through a total >> of seven iterations. At the end, it was rejected for a number of reasons; >> see the provided link for details. This implementation resides in the >> SCSI core. It originally resided in libata but was moved to SCSI per >> maintainer request, where it was ultimately rejected. > > While I am sure I come across as a curmudgeon, regressions is a major > concern for me. That, and making sure we pick the right architecture. I > thought we were making good progress in that department when Linus > abandoned the effort. > If anything, I am surprised that he did not give up earlier. Personally I did not see a path to success after v7 of the patch set was rejected. I also no longer believe that temperature monitoring of SATA drives should be implemented within the ATA or SCSI subsystem. I came to the conclusion that it is much better suited as separate hardware monitoring driver. As separate driver, its instantiation is in full control of the user. If it causes trouble (or, as mentioned separately, if it adds too much instantiation time, or if it is considered to be too large), it can simply be disabled in a given system by blacklisting it (or, rather, by not explicitly loading it in the first place). With that, there is no real compatibility concern. If and when drives are detected which report bad information, such drives can be added to a blacklist without impact on the core SCSI or ATA code. Until that happens, not loading the driver solves the problem on any affected system. >> The feedback on this approach suggests to use the SCSI Temperature log >> page [0x0d] as means to access drive temperature information. It is >> unknown if this is implemented in any real SCSI drive. > > Almost every SCSI drive has it. > Good to hear. >> The feedback also suggests to obtain temperature from ATA drives, >> convert it into the SCSI temperature log page in libata-scsi, and to >> use that information in a hardware monitoring driver. The format and >> method to do this is documented in [3]. This is not currently >> implemented in the Linux kernel. > > Correct, but I have no qualms over exporting the SCSI temperature log > page. The devices that export that page are generally well-behaved. > Also good to hear. However, for my part, I have no means to test such code since I don't have any SCSI drives. > My concerns are wrt. identifying whether SMART data is available for > USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID > controllers that hide the real drives in various ways). > The one USB/UAS connected SATA drive I have (a WD passport) reports itself as "WD ", not as "ATA ". I would expect other drives to do the same. That drive reports (via smartctl) that it supports both SCT and SMART data. It doesn't report temperatures through SCT, but it does report the drive temperature with SMART attribute 194. I did not attempt to add support for this and similar drives since I don't know if I can reliably detect it. The potential benefit compared to the risk seemed too low (we would be getting into possible regression space) for me to try. Such code (effectively it boils down to relaxing SATA drive detection) can always be added at a later time. > I am not sure why the SCSI temperature log page parsing would be > complex. I will have to go check smartmontools to see what that is all Not as much the parsing, but detection if the information is there. > about. The spec is as simple as can be. > Possibly. I personally also find it quite vague. It is definitely not something I would want to try to implement without ability to see how the data actually looks like as reported by a real drive, and without ability to test the code. > Anyway. I think the overall approach wrt. SCT and falling back to > well-known SMART fields is reasonably sane and fine for libata. But I > don't understand the pushback wrt. using the SCSI temperature log page > as a conduit. I think it would be fine if this worked out of the box for This is not a pushback per se. It is simply a matter of ability (or lack of it) to test any such code. Regarding "conduit", I assume you mean converting SATA/SCT information into SCST temperature pages and reporting temperature purely based on those. I personally think that this would be the wrong approach: It would effectively require code in the ATA core which is not really needed there. This would bloat the ATA code with no real advantage. In my opinion, available temperature information should be interpreted where it is needed, and only there, not in several places. I see that as much less risky and error prone than spreading the code to multiple places. > both SCSI and ATA drives. > The elegance of my approach is that adding support for reading temperatures from SCSI drives (or, for that matter, USB/UAS drives) would be straightforward. All one would need to do is to implement the necessary detection code as well as a function to actually read the information from the drive. This can be done at any time, and, again, it should be done by someone with the ability to test the code. > The elephant in the room remains USB. And coming up with a way we can > reliably detect whether it is safe to start poking at the device to > discover if SMART is provided. If we eventually want to pursue USB, > think your heuristic stuff needs to be a library that can be leveraged > by both libata and USB. But that doesn't have to be part of the initial > effort. > > And finally, my concerns wrt. reacting to bad sensors remain. Not too > familiar with hwmon, but I would still like any actions based on > reported temperatures to be under user control and not the kernel. > All sensors can report bad information, and quite often they do. This is actually quite normal in any given system. That doesn't mean that the available (connected) sensors should be ignored. Also, when it comes to actions, the one subsystem performing any actions in the kernel based on temperature sensor information is the thermal subsystem, and that is on purpose implemented in the kernel. The hardware monitoring subsystem, on its own, is purely passive and only reports sensor information; it does not act on it. Any action will either be done by userspace (eg with fancontrol) or by the thermal subsystem. Overall, I understand the desire to also support temperature reporting for SCSI and USB/UAS drives. As hardware monitoring maintainer, I'd be happy to accept patches implementing that support. However, I don't see this as immediately necessary, and I would want to have some reassurance that such code is well tested and doesn't cause any regressions (especially since concerns about possible regressions were mentioned several times in the context of the previous submissions). Thanks, Guenter
Guenter, > If and when drives are detected which report bad information, such > drives can be added to a blacklist without impact on the core SCSI or > ATA code. Until that happens, not loading the driver solves the > problem on any affected system. My only concern with that is that we'll have blacklisting several places. We already have ATA and SCSI blacklists. If we now add a third place, that's going to be a maintenance nightmare. More on that below. >> My concerns are wrt. identifying whether SMART data is available for >> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID >> controllers that hide the real drives in various ways). OK, so I spent my weekend tinkering with 15+ years of accumulated USB devices. And my conclusion is that no, we can't in any sensible manner, support USB storage monitoring in the kernel. There is no heuristic that I can find that identifies that "this is a hard drive or an SSD and attempting one of the various SMART methods may be safe". As opposed to "this is a USB key that's likely to lock up if you try". And that's ignoring the drives with USB-ATA bridges that I managed to wedge in my attempt at sending down commands. Even smartmontools is failing to work on a huge part of my vintage collection. Thanks to a wide variety of bridges with random, custom interfaces. So my stance on all this is that I'm fine with your general approach for ATA. I will post a patch adding the required bits for SCSI. And if a device does not implement either of the two standard methods, people should use smartmontools. Wrt. name, since I've added SCSI support, satatemp is a bit of a misnomer. drivetemp, maybe? No particular preference. > The one USB/UAS connected SATA drive I have (a WD passport) reports > itself as "WD ", not as "ATA ". I would expect other drives > to do the same. Yes. Most vendors are too fond of their brand names to put "ATA" in there. So my suggestion is to relax the heuristic to trigger on the ATA Information VPD page only and ignore the name. Also, there are some devices that will lock up the way you access that VPD page. So a tweak is also required there. To avoid the multiple blacklists and heuristic collections my suggestion is that I introduce a helper function in SCSI (based on what I did in the disk driver) that can be called to identify whether something is an ATA device. And then the hwmon driver can call that and we can keep the heuristics in one place. If a device turns out to be problematic wrt. getting the ATA VPD for the purpose of SMART, for instance, it will also need to be blacklisted for other reasons in SCSI. So I would really like to keep the heuristics in one place.
On 12/16/19 6:35 PM, Martin K. Petersen wrote: > > Guenter, > >> If and when drives are detected which report bad information, such >> drives can be added to a blacklist without impact on the core SCSI or >> ATA code. Until that happens, not loading the driver solves the >> problem on any affected system. > > My only concern with that is that we'll have blacklisting several > places. We already have ATA and SCSI blacklists. If we now add a third > place, that's going to be a maintenance nightmare. > > More on that below. > >>> My concerns are wrt. identifying whether SMART data is available for >>> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID >>> controllers that hide the real drives in various ways). > > OK, so I spent my weekend tinkering with 15+ years of accumulated USB > devices. And my conclusion is that no, we can't in any sensible manner, > support USB storage monitoring in the kernel. There is no heuristic that > I can find that identifies that "this is a hard drive or an SSD and > attempting one of the various SMART methods may be safe". As opposed to > "this is a USB key that's likely to lock up if you try". And that's > ignoring the drives with USB-ATA bridges that I managed to wedge in my > attempt at sending down commands. > > Even smartmontools is failing to work on a huge part of my vintage > collection. Thanks to a wide variety of bridges with random, custom > interfaces. > > So my stance on all this is that I'm fine with your general approach for > ATA. I will post a patch adding the required bits for SCSI. And if a > device does not implement either of the two standard methods, people > should use smartmontools. > > Wrt. name, since I've added SCSI support, satatemp is a bit of a > misnomer. drivetemp, maybe? No particular preference. > Agreed, if we extend this to SCSI, satatemp is less than perfect. drivetemp ? disktemp ? I am open to suggestions, with maybe a small personal preference for disktemp out of those two. >> The one USB/UAS connected SATA drive I have (a WD passport) reports >> itself as "WD ", not as "ATA ". I would expect other drives >> to do the same. > > Yes. Most vendors are too fond of their brand names to put "ATA" in > there. So my suggestion is to relax the heuristic to trigger on the ATA > Information VPD page only and ignore the name. > Fine with me. I wanted to be as restrictive as possible. > Also, there are some devices that will lock up the way you access that > VPD page. So a tweak is also required there. > Do you have details ? Do I need to add a call to scsi_device_supports_vpd(), maybe ? > To avoid the multiple blacklists and heuristic collections my suggestion > is that I introduce a helper function in SCSI (based on what I did in > the disk driver) that can be called to identify whether something is an > ATA device. And then the hwmon driver can call that and we can keep the > heuristics in one place. > > If a device turns out to be problematic wrt. getting the ATA VPD for the > purpose of SMART, for instance, it will also need to be blacklisted for > other reasons in SCSI. So I would really like to keep the heuristics in > one place. > Fine with me. My only concern is that I don't want the driver to disappear into nowhere-land (again). Thanks, Guenter
On 2019/12/17 12:57, Guenter Roeck wrote: > On 12/16/19 6:35 PM, Martin K. Petersen wrote: >> >> Guenter, >> >>> If and when drives are detected which report bad information, such >>> drives can be added to a blacklist without impact on the core SCSI or >>> ATA code. Until that happens, not loading the driver solves the >>> problem on any affected system. >> >> My only concern with that is that we'll have blacklisting several >> places. We already have ATA and SCSI blacklists. If we now add a third >> place, that's going to be a maintenance nightmare. >> >> More on that below. >> >>>> My concerns are wrt. identifying whether SMART data is available for >>>> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID >>>> controllers that hide the real drives in various ways). >> >> OK, so I spent my weekend tinkering with 15+ years of accumulated USB >> devices. And my conclusion is that no, we can't in any sensible manner, >> support USB storage monitoring in the kernel. There is no heuristic that >> I can find that identifies that "this is a hard drive or an SSD and >> attempting one of the various SMART methods may be safe". As opposed to >> "this is a USB key that's likely to lock up if you try". And that's >> ignoring the drives with USB-ATA bridges that I managed to wedge in my >> attempt at sending down commands. >> >> Even smartmontools is failing to work on a huge part of my vintage >> collection. Thanks to a wide variety of bridges with random, custom >> interfaces. >> >> So my stance on all this is that I'm fine with your general approach for >> ATA. I will post a patch adding the required bits for SCSI. And if a >> device does not implement either of the two standard methods, people >> should use smartmontools. >> >> Wrt. name, since I've added SCSI support, satatemp is a bit of a >> misnomer. drivetemp, maybe? No particular preference. >> > Agreed, if we extend this to SCSI, satatemp is less than perfect. > drivetemp ? disktemp ? I am open to suggestions, with maybe a small > personal preference for disktemp out of those two. "disk" tend to imply HDD, excluding SSDs. So my vote goes to "drivetemp", or even the more generic, "devtemp".
On Tue, Dec 17, 2019 at 05:50:17AM +0000, Damien Le Moal wrote: > On 2019/12/17 12:57, Guenter Roeck wrote: > > On 12/16/19 6:35 PM, Martin K. Petersen wrote: > >> > >> Guenter, > >> > >>> If and when drives are detected which report bad information, such > >>> drives can be added to a blacklist without impact on the core SCSI or > >>> ATA code. Until that happens, not loading the driver solves the > >>> problem on any affected system. > >> > >> My only concern with that is that we'll have blacklisting several > >> places. We already have ATA and SCSI blacklists. If we now add a third > >> place, that's going to be a maintenance nightmare. > >> > >> More on that below. > >> > >>>> My concerns are wrt. identifying whether SMART data is available for > >>>> USB/UAS. I am not too worried about ATA and "real" SCSI (ignoring RAID > >>>> controllers that hide the real drives in various ways). > >> > >> OK, so I spent my weekend tinkering with 15+ years of accumulated USB > >> devices. And my conclusion is that no, we can't in any sensible manner, > >> support USB storage monitoring in the kernel. There is no heuristic that > >> I can find that identifies that "this is a hard drive or an SSD and > >> attempting one of the various SMART methods may be safe". As opposed to > >> "this is a USB key that's likely to lock up if you try". And that's > >> ignoring the drives with USB-ATA bridges that I managed to wedge in my > >> attempt at sending down commands. > >> > >> Even smartmontools is failing to work on a huge part of my vintage > >> collection. Thanks to a wide variety of bridges with random, custom > >> interfaces. > >> > >> So my stance on all this is that I'm fine with your general approach for > >> ATA. I will post a patch adding the required bits for SCSI. And if a > >> device does not implement either of the two standard methods, people > >> should use smartmontools. > >> > >> Wrt. name, since I've added SCSI support, satatemp is a bit of a > >> misnomer. drivetemp, maybe? No particular preference. > >> > > Agreed, if we extend this to SCSI, satatemp is less than perfect. > > drivetemp ? disktemp ? I am open to suggestions, with maybe a small > > personal preference for disktemp out of those two. > > "disk" tend to imply HDD, excluding SSDs. So my vote goes to > "drivetemp", or even the more generic, "devtemp". > "devtemp" would apply to all devices with temperature sensors, which would be a bit too generic. I'll take that as a vote for "drivetemp". Guenter
Guenter, >> Also, there are some devices that will lock up the way you access that >> VPD page. So a tweak is also required there. >> > Do you have details ? Do I need to add a call to scsi_device_supports_vpd(), > maybe ? Some devices lock up if you ask for too much data. I actually discovered a VPD handling regression in 5.5 while working on a series of prep patches for you today. Working on a fix. I'll try to get a patch series out for review tomorrow.