Message ID | 20220630091406.19624-1-kch@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | block: add support for REQ_OP_VERIFY | expand |
On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote: > This adds support for the REQ_OP_VERIFY. In this version we add > support for block layer. NVMe host side, NVMeOF Block device backend, > and NVMeOF File device backend and null_blk driver. Who is "we" in this patch set? > Here is a link for the complete cover-letter for the background to save > reviewer's time :- > https://patchwork.kernel.org/project/dm-devel/cover/20211104064634.4481-1-chaitanyak@nvidia.com/ Well, the cover letter should be here. Also I don't see how an NVMe-only implementation. If we don't also cover SCSI and ATA there isn't much this API buys a potential user.
On 7/5/22 01:38, Christoph Hellwig wrote: > On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote: >> This adds support for the REQ_OP_VERIFY. In this version we add >> support for block layer. NVMe host side, NVMeOF Block device backend, >> and NVMeOF File device backend and null_blk driver. > > Who is "we" in this patch set? > >> Here is a link for the complete cover-letter for the background to save >> reviewer's time :- >> https://patchwork.kernel.org/project/dm-devel/cover/20211104064634.4481-1-chaitanyak@nvidia.com/ > > Well, the cover letter should be here. I'll send out with the updated cover-letter and fix the wording. > > Also I don't see how an NVMe-only implementation. If we don't also > cover SCSI and ATA there isn't much this API buys a potential user. I'll add to in the next version. My device went bad to test the scsi code, I think scsi_debug should be okay for testing until I have the scsi controller. -ck
On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
> This adds support for the REQ_OP_VERIFY. In this version we add
IMO, VERIFY is a useless command. The history of storage is full of
devices which simply lie. Since there's no way for the host to check if
the device did any work, cheap devices may simply implement it as a NOOP.
Even expensive devices where there's an ironclad legal contract between
the vendor and customer may have bugs that result in only some of the
bytes being VERIFYed. We shouldn't support it.
Now, everything you say about its value (not consuming bus bandwidth)
is true, but the device should provide the host with proof-of-work.
I'd suggest calculating some kind of checksum, even something like a
SHA-1 of the contents would be worth having. It doesn't need to be
crypto-secure; just something the host can verify the device didn't spoof.
On 7/6/22 10:42, Matthew Wilcox wrote: > On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote: >> This adds support for the REQ_OP_VERIFY. In this version we add > > IMO, VERIFY is a useless command. The history of storage is full of > devices which simply lie. Since there's no way for the host to check if > the device did any work, cheap devices may simply implement it as a NOOP. Thanks for sharing your feedback regarding cheap devices. This falls outside of the scope of the work, as scope of this work is not to analyze different vendor implementations of the verify command. > Even expensive devices where there's an ironclad legal contract between > the vendor and customer may have bugs that result in only some of the > bytes being VERIFYed. We shouldn't support it. This is not true with enterprise SSDs, I've been involved with product qualification of the high end enterprise SSDs since 2012 including good old non-nvme devices with e.g. skd driver on linux/windows/vmware. At product qualification time for large data centers every single feature gets reviewed with excruciating architectural details in the data center environment and detailed analysis of the feature including running cost and actual impact is calculated where Service level Agreements are confirmed between the vendor and client. In case vendor fails to meet the SLA product gets disqualified. What you are mentioning is vendor is failing to meet the SLA and I think we shouldn't consider vendor specific implementations for generic feature. > > Now, everything you say about its value (not consuming bus bandwidth) > is true, but the device should provide the host with proof-of-work. Yes that seems to be missing but it is not a blocker in this work since protocol needs to provide this information. We can update the respective specification to add a log page which shows proof of work for verify command e.g. A log page consist of the information such as :- 1. How many LBAs were verified ? How long it took. 2. What kind of errors were detected ? 3. How many blocks were moved to safe location ? 4. How much data (LBAs) been moved successfully ? 5. How much data we lost permanently with uncorrectible errors? 6. What is the impact on the overall size of the storage, in case of flash reduction in the over provisioning due to uncorrectible errors. but clearly this is outside of the scope of the this work, if you are willing to work on this I'd be happy to draft a TP and work with you. > I'd suggest calculating some kind of checksum, even something like a > SHA-1 of the contents would be worth having. It doesn't need to be > crypto-secure; just something the host can verify the device didn't spoof. I did not understand exactly what you mean here. -ck
On 13.07.22 11:14, Chaitanya Kulkarni wrote: >> I'd suggest calculating some kind of checksum, even something like a >> SHA-1 of the contents would be worth having. It doesn't need to be >> crypto-secure; just something the host can verify the device didn't spoof. > I did not understand exactly what you mean here. I _think_ what Willy wants to say here is, we need some kind of "out-of-band" checksums to verify the device is not lying to us. Something like the checksums for each data block that i.e. btrfs has. On read, we're verifying the calculated checksum of the payload with the saved one and if they're not matching (for whatever reason) return -EIO. As the device doesn't know the location of the data checksum, it can't spoof it for the host. (Excuse me for my btrfs centric view on this topic, but that's what I know and what I'm used to)
On Wed, Jul 13, 2022 at 09:14:42AM +0000, Chaitanya Kulkarni wrote: > On 7/6/22 10:42, Matthew Wilcox wrote: > > On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote: > >> This adds support for the REQ_OP_VERIFY. In this version we add > > > > IMO, VERIFY is a useless command. The history of storage is full of > > devices which simply lie. Since there's no way for the host to check if > > the device did any work, cheap devices may simply implement it as a NOOP. > > Thanks for sharing your feedback regarding cheap devices. > > This falls outside of the scope of the work, as scope of this work is > not to analyze different vendor implementations of the verify command. The work is pointless. As a customer, I can't ever use the VERIFY command because I have no reason for trusting the outcome. And there's no way for a vendor to convince me that I should trust the result. > > Even expensive devices where there's an ironclad legal contract between > > the vendor and customer may have bugs that result in only some of the > > bytes being VERIFYed. We shouldn't support it. > This is not true with enterprise SSDs, I've been involved with product > qualification of the high end enterprise SSDs since 2012 including good > old non-nvme devices with e.g. skd driver on linux/windows/vmware. Oh, I'm sure there's good faith at the high end. But bugs happen in firmware, and everybody knows it. > > Now, everything you say about its value (not consuming bus bandwidth) > > is true, but the device should provide the host with proof-of-work. > > Yes that seems to be missing but it is not a blocker in this work since > protocol needs to provide this information. There's no point in providing access to a feature when that feature is not useful. > We can update the respective specification to add a log page which > shows proof of work for verify command e.g. > A log page consist of the information such as :- > > 1. How many LBAs were verified ? How long it took. > 2. What kind of errors were detected ? > 3. How many blocks were moved to safe location ? > 4. How much data (LBAs) been moved successfully ? > 5. How much data we lost permanently with uncorrectible errors? > 6. What is the impact on the overall size of the storage, in > case of flash reduction in the over provisioning due to > uncorrectible errors. That's not proof of work. That's claim of work. > > I'd suggest calculating some kind of checksum, even something like a > > SHA-1 of the contents would be worth having. It doesn't need to be > > crypto-secure; just something the host can verify the device didn't spoof. > > I did not understand exactly what you mean here. The firmware needs to prove to me that it *did something*. That it actually read those bytes that it claims to have verified. The simplest way to do so is to calculate a hash over the blocks which were read (maybe the host needs to provide a nonce as part of the VERIFY command so the drive can't "remember" the checksum).
On Wed, Jul 13, 2022 at 01:17:56PM +0100, Matthew Wilcox wrote: > The firmware needs to prove to me that it *did something*. That it > actually read those bytes that it claims to have verified. The simplest > way to do so is to calculate a hash over the blocks which were read > (maybe the host needs to provide a nonce as part of the VERIFY command > so the drive can't "remember" the checksum). From a protocol perspective, NVMe's verify command currently leaves 8-bytes in the response unused, so this could be a reasonable way to return something like a crc64 of the verified blocks. The verify command starts at an arbitrary block and spans an arbitrary number of them, so I think a device "remembering" all possible combinations as a way to cheat may be harder than just doing the work :). But this is just theoretical; it'd be some time before we'd see devices supporting such a scheme, assuming there's support from the standards committees.
On 7/6/22 10:42, Matthew Wilcox wrote: > On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote: >> This adds support for the REQ_OP_VERIFY. In this version we add > > IMO, VERIFY is a useless command. The history of storage is full of > devices which simply lie. Since there's no way for the host to check if > the device did any work, cheap devices may simply implement it as a NOOP. In past few months at various storage conferences I've talked to different people to address your comment where device being a liar verify implementation or even implementing NOOP. With all do respect this is not true. Verify command for NVMe has significant impact when it comes to doing the maintenance work in downtime, that can get in a way when device is under heavy workload e.g. social media website being active at peak hours or high performance trading hours. In such a scenario controller doing the maintenance work can increase the tail latency of the workload and one can see spikes as SSD starts aging, this is also true for the file systems and databases when complex queries generating complex I/O patterns under a heavy I/Os. I respect your experience but calling every single SSD and enterprise grade SSD vendor a liar is just not true, as with enterprise SSD qualification process is extremely detailed oriented and each feature command gets excruciating design review including performance/watt. So nobody can get away with a lie. This patch-series has an option to emulate the verify operation, if you or someone don't want to trust the device one can simply turn off the verify command and let the kernel emulate the Verify, that will also save ton of the bandwidth especially in the case of fabrics and if it is not there I'll make sure to have in the final version. > Even expensive devices where there's an ironclad legal contract between > the vendor and customer may have bugs that result in only some of the > bytes being VERIFYed. We shouldn't support it. > One cannot simply write anything without bugs, see how many quirks we have for the NVMe PCIe SSDs that doesn't stop us supporting features in kernel if one can't trust the device just add a quirk and emulate. The bugs doesn't make this feature useless or design of the verify command unusable. There is a significant reason why it exists in major device specs based on which data-center workloads are running, just because there are few cheap vendors not being authentic we cannot call multiple TWG's decision to add verify command useless and not add support. In fact all the file systems should be sending verify command as a part of scrubbing which XFS only does as far as I know since lack of interface is preventing the use. > Now, everything you say about its value (not consuming bus bandwidth) > is true, but the device should provide the host with proof-of-work. > I'd suggest calculating some kind of checksum, even something like a > SHA-1 of the contents would be worth having. It doesn't need to be > crypto-secure; just something the host can verify the device didn't spoof. I'm not sure how SSD vendor will entertain the proof of work idea since it will open the door for other questions such as discard and any other commands since one of the blatant answer I got "if you don't trust don't buy". I'm absolutely not discarding your concerns and idea of proof of work, I'm wiling to work with you in the TWG and submit the proposal offline, but right now there is no support for this with existing specs. -ck
On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> So nobody can get away with a lie.
And yet devices do exist which lie. I'm not surprised that vendors
vehemently claim that they don't, or "nobody would get away with it".
But, of course, they do. And there's no way for us to find out if
they're lying!
On 12/1/22 19:12, Chaitanya Kulkarni wrote: > On 7/6/22 10:42, Matthew Wilcox wrote: >> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote: >>> This adds support for the REQ_OP_VERIFY. In this version we add >> >> IMO, VERIFY is a useless command. The history of storage is full of >> devices which simply lie. Since there's no way for the host to check if >> the device did any work, cheap devices may simply implement it as a NOOP. > > In past few months at various storage conferences I've talked to > different people to address your comment where device being > a liar verify implementation or even implementing NOOP. > > With all do respect this is not true. > [ .. ] Be careful to not fall into the copy-offload trap. The arguments given do pretty much mirror the discussion we had during designing and implementing copy offload. Turns out that the major benefit for any of these 'advanced' commands is not so much functionality but rather performance. If the functionality provided by the command is _slower_ as when the host would be doing is manually there hardly is a point even calling that functionality; we've learned that the hard way with copy offload, and I guess the same it true for 'verify'. Thing is, none of the command will _tell_ you how long that command will take; you just have to issue it and wait for it to complete. (Remember TRIM? And device which took minutes to complete it?) For copy offload we only recently added a bit telling the application whether it's worth calling it, or if we should rather do it the old fashioned way. But all of the above discussion has nothing to do with the implementation. Why should we disallow an implementation for a functionality which is present in hardware? How could we figure out the actual usability if we don't test? Where is the innovation? We need room for experimenting; if it turns out to be useless we can always disable or remove it later. But we shouldn't bar implementations because hardware _might_ be faulty. I do see at least two topics for the next LSF: - Implementation of REQ_OP_VERIFY - Innovation rules for the kernel ... Cheers, Hannes
On 12/1/22 20:39, Matthew Wilcox wrote: > On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote: >> So nobody can get away with a lie. > > And yet devices do exist which lie. I'm not surprised that vendors > vehemently claim that they don't, or "nobody would get away with it". > But, of course, they do. And there's no way for us to find out if > they're lying! > But we'll never be able to figure that out unless we try. Once we've tried we will have proof either way. Which is why I think we should have that implementation. Only then we'll know if it's worth it, both from the hardware and the application side. Without an implementation it'll just degrade to a shouting match. Cheers, Hannes
On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote: > On 12/1/22 20:39, Matthew Wilcox wrote: > > On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote: > > > So nobody can get away with a lie. > > > > And yet devices do exist which lie. I'm not surprised that vendors > > vehemently claim that they don't, or "nobody would get away with it". > > But, of course, they do. And there's no way for us to find out if > > they're lying! > > > But we'll never be able to figure that out unless we try. > > Once we've tried we will have proof either way. As long as the protocols don't provide proof-of-work, trying this doesn't really prove anything with respect to this concern.
> From: Keith Busch > On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote: > > On 12/1/22 20:39, Matthew Wilcox wrote: > > > On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote: > > > > So nobody can get away with a lie. > > > > > > And yet devices do exist which lie. I'm not surprised that vendors > > > vehemently claim that they don't, or "nobody would get away with it". > > > But, of course, they do. And there's no way for us to find out if > > > they're lying! My guess, if true, is it's rationalized with the device is already doing patrols in the background - why verify when it's already been recently patrolled? > > > > > But we'll never be able to figure that out unless we try. > > > > Once we've tried we will have proof either way. > > As long as the protocols don't provide proof-of-work, trying this > doesn't really prove anything with respect to this concern. I'm out of my depth here, but isn't VERIFY tightly related to PI and at the heart of detecting SAN bit-rot? The proof of work can be via end-to-end data protection. VERIFY has to actually read to detect bad host generated PI guard/tags. I'm assuming the PI checks can be disabled for WRITE and enabled for VERIFY as the test.
On Fri, Dec 02, 2022 at 05:33:33PM +0000, Clay Mayers wrote: > > > > As long as the protocols don't provide proof-of-work, trying this > > doesn't really prove anything with respect to this concern. > > I'm out of my depth here, but isn't VERIFY tightly related to PI and > at the heart of detecting SAN bit-rot? The proof of work can be via > end-to-end data protection. VERIFY has to actually read to detect bad > host generated PI guard/tags. I'm assuming the PI checks can be > disabled for WRITE and enabled for VERIFY as the test. I suppose if you happen to have a PI formatted drive, you could WRITE garbage Guard tags with PRCHK disabled, then see if VERIFY with PRCHK enabled returns the Guard Check Error; but PI is an optional format feature orthogonal to the VERIFY command: we can't count on that format being available in most implementations.
> On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote: > > On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote: >>> On 12/1/22 20:39, Matthew Wilcox wrote: >>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote: >>>> So nobody can get away with a lie. >>> >>> And yet devices do exist which lie. I'm not surprised that vendors >>> vehemently claim that they don't, or "nobody would get away with it". >>> But, of course, they do. And there's no way for us to find out if >>> they're lying! >>> >> But we'll never be able to figure that out unless we try. >> >> Once we've tried we will have proof either way. > > As long as the protocols don't provide proof-of-work, trying this > doesn't really prove anything with respect to this concern. Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. I will check internally if there is any existing proof-of-work that we are missing.
On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote: > > > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote: > > > > On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote: > >>> On 12/1/22 20:39, Matthew Wilcox wrote: > >>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote: > >>>> So nobody can get away with a lie. > >>> > >>> And yet devices do exist which lie. I'm not surprised that vendors > >>> vehemently claim that they don't, or "nobody would get away with it". > >>> But, of course, they do. And there's no way for us to find out if > >>> they're lying! > >>> > >> But we'll never be able to figure that out unless we try. > >> > >> Once we've tried we will have proof either way. > > > > As long as the protocols don't provide proof-of-work, trying this > > doesn't really prove anything with respect to this concern. > > Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. Yeah, proof for the host appears to require a new feature, so we'd need to bring this to the TWG. I can draft a TPAR if there's interest and have ideas on how the feature could be implemented, but I currently don't have enough skin in this game to sponser it.
On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote: > > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote: > > As long as the protocols don't provide proof-of-work, trying this > > doesn't really prove anything with respect to this concern. > > Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. > > I will check internally if there is any existing proof-of-work that we are missing. I think the right thing for NVMe to standardise is a new command, HASH. It should be quite similar to the READ command, at least for command Dwords 10-15. Need to specify the hash algorithm to use somewhere (maybe it's a per-namespace setting, maybe a per-command setting). The ctte will have to decide which hash algos are permitted / required (it probably makes sense to permit even some quite weak hashes as we're not necessarily looking to have cryptographic security). While there may be ways to make use of this command for its obvious usage (ie actually producing a hash and communicating that to somebody else), its real purpose is for the drive to read the data from storage and prove it's still there without transferring it to the host, as discussed in this thread.
On 04.12.2022 20:29, Keith Busch wrote: >On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote: >> >> > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote: >> > >> > On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote: >> >>> On 12/1/22 20:39, Matthew Wilcox wrote: >> >>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote: >> >>>> So nobody can get away with a lie. >> >>> >> >>> And yet devices do exist which lie. I'm not surprised that vendors >> >>> vehemently claim that they don't, or "nobody would get away with it". >> >>> But, of course, they do. And there's no way for us to find out if >> >>> they're lying! >> >>> >> >> But we'll never be able to figure that out unless we try. >> >> >> >> Once we've tried we will have proof either way. >> > >> > As long as the protocols don't provide proof-of-work, trying this >> > doesn't really prove anything with respect to this concern. >> >> Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. > >Yeah, proof for the host appears to require a new feature, so we'd need >to bring this to the TWG. I can draft a TPAR if there's interest and >have ideas on how the feature could be implemented, but I currently >don't have enough skin in this game to sponser it. Happy to review the TPAR, but I am in a similar situation.
> My guess, if true, is it's rationalized with the device is already > doing patrols in the background - why verify when it's already > been recently patrolled? The original SCSI VERIFY operation allowed RAID array firmware to do background scrubs before disk drives developed anything resembling sophisticated media management. If verification of a block failed, the array firmware could go ahead and reconstruct the data from the rest of the stripe in the background. This substantially reduced the risk of having to perform block reconstruction in the hot path. And verification did not have to burden the already slow array CPU/memory/DMA combo with transferring every block on every attached drive. I suspect that these days it is very hard to find a storage device that doesn't do media management internally in the background. So from the perspective of physically exercising the media, VERIFY is probably not terribly useful anymore. In that light, having to run VERIFY over the full block range of a device to identify unreadable blocks seems like a fairly clunky mechanism. Querying the device for a list of unrecoverable blocks already identified by the firmware seems like a better interface. I am not sure I understand this whole "proof that the drive did something" requirement. If a device lies and implements VERIFY as a noop it just means you'll get the error during a future READ operation instead. No matter what, a successful VERIFY is obviously no guarantee that a future READ on a given block will be possible. But it doesn't matter because the useful outcome of a verify operation is the failure, not the success. It's the verification failure scenario which allows you to take a corrective action. If you really want to verify device VERIFY implementation, we do have WRITE UNCORRECTABLE commands in both SCSI and NVMe which allow you to do that. But I think device validation is a secondary issue. The more pertinent question is whether we have use cases in the kernel (MD, btrfs) which would benefit from being able to preemptively identify unreadable blocks?
Martin K. Petersen (martin.petersen@oracle.com) wrote on Fri, Dec 09, 2022 at 01:52:01AM -03: > I suspect that these days it is very hard to find a storage device that > doesn't do media management internally in the background. So from the > perspective of physically exercising the media, VERIFY is probably not > terribly useful anymore. > > In that light, having to run VERIFY over the full block range of a > device to identify unreadable blocks seems like a fairly clunky > mechanism. Querying the device for a list of unrecoverable blocks > already identified by the firmware seems like a better interface. Sure. > But I think device validation is a secondary issue. The more > pertinent question is whether we have use cases in the kernel (MD, > btrfs) which would benefit from being able to preemptively identify > unreadable blocks? Certainly we have. Currently admins have to periodically run full block range checks in redundant arrays to detect bad blocks and correct them while redundancy is available. Otherwise when a disk fails and you try to reconstruct the replacement you hit another block in the remaining disks that's bad and you cannot complete the reconstruction and have data loss. These checks are a burden because they have HIGH overhead, significantly reducing bandwidth for the normal use of the array. If there was a standard interface for getting the list of bad blocks that the firmware secretly knows the kernel could implement the repair continuosly, with logs etc. That'd really be a relief for admins and, specially, users.
On Sat, Dec 10, 2022 at 10:06:34AM -0300, Carlos Carvalho wrote: > Certainly we have. Currently admins have to periodically run full block range > checks in redundant arrays to detect bad blocks and correct them while > redundancy is available. Otherwise when a disk fails and you try to reconstruct > the replacement you hit another block in the remaining disks that's bad and you > cannot complete the reconstruction and have data loss. These checks are a > burden because they have HIGH overhead, significantly reducing bandwidth for > the normal use of the array. > > If there was a standard interface for getting the list of bad blocks that the > firmware secretly knows the kernel could implement the repair continuosly, with > logs etc. That'd really be a relief for admins and, specially, users. Both SCSI and NVMe can do this through the GET LBA STATUS command - in SCSI this was a later addition abusing the command, and in NVMe only the abuse survived. NVMe also has a log page an AEN associated for it, I'd have to spend more time reading SBC to remember if SCSI also has a notification mechanism of some sort.