mbox series

[0/6] block: add support for REQ_OP_VERIFY

Message ID 20220630091406.19624-1-kch@nvidia.com (mailing list archive)
Headers show
Series block: add support for REQ_OP_VERIFY | expand

Message

Chaitanya Kulkarni June 30, 2022, 9:14 a.m. UTC
Hi,

This adds support for the REQ_OP_VERIFY. In this version we add
support for block layer. NVMe host side, NVMeOF Block device backend,
and NVMeOF File device backend and null_blk driver.

In this version we also add a new blkverify userspace tool along with
the testcases patch for the util-linux, this patch will be followed
by the this series.

Below is the summary of testlog :-

1. NVMeOF bdev-ns null_blk verify=0 (triggering bdev emulation code) :-
-----------------------------------------------------------------------

linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 1171.171536] nvmet: nvmet_bdev_emulate_verify_work 467
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1  
NVME Verify Success
linux-block (for-next) # dmesg -c 
[ 1199.322161] nvmet: nvmet_bdev_emulate_verify_work 467

2. NVMeOF bdev-ns null_blk verify=1.
-----------------------------------------------------------------------

linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 1257.661548] nvmet: nvmet_bdev_execute_verify 506
[ 1257.661558] null_blk: null_process_cmd 1406
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1  
NVME Verify Success
linux-block (for-next) # dmesg -c 
[ 1269.613415] nvmet: nvmet_bdev_execute_verify 506
[ 1269.613425] null_blk: null_process_cmd 1406

3. NVMeOF file-ns :-
-----------------------------------------------------------------------

linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme1n1 
linux-block (for-next) # dmesg -c 
[ 3452.675959] nvme_setup_verify 844
[ 3452.675969] nvmet: nvmet_file_execute_verify 525
[ 3452.675971] nvmet: nvmet_file_emulate_verify_work 502
[ 3452.675972] nvmet: do_direct_io_emulate_verify 431
linux-block (for-next) # nvme verify -s 0 -c 1024 /dev/nvme1n1
NVME Verify Success
linux-block (for-next) # dmesg  -c 
[ 3459.794385] nvmet: nvmet_file_execute_verify 525
[ 3459.794389] nvmet: nvmet_file_emulate_verify_work 502
[ 3459.794391] nvmet: do_direct_io_emulate_verify 431

4. NVMe PCIe device.
-----------------------------------------------------------------------

linux-block (for-next) # modprobe nvme
linux-block (for-next) # blkverify -o 0 -l 40960 /dev/nvme0n1
linux-block (for-next) # dmesg  -c
[ 2763.432194] nvme nvme0: pci function 0000:00:04.0
[ 2763.473827] nvme nvme0: 48/0/0 default/read/poll queues
[ 2763.478868] nvme nvme0: Ignoring bogus Namespace Identifiers
[ 2766.583923] nvme_setup_verify 844

Here is a link for the complete cover-letter for the background to save
reviewer's time :-
https://patchwork.kernel.org/project/dm-devel/cover/20211104064634.4481-1-chaitanyak@nvidia.com/

-ck

Chaitanya Kulkarni (6):
  block: add support for REQ_OP_VERIFY
  nvme: add support for the Verify command
  nvmet: add Verify command support for bdev-ns
  nvmet: add Verify emulation support for bdev-ns
  nvmet: add verify emulation support for file-ns
  null_blk: add REQ_OP_VERIFY support

 Documentation/ABI/stable/sysfs-block |  12 +++
 block/blk-core.c                     |   5 +
 block/blk-lib.c                      | 155 +++++++++++++++++++++++++++
 block/blk-merge.c                    |  18 ++++
 block/blk-settings.c                 |  17 +++
 block/blk-sysfs.c                    |   8 ++
 block/blk.h                          |   4 +
 block/ioctl.c                        |  35 ++++++
 drivers/block/null_blk/main.c        |  20 +++-
 drivers/block/null_blk/null_blk.h    |   1 +
 drivers/nvme/host/core.c             |  33 ++++++
 drivers/nvme/target/admin-cmd.c      |   3 +-
 drivers/nvme/target/core.c           |  14 ++-
 drivers/nvme/target/io-cmd-bdev.c    |  75 +++++++++++++
 drivers/nvme/target/io-cmd-file.c    | 152 ++++++++++++++++++++++++++
 drivers/nvme/target/nvmet.h          |   4 +-
 include/linux/bio.h                  |   9 +-
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  22 ++++
 include/linux/nvme.h                 |  19 ++++
 include/uapi/linux/fs.h              |   1 +
 21 files changed, 601 insertions(+), 8 deletions(-)

Comments

Christoph Hellwig July 5, 2022, 8:38 a.m. UTC | #1
On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
> This adds support for the REQ_OP_VERIFY. In this version we add
> support for block layer. NVMe host side, NVMeOF Block device backend,
> and NVMeOF File device backend and null_blk driver.

Who is "we" in this patch set?

> Here is a link for the complete cover-letter for the background to save
> reviewer's time :-
> https://patchwork.kernel.org/project/dm-devel/cover/20211104064634.4481-1-chaitanyak@nvidia.com/

Well, the cover letter should be here.

Also I don't see how an NVMe-only implementation.  If we don't also
cover SCSI and ATA there isn't much this API buys a potential user.
Chaitanya Kulkarni July 5, 2022, 4:47 p.m. UTC | #2
On 7/5/22 01:38, Christoph Hellwig wrote:
> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
>> This adds support for the REQ_OP_VERIFY. In this version we add
>> support for block layer. NVMe host side, NVMeOF Block device backend,
>> and NVMeOF File device backend and null_blk driver.
> 
> Who is "we" in this patch set?
> 
>> Here is a link for the complete cover-letter for the background to save
>> reviewer's time :-
>> https://patchwork.kernel.org/project/dm-devel/cover/20211104064634.4481-1-chaitanyak@nvidia.com/
> 
> Well, the cover letter should be here.

I'll send out with the updated cover-letter and fix the wording.

> 
> Also I don't see how an NVMe-only implementation.  If we don't also
> cover SCSI and ATA there isn't much this API buys a potential user.

I'll add to in the next version. My device went bad to test the scsi
code, I think scsi_debug should be okay for testing until
I have the scsi controller.

-ck
Matthew Wilcox July 6, 2022, 5:42 p.m. UTC | #3
On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
> This adds support for the REQ_OP_VERIFY. In this version we add

IMO, VERIFY is a useless command.  The history of storage is full of
devices which simply lie.  Since there's no way for the host to check if
the device did any work, cheap devices may simply implement it as a NOOP.
Even expensive devices where there's an ironclad legal contract between
the vendor and customer may have bugs that result in only some of the
bytes being VERIFYed.  We shouldn't support it.

Now, everything you say about its value (not consuming bus bandwidth)
is true, but the device should provide the host with proof-of-work.
I'd suggest calculating some kind of checksum, even something like a
SHA-1 of the contents would be worth having.  It doesn't need to be
crypto-secure; just something the host can verify the device didn't spoof.
Chaitanya Kulkarni July 13, 2022, 9:14 a.m. UTC | #4
On 7/6/22 10:42, Matthew Wilcox wrote:
> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
>> This adds support for the REQ_OP_VERIFY. In this version we add
> 
> IMO, VERIFY is a useless command.  The history of storage is full of
> devices which simply lie.  Since there's no way for the host to check if
> the device did any work, cheap devices may simply implement it as a NOOP.

Thanks for sharing your feedback regarding cheap devices.

This falls outside of the scope of the work, as scope of this work is
not to analyze different vendor implementations of the verify command.

> Even expensive devices where there's an ironclad legal contract between
> the vendor and customer may have bugs that result in only some of the
> bytes being VERIFYed.  We shouldn't support it.
This is not true with enterprise SSDs, I've been involved with product
qualification of the high end enterprise SSDs since 2012 including good
old non-nvme devices with e.g. skd driver on linux/windows/vmware.

At product qualification time for large data centers every single
feature gets reviewed with excruciating architectural details in the 
data center environment and detailed analysis of the feature including
running cost and actual impact is calculated where Service level
Agreements are confirmed between the vendor and client. In case vendor 
fails to meet the SLA product gets disqualified.

What you are mentioning is vendor is failing to meet the SLA and I think
we shouldn't consider vendor specific implementations for generic
feature.

> 
> Now, everything you say about its value (not consuming bus bandwidth)
> is true, but the device should provide the host with proof-of-work.

Yes that seems to be missing but it is not a blocker in this work since
protocol needs to provide this information.

We can update the respective specification to add a log page which
shows proof of work for verify command e.g.
A log page consist of the information such as :-

1. How many LBAs were verified ? How long it took.
2. What kind of errors were detected ?
3. How many blocks were moved to safe location ?
4. How much data (LBAs) been moved successfully ?
5. How much data we lost permanently with uncorrectible errors?
6. What is the impact on the overall size of the storage, in
    case of flash reduction in the over provisioning due to
    uncorrectible errors.

but clearly this is outside of the scope of the this work,
if you are willing to work on this I'd be happy to draft a TP
and work with you.

> I'd suggest calculating some kind of checksum, even something like a
> SHA-1 of the contents would be worth having.  It doesn't need to be
> crypto-secure; just something the host can verify the device didn't spoof.

I did not understand exactly what you mean here.

-ck
Johannes Thumshirn July 13, 2022, 9:36 a.m. UTC | #5
On 13.07.22 11:14, Chaitanya Kulkarni wrote:
>> I'd suggest calculating some kind of checksum, even something like a
>> SHA-1 of the contents would be worth having.  It doesn't need to be
>> crypto-secure; just something the host can verify the device didn't spoof.
> I did not understand exactly what you mean here.

I _think_ what Willy wants to say here is, we need some kind of "out-of-band"
checksums to verify the device is not lying to us.

Something like the checksums for each data block that i.e. btrfs has. On read,
we're verifying the calculated checksum of the payload with the saved one and
if they're not matching (for whatever reason) return -EIO.

As the device doesn't know the location of the data checksum, it can't spoof it
for the host.

(Excuse me for my btrfs centric view on this topic, but that's what I know and
what I'm used to)
Matthew Wilcox July 13, 2022, 12:17 p.m. UTC | #6
On Wed, Jul 13, 2022 at 09:14:42AM +0000, Chaitanya Kulkarni wrote:
> On 7/6/22 10:42, Matthew Wilcox wrote:
> > On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
> >> This adds support for the REQ_OP_VERIFY. In this version we add
> > 
> > IMO, VERIFY is a useless command.  The history of storage is full of
> > devices which simply lie.  Since there's no way for the host to check if
> > the device did any work, cheap devices may simply implement it as a NOOP.
> 
> Thanks for sharing your feedback regarding cheap devices.
> 
> This falls outside of the scope of the work, as scope of this work is
> not to analyze different vendor implementations of the verify command.

The work is pointless.  As a customer, I can't ever use the VERIFY
command because I have no reason for trusting the outcome.  And there's
no way for a vendor to convince me that I should trust the result.

> > Even expensive devices where there's an ironclad legal contract between
> > the vendor and customer may have bugs that result in only some of the
> > bytes being VERIFYed.  We shouldn't support it.
> This is not true with enterprise SSDs, I've been involved with product
> qualification of the high end enterprise SSDs since 2012 including good
> old non-nvme devices with e.g. skd driver on linux/windows/vmware.

Oh, I'm sure there's good faith at the high end.  But bugs happen in
firmware, and everybody knows it.

> > Now, everything you say about its value (not consuming bus bandwidth)
> > is true, but the device should provide the host with proof-of-work.
> 
> Yes that seems to be missing but it is not a blocker in this work since
> protocol needs to provide this information.

There's no point in providing access to a feature when that feature is
not useful.

> We can update the respective specification to add a log page which
> shows proof of work for verify command e.g.
> A log page consist of the information such as :-
> 
> 1. How many LBAs were verified ? How long it took.
> 2. What kind of errors were detected ?
> 3. How many blocks were moved to safe location ?
> 4. How much data (LBAs) been moved successfully ?
> 5. How much data we lost permanently with uncorrectible errors?
> 6. What is the impact on the overall size of the storage, in
>     case of flash reduction in the over provisioning due to
>     uncorrectible errors.

That's not proof of work.  That's claim of work.

> > I'd suggest calculating some kind of checksum, even something like a
> > SHA-1 of the contents would be worth having.  It doesn't need to be
> > crypto-secure; just something the host can verify the device didn't spoof.
> 
> I did not understand exactly what you mean here.

The firmware needs to prove to me that it *did something*.  That it
actually read those bytes that it claims to have verified.  The simplest
way to do so is to calculate a hash over the blocks which were read
(maybe the host needs to provide a nonce as part of the VERIFY command
so the drive can't "remember" the checksum).
Keith Busch July 13, 2022, 2:04 p.m. UTC | #7
On Wed, Jul 13, 2022 at 01:17:56PM +0100, Matthew Wilcox wrote:
> The firmware needs to prove to me that it *did something*.  That it
> actually read those bytes that it claims to have verified.  The simplest
> way to do so is to calculate a hash over the blocks which were read
> (maybe the host needs to provide a nonce as part of the VERIFY command
> so the drive can't "remember" the checksum).

From a protocol perspective, NVMe's verify command currently leaves 8-bytes in
the response unused, so this could be a reasonable way to return something like
a crc64 of the verified blocks. The verify command starts at an arbitrary block
and spans an arbitrary number of them, so I think a device "remembering" all
possible combinations as a way to cheat may be harder than just doing the work
:).

But this is just theoretical; it'd be some time before we'd see devices
supporting such a scheme, assuming there's support from the standards
committees.
Chaitanya Kulkarni Dec. 1, 2022, 6:12 p.m. UTC | #8
On 7/6/22 10:42, Matthew Wilcox wrote:
> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
>> This adds support for the REQ_OP_VERIFY. In this version we add
> 
> IMO, VERIFY is a useless command.  The history of storage is full of
> devices which simply lie.  Since there's no way for the host to check if
> the device did any work, cheap devices may simply implement it as a NOOP.

In past few months at various storage conferences I've talked to
different people to address your comment where device being
a liar verify implementation or even implementing NOOP.

With all do respect this is not true.

Verify command for NVMe has significant impact when it comes to doing 
the maintenance work in downtime, that can get in a way when device is 
under heavy workload e.g. social media website being active at peak 
hours or high performance trading hours. In such a scenario controller 
doing the maintenance work can increase the tail latency of the workload
and one can see spikes as SSD starts aging, this is also true for
the file systems and databases when complex queries generating
complex I/O patterns under a heavy I/Os.

I respect your experience but calling every single SSD and enterprise
grade SSD vendor a liar is just not true, as with enterprise SSD
qualification process is extremely detailed oriented and each feature
command gets excruciating design review including performance/watt.

So nobody can get away with a lie.

This patch-series has an option to emulate the verify operation,
if you or someone don't want to trust the device one can simply turn
off the verify command and let the kernel emulate the Verify, that
will also save ton of the bandwidth especially in the case of fabrics
and if it is not there I'll make sure to have in the final version.

> Even expensive devices where there's an ironclad legal contract between
> the vendor and customer may have bugs that result in only some of the
> bytes being VERIFYed.  We shouldn't support it.
> 

One cannot simply write anything without bugs, see how many quirks we
have for the NVMe PCIe SSDs that doesn't stop us supporting features
in kernel if one can't trust the device just add a quirk and emulate.

The bugs doesn't make this feature useless or design of
the verify command unusable. There is a significant reason why it
exists in major device specs based on which data-center workloads
are running, just because there are few cheap vendors not being
authentic we cannot call multiple TWG's decision to add verify
command useless and not add support.

In fact all the file systems should be sending verify command as
a part of scrubbing which XFS only does as far as I know since
lack of interface is preventing the use.

> Now, everything you say about its value (not consuming bus bandwidth)
> is true, but the device should provide the host with proof-of-work.
> I'd suggest calculating some kind of checksum, even something like a
> SHA-1 of the contents would be worth having.  It doesn't need to be
> crypto-secure; just something the host can verify the device didn't spoof.

I'm not sure how SSD vendor will entertain the proof of work
idea since it will open the door for other questions such as discard and
any other commands since one of the blatant answer I got "if you don't
trust don't buy".

I'm absolutely not discarding your concerns and idea of proof of work,
I'm wiling to work with you in the TWG and submit the proposal
offline, but right now there is no support for this with existing
specs.

-ck
Matthew Wilcox Dec. 1, 2022, 7:39 p.m. UTC | #9
On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> So nobody can get away with a lie.

And yet devices do exist which lie.  I'm not surprised that vendors
vehemently claim that they don't, or "nobody would get away with it".
But, of course, they do.  And there's no way for us to find out if
they're lying!
Hannes Reinecke Dec. 2, 2022, 7:13 a.m. UTC | #10
On 12/1/22 19:12, Chaitanya Kulkarni wrote:
> On 7/6/22 10:42, Matthew Wilcox wrote:
>> On Thu, Jun 30, 2022 at 02:14:00AM -0700, Chaitanya Kulkarni wrote:
>>> This adds support for the REQ_OP_VERIFY. In this version we add
>>
>> IMO, VERIFY is a useless command.  The history of storage is full of
>> devices which simply lie.  Since there's no way for the host to check if
>> the device did any work, cheap devices may simply implement it as a NOOP.
> 
> In past few months at various storage conferences I've talked to
> different people to address your comment where device being
> a liar verify implementation or even implementing NOOP.
> 
> With all do respect this is not true.
> 
[ .. ]

Be careful to not fall into the copy-offload trap.
The arguments given do pretty much mirror the discussion we had during 
designing and implementing copy offload.

Turns out that the major benefit for any of these 'advanced' commands is 
not so much functionality but rather performance.
If the functionality provided by the command is _slower_ as when the 
host would be doing is manually there hardly is a point even calling 
that functionality; we've learned that the hard way with copy offload, 
and I guess the same it true for 'verify'.
Thing is, none of the command will _tell_ you how long that command will 
take; you just have to issue it and wait for it to complete.
(Remember TRIM? And device which took minutes to complete it?)

For copy offload we only recently added a bit telling the application 
whether it's worth calling it, or if we should rather do it the old 
fashioned way.

But all of the above discussion has nothing to do with the 
implementation. Why should we disallow an implementation for a 
functionality which is present in hardware?
How could we figure out the actual usability if we don't test?
Where is the innovation?

We need room for experimenting; if it turns out to be useless we can 
always disable or remove it later. But we shouldn't bar implementations 
because hardware _might_ be faulty.

I do see at least two topics for the next LSF:

- Implementation of REQ_OP_VERIFY
- Innovation rules for the kernel
...

Cheers,

Hannes
Hannes Reinecke Dec. 2, 2022, 7:16 a.m. UTC | #11
On 12/1/22 20:39, Matthew Wilcox wrote:
> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
>> So nobody can get away with a lie.
> 
> And yet devices do exist which lie.  I'm not surprised that vendors
> vehemently claim that they don't, or "nobody would get away with it".
> But, of course, they do.  And there's no way for us to find out if
> they're lying!
> 
But we'll never be able to figure that out unless we try.

Once we've tried we will have proof either way.
Which is why I think we should have that implementation.
Only then we'll know if it's worth it, both from the hardware and the 
application side.

Without an implementation it'll just degrade to a shouting match.

Cheers,

Hannes
Keith Busch Dec. 2, 2022, 2:58 p.m. UTC | #12
On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
> On 12/1/22 20:39, Matthew Wilcox wrote:
> > On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> > > So nobody can get away with a lie.
> > 
> > And yet devices do exist which lie.  I'm not surprised that vendors
> > vehemently claim that they don't, or "nobody would get away with it".
> > But, of course, they do.  And there's no way for us to find out if
> > they're lying!
> > 
> But we'll never be able to figure that out unless we try.
> 
> Once we've tried we will have proof either way.

As long as the protocols don't provide proof-of-work, trying this
doesn't really prove anything with respect to this concern.
Clay Mayers Dec. 2, 2022, 5:33 p.m. UTC | #13
> From: Keith Busch
> On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
> > On 12/1/22 20:39, Matthew Wilcox wrote:
> > > On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> > > > So nobody can get away with a lie.
> > >
> > > And yet devices do exist which lie.  I'm not surprised that vendors
> > > vehemently claim that they don't, or "nobody would get away with it".
> > > But, of course, they do.  And there's no way for us to find out if
> > > they're lying!

My guess, if true, is it's rationalized with the device is already
doing patrols in the background - why verify when it's already
been recently patrolled?
 
> > >
> > But we'll never be able to figure that out unless we try.
> >
> > Once we've tried we will have proof either way.
> 
> As long as the protocols don't provide proof-of-work, trying this
> doesn't really prove anything with respect to this concern.

I'm out of my depth here, but isn't VERIFY tightly related to PI and
at the heart of detecting SAN bit-rot? The proof of work can be via
end-to-end data protection. VERIFY has to actually read to detect bad
host generated PI guard/tags.  I'm assuming the PI checks can be
disabled for WRITE and enabled for VERIFY as the test.
Keith Busch Dec. 2, 2022, 6:58 p.m. UTC | #14
On Fri, Dec 02, 2022 at 05:33:33PM +0000, Clay Mayers wrote:
> > 
> > As long as the protocols don't provide proof-of-work, trying this
> > doesn't really prove anything with respect to this concern.
> 
> I'm out of my depth here, but isn't VERIFY tightly related to PI and
> at the heart of detecting SAN bit-rot? The proof of work can be via
> end-to-end data protection. VERIFY has to actually read to detect bad
> host generated PI guard/tags.  I'm assuming the PI checks can be
> disabled for WRITE and enabled for VERIFY as the test.

I suppose if you happen to have a PI formatted drive, you could WRITE
garbage Guard tags with PRCHK disabled, then see if VERIFY with PRCHK
enabled returns the Guard Check Error; but PI is an optional format
feature orthogonal to the VERIFY command: we can't count on that format
being available in most implementations.
Javier González Dec. 3, 2022, 4:19 a.m. UTC | #15
> On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
>>> On 12/1/22 20:39, Matthew Wilcox wrote:
>>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
>>>> So nobody can get away with a lie.
>>> 
>>> And yet devices do exist which lie.  I'm not surprised that vendors
>>> vehemently claim that they don't, or "nobody would get away with it".
>>> But, of course, they do.  And there's no way for us to find out if
>>> they're lying!
>>> 
>> But we'll never be able to figure that out unless we try.
>> 
>> Once we've tried we will have proof either way.
> 
> As long as the protocols don't provide proof-of-work, trying this
> doesn't really prove anything with respect to this concern.

Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. 

I will check internally if there is any existing proof-of-work that we are missing.
Keith Busch Dec. 4, 2022, 8:29 p.m. UTC | #16
On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote:
> 
> > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote:
> > 
> > On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
> >>> On 12/1/22 20:39, Matthew Wilcox wrote:
> >>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
> >>>> So nobody can get away with a lie.
> >>> 
> >>> And yet devices do exist which lie.  I'm not surprised that vendors
> >>> vehemently claim that they don't, or "nobody would get away with it".
> >>> But, of course, they do.  And there's no way for us to find out if
> >>> they're lying!
> >>> 
> >> But we'll never be able to figure that out unless we try.
> >> 
> >> Once we've tried we will have proof either way.
> > 
> > As long as the protocols don't provide proof-of-work, trying this
> > doesn't really prove anything with respect to this concern.
> 
> Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. 

Yeah, proof for the host appears to require a new feature, so we'd need
to bring this to the TWG. I can draft a TPAR if there's interest and
have ideas on how the feature could be implemented, but I currently
don't have enough skin in this game to sponser it.
Matthew Wilcox Dec. 8, 2022, 8:02 p.m. UTC | #17
On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote:
> > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote:
> > As long as the protocols don't provide proof-of-work, trying this
> > doesn't really prove anything with respect to this concern.
> 
> Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there. 
> 
> I will check internally if there is any existing proof-of-work that we are missing. 

I think the right thing for NVMe to standardise is a new command, HASH.

It should be quite similar to the READ command, at least for command
Dwords 10-15.  Need to specify the hash algorithm to use somewhere
(maybe it's a per-namespace setting, maybe a per-command setting).
The ctte will have to decide which hash algos are permitted / required
(it probably makes sense to permit even some quite weak hashes as we're
not necessarily looking to have cryptographic security).

While there may be ways to make use of this command for its obvious usage
(ie actually producing a hash and communicating that to somebody else),
its real purpose is for the drive to read the data from storage and prove
it's still there without transferring it to the host, as discussed in
this thread.
Javier González Dec. 8, 2022, 8:06 p.m. UTC | #18
On 04.12.2022 20:29, Keith Busch wrote:
>On Sat, Dec 03, 2022 at 07:19:17AM +0300, Javier González wrote:
>>
>> > On 2 Dec 2022, at 17.58, Keith Busch <kbusch@kernel.org> wrote:
>> >
>> > On Fri, Dec 02, 2022 at 08:16:30AM +0100, Hannes Reinecke wrote:
>> >>> On 12/1/22 20:39, Matthew Wilcox wrote:
>> >>> On Thu, Dec 01, 2022 at 06:12:46PM +0000, Chaitanya Kulkarni wrote:
>> >>>> So nobody can get away with a lie.
>> >>>
>> >>> And yet devices do exist which lie.  I'm not surprised that vendors
>> >>> vehemently claim that they don't, or "nobody would get away with it".
>> >>> But, of course, they do.  And there's no way for us to find out if
>> >>> they're lying!
>> >>>
>> >> But we'll never be able to figure that out unless we try.
>> >>
>> >> Once we've tried we will have proof either way.
>> >
>> > As long as the protocols don't provide proof-of-work, trying this
>> > doesn't really prove anything with respect to this concern.
>>
>> Is this something we should bring to NVMe? Seems like the main disagreement can be addressed there.
>
>Yeah, proof for the host appears to require a new feature, so we'd need
>to bring this to the TWG. I can draft a TPAR if there's interest and
>have ideas on how the feature could be implemented, but I currently
>don't have enough skin in this game to sponser it.

Happy to review the TPAR, but I am in a similar situation.
Martin K. Petersen Dec. 9, 2022, 4:52 a.m. UTC | #19
> My guess, if true, is it's rationalized with the device is already
> doing patrols in the background - why verify when it's already
> been recently patrolled?

The original SCSI VERIFY operation allowed RAID array firmware to do
background scrubs before disk drives developed anything resembling
sophisticated media management. If verification of a block failed, the
array firmware could go ahead and reconstruct the data from the rest of
the stripe in the background. This substantially reduced the risk of
having to perform block reconstruction in the hot path. And verification
did not have to burden the already slow array CPU/memory/DMA combo with
transferring every block on every attached drive.

I suspect that these days it is very hard to find a storage device that
doesn't do media management internally in the background. So from the
perspective of physically exercising the media, VERIFY is probably not
terribly useful anymore.

In that light, having to run VERIFY over the full block range of a
device to identify unreadable blocks seems like a fairly clunky
mechanism. Querying the device for a list of unrecoverable blocks
already identified by the firmware seems like a better interface.

I am not sure I understand this whole "proof that the drive did
something" requirement. If a device lies and implements VERIFY as a noop
it just means you'll get the error during a future READ operation
instead.

No matter what, a successful VERIFY is obviously no guarantee that a
future READ on a given block will be possible. But it doesn't matter
because the useful outcome of a verify operation is the failure, not the
success. It's the verification failure scenario which allows you to take
a corrective action.

If you really want to verify device VERIFY implementation, we do have
WRITE UNCORRECTABLE commands in both SCSI and NVMe which allow you to do
that. But I think device validation is a secondary issue. The more
pertinent question is whether we have use cases in the kernel (MD,
btrfs) which would benefit from being able to preemptively identify
unreadable blocks?
Carlos Carvalho Dec. 10, 2022, 1:06 p.m. UTC | #20
Martin K. Petersen (martin.petersen@oracle.com) wrote on Fri, Dec 09, 2022 at 01:52:01AM -03:
> I suspect that these days it is very hard to find a storage device that
> doesn't do media management internally in the background. So from the
> perspective of physically exercising the media, VERIFY is probably not
> terribly useful anymore.
> 
> In that light, having to run VERIFY over the full block range of a
> device to identify unreadable blocks seems like a fairly clunky
> mechanism. Querying the device for a list of unrecoverable blocks
> already identified by the firmware seems like a better interface.

Sure.

> But I think device validation is a secondary issue. The more
> pertinent question is whether we have use cases in the kernel (MD,
> btrfs) which would benefit from being able to preemptively identify
> unreadable blocks?

Certainly we have. Currently admins have to periodically run full block range
checks in redundant arrays to detect bad blocks and correct them while
redundancy is available. Otherwise when a disk fails and you try to reconstruct
the replacement you hit another block in the remaining disks that's bad and you
cannot complete the reconstruction and have data loss. These checks are a
burden because they have HIGH overhead, significantly reducing bandwidth for
the normal use of the array.

If there was a standard interface for getting the list of bad blocks that the
firmware secretly knows the kernel could implement the repair continuosly, with
logs etc. That'd really be a relief for admins and, specially, users.
Christoph Hellwig Dec. 12, 2022, 6:30 a.m. UTC | #21
On Sat, Dec 10, 2022 at 10:06:34AM -0300, Carlos Carvalho wrote:
> Certainly we have. Currently admins have to periodically run full block range
> checks in redundant arrays to detect bad blocks and correct them while
> redundancy is available. Otherwise when a disk fails and you try to reconstruct
> the replacement you hit another block in the remaining disks that's bad and you
> cannot complete the reconstruction and have data loss. These checks are a
> burden because they have HIGH overhead, significantly reducing bandwidth for
> the normal use of the array.
> 
> If there was a standard interface for getting the list of bad blocks that the
> firmware secretly knows the kernel could implement the repair continuosly, with
> logs etc. That'd really be a relief for admins and, specially, users.

Both SCSI and NVMe can do this through the GET LBA STATUS command -
in SCSI this was a later addition abusing the command, and in NVMe
only the abuse survived.  NVMe also has a log page an AEN associated
for it, I'd have to spend more time reading SBC to remember if SCSI
also has a notification mechanism of some sort.