mbox series

[RFC,0/3] Btrfs checksum offload

Message ID 20250129140207.22718-1-joshi.k@samsung.com (mailing list archive)
Headers show
Series Btrfs checksum offload | expand

Message

Kanchan Joshi Jan. 29, 2025, 2:02 p.m. UTC
TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
SSD for data checksumming.

Now, the longer version for why/how.

End-to-end data protection (E2EDP)-capable drives require the transfer
of integrity metadata (PI).
This is currently handled by the block layer, without filesystem
involvement/awareness.
The block layer attaches the metadata buffer, generates the checksum
(and reftag) for write I/O, and verifies it during read I/O.

Btrfs has its own data and metadata checksumming, which is currently
disconnected from the above.
It maintains a separate on-device 'checksum tree' for data checksums,
while the block layer will also be checksumming each Btrfs I/O.

There is value in avoiding Copy-on-write (COW) checksum tree on
a device that can anyway store checksums inline (as part of PI).
This would eliminate extra checksum writes/reads, making I/O
more CPU-efficient.
Additionally, usable space would increase, and write
amplification, both in Btrfs and eventually at the device level, would
be reduced [*].

NVMe drives can also automatically insert and strip the PI/checksum
and provide a per-I/O control knob (the PRACT bit) for this.
Block layer currently makes no attempt to know/advertise this offload.

This patch series: (a) adds checksum offload awareness to the
block layer (patch #1),
(b) enables the NVMe driver to register and support the offload
(patch #2), and
(c) introduces an opt-in (datasum_offload mount option) in Btrfs to
apply checksum offload for data (patch #3).

[*] Here are some perf/write-amplification numbers from randwrite test [1]
on 3 configs (same device):
Config 1: No meta format (4K) + Btrfs (base)
Config 2: Meta format (4K + 8b) + Btrfs (base)
Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)

In config 1 and 2, Btrfs will operate with a checksum tree.
Only in config 2, block-layer will attach integrity buffer with each I/O and
do checksum/reftag verification.
Only in config 3, offload will take place and device will generate/verify
the checksum.

AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
FsW: writes issued to device (from iostat)
ExtraW: extra writes compared to AppW

Direct I/O
---------------------------------------------------------
Config		IOPS(K)		FsW(G)		ExtraW(G)
1		144		186		66
2		141		181		61
3		172		129		9

Buffered I/O
---------------------------------------------------------
Config		IOPS(K)		FsW(G)		ExtraW(G)
1		82		255		135
2		80		181		132
3		100		199		79

Write amplification is generally high (and that's understandable given
B-trees) but not sure why buffered I/O shows that much.

[1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting


Kanchan Joshi (3):
  block: add integrity offload
  nvme: support integrity offload
  btrfs: add checksum offload

 block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
 block/t10-pi.c            |  7 +++++++
 drivers/nvme/host/core.c  | 24 ++++++++++++++++++++++
 drivers/nvme/host/nvme.h  |  1 +
 fs/btrfs/bio.c            | 12 +++++++++++
 fs/btrfs/fs.h             |  1 +
 fs/btrfs/super.c          |  9 +++++++++
 include/linux/blk_types.h |  3 +++
 include/linux/blkdev.h    |  7 +++++++
 9 files changed, 105 insertions(+), 1 deletion(-)

Comments

Johannes Thumshirn Jan. 29, 2025, 2:55 p.m. UTC | #1
On 29.01.25 15:13, Kanchan Joshi wrote:
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
> 
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
> 
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
> 
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).

Hi Kanchan,

This is an interesting approach on offloading the checksum work. I've 
only had a quick glance over it with a birds eye view and one thing that 
I noticed is, the missing connection of error reporting between the layers.

For instance if we get a checksum error on btrfs we not only report in 
in dmesg, but also try to repair the affected sector if we do have a 
data profile with redundancy.

So while this patchset offloads the submission side work of the checksum 
tree to the PI code, I don't see the back-propagation of the errors into 
btrfs and the triggering of the repair code.

I get it's a RFC, but as it is now it essentially breaks functionality 
we rely on. Can you add this part as well so we can evaluate the 
patchset not only from the write but also from the read side.

Byte,
	Johannes
Keith Busch Jan. 29, 2025, 3:28 p.m. UTC | #2
On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.

Another potential benefit: if the device does the checksum, then I think
btrfs could avoid the stable page writeback overhead and let the
contents be changable all the way until it goes out on the wire.

Though I feel the very specific device format constraints that can
support an offload like this are a unfortunate.
Christoph Hellwig Jan. 29, 2025, 3:35 p.m. UTC | #3
On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.

That's not quite true.  The block layer automatically adds a PI
payload if not is added by the caller.  The caller can add it's own
PI payload, but currently no file system does this - only the block
device fops as of 6.13 and the nvme and scsi targets.  But file systems
can do that, and I have (hacky and outdated patches) wiring this up
in XFS.

Note that the "auto-PI" vs "caller-PI" isn't very cleanly split
currently, which causes some confusion.  I have a series almost
ready to clean that up a bit.

> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).

Yes.

> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),

I've skipped over the patches and don't understand what this offload
awareness concept does compared the file system simply attaching PI
metadata.

> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).

Not really important for an initial prototype, but incompatible on-disk
format changes like this need feature flags and not just a mount
option.
Christoph Hellwig Jan. 29, 2025, 3:40 p.m. UTC | #4
On Wed, Jan 29, 2025 at 08:28:24AM -0700, Keith Busch wrote:
> On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote:
> > There is value in avoiding Copy-on-write (COW) checksum tree on
> > a device that can anyway store checksums inline (as part of PI).
> > This would eliminate extra checksum writes/reads, making I/O
> > more CPU-efficient.
> 
> Another potential benefit: if the device does the checksum, then I think
> btrfs could avoid the stable page writeback overhead and let the
> contents be changable all the way until it goes out on the wire.

If the device generates the checksum (aka DIF insert) that problem goes
away.  But we also lose integrity protection over the wire, which would
be unfortunate.

If you feed the checksum / guard tag from the kernel we still have the
same problem.  A while ago I did a prototype where we'd bubble up to the
fs that we had guard tag error vs just the non-specific "protection
error" and the file system would then retry after copying.  This was
pretty sketchy as the error handling blew up frequently and at least my
version would only work for synchronous I/O and not with aio / io_uring
due to the missing MM context.  But if someone has enough spare cycles
that could be something interesting to look into again.
Mark Harmstone Jan. 29, 2025, 3:55 p.m. UTC | #5
On 29/1/25 14:02, Kanchan Joshi wrote:
> > 
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
> Btrfs has its own data and metadata checksumming, which is currently
> disconnected from the above.
> It maintains a separate on-device 'checksum tree' for data checksums,
> while the block layer will also be checksumming each Btrfs I/O.
> 
> There is value in avoiding Copy-on-write (COW) checksum tree on
> a device that can anyway store checksums inline (as part of PI).
> This would eliminate extra checksum writes/reads, making I/O
> more CPU-efficient.
> Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].
> 
> NVMe drives can also automatically insert and strip the PI/checksum
> and provide a per-I/O control knob (the PRACT bit) for this.
> Block layer currently makes no attempt to know/advertise this offload.
> 
> This patch series: (a) adds checksum offload awareness to the
> block layer (patch #1),
> (b) enables the NVMe driver to register and support the offload
> (patch #2), and
> (c) introduces an opt-in (datasum_offload mount option) in Btrfs to
> apply checksum offload for data (patch #3).
> 
> [*] Here are some perf/write-amplification numbers from randwrite test [1]
> on 3 configs (same device):
> Config 1: No meta format (4K) + Btrfs (base)
> Config 2: Meta format (4K + 8b) + Btrfs (base)
> Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload)
> 
> In config 1 and 2, Btrfs will operate with a checksum tree.
> Only in config 2, block-layer will attach integrity buffer with each I/O and
> do checksum/reftag verification.
> Only in config 3, offload will take place and device will generate/verify
> the checksum.
> 
> AppW: writes issued by app, 120G (4 Jobs, each writing 30G)
> FsW: writes issued to device (from iostat)
> ExtraW: extra writes compared to AppW
> 
> Direct I/O
> ---------------------------------------------------------
> Config		IOPS(K)		FsW(G)		ExtraW(G)
> 1		144		186		66
> 2		141		181		61
> 3		172		129		9
> 
> Buffered I/O
> ---------------------------------------------------------
> Config		IOPS(K)		FsW(G)		ExtraW(G)
> 1		82		255		135
> 2		80		181		132
> 3		100		199		79
> 
> Write amplification is generally high (and that's understandable given
> B-trees) but not sure why buffered I/O shows that much.
> 
> [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting
> 
> 
> Kanchan Joshi (3):
>    block: add integrity offload
>    nvme: support integrity offload
>    btrfs: add checksum offload
> 
>   block/bio-integrity.c     | 42 ++++++++++++++++++++++++++++++++++++++-
>   block/t10-pi.c            |  7 +++++++
>   drivers/nvme/host/core.c  | 24 ++++++++++++++++++++++
>   drivers/nvme/host/nvme.h  |  1 +
>   fs/btrfs/bio.c            | 12 +++++++++++
>   fs/btrfs/fs.h             |  1 +
>   fs/btrfs/super.c          |  9 +++++++++
>   include/linux/blk_types.h |  3 +++
>   include/linux/blkdev.h    |  7 +++++++
>   9 files changed, 105 insertions(+), 1 deletion(-)
> 

There's also checksumming done on the metadata trees, which could be 
avoided if we're trusting the block device to do it.

Maybe rather than putting this behind a new compat flag, add a new csum 
type of "none"? With the logic being that it also zeroes out the csum 
field in the B-tree headers.

Mark
Keith Busch Jan. 29, 2025, 6:03 p.m. UTC | #6
On Wed, Jan 29, 2025 at 04:40:25PM +0100, Christoph Hellwig wrote:
> > 
> > Another potential benefit: if the device does the checksum, then I think
> > btrfs could avoid the stable page writeback overhead and let the
> > contents be changable all the way until it goes out on the wire.
> 
> If the device generates the checksum (aka DIF insert) that problem goes
> away.  But we also lose integrity protection over the wire, which would
> be unfortunate.

If the "wire" is only PCIe, I don't see why it matters. What kind of
wire corruption gets undetected by the protocol's encoding and LCRC that
would get caught by the host's CRC payload?
Goffredo Baroncelli Jan. 29, 2025, 7:02 p.m. UTC | #7
On 29/01/2025 15.02, Kanchan Joshi wrote:
> TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> SSD for data checksumming.
> 
> Now, the longer version for why/how.
> 
> End-to-end data protection (E2EDP)-capable drives require the transfer
> of integrity metadata (PI).
> This is currently handled by the block layer, without filesystem
> involvement/awareness.
> The block layer attaches the metadata buffer, generates the checksum
> (and reftag) for write I/O, and verifies it during read I/O.
> 
May be this is a stupid question, but if we can (will) avoid storing the checksum
in the FS, which is the advantage of having a COW filesystem ?

My understand is that a COW filesystem is needed mainly to synchronize
csum and data. Am I wrong ?

[...]

BR
Kanchan Joshi Jan. 30, 2025, 9:22 a.m. UTC | #8
On 1/29/2025 9:05 PM, Christoph Hellwig wrote:
>> This patch series: (a) adds checksum offload awareness to the
>> block layer (patch #1),
> I've skipped over the patches and don't understand what this offload
> awareness concept does compared the file system simply attaching PI
> metadata.

Difference is that FS does not have to attach any PI for offload.

Offload is about the Host doing as little as possible, and the closest 
we get there is by setting PRACT bit.

Attaching PI is not really needed, neither for FS nor for block-layer, 
for pure offload.
When device has "ms == pi_size" format, we only need to send I/O with 
PRACT set and device take care of attaching integrity buffer and 
checksum generation/verification.
This is abstracted as 'offload type 1' in this series.

For other format "ms > pi_size" also we set the PRACT but integrity 
buffer also needs to be passed. This is abstracted as 'offload type 2'.
Still offload as the checksum processing is done only by the device.

Block layer Auto-PI is a good place because all above details are common 
and remain abstracted, while filesystems only need to decide whether 
they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility.
Daniel Vacek Jan. 30, 2025, 9:33 a.m. UTC | #9
On Wed, 29 Jan 2025 at 20:04, Goffredo Baroncelli <kreijack@libero.it> wrote:
>
> On 29/01/2025 15.02, Kanchan Joshi wrote:
> > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe
> > SSD for data checksumming.
> >
> > Now, the longer version for why/how.
> >
> > End-to-end data protection (E2EDP)-capable drives require the transfer
> > of integrity metadata (PI).
> > This is currently handled by the block layer, without filesystem
> > involvement/awareness.
> > The block layer attaches the metadata buffer, generates the checksum
> > (and reftag) for write I/O, and verifies it during read I/O.
> >
> May be this is a stupid question, but if we can (will) avoid storing the checksum
> in the FS, which is the advantage of having a COW filesystem ?

I was wondering the same. My understanding is the checksums are there
primarily to protect against untrusted devices or data transfers over
the line. And now suddenly we're going to trust them? What's even the
point then?

Is there any other advantage of having these checksums that I may be missing?
Perhaps logic code bugs accidentally corrupting the data? Is the
stored payload even ever touched? That would not be wanted, right?
Or perhaps data mangled on the storage by an attacker?

> My understand is that a COW filesystem is needed mainly to synchronize
> csum and data. Am I wrong ?
>
> [...]
>
> BR
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>
Christoph Hellwig Jan. 30, 2025, 12:53 p.m. UTC | #10
On Thu, Jan 30, 2025 at 02:52:23PM +0530, Kanchan Joshi wrote:
> On 1/29/2025 9:05 PM, Christoph Hellwig wrote:
> >> This patch series: (a) adds checksum offload awareness to the
> >> block layer (patch #1),
> > I've skipped over the patches and don't understand what this offload
> > awareness concept does compared the file system simply attaching PI
> > metadata.
> 
> Difference is that FS does not have to attach any PI for offload.
> 
> Offload is about the Host doing as little as possible, and the closest 
> we get there is by setting PRACT bit.

But that doesn't actually work.  The file system needs to be able
to verify the checksum for failing over to other mirrors, repair,
etc.  Also if you trust the device to get things right you do not
need to use PI at all - SSDs or hard drives that support PI generally
use PI internally anyway, and PRACT just means you treat a format
with PI like one without.  In other words - no need for an offload
here, you might as well just trust the device if you're not doing
end to end protection.

> 
> Attaching PI is not really needed, neither for FS nor for block-layer, 
> for pure offload.
> When device has "ms == pi_size" format, we only need to send I/O with 
> PRACT set and device take care of attaching integrity buffer and 
> checksum generation/verification.
> This is abstracted as 'offload type 1' in this series.
> 
> For other format "ms > pi_size" also we set the PRACT but integrity 
> buffer also needs to be passed. This is abstracted as 'offload type 2'.
> Still offload as the checksum processing is done only by the device.
> 
> Block layer Auto-PI is a good place because all above details are common 
> and remain abstracted, while filesystems only need to decide whether 
> they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility.
---end quoted text---
Christoph Hellwig Jan. 30, 2025, 12:54 p.m. UTC | #11
On Wed, Jan 29, 2025 at 11:03:36AM -0700, Keith Busch wrote:
> > away.  But we also lose integrity protection over the wire, which would
> > be unfortunate.
> 
> If the "wire" is only PCIe, I don't see why it matters. What kind of
> wire corruption gets undetected by the protocol's encoding and LCRC that
> would get caught by the host's CRC payload?

The "wire" could be anything.  And includes a little more than than
than the wire, like the entire host side driver stack and the device
data path between the phy and wherever in the stack the PI insert/strip
accelerator sits.
Martin K. Petersen Jan. 30, 2025, 8:21 p.m. UTC | #12
Hi Kanchan!

> There is value in avoiding Copy-on-write (COW) checksum tree on a
> device that can anyway store checksums inline (as part of PI). This
> would eliminate extra checksum writes/reads, making I/O more
> CPU-efficient. Additionally, usable space would increase, and write
> amplification, both in Btrfs and eventually at the device level, would
> be reduced [*].

I have a couple of observations.

First of all, there is no inherent benefit to PI if it is generated at
the same time as the ECC. The ECC is usually far superior when it comes
to protecting data at rest. And you'll still get an error if uncorrected
corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer
any benefits in my book.

The motivation for T10 PI is that it is generated in close temporal
proximity to the data. I.e. ideally the PI protecting the data is
calculated as soon as the data has been created in memory. And then the
I/O will eventually be queued, submitted, traverse the kernel, through
the storage fabric, and out to the end device. The PI and data have
traveled along different paths (potentially, more on that later) to get
there. The device will calculate the ECC and then perform a validation
of the PI wrt. to the data buffer. And if those two line up, we know the
ECC is also good. At that point we have confirmed that the data to be
stored matches the data that was used as input when the PI was generated
N seconds ago in host memory. And therefore we can write.

I.e. the goal of PI is protect against problems that happen between data
creation time and the data being persisted to media. Once the ECC has
been calculated, PI essentially stops being interesting.

The second point I would like to make is that the separation between PI
and data that we introduced with DIX, and which NVMe subsequently
adopted, was a feature. It was not just to avoid the inconvenience of
having to deal with buffers that were multiples of 520 bytes in host
memory. The separation between the data and its associated protection
information had proven critical for data protection in many common
corruption scenarios. Inline protection had been tried and had failed to
catch many of the scenarios we had come across in the field.

At the time T10 PI was designed spinning rust was the only game in town.
And nobody was willing to take the performance hit of having to seek
twice per I/O to store PI separately from the data. And while schemes
involving sending all the PI ahead of the data were entertained, they
never came to fruition. Storing 512+8 in the same sector was a necessity
in the context of SCSI drives, not a desired behavior. Addressing that
in DIX was key.

So to me, it's a highly desirable feature that btrfs stores its
checksums elsewhere on media. But that's obviously a trade-off a user
can make. In some cases media WAR may be more important than extending
the protection envelope for the data, that's OK. I would suggest you
look at using CRC32C given the intended 4KB block use case, though,
because the 16-bit CRC isn't fantastic for large blocks.
Christoph Hellwig Jan. 31, 2025, 7:44 a.m. UTC | #13
On Thu, Jan 30, 2025 at 03:21:45PM -0500, Martin K. Petersen wrote:
> So to me, it's a highly desirable feature that btrfs stores its
> checksums elsewhere on media.

Except for SSDs it generally doesn't - the fact that they are written
at the same time means there is a very high chance they will end up
on media together for traditional SSDs designs.  This might be different
when explicitly using some form of data placement scheme, and SSD
vendors might be able to place PI/metadata different under the hood
when using a big enough customer aks for it (they might not be very
happy about the request :)).


> But that's obviously a trade-off a user
> can make. In some cases media WAR may be more important than extending
> the protection envelope for the data, that's OK. I would suggest you
> look at using CRC32C given the intended 4KB block use case, though,
> because the 16-bit CRC isn't fantastic for large blocks.

One thing that I did implement for my XFS hack/prototype is the ability
to store a crc32c in the non-PI metadata support by nvme.  This allows
for low overhead data checksumming as you don't need a separate data
structure to track where the checksums for a data block are located and
doesn't require out of place writes.  It doesn't provide a reg tag
equivalent or device side checking of the guard tag unfortunately.

I never could come up with a good use of the app_tag for file systems,
so not wasting space for that is actually a good thing.
Kanchan Joshi Jan. 31, 2025, 10:19 a.m. UTC | #14
On 1/29/2025 8:25 PM, Johannes Thumshirn wrote:
> For instance if we get a checksum error on btrfs we not only report in
> in dmesg, but also try to repair the affected sector if we do have a
> data profile with redundancy.
> 
> So while this patchset offloads the submission side work of the checksum
> tree to the PI code, I don't see the back-propagation of the errors into
> btrfs and the triggering of the repair code.
> 
> I get it's a RFC, but as it is now it essentially breaks functionality
> we rely on. Can you add this part as well so we can evaluate the
> patchset not only from the write but also from the read side.

I tested the series for read, but only the success cases. In this case 
checksum generation/verification happens only within the device. It was 
slightly tricky to inject an error and I skipped that.

Since separate checksum I/Os are omitted, this is about handling the 
error condition in data read I/O path itself. I have not yet checked if 
repair code triggers when Btrfs is working with existing 'nodatasum' 
mount option. But I get your point that this needs to be handled.
Johannes Thumshirn Jan. 31, 2025, 10:29 a.m. UTC | #15
On 31.01.25 11:19, Kanchan Joshi wrote:
> On 1/29/2025 8:25 PM, Johannes Thumshirn wrote:
>> For instance if we get a checksum error on btrfs we not only report in
>> in dmesg, but also try to repair the affected sector if we do have a
>> data profile with redundancy.
>>
>> So while this patchset offloads the submission side work of the checksum
>> tree to the PI code, I don't see the back-propagation of the errors into
>> btrfs and the triggering of the repair code.
>>
>> I get it's a RFC, but as it is now it essentially breaks functionality
>> we rely on. Can you add this part as well so we can evaluate the
>> patchset not only from the write but also from the read side.
> 
> I tested the series for read, but only the success cases. In this case
> checksum generation/verification happens only within the device. It was
> slightly tricky to inject an error and I skipped that.
> 
> Since separate checksum I/Os are omitted, this is about handling the
> error condition in data read I/O path itself. I have not yet checked if
> repair code triggers when Btrfs is working with existing 'nodatasum'
> mount option. But I get your point that this needs to be handled.
> 

So this as of now disables one of the most useful features of the FS, 
repairing bad data. The whole "story" for the RAID code in the FS is 
build around this assumption, that we can repair bad data, unlike say MD 
RAID.
Kanchan Joshi Jan. 31, 2025, 10:29 a.m. UTC | #16
On 1/30/2025 6:23 PM, Christoph Hellwig wrote:
>> Difference is that FS does not have to attach any PI for offload.
>>
>> Offload is about the Host doing as little as possible, and the closest
>> we get there is by setting PRACT bit.
> But that doesn't actually work.  The file system needs to be able
> to verify the checksum for failing over to other mirrors, repair,
> etc.

Right. That sounds like reusing the existing code on detecting 
checksum-specific failure. So maybe that can be handled iff this gets 
any far.

> Also if you trust the device to get things right you do not
> need to use PI at all - SSDs or hard drives that support PI generally
> use PI internally anyway, and PRACT just means you treat a format
> with PI like one without.  In other words - no need for an offload
> here, you might as well just trust the device if you're not doing
> end to end protection.

Agree that device maybe implementing internal E2E, but that's not a 
contract to be honored. Host can't trust until device says it explicitly.

Since Btrfs already has 'nodatasum' mount option, I assume there are 
deployments that prefer to optimize for checksum. I thought Btrfs will 
be more comfortable to give up (its own checksum tree) and exercise this 
more often if it certainly knows that device is also capable.
Christoph Hellwig Jan. 31, 2025, 10:42 a.m. UTC | #17
On Fri, Jan 31, 2025 at 03:59:17PM +0530, Kanchan Joshi wrote:
> > Also if you trust the device to get things right you do not
> > need to use PI at all - SSDs or hard drives that support PI generally
> > use PI internally anyway, and PRACT just means you treat a format
> > with PI like one without.  In other words - no need for an offload
> > here, you might as well just trust the device if you're not doing
> > end to end protection.
> 
> Agree that device maybe implementing internal E2E, but that's not a 
> contract to be honored. Host can't trust until device says it explicitly.

But you're not doing end to end protection.  Once you set PRACT you
basically tell the device to pretend the LU/namespace was formatted
without protection information.  That fact that you even need the
flag has always been very confusing to me - the logical way to expose
PI would have been to make PRACT the default and require a flag to
actually look at the passed information.  I suspect for SCSI this
is a result of shoe-horning DIF and DIX into existing infrastructure,
and NVMe then blindly copied much of that without thinking how it
fits into an architecture without a separate HBA and without all
the legacy concerns.
Kanchan Joshi Feb. 3, 2025, 1:24 p.m. UTC | #18
On 1/31/2025 1:51 AM, Martin K. Petersen wrote:
>> There is value in avoiding Copy-on-write (COW) checksum tree on a
>> device that can anyway store checksums inline (as part of PI). This
>> would eliminate extra checksum writes/reads, making I/O more
>> CPU-efficient. Additionally, usable space would increase, and write
>> amplification, both in Btrfs and eventually at the device level, would
>> be reduced [*].
> I have a couple of observations.
> 
> First of all, there is no inherent benefit to PI if it is generated at
> the same time as the ECC. The ECC is usually far superior when it comes
> to protecting data at rest. And you'll still get an error if uncorrected
> corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer
> any benefits in my book.

I fully agree, there is no benefit if we see it from E2E use case.
But for a different use case (i.e., checksum offload), 
BLK_INTEGRITY_OFFLOAD_NO_BUF is as good as it gets.

[Application -> FS -> Block-layer -> Device]

I understand that E2E gets stronger when integrity/checksum is placed at 
the origin of data (application), and then each component collaborates 
in checking it.

But I am not doing E2E here. Call it abuse or creative, but I am using 
the same E2E capable device to do checksum-offload. If the device had 
exposed checksum-offload in a different way, I would have taken that 
route rather than E2E one.
Kanchan Joshi Feb. 3, 2025, 1:25 p.m. UTC | #19
On 1/31/2025 3:59 PM, Johannes Thumshirn wrote:
>> I tested the series for read, but only the success cases. In this case
>> checksum generation/verification happens only within the device. It was
>> slightly tricky to inject an error and I skipped that.
>>
>> Since separate checksum I/Os are omitted, this is about handling the
>> error condition in data read I/O path itself. I have not yet checked if
>> repair code triggers when Btrfs is working with existing 'nodatasum'
>> mount option. But I get your point that this needs to be handled.
>>
> So this as of now disables one of the most useful features of the FS,
> repairing bad data. The whole "story" for the RAID code in the FS is
> build around this assumption, that we can repair bad data, unlike say MD
> RAID.

Does repairing-bad-data work when Btrfs is mounted with NODATASUM?
If not, should not the proposed option be seen as an upgrade over that?

You might be knowing, but I do not know how does Btrfs currently decide 
to apply NODATSUM! With these patches it becomes possible to know if 
checksum-offload is supported by the underlying hardware. And that makes 
it possible to apply NODATASUM in an informed manner.

I have not reduced anything, but added an opt-in for deployments that 
may have a different definition of what is useful. Not all planets are 
Mars. The cost of checksum tree will be different (say on QLC vs SLC).
Johannes Thumshirn Feb. 3, 2025, 1:40 p.m. UTC | #20
On 03.02.25 14:25, Kanchan Joshi wrote:
> On 1/31/2025 3:59 PM, Johannes Thumshirn wrote:
>>> I tested the series for read, but only the success cases. In this case
>>> checksum generation/verification happens only within the device. It was
>>> slightly tricky to inject an error and I skipped that.
>>>
>>> Since separate checksum I/Os are omitted, this is about handling the
>>> error condition in data read I/O path itself. I have not yet checked if
>>> repair code triggers when Btrfs is working with existing 'nodatasum'
>>> mount option. But I get your point that this needs to be handled.
>>>
>> So this as of now disables one of the most useful features of the FS,
>> repairing bad data. The whole "story" for the RAID code in the FS is
>> build around this assumption, that we can repair bad data, unlike say MD
>> RAID.
> 
> Does repairing-bad-data work when Btrfs is mounted with NODATASUM?
> If not, should not the proposed option be seen as an upgrade over that?
> 
> You might be knowing, but I do not know how does Btrfs currently decide
> to apply NODATSUM! With these patches it becomes possible to know if
> checksum-offload is supported by the underlying hardware. And that makes
> it possible to apply NODATASUM in an informed manner.

NODATASUM is something I personally would only ever tun on on VM images, 
so we don't have the stable page vs checksums f*up (see Qu's answers).

> I have not reduced anything, but added an opt-in for deployments that
> may have a different definition of what is useful. Not all planets are
> Mars. The cost of checksum tree will be different (say on QLC vs SLC).
> 

But NODATASUM isn't something that is actively recommended unless you 
know what you're doing. I thought of your patches as an offload of the 
checksum tree to the T10-PI extended sector format, which I personally 
like. And it's not that hard to do that.

If it's about getting people to use NODATASUM I'm really starting to 
dislike the patchset. Also NODATASUM implies deactivated compression.

So this to me then sounds like a good use case for 1 or 2 specific 
scenarios but not really all too helpful for the broader picture, which 
it could've been.
Kanchan Joshi Feb. 3, 2025, 2:03 p.m. UTC | #21
On 2/3/2025 7:10 PM, Johannes Thumshirn wrote:
> But NODATASUM isn't something that is actively recommended unless you
> know what you're doing. I thought of your patches as an offload of the
> checksum tree to the T10-PI extended sector format

You thought right, patches do "offload to T10-PI format" part. That part 
is generic for any upper-layer user. One only needs to send flag 
REQ_INTEGRITY_OFFLOAD for that.
And for the other part "suppress data-csums at Btrfs level", I thought 
of using NODATASUM.
Johannes Thumshirn Feb. 3, 2025, 2:41 p.m. UTC | #22
On 03.02.25 15:04, Kanchan Joshi wrote:
> On 2/3/2025 7:10 PM, Johannes Thumshirn wrote:
>> But NODATASUM isn't something that is actively recommended unless you
>> know what you're doing. I thought of your patches as an offload of the
>> checksum tree to the T10-PI extended sector format
> 
> You thought right, patches do "offload to T10-PI format" part. That part
> is generic for any upper-layer user. One only needs to send flag
> REQ_INTEGRITY_OFFLOAD for that.

That one I think is nice.

> And for the other part "suppress data-csums at Btrfs level", I thought
> of using NODATASUM.
> 

That one I hate with passion, IFF done the way it's done in this patchset.

For the generation/write side you can go that way. But it needs to be 
wired up so that btrfs can be the consumer (is this the correct term 
here?) of the checksums as well. Luckily 'btrfs_check_read_bio()' 
treats bio.bi_ioerror != BLK_STS_OK the same way as
!btrfs_data_csum_ok(). So that part looks save to me. Also 
btrfs_data_csum_ok() does an early return if there's no checsum, so I 
think we're good there.

In order to make scrub (and RAID5) work, you'd need a fallback for 
'btrfs_lookup_csums_bitmap()' that returns the sector checksum from the 
bio integrity layer instead of looking at the checksum tree.
Martin K. Petersen Feb. 3, 2025, 7:31 p.m. UTC | #23
Christoph,

> Except for SSDs it generally doesn't - the fact that they are written
> at the same time means there is a very high chance they will end up
> on media together for traditional SSDs designs.
>
> This might be different when explicitly using some form of data
> placement scheme, and SSD vendors might be able to place PI/metadata
> different under the hood when using a big enough customer aks for it
> (they might not be very happy about the request :)).

There was a multi-vendor effort many years ago (first gen SSD era) to
make vendors guarantee that metadata and data would be written to
different channels. But performance got in the way, obviously.

> One thing that I did implement for my XFS hack/prototype is the ability
> to store a crc32c in the non-PI metadata support by nvme.  This allows
> for low overhead data checksumming as you don't need a separate data
> structure to track where the checksums for a data block are located and
> doesn't require out of place writes.  It doesn't provide a reg tag
> equivalent or device side checking of the guard tag unfortunately.

That sounds fine. Again, I don't have a problem with having the ability
to choose whether checksum placement or WAF is more important for a
given application.

> I never could come up with a good use of the app_tag for file systems,
> so not wasting space for that is actually a good thing.

I wish we could just do 4 bytes of CRC32C + 4 bytes of ref tag. I think
that would be a reasonable compromise between space and utility. But we
can't do that because of the app tag escape. We're essentially wasting 2
bytes per block to store a single bit flag.

In general I think 4096+16 is a reasonable format going forward. With
either CRC32C or CRC64 plus full LBA as ref tag.
Christoph Hellwig Feb. 4, 2025, 5:12 a.m. UTC | #24
On Mon, Feb 03, 2025 at 02:31:13PM -0500, Martin K. Petersen wrote:
> > I never could come up with a good use of the app_tag for file systems,
> > so not wasting space for that is actually a good thing.
> 
> I wish we could just do 4 bytes of CRC32C + 4 bytes of ref tag. I think
> that would be a reasonable compromise between space and utility.

Agreed.

> But we
> can't do that because of the app tag escape. We're essentially wasting 2
> bytes per block to store a single bit flag.

Well, what do we actually need the app tag escape for except for
historical precedence?  In NVMe the app tag escape is an option for
deallocated blocks, but it's only one of the options, there other beeing
a synthetic guard/ref tag with the expected value.

> In general I think 4096+16 is a reasonable format going forward. With
> either CRC32C or CRC64 plus full LBA as ref tag.

That would also work fine.  NVMe supports 4byte crc32c + 2 byte app tag +
12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6
byte guard / storage tag, although Linux only supports the latter so far.
Martin K. Petersen Feb. 4, 2025, 12:52 p.m. UTC | #25
Christoph,

> Well, what do we actually need the app tag escape for except for
> historical precedence?  In NVMe the app tag escape is an option for
> deallocated blocks, but it's only one of the options, there other beeing
> a synthetic guard/ref tag with the expected value.

I have been told that some arrays use it to disable PI when writing the
RAID parity blocks. I guess that makes sense if the array firmware is
mixing data and parity blocks in a single write operation. For my test
tool I just use WRPROTECT=3 to disable checking when writing "bad" PI.

And obviously the original reason for the initialization pattern escape
made sense on a disk drive that had no FTL to track whether a block was
in a written state or not. But I really don't think any of this is
useful in a modern SSD or even array context.

>> In general I think 4096+16 is a reasonable format going forward. With
>> either CRC32C or CRC64 plus full LBA as ref tag.
>
> That would also work fine.  NVMe supports 4byte crc32c + 2 byte app tag +
> 12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6
> byte guard / storage tag, although Linux only supports the latter so far.

Yep, the CRC32C variant should be easy to wire up. I've thought about
the storage tag but haven't really come up with a good use case. It's
essentially the same situation as with the app tag.
Christoph Hellwig Feb. 4, 2025, 1:49 p.m. UTC | #26
On Tue, Feb 04, 2025 at 07:52:38AM -0500, Martin K. Petersen wrote:
> I have been told that some arrays use it to disable PI when writing the
> RAID parity blocks. I guess that makes sense if the array firmware is
> mixing data and parity blocks in a single write operation. For my test
> tool I just use WRPROTECT=3 to disable checking when writing "bad" PI.

Hmm, why would you disable PI for parity blocks?  But yes, outside of
Linux there might be uses.  Just thinking of a "perfect" format for
our needs.

> >
> > That would also work fine.  NVMe supports 4byte crc32c + 2 byte app tag +
> > 12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6
> > byte guard / storage tag, although Linux only supports the latter so far.
> 
> Yep, the CRC32C variant should be easy to wire up. I've thought about
> the storage tag but haven't really come up with a good use case. It's
> essentially the same situation as with the app tag.

Exactly, it's an app tag by other means.
Martin K. Petersen Feb. 5, 2025, 2:31 a.m. UTC | #27
Christoph,

> Hmm, why would you disable PI for parity blocks?

Some devices calculate P and Q on the data portion of the block only and
store valid PI. Others calculate P and Q for the entire block, including
metadata.