Message ID | 20250129140207.22718-1-joshi.k@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | Btrfs checksum offload | expand |
On 29.01.25 15:13, Kanchan Joshi wrote: > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > SSD for data checksumming. > > Now, the longer version for why/how. > > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. > > Btrfs has its own data and metadata checksumming, which is currently > disconnected from the above. > It maintains a separate on-device 'checksum tree' for data checksums, > while the block layer will also be checksumming each Btrfs I/O. > > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). > This would eliminate extra checksum writes/reads, making I/O > more CPU-efficient. > Additionally, usable space would increase, and write > amplification, both in Btrfs and eventually at the device level, would > be reduced [*]. > > NVMe drives can also automatically insert and strip the PI/checksum > and provide a per-I/O control knob (the PRACT bit) for this. > Block layer currently makes no attempt to know/advertise this offload. > > This patch series: (a) adds checksum offload awareness to the > block layer (patch #1), > (b) enables the NVMe driver to register and support the offload > (patch #2), and > (c) introduces an opt-in (datasum_offload mount option) in Btrfs to > apply checksum offload for data (patch #3). Hi Kanchan, This is an interesting approach on offloading the checksum work. I've only had a quick glance over it with a birds eye view and one thing that I noticed is, the missing connection of error reporting between the layers. For instance if we get a checksum error on btrfs we not only report in in dmesg, but also try to repair the affected sector if we do have a data profile with redundancy. So while this patchset offloads the submission side work of the checksum tree to the PI code, I don't see the back-propagation of the errors into btrfs and the triggering of the repair code. I get it's a RFC, but as it is now it essentially breaks functionality we rely on. Can you add this part as well so we can evaluate the patchset not only from the write but also from the read side. Byte, Johannes
On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote: > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). > This would eliminate extra checksum writes/reads, making I/O > more CPU-efficient. Another potential benefit: if the device does the checksum, then I think btrfs could avoid the stable page writeback overhead and let the contents be changable all the way until it goes out on the wire. Though I feel the very specific device format constraints that can support an offload like this are a unfortunate.
On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote: > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. That's not quite true. The block layer automatically adds a PI payload if not is added by the caller. The caller can add it's own PI payload, but currently no file system does this - only the block device fops as of 6.13 and the nvme and scsi targets. But file systems can do that, and I have (hacky and outdated patches) wiring this up in XFS. Note that the "auto-PI" vs "caller-PI" isn't very cleanly split currently, which causes some confusion. I have a series almost ready to clean that up a bit. > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). Yes. > This patch series: (a) adds checksum offload awareness to the > block layer (patch #1), I've skipped over the patches and don't understand what this offload awareness concept does compared the file system simply attaching PI metadata. > (c) introduces an opt-in (datasum_offload mount option) in Btrfs to > apply checksum offload for data (patch #3). Not really important for an initial prototype, but incompatible on-disk format changes like this need feature flags and not just a mount option.
On Wed, Jan 29, 2025 at 08:28:24AM -0700, Keith Busch wrote: > On Wed, Jan 29, 2025 at 07:32:04PM +0530, Kanchan Joshi wrote: > > There is value in avoiding Copy-on-write (COW) checksum tree on > > a device that can anyway store checksums inline (as part of PI). > > This would eliminate extra checksum writes/reads, making I/O > > more CPU-efficient. > > Another potential benefit: if the device does the checksum, then I think > btrfs could avoid the stable page writeback overhead and let the > contents be changable all the way until it goes out on the wire. If the device generates the checksum (aka DIF insert) that problem goes away. But we also lose integrity protection over the wire, which would be unfortunate. If you feed the checksum / guard tag from the kernel we still have the same problem. A while ago I did a prototype where we'd bubble up to the fs that we had guard tag error vs just the non-specific "protection error" and the file system would then retry after copying. This was pretty sketchy as the error handling blew up frequently and at least my version would only work for synchronous I/O and not with aio / io_uring due to the missing MM context. But if someone has enough spare cycles that could be something interesting to look into again.
On 29/1/25 14:02, Kanchan Joshi wrote: > > > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > SSD for data checksumming. > > Now, the longer version for why/how. > > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. > > Btrfs has its own data and metadata checksumming, which is currently > disconnected from the above. > It maintains a separate on-device 'checksum tree' for data checksums, > while the block layer will also be checksumming each Btrfs I/O. > > There is value in avoiding Copy-on-write (COW) checksum tree on > a device that can anyway store checksums inline (as part of PI). > This would eliminate extra checksum writes/reads, making I/O > more CPU-efficient. > Additionally, usable space would increase, and write > amplification, both in Btrfs and eventually at the device level, would > be reduced [*]. > > NVMe drives can also automatically insert and strip the PI/checksum > and provide a per-I/O control knob (the PRACT bit) for this. > Block layer currently makes no attempt to know/advertise this offload. > > This patch series: (a) adds checksum offload awareness to the > block layer (patch #1), > (b) enables the NVMe driver to register and support the offload > (patch #2), and > (c) introduces an opt-in (datasum_offload mount option) in Btrfs to > apply checksum offload for data (patch #3). > > [*] Here are some perf/write-amplification numbers from randwrite test [1] > on 3 configs (same device): > Config 1: No meta format (4K) + Btrfs (base) > Config 2: Meta format (4K + 8b) + Btrfs (base) > Config 3: Meta format (4K + 8b) + Btrfs (datasum_offload) > > In config 1 and 2, Btrfs will operate with a checksum tree. > Only in config 2, block-layer will attach integrity buffer with each I/O and > do checksum/reftag verification. > Only in config 3, offload will take place and device will generate/verify > the checksum. > > AppW: writes issued by app, 120G (4 Jobs, each writing 30G) > FsW: writes issued to device (from iostat) > ExtraW: extra writes compared to AppW > > Direct I/O > --------------------------------------------------------- > Config IOPS(K) FsW(G) ExtraW(G) > 1 144 186 66 > 2 141 181 61 > 3 172 129 9 > > Buffered I/O > --------------------------------------------------------- > Config IOPS(K) FsW(G) ExtraW(G) > 1 82 255 135 > 2 80 181 132 > 3 100 199 79 > > Write amplification is generally high (and that's understandable given > B-trees) but not sure why buffered I/O shows that much. > > [1] fio --name=btrfswrite --ioengine=io_uring --directory=/mnt --blocksize=4k --readwrite=randwrite --filesize=30G --numjobs=4 --iodepth=32 --randseed=0 --direct=1 -output=out --group_reporting > > > Kanchan Joshi (3): > block: add integrity offload > nvme: support integrity offload > btrfs: add checksum offload > > block/bio-integrity.c | 42 ++++++++++++++++++++++++++++++++++++++- > block/t10-pi.c | 7 +++++++ > drivers/nvme/host/core.c | 24 ++++++++++++++++++++++ > drivers/nvme/host/nvme.h | 1 + > fs/btrfs/bio.c | 12 +++++++++++ > fs/btrfs/fs.h | 1 + > fs/btrfs/super.c | 9 +++++++++ > include/linux/blk_types.h | 3 +++ > include/linux/blkdev.h | 7 +++++++ > 9 files changed, 105 insertions(+), 1 deletion(-) > There's also checksumming done on the metadata trees, which could be avoided if we're trusting the block device to do it. Maybe rather than putting this behind a new compat flag, add a new csum type of "none"? With the logic being that it also zeroes out the csum field in the B-tree headers. Mark
On Wed, Jan 29, 2025 at 04:40:25PM +0100, Christoph Hellwig wrote: > > > > Another potential benefit: if the device does the checksum, then I think > > btrfs could avoid the stable page writeback overhead and let the > > contents be changable all the way until it goes out on the wire. > > If the device generates the checksum (aka DIF insert) that problem goes > away. But we also lose integrity protection over the wire, which would > be unfortunate. If the "wire" is only PCIe, I don't see why it matters. What kind of wire corruption gets undetected by the protocol's encoding and LCRC that would get caught by the host's CRC payload?
On 29/01/2025 15.02, Kanchan Joshi wrote: > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > SSD for data checksumming. > > Now, the longer version for why/how. > > End-to-end data protection (E2EDP)-capable drives require the transfer > of integrity metadata (PI). > This is currently handled by the block layer, without filesystem > involvement/awareness. > The block layer attaches the metadata buffer, generates the checksum > (and reftag) for write I/O, and verifies it during read I/O. > May be this is a stupid question, but if we can (will) avoid storing the checksum in the FS, which is the advantage of having a COW filesystem ? My understand is that a COW filesystem is needed mainly to synchronize csum and data. Am I wrong ? [...] BR
On 1/29/2025 9:05 PM, Christoph Hellwig wrote: >> This patch series: (a) adds checksum offload awareness to the >> block layer (patch #1), > I've skipped over the patches and don't understand what this offload > awareness concept does compared the file system simply attaching PI > metadata. Difference is that FS does not have to attach any PI for offload. Offload is about the Host doing as little as possible, and the closest we get there is by setting PRACT bit. Attaching PI is not really needed, neither for FS nor for block-layer, for pure offload. When device has "ms == pi_size" format, we only need to send I/O with PRACT set and device take care of attaching integrity buffer and checksum generation/verification. This is abstracted as 'offload type 1' in this series. For other format "ms > pi_size" also we set the PRACT but integrity buffer also needs to be passed. This is abstracted as 'offload type 2'. Still offload as the checksum processing is done only by the device. Block layer Auto-PI is a good place because all above details are common and remain abstracted, while filesystems only need to decide whether they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility.
On Wed, 29 Jan 2025 at 20:04, Goffredo Baroncelli <kreijack@libero.it> wrote: > > On 29/01/2025 15.02, Kanchan Joshi wrote: > > TL;DR first: this makes Btrfs chuck its checksum tree and leverage NVMe > > SSD for data checksumming. > > > > Now, the longer version for why/how. > > > > End-to-end data protection (E2EDP)-capable drives require the transfer > > of integrity metadata (PI). > > This is currently handled by the block layer, without filesystem > > involvement/awareness. > > The block layer attaches the metadata buffer, generates the checksum > > (and reftag) for write I/O, and verifies it during read I/O. > > > May be this is a stupid question, but if we can (will) avoid storing the checksum > in the FS, which is the advantage of having a COW filesystem ? I was wondering the same. My understanding is the checksums are there primarily to protect against untrusted devices or data transfers over the line. And now suddenly we're going to trust them? What's even the point then? Is there any other advantage of having these checksums that I may be missing? Perhaps logic code bugs accidentally corrupting the data? Is the stored payload even ever touched? That would not be wanted, right? Or perhaps data mangled on the storage by an attacker? > My understand is that a COW filesystem is needed mainly to synchronize > csum and data. Am I wrong ? > > [...] > > BR > > -- > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> > Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 >
On Thu, Jan 30, 2025 at 02:52:23PM +0530, Kanchan Joshi wrote: > On 1/29/2025 9:05 PM, Christoph Hellwig wrote: > >> This patch series: (a) adds checksum offload awareness to the > >> block layer (patch #1), > > I've skipped over the patches and don't understand what this offload > > awareness concept does compared the file system simply attaching PI > > metadata. > > Difference is that FS does not have to attach any PI for offload. > > Offload is about the Host doing as little as possible, and the closest > we get there is by setting PRACT bit. But that doesn't actually work. The file system needs to be able to verify the checksum for failing over to other mirrors, repair, etc. Also if you trust the device to get things right you do not need to use PI at all - SSDs or hard drives that support PI generally use PI internally anyway, and PRACT just means you treat a format with PI like one without. In other words - no need for an offload here, you might as well just trust the device if you're not doing end to end protection. > > Attaching PI is not really needed, neither for FS nor for block-layer, > for pure offload. > When device has "ms == pi_size" format, we only need to send I/O with > PRACT set and device take care of attaching integrity buffer and > checksum generation/verification. > This is abstracted as 'offload type 1' in this series. > > For other format "ms > pi_size" also we set the PRACT but integrity > buffer also needs to be passed. This is abstracted as 'offload type 2'. > Still offload as the checksum processing is done only by the device. > > Block layer Auto-PI is a good place because all above details are common > and remain abstracted, while filesystems only need to decide whether > they want to send the flag (REQ_INTEGRITY_OFFLOAD) to use the facility. ---end quoted text---
On Wed, Jan 29, 2025 at 11:03:36AM -0700, Keith Busch wrote: > > away. But we also lose integrity protection over the wire, which would > > be unfortunate. > > If the "wire" is only PCIe, I don't see why it matters. What kind of > wire corruption gets undetected by the protocol's encoding and LCRC that > would get caught by the host's CRC payload? The "wire" could be anything. And includes a little more than than than the wire, like the entire host side driver stack and the device data path between the phy and wherever in the stack the PI insert/strip accelerator sits.
Hi Kanchan! > There is value in avoiding Copy-on-write (COW) checksum tree on a > device that can anyway store checksums inline (as part of PI). This > would eliminate extra checksum writes/reads, making I/O more > CPU-efficient. Additionally, usable space would increase, and write > amplification, both in Btrfs and eventually at the device level, would > be reduced [*]. I have a couple of observations. First of all, there is no inherent benefit to PI if it is generated at the same time as the ECC. The ECC is usually far superior when it comes to protecting data at rest. And you'll still get an error if uncorrected corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer any benefits in my book. The motivation for T10 PI is that it is generated in close temporal proximity to the data. I.e. ideally the PI protecting the data is calculated as soon as the data has been created in memory. And then the I/O will eventually be queued, submitted, traverse the kernel, through the storage fabric, and out to the end device. The PI and data have traveled along different paths (potentially, more on that later) to get there. The device will calculate the ECC and then perform a validation of the PI wrt. to the data buffer. And if those two line up, we know the ECC is also good. At that point we have confirmed that the data to be stored matches the data that was used as input when the PI was generated N seconds ago in host memory. And therefore we can write. I.e. the goal of PI is protect against problems that happen between data creation time and the data being persisted to media. Once the ECC has been calculated, PI essentially stops being interesting. The second point I would like to make is that the separation between PI and data that we introduced with DIX, and which NVMe subsequently adopted, was a feature. It was not just to avoid the inconvenience of having to deal with buffers that were multiples of 520 bytes in host memory. The separation between the data and its associated protection information had proven critical for data protection in many common corruption scenarios. Inline protection had been tried and had failed to catch many of the scenarios we had come across in the field. At the time T10 PI was designed spinning rust was the only game in town. And nobody was willing to take the performance hit of having to seek twice per I/O to store PI separately from the data. And while schemes involving sending all the PI ahead of the data were entertained, they never came to fruition. Storing 512+8 in the same sector was a necessity in the context of SCSI drives, not a desired behavior. Addressing that in DIX was key. So to me, it's a highly desirable feature that btrfs stores its checksums elsewhere on media. But that's obviously a trade-off a user can make. In some cases media WAR may be more important than extending the protection envelope for the data, that's OK. I would suggest you look at using CRC32C given the intended 4KB block use case, though, because the 16-bit CRC isn't fantastic for large blocks.
On Thu, Jan 30, 2025 at 03:21:45PM -0500, Martin K. Petersen wrote: > So to me, it's a highly desirable feature that btrfs stores its > checksums elsewhere on media. Except for SSDs it generally doesn't - the fact that they are written at the same time means there is a very high chance they will end up on media together for traditional SSDs designs. This might be different when explicitly using some form of data placement scheme, and SSD vendors might be able to place PI/metadata different under the hood when using a big enough customer aks for it (they might not be very happy about the request :)). > But that's obviously a trade-off a user > can make. In some cases media WAR may be more important than extending > the protection envelope for the data, that's OK. I would suggest you > look at using CRC32C given the intended 4KB block use case, though, > because the 16-bit CRC isn't fantastic for large blocks. One thing that I did implement for my XFS hack/prototype is the ability to store a crc32c in the non-PI metadata support by nvme. This allows for low overhead data checksumming as you don't need a separate data structure to track where the checksums for a data block are located and doesn't require out of place writes. It doesn't provide a reg tag equivalent or device side checking of the guard tag unfortunately. I never could come up with a good use of the app_tag for file systems, so not wasting space for that is actually a good thing.
On 1/29/2025 8:25 PM, Johannes Thumshirn wrote: > For instance if we get a checksum error on btrfs we not only report in > in dmesg, but also try to repair the affected sector if we do have a > data profile with redundancy. > > So while this patchset offloads the submission side work of the checksum > tree to the PI code, I don't see the back-propagation of the errors into > btrfs and the triggering of the repair code. > > I get it's a RFC, but as it is now it essentially breaks functionality > we rely on. Can you add this part as well so we can evaluate the > patchset not only from the write but also from the read side. I tested the series for read, but only the success cases. In this case checksum generation/verification happens only within the device. It was slightly tricky to inject an error and I skipped that. Since separate checksum I/Os are omitted, this is about handling the error condition in data read I/O path itself. I have not yet checked if repair code triggers when Btrfs is working with existing 'nodatasum' mount option. But I get your point that this needs to be handled.
On 31.01.25 11:19, Kanchan Joshi wrote: > On 1/29/2025 8:25 PM, Johannes Thumshirn wrote: >> For instance if we get a checksum error on btrfs we not only report in >> in dmesg, but also try to repair the affected sector if we do have a >> data profile with redundancy. >> >> So while this patchset offloads the submission side work of the checksum >> tree to the PI code, I don't see the back-propagation of the errors into >> btrfs and the triggering of the repair code. >> >> I get it's a RFC, but as it is now it essentially breaks functionality >> we rely on. Can you add this part as well so we can evaluate the >> patchset not only from the write but also from the read side. > > I tested the series for read, but only the success cases. In this case > checksum generation/verification happens only within the device. It was > slightly tricky to inject an error and I skipped that. > > Since separate checksum I/Os are omitted, this is about handling the > error condition in data read I/O path itself. I have not yet checked if > repair code triggers when Btrfs is working with existing 'nodatasum' > mount option. But I get your point that this needs to be handled. > So this as of now disables one of the most useful features of the FS, repairing bad data. The whole "story" for the RAID code in the FS is build around this assumption, that we can repair bad data, unlike say MD RAID.
On 1/30/2025 6:23 PM, Christoph Hellwig wrote: >> Difference is that FS does not have to attach any PI for offload. >> >> Offload is about the Host doing as little as possible, and the closest >> we get there is by setting PRACT bit. > But that doesn't actually work. The file system needs to be able > to verify the checksum for failing over to other mirrors, repair, > etc. Right. That sounds like reusing the existing code on detecting checksum-specific failure. So maybe that can be handled iff this gets any far. > Also if you trust the device to get things right you do not > need to use PI at all - SSDs or hard drives that support PI generally > use PI internally anyway, and PRACT just means you treat a format > with PI like one without. In other words - no need for an offload > here, you might as well just trust the device if you're not doing > end to end protection. Agree that device maybe implementing internal E2E, but that's not a contract to be honored. Host can't trust until device says it explicitly. Since Btrfs already has 'nodatasum' mount option, I assume there are deployments that prefer to optimize for checksum. I thought Btrfs will be more comfortable to give up (its own checksum tree) and exercise this more often if it certainly knows that device is also capable.
On Fri, Jan 31, 2025 at 03:59:17PM +0530, Kanchan Joshi wrote: > > Also if you trust the device to get things right you do not > > need to use PI at all - SSDs or hard drives that support PI generally > > use PI internally anyway, and PRACT just means you treat a format > > with PI like one without. In other words - no need for an offload > > here, you might as well just trust the device if you're not doing > > end to end protection. > > Agree that device maybe implementing internal E2E, but that's not a > contract to be honored. Host can't trust until device says it explicitly. But you're not doing end to end protection. Once you set PRACT you basically tell the device to pretend the LU/namespace was formatted without protection information. That fact that you even need the flag has always been very confusing to me - the logical way to expose PI would have been to make PRACT the default and require a flag to actually look at the passed information. I suspect for SCSI this is a result of shoe-horning DIF and DIX into existing infrastructure, and NVMe then blindly copied much of that without thinking how it fits into an architecture without a separate HBA and without all the legacy concerns.
On 1/31/2025 1:51 AM, Martin K. Petersen wrote: >> There is value in avoiding Copy-on-write (COW) checksum tree on a >> device that can anyway store checksums inline (as part of PI). This >> would eliminate extra checksum writes/reads, making I/O more >> CPU-efficient. Additionally, usable space would increase, and write >> amplification, both in Btrfs and eventually at the device level, would >> be reduced [*]. > I have a couple of observations. > > First of all, there is no inherent benefit to PI if it is generated at > the same time as the ECC. The ECC is usually far superior when it comes > to protecting data at rest. And you'll still get an error if uncorrected > corruption is detected. So BLK_INTEGRITY_OFFLOAD_NO_BUF does not offer > any benefits in my book. I fully agree, there is no benefit if we see it from E2E use case. But for a different use case (i.e., checksum offload), BLK_INTEGRITY_OFFLOAD_NO_BUF is as good as it gets. [Application -> FS -> Block-layer -> Device] I understand that E2E gets stronger when integrity/checksum is placed at the origin of data (application), and then each component collaborates in checking it. But I am not doing E2E here. Call it abuse or creative, but I am using the same E2E capable device to do checksum-offload. If the device had exposed checksum-offload in a different way, I would have taken that route rather than E2E one.
On 1/31/2025 3:59 PM, Johannes Thumshirn wrote: >> I tested the series for read, but only the success cases. In this case >> checksum generation/verification happens only within the device. It was >> slightly tricky to inject an error and I skipped that. >> >> Since separate checksum I/Os are omitted, this is about handling the >> error condition in data read I/O path itself. I have not yet checked if >> repair code triggers when Btrfs is working with existing 'nodatasum' >> mount option. But I get your point that this needs to be handled. >> > So this as of now disables one of the most useful features of the FS, > repairing bad data. The whole "story" for the RAID code in the FS is > build around this assumption, that we can repair bad data, unlike say MD > RAID. Does repairing-bad-data work when Btrfs is mounted with NODATASUM? If not, should not the proposed option be seen as an upgrade over that? You might be knowing, but I do not know how does Btrfs currently decide to apply NODATSUM! With these patches it becomes possible to know if checksum-offload is supported by the underlying hardware. And that makes it possible to apply NODATASUM in an informed manner. I have not reduced anything, but added an opt-in for deployments that may have a different definition of what is useful. Not all planets are Mars. The cost of checksum tree will be different (say on QLC vs SLC).
On 03.02.25 14:25, Kanchan Joshi wrote: > On 1/31/2025 3:59 PM, Johannes Thumshirn wrote: >>> I tested the series for read, but only the success cases. In this case >>> checksum generation/verification happens only within the device. It was >>> slightly tricky to inject an error and I skipped that. >>> >>> Since separate checksum I/Os are omitted, this is about handling the >>> error condition in data read I/O path itself. I have not yet checked if >>> repair code triggers when Btrfs is working with existing 'nodatasum' >>> mount option. But I get your point that this needs to be handled. >>> >> So this as of now disables one of the most useful features of the FS, >> repairing bad data. The whole "story" for the RAID code in the FS is >> build around this assumption, that we can repair bad data, unlike say MD >> RAID. > > Does repairing-bad-data work when Btrfs is mounted with NODATASUM? > If not, should not the proposed option be seen as an upgrade over that? > > You might be knowing, but I do not know how does Btrfs currently decide > to apply NODATSUM! With these patches it becomes possible to know if > checksum-offload is supported by the underlying hardware. And that makes > it possible to apply NODATASUM in an informed manner. NODATASUM is something I personally would only ever tun on on VM images, so we don't have the stable page vs checksums f*up (see Qu's answers). > I have not reduced anything, but added an opt-in for deployments that > may have a different definition of what is useful. Not all planets are > Mars. The cost of checksum tree will be different (say on QLC vs SLC). > But NODATASUM isn't something that is actively recommended unless you know what you're doing. I thought of your patches as an offload of the checksum tree to the T10-PI extended sector format, which I personally like. And it's not that hard to do that. If it's about getting people to use NODATASUM I'm really starting to dislike the patchset. Also NODATASUM implies deactivated compression. So this to me then sounds like a good use case for 1 or 2 specific scenarios but not really all too helpful for the broader picture, which it could've been.
On 2/3/2025 7:10 PM, Johannes Thumshirn wrote: > But NODATASUM isn't something that is actively recommended unless you > know what you're doing. I thought of your patches as an offload of the > checksum tree to the T10-PI extended sector format You thought right, patches do "offload to T10-PI format" part. That part is generic for any upper-layer user. One only needs to send flag REQ_INTEGRITY_OFFLOAD for that. And for the other part "suppress data-csums at Btrfs level", I thought of using NODATASUM.
On 03.02.25 15:04, Kanchan Joshi wrote: > On 2/3/2025 7:10 PM, Johannes Thumshirn wrote: >> But NODATASUM isn't something that is actively recommended unless you >> know what you're doing. I thought of your patches as an offload of the >> checksum tree to the T10-PI extended sector format > > You thought right, patches do "offload to T10-PI format" part. That part > is generic for any upper-layer user. One only needs to send flag > REQ_INTEGRITY_OFFLOAD for that. That one I think is nice. > And for the other part "suppress data-csums at Btrfs level", I thought > of using NODATASUM. > That one I hate with passion, IFF done the way it's done in this patchset. For the generation/write side you can go that way. But it needs to be wired up so that btrfs can be the consumer (is this the correct term here?) of the checksums as well. Luckily 'btrfs_check_read_bio()' treats bio.bi_ioerror != BLK_STS_OK the same way as !btrfs_data_csum_ok(). So that part looks save to me. Also btrfs_data_csum_ok() does an early return if there's no checsum, so I think we're good there. In order to make scrub (and RAID5) work, you'd need a fallback for 'btrfs_lookup_csums_bitmap()' that returns the sector checksum from the bio integrity layer instead of looking at the checksum tree.
Christoph, > Except for SSDs it generally doesn't - the fact that they are written > at the same time means there is a very high chance they will end up > on media together for traditional SSDs designs. > > This might be different when explicitly using some form of data > placement scheme, and SSD vendors might be able to place PI/metadata > different under the hood when using a big enough customer aks for it > (they might not be very happy about the request :)). There was a multi-vendor effort many years ago (first gen SSD era) to make vendors guarantee that metadata and data would be written to different channels. But performance got in the way, obviously. > One thing that I did implement for my XFS hack/prototype is the ability > to store a crc32c in the non-PI metadata support by nvme. This allows > for low overhead data checksumming as you don't need a separate data > structure to track where the checksums for a data block are located and > doesn't require out of place writes. It doesn't provide a reg tag > equivalent or device side checking of the guard tag unfortunately. That sounds fine. Again, I don't have a problem with having the ability to choose whether checksum placement or WAF is more important for a given application. > I never could come up with a good use of the app_tag for file systems, > so not wasting space for that is actually a good thing. I wish we could just do 4 bytes of CRC32C + 4 bytes of ref tag. I think that would be a reasonable compromise between space and utility. But we can't do that because of the app tag escape. We're essentially wasting 2 bytes per block to store a single bit flag. In general I think 4096+16 is a reasonable format going forward. With either CRC32C or CRC64 plus full LBA as ref tag.
On Mon, Feb 03, 2025 at 02:31:13PM -0500, Martin K. Petersen wrote: > > I never could come up with a good use of the app_tag for file systems, > > so not wasting space for that is actually a good thing. > > I wish we could just do 4 bytes of CRC32C + 4 bytes of ref tag. I think > that would be a reasonable compromise between space and utility. Agreed. > But we > can't do that because of the app tag escape. We're essentially wasting 2 > bytes per block to store a single bit flag. Well, what do we actually need the app tag escape for except for historical precedence? In NVMe the app tag escape is an option for deallocated blocks, but it's only one of the options, there other beeing a synthetic guard/ref tag with the expected value. > In general I think 4096+16 is a reasonable format going forward. With > either CRC32C or CRC64 plus full LBA as ref tag. That would also work fine. NVMe supports 4byte crc32c + 2 byte app tag + 12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6 byte guard / storage tag, although Linux only supports the latter so far.
Christoph, > Well, what do we actually need the app tag escape for except for > historical precedence? In NVMe the app tag escape is an option for > deallocated blocks, but it's only one of the options, there other beeing > a synthetic guard/ref tag with the expected value. I have been told that some arrays use it to disable PI when writing the RAID parity blocks. I guess that makes sense if the array firmware is mixing data and parity blocks in a single write operation. For my test tool I just use WRPROTECT=3 to disable checking when writing "bad" PI. And obviously the original reason for the initialization pattern escape made sense on a disk drive that had no FTL to track whether a block was in a written state or not. But I really don't think any of this is useful in a modern SSD or even array context. >> In general I think 4096+16 is a reasonable format going forward. With >> either CRC32C or CRC64 plus full LBA as ref tag. > > That would also work fine. NVMe supports 4byte crc32c + 2 byte app tag + > 12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6 > byte guard / storage tag, although Linux only supports the latter so far. Yep, the CRC32C variant should be easy to wire up. I've thought about the storage tag but haven't really come up with a good use case. It's essentially the same situation as with the app tag.
On Tue, Feb 04, 2025 at 07:52:38AM -0500, Martin K. Petersen wrote: > I have been told that some arrays use it to disable PI when writing the > RAID parity blocks. I guess that makes sense if the array firmware is > mixing data and parity blocks in a single write operation. For my test > tool I just use WRPROTECT=3 to disable checking when writing "bad" PI. Hmm, why would you disable PI for parity blocks? But yes, outside of Linux there might be uses. Just thinking of a "perfect" format for our needs. > > > > That would also work fine. NVMe supports 4byte crc32c + 2 byte app tag + > > 12 byte guard tag / storage tag and 8-byte crc64 + 2 byte app tag + 6 > > byte guard / storage tag, although Linux only supports the latter so far. > > Yep, the CRC32C variant should be easy to wire up. I've thought about > the storage tag but haven't really come up with a good use case. It's > essentially the same situation as with the app tag. Exactly, it's an app tag by other means.
Christoph,
> Hmm, why would you disable PI for parity blocks?
Some devices calculate P and Q on the data portion of the block only and
store valid PI. Others calculate P and Q for the entire block, including
metadata.