Message ID | 20250311133517.3095878-1-kent.overstreet@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | block: Allow REQ_FUA|REQ_READ | expand |
Hello Kent, On Tue, Mar 11, 2025 at 09:35:16AM -0400, Kent Overstreet wrote: > REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", > the same as writes. > > This is useful for when the filesystem gets a checksum error, it's > possible that a bit was flipped in the controller cache, and when we > retry we want to retry the entire IO, not just from cache. Looking at ATA Command Set - 6 (ACS-6), 7.23 READ FPDMA QUEUED - 60h """ If the Forced Unit Access (FUA) bit is set to one, the device shall retrieve the data from the non-volatile media regardless of whether the device holds the requested information in the volatile cache. If the device holds a modified copy of the requested data as a result of having volatile cached writes, the modified data shall be written to the non-volatile media before being retrieved from the non-volatile media as part of this operation. """ So IIUC, at least for ATA, if something is corrupted in the volatile write cache, setting the FUA bit will ensure that the corruption will get propagated to the non-volatile media. Kind regards, Niklas
Kent, > REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", > the same as writes. FUA does not bypass anything. On the contrary, A FUA READ causes data in the volatile cache (if any) to be flushed to non-volatile storage (either cache or media or both). So I'm afraid it has the exact opposite effect of what you are looking for.
On Tue, Mar 11, 2025 at 03:53:51PM +0100, Niklas Cassel wrote: > Hello Kent, > > On Tue, Mar 11, 2025 at 09:35:16AM -0400, Kent Overstreet wrote: > > REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", > > the same as writes. > > > > This is useful for when the filesystem gets a checksum error, it's > > possible that a bit was flipped in the controller cache, and when we > > retry we want to retry the entire IO, not just from cache. > > Looking at ATA Command Set - 6 (ACS-6), > 7.23 READ FPDMA QUEUED - 60h > > """ > If the Forced Unit Access (FUA) bit is set to one, the device shall retrieve > the data from the non-volatile media regardless of whether the device holds > the requested information in the volatile cache. > > If the device holds a modified copy of the requested data as a result of > having volatile cached writes, the modified data shall be written to the > non-volatile media before being retrieved from the non-volatile media as > part of this operation. > """ > > So IIUC, at least for ATA, if something is corrupted in the volatile > write cache, setting the FUA bit will ensure that the corruption will > get propagated to the non-volatile media. Corrupted in the volatile writeback cache is not the expected case - we don't usually read data we've just written. Generally, the data will be in the device's cache only because we just read it, and we don't want to retry the read from the device cache. It's possible that either the device's cache was faulty, or there was a transient error in reading from flash (SSD ec algorithms are effectively probabilistic these days), so in either case retrying with a FUA read has a much better chance of clearing a transient error and correctly reading the bad data.
On Tue, Mar 11, 2025 at 03:53:51PM +0100, Niklas Cassel wrote: > So IIUC, at least for ATA, if something is corrupted in the volatile > write cache, setting the FUA bit will ensure that the corruption will > get propagated to the non-volatile media. It's not necessarily going to corrupt persistent media if the volatile write cache is corrupted. It should just apply only to data written to the volatile write cache that hasn't been flushed to the media. If the device reads persisted media data into a corrupted cache, the FUA here should catch that and see good data. But if you haven't flushed previous writes from a corrupted volatile write cache, then I believe you're right: the read FUA will presist that corruption. Same is true for NVMe.
diff --git a/block/blk-core.c b/block/blk-core.c index d6c4fa3943b5..7b1103eb877d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -793,20 +793,21 @@ void submit_bio_noacct(struct bio *bio) goto end_io; } + if (WARN_ON_ONCE((bio->bi_opf & REQ_PREFLUSH) && + bio_op(bio) != REQ_OP_WRITE && + bio_op(bio) != REQ_OP_ZONE_APPEND)) + goto end_io; + /* * Filter flush bio's early so that bio based drivers without flush * support don't have to worry about them. */ - if (op_is_flush(bio->bi_opf)) { - if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_WRITE && - bio_op(bio) != REQ_OP_ZONE_APPEND)) + if (op_is_flush(bio->bi_opf) && + !bdev_write_cache(bdev)) { + bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); + if (!bio_sectors(bio)) { + status = BLK_STS_OK; goto end_io; - if (!bdev_write_cache(bdev)) { - bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); - if (!bio_sectors(bio)) { - status = BLK_STS_OK; - goto end_io; - } } }
REQ_FUA|REQ_READ means "do a read that bypasses the controller cache", the same as writes. This is useful for when the filesystem gets a checksum error, it's possible that a bit was flipped in the controller cache, and when we retry we want to retry the entire IO, not just from cache. The nvme driver already passes through REQ_FUA for reads, not just writes, so disabling the warning is sufficient to start using it, and bcachefs is implementing additional retries for checksum errors so can immediately use it. Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> --- block/blk-core.c | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-)