Message ID | 20220517145039.3202184-13-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/15] btrfs: introduce a pure data checksum checking helper | expand |
On 2022/5/17 22:50, Christoph Hellwig wrote: > This adds a new read repair implementation for btrfs. It is synchronous > in that the end I/O handlers call them, and will get back the results > instead of potentially getting multiple concurrent calls back into the > original end I/O handler. The synchronous nature has the following > advantages: > > - there is no need for a per-I/O tree of I/O failures, as everything > related to the I/O failure can be handled locally > - not having separate repair end I/O helpers will in the future help > to reuse the direct I/O bio from iomap for the actual submission and > thus remove the btrfs_dio_private infrastructure > > Because submitting many sector size synchronous I/Os would be very > slow when multiple sectors (or a whole read) fail, this new code instead > submits a single read and repair write bio for each contiguous section. > It uses clone of the bio to do that and thus does not need to allocate > any extra bio_vecs. Note that this cloning is open coded instead of > using the block layer clone helpers as the clone is based on the save > iter in the btrfs_bio, and not bio.bi_iter, which at the point that the > repair code is called has been advanced by the low-level driver. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/btrfs/Makefile | 2 +- > fs/btrfs/read-repair.c | 211 +++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/read-repair.h | 33 +++++++ > fs/btrfs/super.c | 9 +- > 4 files changed, 252 insertions(+), 3 deletions(-) > create mode 100644 fs/btrfs/read-repair.c > create mode 100644 fs/btrfs/read-repair.h > > diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile > index 99f9995670ea3..0b2605c750cab 100644 > --- a/fs/btrfs/Makefile > +++ b/fs/btrfs/Makefile > @@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ > backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ > uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \ > block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \ > - subpage.o tree-mod-log.o > + subpage.o tree-mod-log.o read-repair.o > > btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o > btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o > diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c > new file mode 100644 > index 0000000000000..3ac93bfe09e4f > --- /dev/null > +++ b/fs/btrfs/read-repair.c > @@ -0,0 +1,211 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright (c) 2022 Christoph Hellwig. > + */ > +#include "ctree.h" > +#include "volumes.h" > +#include "read-repair.h" > +#include "btrfs_inode.h" > + > +static struct bio_set read_repair_bioset; > + > +static int next_mirror(struct btrfs_read_repair *rr, int cur_mirror) > +{ > + if (cur_mirror == rr->num_copies) > + return cur_mirror + 1 - rr->num_copies; > + return cur_mirror + 1; > +} > + > +static int prev_mirror(struct btrfs_read_repair *rr, int cur_mirror) > +{ > + if (cur_mirror == 1) > + return rr->num_copies; > + return cur_mirror - 1; > +} > + > +/* > + * Clone a new bio from the src_bbio, using the saved iter in the btrfs_bio > + * instead of bi_iter. > + */ > +static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio, > + u64 offset, u32 size, unsigned int op) > +{ > + struct btrfs_bio *bbio; > + struct bio *bio; > + > + bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS, > + &read_repair_bioset); > + bio_set_flag(bio, BIO_CLONED); Do we need to bother setting the CLONED flag? Without CLONED flag, we can easily go bio_for_each_segment_all() in the endio function without the need of bbio->iter, thus can save some memory. > + > + bio->bi_io_vec = src_bbio->bio.bi_io_vec; > + bio->bi_iter = src_bbio->iter; > + bio_advance(bio, offset); > + bio->bi_iter.bi_size = size; > + > + bbio = btrfs_bio(bio); > + memset(bbio, 0, offsetof(struct btrfs_bio, bio)); > + bbio->iter = bbio->bio.bi_iter; > + bbio->file_offset = src_bbio->file_offset + offset; > + > + return bbio; > +} > + > +static void btrfs_repair_one_mirror(struct btrfs_bio *read_bbio, > + struct btrfs_bio *failed_bbio, struct inode *inode, > + int bad_mirror) > +{ > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > + struct btrfs_inode *bi = BTRFS_I(inode); > + u64 logical = read_bbio->iter.bi_sector << SECTOR_SHIFT; > + u64 file_offset = read_bbio->file_offset; > + struct btrfs_bio *write_bbio; > + blk_status_t ret; > + > + /* > + * For zoned file systems repair has to relocate the whole zone. > + */ > + if (btrfs_repair_one_zone(fs_info, logical)) > + return; > + > + /* > + * For RAID56, we can not just write the bad data back, as any write > + * will trigger RMW and read back the corrrupted on-disk stripe, causing > + * further damage. > + * > + * Perform a special low-level repair that bypasses btrfs_map_bio. > + */ > + if (btrfs_is_parity_mirror(fs_info, logical, fs_info->sectorsize)) { > + struct bvec_iter iter; > + struct bio_vec bv; > + u32 offset; > + > + btrfs_bio_for_each_sector(fs_info, bv, read_bbio, iter, offset) > + btrfs_repair_io_failure(fs_info, btrfs_ino(bi), > + file_offset + offset, > + fs_info->sectorsize, > + logical + offset, > + bv.bv_page, bv.bv_offset, > + bad_mirror); > + return; > + } > + > + /* > + * Otherwise just clone the whole bio and write it back to the > + * previously bad mirror. > + */ > + write_bbio = btrfs_repair_bio_clone(read_bbio, 0, > + read_bbio->iter.bi_size, REQ_OP_WRITE); Do we need to clone the whole bio? Considering under most cases the read repair is already the cold path, have more than one corruption is already rare. > + ret = btrfs_map_bio_wait(fs_info, &write_bbio->bio, bad_mirror); > + bio_put(&write_bbio->bio); > + > + btrfs_info_rl(fs_info, > + "%s: root %lld ino %llu off %llu logical %llu/%u from good mirror %d", > + ret ? "failed to correct read error" : "read error corrected", > + bi->root->root_key.objectid, btrfs_ino(bi), > + file_offset, logical, read_bbio->iter.bi_size, bad_mirror); > +} > + > +static bool btrfs_repair_read_bio(struct btrfs_bio *bbio, > + struct btrfs_bio *failed_bbio, struct inode *inode, > + int read_mirror) > +{ > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > + u32 start_offset = bbio->file_offset - failed_bbio->file_offset; > + u8 csum[BTRFS_CSUM_SIZE]; > + struct bvec_iter iter; > + struct bio_vec bv; > + u32 offset; > + > + if (btrfs_map_bio_wait(fs_info, &bbio->bio, read_mirror)) > + return false; > + > + if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) > + return true; > + > + btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) { > + u8 *expected_csum = > + btrfs_csum_ptr(fs_info, failed_bbio->csum, > + start_offset + offset); > + > + if (btrfs_check_data_sector(fs_info, bv.bv_page, bv.bv_offset, > + csum, expected_csum)) > + return false; > + } > + > + return true; > +} > + > +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr, > + struct btrfs_bio *failed_bbio, struct inode *inode, > + u64 end_offset, repair_endio_t endio) The reason I don't use "end" in my patchset is, any thing related "end" will let people to think it has something related with endio. And in fact when checking the function, I really thought it's something related to endio, but it's the equivalent of btrfs_read_repair_finish(). > +{ > + u8 bad_mirror = failed_bbio->mirror_num; > + u8 read_mirror = next_mirror(rr, bad_mirror); > + struct btrfs_bio *read_bbio = NULL; > + bool uptodate = false; > + > + do { > + if (read_bbio) > + bio_put(&read_bbio->bio); > + read_bbio = btrfs_repair_bio_clone(failed_bbio, > + rr->start_offset, end_offset - rr->start_offset, > + REQ_OP_READ); > + if (btrfs_repair_read_bio(read_bbio, failed_bbio, inode, > + read_mirror)) { A big NONO here. Function btrfs_repair_read_bio() will only return true if all of its data matches csum. Consider the following case: Profile RAID1C3, 2 sectors to read, the initial mirror is 1. Mirror 1: |X|X| Mirror 2: |X| | Mirror 3: | |X| Now we will got -EIO, but in reality, we can repair the read by using the first sector from mirror 3 and the 2nd sector from mirror 2. This is a behavior regression. And that's why the original repair and all my patchsets are doing sector by sector check, not a full range check. Thanks, Qu > + do { > + btrfs_repair_one_mirror(read_bbio, failed_bbio, > + inode, bad_mirror); > + } while ((bad_mirror = prev_mirror(rr, bad_mirror)) != > + failed_bbio->mirror_num); > + uptodate = true; > + break; > + } > + bad_mirror = read_mirror; > + read_mirror = next_mirror(rr, bad_mirror); > + } while (read_mirror != failed_bbio->mirror_num); > + > + if (endio) > + endio(read_bbio, inode, uptodate); > + bio_put(&read_bbio->bio); > + > + rr->in_use = false; > + return uptodate; > +} > + > +bool btrfs_read_repair_add(struct btrfs_read_repair *rr, > + struct btrfs_bio *failed_bbio, struct inode *inode, > + u64 start_offset) > +{ > + if (rr->in_use) > + return true; > + > + if (!rr->num_copies) { > + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); > + > + rr->num_copies = btrfs_num_copies(fs_info, > + failed_bbio->iter.bi_sector << SECTOR_SHIFT, > + failed_bbio->iter.bi_size); > + } > + > + /* > + * If there is no other copy of the data to recovery from, give up now > + * and don't even try to build up a larget batch. > + */ > + if (rr->num_copies < 2) > + return false; > + > + rr->in_use = true; > + rr->start_offset = start_offset; > + return true; > +} > + > +void btrfs_read_repair_exit(void) > +{ > + bioset_exit(&read_repair_bioset); > +} > + > +int __init btrfs_read_repair_init(void) > +{ > + return bioset_init(&read_repair_bioset, BIO_POOL_SIZE, > + offsetof(struct btrfs_bio, bio), 0); > +} > diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h > new file mode 100644 > index 0000000000000..e371790af2b3e > --- /dev/null > +++ b/fs/btrfs/read-repair.h > @@ -0,0 +1,33 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef BTRFS_READ_REPAIR_H > +#define BTRFS_READ_REPAIR_H > + > +struct btrfs_read_repair { > + u64 start_offset; > + bool in_use; > + int num_copies; > +}; > + > +typedef void (*repair_endio_t)(struct btrfs_bio *repair_bbio, > + struct inode *inode, bool uptodate); > + > +bool btrfs_read_repair_add(struct btrfs_read_repair *rr, > + struct btrfs_bio *failed_bbio, struct inode *inode, > + u64 bio_offset); > +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr, > + struct btrfs_bio *failed_bbio, struct inode *inode, > + u64 end_offset, repair_endio_t end_io); > +static inline bool btrfs_read_repair_end(struct btrfs_read_repair *rr, > + struct btrfs_bio *failed_bbio, struct inode *inode, > + u64 end_offset, repair_endio_t endio) > +{ > + if (!rr->in_use) > + return true; > + return __btrfs_read_repair_end(rr, failed_bbio, inode, end_offset, > + endio); > +} > + > +int __init btrfs_read_repair_init(void); > +void btrfs_read_repair_exit(void); > + > +#endif /* BTRFS_READ_REPAIR_H */ > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index b1fdc6a26c76e..b33ad892c3058 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -48,6 +48,7 @@ > #include "block-group.h" > #include "discard.h" > #include "qgroup.h" > +#include "read-repair.h" > #define CREATE_TRACE_POINTS > #include <trace/events/btrfs.h> > > @@ -2641,10 +2642,12 @@ static int __init init_btrfs_fs(void) > err = extent_io_init(); > if (err) > goto free_cachep; > - > - err = extent_state_cache_init(); > + err = btrfs_read_repair_init(); > if (err) > goto free_extent_io; > + err = extent_state_cache_init(); > + if (err) > + goto free_read_repair; > > err = extent_map_init(); > if (err) > @@ -2708,6 +2711,8 @@ static int __init init_btrfs_fs(void) > extent_map_exit(); > free_extent_state_cache: > extent_state_cache_exit(); > +free_read_repair: > + btrfs_read_repair_exit(); > free_extent_io: > extent_io_exit(); > free_cachep:
On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote: >> +static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio, >> + u64 offset, u32 size, unsigned int op) >> +{ >> + struct btrfs_bio *bbio; >> + struct bio *bio; >> + >> + bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS, >> + &read_repair_bioset); >> + bio_set_flag(bio, BIO_CLONED); > > Do we need to bother setting the CLONED flag? The CLONED flag should never be set. Except for the one bogus check in btrfs that I have a pending removal for it is only used for debugging checks. > Without CLONED flag, we can easily go bio_for_each_segment_all() in the > endio function without the need of bbio->iter, thus can save some memory. bio_for_each_segment_all ignores the iter and walks over bi_io_vec directly. And that is something we absolutely can't do here, as the bio reuses the bio_vecs from the failed bio, and depending on what failed reduces the size. >> + /* >> + * Otherwise just clone the whole bio and write it back to the >> + * previously bad mirror. >> + */ >> + write_bbio = btrfs_repair_bio_clone(read_bbio, 0, >> + read_bbio->iter.bi_size, REQ_OP_WRITE); > > Do we need to clone the whole bio? > > Considering under most cases the read repair is already the cold path, > have more than one corruption is already rare. The read bio is trimmed to only cover the bad area, so this already potentially does not cover the whole failed bbio. But except for the case you note below we do need to write back the whole bio. >> +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr, >> + struct btrfs_bio *failed_bbio, struct inode *inode, >> + u64 end_offset, repair_endio_t endio) > > The reason I don't use "end" in my patchset is, any thing related "end" > will let people to think it has something related with endio. That seems like an odd conotation. Without io in the word end just conotates the end of a range in most of the Linux I/O stack. > And in fact when checking the function, I really thought it's something > related to endio, but it's the equivalent of btrfs_read_repair_finish(). But if there is a strong preference I can use finish instead of end.
On 2022/5/18 16:54, Christoph Hellwig wrote: > On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote: >>> +static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio, >>> + u64 offset, u32 size, unsigned int op) >>> +{ >>> + struct btrfs_bio *bbio; >>> + struct bio *bio; >>> + >>> + bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS, >>> + &read_repair_bioset); >>> + bio_set_flag(bio, BIO_CLONED); >> >> Do we need to bother setting the CLONED flag? > > The CLONED flag should never be set. Except for the one bogus check > in btrfs that I have a pending removal for it is only used for debugging > checks. > >> Without CLONED flag, we can easily go bio_for_each_segment_all() in the >> endio function without the need of bbio->iter, thus can save some memory. > > bio_for_each_segment_all ignores the iter and walks over bi_io_vec > directly. And that is something we absolutely can't do here, as the > bio reuses the bio_vecs from the failed bio, and depending on what > failed reduces the size. My bad, I see the bio_alloc_bioset() but didn't check it's allocating a bi_io_vec with size 0, and soon utilize the original bi_io_vec. So the function matches its name, it's really bio clone. And it's very different from my version, which really allocates a new bio with larger enough bi_io_vec then adding back the needed sectors from the original bio. Then I guess the BIO_CLONE flag is completely fine. But in that case, you may want to call bio_alloc_clone() directly? Which can handle bioset without problem. > >>> + /* >>> + * Otherwise just clone the whole bio and write it back to the >>> + * previously bad mirror. >>> + */ >>> + write_bbio = btrfs_repair_bio_clone(read_bbio, 0, >>> + read_bbio->iter.bi_size, REQ_OP_WRITE); >> >> Do we need to clone the whole bio? >> >> Considering under most cases the read repair is already the cold path, >> have more than one corruption is already rare. > > The read bio is trimmed to only cover the bad area, so this already > potentially does not cover the whole failed bbio. But except for > the case you note below we do need to write back the whole bio. > >>> +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr, >>> + struct btrfs_bio *failed_bbio, struct inode *inode, >>> + u64 end_offset, repair_endio_t endio) >> >> The reason I don't use "end" in my patchset is, any thing related "end" >> will let people to think it has something related with endio. > > That seems like an odd conotation. Without io in the word end just > conotates the end of a range in most of the Linux I/O stack. OK, thanks for explaining this, I guess it's my problem linking "end" to "end_io" or "endio". Then no problem using this current naming. Thanks, Qu > >> And in fact when checking the function, I really thought it's something >> related to endio, but it's the equivalent of btrfs_read_repair_finish(). > > But if there is a strong preference I can use finish instead of end.
On Wed, May 18, 2022 at 06:20:53PM +0800, Qu Wenruo wrote: > My bad, I see the bio_alloc_bioset() but didn't check it's allocating a > bi_io_vec with size 0, and soon utilize the original bi_io_vec. > > So the function matches its name, it's really bio clone. > > And it's very different from my version, which really allocates a new > bio with larger enough bi_io_vec then adding back the needed sectors > from the original bio. > > Then I guess the BIO_CLONE flag is completely fine. > > But in that case, you may want to call bio_alloc_clone() directly? Which > can handle bioset without problem. The big difference is that bio_alloc_clone directly looks at bio->bi_iter, while we must look at btrfs_bio->iter as bio->bi_iter may already be consumed by the driver. I though the comment in the code explains that, but maybe I need to improve it.
On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote: > Function btrfs_repair_read_bio() will only return true if all of its > data matches csum. > > Consider the following case: > > Profile RAID1C3, 2 sectors to read, the initial mirror is 1. > > Mirror 1: |X|X| > Mirror 2: |X| | > Mirror 3: | |X| > > Now we will got -EIO, but in reality, we can repair the read by using > the first sector from mirror 3 and the 2nd sector from mirror 2. I tried to write a test case for this by copying btrfs/140 and then as a first step extending it to three mirrors unsing the raid1c1 profile. But it seems that the tricks used there don't work, as the code in btrfs/140 relies on the fact that the btrfs logic address repored by file frag is reported by dump-tree as the item "index" ĭn this line: item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz but for the raid1c3 profile that line reports something entirely different. for raid1: logical: 137756672 item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsize 112 for raid1c3: logical: 343998464 item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621 itemsize 144 any idea how to find physical sectors to corrupt for raid1c1?
On 2022/5/19 17:36, Christoph Hellwig wrote: > On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote: >> Function btrfs_repair_read_bio() will only return true if all of its >> data matches csum. >> >> Consider the following case: >> >> Profile RAID1C3, 2 sectors to read, the initial mirror is 1. >> >> Mirror 1: |X|X| >> Mirror 2: |X| | >> Mirror 3: | |X| >> >> Now we will got -EIO, but in reality, we can repair the read by using >> the first sector from mirror 3 and the 2nd sector from mirror 2. > > I tried to write a test case for this by copying btrfs/140 and then > as a first step extending it to three mirrors unsing the raid1c1 > profile. But it seems that the tricks used there don't work, > as the code in btrfs/140 relies on the fact that the btrfs logic > address repored by file frag is reported by dump-tree as the item > "index" ĭn this line: > > item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz > > but for the raid1c3 profile that line reports something entirely > different. > > for raid1: > > logical: 137756672 > item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsize 112 > > for raid1c3: > > logical: 343998464 > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621 itemsize 144 > > any idea how to find physical sectors to corrupt for raid1c1? > I also recently hit weird cases why extent allocator no longer puts the first data extent at the beginning of a chunk. So in that case, the best solution is to use "btrfs-map-logical -l 343998464", which will directly return the physical offset of the wanted logical on each involved devices. Although we need to note: - btrfs-map-logical may not always be shipped in progs in the future This tool really looks like a debug tool. I'm not sure if we will keep shipping it (personally I really hope to) - btrfs-map-logical only return data stripes Thus it doesn't work for RAID56 just in case you want to use it. Despite the weird extent logical bytenr, everything should be fine with btrfs-map-logical. I'll take some time looking into the weird behavior change. Thanks, Qu
On 19.05.22 г. 13:41 ч., Qu Wenruo wrote: > > > On 2022/5/19 17:36, Christoph Hellwig wrote: >> On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote: >>> Function btrfs_repair_read_bio() will only return true if all of its >>> data matches csum. >>> >>> Consider the following case: >>> >>> Profile RAID1C3, 2 sectors to read, the initial mirror is 1. >>> >>> Mirror 1: |X|X| >>> Mirror 2: |X| | >>> Mirror 3: | |X| >>> >>> Now we will got -EIO, but in reality, we can repair the read by using >>> the first sector from mirror 3 and the 2nd sector from mirror 2. >> >> I tried to write a test case for this by copying btrfs/140 and then >> as a first step extending it to three mirrors unsing the raid1c1 >> profile. But it seems that the tricks used there don't work, >> as the code in btrfs/140 relies on the fact that the btrfs logic >> address repored by file frag is reported by dump-tree as the item >> "index" ĭn this line: >> >> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz >> >> but for the raid1c3 profile that line reports something entirely >> different. >> >> for raid1: >> >> logical: 137756672 >> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 >> itemsize 112 >> >> for raid1c3: >> >> logical: 343998464 >> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621 >> itemsize 144 >> >> any idea how to find physical sectors to corrupt for raid1c1? >> > > I also recently hit weird cases why extent allocator no longer puts the > first data extent at the beginning of a chunk. > > So in that case, the best solution is to use "btrfs-map-logical -l > 343998464", which will directly return the physical offset of the wanted > logical on each involved devices. Any reason why this is kept as a separate tool and not simply integrated into btrfs-progs as a separate command? > > Although we need to note: > > - btrfs-map-logical may not always be shipped in progs in the future > This tool really looks like a debug tool. I'm not sure if we will keep > shipping it (personally I really hope to) > > - btrfs-map-logical only return data stripes > Thus it doesn't work for RAID56 just in case you want to use it. > > Despite the weird extent logical bytenr, everything should be fine with > btrfs-map-logical. > > I'll take some time looking into the weird behavior change. > > Thanks, > Qu >
On 2022/5/19 18:45, Nikolay Borisov wrote: > > > On 19.05.22 г. 13:41 ч., Qu Wenruo wrote: >> >> >> On 2022/5/19 17:36, Christoph Hellwig wrote: >>> On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote: >>>> Function btrfs_repair_read_bio() will only return true if all of its >>>> data matches csum. >>>> >>>> Consider the following case: >>>> >>>> Profile RAID1C3, 2 sectors to read, the initial mirror is 1. >>>> >>>> Mirror 1: |X|X| >>>> Mirror 2: |X| | >>>> Mirror 3: | |X| >>>> >>>> Now we will got -EIO, but in reality, we can repair the read by using >>>> the first sector from mirror 3 and the 2nd sector from mirror 2. >>> >>> I tried to write a test case for this by copying btrfs/140 and then >>> as a first step extending it to three mirrors unsing the raid1c1 >>> profile. But it seems that the tricks used there don't work, >>> as the code in btrfs/140 relies on the fact that the btrfs logic >>> address repored by file frag is reported by dump-tree as the item >>> "index" ĭn this line: >>> >>> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz >>> >>> but for the raid1c3 profile that line reports something entirely >>> different. >>> >>> for raid1: >>> >>> logical: 137756672 >>> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 >>> itemsize 112 >>> >>> for raid1c3: >>> >>> logical: 343998464 >>> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621 >>> itemsize 144 >>> >>> any idea how to find physical sectors to corrupt for raid1c1? >>> >> >> I also recently hit weird cases why extent allocator no longer puts >> the first data extent at the beginning of a chunk. >> >> So in that case, the best solution is to use "btrfs-map-logical -l >> 343998464", which will directly return the physical offset of the >> wanted logical on each involved devices. > > Any reason why this is kept as a separate tool and not simply integrated > into btrfs-progs as a separate command? For historical reasons I guess. I'm totally fine to move it into btrfs-ins subcommand. Thanks, Qu > >> >> Although we need to note: >> >> - btrfs-map-logical may not always be shipped in progs in the future >> This tool really looks like a debug tool. I'm not sure if we will keep >> shipping it (personally I really hope to) >> >> - btrfs-map-logical only return data stripes >> Thus it doesn't work for RAID56 just in case you want to use it. >> >> Despite the weird extent logical bytenr, everything should be fine >> with btrfs-map-logical. >> >> I'll take some time looking into the weird behavior change. >> >> Thanks, >> Qu >>
On Thu, May 19, 2022 at 06:41:52PM +0800, Qu Wenruo wrote: > So in that case, the best solution is to use "btrfs-map-logical -l > 343998464", which will directly return the physical offset of the wanted > logical on each involved devices. > > Although we need to note: > > - btrfs-map-logical may not always be shipped in progs in the future > This tool really looks like a debug tool. I'm not sure if we will keep > shipping it (personally I really hope to) > > - btrfs-map-logical only return data stripes > Thus it doesn't work for RAID56 just in case you want to use it. > > Despite the weird extent logical bytenr, everything should be fine with > btrfs-map-logical. Oh, nice, this is much better than the hackery in the existing tests. That bein said the -c option does not seem to work. No matter what I specify it always returns all three mirrors. I guess a little awk will fix that, but the behavior seems odd.
On 2022/5/19 18:50, Christoph Hellwig wrote: > On Thu, May 19, 2022 at 06:41:52PM +0800, Qu Wenruo wrote: >> So in that case, the best solution is to use "btrfs-map-logical -l >> 343998464", which will directly return the physical offset of the wanted >> logical on each involved devices. >> >> Although we need to note: >> >> - btrfs-map-logical may not always be shipped in progs in the future >> This tool really looks like a debug tool. I'm not sure if we will keep >> shipping it (personally I really hope to) >> >> - btrfs-map-logical only return data stripes >> Thus it doesn't work for RAID56 just in case you want to use it. >> >> Despite the weird extent logical bytenr, everything should be fine with >> btrfs-map-logical. > > > Oh, nice, this is much better than the hackery in the existing tests. > > That bein said the -c option does not seem to work. No matter what > I specify it always returns all three mirrors. I guess a little > awk will fix that, but the behavior seems odd. Currently the copy number is only used for -o option, not for regular chunk mapping printing. Thus we still need awk to fix the awkward behavior. (no pun intended) Another problem related to btrfs-map-logical is it doesn't work if the range has no data/metadata extent there. So Nik is right, we need a better tool integrated into btrfs-ins, other than this historical one. Thanks, Qu
>> I tried to write a test case for this by copying btrfs/140 and then >> as a first step extending it to three mirrors unsing the raid1c1 >> profile. But it seems that the tricks used there don't work, >> as the code in btrfs/140 relies on the fact that the btrfs logic >> address repored by file frag is reported by dump-tree as the item >> "index" ĭn this line: >> >> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz >> >> but for the raid1c3 profile that line reports something entirely >> different. >> >> for raid1: >> >> logical: 137756672 >> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 >> itemsize 112 >> >> for raid1c3: >> >> logical: 343998464 >> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621 >> itemsize 144 >> >> any idea how to find physical sectors to corrupt for raid1c1? >> > > I also recently hit weird cases why extent allocator no longer puts the > first data extent at the beginning of a chunk. Thankfully, this is not a bug, but a combination of seemingly straightforward behaviors, which leads to a weird combined result. It takes me a short adventure into the free space handling to find the problem. For my example, I'm using 3x10G disks, and running RAID0 for data, RAID1 for metadata: Label: (null) UUID: bb10a539-0344-445a-9e77-bbda65d79366 Node size: 16384 Sector size: 4096 Filesystem size: 30.00GiB Block group profiles: Data: RAID0 3.00GiB Metadata: RAID1 256.00MiB System: RAID1 8.00MiB SSD detected: no Zoned device: no Incompat features: extref, skinny-metadata, no-holes Runtime features: free-space-tree Checksum: crc32c Number of devices: 3 Devices: ID SIZE PATH 1 10.00GiB /dev/test/scratch1 2 10.00GiB /dev/test/scratch2 3 10.00GiB /dev/test/scratch3 The 3GiB data chunk (at logical 298844160, length 3GiB) is completely empty, but notice that, btrfs needs to avoid allocating extents for super block reservations. And we have one logical bytenr 434110464, which is at the superblock location of /dev/test/scratch1. So the free space of that 3GiB chunk is split into two parts: [298844160, +135266304) [434176000, +3085893632) Notice the latter part is much larger. So far so good, but there is another thing involved, the cached free space behavior. In find_free_space(), if we are searching from the beginning of a block group, we will use `rb_first_cached(&ctl->free_space_bytes);` But free_space_bytes rbtree is not sorted using logical bytenr, but the free space. And the leftmost one will have the most amount of free space. So instead of choose [298844160, +135266304), we choose [434176000, +3085893632) which has the much larger free space. Thus we got the seemingly weird bytenr, 434176000, for our first data extent. And each behavior itself is completely sane and straightforward. We can not use space reserved for superblocks. We should use the free space which has the most free space. But in the end, when combining two of them, we got the behavior that not returning the beginning of a seemingly empty chunk. So in short, we should not rely on the dirty dump tree hacks, but a better version of btrfs-map-logical to grab the real physical offset of a logical bytenr. Thanks, Qu
On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote: > A big NONO here. > > Function btrfs_repair_read_bio() will only return true if all of its > data matches csum. > > Consider the following case: > > Profile RAID1C3, 2 sectors to read, the initial mirror is 1. > > Mirror 1: |X|X| > Mirror 2: |X| | > Mirror 3: | |X| > > Now we will got -EIO, but in reality, we can repair the read by using > the first sector from mirror 3 and the 2nd sector from mirror 2. > > This is a behavior regression. Now that I've written tests and code to treat this properly I have to say that at least on x86 I can't actually trigger this behavior regression and had to add instrumentation to see a low-level change in behavior. Why? For some reason (and I'd love to see an explanation for it!), btrfs limits direct I/O reads to a single sector, so for direct I/O is is impossible to hit this case as all bios (and thus repair bios) are limited to a single sector. For buffered I/O we can hit this case, but all ->readahead error are ignored and the pages actually needed are eventually retried using ->readpage. And ->reapage is limit to a single sector. But given that btrfs does not seem to support sub-4K sector sizes I can't actually test the sub-sector code here - I assume this might be an issue on larger page size systems.
diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 99f9995670ea3..0b2605c750cab 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \ block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \ - subpage.o tree-mod-log.o + subpage.o tree-mod-log.o read-repair.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c new file mode 100644 index 0000000000000..3ac93bfe09e4f --- /dev/null +++ b/fs/btrfs/read-repair.c @@ -0,0 +1,211 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2022 Christoph Hellwig. + */ +#include "ctree.h" +#include "volumes.h" +#include "read-repair.h" +#include "btrfs_inode.h" + +static struct bio_set read_repair_bioset; + +static int next_mirror(struct btrfs_read_repair *rr, int cur_mirror) +{ + if (cur_mirror == rr->num_copies) + return cur_mirror + 1 - rr->num_copies; + return cur_mirror + 1; +} + +static int prev_mirror(struct btrfs_read_repair *rr, int cur_mirror) +{ + if (cur_mirror == 1) + return rr->num_copies; + return cur_mirror - 1; +} + +/* + * Clone a new bio from the src_bbio, using the saved iter in the btrfs_bio + * instead of bi_iter. + */ +static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio, + u64 offset, u32 size, unsigned int op) +{ + struct btrfs_bio *bbio; + struct bio *bio; + + bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS, + &read_repair_bioset); + bio_set_flag(bio, BIO_CLONED); + + bio->bi_io_vec = src_bbio->bio.bi_io_vec; + bio->bi_iter = src_bbio->iter; + bio_advance(bio, offset); + bio->bi_iter.bi_size = size; + + bbio = btrfs_bio(bio); + memset(bbio, 0, offsetof(struct btrfs_bio, bio)); + bbio->iter = bbio->bio.bi_iter; + bbio->file_offset = src_bbio->file_offset + offset; + + return bbio; +} + +static void btrfs_repair_one_mirror(struct btrfs_bio *read_bbio, + struct btrfs_bio *failed_bbio, struct inode *inode, + int bad_mirror) +{ + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_inode *bi = BTRFS_I(inode); + u64 logical = read_bbio->iter.bi_sector << SECTOR_SHIFT; + u64 file_offset = read_bbio->file_offset; + struct btrfs_bio *write_bbio; + blk_status_t ret; + + /* + * For zoned file systems repair has to relocate the whole zone. + */ + if (btrfs_repair_one_zone(fs_info, logical)) + return; + + /* + * For RAID56, we can not just write the bad data back, as any write + * will trigger RMW and read back the corrrupted on-disk stripe, causing + * further damage. + * + * Perform a special low-level repair that bypasses btrfs_map_bio. + */ + if (btrfs_is_parity_mirror(fs_info, logical, fs_info->sectorsize)) { + struct bvec_iter iter; + struct bio_vec bv; + u32 offset; + + btrfs_bio_for_each_sector(fs_info, bv, read_bbio, iter, offset) + btrfs_repair_io_failure(fs_info, btrfs_ino(bi), + file_offset + offset, + fs_info->sectorsize, + logical + offset, + bv.bv_page, bv.bv_offset, + bad_mirror); + return; + } + + /* + * Otherwise just clone the whole bio and write it back to the + * previously bad mirror. + */ + write_bbio = btrfs_repair_bio_clone(read_bbio, 0, + read_bbio->iter.bi_size, REQ_OP_WRITE); + ret = btrfs_map_bio_wait(fs_info, &write_bbio->bio, bad_mirror); + bio_put(&write_bbio->bio); + + btrfs_info_rl(fs_info, + "%s: root %lld ino %llu off %llu logical %llu/%u from good mirror %d", + ret ? "failed to correct read error" : "read error corrected", + bi->root->root_key.objectid, btrfs_ino(bi), + file_offset, logical, read_bbio->iter.bi_size, bad_mirror); +} + +static bool btrfs_repair_read_bio(struct btrfs_bio *bbio, + struct btrfs_bio *failed_bbio, struct inode *inode, + int read_mirror) +{ + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + u32 start_offset = bbio->file_offset - failed_bbio->file_offset; + u8 csum[BTRFS_CSUM_SIZE]; + struct bvec_iter iter; + struct bio_vec bv; + u32 offset; + + if (btrfs_map_bio_wait(fs_info, &bbio->bio, read_mirror)) + return false; + + if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM) + return true; + + btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) { + u8 *expected_csum = + btrfs_csum_ptr(fs_info, failed_bbio->csum, + start_offset + offset); + + if (btrfs_check_data_sector(fs_info, bv.bv_page, bv.bv_offset, + csum, expected_csum)) + return false; + } + + return true; +} + +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr, + struct btrfs_bio *failed_bbio, struct inode *inode, + u64 end_offset, repair_endio_t endio) +{ + u8 bad_mirror = failed_bbio->mirror_num; + u8 read_mirror = next_mirror(rr, bad_mirror); + struct btrfs_bio *read_bbio = NULL; + bool uptodate = false; + + do { + if (read_bbio) + bio_put(&read_bbio->bio); + read_bbio = btrfs_repair_bio_clone(failed_bbio, + rr->start_offset, end_offset - rr->start_offset, + REQ_OP_READ); + if (btrfs_repair_read_bio(read_bbio, failed_bbio, inode, + read_mirror)) { + do { + btrfs_repair_one_mirror(read_bbio, failed_bbio, + inode, bad_mirror); + } while ((bad_mirror = prev_mirror(rr, bad_mirror)) != + failed_bbio->mirror_num); + uptodate = true; + break; + } + bad_mirror = read_mirror; + read_mirror = next_mirror(rr, bad_mirror); + } while (read_mirror != failed_bbio->mirror_num); + + if (endio) + endio(read_bbio, inode, uptodate); + bio_put(&read_bbio->bio); + + rr->in_use = false; + return uptodate; +} + +bool btrfs_read_repair_add(struct btrfs_read_repair *rr, + struct btrfs_bio *failed_bbio, struct inode *inode, + u64 start_offset) +{ + if (rr->in_use) + return true; + + if (!rr->num_copies) { + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + + rr->num_copies = btrfs_num_copies(fs_info, + failed_bbio->iter.bi_sector << SECTOR_SHIFT, + failed_bbio->iter.bi_size); + } + + /* + * If there is no other copy of the data to recovery from, give up now + * and don't even try to build up a larget batch. + */ + if (rr->num_copies < 2) + return false; + + rr->in_use = true; + rr->start_offset = start_offset; + return true; +} + +void btrfs_read_repair_exit(void) +{ + bioset_exit(&read_repair_bioset); +} + +int __init btrfs_read_repair_init(void) +{ + return bioset_init(&read_repair_bioset, BIO_POOL_SIZE, + offsetof(struct btrfs_bio, bio), 0); +} diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h new file mode 100644 index 0000000000000..e371790af2b3e --- /dev/null +++ b/fs/btrfs/read-repair.h @@ -0,0 +1,33 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef BTRFS_READ_REPAIR_H +#define BTRFS_READ_REPAIR_H + +struct btrfs_read_repair { + u64 start_offset; + bool in_use; + int num_copies; +}; + +typedef void (*repair_endio_t)(struct btrfs_bio *repair_bbio, + struct inode *inode, bool uptodate); + +bool btrfs_read_repair_add(struct btrfs_read_repair *rr, + struct btrfs_bio *failed_bbio, struct inode *inode, + u64 bio_offset); +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr, + struct btrfs_bio *failed_bbio, struct inode *inode, + u64 end_offset, repair_endio_t end_io); +static inline bool btrfs_read_repair_end(struct btrfs_read_repair *rr, + struct btrfs_bio *failed_bbio, struct inode *inode, + u64 end_offset, repair_endio_t endio) +{ + if (!rr->in_use) + return true; + return __btrfs_read_repair_end(rr, failed_bbio, inode, end_offset, + endio); +} + +int __init btrfs_read_repair_init(void); +void btrfs_read_repair_exit(void); + +#endif /* BTRFS_READ_REPAIR_H */ diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index b1fdc6a26c76e..b33ad892c3058 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -48,6 +48,7 @@ #include "block-group.h" #include "discard.h" #include "qgroup.h" +#include "read-repair.h" #define CREATE_TRACE_POINTS #include <trace/events/btrfs.h> @@ -2641,10 +2642,12 @@ static int __init init_btrfs_fs(void) err = extent_io_init(); if (err) goto free_cachep; - - err = extent_state_cache_init(); + err = btrfs_read_repair_init(); if (err) goto free_extent_io; + err = extent_state_cache_init(); + if (err) + goto free_read_repair; err = extent_map_init(); if (err) @@ -2708,6 +2711,8 @@ static int __init init_btrfs_fs(void) extent_map_exit(); free_extent_state_cache: extent_state_cache_exit(); +free_read_repair: + btrfs_read_repair_exit(); free_extent_io: extent_io_exit(); free_cachep:
This adds a new read repair implementation for btrfs. It is synchronous in that the end I/O handlers call them, and will get back the results instead of potentially getting multiple concurrent calls back into the original end I/O handler. The synchronous nature has the following advantages: - there is no need for a per-I/O tree of I/O failures, as everything related to the I/O failure can be handled locally - not having separate repair end I/O helpers will in the future help to reuse the direct I/O bio from iomap for the actual submission and thus remove the btrfs_dio_private infrastructure Because submitting many sector size synchronous I/Os would be very slow when multiple sectors (or a whole read) fail, this new code instead submits a single read and repair write bio for each contiguous section. It uses clone of the bio to do that and thus does not need to allocate any extra bio_vecs. Note that this cloning is open coded instead of using the block layer clone helpers as the clone is based on the save iter in the btrfs_bio, and not bio.bi_iter, which at the point that the repair code is called has been advanced by the low-level driver. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/btrfs/Makefile | 2 +- fs/btrfs/read-repair.c | 211 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/read-repair.h | 33 +++++++ fs/btrfs/super.c | 9 +- 4 files changed, 252 insertions(+), 3 deletions(-) create mode 100644 fs/btrfs/read-repair.c create mode 100644 fs/btrfs/read-repair.h