[12/15] btrfs: add new read repair infrastructure

Message ID	20220517145039.3202184-13-hch@lst.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> From: Christoph Hellwig <hch@lst.de> To: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com> Cc: Qu Wenruo <wqu@suse.com>, linux-btrfs@vger.kernel.org Subject: [PATCH 12/15] btrfs: add new read repair infrastructure Date: Tue, 17 May 2022 16:50:36 +0200 Message-Id: <20220517145039.3202184-13-hch@lst.de> In-Reply-To: <20220517145039.3202184-1-hch@lst.de> References: <20220517145039.3202184-1-hch@lst.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[01/15] btrfs: introduce a pure data checksum checking helper \| expand [01/15] btrfs: introduce a pure data checksum checking helper [02/15] btrfs: quit early if the fs has no RAID56 support for raid56 related checks [03/15] btrfs: save the original bi_iter into btrfs_bio for buffered read [04/15] btrfs: remove duplicated parameters from submit_data_read_repair() [05/15] btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks [06/15] btrfs: make repair_io_failure available outside of extent_io.c [07/15] btrfs: factor out a helper to end a single sector from submit_data_read_repair [08/15] btrfs: refactor end_bio_extent_readpage [09/15] btrfs: factor out a btrfs_csum_ptr helper [10/15] btrfs: add a btrfs_map_bio_wait helper [11/15] btrfs: set ->file_offset in end_bio_extent_readpage [12/15] btrfs: add new read repair infrastructure [13/15] btrfs: use the new read repair code for direct I/O [14/15] btrfs: use the new read repair code for buffered reads [15/15] btrfs: remove io_failure_record infrastructure completely

Christoph Hellwig May 17, 2022, 2:50 p.m. UTC

This adds a new read repair implementation for btrfs.  It is synchronous
in that the end I/O handlers call them, and will get back the results
instead of potentially getting multiple concurrent calls back into the
original end I/O handler.  The synchronous nature has the following
advantages:

 - there is no need for a per-I/O tree of I/O failures, as everything
   related to the I/O failure can be handled locally
 - not having separate repair end I/O helpers will in the future help
   to reuse the direct I/O bio from iomap for the actual submission and
   thus remove the btrfs_dio_private infrastructure

Because submitting many sector size synchronous I/Os would be very
slow when multiple sectors (or a whole read) fail, this new code instead
submits a single read and repair write bio for each contiguous section.
It uses clone of the bio to do that and thus does not need to allocate
any extra bio_vecs.  Note that this cloning is open coded instead of
using the block layer clone helpers as the clone is based on the save
iter in the btrfs_bio, and not bio.bi_iter, which at the point that the
repair code is called has been advanced by the low-level driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/Makefile      |   2 +-
 fs/btrfs/read-repair.c | 211 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/read-repair.h |  33 +++++++
 fs/btrfs/super.c       |   9 +-
 4 files changed, 252 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/read-repair.c
 create mode 100644 fs/btrfs/read-repair.h

Qu Wenruo May 17, 2022, 11:04 p.m. UTC | #1

On 2022/5/17 22:50, Christoph Hellwig wrote:
> This adds a new read repair implementation for btrfs.  It is synchronous
> in that the end I/O handlers call them, and will get back the results
> instead of potentially getting multiple concurrent calls back into the
> original end I/O handler.  The synchronous nature has the following
> advantages:
>
>   - there is no need for a per-I/O tree of I/O failures, as everything
>     related to the I/O failure can be handled locally
>   - not having separate repair end I/O helpers will in the future help
>     to reuse the direct I/O bio from iomap for the actual submission and
>     thus remove the btrfs_dio_private infrastructure
>
> Because submitting many sector size synchronous I/Os would be very
> slow when multiple sectors (or a whole read) fail, this new code instead
> submits a single read and repair write bio for each contiguous section.
> It uses clone of the bio to do that and thus does not need to allocate
> any extra bio_vecs.  Note that this cloning is open coded instead of
> using the block layer clone helpers as the clone is based on the save
> iter in the btrfs_bio, and not bio.bi_iter, which at the point that the
> repair code is called has been advanced by the low-level driver.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/Makefile      |   2 +-
>   fs/btrfs/read-repair.c | 211 +++++++++++++++++++++++++++++++++++++++++
>   fs/btrfs/read-repair.h |  33 +++++++
>   fs/btrfs/super.c       |   9 +-
>   4 files changed, 252 insertions(+), 3 deletions(-)
>   create mode 100644 fs/btrfs/read-repair.c
>   create mode 100644 fs/btrfs/read-repair.h
>
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index 99f9995670ea3..0b2605c750cab 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -31,7 +31,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
>   	   backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>   	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
>   	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
> -	   subpage.o tree-mod-log.o
> +	   subpage.o tree-mod-log.o read-repair.o
>
>   btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>   btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
> diff --git a/fs/btrfs/read-repair.c b/fs/btrfs/read-repair.c
> new file mode 100644
> index 0000000000000..3ac93bfe09e4f
> --- /dev/null
> +++ b/fs/btrfs/read-repair.c
> @@ -0,0 +1,211 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2022 Christoph Hellwig.
> + */
> +#include "ctree.h"
> +#include "volumes.h"
> +#include "read-repair.h"
> +#include "btrfs_inode.h"
> +
> +static struct bio_set read_repair_bioset;
> +
> +static int next_mirror(struct btrfs_read_repair *rr, int cur_mirror)
> +{
> +	if (cur_mirror == rr->num_copies)
> +		return cur_mirror + 1 - rr->num_copies;
> +	return cur_mirror + 1;
> +}
> +
> +static int prev_mirror(struct btrfs_read_repair *rr, int cur_mirror)
> +{
> +	if (cur_mirror == 1)
> +		return rr->num_copies;
> +	return cur_mirror - 1;
> +}
> +
> +/*
> + * Clone a new bio from the src_bbio, using the saved iter in the btrfs_bio
> + * instead of bi_iter.
> + */
> +static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio,
> +		u64 offset, u32 size, unsigned int op)
> +{
> +	struct btrfs_bio *bbio;
> +	struct bio *bio;
> +
> +	bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS,
> +			       &read_repair_bioset);
> +	bio_set_flag(bio, BIO_CLONED);

Do we need to bother setting the CLONED flag?

Without CLONED flag, we can easily go bio_for_each_segment_all() in the
endio function without the need of bbio->iter, thus can save some memory.

> +
> +	bio->bi_io_vec = src_bbio->bio.bi_io_vec;
> +	bio->bi_iter = src_bbio->iter;
> +	bio_advance(bio, offset);
> +	bio->bi_iter.bi_size = size;
> +
> +	bbio = btrfs_bio(bio);
> +	memset(bbio, 0, offsetof(struct btrfs_bio, bio));
> +	bbio->iter = bbio->bio.bi_iter;
> +	bbio->file_offset = src_bbio->file_offset + offset;
> +
> +	return bbio;
> +}
> +
> +static void btrfs_repair_one_mirror(struct btrfs_bio *read_bbio,
> +		struct btrfs_bio *failed_bbio, struct inode *inode,
> +		int bad_mirror)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	struct btrfs_inode *bi = BTRFS_I(inode);
> +	u64 logical = read_bbio->iter.bi_sector << SECTOR_SHIFT;
> +	u64 file_offset = read_bbio->file_offset;
> +	struct btrfs_bio *write_bbio;
> +	blk_status_t ret;
> +
> +	/*
> +	 * For zoned file systems repair has to relocate the whole zone.
> +	 */
> +	if (btrfs_repair_one_zone(fs_info, logical))
> +		return;
> +
> +	/*
> +	 * For RAID56, we can not just write the bad data back, as any write
> +	 * will trigger RMW and read back the corrrupted on-disk stripe, causing
> +	 * further damage.
> +	 *
> +	 * Perform a special low-level repair that bypasses btrfs_map_bio.
> +	 */
> +	if (btrfs_is_parity_mirror(fs_info, logical, fs_info->sectorsize)) {
> +		struct bvec_iter iter;
> +		struct bio_vec bv;
> +		u32 offset;
> +
> +		btrfs_bio_for_each_sector(fs_info, bv, read_bbio, iter, offset)
> +			btrfs_repair_io_failure(fs_info, btrfs_ino(bi),
> +					file_offset + offset,
> +					fs_info->sectorsize,
> +					logical + offset,
> +					bv.bv_page, bv.bv_offset,
> +					bad_mirror);
> +		return;
> +	}
> +
> +	/*
> +	 * Otherwise just clone the whole bio and write it back to the
> +	 * previously bad mirror.
> +	 */
> +	write_bbio = btrfs_repair_bio_clone(read_bbio, 0,
> +			read_bbio->iter.bi_size, REQ_OP_WRITE);

Do we need to clone the whole bio?

Considering under most cases the read repair is already the cold path,
have more than one corruption is already rare.

> +	ret = btrfs_map_bio_wait(fs_info, &write_bbio->bio, bad_mirror);
> +	bio_put(&write_bbio->bio);
> +
> +	btrfs_info_rl(fs_info,
> +		"%s: root %lld ino %llu off %llu logical %llu/%u from good mirror %d",
> +		ret ? "failed to correct read error" : "read error corrected",
> +		bi->root->root_key.objectid, btrfs_ino(bi),
> +		file_offset, logical, read_bbio->iter.bi_size, bad_mirror);
> +}
> +
> +static bool btrfs_repair_read_bio(struct btrfs_bio *bbio,
> +		struct btrfs_bio *failed_bbio, struct inode *inode,
> +		int read_mirror)
> +{
> +	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +	u32 start_offset = bbio->file_offset - failed_bbio->file_offset;
> +	u8 csum[BTRFS_CSUM_SIZE];
> +	struct bvec_iter iter;
> +	struct bio_vec bv;
> +	u32 offset;
> +
> +	if (btrfs_map_bio_wait(fs_info, &bbio->bio, read_mirror))
> +		return false;
> +
> +	if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
> +		return true;
> +
> +	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
> +		u8 *expected_csum =
> +			btrfs_csum_ptr(fs_info, failed_bbio->csum,
> +					start_offset + offset);
> +
> +		if (btrfs_check_data_sector(fs_info, bv.bv_page, bv.bv_offset,
> +				csum, expected_csum))
> +			return false;
> +	}
> +
> +	return true;
> +}
> +
> +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr,
> +		struct btrfs_bio *failed_bbio, struct inode *inode,
> +		u64 end_offset, repair_endio_t endio)

The reason I don't use "end" in my patchset is, any thing related "end"
will let people to think it has something related with endio.

And in fact when checking the function, I really thought it's something
related to endio, but it's the equivalent of btrfs_read_repair_finish().

> +{
> +	u8 bad_mirror = failed_bbio->mirror_num;
> +	u8 read_mirror = next_mirror(rr, bad_mirror);
> +	struct btrfs_bio *read_bbio = NULL;
> +	bool uptodate = false;
> +
> +	do {
> +		if (read_bbio)
> +			bio_put(&read_bbio->bio);
> +		read_bbio = btrfs_repair_bio_clone(failed_bbio,
> +				rr->start_offset, end_offset - rr->start_offset,
> +				REQ_OP_READ);
> +		if (btrfs_repair_read_bio(read_bbio, failed_bbio, inode,
> +				read_mirror)) {

A big NONO here.

Function btrfs_repair_read_bio() will only return true if all of its
data matches csum.

Consider the following case:

Profile RAID1C3, 2 sectors to read, the initial mirror is 1.

Mirror 1:	|X|X|
Mirror 2:	|X| |
Mirror 3:	| |X|

Now we will got -EIO, but in reality, we can repair the read by using
the first sector from mirror 3 and the 2nd sector from mirror 2.

This is a behavior regression.

And that's why the original repair and all my patchsets are doing sector
by sector check, not a full range check.

Thanks,
Qu

> +			do {
> +		    		btrfs_repair_one_mirror(read_bbio, failed_bbio,
> +						inode, bad_mirror);
> +			} while ((bad_mirror = prev_mirror(rr, bad_mirror)) !=
> +					failed_bbio->mirror_num);
> +			uptodate = true;
> +			break;
> +		}
> +		bad_mirror = read_mirror;
> +		read_mirror = next_mirror(rr, bad_mirror);
> +	} while (read_mirror != failed_bbio->mirror_num);
> +
> +	if (endio)
> +		endio(read_bbio, inode, uptodate);
> +	bio_put(&read_bbio->bio);
> +
> +	rr->in_use = false;
> +	return uptodate;
> +}
> +
> +bool btrfs_read_repair_add(struct btrfs_read_repair *rr,
> +		struct btrfs_bio *failed_bbio, struct inode *inode,
> +		u64 start_offset)
> +{
> +	if (rr->in_use)
> +		return true;
> +
> +	if (!rr->num_copies) {
> +		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> +
> +		rr->num_copies = btrfs_num_copies(fs_info,
> +				failed_bbio->iter.bi_sector << SECTOR_SHIFT,
> +				failed_bbio->iter.bi_size);
> +	}
> +
> +	/*
> +	 * If there is no other copy of the data to recovery from, give up now
> +	 * and don't even try to build up a larget batch.
> +	 */
> +	if (rr->num_copies < 2)
> +		return false;
> +
> +	rr->in_use = true;
> +	rr->start_offset = start_offset;
> +	return true;
> +}
> +
> +void btrfs_read_repair_exit(void)
> +{
> +	bioset_exit(&read_repair_bioset);
> +}
> +
> +int __init btrfs_read_repair_init(void)
> +{
> +	return bioset_init(&read_repair_bioset, BIO_POOL_SIZE,
> +			offsetof(struct btrfs_bio, bio), 0);
> +}
> diff --git a/fs/btrfs/read-repair.h b/fs/btrfs/read-repair.h
> new file mode 100644
> index 0000000000000..e371790af2b3e
> --- /dev/null
> +++ b/fs/btrfs/read-repair.h
> @@ -0,0 +1,33 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef BTRFS_READ_REPAIR_H
> +#define BTRFS_READ_REPAIR_H
> +
> +struct btrfs_read_repair {
> +	u64 start_offset;
> +	bool in_use;
> +	int num_copies;
> +};
> +
> +typedef void (*repair_endio_t)(struct btrfs_bio *repair_bbio,
> +		struct inode *inode, bool uptodate);
> +
> +bool btrfs_read_repair_add(struct btrfs_read_repair *rr,
> +		struct btrfs_bio *failed_bbio, struct inode *inode,
> +		u64 bio_offset);
> +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr,
> +		struct btrfs_bio *failed_bbio, struct inode *inode,
> +		u64 end_offset, repair_endio_t end_io);
> +static inline bool btrfs_read_repair_end(struct btrfs_read_repair *rr,
> +		struct btrfs_bio *failed_bbio, struct inode *inode,
> +		u64 end_offset, repair_endio_t endio)
> +{
> +	if (!rr->in_use)
> +		return true;
> +	return __btrfs_read_repair_end(rr, failed_bbio, inode, end_offset,
> +			endio);
> +}
> +
> +int __init btrfs_read_repair_init(void);
> +void btrfs_read_repair_exit(void);
> +
> +#endif /* BTRFS_READ_REPAIR_H */
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index b1fdc6a26c76e..b33ad892c3058 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -48,6 +48,7 @@
>   #include "block-group.h"
>   #include "discard.h"
>   #include "qgroup.h"
> +#include "read-repair.h"
>   #define CREATE_TRACE_POINTS
>   #include <trace/events/btrfs.h>
>
> @@ -2641,10 +2642,12 @@ static int __init init_btrfs_fs(void)
>   	err = extent_io_init();
>   	if (err)
>   		goto free_cachep;
> -
> -	err = extent_state_cache_init();
> +	err = btrfs_read_repair_init();
>   	if (err)
>   		goto free_extent_io;
> +	err = extent_state_cache_init();
> +	if (err)
> +		goto free_read_repair;
>
>   	err = extent_map_init();
>   	if (err)
> @@ -2708,6 +2711,8 @@ static int __init init_btrfs_fs(void)
>   	extent_map_exit();
>   free_extent_state_cache:
>   	extent_state_cache_exit();
> +free_read_repair:
> +	btrfs_read_repair_exit();
>   free_extent_io:
>   	extent_io_exit();
>   free_cachep:

Christoph Hellwig May 18, 2022, 8:54 a.m. UTC | #2

On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote:
>> +static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio,
>> +		u64 offset, u32 size, unsigned int op)
>> +{
>> +	struct btrfs_bio *bbio;
>> +	struct bio *bio;
>> +
>> +	bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS,
>> +			       &read_repair_bioset);
>> +	bio_set_flag(bio, BIO_CLONED);
>
> Do we need to bother setting the CLONED flag?

The CLONED flag should never be set.  Except for the one bogus check
in btrfs that I have a pending removal for it is only used for debugging
checks.

> Without CLONED flag, we can easily go bio_for_each_segment_all() in the
> endio function without the need of bbio->iter, thus can save some memory.

bio_for_each_segment_all ignores the iter and walks over bi_io_vec
directly.  And that is something we absolutely can't do here, as the
bio reuses the bio_vecs from the failed bio, and depending on what
failed reduces the size.

>> +	/*
>> +	 * Otherwise just clone the whole bio and write it back to the
>> +	 * previously bad mirror.
>> +	 */
>> +	write_bbio = btrfs_repair_bio_clone(read_bbio, 0,
>> +			read_bbio->iter.bi_size, REQ_OP_WRITE);
>
> Do we need to clone the whole bio?
>
> Considering under most cases the read repair is already the cold path,
> have more than one corruption is already rare.

The read bio is trimmed to only cover the bad area, so this already
potentially does not cover the whole failed bbio.  But except for
the case you note below we do need to write back the whole bio.

>> +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr,
>> +		struct btrfs_bio *failed_bbio, struct inode *inode,
>> +		u64 end_offset, repair_endio_t endio)
>
> The reason I don't use "end" in my patchset is, any thing related "end"
> will let people to think it has something related with endio.

That seems like an odd conotation.  Without io in the word end just
conotates the end of a range in most of the Linux I/O stack.

> And in fact when checking the function, I really thought it's something
> related to endio, but it's the equivalent of btrfs_read_repair_finish().

But if there is a strong preference I can use finish instead of end.

Qu Wenruo May 18, 2022, 10:20 a.m. UTC | #3

On 2022/5/18 16:54, Christoph Hellwig wrote:
> On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote:
>>> +static struct btrfs_bio *btrfs_repair_bio_clone(struct btrfs_bio *src_bbio,
>>> +		u64 offset, u32 size, unsigned int op)
>>> +{
>>> +	struct btrfs_bio *bbio;
>>> +	struct bio *bio;
>>> +
>>> +	bio = bio_alloc_bioset(NULL, 0, op | REQ_SYNC, GFP_NOFS,
>>> +			       &read_repair_bioset);
>>> +	bio_set_flag(bio, BIO_CLONED);
>>
>> Do we need to bother setting the CLONED flag?
>
> The CLONED flag should never be set.  Except for the one bogus check
> in btrfs that I have a pending removal for it is only used for debugging
> checks.
>
>> Without CLONED flag, we can easily go bio_for_each_segment_all() in the
>> endio function without the need of bbio->iter, thus can save some memory.
>
> bio_for_each_segment_all ignores the iter and walks over bi_io_vec
> directly.  And that is something we absolutely can't do here, as the
> bio reuses the bio_vecs from the failed bio, and depending on what
> failed reduces the size.

My bad, I see the bio_alloc_bioset() but didn't check it's allocating a
bi_io_vec with size 0, and soon utilize the original bi_io_vec.

So the function matches its name, it's really bio clone.

And it's very different from my version, which really allocates a new
bio with larger enough bi_io_vec then adding back the needed sectors
from the original bio.

Then I guess the BIO_CLONE flag is completely fine.

But in that case, you may want to call bio_alloc_clone() directly? Which
can handle bioset without problem.
>
>>> +	/*
>>> +	 * Otherwise just clone the whole bio and write it back to the
>>> +	 * previously bad mirror.
>>> +	 */
>>> +	write_bbio = btrfs_repair_bio_clone(read_bbio, 0,
>>> +			read_bbio->iter.bi_size, REQ_OP_WRITE);
>>
>> Do we need to clone the whole bio?
>>
>> Considering under most cases the read repair is already the cold path,
>> have more than one corruption is already rare.
>
> The read bio is trimmed to only cover the bad area, so this already
> potentially does not cover the whole failed bbio.  But except for
> the case you note below we do need to write back the whole bio.
>
>>> +bool __btrfs_read_repair_end(struct btrfs_read_repair *rr,
>>> +		struct btrfs_bio *failed_bbio, struct inode *inode,
>>> +		u64 end_offset, repair_endio_t endio)
>>
>> The reason I don't use "end" in my patchset is, any thing related "end"
>> will let people to think it has something related with endio.
>
> That seems like an odd conotation.  Without io in the word end just
> conotates the end of a range in most of the Linux I/O stack.

OK, thanks for explaining this, I guess it's my problem linking "end" to
"end_io" or "endio".

Then no problem using this current naming.

Thanks,
Qu
>
>> And in fact when checking the function, I really thought it's something
>> related to endio, but it's the equivalent of btrfs_read_repair_finish().
>
> But if there is a strong preference I can use finish instead of end.

Christoph Hellwig May 18, 2022, 12:48 p.m. UTC | #4

On Wed, May 18, 2022 at 06:20:53PM +0800, Qu Wenruo wrote:
> My bad, I see the bio_alloc_bioset() but didn't check it's allocating a
> bi_io_vec with size 0, and soon utilize the original bi_io_vec.
>
> So the function matches its name, it's really bio clone.
>
> And it's very different from my version, which really allocates a new
> bio with larger enough bi_io_vec then adding back the needed sectors
> from the original bio.
>
> Then I guess the BIO_CLONE flag is completely fine.
>
> But in that case, you may want to call bio_alloc_clone() directly? Which
> can handle bioset without problem.

The big difference is that bio_alloc_clone directly looks at
bio->bi_iter, while we must look at btrfs_bio->iter as bio->bi_iter
may already be consumed by the driver.  I though the comment in the
code explains that, but maybe I need to improve it.

Christoph Hellwig May 19, 2022, 9:36 a.m. UTC | #5

On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote:
> Function btrfs_repair_read_bio() will only return true if all of its
> data matches csum.
>
> Consider the following case:
>
> Profile RAID1C3, 2 sectors to read, the initial mirror is 1.
>
> Mirror 1:	|X|X|
> Mirror 2:	|X| |
> Mirror 3:	| |X|
>
> Now we will got -EIO, but in reality, we can repair the read by using
> the first sector from mirror 3 and the 2nd sector from mirror 2.

I tried to write a test case for this by copying btrfs/140 and then
as a first step extending it to three mirrors unsing the raid1c1
profile.  But it seems that the tricks used there don't work,
as the code in btrfs/140 relies on the fact that the btrfs logic
address repored by file frag is reported by dump-tree as the item
"index" ĭn this line:

item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz

but for the raid1c3 profile that line reports something entirely
different.

for raid1:

logical: 137756672
item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsize 112

for raid1c3:

logical: 343998464
item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621 itemsize 144

any idea how to find physical sectors to corrupt for raid1c1?

Qu Wenruo May 19, 2022, 10:41 a.m. UTC | #6

On 2022/5/19 17:36, Christoph Hellwig wrote:
> On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote:
>> Function btrfs_repair_read_bio() will only return true if all of its
>> data matches csum.
>>
>> Consider the following case:
>>
>> Profile RAID1C3, 2 sectors to read, the initial mirror is 1.
>>
>> Mirror 1:	|X|X|
>> Mirror 2:	|X| |
>> Mirror 3:	| |X|
>>
>> Now we will got -EIO, but in reality, we can repair the read by using
>> the first sector from mirror 3 and the 2nd sector from mirror 2.
> 
> I tried to write a test case for this by copying btrfs/140 and then
> as a first step extending it to three mirrors unsing the raid1c1
> profile.  But it seems that the tricks used there don't work,
> as the code in btrfs/140 relies on the fact that the btrfs logic
> address repored by file frag is reported by dump-tree as the item
> "index" ĭn this line:
> 
> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz
> 
> but for the raid1c3 profile that line reports something entirely
> different.
> 
> for raid1:
> 
> logical: 137756672
> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsize 112
> 
> for raid1c3:
> 
> logical: 343998464
> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621 itemsize 144
> 
> any idea how to find physical sectors to corrupt for raid1c1?
> 

I also recently hit weird cases why extent allocator no longer puts the 
first data extent at the beginning of a chunk.

So in that case, the best solution is to use "btrfs-map-logical -l 
343998464", which will directly return the physical offset of the wanted 
logical on each involved devices.

Although we need to note:

- btrfs-map-logical may not always be shipped in progs in the future
   This tool really looks like a debug tool. I'm not sure if we will keep
   shipping it (personally I really hope to)

- btrfs-map-logical only return data stripes
   Thus it doesn't work for RAID56 just in case you want to use it.

Despite the weird extent logical bytenr, everything should be fine with 
btrfs-map-logical.

I'll take some time looking into the weird behavior change.

Thanks,
Qu

Nikolay Borisov May 19, 2022, 10:45 a.m. UTC | #7

On 19.05.22 г. 13:41 ч., Qu Wenruo wrote:
> 
> 
> On 2022/5/19 17:36, Christoph Hellwig wrote:
>> On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote:
>>> Function btrfs_repair_read_bio() will only return true if all of its
>>> data matches csum.
>>>
>>> Consider the following case:
>>>
>>> Profile RAID1C3, 2 sectors to read, the initial mirror is 1.
>>>
>>> Mirror 1:    |X|X|
>>> Mirror 2:    |X| |
>>> Mirror 3:    | |X|
>>>
>>> Now we will got -EIO, but in reality, we can repair the read by using
>>> the first sector from mirror 3 and the 2nd sector from mirror 2.
>>
>> I tried to write a test case for this by copying btrfs/140 and then
>> as a first step extending it to three mirrors unsing the raid1c1
>> profile.  But it seems that the tricks used there don't work,
>> as the code in btrfs/140 relies on the fact that the btrfs logic
>> address repored by file frag is reported by dump-tree as the item
>> "index" ĭn this line:
>>
>> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz
>>
>> but for the raid1c3 profile that line reports something entirely
>> different.
>>
>> for raid1:
>>
>> logical: 137756672
>> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 
>> itemsize 112
>>
>> for raid1c3:
>>
>> logical: 343998464
>> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621 
>> itemsize 144
>>
>> any idea how to find physical sectors to corrupt for raid1c1?
>>
> 
> I also recently hit weird cases why extent allocator no longer puts the 
> first data extent at the beginning of a chunk.
> 
> So in that case, the best solution is to use "btrfs-map-logical -l 
> 343998464", which will directly return the physical offset of the wanted 
> logical on each involved devices.

Any reason why this is kept as a separate tool and not simply integrated 
into btrfs-progs as a separate command?

> 
> Although we need to note:
> 
> - btrfs-map-logical may not always be shipped in progs in the future
>    This tool really looks like a debug tool. I'm not sure if we will keep
>    shipping it (personally I really hope to)
> 
> - btrfs-map-logical only return data stripes
>    Thus it doesn't work for RAID56 just in case you want to use it.
> 
> Despite the weird extent logical bytenr, everything should be fine with 
> btrfs-map-logical.
> 
> I'll take some time looking into the weird behavior change.
> 
> Thanks,
> Qu
>

Qu Wenruo May 19, 2022, 10:46 a.m. UTC | #8

On 2022/5/19 18:45, Nikolay Borisov wrote:
>
>
> On 19.05.22 г. 13:41 ч., Qu Wenruo wrote:
>>
>>
>> On 2022/5/19 17:36, Christoph Hellwig wrote:
>>> On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote:
>>>> Function btrfs_repair_read_bio() will only return true if all of its
>>>> data matches csum.
>>>>
>>>> Consider the following case:
>>>>
>>>> Profile RAID1C3, 2 sectors to read, the initial mirror is 1.
>>>>
>>>> Mirror 1:    |X|X|
>>>> Mirror 2:    |X| |
>>>> Mirror 3:    | |X|
>>>>
>>>> Now we will got -EIO, but in reality, we can repair the read by using
>>>> the first sector from mirror 3 and the 2nd sector from mirror 2.
>>>
>>> I tried to write a test case for this by copying btrfs/140 and then
>>> as a first step extending it to three mirrors unsing the raid1c1
>>> profile.  But it seems that the tricks used there don't work,
>>> as the code in btrfs/140 relies on the fact that the btrfs logic
>>> address repored by file frag is reported by dump-tree as the item
>>> "index" ĭn this line:
>>>
>>> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz
>>>
>>> but for the raid1c3 profile that line reports something entirely
>>> different.
>>>
>>> for raid1:
>>>
>>> logical: 137756672
>>> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751
>>> itemsize 112
>>>
>>> for raid1c3:
>>>
>>> logical: 343998464
>>> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621
>>> itemsize 144
>>>
>>> any idea how to find physical sectors to corrupt for raid1c1?
>>>
>>
>> I also recently hit weird cases why extent allocator no longer puts
>> the first data extent at the beginning of a chunk.
>>
>> So in that case, the best solution is to use "btrfs-map-logical -l
>> 343998464", which will directly return the physical offset of the
>> wanted logical on each involved devices.
>
> Any reason why this is kept as a separate tool and not simply integrated
> into btrfs-progs as a separate command?

For historical reasons I guess.

I'm totally fine to move it into btrfs-ins subcommand.

Thanks,
Qu
>
>>
>> Although we need to note:
>>
>> - btrfs-map-logical may not always be shipped in progs in the future
>>    This tool really looks like a debug tool. I'm not sure if we will keep
>>    shipping it (personally I really hope to)
>>
>> - btrfs-map-logical only return data stripes
>>    Thus it doesn't work for RAID56 just in case you want to use it.
>>
>> Despite the weird extent logical bytenr, everything should be fine
>> with btrfs-map-logical.
>>
>> I'll take some time looking into the weird behavior change.
>>
>> Thanks,
>> Qu
>>

Christoph Hellwig May 19, 2022, 10:50 a.m. UTC | #9

On Thu, May 19, 2022 at 06:41:52PM +0800, Qu Wenruo wrote:
> So in that case, the best solution is to use "btrfs-map-logical -l 
> 343998464", which will directly return the physical offset of the wanted 
> logical on each involved devices.
>
> Although we need to note:
>
> - btrfs-map-logical may not always be shipped in progs in the future
>   This tool really looks like a debug tool. I'm not sure if we will keep
>   shipping it (personally I really hope to)
>
> - btrfs-map-logical only return data stripes
>   Thus it doesn't work for RAID56 just in case you want to use it.
>
> Despite the weird extent logical bytenr, everything should be fine with 
> btrfs-map-logical.


Oh, nice, this is much better than the hackery in the existing tests.

That bein said the -c option does not seem to work.  No matter what
I specify it always returns all three mirrors.  I guess a little
awk will fix that, but the behavior seems odd.

Qu Wenruo May 19, 2022, 11:27 a.m. UTC | #10

On 2022/5/19 18:50, Christoph Hellwig wrote:
> On Thu, May 19, 2022 at 06:41:52PM +0800, Qu Wenruo wrote:
>> So in that case, the best solution is to use "btrfs-map-logical -l
>> 343998464", which will directly return the physical offset of the wanted
>> logical on each involved devices.
>>
>> Although we need to note:
>>
>> - btrfs-map-logical may not always be shipped in progs in the future
>>    This tool really looks like a debug tool. I'm not sure if we will keep
>>    shipping it (personally I really hope to)
>>
>> - btrfs-map-logical only return data stripes
>>    Thus it doesn't work for RAID56 just in case you want to use it.
>>
>> Despite the weird extent logical bytenr, everything should be fine with
>> btrfs-map-logical.
>
>
> Oh, nice, this is much better than the hackery in the existing tests.
>
> That bein said the -c option does not seem to work.  No matter what
> I specify it always returns all three mirrors.  I guess a little
> awk will fix that, but the behavior seems odd.

Currently the copy number is only used for -o option, not for regular
chunk mapping printing.

Thus we still need awk to fix the awkward behavior. (no pun intended)

Another problem related to btrfs-map-logical is it doesn't work if the
range has no data/metadata extent there.

So Nik is right, we need a better tool integrated into btrfs-ins, other
than this historical one.

Thanks,
Qu

Qu Wenruo May 20, 2022, 6:43 a.m. UTC | #11

>> I tried to write a test case for this by copying btrfs/140 and then
>> as a first step extending it to three mirrors unsing the raid1c1
>> profile.  But it seems that the tricks used there don't work,
>> as the code in btrfs/140 relies on the fact that the btrfs logic
>> address repored by file frag is reported by dump-tree as the item
>> "index" ĭn this line:
>>
>> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751 itemsiz
>>
>> but for the raid1c3 profile that line reports something entirely
>> different.
>>
>> for raid1:
>>
>> logical: 137756672
>> item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 137756672) itemoff 15751
>> itemsize 112
>>
>> for raid1c3:
>>
>> logical: 343998464
>> item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 298844160) itemoff 15621
>> itemsize 144
>>
>> any idea how to find physical sectors to corrupt for raid1c1?
>>
>
> I also recently hit weird cases why extent allocator no longer puts the
> first data extent at the beginning of a chunk.

Thankfully, this is not a bug, but a combination of seemingly
straightforward behaviors, which leads to a weird combined result.

It takes me a short adventure into the free space handling to find the
problem.

For my example, I'm using 3x10G disks, and running RAID0 for data, RAID1
for metadata:

    Label:              (null)
    UUID:               bb10a539-0344-445a-9e77-bbda65d79366
    Node size:          16384
    Sector size:        4096
    Filesystem size:    30.00GiB
    Block group profiles:
      Data:             RAID0             3.00GiB
      Metadata:         RAID1           256.00MiB
      System:           RAID1             8.00MiB
    SSD detected:       no
    Zoned device:       no
    Incompat features:  extref, skinny-metadata, no-holes
    Runtime features:   free-space-tree
    Checksum:           crc32c
    Number of devices:  3
    Devices:
       ID        SIZE  PATH
        1    10.00GiB  /dev/test/scratch1
        2    10.00GiB  /dev/test/scratch2
        3    10.00GiB  /dev/test/scratch3

The 3GiB data chunk (at logical 298844160, length 3GiB) is completely
empty, but notice that, btrfs needs to avoid allocating extents for
super block reservations.

And we have one logical bytenr 434110464, which is at the superblock
location of /dev/test/scratch1.

So the free space of that 3GiB chunk is split into two parts:

[298844160, +135266304)
[434176000, +3085893632)

Notice the latter part is much larger.

So far so good, but there is another thing involved, the cached free
space behavior.

In find_free_space(), if we are searching from the beginning of a block
group, we will use `rb_first_cached(&ctl->free_space_bytes);`

But free_space_bytes rbtree is not sorted using logical bytenr, but the
free space.
And the leftmost one will have the most amount of free space.
So instead of choose [298844160, +135266304), we choose [434176000,
+3085893632) which has the much larger free space.


Thus we got the seemingly weird bytenr, 434176000, for our first data
extent.


And each behavior itself is completely sane and straightforward.
We can not use space reserved for superblocks.
We should use the free space which has the most free space.

But in the end, when combining two of them, we got the behavior that not
returning the beginning of a seemingly empty chunk.

So in short, we should not rely on the dirty dump tree hacks, but a
better version of btrfs-map-logical to grab the real physical offset of
a logical bytenr.

Thanks,
Qu

Christoph Hellwig May 20, 2022, 3:25 p.m. UTC | #12

On Wed, May 18, 2022 at 07:04:22AM +0800, Qu Wenruo wrote:
> A big NONO here.
>
> Function btrfs_repair_read_bio() will only return true if all of its
> data matches csum.
>
> Consider the following case:
>
> Profile RAID1C3, 2 sectors to read, the initial mirror is 1.
>
> Mirror 1:	|X|X|
> Mirror 2:	|X| |
> Mirror 3:	| |X|
>
> Now we will got -EIO, but in reality, we can repair the read by using
> the first sector from mirror 3 and the 2nd sector from mirror 2.
>
> This is a behavior regression.

Now that I've written tests and code to treat this properly I have to
say that at least on x86 I can't actually trigger this behavior
regression and had to add instrumentation to see a low-level change
in behavior.

Why?

For some reason (and I'd love to see an explanation for it!), btrfs
limits direct I/O reads to a single sector, so for direct I/O is is
impossible to hit this case as all bios (and thus repair bios) are
limited to a single sector.

For buffered I/O we can hit this case, but all ->readahead error are
ignored and the pages actually needed are eventually retried using
->readpage.  And ->reapage is limit to a single sector.   But given
that btrfs does not seem to support sub-4K sector sizes I can't
actually test the sub-sector code here - I assume this might be
an issue on larger page size systems.

[12/15] btrfs: add new read repair infrastructure

Commit Message

Comments

Patch