diff mbox series

[07/17] btrfs: allow btrfs_submit_bio to split bios

Message ID 20220901074216.1849941-8-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/17] block: export bio_split_rw | expand

Commit Message

Christoph Hellwig Sept. 1, 2022, 7:42 a.m. UTC
Currently the I/O submitters have to split bios according to the
chunk stripe boundaries.  This leads to extra lookups in the extent
trees and a lot of boilerplate code.

To drop this requirement, split the bio when __btrfs_map_block
returns a mapping that is smaller than the requested size and
keep a count of pending bios in the original btrfs_bio so that
the upper level completion is only invoked when all clones have
completed.

Based on a patch from Qu Wenruo.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/volumes.c | 106 +++++++++++++++++++++++++++++++++++++--------
 fs/btrfs/volumes.h |   1 +
 2 files changed, 90 insertions(+), 17 deletions(-)

Comments

Johannes Thumshirn Sept. 1, 2022, 9:47 a.m. UTC | #1
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Josef Bacik Sept. 7, 2022, 8:55 p.m. UTC | #2
On Thu, Sep 01, 2022 at 10:42:06AM +0300, Christoph Hellwig wrote:
> Currently the I/O submitters have to split bios according to the
> chunk stripe boundaries.  This leads to extra lookups in the extent
> trees and a lot of boilerplate code.
> 
> To drop this requirement, split the bio when __btrfs_map_block
> returns a mapping that is smaller than the requested size and
> keep a count of pending bios in the original btrfs_bio so that
> the upper level completion is only invoked when all clones have
> completed.
> 
> Based on a patch from Qu Wenruo.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/btrfs/volumes.c | 106 +++++++++++++++++++++++++++++++++++++--------
>  fs/btrfs/volumes.h |   1 +
>  2 files changed, 90 insertions(+), 17 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 5c6535e10085d..0a2d144c20604 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -35,6 +35,7 @@
>  #include "zoned.h"
>  
>  static struct bio_set btrfs_bioset;
> +static struct bio_set btrfs_clone_bioset;
>  static struct bio_set btrfs_repair_bioset;
>  static mempool_t btrfs_failed_bio_pool;
>  
> @@ -6661,6 +6662,7 @@ static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
>  	bbio->inode = inode;
>  	bbio->end_io = end_io;
>  	bbio->private = private;
> +	atomic_set(&bbio->pending_ios, 1);
>  }
>  
>  /*
> @@ -6698,6 +6700,57 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
>  	return bio;
>  }
>  
> +static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
> +{
> +	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
> +	struct bio *bio;
> +
> +	bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
> +			&btrfs_clone_bioset);
> +	btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode, NULL, orig_bbio);
> +
> +	btrfs_bio(bio)->file_offset = orig_bbio->file_offset;
> +	orig_bbio->file_offset += map_length;

I'm worried about this for the ONE_ORDERED case.  We specifically used the
ONE_ORDERED thing because our file_offset was the start, but our length could go
past the range of the ordered extent, and then we wouldn't find our ordered
extent and things would go quite wrong.

Instead we should do something like

if (!(orig->bi_opf & REQ_BTRFS_ONE_ORDERED))
	orig_bbio->file_offset += map_length;

I've cc'ed Omar since he's the one who added this and I'm a little confused
about how this can happen.

> +
> +	atomic_inc(&orig_bbio->pending_ios);
> +	return bio;
> +}
> +
> +static void btrfs_orig_write_end_io(struct bio *bio);
> +static void btrfs_bbio_propagate_error(struct btrfs_bio *bbio,
> +				       struct btrfs_bio *orig_bbio)
> +{
> +	/*
> +	 * For writes btrfs tolerates nr_mirrors - 1 write failures, so we
> +	 * can't just blindly propagate a write failure here.
> +	 * Instead increment the error count in the original I/O context so
> +	 * that it is guaranteed to be larger than the error tolerance.
> +	 */
> +	if (bbio->bio.bi_end_io == &btrfs_orig_write_end_io) {
> +		struct btrfs_io_stripe *orig_stripe = orig_bbio->bio.bi_private;
> +		struct btrfs_io_context *orig_bioc = orig_stripe->bioc;
> +		

Whitespace error here.  Thanks,

Josef
Qu Wenruo Sept. 12, 2022, 12:20 a.m. UTC | #3
On 2022/9/1 15:42, Christoph Hellwig wrote:
> Currently the I/O submitters have to split bios according to the
> chunk stripe boundaries.  This leads to extra lookups in the extent
> trees and a lot of boilerplate code.
>
> To drop this requirement, split the bio when __btrfs_map_block
> returns a mapping that is smaller than the requested size and
> keep a count of pending bios in the original btrfs_bio so that
> the upper level completion is only invoked when all clones have
> completed.

Sorry for the late reply, but I still have a question related the
chained bio way.

Since we go the chained method, it means, if we hit an error for the
splitted bio, the whole bio will be marked error.

Especially for read bios, that can be a problem (currently only for
RAID10 though), which can affect the read repair behavior.

E.g. we have a 4-disks RAID10 looks like this:

Disk 1 (unreliable): Mirror 1 of logical range [X, X + 64K)
Disk 2 (reliable):   Mirror 2 of logical range [X, X + 64K)
Disk 3 (reliable):   Mirror 1 of logical range [X + 64K, X + 128K)
Disk 4 (unreliable): Mirror 2 of logical range [X + 64K, X + 128K)

And we submit a read for range [X, X + 128K)

The first 64K will use mirror 1, thus reading from Disk 1.
The second 64K will also use mirror 1, thus read from Disk 2.

But the first 64K read failed due to whatever reason, thus we mark the
whole range error, and needs to go repair code.

Note that, the original bio is using mirror 1, thus for read-repair we
can only read from mirror 2 to repair.

But in that case, Disk 4 is also unreliable, if at read-repair time, we
didn't try all mirrors, including the failed one (which is no longer a
reliable mirror_num for all ranges), we may failed to repair some range.

Does the read-repair code now has something to compensate the chained
behavior?

Thanks,
Qu

>
> Based on a patch from Qu Wenruo.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   fs/btrfs/volumes.c | 106 +++++++++++++++++++++++++++++++++++++--------
>   fs/btrfs/volumes.h |   1 +
>   2 files changed, 90 insertions(+), 17 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 5c6535e10085d..0a2d144c20604 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -35,6 +35,7 @@
>   #include "zoned.h"
>
>   static struct bio_set btrfs_bioset;
> +static struct bio_set btrfs_clone_bioset;
>   static struct bio_set btrfs_repair_bioset;
>   static mempool_t btrfs_failed_bio_pool;
>
> @@ -6661,6 +6662,7 @@ static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
>   	bbio->inode = inode;
>   	bbio->end_io = end_io;
>   	bbio->private = private;
> +	atomic_set(&bbio->pending_ios, 1);
>   }
>
>   /*
> @@ -6698,6 +6700,57 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
>   	return bio;
>   }
>
> +static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
> +{
> +	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
> +	struct bio *bio;
> +
> +	bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
> +			&btrfs_clone_bioset);
> +	btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode, NULL, orig_bbio);
> +
> +	btrfs_bio(bio)->file_offset = orig_bbio->file_offset;
> +	orig_bbio->file_offset += map_length;
> +
> +	atomic_inc(&orig_bbio->pending_ios);
> +	return bio;
> +}
> +
> +static void btrfs_orig_write_end_io(struct bio *bio);
> +static void btrfs_bbio_propagate_error(struct btrfs_bio *bbio,
> +				       struct btrfs_bio *orig_bbio)
> +{
> +	/*
> +	 * For writes btrfs tolerates nr_mirrors - 1 write failures, so we
> +	 * can't just blindly propagate a write failure here.
> +	 * Instead increment the error count in the original I/O context so
> +	 * that it is guaranteed to be larger than the error tolerance.
> +	 */
> +	if (bbio->bio.bi_end_io == &btrfs_orig_write_end_io) {
> +		struct btrfs_io_stripe *orig_stripe = orig_bbio->bio.bi_private;
> +		struct btrfs_io_context *orig_bioc = orig_stripe->bioc;
> +
> +		atomic_add(orig_bioc->max_errors, &orig_bioc->error);
> +	} else {
> +		orig_bbio->bio.bi_status = bbio->bio.bi_status;
> +	}
> +}
> +
> +static void btrfs_orig_bbio_end_io(struct btrfs_bio *bbio)
> +{
> +	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
> +		struct btrfs_bio *orig_bbio = bbio->private;
> +
> +		if (bbio->bio.bi_status)
> +			btrfs_bbio_propagate_error(bbio, orig_bbio);
> +		bio_put(&bbio->bio);
> +		bbio = orig_bbio;
> +	}
> +
> +	if (atomic_dec_and_test(&bbio->pending_ios))
> +		bbio->end_io(bbio);
> +}
> +
>   static int next_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
>   {
>   	if (cur_mirror == fbio->num_copies)
> @@ -6715,7 +6768,7 @@ static int prev_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
>   static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
>   {
>   	if (atomic_dec_and_test(&fbio->repair_count)) {
> -		fbio->bbio->end_io(fbio->bbio);
> +		btrfs_orig_bbio_end_io(fbio->bbio);
>   		mempool_free(fbio, &btrfs_failed_bio_pool);
>   	}
>   }
> @@ -6857,7 +6910,7 @@ static void btrfs_check_read_bio(struct btrfs_bio *bbio,
>   	if (unlikely(fbio))
>   		btrfs_repair_done(fbio);
>   	else
> -		bbio->end_io(bbio);
> +		btrfs_orig_bbio_end_io(bbio);
>   }
>
>   static void btrfs_log_dev_io_error(struct bio *bio, struct btrfs_device *dev)
> @@ -6908,7 +6961,7 @@ static void btrfs_simple_end_io(struct bio *bio)
>   	} else {
>   		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
>   			btrfs_record_physical_zoned(bbio);
> -		bbio->end_io(bbio);
> +		btrfs_orig_bbio_end_io(bbio);
>   	}
>   }
>
> @@ -6922,7 +6975,7 @@ static void btrfs_raid56_end_io(struct bio *bio)
>   	if (bio_op(bio) == REQ_OP_READ)
>   		btrfs_check_read_bio(bbio, NULL);
>   	else
> -		bbio->end_io(bbio);
> +		btrfs_orig_bbio_end_io(bbio);
>
>   	btrfs_put_bioc(bioc);
>   }
> @@ -6949,7 +7002,7 @@ static void btrfs_orig_write_end_io(struct bio *bio)
>   	else
>   		bio->bi_status = BLK_STS_OK;
>
> -	bbio->end_io(bbio);
> +	btrfs_orig_bbio_end_io(bbio);
>   	btrfs_put_bioc(bioc);
>   }
>
> @@ -7190,8 +7243,8 @@ static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio,
>   	return true;
>   }
>
> -void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
> -		      int mirror_num)
> +static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
> +			       int mirror_num)
>   {
>   	struct btrfs_bio *bbio = btrfs_bio(bio);
>   	u64 logical = bio->bi_iter.bi_sector << 9;
> @@ -7207,11 +7260,10 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   	if (ret)
>   		goto fail;
>
> +	map_length = min(map_length, length);
>   	if (map_length < length) {
> -		btrfs_crit(fs_info,
> -			   "mapping failed logical %llu bio len %llu len %llu",
> -			   logical, length, map_length);
> -		BUG();
> +		bio = btrfs_split_bio(bio, map_length);
> +		bbio = btrfs_bio(bio);
>   	}
>
>   	/*
> @@ -7222,7 +7274,7 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   		bbio->saved_iter = bio->bi_iter;
>   		ret = btrfs_lookup_bio_sums(bbio);
>   		if (ret)
> -			goto fail;
> +			goto fail_put_bio;
>   	}
>
>   	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
> @@ -7231,7 +7283,7 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>   			ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
>   			if (ret)
> -				goto fail;
> +				goto fail_put_bio;
>   		}
>
>   		/*
> @@ -7243,22 +7295,36 @@ void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
>   		    !btrfs_is_data_reloc_root(bi->root)) {
>   			if (should_async_write(bbio) &&
>   			    btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
> -				return;
> +				goto done;
>
>   			if (bio->bi_opf & REQ_META)
>   				ret = btree_csum_one_bio(bbio);
>   			else
>   				ret = btrfs_csum_one_bio(bbio);
>   			if (ret)
> -				goto fail;
> +				goto fail_put_bio;
>   		}
>   	}
>
>   	__btrfs_submit_bio(bio, bioc, &smap, mirror_num);
> -	return;
> +done:
> +	return map_length == length;
> +
> +fail_put_bio:
> +	if (map_length < length)
> +		bio_put(bio);
>   fail:
>   	btrfs_bio_counter_dec(fs_info);
>   	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
> +	/* Do not submit another chunk */
> +	return true;
> +}
> +
> +void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
> +		      int mirror_num)
> +{
> +	while (!btrfs_submit_chunk(fs_info, bio, mirror_num))
> +		;
>   }
>
>   /*
> @@ -8858,10 +8924,13 @@ int __init btrfs_bioset_init(void)
>   			offsetof(struct btrfs_bio, bio),
>   			BIOSET_NEED_BVECS))
>   		return -ENOMEM;
> +	if (bioset_init(&btrfs_clone_bioset, BIO_POOL_SIZE,
> +			offsetof(struct btrfs_bio, bio), 0))
> +		goto out_free_bioset;
>   	if (bioset_init(&btrfs_repair_bioset, BIO_POOL_SIZE,
>   			offsetof(struct btrfs_bio, bio),
>   			BIOSET_NEED_BVECS))
> -		goto out_free_bioset;
> +		goto out_free_clone_bioset;
>   	if (mempool_init_kmalloc_pool(&btrfs_failed_bio_pool, BIO_POOL_SIZE,
>   				      sizeof(struct btrfs_failed_bio)))
>   		goto out_free_repair_bioset;
> @@ -8869,6 +8938,8 @@ int __init btrfs_bioset_init(void)
>
>   out_free_repair_bioset:
>   	bioset_exit(&btrfs_repair_bioset);
> +out_free_clone_bioset:
> +	bioset_exit(&btrfs_clone_bioset);
>   out_free_bioset:
>   	bioset_exit(&btrfs_bioset);
>   	return -ENOMEM;
> @@ -8878,5 +8949,6 @@ void __cold btrfs_bioset_exit(void)
>   {
>   	mempool_exit(&btrfs_failed_bio_pool);
>   	bioset_exit(&btrfs_repair_bioset);
> +	bioset_exit(&btrfs_clone_bioset);
>   	bioset_exit(&btrfs_bioset);
>   }
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 8b248c9bd602b..97877184d0db1 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -386,6 +386,7 @@ struct btrfs_bio {
>
>   	/* For internal use in read end I/O handling */
>   	unsigned int mirror_num;
> +	atomic_t pending_ios;
>   	struct work_struct end_io_work;
>
>   	/*
Christoph Hellwig Sept. 12, 2022, 1:55 p.m. UTC | #4
On Mon, Sep 12, 2022 at 08:20:37AM +0800, Qu Wenruo wrote:
> Sorry for the late reply, but I still have a question related the
> chained bio way.
>
> Since we go the chained method, it means, if we hit an error for the
> splitted bio, the whole bio will be marked error.

The only chained bios in the sense of using bio chaining are the
writes to the multiple legs of mirrored volumes.

> Especially for read bios, that can be a problem (currently only for
> RAID10 though), which can affect the read repair behavior.
>
> E.g. we have a 4-disks RAID10 looks like this:
>
> Disk 1 (unreliable): Mirror 1 of logical range [X, X + 64K)
> Disk 2 (reliable):   Mirror 2 of logical range [X, X + 64K)
> Disk 3 (reliable):   Mirror 1 of logical range [X + 64K, X + 128K)
> Disk 4 (unreliable): Mirror 2 of logical range [X + 64K, X + 128K)
>
> And we submit a read for range [X, X + 128K)
>
> The first 64K will use mirror 1, thus reading from Disk 1.
> The second 64K will also use mirror 1, thus read from Disk 2.
>
> But the first 64K read failed due to whatever reason, thus we mark the
> whole range error, and needs to go repair code.

With the code posted in this series that is not what happens.  Instead
the checksum validation and then repair happen when the read from
mirror 1 / disk 1 completes, but before the results are propagated
up.  That was the prime reason why I had to move the repair code
below btrfs_submit_bio (that it happend to removed code and consolidate
the exact behavior is a nice side-effect).

> Does the read-repair code now has something to compensate the chained
> behavior?

It doesn't compensate it, but it is invoked at a low enough level so
that this problem does not happen.
Christoph Hellwig Sept. 12, 2022, 1:58 p.m. UTC | #5
On Wed, Sep 07, 2022 at 04:55:45PM -0400, Josef Bacik wrote:
> I'm worried about this for the ONE_ORDERED case.  We specifically used the
> ONE_ORDERED thing because our file_offset was the start, but our length could go
> past the range of the ordered extent, and then we wouldn't find our ordered
> extent and things would go quite wrong.
> 
> Instead we should do something like
> 
> if (!(orig->bi_opf & REQ_BTRFS_ONE_ORDERED))
> 	orig_bbio->file_offset += map_length;
> 
> I've cc'ed Omar since he's the one who added this and I'm a little confused
> about how this can happen.

I have to say I found the logic quite confusing as well, and when I
broke it during development of this series xfstests did not complain
either.  So shedding some more light on the flag would be really
helpful.
Qu Wenruo Sept. 12, 2022, 10:23 p.m. UTC | #6
On 2022/9/12 21:55, Christoph Hellwig wrote:
> On Mon, Sep 12, 2022 at 08:20:37AM +0800, Qu Wenruo wrote:
>> Sorry for the late reply, but I still have a question related the
>> chained bio way.
>>
>> Since we go the chained method, it means, if we hit an error for the
>> splitted bio, the whole bio will be marked error.
>
> The only chained bios in the sense of using bio chaining are the
> writes to the multiple legs of mirrored volumes.
>
>> Especially for read bios, that can be a problem (currently only for
>> RAID10 though), which can affect the read repair behavior.
>>
>> E.g. we have a 4-disks RAID10 looks like this:
>>
>> Disk 1 (unreliable): Mirror 1 of logical range [X, X + 64K)
>> Disk 2 (reliable):   Mirror 2 of logical range [X, X + 64K)
>> Disk 3 (reliable):   Mirror 1 of logical range [X + 64K, X + 128K)
>> Disk 4 (unreliable): Mirror 2 of logical range [X + 64K, X + 128K)
>>
>> And we submit a read for range [X, X + 128K)
>>
>> The first 64K will use mirror 1, thus reading from Disk 1.
>> The second 64K will also use mirror 1, thus read from Disk 2.
>>
>> But the first 64K read failed due to whatever reason, thus we mark the
>> whole range error, and needs to go repair code.
>
> With the code posted in this series that is not what happens.  Instead
> the checksum validation and then repair happen when the read from
> mirror 1 / disk 1 completes, but before the results are propagated
> up.  That was the prime reason why I had to move the repair code
> below btrfs_submit_bio (that it happend to removed code and consolidate
> the exact behavior is a nice side-effect).
>
>> Does the read-repair code now has something to compensate the chained
>> behavior?
>
> It doesn't compensate it, but it is invoked at a low enough level so
> that this problem does not happen.

You're completely right, it's the 4th patch putting the verification
code into the endio function, thus the verification is still done
per-splitted-bio.

I really should review the whole series in one go...

Then it looks pretty good to me.

Reviewed-by: Qu Wenruo <wqu@suse.com>

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5c6535e10085d..0a2d144c20604 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -35,6 +35,7 @@ 
 #include "zoned.h"
 
 static struct bio_set btrfs_bioset;
+static struct bio_set btrfs_clone_bioset;
 static struct bio_set btrfs_repair_bioset;
 static mempool_t btrfs_failed_bio_pool;
 
@@ -6661,6 +6662,7 @@  static void btrfs_bio_init(struct btrfs_bio *bbio, struct inode *inode,
 	bbio->inode = inode;
 	bbio->end_io = end_io;
 	bbio->private = private;
+	atomic_set(&bbio->pending_ios, 1);
 }
 
 /*
@@ -6698,6 +6700,57 @@  struct bio *btrfs_bio_clone_partial(struct bio *orig, u64 offset, u64 size,
 	return bio;
 }
 
+static struct bio *btrfs_split_bio(struct bio *orig, u64 map_length)
+{
+	struct btrfs_bio *orig_bbio = btrfs_bio(orig);
+	struct bio *bio;
+
+	bio = bio_split(orig, map_length >> SECTOR_SHIFT, GFP_NOFS,
+			&btrfs_clone_bioset);
+	btrfs_bio_init(btrfs_bio(bio), orig_bbio->inode, NULL, orig_bbio);
+
+	btrfs_bio(bio)->file_offset = orig_bbio->file_offset;
+	orig_bbio->file_offset += map_length;
+
+	atomic_inc(&orig_bbio->pending_ios);
+	return bio;
+}
+
+static void btrfs_orig_write_end_io(struct bio *bio);
+static void btrfs_bbio_propagate_error(struct btrfs_bio *bbio,
+				       struct btrfs_bio *orig_bbio)
+{
+	/*
+	 * For writes btrfs tolerates nr_mirrors - 1 write failures, so we
+	 * can't just blindly propagate a write failure here.
+	 * Instead increment the error count in the original I/O context so
+	 * that it is guaranteed to be larger than the error tolerance.
+	 */
+	if (bbio->bio.bi_end_io == &btrfs_orig_write_end_io) {
+		struct btrfs_io_stripe *orig_stripe = orig_bbio->bio.bi_private;
+		struct btrfs_io_context *orig_bioc = orig_stripe->bioc;
+		
+		atomic_add(orig_bioc->max_errors, &orig_bioc->error);
+	} else {
+		orig_bbio->bio.bi_status = bbio->bio.bi_status;
+	}
+}
+
+static void btrfs_orig_bbio_end_io(struct btrfs_bio *bbio)
+{
+	if (bbio->bio.bi_pool == &btrfs_clone_bioset) {
+		struct btrfs_bio *orig_bbio = bbio->private;
+
+		if (bbio->bio.bi_status)
+			btrfs_bbio_propagate_error(bbio, orig_bbio);
+		bio_put(&bbio->bio);
+		bbio = orig_bbio;
+	}
+
+	if (atomic_dec_and_test(&bbio->pending_ios))
+		bbio->end_io(bbio);
+}
+
 static int next_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
 {
 	if (cur_mirror == fbio->num_copies)
@@ -6715,7 +6768,7 @@  static int prev_repair_mirror(struct btrfs_failed_bio *fbio, int cur_mirror)
 static void btrfs_repair_done(struct btrfs_failed_bio *fbio)
 {
 	if (atomic_dec_and_test(&fbio->repair_count)) {
-		fbio->bbio->end_io(fbio->bbio);
+		btrfs_orig_bbio_end_io(fbio->bbio);
 		mempool_free(fbio, &btrfs_failed_bio_pool);
 	}
 }
@@ -6857,7 +6910,7 @@  static void btrfs_check_read_bio(struct btrfs_bio *bbio,
 	if (unlikely(fbio))
 		btrfs_repair_done(fbio);
 	else
-		bbio->end_io(bbio);
+		btrfs_orig_bbio_end_io(bbio);
 }
 
 static void btrfs_log_dev_io_error(struct bio *bio, struct btrfs_device *dev)
@@ -6908,7 +6961,7 @@  static void btrfs_simple_end_io(struct bio *bio)
 	} else {
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
 			btrfs_record_physical_zoned(bbio);
-		bbio->end_io(bbio);
+		btrfs_orig_bbio_end_io(bbio);
 	}
 }
 
@@ -6922,7 +6975,7 @@  static void btrfs_raid56_end_io(struct bio *bio)
 	if (bio_op(bio) == REQ_OP_READ)
 		btrfs_check_read_bio(bbio, NULL);
 	else
-		bbio->end_io(bbio);
+		btrfs_orig_bbio_end_io(bbio);
 
 	btrfs_put_bioc(bioc);
 }
@@ -6949,7 +7002,7 @@  static void btrfs_orig_write_end_io(struct bio *bio)
 	else
 		bio->bi_status = BLK_STS_OK;
 
-	bbio->end_io(bbio);
+	btrfs_orig_bbio_end_io(bbio);
 	btrfs_put_bioc(bioc);
 }
 
@@ -7190,8 +7243,8 @@  static bool btrfs_wq_submit_bio(struct btrfs_bio *bbio,
 	return true;
 }
 
-void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
-		      int mirror_num)
+static bool btrfs_submit_chunk(struct btrfs_fs_info *fs_info, struct bio *bio,
+			       int mirror_num)
 {
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	u64 logical = bio->bi_iter.bi_sector << 9;
@@ -7207,11 +7260,10 @@  void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	if (ret)
 		goto fail;
 
+	map_length = min(map_length, length);
 	if (map_length < length) {
-		btrfs_crit(fs_info,
-			   "mapping failed logical %llu bio len %llu len %llu",
-			   logical, length, map_length);
-		BUG();
+		bio = btrfs_split_bio(bio, map_length);
+		bbio = btrfs_bio(bio);
 	}
 
 	/*
@@ -7222,7 +7274,7 @@  void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		bbio->saved_iter = bio->bi_iter;
 		ret = btrfs_lookup_bio_sums(bbio);
 		if (ret)
-			goto fail;
+			goto fail_put_bio;
 	}
 
 	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
@@ -7231,7 +7283,7 @@  void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
 			ret = btrfs_extract_ordered_extent(btrfs_bio(bio));
 			if (ret)
-				goto fail;
+				goto fail_put_bio;
 		}
 
 		/*
@@ -7243,22 +7295,36 @@  void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		    !btrfs_is_data_reloc_root(bi->root)) {
 			if (should_async_write(bbio) &&
 			    btrfs_wq_submit_bio(bbio, bioc, &smap, mirror_num))
-				return;
+				goto done;
 
 			if (bio->bi_opf & REQ_META)
 				ret = btree_csum_one_bio(bbio);
 			else
 				ret = btrfs_csum_one_bio(bbio);
 			if (ret)
-				goto fail;
+				goto fail_put_bio;
 		}
 	}
 
 	__btrfs_submit_bio(bio, bioc, &smap, mirror_num);
-	return;
+done:
+	return map_length == length;
+
+fail_put_bio:
+	if (map_length < length)
+		bio_put(bio);
 fail:
 	btrfs_bio_counter_dec(fs_info);
 	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
+	/* Do not submit another chunk */
+	return true;
+}
+
+void btrfs_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
+		      int mirror_num)
+{
+	while (!btrfs_submit_chunk(fs_info, bio, mirror_num))
+		;
 }
 
 /*
@@ -8858,10 +8924,13 @@  int __init btrfs_bioset_init(void)
 			offsetof(struct btrfs_bio, bio),
 			BIOSET_NEED_BVECS))
 		return -ENOMEM;
+	if (bioset_init(&btrfs_clone_bioset, BIO_POOL_SIZE,
+			offsetof(struct btrfs_bio, bio), 0))
+		goto out_free_bioset;
 	if (bioset_init(&btrfs_repair_bioset, BIO_POOL_SIZE,
 			offsetof(struct btrfs_bio, bio),
 			BIOSET_NEED_BVECS))
-		goto out_free_bioset;
+		goto out_free_clone_bioset;
 	if (mempool_init_kmalloc_pool(&btrfs_failed_bio_pool, BIO_POOL_SIZE,
 				      sizeof(struct btrfs_failed_bio)))
 		goto out_free_repair_bioset;
@@ -8869,6 +8938,8 @@  int __init btrfs_bioset_init(void)
 
 out_free_repair_bioset:
 	bioset_exit(&btrfs_repair_bioset);
+out_free_clone_bioset:
+	bioset_exit(&btrfs_clone_bioset);
 out_free_bioset:
 	bioset_exit(&btrfs_bioset);
 	return -ENOMEM;
@@ -8878,5 +8949,6 @@  void __cold btrfs_bioset_exit(void)
 {
 	mempool_exit(&btrfs_failed_bio_pool);
 	bioset_exit(&btrfs_repair_bioset);
+	bioset_exit(&btrfs_clone_bioset);
 	bioset_exit(&btrfs_bioset);
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 8b248c9bd602b..97877184d0db1 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -386,6 +386,7 @@  struct btrfs_bio {
 
 	/* For internal use in read end I/O handling */
 	unsigned int mirror_num;
+	atomic_t pending_ios;
 	struct work_struct end_io_work;
 
 	/*