diff mbox series

[v4,04/13] btrfs: introduce a new helper to submit write bio for scrub

Message ID 72f4fa26c35f2e649bc562a80a40955d745f1118.1679826088.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: scrub: use a more reader friendly code to implement scrub_simple_mirror() | expand

Commit Message

Qu Wenruo March 26, 2023, 11:06 a.m. UTC
Just like the special scrub read, scrub write also has its extra niches:

- Only write back to single device
  Even for read-repair on RAID56, we only update the corrupted data
  stripe itself, not triggering the full RMW path.

  This makes scrub writeback a perfect match for the single stripe
  quick path.

- Requires a valid @mirror_num
  For RAID56 case, only @mirror_num == 1 is supported.
  For non-RAID56 cases, we need @mirror_num to locate our stripe.

- Need to manually specify if it's for dev-replace
  For scrub path we can write back to the original device (for
  read-repair) and to the target device (for replace) at the same
  time, but with different sectors (read-repair only writes repaired
  sectors, while dev-replace writes all good sectors).

  So here we need a bool to specify the case.

- No data csum generation

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/bio.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/bio.h |  2 ++
 2 files changed, 94 insertions(+)

Comments

Christoph Hellwig March 27, 2023, 3:44 a.m. UTC | #1
On Sun, Mar 26, 2023 at 07:06:33PM +0800, Qu Wenruo wrote:
> +	/* Caller should ensure the @bbio doesn't cross stripe boundary. */
> +	ASSERT(map_length >= length);
> +	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE && btrfs_is_zoned(fs_info)) {
> +		bbio->bio.bi_opf &= ~REQ_OP_WRITE;
> +		bbio->bio.bi_opf |= REQ_OP_ZONE_APPEND;
> +	}

Not crossing the stripe boundary is not enough, for zone append it
also must not cross the zone append limit, which (at least in theory)
can be arbitrarily small.

> +	if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
> +		int data_stripes = (bioc->map_type & BTRFS_BLOCK_GROUP_RAID5) ?
> +				    bioc->num_stripes - 1 : bioc->num_stripes - 2;

This calculation feels like it should be in a well document simple
helper.

> +		smap.physical = bioc->stripes[i].physical +
> +				((logical - bioc->full_stripe_logical) &
> +				 BTRFS_STRIPE_LEN_MASK);

This calculation feels like another candidate for a self contained
a well documented helper.


> +		goto submit;
> +	}
> +	ASSERT(mirror_num <= bioc->num_stripes);
> +	smap.dev = bioc->stripes[mirror_num - 1].dev;
> +	smap.physical = bioc->stripes[mirror_num - 1].physical;
> +submit:

Why no else instead of the goto here?  That would read a lot easier.
Qu Wenruo March 27, 2023, 4:32 a.m. UTC | #2
On 2023/3/27 11:44, Christoph Hellwig wrote:
> On Sun, Mar 26, 2023 at 07:06:33PM +0800, Qu Wenruo wrote:
>> +	/* Caller should ensure the @bbio doesn't cross stripe boundary. */
>> +	ASSERT(map_length >= length);
>> +	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE && btrfs_is_zoned(fs_info)) {
>> +		bbio->bio.bi_opf &= ~REQ_OP_WRITE;
>> +		bbio->bio.bi_opf |= REQ_OP_ZONE_APPEND;
>> +	}
> 
> Not crossing the stripe boundary is not enough, for zone append it
> also must not cross the zone append limit, which (at least in theory)
> can be arbitrarily small.

Do you mean that we can have some real zone append limit which is even 
smaller than 64K?

If so, then the write helper can be more complex than I thought...

Thanks,
Qu

> 
>> +	if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
>> +		int data_stripes = (bioc->map_type & BTRFS_BLOCK_GROUP_RAID5) ?
>> +				    bioc->num_stripes - 1 : bioc->num_stripes - 2;
> 
> This calculation feels like it should be in a well document simple
> helper.
> 
>> +		smap.physical = bioc->stripes[i].physical +
>> +				((logical - bioc->full_stripe_logical) &
>> +				 BTRFS_STRIPE_LEN_MASK);
> 
> This calculation feels like another candidate for a self contained
> a well documented helper.
> 
> 
>> +		goto submit;
>> +	}
>> +	ASSERT(mirror_num <= bioc->num_stripes);
>> +	smap.dev = bioc->stripes[mirror_num - 1].dev;
>> +	smap.physical = bioc->stripes[mirror_num - 1].physical;
>> +submit:
> 
> Why no else instead of the goto here?  That would read a lot easier.
Christoph Hellwig March 27, 2023, 5:22 a.m. UTC | #3
On Mon, Mar 27, 2023 at 12:32:22PM +0800, Qu Wenruo wrote:
> > Not crossing the stripe boundary is not enough, for zone append it
> > also must not cross the zone append limit, which (at least in theory)
> > can be arbitrarily small.
> 
> Do you mean that we can have some real zone append limit which is even
> smaller than 64K?
> 
> If so, then the write helper can be more complex than I thought...

In theory it can be as small as a single LBA.  It might make sense
to require a minimum of 64k for btrfs, and reject the writable mount
if it is not, as a device with just a 64k zone append limit or smaller
would be really horrible to use anyway.
Qu Wenruo March 27, 2023, 7:52 a.m. UTC | #4
On 2023/3/27 13:22, Christoph Hellwig wrote:
> On Mon, Mar 27, 2023 at 12:32:22PM +0800, Qu Wenruo wrote:
>>> Not crossing the stripe boundary is not enough, for zone append it
>>> also must not cross the zone append limit, which (at least in theory)
>>> can be arbitrarily small.
>>
>> Do you mean that we can have some real zone append limit which is even
>> smaller than 64K?
>>
>> If so, then the write helper can be more complex than I thought...
> 
> In theory it can be as small as a single LBA.  It might make sense
> to require a minimum of 64k for btrfs, and reject the writable mount
> if it is not, as a device with just a 64k zone append limit or smaller
> would be really horrible to use anyway.

The rejection method sounds very reasonable to me.

Although I'd prefer some feedback from Johannes on the real world 
values, and if possible let the zoned guys submit the patch.

I haven't yet build my VMs to run ZNS tests.

Another thing is, if we want to implement the rejection before the 
reworked scrub, the threshold would be 128K, as that's the bio size 
limit for the old scrub code.

Thanks,
Qu
Johannes Thumshirn March 27, 2023, 6:02 p.m. UTC | #5
On 27.03.23 09:52, Qu Wenruo wrote:
> 
> 
> On 2023/3/27 13:22, Christoph Hellwig wrote:
>> On Mon, Mar 27, 2023 at 12:32:22PM +0800, Qu Wenruo wrote:
>>>> Not crossing the stripe boundary is not enough, for zone append it
>>>> also must not cross the zone append limit, which (at least in theory)
>>>> can be arbitrarily small.
>>>
>>> Do you mean that we can have some real zone append limit which is even
>>> smaller than 64K?
>>>
>>> If so, then the write helper can be more complex than I thought...
>>
>> In theory it can be as small as a single LBA.  It might make sense
>> to require a minimum of 64k for btrfs, and reject the writable mount
>> if it is not, as a device with just a 64k zone append limit or smaller
>> would be really horrible to use anyway.
> 
> The rejection method sounds very reasonable to me.
> 
> Although I'd prefer some feedback from Johannes on the real world 
> values, and if possible let the zoned guys submit the patch.
> 

Sure rejecting sub 64k ZA limits sounds reasonable to me as well.
I can craft a patch if you want.

> I haven't yet build my VMs to run ZNS tests.
> 
> Another thing is, if we want to implement the rejection before the 
> reworked scrub, the threshold would be 128K, as that's the bio size 
> limit for the old scrub code.
> 
> Thanks,
> Qu
>
diff mbox series

Patch

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index bdef346c542c..34902d58bb4b 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -754,6 +754,98 @@  void btrfs_submit_scrub_read(struct btrfs_bio *bbio, int mirror_num)
 	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
 }
 
+/*
+ * Scrub write special version. Some extra limits:
+ *
+ * - Only support write back for dev-replace and scrub read-repair.
+ *   This means, the write bio, even for RAID56, would only
+ *   be mapped to single device.
+ *
+ * - @mirror_num must be >0.
+ *   To indicate which mirror to be written.
+ *   If it's RAID56, it must be 1 (data stripes).
+ *
+ * - The @bbio must not cross stripe boundary.
+ *
+ * - If @dev_replace is true, the resulted stripe must be mapped to
+ *   replace source device.
+ *
+ * - No csum geneartion.
+ */
+void btrfs_submit_scrub_write(struct btrfs_bio *bbio, int mirror_num,
+			      bool dev_replace)
+{
+	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	u64 logical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 length = bbio->bio.bi_iter.bi_size;
+	u64 map_length = length;
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_io_stripe smap;
+	int ret;
+
+	ASSERT(fs_info);
+	ASSERT(mirror_num > 0);
+	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE);
+	ASSERT(!bbio->inode);
+
+	btrfs_bio_counter_inc_blocked(fs_info);
+	ret = __btrfs_map_block(fs_info, btrfs_op(&bbio->bio), logical,
+				&map_length, &bioc, &smap, &mirror_num, 1);
+	if (ret)
+		goto fail;
+
+	/* Caller should ensure the @bbio doesn't cross stripe boundary. */
+	ASSERT(map_length >= length);
+	if (btrfs_op(&bbio->bio) == BTRFS_MAP_WRITE && btrfs_is_zoned(fs_info)) {
+		bbio->bio.bi_opf &= ~REQ_OP_WRITE;
+		bbio->bio.bi_opf |= REQ_OP_ZONE_APPEND;
+	}
+
+	if (!bioc)
+		goto submit;
+
+	/* Map the RAID56 multi-stripe writes to a single one. */
+	if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
+		int data_stripes = (bioc->map_type & BTRFS_BLOCK_GROUP_RAID5) ?
+				    bioc->num_stripes - 1 : bioc->num_stripes - 2;
+		int i;
+
+		/* This special write only works for data stripes. */
+		ASSERT(mirror_num == 1);
+		for (i = 0; i < data_stripes; i++) {
+			u64 stripe_start = bioc->full_stripe_logical +
+					   (i << BTRFS_STRIPE_LEN_SHIFT);
+
+			if (logical >= stripe_start &&
+			    logical < stripe_start + BTRFS_STRIPE_LEN)
+				break;
+		}
+		ASSERT(i < data_stripes);
+		smap.dev = bioc->stripes[i].dev;
+		smap.physical = bioc->stripes[i].physical +
+				((logical - bioc->full_stripe_logical) &
+				 BTRFS_STRIPE_LEN_MASK);
+		goto submit;
+	}
+	ASSERT(mirror_num <= bioc->num_stripes);
+	smap.dev = bioc->stripes[mirror_num - 1].dev;
+	smap.physical = bioc->stripes[mirror_num - 1].physical;
+submit:
+	ASSERT(smap.dev);
+	btrfs_put_bioc(bioc);
+	bioc = NULL;
+	if (dev_replace) {
+		ASSERT(smap.dev == fs_info->dev_replace.srcdev);
+		smap.dev = fs_info->dev_replace.tgtdev;
+	}
+	__btrfs_submit_bio(&bbio->bio, bioc, &smap, mirror_num);
+	return;
+
+fail:
+	btrfs_bio_counter_dec(fs_info);
+	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
+}
+
 void btrfs_submit_bio(struct btrfs_bio *bbio, int mirror_num)
 {
 	while (!btrfs_submit_chunk(bbio, mirror_num))
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index afbcf318fdda..ad5a6a558662 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -107,6 +107,8 @@  static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 
 void btrfs_submit_bio(struct btrfs_bio *bbio, int mirror_num);
 void btrfs_submit_scrub_read(struct btrfs_bio *bbio, int mirror_num);
+void btrfs_submit_scrub_write(struct btrfs_bio *bbio, int mirror_num,
+			      bool dev_replace);
 int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 			    u64 length, u64 logical, struct page *page,
 			    unsigned int pg_offset, int mirror_num);