diff mbox series

[v5,03/13] btrfs: introduce a new helper to submit read bio for scrub

Message ID 79a6604bc9ccb2a6e1355f9d897b45943c6bcca9.1679959770.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: scrub: use a more reader friendly code to implement scrub_simple_mirror() | expand

Commit Message

Qu Wenruo March 27, 2023, 11:30 p.m. UTC
The new helper, btrfs_submit_scrub_read(), would be mostly a subset of
btrfs_submit_bio(), with the following limitations:

- Only supports read
- @mirror_num must be > 0
- No read-time repair nor checksum verification
- The @bbio must not cross stripe boundary

This would provide the basis for unified read repair for scrub, as we no
longer needs to handle RAID56 recovery all by scrub, and RAID56 data
stripes scrub can share the same code of read and repair.

The repair part would be the same as non-RAID56, as we only need to try
the next mirror.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/bio.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 fs/btrfs/bio.h |  1 +
 2 files changed, 45 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig March 27, 2023, 11:35 p.m. UTC | #1
On Tue, Mar 28, 2023 at 07:30:53AM +0800, Qu Wenruo wrote:
> The new helper, btrfs_submit_scrub_read(), would be mostly a subset of
> btrfs_submit_bio(), with the following limitations:
> 
> - Only supports read
> - @mirror_num must be > 0
> - No read-time repair nor checksum verification
> - The @bbio must not cross stripe boundary
> 
> This would provide the basis for unified read repair for scrub, as we no
> longer needs to handle RAID56 recovery all by scrub, and RAID56 data
> stripes scrub can share the same code of read and repair.
> 
> The repair part would be the same as non-RAID56, as we only need to try
> the next mirror.

Stupid question: what do we actually still need this and the write helper
for now that I think the generic helpers should work just fine without
bbio->inode?
Qu Wenruo March 28, 2023, 12:16 a.m. UTC | #2
On 2023/3/28 07:35, Christoph Hellwig wrote:
> On Tue, Mar 28, 2023 at 07:30:53AM +0800, Qu Wenruo wrote:
>> The new helper, btrfs_submit_scrub_read(), would be mostly a subset of
>> btrfs_submit_bio(), with the following limitations:
>>
>> - Only supports read
>> - @mirror_num must be > 0
>> - No read-time repair nor checksum verification
>> - The @bbio must not cross stripe boundary
>>
>> This would provide the basis for unified read repair for scrub, as we no
>> longer needs to handle RAID56 recovery all by scrub, and RAID56 data
>> stripes scrub can share the same code of read and repair.
>>
>> The repair part would be the same as non-RAID56, as we only need to try
>> the next mirror.
> 
> Stupid question: what do we actually still need this and the write helper
> for now that I think the generic helpers should work just fine without
> bbio->inode?

For read part, as long as we skip all the csum verification and csum 
search, it can go the generic helper.
Although this may needs some coordinate during merge.

But in the future, mostly for the ZNS RST patches, we need special 
handling for the garbage range.

Johannes is already testing his RST branches upon this series, and he is 
hitting the ASSERT() on the mapped length due to RST changes.
In that case, we need to skip certain ranges, and that's specific to 
scrub usage.

Thus for read part, I'd still prefer a scrub specific helper, for the 
incoming RST expansion.


For the write helper, it's very different.

The btrfs_submit_bio() would duplicate the writes to all mirrors, which 
is exactly what I want to avoid for scrub usage.
Thus the scrub specific write helper would always map the write to a 
single stripe, even for RAID56 writes.

So the write helper must stay.

Thanks,
Qu
Christoph Hellwig March 28, 2023, 12:25 a.m. UTC | #3
On Tue, Mar 28, 2023 at 08:16:38AM +0800, Qu Wenruo wrote:
> For read part, as long as we skip all the csum verification and csum search,
> it can go the generic helper.

Yes, and we can just check for bbio->inode to decide that, like you did
in the earliest versions of the series.

> But in the future, mostly for the ZNS RST patches, we need special handling
> for the garbage range.
> 
> Johannes is already testing his RST branches upon this series, and he is
> hitting the ASSERT() on the mapped length due to RST changes.
> In that case, we need to skip certain ranges, and that's specific to scrub
> usage.

Can you explain what the issue is here?

> The btrfs_submit_bio() would duplicate the writes to all mirrors, which is
> exactly what I want to avoid for scrub usage.
> Thus the scrub specific write helper would always map the write to a single
> stripe, even for RAID56 writes.

That's basically what btrfs_repair_io_failure does.  Any chance we
could share the code for those instead of going to back to creating
multiple code paths for I/O submission?
Qu Wenruo March 28, 2023, 12:48 a.m. UTC | #4
On 2023/3/28 08:25, Christoph Hellwig wrote:
> On Tue, Mar 28, 2023 at 08:16:38AM +0800, Qu Wenruo wrote:
>> For read part, as long as we skip all the csum verification and csum search,
>> it can go the generic helper.
> 
> Yes, and we can just check for bbio->inode to decide that, like you did
> in the earliest versions of the series.
> 
>> But in the future, mostly for the ZNS RST patches, we need special handling
>> for the garbage range.
>>
>> Johannes is already testing his RST branches upon this series, and he is
>> hitting the ASSERT() on the mapped length due to RST changes.
>> In that case, we need to skip certain ranges, and that's specific to scrub
>> usage.
> 
> Can you explain what the issue is here?

The new RST would add a new layer between physical address and logical 
address.

Thus every read would do an extra lookup (almost at extent level), 
that's done in btrfs_map_block().

However the new scrub code introduced a new behavior, which is never 
really utilized by any other code path.
Now scrub would read the full 64K in one go, even this means we can read 
out sectors not utilized by any extent.

The new behavior itself reduces IOPS, thus it should be an overall win.
And for regular devices or non-RST zoned devices, we just read out some 
garbage, and scrub verification code would skip them.

But for RST cases, there would be no mapping for the garbage range, thus 
we need to skip those garbage for RST cases.

The dedicated read helper would provide the location for us to handle 
this RST scrub specific case.

> 
>> The btrfs_submit_bio() would duplicate the writes to all mirrors, which is
>> exactly what I want to avoid for scrub usage.
>> Thus the scrub specific write helper would always map the write to a single
>> stripe, even for RAID56 writes.
> 
> That's basically what btrfs_repair_io_failure does.  Any chance we
> could share the code for those instead of going to back to creating
> multiple code paths for I/O submission?

That sounds like a pretty good candidate to merge.

The core shares the same idea, only the support code is different.

I'm fine merging the core logic, although I believe we still need 
different entrance functions.
(e.g repair is only done for one sector, requires inode/file offset, and 
is done synchronously.)

Thanks,
Qu
Christoph Hellwig March 28, 2023, 12:53 a.m. UTC | #5
On Tue, Mar 28, 2023 at 08:48:29AM +0800, Qu Wenruo wrote:
> The new behavior itself reduces IOPS, thus it should be an overall win.
> And for regular devices or non-RST zoned devices, we just read out some
> garbage, and scrub verification code would skip them.
> 
> But for RST cases, there would be no mapping for the garbage range, thus we
> need to skip those garbage for RST cases.
> 
> The dedicated read helper would provide the location for us to handle this
> RST scrub specific case.

Well, if did just call btrfs_submit_chunk, the RST lookup would ensure
you only get the length of the RST mapping, and you get the behavior
you want without duplication.  We'd need to make it non-static (and
document it), but I'd still be much happier about that than yet another
I/O submission interface.

> 
> The core shares the same idea, only the support code is different.
> 
> I'm fine merging the core logic, although I believe we still need different
> entrance functions.
> (e.g repair is only done for one sector, requires inode/file offset, and is
> done synchronously.)

Well, the inode number and start is only used for a debug printk,
so we can easily move that into the caller.  As for the single sectors
vs not, what we really should do is to move the bio allocation
to the caller, which means the caller cna decided how large the bio
is.  Together with moving the printk that also makes the interface
much simpler as we'll need a lot less arguments.
Qu Wenruo March 28, 2023, 1:01 a.m. UTC | #6
On 2023/3/28 08:53, Christoph Hellwig wrote:
> On Tue, Mar 28, 2023 at 08:48:29AM +0800, Qu Wenruo wrote:
>> The new behavior itself reduces IOPS, thus it should be an overall win.
>> And for regular devices or non-RST zoned devices, we just read out some
>> garbage, and scrub verification code would skip them.
>>
>> But for RST cases, there would be no mapping for the garbage range, thus we
>> need to skip those garbage for RST cases.
>>
>> The dedicated read helper would provide the location for us to handle this
>> RST scrub specific case.
> 
> Well, if did just call btrfs_submit_chunk, the RST lookup would ensure
> you only get the length of the RST mapping, and you get the behavior
> you want without duplication.  We'd need to make it non-static (and
> document it), but I'd still be much happier about that than yet another
> I/O submission interface.

But in that case, wouldn't we just error out?

As if RST can not find the mapping, it can error out.

The behavior of handling missing RST mapping is different between the 
regular read path and scrub read path.

For regular path, if we're reading some range without an RST map, we 
should error out.

While for scrub path, we should skip (or if you want, we can also read 
some garbage) the unmapped range.
And for non-RST cases, we should not hit stripe boundary (although this 
is just a debugging check).

> 
>>
>> The core shares the same idea, only the support code is different.
>>
>> I'm fine merging the core logic, although I believe we still need different
>> entrance functions.
>> (e.g repair is only done for one sector, requires inode/file offset, and is
>> done synchronously.)
> 
> Well, the inode number and start is only used for a debug printk,
> so we can easily move that into the caller.  As for the single sectors
> vs not, what we really should do is to move the bio allocation
> to the caller, which means the caller cna decided how large the bio
> is.  Together with moving the printk that also makes the interface
> much simpler as we'll need a lot less arguments.

Yep, I'm totally fine with a helper to do the core single-dev mapping 
helper.
Just pass logical + mirror as parameters to the helper, and returns a 
smap or dev + physical, and let the caller do the submission.

Thanks,
Qu
Christoph Hellwig March 28, 2023, 1:04 a.m. UTC | #7
On Tue, Mar 28, 2023 at 09:01:57AM +0800, Qu Wenruo wrote:
> > Well, if did just call btrfs_submit_chunk, the RST lookup would ensure
> > you only get the length of the RST mapping, and you get the behavior
> > you want without duplication.  We'd need to make it non-static (and
> > document it), but I'd still be much happier about that than yet another
> > I/O submission interface.
> 
> But in that case, wouldn't we just error out?

btrfs_submit_chunk is greedy.  But yes, if there was no mapping at all
I guess it would error out and we'd need to find a way to handle that.
But that still seems better than duplicating the logic again.
Qu Wenruo March 28, 2023, 1:10 a.m. UTC | #8
On 2023/3/28 09:04, Christoph Hellwig wrote:
> On Tue, Mar 28, 2023 at 09:01:57AM +0800, Qu Wenruo wrote:
>>> Well, if did just call btrfs_submit_chunk, the RST lookup would ensure
>>> you only get the length of the RST mapping, and you get the behavior
>>> you want without duplication.  We'd need to make it non-static (and
>>> document it), but I'd still be much happier about that than yet another
>>> I/O submission interface.
>>
>> But in that case, wouldn't we just error out?
> 
> btrfs_submit_chunk is greedy.  But yes, if there was no mapping at all
> I guess it would error out and we'd need to find a way to handle that.
> But that still seems better than duplicating the logic again.

Well, to me the whole btrfs_submit_chunk() is already super modular 
(great work from you, very appreciated), thus new scrub_read path only 
goes less than 30 lines.

At least to me, a few more lines for a special path seems worthy.
Especially when that scrub path should never be utilized by anyone else.

But I'm still open minded to merge if proven the change is small enough.
Let us see what we can do after the next update of RST patchset and 
determine then.

Thanks,
Qu
diff mbox series

Patch

diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c
index c1edadc17260..bdef346c542c 100644
--- a/fs/btrfs/bio.c
+++ b/fs/btrfs/bio.c
@@ -333,8 +333,8 @@  static void btrfs_end_bio_work(struct work_struct *work)
 {
 	struct btrfs_bio *bbio = container_of(work, struct btrfs_bio, end_io_work);
 
-	/* Metadata reads are checked and repaired by the submitter. */
-	if (bbio->bio.bi_opf & REQ_META)
+	/* Metadata or scrub reads are checked and repaired by the submitter. */
+	if (bbio->bio.bi_opf & REQ_META || !bbio->inode)
 		bbio->end_io(bbio);
 	else
 		btrfs_check_read_bio(bbio, bbio->bio.bi_private);
@@ -344,7 +344,7 @@  static void btrfs_simple_end_io(struct bio *bio)
 {
 	struct btrfs_bio *bbio = btrfs_bio(bio);
 	struct btrfs_device *dev = bio->bi_private;
-	struct btrfs_fs_info *fs_info = bbio->inode->root->fs_info;
+	struct btrfs_fs_info *fs_info = bbio->fs_info;
 
 	btrfs_bio_counter_dec(fs_info);
 
@@ -368,7 +368,8 @@  static void btrfs_raid56_end_io(struct bio *bio)
 
 	btrfs_bio_counter_dec(bioc->fs_info);
 	bbio->mirror_num = bioc->mirror_num;
-	if (bio_op(bio) == REQ_OP_READ && !(bbio->bio.bi_opf & REQ_META))
+	if (bio_op(bio) == REQ_OP_READ && bbio->inode &&
+	    !(bbio->bio.bi_opf & REQ_META))
 		btrfs_check_read_bio(bbio, NULL);
 	else
 		btrfs_orig_bbio_end_io(bbio);
@@ -714,6 +715,45 @@  static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num)
 	return true;
 }
 
+/*
+ * Scrub read special version, with extra limits:
+ *
+ * - Only support read for scrub usage
+ * - @mirror_num must be >0
+ * - No read-time repair nor checksum verification.
+ * - The @bbio must not cross stripe boundary.
+ */
+void btrfs_submit_scrub_read(struct btrfs_bio *bbio, int mirror_num)
+{
+	struct btrfs_fs_info *fs_info = bbio->fs_info;
+	u64 logical = bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 length = bbio->bio.bi_iter.bi_size;
+	u64 map_length = length;
+	struct btrfs_io_context *bioc = NULL;
+	struct btrfs_io_stripe smap;
+	int ret;
+
+	ASSERT(fs_info);
+	ASSERT(mirror_num > 0);
+	ASSERT(btrfs_op(&bbio->bio) == BTRFS_MAP_READ);
+	ASSERT(!bbio->inode);
+
+	btrfs_bio_counter_inc_blocked(fs_info);
+	ret = __btrfs_map_block(fs_info, btrfs_op(&bbio->bio), logical,
+				&map_length, &bioc, &smap, &mirror_num, 1);
+	if (ret)
+		goto fail;
+
+	/* Caller should ensure the @bbio doesn't cross stripe boundary. */
+	ASSERT(map_length >= length);
+	__btrfs_submit_bio(&bbio->bio, bioc, &smap, mirror_num);
+	return;
+
+fail:
+	btrfs_bio_counter_dec(fs_info);
+	btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
+}
+
 void btrfs_submit_bio(struct btrfs_bio *bbio, int mirror_num)
 {
 	while (!btrfs_submit_chunk(bbio, mirror_num))
diff --git a/fs/btrfs/bio.h b/fs/btrfs/bio.h
index 3b97ce54140a..afbcf318fdda 100644
--- a/fs/btrfs/bio.h
+++ b/fs/btrfs/bio.h
@@ -106,6 +106,7 @@  static inline void btrfs_bio_end_io(struct btrfs_bio *bbio, blk_status_t status)
 #define REQ_BTRFS_ONE_ORDERED			REQ_DRV
 
 void btrfs_submit_bio(struct btrfs_bio *bbio, int mirror_num);
+void btrfs_submit_scrub_read(struct btrfs_bio *bbio, int mirror_num);
 int btrfs_repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 			    u64 length, u64 logical, struct page *page,
 			    unsigned int pg_offset, int mirror_num);