diff mbox series

[8/8] btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio

Message ID 20220522114754.173685-9-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [1/8] btrfs: quit early if the fs has no RAID56 support for raid56 related checks | expand

Commit Message

Christoph Hellwig May 22, 2022, 11:47 a.m. UTC
Use the new btrfs_bio_for_each_sector iterator to simplify
btrfs_check_read_dio_bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/btrfs/inode.c | 56 +++++++++++++++++++-----------------------------
 1 file changed, 22 insertions(+), 34 deletions(-)

Comments

Qu Wenruo May 22, 2022, 12:21 p.m. UTC | #1
On 2022/5/22 19:47, Christoph Hellwig wrote:
> Use the new btrfs_bio_for_each_sector iterator to simplify
> btrfs_check_read_dio_bio.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Qu Wenruo <wqu@suse.com>

In fact, in my version I also want to convert the buffered read endio to
use the helper, and get rid of the error bitmap thing.

But you're so fast sending out the patchset...

Thanks,
Qu
> ---
>   fs/btrfs/inode.c | 56 +++++++++++++++++++-----------------------------
>   1 file changed, 22 insertions(+), 34 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 095632977a798..34466b543ed97 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7886,47 +7886,35 @@ static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
>   {
>   	struct inode *inode = dip->inode;
>   	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> -	const u32 sectorsize = fs_info->sectorsize;
>   	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
>   	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
>   	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
> -	struct bio_vec bvec;
> -	struct bvec_iter iter;
> -	u32 bio_offset = 0;
>   	blk_status_t err = BLK_STS_OK;
> +	struct bvec_iter iter;
> +	struct bio_vec bv;
> +	u32 offset;
> +
> +	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
> +		u64 start = bbio->file_offset + offset;
> +
> +		if (uptodate &&
> +		    (!csum || !check_data_csum(inode, bbio, offset, bv.bv_page,
> +				bv.bv_offset, start))) {
> +			clean_io_failure(fs_info, failure_tree, io_tree, start,
> +					 bv.bv_page, btrfs_ino(BTRFS_I(inode)),
> +					 bv.bv_offset);
> +		} else {
> +			int ret;
>
> -	__bio_for_each_segment(bvec, &bbio->bio, iter, bbio->iter) {
> -		unsigned int i, nr_sectors, pgoff;
> -
> -		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec.bv_len);
> -		pgoff = bvec.bv_offset;
> -		for (i = 0; i < nr_sectors; i++) {
> -			u64 start = bbio->file_offset + bio_offset;
> -
> -			ASSERT(pgoff < PAGE_SIZE);
> -			if (uptodate &&
> -			    (!csum || !check_data_csum(inode, bbio,
> -						       bio_offset, bvec.bv_page,
> -						       pgoff, start))) {
> -				clean_io_failure(fs_info, failure_tree, io_tree,
> -						 start, bvec.bv_page,
> -						 btrfs_ino(BTRFS_I(inode)),
> -						 pgoff);
> -			} else {
> -				int ret;
> -
> -				ret = btrfs_repair_one_sector(inode, &bbio->bio,
> -						bio_offset, bvec.bv_page, pgoff,
> -						start, bbio->mirror_num,
> -						submit_dio_repair_bio);
> -				if (ret)
> -					err = errno_to_blk_status(ret);
> -			}
> -			ASSERT(bio_offset + sectorsize > bio_offset);
> -			bio_offset += sectorsize;
> -			pgoff += sectorsize;
> +			ret = btrfs_repair_one_sector(inode, &bbio->bio, offset,
> +					bv.bv_page, bv.bv_offset, start,
> +					bbio->mirror_num,
> +					submit_dio_repair_bio);
> +			if (ret)
> +				err = errno_to_blk_status(ret);
>   		}
>   	}
> +
>   	return err;
>   }
>
Christoph Hellwig May 22, 2022, 12:31 p.m. UTC | #2
On Sun, May 22, 2022 at 08:21:47PM +0800, Qu Wenruo wrote:
> In fact, in my version I also want to convert the buffered read endio to
> use the helper, and get rid of the error bitmap thing.

Yes, the buffered end I/O path could use some more love.  I also wonder
if splitting the data vs metadata case might be good idea as well.  But
maybe we can finish the read repair series first and do that next?
'cause now a lot of the bio works will depend on the read repair, and
I don't want to block it on yet another series..
Qu Wenruo May 22, 2022, 12:38 p.m. UTC | #3
On 2022/5/22 20:31, Christoph Hellwig wrote:
> On Sun, May 22, 2022 at 08:21:47PM +0800, Qu Wenruo wrote:
>> In fact, in my version I also want to convert the buffered read endio to
>> use the helper, and get rid of the error bitmap thing.
>
> Yes, the buffered end I/O path could use some more love.  I also wonder
> if splitting the data vs metadata case might be good idea as well.  But
> maybe we can finish the read repair series first and do that next?

OK, that makes sense.

> 'cause now a lot of the bio works will depend on the read repair, and
> I don't want to block it on yet another series..

Although I believe we will have to take more time on the read repair
code/functionality.

Especially all our submitted version have their own problems.

 From the basic handling of checker pattern corruption, to the trade-off
between memory allocation and performance for read on corrupted data.

Thanks,
Qu
Christoph Hellwig May 22, 2022, 12:53 p.m. UTC | #4
On Sun, May 22, 2022 at 08:38:50PM +0800, Qu Wenruo wrote:
>> 'cause now a lot of the bio works will depend on the read repair, and
>> I don't want to block it on yet another series..
>
> Although I believe we will have to take more time on the read repair
> code/functionality.
>
> Especially all our submitted version have their own problems.
>
> From the basic handling of checker pattern corruption, to the trade-off
> between memory allocation and performance for read on corrupted data.

I've already pushed out a new version here:

http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-read_repair

so feel free to take a look.  I just don't want to spam the list with it
quite yet with this series outstanding.
Qu Wenruo May 23, 2022, 12:07 a.m. UTC | #5
On 2022/5/22 20:53, Christoph Hellwig wrote:
> On Sun, May 22, 2022 at 08:38:50PM +0800, Qu Wenruo wrote:
>>> 'cause now a lot of the bio works will depend on the read repair, and
>>> I don't want to block it on yet another series..
>>
>> Although I believe we will have to take more time on the read repair
>> code/functionality.
>>
>> Especially all our submitted version have their own problems.
>>
>>  From the basic handling of checker pattern corruption, to the trade-off
>> between memory allocation and performance for read on corrupted data.
>
> I've already pushed out a new version here:
>
> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/btrfs-read_repair
>
> so feel free to take a look.  I just don't want to spam the list with it
> quite yet with this series outstanding.

I checked the code, but still find the code in patch "btrfs: add new
read repair infrastructure" not that instinctive.

- Why we bother different repair methods in btrfs_repair_one_mirror()?
   In fact btrfs_repair_io_failure() can handle all profiles.

   Then why we go back to write the whole bio?
   The only reason I can think of is, we're still trying to do some
   "optimization".

   But all our bio submission is already synchronous, I doubt such
   "optimization" would make much difference.

- The bio truncation
   This really looks like a bandage to address the checker pattern
   corruption.
   I doubt why not just do per-sector read/write like:

+	/* Init read bio to contain that corrupted sector only */
+	for (i = get_next_mirror(init_mirror); i != init_mirror; i++) {
+		int ret;
+
+		ret = btrfs_map_bio_wait(inode, read_bio, i);
+		/* Failed to submit, try next mirror */
+		if (ret < 0)
+			continue;
+
+		/* Verify the checksum */
+		if (failed_bbio->csum)
+			ret = btrfs_check_data_sector(fs_info, page,
+					pgoff, repair_bbio->csum);
+		if (ret == 0) {
+			found_good = true;
+			btrfs_repair_io_failure();
+			break;
+		}
+	}
+	if (!found_good)
+		return -EIO;

To me, the "optimization" of batched read/write is only relevant if we
have tons of corrupted sectors in a read bio, which I don't believe is a
hot path in real world anyway.

Thanks,
Qu
Christoph Hellwig May 23, 2022, 6:26 a.m. UTC | #6
On Mon, May 23, 2022 at 08:07:01AM +0800, Qu Wenruo wrote:
>
> I checked the code, but still find the code in patch "btrfs: add new
> read repair infrastructure" not that instinctive.
>
> - Why we bother different repair methods in btrfs_repair_one_mirror()?
>   In fact btrfs_repair_io_failure() can handle all profiles.

Becasue btrfs_repair_io_failure can't handle multiple-page I/O.  It
is also is rather cumbersome because it bypassed the normal bio
mapping.  As a follow on I'd rather move it over to btrfs_map_bio
with a special flag for the single mirror parity write rather than that
hack.

>   Then why we go back to write the whole bio?

Because the whole bio at this point is all the bad sectors.  There
is no point in writing only parts of the bio because that would leave
corruption on disk.

>   The only reason I can think of is, we're still trying to do some
>   "optimization".
>
>   But all our bio submission is already synchronous, I doubt such
>   "optimization" would make much difference.

Can you explain what you mean here?

>
> - The bio truncation
>   This really looks like a bandage to address the checker pattern
>   corruption.
>   I doubt why not just do per-sector read/write like:

Because now you wait for each I/O.

> To me, the "optimization" of batched read/write is only relevant if we
> have tons of corrupted sectors in a read bio, which I don't believe is a
> hot path in real world anyway.

It is is very easy low hanging fruit, and actually pretty common for
actual failures of components.  In other words:  single sector failures
happen some times, usually due to corruption on the transfer wire.  If
actual corruption happens on the driver it will usually fail quite
bit areas.  These are rather rare these days as a lot of device shut
down before returning bad data, but if it happens you'll rarely see
just a single sector.
Qu Wenruo May 23, 2022, 7:46 a.m. UTC | #7
On 2022/5/23 14:26, Christoph Hellwig wrote:
> On Mon, May 23, 2022 at 08:07:01AM +0800, Qu Wenruo wrote:
>>
>> I checked the code, but still find the code in patch "btrfs: add new
>> read repair infrastructure" not that instinctive.
>>
>> - Why we bother different repair methods in btrfs_repair_one_mirror()?
>>    In fact btrfs_repair_io_failure() can handle all profiles.
>
> Becasue btrfs_repair_io_failure can't handle multiple-page I/O.  It
> is also is rather cumbersome because it bypassed the normal bio
> mapping.  As a follow on I'd rather move it over to btrfs_map_bio
> with a special flag for the single mirror parity write rather than that
> hack.

In fact so far for all callers of btrfs_repair_io_failure(), we are
always handling things inside one stripe.

Thus we can easily enhance that function to handle multi page ranges.

Although a dedicated btrfs_map_bio() flags seems more generic and better.
>
>>    Then why we go back to write the whole bio?
>
> Because the whole bio at this point is all the bad sectors.  There
> is no point in writing only parts of the bio because that would leave
> corruption on disk.
>
>>    The only reason I can think of is, we're still trying to do some
>>    "optimization".
>>
>>    But all our bio submission is already synchronous, I doubt such
>>    "optimization" would make much difference.
>
> Can you explain what you mean here?

We wait for the read bio anyway, I doubt the batched write part is that
important.

If you really want, I can try to make the write part asynchronous, while
still keep the read part synchronous, and easier to read.

>
>>
>> - The bio truncation
>>    This really looks like a bandage to address the checker pattern
>>    corruption.
>>    I doubt why not just do per-sector read/write like:
>
> Because now you wait for each I/O.

Yeah, that's the biggest performance drop here, that's definitely no doubt.

>
>> To me, the "optimization" of batched read/write is only relevant if we
>> have tons of corrupted sectors in a read bio, which I don't believe is a
>> hot path in real world anyway.
>
> It is is very easy low hanging fruit,

Unfortunately, this is not that a simple fruit, or I won't go full
bitmap way.

In your current version, the do {} while() loop iterates through all
mirrors.

But for the following case, we will hit problems thanks to RAID1C3 again:

Mirror 1 	|X|X|X|X|
Mirror 2	|X| |X| |
Mirror 3	| |X| |X|

We hit mirror 1 initially, thus @initial_mirror is 1.

Then when we try mirror 2, since the first sector is still bad, we jump
to the next mirror.

For mirror 3, we fixed the first sector only. Then 2nd sector is still
from mirror 3 and didn't pass.
Now we have no more mirrors, and still return -EIO.

What saves our day is the VFS read retry, which will try read the range
sectors by sector, and got every sector fixed.


Unfortunatly we waste a lot of code to hack the bio size, but it still
doesn't work, and falls back to exactly the same sector-by-sector wait
behavior.

So my points still stand, if we want to do batched handling, either we
go bitmap or we give up.

Such hacky bandage seems to work at first glance and will pass your new
test cases, but it doesn't do it any better than sector-by-sector waiting.
(Forgot to mention, the new RAID1C3 test case may also be flawed, as any
read on other mirrors will cause read-repair, screwing up our later
retry, thus we must check pid first before doing any read.)

Thus it's a low hanging but rotten fruit.

Thanks,
Qu

> and actually pretty common for
> actual failures of components.  In other words:  single sector failures
> happen some times, usually due to corruption on the transfer wire.  If
> actual corruption happens on the driver it will usually fail quite
> bit areas.  These are rather rare these days as a lot of device shut
> down before returning bad data, but if it happens you'll rarely see
> just a single sector.
Christoph Hellwig May 24, 2022, 7:32 a.m. UTC | #8
On Mon, May 23, 2022 at 03:46:02PM +0800, Qu Wenruo wrote:
>> Becasue btrfs_repair_io_failure can't handle multiple-page I/O.  It
>> is also is rather cumbersome because it bypassed the normal bio
>> mapping.  As a follow on I'd rather move it over to btrfs_map_bio
>> with a special flag for the single mirror parity write rather than that
>> hack.
>
> In fact so far for all callers of btrfs_repair_io_failure(), we are
> always handling things inside one stripe.
>
> Thus we can easily enhance that function to handle multi page ranges.
>
> Although a dedicated btrfs_map_bio() flags seems more generic and better.

I did think of moving btrfs_repair_io_failure over to my new
infrastructure in fact, because it seems inherently possible.  Just
not the highest priority right now.

>> Because the whole bio at this point is all the bad sectors.  There
>> is no point in writing only parts of the bio because that would leave
>> corruption on disk.
>>
>>>    The only reason I can think of is, we're still trying to do some
>>>    "optimization".
>>>
>>>    But all our bio submission is already synchronous, I doubt such
>>>    "optimization" would make much difference.
>>
>> Can you explain what you mean here?
>
> We wait for the read bio anyway, I doubt the batched write part is that
> important.

I still don't understand the point.  Once we read more than a single
page, writing it back as a patch is completely trivial as shown by
this series.  Why would we not do it?

>
> If you really want, I can try to make the write part asynchronous, while
> still keep the read part synchronous, and easier to read.

Asynchronous writes gets us back into all the I/O completion handler
complexities, which was the whole reason to start on the synchronous
repair.

> In your current version, the do {} while() loop iterates through all
> mirrors.
>
> But for the following case, we will hit problems thanks to RAID1C3 again:
>
> Mirror 1 	|X|X|X|X|
> Mirror 2	|X| |X| |
> Mirror 3	| |X| |X|
>
> We hit mirror 1 initially, thus @initial_mirror is 1.
>
> Then when we try mirror 2, since the first sector is still bad, we jump
> to the next mirror.
>
> For mirror 3, we fixed the first sector only. Then 2nd sector is still
> from mirror 3 and didn't pass.
> Now we have no more mirrors, and still return -EIO.

Can you share a test case?  The code resets initial_mirror as soon as
we made any progress so that should not happen.

> So my points still stand, if we want to do batched handling, either we
> go bitmap or we give up.

Why?  For the very common case of clustered corruption or entirely
failing reads it is significantly faster than a simple synchronous
read of each sector, and also much better than the existing code.
It also is a lot less code than the existing code base, and (maybe
I'm biassed) a lot more readable.

Bitmaps only help you with randomly splattered corruption, which simply
is not how SSDs or hard drives actually fail.

> Such hacky bandage seems to work at first glance and will pass your new
> test cases, but it doesn't do it any better than sector-by-sector waiting.
> (Forgot to mention, the new RAID1C3 test case may also be flawed, as any
> read on other mirrors will cause read-repair, screwing up our later
> retry, thus we must check pid first before doing any read.)

The updated version uses the read from mirror loop from btrfs/142
that cleverly used bash internals to not issue the read if it would
be done using the wrong mirror.  Which also really nicely speeds up
the tests including the exist 140 and 141 ones.
Qu Wenruo May 24, 2022, 8:04 a.m. UTC | #9
On 2022/5/24 15:32, Christoph Hellwig wrote:
> On Mon, May 23, 2022 at 03:46:02PM +0800, Qu Wenruo wrote:
>>> Becasue btrfs_repair_io_failure can't handle multiple-page I/O.  It
>>> is also is rather cumbersome because it bypassed the normal bio
>>> mapping.  As a follow on I'd rather move it over to btrfs_map_bio
>>> with a special flag for the single mirror parity write rather than that
>>> hack.
>>
>> In fact so far for all callers of btrfs_repair_io_failure(), we are
>> always handling things inside one stripe.
>>
>> Thus we can easily enhance that function to handle multi page ranges.
>>
>> Although a dedicated btrfs_map_bio() flags seems more generic and better.
>
> I did think of moving btrfs_repair_io_failure over to my new
> infrastructure in fact, because it seems inherently possible.  Just
> not the highest priority right now.
>
>>> Because the whole bio at this point is all the bad sectors.  There
>>> is no point in writing only parts of the bio because that would leave
>>> corruption on disk.
>>>
>>>>     The only reason I can think of is, we're still trying to do some
>>>>     "optimization".
>>>>
>>>>     But all our bio submission is already synchronous, I doubt such
>>>>     "optimization" would make much difference.
>>>
>>> Can you explain what you mean here?
>>
>> We wait for the read bio anyway, I doubt the batched write part is that
>> important.
>
> I still don't understand the point.  Once we read more than a single
> page, writing it back as a patch is completely trivial as shown by
> this series.  Why would we not do it?
>
>>
>> If you really want, I can try to make the write part asynchronous, while
>> still keep the read part synchronous, and easier to read.
>
> Asynchronous writes gets us back into all the I/O completion handler
> complexities, which was the whole reason to start on the synchronous
> repair.
>
>> In your current version, the do {} while() loop iterates through all
>> mirrors.
>>
>> But for the following case, we will hit problems thanks to RAID1C3 again:
>>
>> Mirror 1 	|X|X|X|X|
>> Mirror 2	|X| |X| |
>> Mirror 3	| |X| |X|
>>
>> We hit mirror 1 initially, thus @initial_mirror is 1.
>>
>> Then when we try mirror 2, since the first sector is still bad, we jump
>> to the next mirror.
>>
>> For mirror 3, we fixed the first sector only. Then 2nd sector is still
>> from mirror 3 and didn't pass.
>> Now we have no more mirrors, and still return -EIO.
>
> Can you share a test case?

Unfortunately no real test case can work here.

The problem is, VFS will try to re-read with smaller block size.
In that case, fallback to sector-by-sector repair, thus even if we have
some policy terribly wrong, as long as the sector-by-sector behavior is
  fine, it will be hidden.

That's why when I do my bitmap version, I have to add extra trace events
to make sure it's not the retry, but really my read-repair code doing
the correct repair, without triggering the re-read.

>  The code resets initial_mirror as soon as
> we made any progress so that should not happen.

Oh, didn't see that part.

And in that case, yes, it will work for the checker pattern, although
not really any more efficient.

We will do 4 different reads to fix it, no better than sector-by-sector
repair.
And worse than 2 reads from the bitmap version.

But I get your point, it can handle continuous corruption better than
sector-by-sector, and no worse than bitmap version in that case.

>
>> So my points still stand, if we want to do batched handling, either we
>> go bitmap or we give up.
>
> Why?  For the very common case of clustered corruption or entirely
> failing reads it is significantly faster than a simple synchronous
> read of each sector, and also much better than the existing code.
> It also is a lot less code than the existing code base, and (maybe
> I'm biassed) a lot more readable.

The problem here is, yes, you can pick the common case one, but it comes
with the burden of worst cases too.

And for the readable part, I strongly doubt.

The things like resetting initial_mirror, making the naming "initial"
meaningless.
And the reset on the length part is also very quirky.

Yes, my bitmap version is super complex, that's no doubt, but the idea
and code should be very straightforward.

Just loop all mirrors until the bad bitmap is all zero. No need to reset
the length or whatever halfway, bitmap and mirror number is the only
iterator.

And even for the bitmap preallocation failure part, we have VFS to bail
us out.
And code wise, it's not that simpler, if you ignore the bitmap
pre-allocation part...

And for the ultimate readability, the sector-by-sector method can not be
beaten.
Thus I'm not a particular fan of any middle ground here.

>
> Bitmaps only help you with randomly splattered corruption, which simply
> is not how SSDs or hard drives actually fail.

But that's the case you have to take into consideration.

Even for cases where real world SSD to corrupt a big range of data, we
can still submit a read that crosses the corruption boundary.

>
>> Such hacky bandage seems to work at first glance and will pass your new
>> test cases, but it doesn't do it any better than sector-by-sector waiting.
>> (Forgot to mention, the new RAID1C3 test case may also be flawed, as any
>> read on other mirrors will cause read-repair, screwing up our later
>> retry, thus we must check pid first before doing any read.)
>
> The updated version uses the read from mirror loop from btrfs/142
> that cleverly used bash internals to not issue the read if it would
> be done using the wrong mirror.  Which also really nicely speeds up
> the tests including the exist 140 and 141 ones.
>
That's wonderful.

However the smaller problem is still there, we have no way to know if
it's the read-repair itself does its part correctly, or it's VFS retry
saves our day.

But yeah, btrfs/142 method is much better and should be backported to
btrfs/140 and btrfs/141.

Thanks,
Qu
Qu Wenruo May 24, 2022, 8:21 a.m. UTC | #10
On 2022/5/24 16:04, Qu Wenruo wrote:
>
>
> On 2022/5/24 15:32, Christoph Hellwig wrote:
>> On Mon, May 23, 2022 at 03:46:02PM +0800, Qu Wenruo wrote:
>>>> Becasue btrfs_repair_io_failure can't handle multiple-page I/O.  It
>>>> is also is rather cumbersome because it bypassed the normal bio
>>>> mapping.  As a follow on I'd rather move it over to btrfs_map_bio
>>>> with a special flag for the single mirror parity write rather than that
>>>> hack.
>>>
>>> In fact so far for all callers of btrfs_repair_io_failure(), we are
>>> always handling things inside one stripe.
>>>
>>> Thus we can easily enhance that function to handle multi page ranges.
>>>
>>> Although a dedicated btrfs_map_bio() flags seems more generic and
>>> better.
>>
>> I did think of moving btrfs_repair_io_failure over to my new
>> infrastructure in fact, because it seems inherently possible.  Just
>> not the highest priority right now.
>>
>>>> Because the whole bio at this point is all the bad sectors.  There
>>>> is no point in writing only parts of the bio because that would leave
>>>> corruption on disk.
>>>>
>>>>>     The only reason I can think of is, we're still trying to do some
>>>>>     "optimization".
>>>>>
>>>>>     But all our bio submission is already synchronous, I doubt such
>>>>>     "optimization" would make much difference.
>>>>
>>>> Can you explain what you mean here?
>>>
>>> We wait for the read bio anyway, I doubt the batched write part is that
>>> important.
>>
>> I still don't understand the point.  Once we read more than a single
>> page, writing it back as a patch is completely trivial as shown by
>> this series.  Why would we not do it?
>>
>>>
>>> If you really want, I can try to make the write part asynchronous, while
>>> still keep the read part synchronous, and easier to read.
>>
>> Asynchronous writes gets us back into all the I/O completion handler
>> complexities, which was the whole reason to start on the synchronous
>> repair.
>>
>>> In your current version, the do {} while() loop iterates through all
>>> mirrors.
>>>
>>> But for the following case, we will hit problems thanks to RAID1C3
>>> again:
>>>
>>> Mirror 1     |X|X|X|X|
>>> Mirror 2    |X| |X| |
>>> Mirror 3    | |X| |X|
>>>
>>> We hit mirror 1 initially, thus @initial_mirror is 1.
>>>
>>> Then when we try mirror 2, since the first sector is still bad, we jump
>>> to the next mirror.
>>>
>>> For mirror 3, we fixed the first sector only. Then 2nd sector is still
>>> from mirror 3 and didn't pass.
>>> Now we have no more mirrors, and still return -EIO.
>>
>> Can you share a test case?
>
> Unfortunately no real test case can work here.
>
> The problem is, VFS will try to re-read with smaller block size.
> In that case, fallback to sector-by-sector repair, thus even if we have
> some policy terribly wrong, as long as the sector-by-sector behavior is
>   fine, it will be hidden.
>
> That's why when I do my bitmap version, I have to add extra trace events
> to make sure it's not the retry, but really my read-repair code doing
> the correct repair, without triggering the re-read.
>
>>  The code resets initial_mirror as soon as
>> we made any progress so that should not happen.
>
> Oh, didn't see that part.
>
> And in that case, yes, it will work for the checker pattern, although
> not really any more efficient.
>
> We will do 4 different reads to fix it, no better than sector-by-sector
> repair.
> And worse than 2 reads from the bitmap version.
>
> But I get your point, it can handle continuous corruption better than
> sector-by-sector, and no worse than bitmap version in that case.
>
>>
>>> So my points still stand, if we want to do batched handling, either we
>>> go bitmap or we give up.
>>
>> Why?  For the very common case of clustered corruption or entirely
>> failing reads it is significantly faster than a simple synchronous
>> read of each sector, and also much better than the existing code.
>> It also is a lot less code than the existing code base, and (maybe
>> I'm biassed) a lot more readable.
>
> The problem here is, yes, you can pick the common case one, but it comes
> with the burden of worst cases too.
>
> And for the readable part, I strongly doubt.
>
> The things like resetting initial_mirror, making the naming "initial"
> meaningless.
> And the reset on the length part is also very quirky.

In fact, if you didn't do the initial_mirror and length change (which is
a big disaster of readability, to change iterator in a loop, at least to
me), and rely on the VFS re-read behavior to fall back to sector by
secot read, I would call it better readability...

Thanks,
Qu

>
> Yes, my bitmap version is super complex, that's no doubt, but the idea
> and code should be very straightforward.
>
> Just loop all mirrors until the bad bitmap is all zero. No need to reset
> the length or whatever halfway, bitmap and mirror number is the only
> iterator.
>
> And even for the bitmap preallocation failure part, we have VFS to bail
> us out.
> And code wise, it's not that simpler, if you ignore the bitmap
> pre-allocation part...
>
> And for the ultimate readability, the sector-by-sector method can not be
> beaten.
> Thus I'm not a particular fan of any middle ground here.
>
>>
>> Bitmaps only help you with randomly splattered corruption, which simply
>> is not how SSDs or hard drives actually fail.
>
> But that's the case you have to take into consideration.
>
> Even for cases where real world SSD to corrupt a big range of data, we
> can still submit a read that crosses the corruption boundary.
>
>>
>>> Such hacky bandage seems to work at first glance and will pass your new
>>> test cases, but it doesn't do it any better than sector-by-sector
>>> waiting.
>>> (Forgot to mention, the new RAID1C3 test case may also be flawed, as any
>>> read on other mirrors will cause read-repair, screwing up our later
>>> retry, thus we must check pid first before doing any read.)
>>
>> The updated version uses the read from mirror loop from btrfs/142
>> that cleverly used bash internals to not issue the read if it would
>> be done using the wrong mirror.  Which also really nicely speeds up
>> the tests including the exist 140 and 141 ones.
>>
> That's wonderful.
>
> However the smaller problem is still there, we have no way to know if
> it's the read-repair itself does its part correctly, or it's VFS retry
> saves our day.
>
> But yeah, btrfs/142 method is much better and should be backported to
> btrfs/140 and btrfs/141.
>
> Thanks,
> Qu
Christoph Hellwig May 24, 2022, 12:08 p.m. UTC | #11
On Tue, May 24, 2022 at 04:21:38PM +0800, Qu Wenruo wrote:
>> The things like resetting initial_mirror, making the naming "initial"
>> meaningless.
>> And the reset on the length part is also very quirky.
>
> In fact, if you didn't do the initial_mirror and length change (which is
> a big disaster of readability, to change iterator in a loop, at least to
> me),

So what is the big problem there?  Do I need more extensive documentation
or as there anything in this concept that is just too confusing.

> and rely on the VFS re-read behavior to fall back to sector by
> secot read, I would call it better readability...

I don't think relying on undocumented VFS behavior is a good idea.  It
will also not work for direct I/O if any single direct I/O bio has
ever more than a single page, which is something btrfs badly needs
if it wants to get any kind of performance out of direct I/O.
Qu Wenruo May 24, 2022, 1:13 p.m. UTC | #12
On 2022/5/24 20:08, Christoph Hellwig wrote:
> On Tue, May 24, 2022 at 04:21:38PM +0800, Qu Wenruo wrote:
>>> The things like resetting initial_mirror, making the naming "initial"
>>> meaningless.
>>> And the reset on the length part is also very quirky.
>>
>> In fact, if you didn't do the initial_mirror and length change (which is
>> a big disaster of readability, to change iterator in a loop, at least to
>> me),
>
> So what is the big problem there?  Do I need more extensive documentation
> or as there anything in this concept that is just too confusing.

Modifying the iterator inside a loop is the biggest problem to me.

Yes, extra comments can help, but that doesn't help the readability.

And that's why most of guys here prefer for () loop if we can.

Thanks,
Qu
>
>> and rely on the VFS re-read behavior to fall back to sector by
>> secot read, I would call it better readability...
>
> I don't think relying on undocumented VFS behavior is a good idea.  It
> will also not work for direct I/O if any single direct I/O bio has
> ever more than a single page, which is something btrfs badly needs
> if it wants to get any kind of performance out of direct I/O.
Qu Wenruo May 24, 2022, 2:02 p.m. UTC | #13
On 2022/5/24 21:13, Qu Wenruo wrote:
>
>
> On 2022/5/24 20:08, Christoph Hellwig wrote:
>> On Tue, May 24, 2022 at 04:21:38PM +0800, Qu Wenruo wrote:
>>>> The things like resetting initial_mirror, making the naming "initial"
>>>> meaningless.
>>>> And the reset on the length part is also very quirky.
>>>
>>> In fact, if you didn't do the initial_mirror and length change (which is
>>> a big disaster of readability, to change iterator in a loop, at least to
>>> me),
>>
>> So what is the big problem there?  Do I need more extensive documentation
>> or as there anything in this concept that is just too confusing.
>
> Modifying the iterator inside a loop is the biggest problem to me.
>
> Yes, extra comments can help, but that doesn't help the readability.
>
> And that's why most of guys here prefer for () loop if we can.

Just allow me to do another attempt on this.

This time, still bitmap based, but no pre/runtime allocation.
Just use the on-stack structure to contain a u32 bitmap.


And we no longer use the bitmap for the whole bio, but only for the
first corrupted sector, to the next good sector.

If the bitmap is not large enough to contain the next sector, we just
finish the current batch.

Now we should have all the best things, batched submission, no memory
allocation, no modification on the iterator inside a loop.

Although the cost is that, we are still doing the bitmap way, and we
have more call sites of btrfs_read_repair_finish().

Thanks,
Qu
>
> Thanks,
> Qu
>>
>>> and rely on the VFS re-read behavior to fall back to sector by
>>> secot read, I would call it better readability...
>>
>> I don't think relying on undocumented VFS behavior is a good idea.  It
>> will also not work for direct I/O if any single direct I/O bio has
>> ever more than a single page, which is something btrfs badly needs
>> if it wants to get any kind of performance out of direct I/O.
Nikolay Borisov May 25, 2022, 3:12 p.m. UTC | #14
On 24.05.22 г. 16:13 ч., Qu Wenruo wrote:
> 
> 
> On 2022/5/24 20:08, Christoph Hellwig wrote:
>> On Tue, May 24, 2022 at 04:21:38PM +0800, Qu Wenruo wrote:
>>>> The things like resetting initial_mirror, making the naming "initial"
>>>> meaningless.
>>>> And the reset on the length part is also very quirky.
>>>
>>> In fact, if you didn't do the initial_mirror and length change (which is
>>> a big disaster of readability, to change iterator in a loop, at least to
>>> me),
>>
>> So what is the big problem there?  Do I need more extensive documentation
>> or as there anything in this concept that is just too confusing.
> 
> Modifying the iterator inside a loop is the biggest problem to me.

What iterator gets modified? The code defines a local one, which gets 
initialized by copying the state from the bbio. Then the loop works as 
usual, that's not confusing and is in line with how the other bio vec 
iterator macros work.

> 
> Yes, extra comments can help, but that doesn't help the readability.
> 
> And that's why most of guys here prefer for () loop if we can.
> 
> Thanks,
> Qu
>>
>>> and rely on the VFS re-read behavior to fall back to sector by
>>> secot read, I would call it better readability...
>>
>> I don't think relying on undocumented VFS behavior is a good idea.  It
>> will also not work for direct I/O if any single direct I/O bio has
>> ever more than a single page, which is something btrfs badly needs
>> if it wants to get any kind of performance out of direct I/O.
>
Qu Wenruo May 25, 2022, 10:46 p.m. UTC | #15
On 2022/5/25 23:12, Nikolay Borisov wrote:
>
>
> On 24.05.22 г. 16:13 ч., Qu Wenruo wrote:
>>
>>
>> On 2022/5/24 20:08, Christoph Hellwig wrote:
>>> On Tue, May 24, 2022 at 04:21:38PM +0800, Qu Wenruo wrote:
>>>>> The things like resetting initial_mirror, making the naming "initial"
>>>>> meaningless.
>>>>> And the reset on the length part is also very quirky.
>>>>
>>>> In fact, if you didn't do the initial_mirror and length change
>>>> (which is
>>>> a big disaster of readability, to change iterator in a loop, at
>>>> least to
>>>> me),
>>>
>>> So what is the big problem there?  Do I need more extensive
>>> documentation
>>> or as there anything in this concept that is just too confusing.
>>
>> Modifying the iterator inside a loop is the biggest problem to me.
>
> What iterator gets modified?

Mirror number.

Thanks,
Qu

> The code defines a local one, which gets
> initialized by copying the state from the bbio. Then the loop works as
> usual, that's not confusing and is in line with how the other bio vec
> iterator macros work.
>
>>
>> Yes, extra comments can help, but that doesn't help the readability.
>>
>> And that's why most of guys here prefer for () loop if we can.
>>
>> Thanks,
>> Qu
>>>
>>>> and rely on the VFS re-read behavior to fall back to sector by
>>>> secot read, I would call it better readability...
>>>
>>> I don't think relying on undocumented VFS behavior is a good idea.  It
>>> will also not work for direct I/O if any single direct I/O bio has
>>> ever more than a single page, which is something btrfs badly needs
>>> if it wants to get any kind of performance out of direct I/O.
>>
Nikolay Borisov May 26, 2022, 12:58 p.m. UTC | #16
On 22.05.22 г. 14:47 ч., Christoph Hellwig wrote:
> Use the new btrfs_bio_for_each_sector iterator to simplify
> btrfs_check_read_dio_bio.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
diff mbox series

Patch

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 095632977a798..34466b543ed97 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7886,47 +7886,35 @@  static blk_status_t btrfs_check_read_dio_bio(struct btrfs_dio_private *dip,
 {
 	struct inode *inode = dip->inode;
 	struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
-	const u32 sectorsize = fs_info->sectorsize;
 	struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
 	struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
-	struct bio_vec bvec;
-	struct bvec_iter iter;
-	u32 bio_offset = 0;
 	blk_status_t err = BLK_STS_OK;
+	struct bvec_iter iter;
+	struct bio_vec bv;
+	u32 offset;
+
+	btrfs_bio_for_each_sector(fs_info, bv, bbio, iter, offset) {
+		u64 start = bbio->file_offset + offset;
+
+		if (uptodate &&
+		    (!csum || !check_data_csum(inode, bbio, offset, bv.bv_page,
+				bv.bv_offset, start))) {
+			clean_io_failure(fs_info, failure_tree, io_tree, start,
+					 bv.bv_page, btrfs_ino(BTRFS_I(inode)),
+					 bv.bv_offset);
+		} else {
+			int ret;
 
-	__bio_for_each_segment(bvec, &bbio->bio, iter, bbio->iter) {
-		unsigned int i, nr_sectors, pgoff;
-
-		nr_sectors = BTRFS_BYTES_TO_BLKS(fs_info, bvec.bv_len);
-		pgoff = bvec.bv_offset;
-		for (i = 0; i < nr_sectors; i++) {
-			u64 start = bbio->file_offset + bio_offset;
-
-			ASSERT(pgoff < PAGE_SIZE);
-			if (uptodate &&
-			    (!csum || !check_data_csum(inode, bbio,
-						       bio_offset, bvec.bv_page,
-						       pgoff, start))) {
-				clean_io_failure(fs_info, failure_tree, io_tree,
-						 start, bvec.bv_page,
-						 btrfs_ino(BTRFS_I(inode)),
-						 pgoff);
-			} else {
-				int ret;
-
-				ret = btrfs_repair_one_sector(inode, &bbio->bio,
-						bio_offset, bvec.bv_page, pgoff,
-						start, bbio->mirror_num,
-						submit_dio_repair_bio);
-				if (ret)
-					err = errno_to_blk_status(ret);
-			}
-			ASSERT(bio_offset + sectorsize > bio_offset);
-			bio_offset += sectorsize;
-			pgoff += sectorsize;
+			ret = btrfs_repair_one_sector(inode, &bbio->bio, offset,
+					bv.bv_page, bv.bv_offset, start,
+					bbio->mirror_num,
+					submit_dio_repair_bio);
+			if (ret)
+				err = errno_to_blk_status(ret);
 		}
 	}
+
 	return err;
 }