diff mbox series

btrfs: don't merge pages into bio if their page offset is not continuous

Message ID 8641e5e791942d86de0a1b3ee120570441616fdc.1660375698.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: don't merge pages into bio if their page offset is not continuous | expand

Commit Message

Qu Wenruo Aug. 13, 2022, 7:28 a.m. UTC
[BUG]
Zygo reported on latest devel branch, he can hit ASSERT()/BUG_ON()
caused crash when doing RAID5 recovery (intentionally corrupt one disk,
and let btrfs to recovery the data during read/scrub).

And The following minimal reproducer can cause extent state leakage at
rmmod time:

  mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
  mount $dev1 $mnt
  fsstress -w -d $mnt -n 25 -s 1660807876
  sync
  fssum -A -f -w /tmp/fssum.saved $mnt
  umount $mnt

  # Wipe the dev1 but keeps its super block
  xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
  mount $dev1 $mnt
  fssum -r /tmp/fssum.saved $mnt > /dev/null
  umount $mnt
  rmmod btrfs

This will lead to the following extent states leakage:

 BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
 BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
 BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
 BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
 BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
 BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
 BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
 BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1

[CAUSE]
Since commit 7aa51232e204 ("btrfs: pass a btrfs_bio to
btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
determine the file offset of a page.

But that usage assume that, one bio has all its page having a continuous
page offsets.

Unfortunately that's not true, btrfs only requires the logical bytenr
continuous when assembling its bios.

From above script, we have one bio looks like this:
  fssum-27671  submit_one_bio: bio logical=217739264 len=36864
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
  fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664

Note that the 1st and the 2nd page has non-continuous page offsets.

This means, at repair time, we will have completely wrong file offset
passed in:

   kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192

Since the file offset is incorrect, we latter incorrectly set the extent
states, and no way to really release them.

Thus later it causes the leakage.

In fact, this can be even worse, since the file offset is incorrect, we
can hit cases like the incorrect file offset belongs to a HOLE, and
later cause btrfs_num_copies() to trigger error, finally hit
BUG_ON()/ASSERT() later.

[FIX]
This patch will add an extra condition in btrfs_bio_add_page() for
uncompressed IO.

Now we will have more strict requirement for bio pages:

- They should all have the same mapping
- Their logical bytenr should be adjacent
  Above two checks are the same as the old code.
  (the mapping check is original implied, now we just make it
   more explicit)

- Their page_offset() (file offset) should be adjacent
  This is the new check.
  This would result a slightly increased amount of bios from btrfs
  (needs holes and cross stripe boundary).

  But this would greatly reduce the confusion, as it's pretty common
  to assume a btrfs bio would only contain continuous page cache.

Currently this should be the minimal patch to fix commit 7aa51232e204
("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 7aa51232e204 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

Comments

Christoph Hellwig Aug. 13, 2022, 7:39 a.m. UTC | #1
> +	if (bio_ctrl->compress_type != BTRFS_COMPRESS_NONE) {
>  		contig = bio->bi_iter.bi_sector == sector;
> -	else
> -		contig = bio_end_sector(bio) == sector;
> +	} else {

I don't tink we can lose the bio_end_sector check here.

Also if you touch this, invrting the check so that it is

	if (bio_ctrl->compress_type == BTRFS_COMPRESS_NONE) {
		/* main version here */
	} else {
		/* compressed special case */
	}

might make it a bit readable.

> +		struct bio_vec *bvec;
> +
> +		/* Empty bio, we can always add page into it. */
> +		if (bio->bi_iter.bi_size == 0) {

This is also true for the compressed case, isn't it?

So maybe:

	if (bio->bi_iter.bi_size == 0) {
		contig = true;
	} else if (bio_ctrl->compress_type == BTRFS_COMPRESS_NONE) {
		struct bio_vec *bvec = bio_last_bvec_all(bio);

		/*
		 * The contig check requires the following conditions to be met:
		 * 1) The pages are belonging to the same inode
		 * 2) The range has adjacent logical bytenr
		 * 3) The range has adjacent file offset (NEW)
		 *    This is required for the usage of btrfs_bio->file_offset.
		 */
		contig = bio_end_sector(bio) == sector &&
			    page_offset(bvec->bv_page) + bvec->bv_offset +
			    bvec->bv_len == page_offset(page) + pg_offset);
	} else {
		contig = bio->bi_iter.bi_sector == sector;
	}

(we don't need the mapping check, as all the submit_extent_page
always loops over the same mapping)
Qu Wenruo Aug. 13, 2022, 7:43 a.m. UTC | #2
On 2022/8/13 15:39, Christoph Hellwig wrote:
>> +	if (bio_ctrl->compress_type != BTRFS_COMPRESS_NONE) {
>>   		contig = bio->bi_iter.bi_sector == sector;
>> -	else
>> -		contig = bio_end_sector(bio) == sector;
>> +	} else {
>
> I don't tink we can lose the bio_end_sector check here.
>
> Also if you touch this, invrting the check so that it is
>
> 	if (bio_ctrl->compress_type == BTRFS_COMPRESS_NONE) {
> 		/* main version here */
> 	} else {
> 		/* compressed special case */
> 	}
>
> might make it a bit readable.

Indeed.

>
>> +		struct bio_vec *bvec;
>> +
>> +		/* Empty bio, we can always add page into it. */
>> +		if (bio->bi_iter.bi_size == 0) {
>
> This is also true for the compressed case, isn't it?

That is already implied in the "contig = bio->bi_iter.bi_sector ==
sector;" check.

Remember for compressed IO, we always set the the disk_bytenr to
whatever the bytenr of the extent start.

>
> So maybe:
>
> 	if (bio->bi_iter.bi_size == 0) {
> 		contig = true;
> 	} else if (bio_ctrl->compress_type == BTRFS_COMPRESS_NONE) {
> 		struct bio_vec *bvec = bio_last_bvec_all(bio);
>
> 		/*
> 		 * The contig check requires the following conditions to be met:
> 		 * 1) The pages are belonging to the same inode
> 		 * 2) The range has adjacent logical bytenr
> 		 * 3) The range has adjacent file offset (NEW)
> 		 *    This is required for the usage of btrfs_bio->file_offset.
> 		 */
> 		contig = bio_end_sector(bio) == sector &&
> 			    page_offset(bvec->bv_page) + bvec->bv_offset +
> 			    bvec->bv_len == page_offset(page) + pg_offset);
> 	} else {
> 		contig = bio->bi_iter.bi_sector == sector;
> 	}

This indeed looks better.

>
> (we don't need the mapping check, as all the submit_extent_page
> always loops over the same mapping)

Yep, that mapping check is already implied by the call chain.

BTW, I'll also update the commit message to mention some future
cleanups, as now a btrfs_bio is completely a continous range, thus
things like endio_readpage_release_extent() related infrastructure can
also be removed.

Thanks,
Qu
Christoph Hellwig Aug. 13, 2022, 7:46 a.m. UTC | #3
On Sat, Aug 13, 2022 at 03:43:49PM +0800, Qu Wenruo wrote:
> BTW, I'll also update the commit message to mention some future
> cleanups, as now a btrfs_bio is completely a continous range, thus
> things like endio_readpage_release_extent() related infrastructure can
> also be removed.

Yes, there is a lot of things that can be cleaned up with this. For
that reason along I was planning to look into a change like this
eventually, but I should have done it earlier..
diff mbox series

Patch

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bc30c0a4a7f2..fc44125110b4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3256,7 +3256,7 @@  static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	u32 bio_size = bio->bi_iter.bi_size;
 	u32 real_size;
 	const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
-	bool contig;
+	bool contig = false;
 	int ret;
 
 	ASSERT(bio);
@@ -3265,10 +3265,32 @@  static int btrfs_bio_add_page(struct btrfs_bio_ctrl *bio_ctrl,
 	if (bio_ctrl->compress_type != compress_type)
 		return 0;
 
-	if (bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
+	if (bio_ctrl->compress_type != BTRFS_COMPRESS_NONE) {
 		contig = bio->bi_iter.bi_sector == sector;
-	else
-		contig = bio_end_sector(bio) == sector;
+	} else {
+		struct bio_vec *bvec;
+
+		/* Empty bio, we can always add page into it. */
+		if (bio->bi_iter.bi_size == 0) {
+			contig = true;
+			goto check;
+		}
+
+		bvec = bio_last_bvec_all(bio);
+		/*
+		 * The contig check requires the following conditions to be met:
+		 * 1) The pages are belonging to the same inode
+		 * 2) The range has adjacent logical bytenr
+		 * 3) The range has adjacent file offset (NEW)
+		 *    This is required for the usage of btrfs_bio->file_offset.
+		 */
+		if (bvec->bv_page->mapping == page->mapping &&
+		    bio_end_sector(bio) == sector &&
+		    page_offset(bvec->bv_page) + bvec->bv_offset +
+		    bvec->bv_len == page_offset(page) + pg_offset)
+			contig = true;
+	}
+check:
 	if (!contig)
 		return 0;