diff mbox series

btrfs: fix corrupt read due to bad offset of a compressed extent map

Message ID 7e186f2d4892bf5bfa1a66dd859a38c981acf8ab.1721124786.git.fdmanana@suse.com (mailing list archive)
State New, archived
Headers show
Series btrfs: fix corrupt read due to bad offset of a compressed extent map | expand

Commit Message

Filipe Manana July 16, 2024, 10:14 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

If we attempt to insert a compressed extent map that has a range that
overlaps another extent map we have in the inode's extent map tree, we
can end up with an incorrect offset after adjusting the new extent map at
merge_extent_mapping() because we don't update the extent map's offset.

For example consider the following scenario:

1) We have a file extent item for a compressed extent covering the file
   range [108K, 144K) and currently there's no corresponding extent map
   in the inode's extent map tree;

2) The inode's size is 141K;

3) We have an encoded write (compressed) into the file range [120K, 128K),
   which overlaps the existing file extent item. The encoded write creates
   a matching extent map, add's it to the inode's extent map tree and
   creates an ordered extent for it.

   Note that the corresponding file extent item is added to the subvolume
   tree only when the ordered extent completes (when executing
   btrfs_finish_one_ordered());

4) We have a write into the file range [160K, 164K[.

   This writes increases the i_size of the file, and there's a hole
   between the current i_size (141K) and the start offset of this write,
   and since the old i_size is in the middle of the block [140K, 144K),
   we have to write zeroes to the range [141K, 144K) (3072 bytes) and
   therefore dirty that page.

   We then call btrfs_set_extent_delalloc() with a start offset of 140K.
   We then end up at btrfs_find_new_delalloc_bytes() which will call
   btrfs_get_extent() for the range [140K, 144K);

5) The btrfs_get_extent() doesn't find any extent map in the inode's
   extent map tree covering the range [140K, 144K), so it searches the
   subvolume tree for any file extent items covering that range.

   There it finds the file extent item for the range [108K, 144K),
   creates a compressed extent map for that range and then calls
   btrfs_add_extent_mapping() with that extent map and passes the
   range [140K, 144K) via the "start" and "len" parameters;

6) The call to add_extent_mapping() done by btrfs_add_extent_mapping()
   fails with -EEXIST because there's an extent map, created at step 2
   for the [120K, 128K) range, that covers that overlaps with the range
   of the given extent map ([108K, 144K)).

   Then it does a lookup for extent map from step 2 add calls
   merge_extent_mapping() to adjust the input extent map ([108K, 144K)).
   That adjust the extent map to a start offset of 128K and a length
   of 16K (starting just after the extent map from step 2), but it does
   not update the offset field of the extent map, leaving it with a value
   of zero instead of updating to a value of 20K (128K - 108K = 20K).

   As a result any read for the range [128K, 144K) can return
   incorrect data since we read from a wrong section of the extent (unless
   both the correct and incorrect ranges happen to have the same data).

So fix this by changing merge_extent_mapping() to update the extent map's
offset even if it's compressed. Also add a test case to the self tests.

A test case for fstests that triggered this problem using send/receive
with compressed writes will be added soon.

Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent_map.c             |  2 +-
 fs/btrfs/tests/extent-map-tests.c | 99 +++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+), 1 deletion(-)

Comments

Qu Wenruo July 16, 2024, 10:59 a.m. UTC | #1
在 2024/7/16 19:44, fdmanana@kernel.org 写道:
> From: Filipe Manana <fdmanana@suse.com>
>
> If we attempt to insert a compressed extent map that has a range that
> overlaps another extent map we have in the inode's extent map tree, we
> can end up with an incorrect offset after adjusting the new extent map at
> merge_extent_mapping() because we don't update the extent map's offset.
>
> For example consider the following scenario:
>
> 1) We have a file extent item for a compressed extent covering the file
>     range [108K, 144K) and currently there's no corresponding extent map
>     in the inode's extent map tree;
>
> 2) The inode's size is 141K;
>
> 3) We have an encoded write (compressed) into the file range [120K, 128K),
>     which overlaps the existing file extent item. The encoded write creates
>     a matching extent map, add's it to the inode's extent map tree and
>     creates an ordered extent for it.
>
>     Note that the corresponding file extent item is added to the subvolume
>     tree only when the ordered extent completes (when executing
>     btrfs_finish_one_ordered());
>
> 4) We have a write into the file range [160K, 164K[.
>
>     This writes increases the i_size of the file, and there's a hole
>     between the current i_size (141K) and the start offset of this write,
>     and since the old i_size is in the middle of the block [140K, 144K),
>     we have to write zeroes to the range [141K, 144K) (3072 bytes) and
>     therefore dirty that page.
>
>     We then call btrfs_set_extent_delalloc() with a start offset of 140K.
>     We then end up at btrfs_find_new_delalloc_bytes() which will call
>     btrfs_get_extent() for the range [140K, 144K);
>
> 5) The btrfs_get_extent() doesn't find any extent map in the inode's
>     extent map tree covering the range [140K, 144K), so it searches the
>     subvolume tree for any file extent items covering that range.
>
>     There it finds the file extent item for the range [108K, 144K),
>     creates a compressed extent map for that range and then calls
>     btrfs_add_extent_mapping() with that extent map and passes the
>     range [140K, 144K) via the "start" and "len" parameters;
>
> 6) The call to add_extent_mapping() done by btrfs_add_extent_mapping()
>     fails with -EEXIST because there's an extent map, created at step 2
>     for the [120K, 128K) range, that covers that overlaps with the range
>     of the given extent map ([108K, 144K)).
>
>     Then it does a lookup for extent map from step 2 add calls
>     merge_extent_mapping() to adjust the input extent map ([108K, 144K)).
>     That adjust the extent map to a start offset of 128K and a length
>     of 16K (starting just after the extent map from step 2), but it does
>     not update the offset field of the extent map, leaving it with a value
>     of zero instead of updating to a value of 20K (128K - 108K = 20K).
>
>     As a result any read for the range [128K, 144K) can return
>     incorrect data since we read from a wrong section of the extent (unless
>     both the correct and incorrect ranges happen to have the same data).
>
> So fix this by changing merge_extent_mapping() to update the extent map's
> offset even if it's compressed. Also add a test case to the self tests.
>
> A test case for fstests that triggered this problem using send/receive
> with compressed writes will be added soon.
>
> Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map")

Reviewed-by: Qu Wenruo <wqu@suse.com>


> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>   fs/btrfs/extent_map.c             |  2 +-
>   fs/btrfs/tests/extent-map-tests.c | 99 +++++++++++++++++++++++++++++++
>   2 files changed, 100 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> index bacb76952fc4..f85f0172b58b 100644
> --- a/fs/btrfs/extent_map.c
> +++ b/fs/btrfs/extent_map.c
> @@ -664,7 +664,7 @@ static noinline int merge_extent_mapping(struct btrfs_inode *inode,
>   	start_diff = start - em->start;
>   	em->start = start;
>   	em->len = end - start;
> -	if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE && !extent_map_is_compressed(em))
> +	if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
>   		em->offset += start_diff;

However I'm not sure if the fixes tag is correct.

The fix is changing the condition so that even for compressed extents we
can properly update the offset

However the condition line is not from that commit.
The condition is there way before the change, just the em member cleanup
touched that line by removing the tailing '{' since eventually there is
only one line.

So it looks like the problem exists way before that fixes tag.

Thanks,
Qu

>   	return add_extent_mapping(inode, em, 0);
>   }
> diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
> index ebec4ab361b8..e4d019c8e8b9 100644
> --- a/fs/btrfs/tests/extent-map-tests.c
> +++ b/fs/btrfs/tests/extent-map-tests.c
> @@ -900,6 +900,102 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>   	return ret;
>   }
>
> +/*
> + * Test a regression for compressed extent map adjustment when we attempt to
> + * add an extent map that is partially ovarlapped by another existing extent
> + * map. The resulting extent map offset was left unchanged despite having
> + * incremented its start offset.
> + */
> +static int test_case_8(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
> +{
> +	struct extent_map_tree *em_tree = &inode->extent_tree;
> +	struct extent_map *em;
> +	int ret;
> +	int ret2;
> +
> +	em = alloc_extent_map();
> +	if (!em) {
> +		test_std_err(TEST_ALLOC_EXTENT_MAP);
> +		return -ENOMEM;
> +	}
> +
> +	/* Compressed extent for the file range [120K, 128K). */
> +	em->start = SZ_1K * 120;
> +	em->len = SZ_8K;
> +	em->disk_num_bytes = SZ_4K;
> +	em->ram_bytes = SZ_8K;
> +	em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
> +	write_lock(&em_tree->lock);
> +	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
> +	write_unlock(&em_tree->lock);
> +	free_extent_map(em);
> +	if (ret < 0) {
> +		test_err("couldn't add extent map for range [120K, 128K)");
> +		goto out;
> +	}
> +
> +	em = alloc_extent_map();
> +	if (!em) {
> +		test_std_err(TEST_ALLOC_EXTENT_MAP);
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Compressed extent for the file range [108K, 144K), which overlaps
> +	 * with the [120K, 128K) we previously inserted.
> +	 */
> +	em->start = SZ_1K * 108;
> +	em->len = SZ_1K * 36;
> +	em->disk_num_bytes = SZ_4K;
> +	em->ram_bytes = SZ_1K * 36;
> +	em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
> +
> +	/*
> +	 * Try to add the extent map but with a search range of [140K, 144K),
> +	 * this should succeed and adjust the extent map to the range
> +	 * [128K, 144K), with a length of 16K and an offset of 20K.
> +	 *
> +	 * This simulates a scenario where in the subvolume tree of an inode we
> +	 * have a compressed file extent item for the range [108K, 144K) and we
> +	 * have an overlapping compressed extent map for the range [120K, 128K),
> +	 * which was created by an encoded write, but its ordered extent was not
> +	 * yet completed, so the subvolume tree doesn't have yet the file extent
> +	 * item for that range - we only have the extent map in the inode's
> +	 * extent map tree.
> +	 */
> +	write_lock(&em_tree->lock);
> +	ret = btrfs_add_extent_mapping(inode, &em, SZ_1K * 140, SZ_4K);
> +	write_unlock(&em_tree->lock);
> +	free_extent_map(em);
> +	if (ret < 0) {
> +		test_err("couldn't add extent map for range [108K, 144K)");
> +		goto out;
> +	}
> +
> +	if (em->start != SZ_128K) {
> +		test_err("unexpected extent map start %llu (should be 128K)", em->start);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if (em->len != SZ_16K) {
> +		test_err("unexpected extent map length %llu (should be 16K)", em->len);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	if (em->offset != SZ_1K * 20) {
> +		test_err("unexpected extent map offset %llu (should be 20K)", em->offset);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +out:
> +	ret2 = free_extent_map_tree(inode);
> +	if (ret == 0)
> +		ret = ret2;
> +
> +	return ret;
> +}
> +
>   struct rmap_test_vector {
>   	u64 raid_type;
>   	u64 physical_start;
> @@ -1076,6 +1172,9 @@ int btrfs_test_extent_map(void)
>   	if (ret)
>   		goto out;
>   	ret = test_case_7(fs_info, BTRFS_I(inode));
> +	if (ret)
> +		goto out;
> +	ret = test_case_8(fs_info, BTRFS_I(inode));
>   	if (ret)
>   		goto out;
>
Filipe Manana July 16, 2024, 11:07 a.m. UTC | #2
On Tue, Jul 16, 2024 at 11:59 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/7/16 19:44, fdmanana@kernel.org 写道:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > If we attempt to insert a compressed extent map that has a range that
> > overlaps another extent map we have in the inode's extent map tree, we
> > can end up with an incorrect offset after adjusting the new extent map at
> > merge_extent_mapping() because we don't update the extent map's offset.
> >
> > For example consider the following scenario:
> >
> > 1) We have a file extent item for a compressed extent covering the file
> >     range [108K, 144K) and currently there's no corresponding extent map
> >     in the inode's extent map tree;
> >
> > 2) The inode's size is 141K;
> >
> > 3) We have an encoded write (compressed) into the file range [120K, 128K),
> >     which overlaps the existing file extent item. The encoded write creates
> >     a matching extent map, add's it to the inode's extent map tree and
> >     creates an ordered extent for it.
> >
> >     Note that the corresponding file extent item is added to the subvolume
> >     tree only when the ordered extent completes (when executing
> >     btrfs_finish_one_ordered());
> >
> > 4) We have a write into the file range [160K, 164K[.
> >
> >     This writes increases the i_size of the file, and there's a hole
> >     between the current i_size (141K) and the start offset of this write,
> >     and since the old i_size is in the middle of the block [140K, 144K),
> >     we have to write zeroes to the range [141K, 144K) (3072 bytes) and
> >     therefore dirty that page.
> >
> >     We then call btrfs_set_extent_delalloc() with a start offset of 140K.
> >     We then end up at btrfs_find_new_delalloc_bytes() which will call
> >     btrfs_get_extent() for the range [140K, 144K);
> >
> > 5) The btrfs_get_extent() doesn't find any extent map in the inode's
> >     extent map tree covering the range [140K, 144K), so it searches the
> >     subvolume tree for any file extent items covering that range.
> >
> >     There it finds the file extent item for the range [108K, 144K),
> >     creates a compressed extent map for that range and then calls
> >     btrfs_add_extent_mapping() with that extent map and passes the
> >     range [140K, 144K) via the "start" and "len" parameters;
> >
> > 6) The call to add_extent_mapping() done by btrfs_add_extent_mapping()
> >     fails with -EEXIST because there's an extent map, created at step 2
> >     for the [120K, 128K) range, that covers that overlaps with the range
> >     of the given extent map ([108K, 144K)).
> >
> >     Then it does a lookup for extent map from step 2 add calls
> >     merge_extent_mapping() to adjust the input extent map ([108K, 144K)).
> >     That adjust the extent map to a start offset of 128K and a length
> >     of 16K (starting just after the extent map from step 2), but it does
> >     not update the offset field of the extent map, leaving it with a value
> >     of zero instead of updating to a value of 20K (128K - 108K = 20K).
> >
> >     As a result any read for the range [128K, 144K) can return
> >     incorrect data since we read from a wrong section of the extent (unless
> >     both the correct and incorrect ranges happen to have the same data).
> >
> > So fix this by changing merge_extent_mapping() to update the extent map's
> > offset even if it's compressed. Also add a test case to the self tests.
> >
> > A test case for fstests that triggered this problem using send/receive
> > with compressed writes will be added soon.
> >
> > Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map")
>
> Reviewed-by: Qu Wenruo <wqu@suse.com>
>
>
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >   fs/btrfs/extent_map.c             |  2 +-
> >   fs/btrfs/tests/extent-map-tests.c | 99 +++++++++++++++++++++++++++++++
> >   2 files changed, 100 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
> > index bacb76952fc4..f85f0172b58b 100644
> > --- a/fs/btrfs/extent_map.c
> > +++ b/fs/btrfs/extent_map.c
> > @@ -664,7 +664,7 @@ static noinline int merge_extent_mapping(struct btrfs_inode *inode,
> >       start_diff = start - em->start;
> >       em->start = start;
> >       em->len = end - start;
> > -     if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE && !extent_map_is_compressed(em))
> > +     if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
> >               em->offset += start_diff;
>
> However I'm not sure if the fixes tag is correct.
>
> The fix is changing the condition so that even for compressed extents we
> can properly update the offset
>
> However the condition line is not from that commit.

No?

That's the only commit that adds:

em->offset += start_diff;

to add_extent_mapping(). How can it not be?
em->offset only exists after that patch.

> The condition is there way before the change, just the em member cleanup
> touched that line by removing the tailing '{' since eventually there is
> only one line.
>
> So it looks like the problem exists way before that fixes tag.

Nop.

Try the test case for fstests on a branch without the extent map
patchset, and you'll see - it will pass, while with the patchset it
fails.

The patchset had that problem I mentioned during review, that by
adding all members and then doing the switch in different patches
would make a bisection point to the switch patches.



>
> Thanks,
> Qu
>
> >       return add_extent_mapping(inode, em, 0);
> >   }
> > diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
> > index ebec4ab361b8..e4d019c8e8b9 100644
> > --- a/fs/btrfs/tests/extent-map-tests.c
> > +++ b/fs/btrfs/tests/extent-map-tests.c
> > @@ -900,6 +900,102 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
> >       return ret;
> >   }
> >
> > +/*
> > + * Test a regression for compressed extent map adjustment when we attempt to
> > + * add an extent map that is partially ovarlapped by another existing extent
> > + * map. The resulting extent map offset was left unchanged despite having
> > + * incremented its start offset.
> > + */
> > +static int test_case_8(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
> > +{
> > +     struct extent_map_tree *em_tree = &inode->extent_tree;
> > +     struct extent_map *em;
> > +     int ret;
> > +     int ret2;
> > +
> > +     em = alloc_extent_map();
> > +     if (!em) {
> > +             test_std_err(TEST_ALLOC_EXTENT_MAP);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     /* Compressed extent for the file range [120K, 128K). */
> > +     em->start = SZ_1K * 120;
> > +     em->len = SZ_8K;
> > +     em->disk_num_bytes = SZ_4K;
> > +     em->ram_bytes = SZ_8K;
> > +     em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
> > +     write_lock(&em_tree->lock);
> > +     ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
> > +     write_unlock(&em_tree->lock);
> > +     free_extent_map(em);
> > +     if (ret < 0) {
> > +             test_err("couldn't add extent map for range [120K, 128K)");
> > +             goto out;
> > +     }
> > +
> > +     em = alloc_extent_map();
> > +     if (!em) {
> > +             test_std_err(TEST_ALLOC_EXTENT_MAP);
> > +             ret = -ENOMEM;
> > +             goto out;
> > +     }
> > +
> > +     /*
> > +      * Compressed extent for the file range [108K, 144K), which overlaps
> > +      * with the [120K, 128K) we previously inserted.
> > +      */
> > +     em->start = SZ_1K * 108;
> > +     em->len = SZ_1K * 36;
> > +     em->disk_num_bytes = SZ_4K;
> > +     em->ram_bytes = SZ_1K * 36;
> > +     em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
> > +
> > +     /*
> > +      * Try to add the extent map but with a search range of [140K, 144K),
> > +      * this should succeed and adjust the extent map to the range
> > +      * [128K, 144K), with a length of 16K and an offset of 20K.
> > +      *
> > +      * This simulates a scenario where in the subvolume tree of an inode we
> > +      * have a compressed file extent item for the range [108K, 144K) and we
> > +      * have an overlapping compressed extent map for the range [120K, 128K),
> > +      * which was created by an encoded write, but its ordered extent was not
> > +      * yet completed, so the subvolume tree doesn't have yet the file extent
> > +      * item for that range - we only have the extent map in the inode's
> > +      * extent map tree.
> > +      */
> > +     write_lock(&em_tree->lock);
> > +     ret = btrfs_add_extent_mapping(inode, &em, SZ_1K * 140, SZ_4K);
> > +     write_unlock(&em_tree->lock);
> > +     free_extent_map(em);
> > +     if (ret < 0) {
> > +             test_err("couldn't add extent map for range [108K, 144K)");
> > +             goto out;
> > +     }
> > +
> > +     if (em->start != SZ_128K) {
> > +             test_err("unexpected extent map start %llu (should be 128K)", em->start);
> > +             ret = -EINVAL;
> > +             goto out;
> > +     }
> > +     if (em->len != SZ_16K) {
> > +             test_err("unexpected extent map length %llu (should be 16K)", em->len);
> > +             ret = -EINVAL;
> > +             goto out;
> > +     }
> > +     if (em->offset != SZ_1K * 20) {
> > +             test_err("unexpected extent map offset %llu (should be 20K)", em->offset);
> > +             ret = -EINVAL;
> > +             goto out;
> > +     }
> > +out:
> > +     ret2 = free_extent_map_tree(inode);
> > +     if (ret == 0)
> > +             ret = ret2;
> > +
> > +     return ret;
> > +}
> > +
> >   struct rmap_test_vector {
> >       u64 raid_type;
> >       u64 physical_start;
> > @@ -1076,6 +1172,9 @@ int btrfs_test_extent_map(void)
> >       if (ret)
> >               goto out;
> >       ret = test_case_7(fs_info, BTRFS_I(inode));
> > +     if (ret)
> > +             goto out;
> > +     ret = test_case_8(fs_info, BTRFS_I(inode));
> >       if (ret)
> >               goto out;
> >
Qu Wenruo July 16, 2024, 11:30 a.m. UTC | #3
在 2024/7/16 20:37, Filipe Manana 写道:
[...]
>> However I'm not sure if the fixes tag is correct.
>>
>> The fix is changing the condition so that even for compressed extents we
>> can properly update the offset
>>
>> However the condition line is not from that commit.
>
> No?
>
> That's the only commit that adds:
>
> em->offset += start_diff;
>
> to add_extent_mapping(). How can it not be?
> em->offset only exists after that patch.

Before the introduction of em->offset, the old code is:

	if (em->block_start < LAST_BYTE && !is_compressed) {
		em->block_start += start_diff;
		em->block_len = em->len;
	}

@block_start is the disk_bytenr + offset for uncompressed extents, or
just disk_bytenr for compressed extents.

@block_len is the num_bytes for uncompressed extents, or just
disk_num_bytes for compressed extents.

So for the old kernel without em member update, compressed extents won't
get any update at all, thus should still lead to the same problem.

Or did I miss something?

Thanks,
Qu

>
>> The condition is there way before the change, just the em member cleanup
>> touched that line by removing the tailing '{' since eventually there is
>> only one line.
>>
>> So it looks like the problem exists way before that fixes tag.
>
> Nop.
>
> Try the test case for fstests on a branch without the extent map
> patchset, and you'll see - it will pass, while with the patchset it
> fails.
>
> The patchset had that problem I mentioned during review, that by
> adding all members and then doing the switch in different patches
> would make a bisection point to the switch patches.
>
>
>
>>
>> Thanks,
>> Qu
>>
>>>        return add_extent_mapping(inode, em, 0);
>>>    }
>>> diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
>>> index ebec4ab361b8..e4d019c8e8b9 100644
>>> --- a/fs/btrfs/tests/extent-map-tests.c
>>> +++ b/fs/btrfs/tests/extent-map-tests.c
>>> @@ -900,6 +900,102 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>>>        return ret;
>>>    }
>>>
>>> +/*
>>> + * Test a regression for compressed extent map adjustment when we attempt to
>>> + * add an extent map that is partially ovarlapped by another existing extent
>>> + * map. The resulting extent map offset was left unchanged despite having
>>> + * incremented its start offset.
>>> + */
>>> +static int test_case_8(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
>>> +{
>>> +     struct extent_map_tree *em_tree = &inode->extent_tree;
>>> +     struct extent_map *em;
>>> +     int ret;
>>> +     int ret2;
>>> +
>>> +     em = alloc_extent_map();
>>> +     if (!em) {
>>> +             test_std_err(TEST_ALLOC_EXTENT_MAP);
>>> +             return -ENOMEM;
>>> +     }
>>> +
>>> +     /* Compressed extent for the file range [120K, 128K). */
>>> +     em->start = SZ_1K * 120;
>>> +     em->len = SZ_8K;
>>> +     em->disk_num_bytes = SZ_4K;
>>> +     em->ram_bytes = SZ_8K;
>>> +     em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
>>> +     write_lock(&em_tree->lock);
>>> +     ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
>>> +     write_unlock(&em_tree->lock);
>>> +     free_extent_map(em);
>>> +     if (ret < 0) {
>>> +             test_err("couldn't add extent map for range [120K, 128K)");
>>> +             goto out;
>>> +     }
>>> +
>>> +     em = alloc_extent_map();
>>> +     if (!em) {
>>> +             test_std_err(TEST_ALLOC_EXTENT_MAP);
>>> +             ret = -ENOMEM;
>>> +             goto out;
>>> +     }
>>> +
>>> +     /*
>>> +      * Compressed extent for the file range [108K, 144K), which overlaps
>>> +      * with the [120K, 128K) we previously inserted.
>>> +      */
>>> +     em->start = SZ_1K * 108;
>>> +     em->len = SZ_1K * 36;
>>> +     em->disk_num_bytes = SZ_4K;
>>> +     em->ram_bytes = SZ_1K * 36;
>>> +     em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
>>> +
>>> +     /*
>>> +      * Try to add the extent map but with a search range of [140K, 144K),
>>> +      * this should succeed and adjust the extent map to the range
>>> +      * [128K, 144K), with a length of 16K and an offset of 20K.
>>> +      *
>>> +      * This simulates a scenario where in the subvolume tree of an inode we
>>> +      * have a compressed file extent item for the range [108K, 144K) and we
>>> +      * have an overlapping compressed extent map for the range [120K, 128K),
>>> +      * which was created by an encoded write, but its ordered extent was not
>>> +      * yet completed, so the subvolume tree doesn't have yet the file extent
>>> +      * item for that range - we only have the extent map in the inode's
>>> +      * extent map tree.
>>> +      */
>>> +     write_lock(&em_tree->lock);
>>> +     ret = btrfs_add_extent_mapping(inode, &em, SZ_1K * 140, SZ_4K);
>>> +     write_unlock(&em_tree->lock);
>>> +     free_extent_map(em);
>>> +     if (ret < 0) {
>>> +             test_err("couldn't add extent map for range [108K, 144K)");
>>> +             goto out;
>>> +     }
>>> +
>>> +     if (em->start != SZ_128K) {
>>> +             test_err("unexpected extent map start %llu (should be 128K)", em->start);
>>> +             ret = -EINVAL;
>>> +             goto out;
>>> +     }
>>> +     if (em->len != SZ_16K) {
>>> +             test_err("unexpected extent map length %llu (should be 16K)", em->len);
>>> +             ret = -EINVAL;
>>> +             goto out;
>>> +     }
>>> +     if (em->offset != SZ_1K * 20) {
>>> +             test_err("unexpected extent map offset %llu (should be 20K)", em->offset);
>>> +             ret = -EINVAL;
>>> +             goto out;
>>> +     }
>>> +out:
>>> +     ret2 = free_extent_map_tree(inode);
>>> +     if (ret == 0)
>>> +             ret = ret2;
>>> +
>>> +     return ret;
>>> +}
>>> +
>>>    struct rmap_test_vector {
>>>        u64 raid_type;
>>>        u64 physical_start;
>>> @@ -1076,6 +1172,9 @@ int btrfs_test_extent_map(void)
>>>        if (ret)
>>>                goto out;
>>>        ret = test_case_7(fs_info, BTRFS_I(inode));
>>> +     if (ret)
>>> +             goto out;
>>> +     ret = test_case_8(fs_info, BTRFS_I(inode));
>>>        if (ret)
>>>                goto out;
>>>
Filipe Manana July 16, 2024, 11:36 a.m. UTC | #4
On Tue, Jul 16, 2024 at 12:30 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/7/16 20:37, Filipe Manana 写道:
> [...]
> >> However I'm not sure if the fixes tag is correct.
> >>
> >> The fix is changing the condition so that even for compressed extents we
> >> can properly update the offset
> >>
> >> However the condition line is not from that commit.
> >
> > No?
> >
> > That's the only commit that adds:
> >
> > em->offset += start_diff;
> >
> > to add_extent_mapping(). How can it not be?
> > em->offset only exists after that patch.
>
> Before the introduction of em->offset, the old code is:
>
>         if (em->block_start < LAST_BYTE && !is_compressed) {
>                 em->block_start += start_diff;
>                 em->block_len = em->len;
>         }
>
> @block_start is the disk_bytenr + offset for uncompressed extents, or
> just disk_bytenr for compressed extents.

But that's a different thing.
em->disk_bytenr should never be updated for a compressed extent, that
was correct.

>
> @block_len is the num_bytes for uncompressed extents, or just
> disk_num_bytes for compressed extents.
>
> So for the old kernel without em member update, compressed extents won't
> get any update at all, thus should still lead to the same problem.

And they shouldn't get any updates.
Because it's disk_bytenr - we have to read the whole compressed
extent, and then extract what we need from the decompressed data.

>
> Or did I miss something?

It seems so. Did you really authored the patchset? :)

If you still don't believe that commit is buggy (despite adding the
em->offset increment logic), then do the following:

1) Run for-next against the test case for fstests - it fails;

2) Checkout for-next into a commit before the patchset that changes
extent maps and run the test case for fstests - it will always pass.

>
> Thanks,
> Qu
>
> >
> >> The condition is there way before the change, just the em member cleanup
> >> touched that line by removing the tailing '{' since eventually there is
> >> only one line.
> >>
> >> So it looks like the problem exists way before that fixes tag.
> >
> > Nop.
> >
> > Try the test case for fstests on a branch without the extent map
> > patchset, and you'll see - it will pass, while with the patchset it
> > fails.
> >
> > The patchset had that problem I mentioned during review, that by
> > adding all members and then doing the switch in different patches
> > would make a bisection point to the switch patches.
> >
> >
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>        return add_extent_mapping(inode, em, 0);
> >>>    }
> >>> diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
> >>> index ebec4ab361b8..e4d019c8e8b9 100644
> >>> --- a/fs/btrfs/tests/extent-map-tests.c
> >>> +++ b/fs/btrfs/tests/extent-map-tests.c
> >>> @@ -900,6 +900,102 @@ static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
> >>>        return ret;
> >>>    }
> >>>
> >>> +/*
> >>> + * Test a regression for compressed extent map adjustment when we attempt to
> >>> + * add an extent map that is partially ovarlapped by another existing extent
> >>> + * map. The resulting extent map offset was left unchanged despite having
> >>> + * incremented its start offset.
> >>> + */
> >>> +static int test_case_8(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
> >>> +{
> >>> +     struct extent_map_tree *em_tree = &inode->extent_tree;
> >>> +     struct extent_map *em;
> >>> +     int ret;
> >>> +     int ret2;
> >>> +
> >>> +     em = alloc_extent_map();
> >>> +     if (!em) {
> >>> +             test_std_err(TEST_ALLOC_EXTENT_MAP);
> >>> +             return -ENOMEM;
> >>> +     }
> >>> +
> >>> +     /* Compressed extent for the file range [120K, 128K). */
> >>> +     em->start = SZ_1K * 120;
> >>> +     em->len = SZ_8K;
> >>> +     em->disk_num_bytes = SZ_4K;
> >>> +     em->ram_bytes = SZ_8K;
> >>> +     em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
> >>> +     write_lock(&em_tree->lock);
> >>> +     ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
> >>> +     write_unlock(&em_tree->lock);
> >>> +     free_extent_map(em);
> >>> +     if (ret < 0) {
> >>> +             test_err("couldn't add extent map for range [120K, 128K)");
> >>> +             goto out;
> >>> +     }
> >>> +
> >>> +     em = alloc_extent_map();
> >>> +     if (!em) {
> >>> +             test_std_err(TEST_ALLOC_EXTENT_MAP);
> >>> +             ret = -ENOMEM;
> >>> +             goto out;
> >>> +     }
> >>> +
> >>> +     /*
> >>> +      * Compressed extent for the file range [108K, 144K), which overlaps
> >>> +      * with the [120K, 128K) we previously inserted.
> >>> +      */
> >>> +     em->start = SZ_1K * 108;
> >>> +     em->len = SZ_1K * 36;
> >>> +     em->disk_num_bytes = SZ_4K;
> >>> +     em->ram_bytes = SZ_1K * 36;
> >>> +     em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
> >>> +
> >>> +     /*
> >>> +      * Try to add the extent map but with a search range of [140K, 144K),
> >>> +      * this should succeed and adjust the extent map to the range
> >>> +      * [128K, 144K), with a length of 16K and an offset of 20K.
> >>> +      *
> >>> +      * This simulates a scenario where in the subvolume tree of an inode we
> >>> +      * have a compressed file extent item for the range [108K, 144K) and we
> >>> +      * have an overlapping compressed extent map for the range [120K, 128K),
> >>> +      * which was created by an encoded write, but its ordered extent was not
> >>> +      * yet completed, so the subvolume tree doesn't have yet the file extent
> >>> +      * item for that range - we only have the extent map in the inode's
> >>> +      * extent map tree.
> >>> +      */
> >>> +     write_lock(&em_tree->lock);
> >>> +     ret = btrfs_add_extent_mapping(inode, &em, SZ_1K * 140, SZ_4K);
> >>> +     write_unlock(&em_tree->lock);
> >>> +     free_extent_map(em);
> >>> +     if (ret < 0) {
> >>> +             test_err("couldn't add extent map for range [108K, 144K)");
> >>> +             goto out;
> >>> +     }
> >>> +
> >>> +     if (em->start != SZ_128K) {
> >>> +             test_err("unexpected extent map start %llu (should be 128K)", em->start);
> >>> +             ret = -EINVAL;
> >>> +             goto out;
> >>> +     }
> >>> +     if (em->len != SZ_16K) {
> >>> +             test_err("unexpected extent map length %llu (should be 16K)", em->len);
> >>> +             ret = -EINVAL;
> >>> +             goto out;
> >>> +     }
> >>> +     if (em->offset != SZ_1K * 20) {
> >>> +             test_err("unexpected extent map offset %llu (should be 20K)", em->offset);
> >>> +             ret = -EINVAL;
> >>> +             goto out;
> >>> +     }
> >>> +out:
> >>> +     ret2 = free_extent_map_tree(inode);
> >>> +     if (ret == 0)
> >>> +             ret = ret2;
> >>> +
> >>> +     return ret;
> >>> +}
> >>> +
> >>>    struct rmap_test_vector {
> >>>        u64 raid_type;
> >>>        u64 physical_start;
> >>> @@ -1076,6 +1172,9 @@ int btrfs_test_extent_map(void)
> >>>        if (ret)
> >>>                goto out;
> >>>        ret = test_case_7(fs_info, BTRFS_I(inode));
> >>> +     if (ret)
> >>> +             goto out;
> >>> +     ret = test_case_8(fs_info, BTRFS_I(inode));
> >>>        if (ret)
> >>>                goto out;
> >>>
Qu Wenruo July 16, 2024, 11:54 a.m. UTC | #5
在 2024/7/16 21:06, Filipe Manana 写道:
> On Tue, Jul 16, 2024 at 12:30 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> 在 2024/7/16 20:37, Filipe Manana 写道:
>> [...]
>>>> However I'm not sure if the fixes tag is correct.
>>>>
>>>> The fix is changing the condition so that even for compressed extents we
>>>> can properly update the offset
>>>>
>>>> However the condition line is not from that commit.
>>>
>>> No?
>>>
>>> That's the only commit that adds:
>>>
>>> em->offset += start_diff;
>>>
>>> to add_extent_mapping(). How can it not be?
>>> em->offset only exists after that patch.
>>
>> Before the introduction of em->offset, the old code is:
>>
>>          if (em->block_start < LAST_BYTE && !is_compressed) {
>>                  em->block_start += start_diff;
>>                  em->block_len = em->len;
>>          }
>>
>> @block_start is the disk_bytenr + offset for uncompressed extents, or
>> just disk_bytenr for compressed extents.
>
> But that's a different thing.
> em->disk_bytenr should never be updated for a compressed extent, that
> was correct.
>
>>
>> @block_len is the num_bytes for uncompressed extents, or just
>> disk_num_bytes for compressed extents.
>>
>> So for the old kernel without em member update, compressed extents won't
>> get any update at all, thus should still lead to the same problem.
>
> And they shouldn't get any updates.
> Because it's disk_bytenr - we have to read the whole compressed
> extent, and then extract what we need from the decompressed data.

OK, I got the answer.

For compressed extent we have other things to calculate the offset,
using em->orig_start.

This means before the em change, we handle offset differently:

- For compressed extents, it's em->start - em->orig_start.

- For uncompressed extents, it's already calculated into
   em->block_start.

So for the old code, even if we didn't update anything, as long as the
em->orig_start is still correct, we can always get a correct result from
btrfs_submit_compressed_read().

But after the em update, all the calculation is unified, and we no
longer have orig_start to do the hidden calculation for compressed read.

>
>>
>> Or did I miss something?
>
> It seems so. Did you really authored the patchset? :)
>
> If you still don't believe that commit is buggy (despite adding the
> em->offset increment logic), then do the following:

No, I think the fix is correct, or I won't give it a reviewed-by.

I just didn't see why the old code is also working.

Now I see the reason and I believe the reason is worthy to be mentioned
in your commit message then, especially considering the old code is not
that obvious why it works in the first place.

Thanks,
Qu


>
> 1) Run for-next against the test case for fstests - it fails;
>
> 2) Checkout for-next into a commit before the patchset that changes
> extent maps and run the test case for fstests - it will always pass.
>
diff mbox series

Patch

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index bacb76952fc4..f85f0172b58b 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -664,7 +664,7 @@  static noinline int merge_extent_mapping(struct btrfs_inode *inode,
 	start_diff = start - em->start;
 	em->start = start;
 	em->len = end - start;
-	if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE && !extent_map_is_compressed(em))
+	if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
 		em->offset += start_diff;
 	return add_extent_mapping(inode, em, 0);
 }
diff --git a/fs/btrfs/tests/extent-map-tests.c b/fs/btrfs/tests/extent-map-tests.c
index ebec4ab361b8..e4d019c8e8b9 100644
--- a/fs/btrfs/tests/extent-map-tests.c
+++ b/fs/btrfs/tests/extent-map-tests.c
@@ -900,6 +900,102 @@  static int test_case_7(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
 	return ret;
 }
 
+/*
+ * Test a regression for compressed extent map adjustment when we attempt to
+ * add an extent map that is partially ovarlapped by another existing extent
+ * map. The resulting extent map offset was left unchanged despite having
+ * incremented its start offset.
+ */
+static int test_case_8(struct btrfs_fs_info *fs_info, struct btrfs_inode *inode)
+{
+	struct extent_map_tree *em_tree = &inode->extent_tree;
+	struct extent_map *em;
+	int ret;
+	int ret2;
+
+	em = alloc_extent_map();
+	if (!em) {
+		test_std_err(TEST_ALLOC_EXTENT_MAP);
+		return -ENOMEM;
+	}
+
+	/* Compressed extent for the file range [120K, 128K). */
+	em->start = SZ_1K * 120;
+	em->len = SZ_8K;
+	em->disk_num_bytes = SZ_4K;
+	em->ram_bytes = SZ_8K;
+	em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
+	write_lock(&em_tree->lock);
+	ret = btrfs_add_extent_mapping(inode, &em, em->start, em->len);
+	write_unlock(&em_tree->lock);
+	free_extent_map(em);
+	if (ret < 0) {
+		test_err("couldn't add extent map for range [120K, 128K)");
+		goto out;
+	}
+
+	em = alloc_extent_map();
+	if (!em) {
+		test_std_err(TEST_ALLOC_EXTENT_MAP);
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/*
+	 * Compressed extent for the file range [108K, 144K), which overlaps
+	 * with the [120K, 128K) we previously inserted.
+	 */
+	em->start = SZ_1K * 108;
+	em->len = SZ_1K * 36;
+	em->disk_num_bytes = SZ_4K;
+	em->ram_bytes = SZ_1K * 36;
+	em->flags |= EXTENT_FLAG_COMPRESS_ZLIB;
+
+	/*
+	 * Try to add the extent map but with a search range of [140K, 144K),
+	 * this should succeed and adjust the extent map to the range
+	 * [128K, 144K), with a length of 16K and an offset of 20K.
+	 *
+	 * This simulates a scenario where in the subvolume tree of an inode we
+	 * have a compressed file extent item for the range [108K, 144K) and we
+	 * have an overlapping compressed extent map for the range [120K, 128K),
+	 * which was created by an encoded write, but its ordered extent was not
+	 * yet completed, so the subvolume tree doesn't have yet the file extent
+	 * item for that range - we only have the extent map in the inode's
+	 * extent map tree.
+	 */
+	write_lock(&em_tree->lock);
+	ret = btrfs_add_extent_mapping(inode, &em, SZ_1K * 140, SZ_4K);
+	write_unlock(&em_tree->lock);
+	free_extent_map(em);
+	if (ret < 0) {
+		test_err("couldn't add extent map for range [108K, 144K)");
+		goto out;
+	}
+
+	if (em->start != SZ_128K) {
+		test_err("unexpected extent map start %llu (should be 128K)", em->start);
+		ret = -EINVAL;
+		goto out;
+	}
+	if (em->len != SZ_16K) {
+		test_err("unexpected extent map length %llu (should be 16K)", em->len);
+		ret = -EINVAL;
+		goto out;
+	}
+	if (em->offset != SZ_1K * 20) {
+		test_err("unexpected extent map offset %llu (should be 20K)", em->offset);
+		ret = -EINVAL;
+		goto out;
+	}
+out:
+	ret2 = free_extent_map_tree(inode);
+	if (ret == 0)
+		ret = ret2;
+
+	return ret;
+}
+
 struct rmap_test_vector {
 	u64 raid_type;
 	u64 physical_start;
@@ -1076,6 +1172,9 @@  int btrfs_test_extent_map(void)
 	if (ret)
 		goto out;
 	ret = test_case_7(fs_info, BTRFS_I(inode));
+	if (ret)
+		goto out;
+	ret = test_case_8(fs_info, BTRFS_I(inode));
 	if (ret)
 		goto out;