diff mbox series

[v2] btrfs: zstd: fix and simplify the inline extent decompression

Message ID c3bed652c4e20c8a446fba371d529a78dc98b827.1705978912.git.wqu@suse.com (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: zstd: fix and simplify the inline extent decompression | expand

Commit Message

Qu Wenruo Jan. 23, 2024, 3:03 a.m. UTC
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:

	item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
		generation 8 transid 8 size 4096 nbytes 4096
		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
		sequence 1 flags 0x0(none)
	item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
		index 2 namelen 14 name: source_inlined
	item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
		generation 8 type 0 (inline)
		inline extent data size 48 ram_bytes 4096 compression 3 (zstd)

Then trying to reflink that extent in an aarch64 system with 64K page
size, the reflink would just fail:

  # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
  XFS_IOC_CLONE_RANGE: Input/output error

[CAUSE]
In zstd_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should error out, without
any proper explanation (this is copied from other decompression code).

In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer nor error out, since the whole
input/output buffer should never exceed one sector, thus we should not
need to do any buffer switch.

Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.

[FIX]
The fix involves several modification:

- Rename @start_byte to @dest_pgoff to properly express its meaning

- Use @sectorsize other than PAGE_SIZE to properly initialize the
  output buffer size

- Use correct destination offset inside the destination page

- Simplify the main loop
  Since the input/output buffer should never switch, we only need one
  zstd_decompress_stream() call.

- Consider early end as an error

After the fix, even on 64K page sized aarch64, above reflink now
works as expected:

  # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
  linked 4096/4096 bytes at offset 61440

And results the correct file layout:

	item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
		generation 10 transid 10 size 65536 nbytes 4096
		block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
		sequence 1 flags 0x0(none)
	item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
		index 3 namelen 4 name: dest
	item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
		location key (0 UNKNOWN.0 0) type XATTR
		transid 10 data_len 37 name_len 16
		name: security.selinux
		data unconfined_u:object_r:unlabeled_t:s0
	item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
		generation 10 type 1 (regular)
		extent data disk byte 13631488 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0 (none)

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/compression.h |  2 +-
 fs/btrfs/zstd.c        | 73 ++++++++++++------------------------------
 2 files changed, 22 insertions(+), 53 deletions(-)
---
Changelog:
v2:
- Fix the incorrect memcpy_page() parameter:
  Previously the pgoff is (dest_pgoff + to_copy), not just (dest_pgoff).
  This leads to possible write beyond the current page and can lead to
  either incorrect contents (if to_copy is smaller than 2K), or
  triggering the VM_BUG_ON() inside memcpy_to_page() as our write
  destination is beyond the page boundary.

Comments

Neal Gompa Jan. 23, 2024, 11:49 p.m. UTC | #1
On Mon, Jan 22, 2024 at 10:04 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [BUG]
> If we have a filesystem with 4k sectorsize, and an inlined compressed
> extent created like this:
>
>         item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
>                 generation 8 transid 8 size 4096 nbytes 4096
>                 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                 sequence 1 flags 0x0(none)
>         item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
>                 index 2 namelen 14 name: source_inlined
>         item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
>                 generation 8 type 0 (inline)
>                 inline extent data size 48 ram_bytes 4096 compression 3 (zstd)
>
> Then trying to reflink that extent in an aarch64 system with 64K page
> size, the reflink would just fail:
>
>   # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
>   XFS_IOC_CLONE_RANGE: Input/output error
>
> [CAUSE]
> In zstd_decompress(), we didn't treat @start_byte as just a page offset,
> but also use it as an indicator on whether we should error out, without
> any proper explanation (this is copied from other decompression code).
>
> In reality, for subpage cases, although @start_byte can be non-zero,
> we should never switch input/output buffer nor error out, since the whole
> input/output buffer should never exceed one sector, thus we should not
> need to do any buffer switch.
>
> Thus the current code using @start_byte as a condition to switch
> input/output buffer or finish the decompression is completely incorrect.
>
> [FIX]
> The fix involves several modification:
>
> - Rename @start_byte to @dest_pgoff to properly express its meaning
>
> - Use @sectorsize other than PAGE_SIZE to properly initialize the
>   output buffer size
>
> - Use correct destination offset inside the destination page
>
> - Simplify the main loop
>   Since the input/output buffer should never switch, we only need one
>   zstd_decompress_stream() call.
>
> - Consider early end as an error
>
> After the fix, even on 64K page sized aarch64, above reflink now
> works as expected:
>
>   # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
>   linked 4096/4096 bytes at offset 61440
>
> And results the correct file layout:
>
>         item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
>                 generation 10 transid 10 size 65536 nbytes 4096
>                 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>                 sequence 1 flags 0x0(none)
>         item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
>                 index 3 namelen 4 name: dest
>         item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
>                 location key (0 UNKNOWN.0 0) type XATTR
>                 transid 10 data_len 37 name_len 16
>                 name: security.selinux
>                 data unconfined_u:object_r:unlabeled_t:s0
>         item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
>                 generation 10 type 1 (regular)
>                 extent data disk byte 13631488 nr 4096
>                 extent data offset 0 nr 4096 ram 4096
>                 extent compression 0 (none)
>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/compression.h |  2 +-
>  fs/btrfs/zstd.c        | 73 ++++++++++++------------------------------
>  2 files changed, 22 insertions(+), 53 deletions(-)
> ---
> Changelog:
> v2:
> - Fix the incorrect memcpy_page() parameter:
>   Previously the pgoff is (dest_pgoff + to_copy), not just (dest_pgoff).
>   This leads to possible write beyond the current page and can lead to
>   either incorrect contents (if to_copy is smaller than 2K), or
>   triggering the VM_BUG_ON() inside memcpy_to_page() as our write
>   destination is beyond the page boundary.
>

Have we checked to see if this problem shows up in the other
compression algorithms? I know the change was reverted for btrfs-zstd
because Linus saw the issue directly[1], but I'm concerned that the
only reason he saw it is because Fedora uses zstd by default[2] and
this might be an issue in the other algorithms.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e01a83e12604aa2f8d4ab359ec44e341a2248b4a
[2]: https://fedoraproject.org/wiki/Changes/BtrfsTransparentCompression
Qu Wenruo Jan. 24, 2024, 12:07 a.m. UTC | #2
On 2024/1/24 10:19, Neal Gompa wrote:
> On Mon, Jan 22, 2024 at 10:04 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BUG]
>> If we have a filesystem with 4k sectorsize, and an inlined compressed
>> extent created like this:
>>
>>          item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
>>                  generation 8 transid 8 size 4096 nbytes 4096
>>                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>>                  sequence 1 flags 0x0(none)
>>          item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
>>                  index 2 namelen 14 name: source_inlined
>>          item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
>>                  generation 8 type 0 (inline)
>>                  inline extent data size 48 ram_bytes 4096 compression 3 (zstd)
>>
>> Then trying to reflink that extent in an aarch64 system with 64K page
>> size, the reflink would just fail:
>>
>>    # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
>>    XFS_IOC_CLONE_RANGE: Input/output error
>>
>> [CAUSE]
>> In zstd_decompress(), we didn't treat @start_byte as just a page offset,
>> but also use it as an indicator on whether we should error out, without
>> any proper explanation (this is copied from other decompression code).
>>
>> In reality, for subpage cases, although @start_byte can be non-zero,
>> we should never switch input/output buffer nor error out, since the whole
>> input/output buffer should never exceed one sector, thus we should not
>> need to do any buffer switch.
>>
>> Thus the current code using @start_byte as a condition to switch
>> input/output buffer or finish the decompression is completely incorrect.
>>
>> [FIX]
>> The fix involves several modification:
>>
>> - Rename @start_byte to @dest_pgoff to properly express its meaning
>>
>> - Use @sectorsize other than PAGE_SIZE to properly initialize the
>>    output buffer size
>>
>> - Use correct destination offset inside the destination page
>>
>> - Simplify the main loop
>>    Since the input/output buffer should never switch, we only need one
>>    zstd_decompress_stream() call.
>>
>> - Consider early end as an error
>>
>> After the fix, even on 64K page sized aarch64, above reflink now
>> works as expected:
>>
>>    # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
>>    linked 4096/4096 bytes at offset 61440
>>
>> And results the correct file layout:
>>
>>          item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
>>                  generation 10 transid 10 size 65536 nbytes 4096
>>                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
>>                  sequence 1 flags 0x0(none)
>>          item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
>>                  index 3 namelen 4 name: dest
>>          item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
>>                  location key (0 UNKNOWN.0 0) type XATTR
>>                  transid 10 data_len 37 name_len 16
>>                  name: security.selinux
>>                  data unconfined_u:object_r:unlabeled_t:s0
>>          item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
>>                  generation 10 type 1 (regular)
>>                  extent data disk byte 13631488 nr 4096
>>                  extent data offset 0 nr 4096 ram 4096
>>                  extent compression 0 (none)
>>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/compression.h |  2 +-
>>   fs/btrfs/zstd.c        | 73 ++++++++++++------------------------------
>>   2 files changed, 22 insertions(+), 53 deletions(-)
>> ---
>> Changelog:
>> v2:
>> - Fix the incorrect memcpy_page() parameter:
>>    Previously the pgoff is (dest_pgoff + to_copy), not just (dest_pgoff).
>>    This leads to possible write beyond the current page and can lead to
>>    either incorrect contents (if to_copy is smaller than 2K), or
>>    triggering the VM_BUG_ON() inside memcpy_to_page() as our write
>>    destination is beyond the page boundary.
>>
>
> Have we checked to see if this problem shows up in the other
> compression algorithms? I know the change was reverted for btrfs-zstd
> because Linus saw the issue directly[1], but I'm concerned that the
> only reason he saw it is because Fedora uses zstd by default[2] and
> this might be an issue in the other algorithms.

Already checked, and new test case is also added (which you have also
replied).

As replied in the other thread, zstd is really the exception.

Thanks,
Qu

>
> [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e01a83e12604aa2f8d4ab359ec44e341a2248b4a
> [2]: https://fedoraproject.org/wiki/Changes/BtrfsTransparentCompression
>
Neal Gompa Jan. 24, 2024, 1:08 a.m. UTC | #3
On Tue, Jan 23, 2024 at 7:07 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2024/1/24 10:19, Neal Gompa wrote:
> > On Mon, Jan 22, 2024 at 10:04 PM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >> [BUG]
> >> If we have a filesystem with 4k sectorsize, and an inlined compressed
> >> extent created like this:
> >>
> >>          item 4 key (257 INODE_ITEM 0) itemoff 15863 itemsize 160
> >>                  generation 8 transid 8 size 4096 nbytes 4096
> >>                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
> >>                  sequence 1 flags 0x0(none)
> >>          item 5 key (257 INODE_REF 256) itemoff 15839 itemsize 24
> >>                  index 2 namelen 14 name: source_inlined
> >>          item 6 key (257 EXTENT_DATA 0) itemoff 15770 itemsize 69
> >>                  generation 8 type 0 (inline)
> >>                  inline extent data size 48 ram_bytes 4096 compression 3 (zstd)
> >>
> >> Then trying to reflink that extent in an aarch64 system with 64K page
> >> size, the reflink would just fail:
> >>
> >>    # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
> >>    XFS_IOC_CLONE_RANGE: Input/output error
> >>
> >> [CAUSE]
> >> In zstd_decompress(), we didn't treat @start_byte as just a page offset,
> >> but also use it as an indicator on whether we should error out, without
> >> any proper explanation (this is copied from other decompression code).
> >>
> >> In reality, for subpage cases, although @start_byte can be non-zero,
> >> we should never switch input/output buffer nor error out, since the whole
> >> input/output buffer should never exceed one sector, thus we should not
> >> need to do any buffer switch.
> >>
> >> Thus the current code using @start_byte as a condition to switch
> >> input/output buffer or finish the decompression is completely incorrect.
> >>
> >> [FIX]
> >> The fix involves several modification:
> >>
> >> - Rename @start_byte to @dest_pgoff to properly express its meaning
> >>
> >> - Use @sectorsize other than PAGE_SIZE to properly initialize the
> >>    output buffer size
> >>
> >> - Use correct destination offset inside the destination page
> >>
> >> - Simplify the main loop
> >>    Since the input/output buffer should never switch, we only need one
> >>    zstd_decompress_stream() call.
> >>
> >> - Consider early end as an error
> >>
> >> After the fix, even on 64K page sized aarch64, above reflink now
> >> works as expected:
> >>
> >>    # xfs_io -f -c "reflink $mnt/source_inlined 0 60k 4k" $mnt/dest
> >>    linked 4096/4096 bytes at offset 61440
> >>
> >> And results the correct file layout:
> >>
> >>          item 9 key (258 INODE_ITEM 0) itemoff 15542 itemsize 160
> >>                  generation 10 transid 10 size 65536 nbytes 4096
> >>                  block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
> >>                  sequence 1 flags 0x0(none)
> >>          item 10 key (258 INODE_REF 256) itemoff 15528 itemsize 14
> >>                  index 3 namelen 4 name: dest
> >>          item 11 key (258 XATTR_ITEM 3817753667) itemoff 15445 itemsize 83
> >>                  location key (0 UNKNOWN.0 0) type XATTR
> >>                  transid 10 data_len 37 name_len 16
> >>                  name: security.selinux
> >>                  data unconfined_u:object_r:unlabeled_t:s0
> >>          item 12 key (258 EXTENT_DATA 61440) itemoff 15392 itemsize 53
> >>                  generation 10 type 1 (regular)
> >>                  extent data disk byte 13631488 nr 4096
> >>                  extent data offset 0 nr 4096 ram 4096
> >>                  extent compression 0 (none)
> >>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >>   fs/btrfs/compression.h |  2 +-
> >>   fs/btrfs/zstd.c        | 73 ++++++++++++------------------------------
> >>   2 files changed, 22 insertions(+), 53 deletions(-)
> >> ---
> >> Changelog:
> >> v2:
> >> - Fix the incorrect memcpy_page() parameter:
> >>    Previously the pgoff is (dest_pgoff + to_copy), not just (dest_pgoff).
> >>    This leads to possible write beyond the current page and can lead to
> >>    either incorrect contents (if to_copy is smaller than 2K), or
> >>    triggering the VM_BUG_ON() inside memcpy_to_page() as our write
> >>    destination is beyond the page boundary.
> >>
> >
> > Have we checked to see if this problem shows up in the other
> > compression algorithms? I know the change was reverted for btrfs-zstd
> > because Linus saw the issue directly[1], but I'm concerned that the
> > only reason he saw it is because Fedora uses zstd by default[2] and
> > this might be an issue in the other algorithms.
>
> Already checked, and new test case is also added (which you have also
> replied).
>
> As replied in the other thread, zstd is really the exception.
>

Then we should be gravy here. Hopefully this time nothing else pops up! ^_^;

Reviewed-by: Neal Gompa <neal@gompa.dev>
diff mbox series

Patch

diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index afd7e50d073d..97fe3ebf11a2 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -169,7 +169,7 @@  int zstd_compress_pages(struct list_head *ws, struct address_space *mapping,
 		unsigned long *total_in, unsigned long *total_out);
 int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb);
 int zstd_decompress(struct list_head *ws, const u8 *data_in,
-		struct page *dest_page, unsigned long start_byte, size_t srclen,
+		struct page *dest_page, unsigned long dest_pgoff, size_t srclen,
 		size_t destlen);
 void zstd_init_workspace_manager(void);
 void zstd_cleanup_workspace_manager(void);
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 0d66db8bc1d4..7cfe9c06c795 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -20,6 +20,7 @@ 
 #include "misc.h"
 #include "compression.h"
 #include "ctree.h"
+#include "super.h"
 
 #define ZSTD_BTRFS_MAX_WINDOWLOG 17
 #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
@@ -618,80 +619,48 @@  int zstd_decompress_bio(struct list_head *ws, struct compressed_bio *cb)
 }
 
 int zstd_decompress(struct list_head *ws, const u8 *data_in,
-		struct page *dest_page, unsigned long start_byte, size_t srclen,
+		struct page *dest_page, unsigned long dest_pgoff, size_t srclen,
 		size_t destlen)
 {
 	struct workspace *workspace = list_entry(ws, struct workspace, list);
+	struct btrfs_fs_info *fs_info = btrfs_sb(dest_page->mapping->host->i_sb);
+	const u32 sectorsize = fs_info->sectorsize;
 	zstd_dstream *stream;
 	int ret = 0;
-	size_t ret2;
-	unsigned long total_out = 0;
-	unsigned long pg_offset = 0;
+	unsigned long to_copy = 0;
 
 	stream = zstd_init_dstream(
 			ZSTD_BTRFS_MAX_INPUT, workspace->mem, workspace->size);
 	if (!stream) {
 		pr_warn("BTRFS: zstd_init_dstream failed\n");
-		ret = -EIO;
 		goto finish;
 	}
 
-	destlen = min_t(size_t, destlen, PAGE_SIZE);
-
 	workspace->in_buf.src = data_in;
 	workspace->in_buf.pos = 0;
 	workspace->in_buf.size = srclen;
 
 	workspace->out_buf.dst = workspace->buf;
 	workspace->out_buf.pos = 0;
-	workspace->out_buf.size = PAGE_SIZE;
+	workspace->out_buf.size = sectorsize;
 
-	ret2 = 1;
-	while (pg_offset < destlen
-	       && workspace->in_buf.pos < workspace->in_buf.size) {
-		unsigned long buf_start;
-		unsigned long buf_offset;
-		unsigned long bytes;
-
-		/* Check if the frame is over and we still need more input */
-		if (ret2 == 0) {
-			pr_debug("BTRFS: zstd_decompress_stream ended early\n");
-			ret = -EIO;
-			goto finish;
-		}
-		ret2 = zstd_decompress_stream(stream, &workspace->out_buf,
-				&workspace->in_buf);
-		if (zstd_is_error(ret2)) {
-			pr_debug("BTRFS: zstd_decompress_stream returned %d\n",
-					zstd_get_error_code(ret2));
-			ret = -EIO;
-			goto finish;
-		}
-
-		buf_start = total_out;
-		total_out += workspace->out_buf.pos;
-		workspace->out_buf.pos = 0;
-
-		if (total_out <= start_byte)
-			continue;
-
-		if (total_out > start_byte && buf_start < start_byte)
-			buf_offset = start_byte - buf_start;
-		else
-			buf_offset = 0;
-
-		bytes = min_t(unsigned long, destlen - pg_offset,
-				workspace->out_buf.size - buf_offset);
-
-		memcpy_to_page(dest_page, pg_offset,
-			       workspace->out_buf.dst + buf_offset, bytes);
-
-		pg_offset += bytes;
+	/*
+	 * Since both input and output buffers should not exceed one sector,
+	 * one call should end the decompression.
+	 */
+	ret = zstd_decompress_stream(stream, &workspace->out_buf, &workspace->in_buf);
+	if (zstd_is_error(ret)) {
+		pr_warn_ratelimited("BTRFS: zstd_decompress_stream return %d\n",
+				    zstd_get_error_code(ret));
+		goto finish;
 	}
-	ret = 0;
+	to_copy = workspace->out_buf.pos;
+	memcpy_to_page(dest_page, dest_pgoff, workspace->out_buf.dst, to_copy);
 finish:
-	if (pg_offset < destlen) {
-		memzero_page(dest_page, pg_offset, destlen - pg_offset);
+	/* Error or early end. */
+	if (unlikely(to_copy < destlen)) {
+		ret = -EIO;
+		memzero_page(dest_page, dest_pgoff + to_copy, destlen - to_copy);
 	}
 	return ret;
 }