mbox series

[GIT,PULL] Btrfs fixes for 6.8-rc2

Message ID cover.1705946889.git.dsterba@suse.com (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] Btrfs fixes for 6.8-rc2 | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag

Message

David Sterba Jan. 22, 2024, 6:33 p.m. UTC
Hi,

please pull the following fixes. Thanks.

- zoned mode fixes:
  - fix slowdown when writing large file sequentially by looking up
    block groups with enough space faster
  - locking fixes when activating a zone

- new mount API fixes:
  - preserve mount options for a ro/rw mount of the same subvolume

- scrub fixes:
  - fix use-after-free in case the chunk length is not aligned to 64K,
    this does not happen normally but has been reported on images
    converted from ext4
  - similar alignment check was missing with raid-stripe-tree

- subvolume deletion fixes:
  - prevent calling ioctl on already deleted subvolume
  - properly track flag tracking a deleted subvolume

- in subpage mode, fix decompression of an inline extent (zlib, lzo, zstd)

- fix crash when starting writeback on a folio, after integration with
  recent MM changes this needs to be started conditionally

- reject unknown flags in defrag ioctl

- error handling, API fixes, minor warning fixes

----------------------------------------------------------------
The following changes since commit e94dfb7a2935cb91faca88bf7136177d1ce0dda8:

  btrfs: pass btrfs_io_geometry into btrfs_max_io_len (2023-12-15 23:03:59 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag

for you to fetch changes up to 7f2d219e78e95a137a9c76fddac7ff8228260439:

  btrfs: scrub: limit RST scrub to chunk boundary (2024-01-18 23:43:08 +0100)

----------------------------------------------------------------
Chung-Chiang Cheng (1):
      btrfs: tree-checker: fix inline ref size in error messages

David Sterba (1):
      btrfs: don't warn if discard range is not aligned to sector

Dmitry Antipov (1):
      btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send()

Fedor Pchelkin (1):
      btrfs: ref-verify: free ref cache before clearing mount opt

Josef Bacik (2):
      btrfs: use the original mount's mount options for the legacy reconfigure
      btrfs: don't unconditionally call folio_start_writeback in subpage

Naohiro Aota (4):
      btrfs: zoned: factor out prepare_allocation_zoned()
      btrfs: zoned: optimize hint byte for zoned allocator
      btrfs: fix unbalanced unlock of mapping_tree_lock
      btrfs: zoned: fix lock ordering in btrfs_zone_activate()

Omar Sandoval (2):
      btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
      btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted

Qu Wenruo (6):
      btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
      btrfs: zlib: fix and simplify the inline extent decompression
      btrfs: lzo: fix and simplify the inline extent decompression
      btrfs: zstd: fix and simplify the inline extent decompression
      btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
      btrfs: scrub: limit RST scrub to chunk boundary

 fs/btrfs/compression.c     | 23 ++++++++++-----
 fs/btrfs/compression.h     |  6 ++--
 fs/btrfs/extent-tree.c     | 53 ++++++++++++++++++++++++---------
 fs/btrfs/inode.c           | 22 ++++++++------
 fs/btrfs/ioctl.c           |  7 +++++
 fs/btrfs/lzo.c             | 34 ++++++---------------
 fs/btrfs/ref-verify.c      |  6 ++--
 fs/btrfs/scrub.c           | 36 ++++++++++++++++++-----
 fs/btrfs/send.c            |  4 +--
 fs/btrfs/subpage.c         |  3 +-
 fs/btrfs/super.c           |  8 +++++
 fs/btrfs/tree-checker.c    |  2 +-
 fs/btrfs/volumes.c         |  2 --
 fs/btrfs/zlib.c            | 73 ++++++++++++----------------------------------
 fs/btrfs/zoned.c           |  8 ++---
 fs/btrfs/zstd.c            | 73 +++++++++++++---------------------------------
 include/uapi/linux/btrfs.h |  3 ++
 17 files changed, 178 insertions(+), 185 deletions(-)

Comments

pr-tracker-bot@kernel.org Jan. 22, 2024, 9:44 p.m. UTC | #1
The pull request you sent on Mon, 22 Jan 2024 19:33:44 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5d9248eed48054bf26b3d5ad3d7073a356a17d19

Thank you!
Linus Torvalds Jan. 22, 2024, 10:34 p.m. UTC | #2
On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote:
>
> please pull the following fixes.

Bah. These fixes are garbage. Now my machine doesn't even boot. I'm
bisecting, but it's between

good: e94dfb7a2935 ("btrfs: pass btrfs_io_geometry into btrfs_max_io_len")

bad: f398e70dd69e ("btrfs: tree-checker: fix inline ref size in error messages")

Not ok.

               Linus
Linus Torvalds Jan. 22, 2024, 10:54 p.m. UTC | #3
On Mon, 22 Jan 2024 at 14:34, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Bah. These fixes are garbage. Now my machine doesn't even boot. I'm
> bisecting

My bisection says

   1e7f6def8b2370ecefb54b3c8f390ff894b0c51b is the first bad commit

but I'll still have to verify by testing the revert on top of my current tree.

It did revert cleanly, but I also note that if the zstd case is wrong,
I assume the other very similar commits (for zlib and lzo) are
potentially also wrong.

Let me reboot to verify that at least my machine boots.

              Linus
Linus Torvalds Jan. 22, 2024, 11:01 p.m. UTC | #4
On Mon, 22 Jan 2024 at 14:54, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Let me reboot to verify that at least my machine boots.

My tree with that commit reverted does indeed boot:

  Revert "btrfs: zstd: fix and simplify the inline extent decompression"

is working ok for me.

I do not think I have anything odd in my Kconfig, and I didn't see any
messages, and there is nothing logged either - just a hang at boot.

                Linus
David Sterba Jan. 22, 2024, 11:05 p.m. UTC | #5
On Mon, Jan 22, 2024 at 02:54:31PM -0800, Linus Torvalds wrote:
> On Mon, 22 Jan 2024 at 14:34, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Bah. These fixes are garbage. Now my machine doesn't even boot. I'm
> > bisecting

Ah, sorry.

> My bisection says
> 
>    1e7f6def8b2370ecefb54b3c8f390ff894b0c51b is the first bad commit

We got a report today [1] that this commit is indeed bad,

https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/

the timing was also unfortuate and too late to recall the pull request.

> but I'll still have to verify by testing the revert on top of my current tree.
> 
> It did revert cleanly, but I also note that if the zstd case is wrong,
> I assume the other very similar commits (for zlib and lzo) are
> potentially also wrong.
> 
> Let me reboot to verify that at least my machine boots.

Per the report revert makes it work again and zlib and lzo cases are not
affected.

I can send a pull request reverting all the three until we figure out
what's wrong, or you can do it as all revert cleanly.
Qu Wenruo Jan. 22, 2024, 11:27 p.m. UTC | #6
On 2024/1/23 09:35, David Sterba wrote:
> On Mon, Jan 22, 2024 at 02:54:31PM -0800, Linus Torvalds wrote:
>> On Mon, 22 Jan 2024 at 14:34, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>> Bah. These fixes are garbage. Now my machine doesn't even boot. I'm
>>> bisecting
>
> Ah, sorry.
>
>> My bisection says
>>
>>     1e7f6def8b2370ecefb54b3c8f390ff894b0c51b is the first bad commit
>
> We got a report today [1] that this commit is indeed bad,
>
> https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/
>
> the timing was also unfortuate and too late to recall the pull request.

All my fault.

The offending line is:

+	memcpy_to_page(dest_page, dest_pgoff + to_copy, workspace->out_buf.dst,
+		       to_copy);

I'm using the bad pg_off for the memcpy_to_page() call.
And zstd is the only affected algo.

All the other algos go like this:

+	memcpy_to_page(dest_page, dest_pgoff, workspace->buf, out_len);

So that's why it's screwing up the zstd compressed inline extent
decompression, as we can easily write beyond the page boundary and write
into the next innocent page.

And the existing compression group didn't catch it at all.

Would fix it and add new test cases for the regression.

Thanks,
Qu
>
>> but I'll still have to verify by testing the revert on top of my current tree.
>>
>> It did revert cleanly, but I also note that if the zstd case is wrong,
>> I assume the other very similar commits (for zlib and lzo) are
>> potentially also wrong.
>>
>> Let me reboot to verify that at least my machine boots.
>
> Per the report revert makes it work again and zlib and lzo cases are not
> affected.
>
> I can send a pull request reverting all the three until we figure out
> what's wrong, or you can do it as all revert cleanly.
>
Linus Torvalds Jan. 26, 2024, 7:25 p.m. UTC | #7
On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag

I have no idea if this is related to the new fixes, but I have never
seen it before:

  BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
  BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
  BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
  SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
ino=5737268
  SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
ino=5737267

and it caused an actual warning to be printed for my kernel tree from 'git':

   error: failed to stat 'sound/pci/ice1712/se.c': Structure needs cleaning

(and yes, 117 is EUCLEAN, aka "Structure needs cleaning")

The problem seems to have self-corrected, because it didn't happen
when repeating the command, and that file that failed to stat looks
perfectly fine.

But it is clearly worrisome.

The "owner mismatch" check isn't new - it was added back in 5.19 in
commit 88c602ab4460 ("btrfs: tree-checker: check extent buffer owner
against owner rootid"). So something else must have changed to trigger
it.

           Linus
David Sterba Jan. 26, 2024, 8 p.m. UTC | #8
On Fri, Jan 26, 2024 at 11:25:19AM -0800, Linus Torvalds wrote:
> On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag
> 
> I have no idea if this is related to the new fixes, but I have never
> seen it before:
> 
>   BTRFS critical (device dm-0): corrupted node, root=256
> block=8550954455682405139 owner mismatch, have 11858205567642294356
> expect [256, 18446744073709551360]

Whick check: the numbers don't match constaints, blocks must be 4096
aligned

hex(8550954455682405139) = 0x76ab17c5c57e5313
(would have to end with 3 zeros)

and owner is sequentially increased such a high number is unlikely to be
reached on nowadays systems.

>   BTRFS critical (device dm-0): corrupted node, root=256
> block=8550954455682405139 owner mismatch, have 11858205567642294356
> expect [256, 18446744073709551360]
>   BTRFS critical (device dm-0): corrupted node, root=256
> block=8550954455682405139 owner mismatch, have 11858205567642294356
> expect [256, 18446744073709551360]
>   SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
> ino=5737268
>   SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
> ino=5737267
> 
> and it caused an actual warning to be printed for my kernel tree from 'git':
> 
>    error: failed to stat 'sound/pci/ice1712/se.c': Structure needs cleaning

Size of sound/pci/ice1712/se.c is 19875, this does not look like it
would be caused by the zstd bug as it was for inlined files where the
limit is 2048 bytes.

> (and yes, 117 is EUCLEAN, aka "Structure needs cleaning")
> 
> The problem seems to have self-corrected, because it didn't happen
> when repeating the command, and that file that failed to stat looks
> perfectly fine.
> 
> But it is clearly worrisome.
> 
> The "owner mismatch" check isn't new - it was added back in 5.19 in
> commit 88c602ab4460 ("btrfs: tree-checker: check extent buffer owner
> against owner rootid"). So something else must have changed to trigger
> it.

This looks like garbage data got read from disk, yet still passing the
checksum test (otherwise that would lead to an EIO and would not get to
the tree-checker).  Most likely cause is that damaged data were written
to the disk before.

The tree-checker also verifies data before they get written, the same
bogus values of block/owner would trigger it so you'd see a warning.

It's not possible to get the data damaged on the way from the device to
the filesystem, that would not pass the checksum, so unlikely a
driver/block layer bug.

What remains that the data were correctly written in the past, read
correctly passing checksum test but then somehow got damaged in the
memory between the checksum check and tree-checker. The window is quite
short:

disk-io.c:btrfs_validate_extent_buffer()

between csum_tree_block() (around line 397) and
        btrfs_check_node() (around line 464)
Qu Wenruo Jan. 26, 2024, 9:05 p.m. UTC | #9
On 2024/1/27 05:55, Linus Torvalds wrote:
> On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote:
>>
>>    git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag
>
> I have no idea if this is related to the new fixes, but I have never
> seen it before:
>
>    BTRFS critical (device dm-0): corrupted node, root=256
> block=8550954455682405139 owner mismatch, have 11858205567642294356
> expect [256, 18446744073709551360]
>    BTRFS critical (device dm-0): corrupted node, root=256
> block=8550954455682405139 owner mismatch, have 11858205567642294356
> expect [256, 18446744073709551360]
>    BTRFS critical (device dm-0): corrupted node, root=256
> block=8550954455682405139 owner mismatch, have 11858205567642294356
> expect [256, 18446744073709551360]

This is triggered during a btrfs btree search, for XATTR read.

The root=256 means the tree search operation is triggered from subvolume
256, which is completely sane.

But the other number, 11858205567642294356, which is still inside the
allowed subvolume range, but beyond the 0 level qgroup, thus it makes
is_fstree() return false, and triggered the error.

Normally we should not have this many subvolumes, and since 2015 we
already has such check against subvolume creation.

But I don't really believe that's the case, unless there are really that
many snapshots/subvolumes in the fs (beyond 1 << 48 snapshots)

>    SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
> ino=5737268
>    SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
> ino=5737267
>
> and it caused an actual warning to be printed for my kernel tree from 'git':
>
>     error: failed to stat 'sound/pci/ice1712/se.c': Structure needs cleaning
>
> (and yes, 117 is EUCLEAN, aka "Structure needs cleaning")
>
> The problem seems to have self-corrected, because it didn't happen
> when repeating the command, and that file that failed to stat looks
> perfectly fine.

I guess since the error is self-corrected, there is no tree dump of
block 8550954455682405139 just after the error.

Personally speaking the number 11858205567642294356 is really too large,
so that it looks like some runtime garbage.
I don't have any clue for now, but if you hit it again, you may want to
run "btrfs ins dump-tree -b <bytenr> <device>" to dump the content.

Meanwhile I'll make kernel tree-checker to dump the content of the
offending tree block so that it's easier to debug.

>
> But it is clearly worrisome.
>
> The "owner mismatch" check isn't new - it was added back in 5.19 in
> commit 88c602ab4460 ("btrfs: tree-checker: check extent buffer owner
> against owner rootid"). So something else must have changed to trigger
> it.

Anyway I'll keep an eye on the situation.

Thanks,
Qu

>
>             Linus
>
Qu Wenruo Jan. 26, 2024, 9:39 p.m. UTC | #10
On 2024/1/27 06:30, David Sterba wrote:
> On Fri, Jan 26, 2024 at 11:25:19AM -0800, Linus Torvalds wrote:
>> On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote:
>>>
>>>    git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag
>>
>> I have no idea if this is related to the new fixes, but I have never
>> seen it before:
>>
>>    BTRFS critical (device dm-0): corrupted node, root=256
>> block=8550954455682405139 owner mismatch, have 11858205567642294356
>> expect [256, 18446744073709551360]
> 
> Whick check: the numbers don't match constaints, blocks must be 4096
> aligned
> 
> hex(8550954455682405139) = 0x76ab17c5c57e5313
> (would have to end with 3 zeros)

Oh, I forgot the most obvious problem.

This means the extent buffer is full of garbage.

And this is means the content of the eb got some sudden change.
As during extent buffer read, we have a very basic eb bytenr check 
against the eb read from disk.

So it means at that point, at least we got something correct, and can 
even pass csum checks.

But later one, the pages of the eb got corrupted and we got garbarge.

(AKA, my previous analyse is totally wrong, but it may still inspire me 
to add extra check on eb_owner to reject higher level qgroup subvolumes)


What's the page size of the system? 4K or 16K or 64K?

For the first 2, the eb handling would go the regular path as 4K page 
sized systems (as 16K is the default nodesize, nodesize >= PAGE_SIZE 
would share the same page based eb handling).

For 64K, it would go full subpage, but I doubt it, as the Apple M series 
CPUs do not seem to support 64K page size.

Thanks,
Qu
> 
> and owner is sequentially increased such a high number is unlikely to be
> reached on nowadays systems.
> 
>>    BTRFS critical (device dm-0): corrupted node, root=256
>> block=8550954455682405139 owner mismatch, have 11858205567642294356
>> expect [256, 18446744073709551360]
>>    BTRFS critical (device dm-0): corrupted node, root=256
>> block=8550954455682405139 owner mismatch, have 11858205567642294356
>> expect [256, 18446744073709551360]
>>    SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
>> ino=5737268
>>    SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
>> ino=5737267
>>
>> and it caused an actual warning to be printed for my kernel tree from 'git':
>>
>>     error: failed to stat 'sound/pci/ice1712/se.c': Structure needs cleaning
> 
> Size of sound/pci/ice1712/se.c is 19875, this does not look like it
> would be caused by the zstd bug as it was for inlined files where the
> limit is 2048 bytes.
> 
>> (and yes, 117 is EUCLEAN, aka "Structure needs cleaning")
>>
>> The problem seems to have self-corrected, because it didn't happen
>> when repeating the command, and that file that failed to stat looks
>> perfectly fine.
>>
>> But it is clearly worrisome.
>>
>> The "owner mismatch" check isn't new - it was added back in 5.19 in
>> commit 88c602ab4460 ("btrfs: tree-checker: check extent buffer owner
>> against owner rootid"). So something else must have changed to trigger
>> it.
> 
> This looks like garbage data got read from disk, yet still passing the
> checksum test (otherwise that would lead to an EIO and would not get to
> the tree-checker).  Most likely cause is that damaged data were written
> to the disk before.
> 
> The tree-checker also verifies data before they get written, the same
> bogus values of block/owner would trigger it so you'd see a warning.
> 
> It's not possible to get the data damaged on the way from the device to
> the filesystem, that would not pass the checksum, so unlikely a
> driver/block layer bug.
> 
> What remains that the data were correctly written in the past, read
> correctly passing checksum test but then somehow got damaged in the
> memory between the checksum check and tree-checker. The window is quite
> short:
> 
> disk-io.c:btrfs_validate_extent_buffer()
> 
> between csum_tree_block() (around line 397) and
>          btrfs_check_node() (around line 464)
Linus Torvalds Jan. 26, 2024, 9:51 p.m. UTC | #11
On Fri, 26 Jan 2024 at 13:39, Qu Wenruo <wqu@suse.com> wrote:
>
> Oh, I forgot the most obvious problem.
>
> This means the extent buffer is full of garbage.

Allocation lifetime problems?

> What's the page size of the system? 4K or 16K or 64K?

This is a bog-standard x86-64 system. With 32 cores (and 64 threads),
but there's nothing remotely odd about it, except for the fact that
it's running a very recent kernel...

             Linus
Qu Wenruo Jan. 26, 2024, 9:56 p.m. UTC | #12
On 2024/1/27 08:21, Linus Torvalds wrote:
> On Fri, 26 Jan 2024 at 13:39, Qu Wenruo <wqu@suse.com> wrote:
>>
>> Oh, I forgot the most obvious problem.
>>
>> This means the extent buffer is full of garbage.
> 
> Allocation lifetime problems?

Could be, thus it may be better to output the flags of the first page 
for tree-checker.

Thanks,
Qu
> 
>> What's the page size of the system? 4K or 16K or 64K?
> 
> This is a bog-standard x86-64 system. With 32 cores (and 64 threads),
> but there's nothing remotely odd about it, except for the fact that
> it's running a very recent kernel...
> 
>               Linus
Linus Torvalds Jan. 26, 2024, 10:02 p.m. UTC | #13
On Fri, 26 Jan 2024 at 13:56, Qu Wenruo <wqu@suse.com> wrote:
>
> On 2024/1/27 08:21, Linus Torvalds wrote:
> >
> > Allocation lifetime problems?
>
> Could be, thus it may be better to output the flags of the first page
> for tree-checker.

Note that the fact that it magically went away certainly implies that
it never "really" existed, and that something was using a pointer or
similar.

IOW, this is not some IO that got scribbled over, or a cache that got
corrupted. If it had been real corruption, I would have expected that
it would have stayed around in memory.

                 Linus
Qu Wenruo Jan. 26, 2024, 10:05 p.m. UTC | #14
On 2024/1/27 08:32, Linus Torvalds wrote:
> On Fri, 26 Jan 2024 at 13:56, Qu Wenruo <wqu@suse.com> wrote:
>>
>> On 2024/1/27 08:21, Linus Torvalds wrote:
>>>
>>> Allocation lifetime problems?
>>
>> Could be, thus it may be better to output the flags of the first page
>> for tree-checker.
>
> Note that the fact that it magically went away certainly implies that
> it never "really" existed, and that something was using a pointer or
> similar.
>
> IOW, this is not some IO that got scribbled over, or a cache that got
> corrupted. If it had been real corruption, I would have expected that
> it would have stayed around in memory.

Yep, thus it makes sense to show the page status of an eb.

It could be some race that the eb pages are not properly hold, thus its
content changed unexpectedly.

Thanks,
Qu

>
>                   Linus
>