Message ID | cover.1705946889.git.dsterba@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [GIT,PULL] Btrfs fixes for 6.8-rc2 | expand |
The pull request you sent on Mon, 22 Jan 2024 19:33:44 +0100:
> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag
has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/5d9248eed48054bf26b3d5ad3d7073a356a17d19
Thank you!
On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote: > > please pull the following fixes. Bah. These fixes are garbage. Now my machine doesn't even boot. I'm bisecting, but it's between good: e94dfb7a2935 ("btrfs: pass btrfs_io_geometry into btrfs_max_io_len") bad: f398e70dd69e ("btrfs: tree-checker: fix inline ref size in error messages") Not ok. Linus
On Mon, 22 Jan 2024 at 14:34, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Bah. These fixes are garbage. Now my machine doesn't even boot. I'm > bisecting My bisection says 1e7f6def8b2370ecefb54b3c8f390ff894b0c51b is the first bad commit but I'll still have to verify by testing the revert on top of my current tree. It did revert cleanly, but I also note that if the zstd case is wrong, I assume the other very similar commits (for zlib and lzo) are potentially also wrong. Let me reboot to verify that at least my machine boots. Linus
On Mon, 22 Jan 2024 at 14:54, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Let me reboot to verify that at least my machine boots. My tree with that commit reverted does indeed boot: Revert "btrfs: zstd: fix and simplify the inline extent decompression" is working ok for me. I do not think I have anything odd in my Kconfig, and I didn't see any messages, and there is nothing logged either - just a hang at boot. Linus
On Mon, Jan 22, 2024 at 02:54:31PM -0800, Linus Torvalds wrote: > On Mon, 22 Jan 2024 at 14:34, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Bah. These fixes are garbage. Now my machine doesn't even boot. I'm > > bisecting Ah, sorry. > My bisection says > > 1e7f6def8b2370ecefb54b3c8f390ff894b0c51b is the first bad commit We got a report today [1] that this commit is indeed bad, https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/ the timing was also unfortuate and too late to recall the pull request. > but I'll still have to verify by testing the revert on top of my current tree. > > It did revert cleanly, but I also note that if the zstd case is wrong, > I assume the other very similar commits (for zlib and lzo) are > potentially also wrong. > > Let me reboot to verify that at least my machine boots. Per the report revert makes it work again and zlib and lzo cases are not affected. I can send a pull request reverting all the three until we figure out what's wrong, or you can do it as all revert cleanly.
On 2024/1/23 09:35, David Sterba wrote: > On Mon, Jan 22, 2024 at 02:54:31PM -0800, Linus Torvalds wrote: >> On Mon, 22 Jan 2024 at 14:34, Linus Torvalds >> <torvalds@linux-foundation.org> wrote: >>> >>> Bah. These fixes are garbage. Now my machine doesn't even boot. I'm >>> bisecting > > Ah, sorry. > >> My bisection says >> >> 1e7f6def8b2370ecefb54b3c8f390ff894b0c51b is the first bad commit > > We got a report today [1] that this commit is indeed bad, > > https://lore.kernel.org/linux-btrfs/CABq1_vj4GpUeZpVG49OHCo-3sdbe2-2ROcu_xDvUG-6-5zPRXg@mail.gmail.com/ > > the timing was also unfortuate and too late to recall the pull request. All my fault. The offending line is: + memcpy_to_page(dest_page, dest_pgoff + to_copy, workspace->out_buf.dst, + to_copy); I'm using the bad pg_off for the memcpy_to_page() call. And zstd is the only affected algo. All the other algos go like this: + memcpy_to_page(dest_page, dest_pgoff, workspace->buf, out_len); So that's why it's screwing up the zstd compressed inline extent decompression, as we can easily write beyond the page boundary and write into the next innocent page. And the existing compression group didn't catch it at all. Would fix it and add new test cases for the regression. Thanks, Qu > >> but I'll still have to verify by testing the revert on top of my current tree. >> >> It did revert cleanly, but I also note that if the zstd case is wrong, >> I assume the other very similar commits (for zlib and lzo) are >> potentially also wrong. >> >> Let me reboot to verify that at least my machine boots. > > Per the report revert makes it work again and zlib and lzo cases are not > affected. > > I can send a pull request reverting all the three until we figure out > what's wrong, or you can do it as all revert cleanly. >
On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote: > > git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag I have no idea if this is related to the new fixes, but I have never seen it before: BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360] SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 ino=5737268 SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 ino=5737267 and it caused an actual warning to be printed for my kernel tree from 'git': error: failed to stat 'sound/pci/ice1712/se.c': Structure needs cleaning (and yes, 117 is EUCLEAN, aka "Structure needs cleaning") The problem seems to have self-corrected, because it didn't happen when repeating the command, and that file that failed to stat looks perfectly fine. But it is clearly worrisome. The "owner mismatch" check isn't new - it was added back in 5.19 in commit 88c602ab4460 ("btrfs: tree-checker: check extent buffer owner against owner rootid"). So something else must have changed to trigger it. Linus
On Fri, Jan 26, 2024 at 11:25:19AM -0800, Linus Torvalds wrote: > On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag > > I have no idea if this is related to the new fixes, but I have never > seen it before: > > BTRFS critical (device dm-0): corrupted node, root=256 > block=8550954455682405139 owner mismatch, have 11858205567642294356 > expect [256, 18446744073709551360] Whick check: the numbers don't match constaints, blocks must be 4096 aligned hex(8550954455682405139) = 0x76ab17c5c57e5313 (would have to end with 3 zeros) and owner is sequentially increased such a high number is unlikely to be reached on nowadays systems. > BTRFS critical (device dm-0): corrupted node, root=256 > block=8550954455682405139 owner mismatch, have 11858205567642294356 > expect [256, 18446744073709551360] > BTRFS critical (device dm-0): corrupted node, root=256 > block=8550954455682405139 owner mismatch, have 11858205567642294356 > expect [256, 18446744073709551360] > SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 > ino=5737268 > SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 > ino=5737267 > > and it caused an actual warning to be printed for my kernel tree from 'git': > > error: failed to stat 'sound/pci/ice1712/se.c': Structure needs cleaning Size of sound/pci/ice1712/se.c is 19875, this does not look like it would be caused by the zstd bug as it was for inlined files where the limit is 2048 bytes. > (and yes, 117 is EUCLEAN, aka "Structure needs cleaning") > > The problem seems to have self-corrected, because it didn't happen > when repeating the command, and that file that failed to stat looks > perfectly fine. > > But it is clearly worrisome. > > The "owner mismatch" check isn't new - it was added back in 5.19 in > commit 88c602ab4460 ("btrfs: tree-checker: check extent buffer owner > against owner rootid"). So something else must have changed to trigger > it. This looks like garbage data got read from disk, yet still passing the checksum test (otherwise that would lead to an EIO and would not get to the tree-checker). Most likely cause is that damaged data were written to the disk before. The tree-checker also verifies data before they get written, the same bogus values of block/owner would trigger it so you'd see a warning. It's not possible to get the data damaged on the way from the device to the filesystem, that would not pass the checksum, so unlikely a driver/block layer bug. What remains that the data were correctly written in the past, read correctly passing checksum test but then somehow got damaged in the memory between the checksum check and tree-checker. The window is quite short: disk-io.c:btrfs_validate_extent_buffer() between csum_tree_block() (around line 397) and btrfs_check_node() (around line 464)
On 2024/1/27 05:55, Linus Torvalds wrote: > On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote: >> >> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag > > I have no idea if this is related to the new fixes, but I have never > seen it before: > > BTRFS critical (device dm-0): corrupted node, root=256 > block=8550954455682405139 owner mismatch, have 11858205567642294356 > expect [256, 18446744073709551360] > BTRFS critical (device dm-0): corrupted node, root=256 > block=8550954455682405139 owner mismatch, have 11858205567642294356 > expect [256, 18446744073709551360] > BTRFS critical (device dm-0): corrupted node, root=256 > block=8550954455682405139 owner mismatch, have 11858205567642294356 > expect [256, 18446744073709551360] This is triggered during a btrfs btree search, for XATTR read. The root=256 means the tree search operation is triggered from subvolume 256, which is completely sane. But the other number, 11858205567642294356, which is still inside the allowed subvolume range, but beyond the 0 level qgroup, thus it makes is_fstree() return false, and triggered the error. Normally we should not have this many subvolumes, and since 2015 we already has such check against subvolume creation. But I don't really believe that's the case, unless there are really that many snapshots/subvolumes in the fs (beyond 1 << 48 snapshots) > SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 > ino=5737268 > SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 > ino=5737267 > > and it caused an actual warning to be printed for my kernel tree from 'git': > > error: failed to stat 'sound/pci/ice1712/se.c': Structure needs cleaning > > (and yes, 117 is EUCLEAN, aka "Structure needs cleaning") > > The problem seems to have self-corrected, because it didn't happen > when repeating the command, and that file that failed to stat looks > perfectly fine. I guess since the error is self-corrected, there is no tree dump of block 8550954455682405139 just after the error. Personally speaking the number 11858205567642294356 is really too large, so that it looks like some runtime garbage. I don't have any clue for now, but if you hit it again, you may want to run "btrfs ins dump-tree -b <bytenr> <device>" to dump the content. Meanwhile I'll make kernel tree-checker to dump the content of the offending tree block so that it's easier to debug. > > But it is clearly worrisome. > > The "owner mismatch" check isn't new - it was added back in 5.19 in > commit 88c602ab4460 ("btrfs: tree-checker: check extent buffer owner > against owner rootid"). So something else must have changed to trigger > it. Anyway I'll keep an eye on the situation. Thanks, Qu > > Linus >
On 2024/1/27 06:30, David Sterba wrote: > On Fri, Jan 26, 2024 at 11:25:19AM -0800, Linus Torvalds wrote: >> On Mon, 22 Jan 2024 at 10:34, David Sterba <dsterba@suse.com> wrote: >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git tags/for-6.8-rc1-tag >> >> I have no idea if this is related to the new fixes, but I have never >> seen it before: >> >> BTRFS critical (device dm-0): corrupted node, root=256 >> block=8550954455682405139 owner mismatch, have 11858205567642294356 >> expect [256, 18446744073709551360] > > Whick check: the numbers don't match constaints, blocks must be 4096 > aligned > > hex(8550954455682405139) = 0x76ab17c5c57e5313 > (would have to end with 3 zeros) Oh, I forgot the most obvious problem. This means the extent buffer is full of garbage. And this is means the content of the eb got some sudden change. As during extent buffer read, we have a very basic eb bytenr check against the eb read from disk. So it means at that point, at least we got something correct, and can even pass csum checks. But later one, the pages of the eb got corrupted and we got garbarge. (AKA, my previous analyse is totally wrong, but it may still inspire me to add extra check on eb_owner to reject higher level qgroup subvolumes) What's the page size of the system? 4K or 16K or 64K? For the first 2, the eb handling would go the regular path as 4K page sized systems (as 16K is the default nodesize, nodesize >= PAGE_SIZE would share the same page based eb handling). For 64K, it would go full subpage, but I doubt it, as the Apple M series CPUs do not seem to support 64K page size. Thanks, Qu > > and owner is sequentially increased such a high number is unlikely to be > reached on nowadays systems. > >> BTRFS critical (device dm-0): corrupted node, root=256 >> block=8550954455682405139 owner mismatch, have 11858205567642294356 >> expect [256, 18446744073709551360] >> BTRFS critical (device dm-0): corrupted node, root=256 >> block=8550954455682405139 owner mismatch, have 11858205567642294356 >> expect [256, 18446744073709551360] >> SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 >> ino=5737268 >> SELinux: inode_doinit_use_xattr: getxattr returned 117 for dev=dm-0 >> ino=5737267 >> >> and it caused an actual warning to be printed for my kernel tree from 'git': >> >> error: failed to stat 'sound/pci/ice1712/se.c': Structure needs cleaning > > Size of sound/pci/ice1712/se.c is 19875, this does not look like it > would be caused by the zstd bug as it was for inlined files where the > limit is 2048 bytes. > >> (and yes, 117 is EUCLEAN, aka "Structure needs cleaning") >> >> The problem seems to have self-corrected, because it didn't happen >> when repeating the command, and that file that failed to stat looks >> perfectly fine. >> >> But it is clearly worrisome. >> >> The "owner mismatch" check isn't new - it was added back in 5.19 in >> commit 88c602ab4460 ("btrfs: tree-checker: check extent buffer owner >> against owner rootid"). So something else must have changed to trigger >> it. > > This looks like garbage data got read from disk, yet still passing the > checksum test (otherwise that would lead to an EIO and would not get to > the tree-checker). Most likely cause is that damaged data were written > to the disk before. > > The tree-checker also verifies data before they get written, the same > bogus values of block/owner would trigger it so you'd see a warning. > > It's not possible to get the data damaged on the way from the device to > the filesystem, that would not pass the checksum, so unlikely a > driver/block layer bug. > > What remains that the data were correctly written in the past, read > correctly passing checksum test but then somehow got damaged in the > memory between the checksum check and tree-checker. The window is quite > short: > > disk-io.c:btrfs_validate_extent_buffer() > > between csum_tree_block() (around line 397) and > btrfs_check_node() (around line 464)
On Fri, 26 Jan 2024 at 13:39, Qu Wenruo <wqu@suse.com> wrote: > > Oh, I forgot the most obvious problem. > > This means the extent buffer is full of garbage. Allocation lifetime problems? > What's the page size of the system? 4K or 16K or 64K? This is a bog-standard x86-64 system. With 32 cores (and 64 threads), but there's nothing remotely odd about it, except for the fact that it's running a very recent kernel... Linus
On 2024/1/27 08:21, Linus Torvalds wrote: > On Fri, 26 Jan 2024 at 13:39, Qu Wenruo <wqu@suse.com> wrote: >> >> Oh, I forgot the most obvious problem. >> >> This means the extent buffer is full of garbage. > > Allocation lifetime problems? Could be, thus it may be better to output the flags of the first page for tree-checker. Thanks, Qu > >> What's the page size of the system? 4K or 16K or 64K? > > This is a bog-standard x86-64 system. With 32 cores (and 64 threads), > but there's nothing remotely odd about it, except for the fact that > it's running a very recent kernel... > > Linus
On Fri, 26 Jan 2024 at 13:56, Qu Wenruo <wqu@suse.com> wrote: > > On 2024/1/27 08:21, Linus Torvalds wrote: > > > > Allocation lifetime problems? > > Could be, thus it may be better to output the flags of the first page > for tree-checker. Note that the fact that it magically went away certainly implies that it never "really" existed, and that something was using a pointer or similar. IOW, this is not some IO that got scribbled over, or a cache that got corrupted. If it had been real corruption, I would have expected that it would have stayed around in memory. Linus
On 2024/1/27 08:32, Linus Torvalds wrote: > On Fri, 26 Jan 2024 at 13:56, Qu Wenruo <wqu@suse.com> wrote: >> >> On 2024/1/27 08:21, Linus Torvalds wrote: >>> >>> Allocation lifetime problems? >> >> Could be, thus it may be better to output the flags of the first page >> for tree-checker. > > Note that the fact that it magically went away certainly implies that > it never "really" existed, and that something was using a pointer or > similar. > > IOW, this is not some IO that got scribbled over, or a cache that got > corrupted. If it had been real corruption, I would have expected that > it would have stayed around in memory. Yep, thus it makes sense to show the page status of an eb. It could be some race that the eb pages are not properly hold, thus its content changed unexpectedly. Thanks, Qu > > Linus >