Message ID | c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | About the weird tree block corruption | expand |
On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > Hi Tavian, > > Thanks for all the awesome help debugging the weird tree block corruption. > And sorry for the late reply. No worries, thanks for your help! > Unfortunately I still failed to reproduce the bug, so I can only craft a > debug patchset for you to test. Good news: I also failed to reproduce the bug on the latest btrfs/for-next branch (ec616f34eba1 "btrfs: do not skip re-registration for the mounted device"). It was still reproducing the last time I pulled btrfs/for-next (09e6cef19c9f "btrfs: refactor alloc_extent_buffer() to allocate-then-attach method"). This is a much smaller range of commits. I didn't actually bisect them, but I skimmed through the log and saw commit 0a9bab391e336489169b95cb0d4553d921302189 Author: Mikulas Patocka <mpatocka@redhat.com> Date: Wed Jan 31 21:57:27 2024 +0100 dm-crypt, dm-verity: disable tasklets Tasklets have an inherent problem with memory corruption. The function tasklet_action_common calls tasklet_trylock, then it calls the tasklet callback and then it calls tasklet_unlock. If the tasklet callback frees the structure that contains the tasklet or if it calls some code that may free it, tasklet_unlock will write into free memory. which is awfully suspicious. It appears to affect LUKS devices on NVME drives with --perf=no_{read,write}_workqueue enabled, which matches my configuration. So my guess is that all my problems (including the disk corruption I experienced recently) are caused by this UAF in dm-crypt, which is fixed since v6.8-rc3. I'll let you know if I ever see this bug again, but for now I assume the underlying issue was in dm-crypt, not btrfs.
On Thu, Mar 14, 2024 at 1:44 PM Tavian Barnes <tavianator@tavianator.com> wrote: > On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > Hi Tavian, > > > > Thanks for all the awesome help debugging the weird tree block corruption. > > And sorry for the late reply. > > No worries, thanks for your help! > > > Unfortunately I still failed to reproduce the bug, so I can only craft a > > debug patchset for you to test. > > Good news: I also failed to reproduce the bug on the latest > btrfs/for-next branch (ec616f34eba1 "btrfs: do not skip > re-registration for the mounted device"). It was still reproducing > the last time I pulled btrfs/for-next (09e6cef19c9f "btrfs: refactor > alloc_extent_buffer() to allocate-then-attach method"). > > This is a much smaller range of commits. I didn't actually bisect > them, but I skimmed through the log and saw > > commit 0a9bab391e336489169b95cb0d4553d921302189 > Author: Mikulas Patocka <mpatocka@redhat.com> > Date: Wed Jan 31 21:57:27 2024 +0100 > > dm-crypt, dm-verity: disable tasklets Well, I suppose this can't be the relevant fix, because I reproduced the "corrupted node" bug on 6.7.9 which includes a backport of that commit. I'll try to bisect when I get more time.
On Thu, Mar 14, 2024 at 2:42 PM Tavian Barnes <tavianator@tavianator.com> wrote: > On Thu, Mar 14, 2024 at 1:44 PM Tavian Barnes <tavianator@tavianator.com> wrote: > > On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > Hi Tavian, > > > > > > Thanks for all the awesome help debugging the weird tree block corruption. > > > And sorry for the late reply. > > > > No worries, thanks for your help! > > > > > Unfortunately I still failed to reproduce the bug, so I can only craft a > > > debug patchset for you to test. > > > > Good news: I also failed to reproduce the bug on the latest > > btrfs/for-next branch (ec616f34eba1 "btrfs: do not skip > > re-registration for the mounted device"). It was still reproducing > > the last time I pulled btrfs/for-next (09e6cef19c9f "btrfs: refactor > > alloc_extent_buffer() to allocate-then-attach method"). Actually I have to walk all of this back. I can still reproduce the bug on ec616f34eba1. I can't reproduce it on that commit *plus your debugging patch* (btrfs: add extra debug for eb read path), which points to maybe a timing issue/race window closing. Additionally, shortly after reproducing the bug on ec616f34eba1, I got this splat: [ 119.257741] BTRFS critical (device dm-2): corrupted node, root=518 block=16438782945911875046 owner mismatch, have 4517169229596899607 expect [256, 18446744073709551360] [ 119.257764] BTRFS critical (device dm-2): corrupted node, root=518 block=16438782945911875046 owner mismatch, have 4517169229596899607 expect [256, 18446744073709551360] [ 119.257775] BTRFS critical (device dm-2): corrupted node, root=518 block=16438782945911875046 owner mismatch, have 4517169229596899607 expect [256, 18446744073709551360] [ 119.257773] BTRFS critical (device dm-2): corrupted node, root=518 block=16438782945911875046 owner mismatch, have 4517169229596899607 expect [256, 18446744073709551360] [ 119.257782] BTRFS critical (device dm-2): corrupted node, root=518 block=16438782945911875046 owner mismatch, have 4517169229596899607 expect [256, 18446744073709551360] [ 119.257787] BTRFS critical (device dm-2): corrupted node, root=518 block=16438782945911875046 owner mismatch, have 4517169229596899607 expect [256, 18446744073709551360] [ 178.564993] ------------[ cut here ]------------ [ 178.567792] BTRFS: Transaction aborted (error -17) [ 178.570606] WARNING: CPU: 2 PID: 5191 at fs/btrfs/inode.c:6450 btrfs_create_new_inode+0xa46/0xa70 [btrfs] [ 178.573433] Modules linked in: vhost_net vhost vhost_iotlb tap xt_addrtype xt_conntrack xt_comment veth rpcrdma rdma_cm cmac iw_cm algif_hash nct6775 algif_skcipher ib_cm nct6775_core af_alg ib_core bnep hwmon_vid lm92 uvcvideo uvc intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd snd_hda_codec_hdmi kvm_amd snd_hda_intel btusb snd_intel_dspcfg btrtl snd_intel_sdw_acpi gspca_vc032x btintel snd_usb_audio snd_hda_codec btbcm gspca_main kvm btmtk videobuf2_vmalloc snd_usbmidi_lib nls_iso8859_1 videobuf2_memops snd_ump videobuf2_v4l2 snd_hda_core snd_rawmidi vfat videodev bluetooth irqbypass snd_hwdep snd_seq_device fat videobuf2_common mc snd_pcm ecdh_generic crc16 snd_timer rapl wmi_bmof acpi_cpufreq sp5100_tco pcspkr snd soundcore k10temp i2c_piix4 mousedev joydev mac_hid nfsd usbip_host usbip_core pkcs8_key_parser auth_rpcgss i2c_dev sg nfs_acl lockd crypto_user grace fuse loop sunrpc btrfs blake2b_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee xt_MASQUERADE xt_tcpudp xt_mark [ 178.573514] hid_logitech_hidpp hid_logitech_dj dm_mod tun uas crct10dif_pclmul crc32_pclmul polyval_clmulni usb_storage polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 usbhid sha256_ssse3 sha1_ssse3 aesni_intel nvme igb bridge iwlwifi crypto_simd nvme_core mxm_wmi ccp sr_mod cryptd dca xhci_pci cdrom i2c_algo_bit xhci_pci_renesas nvme_auth stp wmi llc nf_tables cfg80211 ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat nf_conntrack rfkill nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc32c_generic crc32c_intel iptable_filter nfnetlink ip_tables x_tables [ 178.600199] CPU: 2 PID: 5191 Comm: git Not tainted 6.8.0-rc7-euclean+ #15 6731aa79f7b285a380c42229104b62a18d4313c1 [ 178.603981] Hardware name: Micro-Star International Co., Ltd. MS-7C60/TRX40 PRO WIFI (MS-7C60), BIOS 2.80 05/17/2022 [ 178.607815] RIP: 0010:btrfs_create_new_inode+0xa46/0xa70 [btrfs] [ 178.611667] Code: cc 0a 00 c7 44 24 0c 01 00 00 00 e9 fc fb ff ff 41 bc f4 ff ff ff e9 a4 f8 ff ff 44 89 e6 48 c7 c7 48 00 3b c1 e8 1a e7 c4 c2 <0f> 0b eb a4 44 89 e6 48 c7 c7 48 00 3b c1 e8 07 e7 c4 c2 0f 0b eb [ 178.615664] RSP: 0018:ffffbb8edf8afad8 EFLAGS: 00010282 [ 178.619577] RAX: 0000000000000000 RBX: ffffbb8edf8afbb8 RCX: 0000000000000000 [ 178.623516] RDX: ffff8fe8beeae780 RSI: ffff8fe8beea19c0 RDI: ffff8fe8beea19c0 [ 178.627458] RBP: ffffbb8edf8afba0 R08: 0000000000000000 R09: ffffbb8edf8af960 [ 178.631370] R10: 0000000000000003 R11: ffff8fe8bf1df668 R12: 00000000ffffffef [ 178.635281] R13: ffff8fc9cdc57d88 R14: ffff8fcb654541f8 R15: ffff8fc9d7c07110 [ 178.639187] FS: 00007f86d59d9740(0000) GS:ffff8fe8bee80000(0000) knlGS:0000000000000000 [ 178.643131] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 178.647061] CR2: 000055c7d0a7a938 CR3: 000000023e8a8000 CR4: 0000000000350ef0 [ 178.650998] Call Trace: [ 178.654898] <TASK> [ 178.658761] ? btrfs_create_new_inode+0xa46/0xa70 [btrfs 0221bef32b31f5d5536e63a0526995bec5247918] [ 178.662770] ? __warn+0x81/0x130 [ 178.666696] ? btrfs_create_new_inode+0xa46/0xa70 [btrfs 0221bef32b31f5d5536e63a0526995bec5247918] [ 178.670749] ? report_bug+0x171/0x1a0 [ 178.674761] ? prb_read_valid+0x1b/0x30 [ 178.678786] ? srso_return_thunk+0x5/0x5f [ 178.682786] ? handle_bug+0x3c/0x80 [ 178.686779] ? exc_invalid_op+0x17/0x70 [ 178.690769] ? asm_exc_invalid_op+0x1a/0x20 [ 178.694773] ? btrfs_create_new_inode+0xa46/0xa70 [btrfs 0221bef32b31f5d5536e63a0526995bec5247918] [ 178.698925] ? btrfs_create_new_inode+0xa46/0xa70 [btrfs 0221bef32b31f5d5536e63a0526995bec5247918] [ 178.703008] btrfs_create_common+0xc4/0x130 [btrfs 0221bef32b31f5d5536e63a0526995bec5247918] [ 178.707133] path_openat+0xe9f/0x1190 [ 178.711196] do_filp_open+0xb3/0x160 [ 178.715254] do_sys_openat2+0xab/0xe0 [ 178.719294] __x64_sys_openat+0x57/0xa0 [ 178.723321] do_syscall_64+0x89/0x170 [ 178.727336] ? srso_return_thunk+0x5/0x5f [ 178.731356] ? do_syscall_64+0x96/0x170 [ 178.735373] ? srso_return_thunk+0x5/0x5f [ 178.739387] ? syscall_exit_to_user_mode+0x80/0x230 [ 178.743420] ? srso_return_thunk+0x5/0x5f [ 178.747434] ? do_syscall_64+0x96/0x170 [ 178.751441] ? srso_return_thunk+0x5/0x5f [ 178.755450] ? srso_return_thunk+0x5/0x5f [ 178.759405] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 178.763406] RIP: 0033:0x7f86d5ad6dc0 [ 178.767424] Code: 48 89 44 24 20 75 94 44 89 54 24 0c e8 d9 c9 f8 ff 44 8b 54 24 0c 89 da 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 89 44 24 0c e8 2c ca f8 ff 8b 44 [ 178.771580] RSP: 002b:00007ffcc4791f50 EFLAGS: 00000293 ORIG_RAX: 0000000000000101 [ 178.775720] RAX: ffffffffffffffda RBX: 00000000000000c1 RCX: 00007f86d5ad6dc0 [ 178.779865] RDX: 00000000000000c1 RSI: 000055c7d0a77140 RDI: 00000000ffffff9c [ 178.784014] RBP: 000055c7d0a77140 R08: 0000000000000000 R09: 000055c7d0a77140 [ 178.788152] R10: 00000000000001b6 R11: 0000000000000293 R12: 00007ffcc4792360 [ 178.792210] R13: 00007ffcc4792000 R14: 000055c7d06bb788 R15: 000055c7d0a77140 [ 178.796192] </TASK> [ 178.800079] ---[ end trace 0000000000000000 ]--- [ 178.804024] BTRFS: error (device dm-2: state A) in btrfs_create_new_inode:6450: errno=-17 Object already exists
On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > [SNIP] > > The second patch is to making tree-checker to BUG_ON() when something > went wrong. > This patch should only be applied if you can reliably reproduce it > inside a VM. Okay, I have finally reproduced this in a VM. I had to add this hunk to your patch 0002 in order to trigger the BUG_ON: diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c index c8fbcae4e88e..4ee7a717642a 100644 --- a/fs/btrfs/tree-checker.c +++ b/fs/btrfs/tree-checker.c @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct extent_buffer *eb, u64 root_owner) btrfs_header_level(eb) == 0 ? "leaf" : "node", root_owner, btrfs_header_bytenr(eb), eb_owner, root_owner); + BUG_ON(1); return -EUCLEAN; } return 0; @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct extent_buffer *eb, u64 root_owner) btrfs_header_level(eb) == 0 ? "leaf" : "node", root_owner, btrfs_header_bytenr(eb), eb_owner, BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID); + BUG_ON(1); return -EUCLEAN; } return 0; > When using the 2nd patch, it's strongly recommended to enable the > following sysctl: > > kernel.ftrace_dump_on_oops = 1 > kernel.panic = 5 > kernel.panic_on_oops = 1 I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog, otherwise I got watchdog backtraces (due to slow console) interspersed with the traces which was hard to read. > And you need a way to reliably access the VM (either netconsole or a > serial console setup). > In that case, you would got all the ftrace buffer to be dumped into the > netconsole/serial console. > > This has the extra benefit of reducing the noise. But really needs a > reliable VM setup and can be a little tricky to setup. I got this to work, the console logs are attached. I added echo 1 > $tracefs/buffer_size_kb otherwise it tried to dump 48MiB over the serial console which I didn't have the patience for. Hopefully that's a big enough buffer, I can re-run it if you need more logs. > Feel free to ask for any extra help to setup the environment, as you're > our last hope to properly pin down the bug. Hopefully this trace helps you debug this. Let me know whenever you have something else for me to test. I can also try to send you the VM, but I'm not sure how to package it up exactly. It has two (emulated) NVMEs with LUKS and BTRFS raid0 on top.
在 2024/3/16 01:53, Tavian Barnes 写道: > On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> [SNIP] >> >> The second patch is to making tree-checker to BUG_ON() when something >> went wrong. >> This patch should only be applied if you can reliably reproduce it >> inside a VM. > > Okay, I have finally reproduced this in a VM. I had to add this hunk > to your patch 0002 in order to trigger the BUG_ON: > > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c > index c8fbcae4e88e..4ee7a717642a 100644 > --- a/fs/btrfs/tree-checker.c > +++ b/fs/btrfs/tree-checker.c > @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct > extent_buffer *eb, u64 root_owner) > btrfs_header_level(eb) == 0 ? "leaf" : "node", > root_owner, btrfs_header_bytenr(eb), eb_owner, > root_owner); > + BUG_ON(1); > return -EUCLEAN; > } > return 0; > @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct > extent_buffer *eb, u64 root_owner) > btrfs_header_level(eb) == 0 ? "leaf" : "node", > root_owner, btrfs_header_bytenr(eb), eb_owner, > BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID); > + BUG_ON(1); > return -EUCLEAN; > } > return 0; > >> When using the 2nd patch, it's strongly recommended to enable the >> following sysctl: >> >> kernel.ftrace_dump_on_oops = 1 >> kernel.panic = 5 >> kernel.panic_on_oops = 1 > > I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog, > otherwise I got watchdog backtraces (due to slow console) interspersed > with the traces which was hard to read. oops_all_cpu_backtrace looks a little overkilled, and it seems to flood the output. > >> And you need a way to reliably access the VM (either netconsole or a >> serial console setup). >> In that case, you would got all the ftrace buffer to be dumped into the >> netconsole/serial console. >> >> This has the extra benefit of reducing the noise. But really needs a >> reliable VM setup and can be a little tricky to setup. > > I got this to work, the console logs are attached. I added > > echo 1 > $tracefs/buffer_size_kb > > otherwise it tried to dump 48MiB over the serial console which I > didn't have the patience for. Hopefully that's a big enough buffer, I > can re-run it if you need more logs. That's totally fine, and that's exactly what I do during debugging. > >> Feel free to ask for any extra help to setup the environment, as you're >> our last hope to properly pin down the bug. > > Hopefully this trace helps you debug this. Let me know whenever you > have something else for me to test. The btrfs_crit() line is using btrfs_header_bytenr(), which can be corrupted. So it's much better to add extra trace_printk() to print eb->start so that we can match the output. But there is some interesting output, the trace_printk() in btrfs_release_extent_buffer_pages() are already showing the page refs is already 0, and its contents is already incorrect. It may be a clue, but without the proper matching trace, it's still hard to say. I'm afraid you will need to add the extra trace_printk() lines, much like this to all the return -EUCLEAN locations: trace_printk("eb=%llu\n", eb->start); > > I can also try to send you the VM, but I'm not sure how to package it > up exactly. It has two (emulated) NVMEs with LUKS and BTRFS raid0 on > top. > Just send me the rootfs qcow2 would be more than enough. I can setup LUKS and btrfs all by myself here. Thanks, Qu
On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes <tavianator@tavianator.com> wrote: > On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > [SNIP] > > > > The second patch is to making tree-checker to BUG_ON() when something > > went wrong. > > This patch should only be applied if you can reliably reproduce it > > inside a VM. > > Okay, I have finally reproduced this in a VM. I had to add this hunk > to your patch 0002 in order to trigger the BUG_ON: > > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c > index c8fbcae4e88e..4ee7a717642a 100644 > --- a/fs/btrfs/tree-checker.c > +++ b/fs/btrfs/tree-checker.c > @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct > extent_buffer *eb, u64 root_owner) > btrfs_header_level(eb) == 0 ? "leaf" : "node", > root_owner, btrfs_header_bytenr(eb), eb_owner, > root_owner); > + BUG_ON(1); > return -EUCLEAN; > } > return 0; > @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct > extent_buffer *eb, u64 root_owner) > btrfs_header_level(eb) == 0 ? "leaf" : "node", > root_owner, btrfs_header_bytenr(eb), eb_owner, > BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID); > + BUG_ON(1); > return -EUCLEAN; > } > return 0; > > > When using the 2nd patch, it's strongly recommended to enable the > > following sysctl: > > > > kernel.ftrace_dump_on_oops = 1 > > kernel.panic = 5 > > kernel.panic_on_oops = 1 > > I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog, > otherwise I got watchdog backtraces (due to slow console) interspersed > with the traces which was hard to read. > > > And you need a way to reliably access the VM (either netconsole or a > > serial console setup). > > In that case, you would got all the ftrace buffer to be dumped into the > > netconsole/serial console. > > > > This has the extra benefit of reducing the noise. But really needs a > > reliable VM setup and can be a little tricky to setup. > > I got this to work, the console logs are attached. I added > > echo 1 > $tracefs/buffer_size_kb > > otherwise it tried to dump 48MiB over the serial console which I > didn't have the patience for. Hopefully that's a big enough buffer, I > can re-run it if you need more logs. > > > Feel free to ask for any extra help to setup the environment, as you're > > our last hope to properly pin down the bug. > > Hopefully this trace helps you debug this. Let me know whenever you > have something else for me to test. > > I can also try to send you the VM, but I'm not sure how to package it > up exactly. It has two (emulated) NVMEs with LUKS and BTRFS raid0 on > top. I added eb->start to the "corrupted node/leaf" message so I could look for relevant lines in the trace output. From another run, I see this: $ grep 'eb=15302115328' typescript-5 [ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258 block=15302115328 owner mismatch, have 13709377558419367261 expect [256, 18446744073709551360] eb=15302115328 [ 2740.240046] iou-wrk--173727 15..... 2649295481us : alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328 len=16384 [ 2740.301767] kworker/-322 15..... 2649295735us : end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0 fsid=b66a67f0-8273-4158-b7bf-988bb5683000 [ 2740.328424] kworker/-5095 31..... 2649295941us : end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0 fsid=b66a67f0-8273-4158-b7bf-988bb5683000 I am surprised to see two end_bbio_meta_read lines with only one matching alloc_extent_buffer. That made me check the locking in read_extent_buffer_pages() again, and I think I may have found something. Let's say we get two threads simultaneously call read_extent_buffer_pages() on the same eb. Thread 1 starts reading: if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) return 0; /* * We could have had EXTENT_BUFFER_UPTODATE cleared by the write * operation, which could potentially still be in flight. In this case * we simply want to return an error. */ if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))) return -EIO; /* (Thread 2 pre-empted here) */ /* Someone else is already reading the buffer, just wait for it. */ if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) goto done; ... Meanwhile, thread 2 does the same thing but gets preempted at the marked point, before testing EXTENT_BUFFER_READING. Now the read finishes, and end_bbio_meta_read() does btrfs_folio_set_uptodate(fs_info, folio, start, len); ... clear_bit(EXTENT_BUFFER_READING, &eb->bflags); smp_mb__after_atomic(); wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING); Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING (again) and starts reading into the already-filled-in extent buffer. This might normally be a benign race, except end_bbio_meta_read() has also set EXTENT_BUFFER_UPTODATE. So now if a third thread tries to read the same extent buffer, it will do if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) return 0; and return 0 *while the eb is still under I/O*. The caller will then try to read data from the extent buffer which is concurrently being updated by the extra read started by thread 2. I attached a candidate patch. So far only compile-tested.
在 2024/3/16 06:31, Tavian Barnes 写道: > On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes > <tavianator@tavianator.com> wrote: >> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>> [SNIP] >>> >>> The second patch is to making tree-checker to BUG_ON() when something >>> went wrong. >>> This patch should only be applied if you can reliably reproduce it >>> inside a VM. >> >> Okay, I have finally reproduced this in a VM. I had to add this hunk >> to your patch 0002 in order to trigger the BUG_ON: >> >> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c >> index c8fbcae4e88e..4ee7a717642a 100644 >> --- a/fs/btrfs/tree-checker.c >> +++ b/fs/btrfs/tree-checker.c >> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct >> extent_buffer *eb, u64 root_owner) >> btrfs_header_level(eb) == 0 ? "leaf" : "node", >> root_owner, btrfs_header_bytenr(eb), eb_owner, >> root_owner); >> + BUG_ON(1); >> return -EUCLEAN; >> } >> return 0; >> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct >> extent_buffer *eb, u64 root_owner) >> btrfs_header_level(eb) == 0 ? "leaf" : "node", >> root_owner, btrfs_header_bytenr(eb), eb_owner, >> BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID); >> + BUG_ON(1); >> return -EUCLEAN; >> } >> return 0; >> >>> When using the 2nd patch, it's strongly recommended to enable the >>> following sysctl: >>> >>> kernel.ftrace_dump_on_oops = 1 >>> kernel.panic = 5 >>> kernel.panic_on_oops = 1 >> >> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog, >> otherwise I got watchdog backtraces (due to slow console) interspersed >> with the traces which was hard to read. >> >>> And you need a way to reliably access the VM (either netconsole or a >>> serial console setup). >>> In that case, you would got all the ftrace buffer to be dumped into the >>> netconsole/serial console. >>> >>> This has the extra benefit of reducing the noise. But really needs a >>> reliable VM setup and can be a little tricky to setup. >> >> I got this to work, the console logs are attached. I added >> >> echo 1 > $tracefs/buffer_size_kb >> >> otherwise it tried to dump 48MiB over the serial console which I >> didn't have the patience for. Hopefully that's a big enough buffer, I >> can re-run it if you need more logs. >> >>> Feel free to ask for any extra help to setup the environment, as you're >>> our last hope to properly pin down the bug. >> >> Hopefully this trace helps you debug this. Let me know whenever you >> have something else for me to test. >> >> I can also try to send you the VM, but I'm not sure how to package it >> up exactly. It has two (emulated) NVMEs with LUKS and BTRFS raid0 on >> top. > > I added eb->start to the "corrupted node/leaf" message so I could look > for relevant lines in the trace output. From another run, I see this: > > $ grep 'eb=15302115328' typescript-5 > [ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258 > block=15302115328 owner mismatch, have 13709377558419367261 expect > [256, 18446744073709551360] eb=15302115328 > [ 2740.240046] iou-wrk--173727 15..... 2649295481us : > alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328 > len=16384 > [ 2740.301767] kworker/-322 15..... 2649295735us : > end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0 > fsid=b66a67f0-8273-4158-b7bf-988bb5683000 > [ 2740.328424] kworker/-5095 31..... 2649295941us : > end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0 > fsid=b66a67f0-8273-4158-b7bf-988bb5683000 > > I am surprised to see two end_bbio_meta_read lines with only one > matching alloc_extent_buffer. That made me check the locking in > read_extent_buffer_pages() again, and I think I may have found > something. > > Let's say we get two threads simultaneously call > read_extent_buffer_pages() on the same eb. Thread 1 starts reading: > > if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) > return 0; > > /* > * We could have had EXTENT_BUFFER_UPTODATE cleared by the write > * operation, which could potentially still be in flight. In this case > * we simply want to return an error. > */ > if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))) > return -EIO; > > /* (Thread 2 pre-empted here) */ > > /* Someone else is already reading the buffer, just wait for it. */ > if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) > goto done; > ... > > Meanwhile, thread 2 does the same thing but gets preempted at the > marked point, before testing EXTENT_BUFFER_READING. Now the read > finishes, and end_bbio_meta_read() does > > btrfs_folio_set_uptodate(fs_info, folio, start, len); > ... > clear_bit(EXTENT_BUFFER_READING, &eb->bflags); > smp_mb__after_atomic(); > wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING); > > Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING > (again) and starts reading into the already-filled-in extent buffer. > This might normally be a benign race, except end_bbio_meta_read() has > also set EXTENT_BUFFER_UPTODATE. So now if a third thread tries to > read the same extent buffer, it will do > > if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) > return 0; > > and return 0 *while the eb is still under I/O*. The caller will then > try to read data from the extent buffer which is concurrently being > updated by the extra read started by thread 2. Awesome! Not only you got a super good trace, but even pinned down the root cause. The trace indeed shows duplicated metadata read endio finishes, which means we indeed have race where we shouldn't. Although I'm not 100% sure about the fix, the normal way we handle it would be locking the page before IO, and unlock it at endio time. I believe we can do that and properly solve it. And go with fallbacks to use spinlock to make all the bits testing in a critical section if the page lock one is no feasible. Thank you so much! You really saved us all! Thanks, Qu > > I attached a candidate patch. So far only compile-tested. >
On Fri, Mar 15, 2024 at 4:21 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > 在 2024/3/16 06:31, Tavian Barnes 写道: > > On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes > > <tavianator@tavianator.com> wrote: > >> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > >>> [SNIP] > >>> > >>> The second patch is to making tree-checker to BUG_ON() when something > >>> went wrong. > >>> This patch should only be applied if you can reliably reproduce it > >>> inside a VM. > >> > >> Okay, I have finally reproduced this in a VM. I had to add this hunk > >> to your patch 0002 in order to trigger the BUG_ON: > >> > >> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c > >> index c8fbcae4e88e..4ee7a717642a 100644 > >> --- a/fs/btrfs/tree-checker.c > >> +++ b/fs/btrfs/tree-checker.c > >> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct > >> extent_buffer *eb, u64 root_owner) > >> btrfs_header_level(eb) == 0 ? "leaf" : "node", > >> root_owner, btrfs_header_bytenr(eb), eb_owner, > >> root_owner); > >> + BUG_ON(1); > >> return -EUCLEAN; > >> } > >> return 0; > >> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct > >> extent_buffer *eb, u64 root_owner) > >> btrfs_header_level(eb) == 0 ? "leaf" : "node", > >> root_owner, btrfs_header_bytenr(eb), eb_owner, > >> BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID); > >> + BUG_ON(1); > >> return -EUCLEAN; > >> } > >> return 0; > >> > >>> When using the 2nd patch, it's strongly recommended to enable the > >>> following sysctl: > >>> > >>> kernel.ftrace_dump_on_oops = 1 > >>> kernel.panic = 5 > >>> kernel.panic_on_oops = 1 > >> > >> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog, > >> otherwise I got watchdog backtraces (due to slow console) interspersed > >> with the traces which was hard to read. > >> > >>> And you need a way to reliably access the VM (either netconsole or a > >>> serial console setup). > >>> In that case, you would got all the ftrace buffer to be dumped into the > >>> netconsole/serial console. > >>> > >>> This has the extra benefit of reducing the noise. But really needs a > >>> reliable VM setup and can be a little tricky to setup. > >> > >> I got this to work, the console logs are attached. I added > >> > >> echo 1 > $tracefs/buffer_size_kb > >> > >> otherwise it tried to dump 48MiB over the serial console which I > >> didn't have the patience for. Hopefully that's a big enough buffer, I > >> can re-run it if you need more logs. > >> > >>> Feel free to ask for any extra help to setup the environment, as you're > >>> our last hope to properly pin down the bug. > >> > >> Hopefully this trace helps you debug this. Let me know whenever you > >> have something else for me to test. > >> > >> I can also try to send you the VM, but I'm not sure how to package it > >> up exactly. It has two (emulated) NVMEs with LUKS and BTRFS raid0 on > >> top. > > > > I added eb->start to the "corrupted node/leaf" message so I could look > > for relevant lines in the trace output. From another run, I see this: > > > > $ grep 'eb=15302115328' typescript-5 > > [ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258 > > block=15302115328 owner mismatch, have 13709377558419367261 expect > > [256, 18446744073709551360] eb=15302115328 > > [ 2740.240046] iou-wrk--173727 15..... 2649295481us : > > alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328 > > len=16384 > > [ 2740.301767] kworker/-322 15..... 2649295735us : > > end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0 > > fsid=b66a67f0-8273-4158-b7bf-988bb5683000 > > [ 2740.328424] kworker/-5095 31..... 2649295941us : > > end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0 > > fsid=b66a67f0-8273-4158-b7bf-988bb5683000 > > > > I am surprised to see two end_bbio_meta_read lines with only one > > matching alloc_extent_buffer. That made me check the locking in > > read_extent_buffer_pages() again, and I think I may have found > > something. > > > > Let's say we get two threads simultaneously call > > read_extent_buffer_pages() on the same eb. Thread 1 starts reading: > > > > if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) > > return 0; > > > > /* > > * We could have had EXTENT_BUFFER_UPTODATE cleared by the write > > * operation, which could potentially still be in flight. In this case > > * we simply want to return an error. > > */ > > if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))) > > return -EIO; > > > > /* (Thread 2 pre-empted here) */ > > > > /* Someone else is already reading the buffer, just wait for it. */ > > if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) > > goto done; > > ... > > > > Meanwhile, thread 2 does the same thing but gets preempted at the > > marked point, before testing EXTENT_BUFFER_READING. Now the read > > finishes, and end_bbio_meta_read() does > > > > btrfs_folio_set_uptodate(fs_info, folio, start, len); > > ... > > clear_bit(EXTENT_BUFFER_READING, &eb->bflags); > > smp_mb__after_atomic(); > > wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING); > > > > Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING > > (again) and starts reading into the already-filled-in extent buffer. > > This might normally be a benign race, except end_bbio_meta_read() has > > also set EXTENT_BUFFER_UPTODATE. So now if a third thread tries to > > read the same extent buffer, it will do > > > > if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) > > return 0; > > > > and return 0 *while the eb is still under I/O*. The caller will then > > try to read data from the extent buffer which is concurrently being > > updated by the extra read started by thread 2. > > Awesome! Not only you got a super good trace, but even pinned down the > root cause. Thanks! > The trace indeed shows duplicated metadata read endio finishes, which > means we indeed have race where we shouldn't. Okay, glad to see I read the trace correctly. Now that we think we understand it, can you think of a more reliable reproducer? It would need lots of read_block_for_search() on the same block in parallel. > Although I'm not 100% sure about the fix, the normal way we handle it > would be locking the page before IO, and unlock it at endio time. Fair enough, I don't really know the locking rules in btrfs or fs/mm generally. I'm pretty sure the double-checked approach from my patch is enough to protect read_extent_buffer_pages() from racing with itself in the way I described, but I don't know if a lock may be required for other reasons. It might be nice to avoid taking a lock in this path but I'm not sure how important that is. I do see that this code path used to take page locks until commit d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading"). I'm currently re-running the reproducer with my patch. If it survives the night without crashing, I'll post it formally, unless you'd like to do it a different way. > I believe we can do that and properly solve it. > And go with fallbacks to use spinlock to make all the bits testing in a > critical section if the page lock one is no feasible. > > Thank you so much! > You really saved us all! No problem, happy to help :) > Thanks, > Qu
在 2024/3/16 08:45, Tavian Barnes 写道: > On Fri, Mar 15, 2024 at 4:21 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> 在 2024/3/16 06:31, Tavian Barnes 写道: >>> On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes >>> <tavianator@tavianator.com> wrote: >>>> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>>>> [SNIP] >>>>> >>>>> The second patch is to making tree-checker to BUG_ON() when something >>>>> went wrong. >>>>> This patch should only be applied if you can reliably reproduce it >>>>> inside a VM. >>>> >>>> Okay, I have finally reproduced this in a VM. I had to add this hunk >>>> to your patch 0002 in order to trigger the BUG_ON: >>>> >>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c >>>> index c8fbcae4e88e..4ee7a717642a 100644 >>>> --- a/fs/btrfs/tree-checker.c >>>> +++ b/fs/btrfs/tree-checker.c >>>> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct >>>> extent_buffer *eb, u64 root_owner) >>>> btrfs_header_level(eb) == 0 ? "leaf" : "node", >>>> root_owner, btrfs_header_bytenr(eb), eb_owner, >>>> root_owner); >>>> + BUG_ON(1); >>>> return -EUCLEAN; >>>> } >>>> return 0; >>>> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct >>>> extent_buffer *eb, u64 root_owner) >>>> btrfs_header_level(eb) == 0 ? "leaf" : "node", >>>> root_owner, btrfs_header_bytenr(eb), eb_owner, >>>> BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID); >>>> + BUG_ON(1); >>>> return -EUCLEAN; >>>> } >>>> return 0; >>>> >>>>> When using the 2nd patch, it's strongly recommended to enable the >>>>> following sysctl: >>>>> >>>>> kernel.ftrace_dump_on_oops = 1 >>>>> kernel.panic = 5 >>>>> kernel.panic_on_oops = 1 >>>> >>>> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog, >>>> otherwise I got watchdog backtraces (due to slow console) interspersed >>>> with the traces which was hard to read. >>>> >>>>> And you need a way to reliably access the VM (either netconsole or a >>>>> serial console setup). >>>>> In that case, you would got all the ftrace buffer to be dumped into the >>>>> netconsole/serial console. >>>>> >>>>> This has the extra benefit of reducing the noise. But really needs a >>>>> reliable VM setup and can be a little tricky to setup. >>>> >>>> I got this to work, the console logs are attached. I added >>>> >>>> echo 1 > $tracefs/buffer_size_kb >>>> >>>> otherwise it tried to dump 48MiB over the serial console which I >>>> didn't have the patience for. Hopefully that's a big enough buffer, I >>>> can re-run it if you need more logs. >>>> >>>>> Feel free to ask for any extra help to setup the environment, as you're >>>>> our last hope to properly pin down the bug. >>>> >>>> Hopefully this trace helps you debug this. Let me know whenever you >>>> have something else for me to test. >>>> >>>> I can also try to send you the VM, but I'm not sure how to package it >>>> up exactly. It has two (emulated) NVMEs with LUKS and BTRFS raid0 on >>>> top. >>> >>> I added eb->start to the "corrupted node/leaf" message so I could look >>> for relevant lines in the trace output. From another run, I see this: >>> >>> $ grep 'eb=15302115328' typescript-5 >>> [ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258 >>> block=15302115328 owner mismatch, have 13709377558419367261 expect >>> [256, 18446744073709551360] eb=15302115328 >>> [ 2740.240046] iou-wrk--173727 15..... 2649295481us : >>> alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328 >>> len=16384 >>> [ 2740.301767] kworker/-322 15..... 2649295735us : >>> end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0 >>> fsid=b66a67f0-8273-4158-b7bf-988bb5683000 >>> [ 2740.328424] kworker/-5095 31..... 2649295941us : >>> end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0 >>> fsid=b66a67f0-8273-4158-b7bf-988bb5683000 >>> >>> I am surprised to see two end_bbio_meta_read lines with only one >>> matching alloc_extent_buffer. That made me check the locking in >>> read_extent_buffer_pages() again, and I think I may have found >>> something. >>> >>> Let's say we get two threads simultaneously call >>> read_extent_buffer_pages() on the same eb. Thread 1 starts reading: >>> >>> if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) >>> return 0; >>> >>> /* >>> * We could have had EXTENT_BUFFER_UPTODATE cleared by the write >>> * operation, which could potentially still be in flight. In this case >>> * we simply want to return an error. >>> */ >>> if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))) >>> return -EIO; >>> >>> /* (Thread 2 pre-empted here) */ >>> >>> /* Someone else is already reading the buffer, just wait for it. */ >>> if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) >>> goto done; >>> ... >>> >>> Meanwhile, thread 2 does the same thing but gets preempted at the >>> marked point, before testing EXTENT_BUFFER_READING. Now the read >>> finishes, and end_bbio_meta_read() does >>> >>> btrfs_folio_set_uptodate(fs_info, folio, start, len); >>> ... >>> clear_bit(EXTENT_BUFFER_READING, &eb->bflags); >>> smp_mb__after_atomic(); >>> wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING); >>> >>> Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING >>> (again) and starts reading into the already-filled-in extent buffer. >>> This might normally be a benign race, except end_bbio_meta_read() has >>> also set EXTENT_BUFFER_UPTODATE. So now if a third thread tries to >>> read the same extent buffer, it will do >>> >>> if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) >>> return 0; >>> >>> and return 0 *while the eb is still under I/O*. The caller will then >>> try to read data from the extent buffer which is concurrently being >>> updated by the extra read started by thread 2. >> >> Awesome! Not only you got a super good trace, but even pinned down the >> root cause. > > Thanks! > >> The trace indeed shows duplicated metadata read endio finishes, which >> means we indeed have race where we shouldn't. > > Okay, glad to see I read the trace correctly. Now that we think we > understand it, can you think of a more reliable reproducer? It would > need lots of read_block_for_search() on the same block in parallel. > >> Although I'm not 100% sure about the fix, the normal way we handle it >> would be locking the page before IO, and unlock it at endio time. > > Fair enough, I don't really know the locking rules in btrfs or fs/mm > generally. I'm pretty sure the double-checked approach from my patch > is enough to protect read_extent_buffer_pages() from racing with > itself in the way I described, but I don't know if a lock may be > required for other reasons. It might be nice to avoid taking a lock > in this path but I'm not sure how important that is. > > I do see that this code path used to take page locks until commit > d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer > reading"). > > I'm currently re-running the reproducer with my patch. If it survives > the night without crashing, I'll post it formally, unless you'd like > to do it a different way. After digging deeper, your fix looks fine. (Just remove the cleanup, as we need a minimal hot fix for backport, and don't forget to CC the fix to stable, and if possible find the regression commit if it's a regression, and I think the commit you mentioned is the culprit, as previously we have extra page locking to prevent this from happening) I can only come up with the following racing cases, and all looks fine with your fix: Case 1: T1 | T2 ----------------------------------+------------------------------- test_bit(UPTODATE) == false | | test_bit(UPTODATE) == false test_set(READING) == true | | test_set(READING) == false | goto wait Everything is fine, the old common case. Case 2: T1 | T2 ----------------------------------+------------------------------- test_bit(UPTODATE) == false | | test_bit(UPTODATE) == false test_set(READING) == true | end_io: | set_bit(UPTODATE) | clear_bit(READING) | | test_set(READING) == true | test_bit(UPTODATE) == true | done The old case where the double read would happen, and now it's fine. Case 3: The extra race between UPTODATE and READING: T1 | T2 ----------------------------------+------------------------------- test_bit(UPTODATE) == false | | test_bit(UPTODATE) == false test_set(READING) == true | end_io: | set_bit(UPTODATE) | | test_set(READING) == false | wait clear_bit(READING) | Still safe, due to the fact that UPTODATE is set before clearing READING. So yep, your fix should be fine. Although initially I'd prefer a more straightforward spinlock based solution like this: diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 7441245b1ceb..88fcf1df87c7 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4318,20 +4318,28 @@ int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num, struct btrfs_bio *bbio; bool ret; - if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) + spin_lock(&eb->refs_lock); + if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) { + spin_unlock(&eb->refs_lock); return 0; + } /* * We could have had EXTENT_BUFFER_UPTODATE cleared by the write * operation, which could potentially still be in flight. In this case * we simply want to return an error. */ - if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))) + if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))) { + spin_unlock(&eb->refs_lock); return -EIO; + } /* Someone else is already reading the buffer, just wait for it. */ - if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) + if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) { + spin_unlock(&eb->refs_lock); goto done; + } + spin_unlock(&eb->refs_lock); clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags); eb->read_mirror = 0; But this is much larger than your fix. Although you may want some extra explanation on the fix. Otherwise pretty happy to see we finally pinned it down. Thanks, Qu >> I believe we can do that and properly solve it. >> And go with fallbacks to use spinlock to make all the bits testing in a >> critical section if the page lock one is no feasible. >> >> Thank you so much! >> You really saved us all! > > No problem, happy to help :) > >> Thanks, >> Qu >
From fad1d5ff67e18ceaa8fa439f09be7f4320f327df Mon Sep 17 00:00:00 2001 Message-ID: <fad1d5ff67e18ceaa8fa439f09be7f4320f327df.1710309440.git.wqu@suse.com> In-Reply-To: <cover.1710309440.git.wqu@suse.com> References: <cover.1710309440.git.wqu@suse.com> From: Qu Wenruo <wqu@suse.com> Date: Wed, 13 Mar 2024 16:24:03 +1030 Subject: [PATCH 1/2] btrfs: add extra debug for eb read path This is for the recently exposed but very hard to reproduce (for everyone except awesome and helpful Tavian Barnes) bug, where we seem to got trash contents for our extent buffer. Signed-off-by: Qu Wenruo <wqu@suse.com> --- fs/btrfs/extent_io.c | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index bb225798bb89..43331ba74c64 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -90,6 +90,18 @@ void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info) #define btrfs_leak_debug_del_eb(eb) do {} while (0) #endif +static inline void dump_extent_buffer(const char *prefix, struct extent_buffer *eb) +{ + u8 fsid[BTRFS_FSID_SIZE] = { 0 }; + + read_extent_buffer(eb, &fsid, offsetof(struct btrfs_header, fsid), + BTRFS_FSID_SIZE); + + trace_printk("%s, eb=%llu page_refs=%d eb level=%d fsid=%pUb\n", + prefix, eb->start, atomic_read(&eb->refs), + btrfs_header_level(eb), fsid); +} + /* * Structure to record info about the bio being assembled, and other info like * how many bytes are there before stripe/ordered extent boundary. @@ -3481,6 +3493,7 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb) { ASSERT(!extent_buffer_under_io(eb)); + dump_extent_buffer("before release", eb); for (int i = 0; i < INLINE_EXTENT_BUFFER_PAGES; i++) { struct folio *folio = eb->folios[i]; @@ -3499,6 +3512,7 @@ static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb) */ static inline void btrfs_release_extent_buffer(struct extent_buffer *eb) { + trace_printk("called from %s", __func__); btrfs_release_extent_buffer_pages(eb); btrfs_leak_debug_del_eb(eb); __free_extent_buffer(eb); @@ -3532,6 +3546,7 @@ struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src) int num_folios = num_extent_folios(src); int ret; + trace_printk("%s: alloc eb=%llu len=%u\n", __func__, src->start, src->len); new = __alloc_extent_buffer(src->fs_info, src->start, src->len); if (new == NULL) return NULL; @@ -3573,6 +3588,7 @@ struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info, int num_folios = 0; int ret; + trace_printk("%s: alloc eb=%llu len=%lu\n", __func__, start, len); eb = __alloc_extent_buffer(fs_info, start, len); if (!eb) return NULL; @@ -3893,6 +3909,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, if (eb) return eb; + trace_printk("%s: alloc eb=%llu len=%lu\n", __func__, start, len); eb = __alloc_extent_buffer(fs_info, start, len); if (!eb) return ERR_PTR(-ENOMEM); @@ -4114,6 +4131,8 @@ static int release_extent_buffer(struct extent_buffer *eb) } btrfs_leak_debug_del_eb(eb); + + trace_printk("called from %s", __func__); /* Should be safe to release our pages at this point */ btrfs_release_extent_buffer_pages(eb); #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS @@ -4347,9 +4366,12 @@ static void end_bbio_meta_read(struct btrfs_bio *bbio) eb->read_mirror = bbio->mirror_num; + dump_extent_buffer("read done", eb); if (uptodate && - btrfs_validate_extent_buffer(eb, &bbio->parent_check) < 0) + btrfs_validate_extent_buffer(eb, &bbio->parent_check) < 0) { + trace_printk("!!! Validation failed, eb=%llu !!!\n", eb->start); uptodate = false; + } if (uptodate) { set_extent_buffer_uptodate(eb); -- 2.44.0