diff mbox series

About the weird tree block corruption

Message ID c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com (mailing list archive)
State New, archived
Headers show
Series About the weird tree block corruption | expand

Commit Message

Qu Wenruo March 13, 2024, 6:07 a.m. UTC
Hi Tavian,

Thanks for all the awesome help debugging the weird tree block corruption.
And sorry for the late reply.

Unfortunately I still failed to reproduce the bug, so I can only craft a
debug patchset for you to test.

The patchset contains 2 patches, the first one is adding the extra
trace_printk() into btrfs.
It requires CONFIG_FTRACE and CONFIG_DYNAMIC_FTRACE kernel configs to be
enabled.

You can use ftrace-cmd to help setting up your environment, but for my
case I normally go with the following bash snippet:

  tracefs="/sys/kernel/debug/tracing"
  begin_trace()
  {
	trace-cmd clear
	echo 1 > $tracefs/tracing_on
  }

  stop_trace()
  {
	cp $tracefs/trace /tmp
	chmod 666 /tmp/trace
	trace-cmd clear
  }

  # Setup the environment
  begin_trace
  workload
  stop_trace

The second patch is to making tree-checker to BUG_ON() when something
went wrong.
This patch should only be applied if you can reliably reproduce it
inside a VM.

When using the 2nd patch, it's strongly recommended to enable the
following sysctl:

  kernel.ftrace_dump_on_oops = 1
  kernel.panic = 5
  kernel.panic_on_oops = 1

And you need a way to reliably access the VM (either netconsole or a
serial console setup).
In that case, you would got all the ftrace buffer to be dumped into the
netconsole/serial console.

This has the extra benefit of reducing the noise. But really needs a
reliable VM setup and can be a little tricky to setup.

Feel free to ask for any extra help to setup the environment, as you're
our last hope to properly pin down the bug.

Thanks,
Qu

Comments

Tavian Barnes March 14, 2024, 5:44 p.m. UTC | #1
On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> Hi Tavian,
>
> Thanks for all the awesome help debugging the weird tree block corruption.
> And sorry for the late reply.

No worries, thanks for your help!

> Unfortunately I still failed to reproduce the bug, so I can only craft a
> debug patchset for you to test.

Good news: I also failed to reproduce the bug on the latest
btrfs/for-next branch (ec616f34eba1 "btrfs: do not skip
re-registration for the mounted device").  It was still reproducing
the last time I pulled btrfs/for-next (09e6cef19c9f "btrfs: refactor
alloc_extent_buffer() to allocate-then-attach method").

This is a much smaller range of commits.  I didn't actually bisect
them, but I skimmed through the log and saw

commit 0a9bab391e336489169b95cb0d4553d921302189
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Wed Jan 31 21:57:27 2024 +0100

   dm-crypt, dm-verity: disable tasklets

   Tasklets have an inherent problem with memory corruption. The function
   tasklet_action_common calls tasklet_trylock, then it calls the tasklet
   callback and then it calls tasklet_unlock. If the tasklet callback frees
   the structure that contains the tasklet or if it calls some code that may
   free it, tasklet_unlock will write into free memory.

which is awfully suspicious.  It appears to affect LUKS devices on
NVME drives with --perf=no_{read,write}_workqueue enabled, which
matches my configuration.  So my guess is that all my problems
(including the disk corruption I experienced recently) are caused by
this UAF in dm-crypt, which is fixed since v6.8-rc3.

I'll let you know if I ever see this bug again, but for now I assume
the underlying issue was in dm-crypt, not btrfs.
Tavian Barnes March 14, 2024, 6:42 p.m. UTC | #2
On Thu, Mar 14, 2024 at 1:44 PM Tavian Barnes <tavianator@tavianator.com> wrote:
> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > Hi Tavian,
> >
> > Thanks for all the awesome help debugging the weird tree block corruption.
> > And sorry for the late reply.
>
> No worries, thanks for your help!
>
> > Unfortunately I still failed to reproduce the bug, so I can only craft a
> > debug patchset for you to test.
>
> Good news: I also failed to reproduce the bug on the latest
> btrfs/for-next branch (ec616f34eba1 "btrfs: do not skip
> re-registration for the mounted device").  It was still reproducing
> the last time I pulled btrfs/for-next (09e6cef19c9f "btrfs: refactor
> alloc_extent_buffer() to allocate-then-attach method").
>
> This is a much smaller range of commits.  I didn't actually bisect
> them, but I skimmed through the log and saw
>
> commit 0a9bab391e336489169b95cb0d4553d921302189
> Author: Mikulas Patocka <mpatocka@redhat.com>
> Date:   Wed Jan 31 21:57:27 2024 +0100
>
>    dm-crypt, dm-verity: disable tasklets

Well, I suppose this can't be the relevant fix, because I reproduced
the "corrupted node" bug on 6.7.9 which includes a backport of that
commit.  I'll try to bisect when I get more time.
Tavian Barnes March 14, 2024, 8:25 p.m. UTC | #3
On Thu, Mar 14, 2024 at 2:42 PM Tavian Barnes <tavianator@tavianator.com> wrote:
> On Thu, Mar 14, 2024 at 1:44 PM Tavian Barnes <tavianator@tavianator.com> wrote:
> > On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > > Hi Tavian,
> > >
> > > Thanks for all the awesome help debugging the weird tree block corruption.
> > > And sorry for the late reply.
> >
> > No worries, thanks for your help!
> >
> > > Unfortunately I still failed to reproduce the bug, so I can only craft a
> > > debug patchset for you to test.
> >
> > Good news: I also failed to reproduce the bug on the latest
> > btrfs/for-next branch (ec616f34eba1 "btrfs: do not skip
> > re-registration for the mounted device").  It was still reproducing
> > the last time I pulled btrfs/for-next (09e6cef19c9f "btrfs: refactor
> > alloc_extent_buffer() to allocate-then-attach method").

Actually I have to walk all of this back.  I can still reproduce the
bug on ec616f34eba1.  I can't reproduce it on that commit *plus your
debugging patch* (btrfs: add extra debug for eb read path), which
points to maybe a timing issue/race window closing.

Additionally, shortly after reproducing the bug on ec616f34eba1, I got
this splat:

[  119.257741] BTRFS critical (device dm-2): corrupted node, root=518
block=16438782945911875046 owner mismatch, have 4517169229596899607
expect [256, 18446744073709551360]
[  119.257764] BTRFS critical (device dm-2): corrupted node, root=518
block=16438782945911875046 owner mismatch, have 4517169229596899607
expect [256, 18446744073709551360]
[  119.257775] BTRFS critical (device dm-2): corrupted node, root=518
block=16438782945911875046 owner mismatch, have 4517169229596899607
expect [256, 18446744073709551360]
[  119.257773] BTRFS critical (device dm-2): corrupted node, root=518
block=16438782945911875046 owner mismatch, have 4517169229596899607
expect [256, 18446744073709551360]
[  119.257782] BTRFS critical (device dm-2): corrupted node, root=518
block=16438782945911875046 owner mismatch, have 4517169229596899607
expect [256, 18446744073709551360]
[  119.257787] BTRFS critical (device dm-2): corrupted node, root=518
block=16438782945911875046 owner mismatch, have 4517169229596899607
expect [256, 18446744073709551360]
[  178.564993] ------------[ cut here ]------------
[  178.567792] BTRFS: Transaction aborted (error -17)
[  178.570606] WARNING: CPU: 2 PID: 5191 at fs/btrfs/inode.c:6450
btrfs_create_new_inode+0xa46/0xa70 [btrfs]
[  178.573433] Modules linked in: vhost_net vhost vhost_iotlb tap
xt_addrtype xt_conntrack xt_comment veth rpcrdma rdma_cm cmac iw_cm
algif_hash nct6775 algif_skcipher ib_cm nct6775_core af_alg ib_core
bnep hwmon_vid lm92 uvcvideo uvc intel_rapl_msr intel_rapl_common
amd64_edac edac_mce_amd snd_hda_codec_hdmi kvm_amd snd_hda_intel btusb
snd_intel_dspcfg btrtl snd_intel_sdw_acpi gspca_vc032x btintel
snd_usb_audio snd_hda_codec btbcm gspca_main kvm btmtk
videobuf2_vmalloc snd_usbmidi_lib nls_iso8859_1 videobuf2_memops
snd_ump videobuf2_v4l2 snd_hda_core snd_rawmidi vfat videodev
bluetooth irqbypass snd_hwdep snd_seq_device fat videobuf2_common mc
snd_pcm ecdh_generic crc16 snd_timer rapl wmi_bmof acpi_cpufreq
sp5100_tco pcspkr snd soundcore k10temp i2c_piix4 mousedev joydev
mac_hid nfsd usbip_host usbip_core pkcs8_key_parser auth_rpcgss
i2c_dev sg nfs_acl lockd crypto_user grace fuse loop sunrpc btrfs
blake2b_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted
asn1_encoder tee xt_MASQUERADE xt_tcpudp xt_mark
[  178.573514]  hid_logitech_hidpp hid_logitech_dj dm_mod tun uas
crct10dif_pclmul crc32_pclmul polyval_clmulni usb_storage
polyval_generic gf128mul ghash_clmulni_intel sha512_ssse3 usbhid
sha256_ssse3 sha1_ssse3 aesni_intel nvme igb bridge iwlwifi
crypto_simd nvme_core mxm_wmi ccp sr_mod cryptd dca xhci_pci cdrom
i2c_algo_bit xhci_pci_renesas nvme_auth stp wmi llc nf_tables cfg80211
ip6table_nat ip6table_filter ip6_tables iptable_nat nf_nat
nf_conntrack rfkill nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
crc32c_generic crc32c_intel iptable_filter nfnetlink ip_tables
x_tables
[  178.600199] CPU: 2 PID: 5191 Comm: git Not tainted
6.8.0-rc7-euclean+ #15 6731aa79f7b285a380c42229104b62a18d4313c1
[  178.603981] Hardware name: Micro-Star International Co., Ltd.
MS-7C60/TRX40 PRO WIFI (MS-7C60), BIOS 2.80 05/17/2022
[  178.607815] RIP: 0010:btrfs_create_new_inode+0xa46/0xa70 [btrfs]
[  178.611667] Code: cc 0a 00 c7 44 24 0c 01 00 00 00 e9 fc fb ff ff
41 bc f4 ff ff ff e9 a4 f8 ff ff 44 89 e6 48 c7 c7 48 00 3b c1 e8 1a
e7 c4 c2 <0f> 0b eb a4 44 89 e6 48 c7 c7 48 00 3b c1 e8 07 e7 c4 c2 0f
0b eb
[  178.615664] RSP: 0018:ffffbb8edf8afad8 EFLAGS: 00010282
[  178.619577] RAX: 0000000000000000 RBX: ffffbb8edf8afbb8 RCX: 0000000000000000
[  178.623516] RDX: ffff8fe8beeae780 RSI: ffff8fe8beea19c0 RDI: ffff8fe8beea19c0
[  178.627458] RBP: ffffbb8edf8afba0 R08: 0000000000000000 R09: ffffbb8edf8af960
[  178.631370] R10: 0000000000000003 R11: ffff8fe8bf1df668 R12: 00000000ffffffef
[  178.635281] R13: ffff8fc9cdc57d88 R14: ffff8fcb654541f8 R15: ffff8fc9d7c07110
[  178.639187] FS:  00007f86d59d9740(0000) GS:ffff8fe8bee80000(0000)
knlGS:0000000000000000
[  178.643131] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  178.647061] CR2: 000055c7d0a7a938 CR3: 000000023e8a8000 CR4: 0000000000350ef0
[  178.650998] Call Trace:
[  178.654898]  <TASK>
[  178.658761]  ? btrfs_create_new_inode+0xa46/0xa70 [btrfs
0221bef32b31f5d5536e63a0526995bec5247918]
[  178.662770]  ? __warn+0x81/0x130
[  178.666696]  ? btrfs_create_new_inode+0xa46/0xa70 [btrfs
0221bef32b31f5d5536e63a0526995bec5247918]
[  178.670749]  ? report_bug+0x171/0x1a0
[  178.674761]  ? prb_read_valid+0x1b/0x30
[  178.678786]  ? srso_return_thunk+0x5/0x5f
[  178.682786]  ? handle_bug+0x3c/0x80
[  178.686779]  ? exc_invalid_op+0x17/0x70
[  178.690769]  ? asm_exc_invalid_op+0x1a/0x20
[  178.694773]  ? btrfs_create_new_inode+0xa46/0xa70 [btrfs
0221bef32b31f5d5536e63a0526995bec5247918]
[  178.698925]  ? btrfs_create_new_inode+0xa46/0xa70 [btrfs
0221bef32b31f5d5536e63a0526995bec5247918]
[  178.703008]  btrfs_create_common+0xc4/0x130 [btrfs
0221bef32b31f5d5536e63a0526995bec5247918]
[  178.707133]  path_openat+0xe9f/0x1190
[  178.711196]  do_filp_open+0xb3/0x160
[  178.715254]  do_sys_openat2+0xab/0xe0
[  178.719294]  __x64_sys_openat+0x57/0xa0
[  178.723321]  do_syscall_64+0x89/0x170
[  178.727336]  ? srso_return_thunk+0x5/0x5f
[  178.731356]  ? do_syscall_64+0x96/0x170
[  178.735373]  ? srso_return_thunk+0x5/0x5f
[  178.739387]  ? syscall_exit_to_user_mode+0x80/0x230
[  178.743420]  ? srso_return_thunk+0x5/0x5f
[  178.747434]  ? do_syscall_64+0x96/0x170
[  178.751441]  ? srso_return_thunk+0x5/0x5f
[  178.755450]  ? srso_return_thunk+0x5/0x5f
[  178.759405]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  178.763406] RIP: 0033:0x7f86d5ad6dc0
[  178.767424] Code: 48 89 44 24 20 75 94 44 89 54 24 0c e8 d9 c9 f8
ff 44 8b 54 24 0c 89 da 48 89 ee 41 89 c0 bf 9c ff ff ff b8 01 01 00
00 0f 05 <48> 3d 00 f0 ff ff 77 38 44 89 c7 89 44 24 0c e8 2c ca f8 ff
8b 44
[  178.771580] RSP: 002b:00007ffcc4791f50 EFLAGS: 00000293 ORIG_RAX:
0000000000000101
[  178.775720] RAX: ffffffffffffffda RBX: 00000000000000c1 RCX: 00007f86d5ad6dc0
[  178.779865] RDX: 00000000000000c1 RSI: 000055c7d0a77140 RDI: 00000000ffffff9c
[  178.784014] RBP: 000055c7d0a77140 R08: 0000000000000000 R09: 000055c7d0a77140
[  178.788152] R10: 00000000000001b6 R11: 0000000000000293 R12: 00007ffcc4792360
[  178.792210] R13: 00007ffcc4792000 R14: 000055c7d06bb788 R15: 000055c7d0a77140
[  178.796192]  </TASK>
[  178.800079] ---[ end trace 0000000000000000 ]---
[  178.804024] BTRFS: error (device dm-2: state A) in
btrfs_create_new_inode:6450: errno=-17 Object already exists
Tavian Barnes March 15, 2024, 3:23 p.m. UTC | #4
On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> [SNIP]
>
> The second patch is to making tree-checker to BUG_ON() when something
> went wrong.
> This patch should only be applied if you can reliably reproduce it
> inside a VM.

Okay, I have finally reproduced this in a VM.  I had to add this hunk
to your patch 0002 in order to trigger the BUG_ON:

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index c8fbcae4e88e..4ee7a717642a 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct
extent_buffer *eb, u64 root_owner)
                                btrfs_header_level(eb) == 0 ? "leaf" : "node",
                                root_owner, btrfs_header_bytenr(eb), eb_owner,
                                root_owner);
+                       BUG_ON(1);
                        return -EUCLEAN;
                }
                return 0;
@@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct
extent_buffer *eb, u64 root_owner)
                        btrfs_header_level(eb) == 0 ? "leaf" : "node",
                        root_owner, btrfs_header_bytenr(eb), eb_owner,
                        BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID);
+               BUG_ON(1);
                return -EUCLEAN;
        }
        return 0;

> When using the 2nd patch, it's strongly recommended to enable the
> following sysctl:
>
>   kernel.ftrace_dump_on_oops = 1
>   kernel.panic = 5
>   kernel.panic_on_oops = 1

I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog,
otherwise I got watchdog backtraces (due to slow console) interspersed
with the traces which was hard to read.

> And you need a way to reliably access the VM (either netconsole or a
> serial console setup).
> In that case, you would got all the ftrace buffer to be dumped into the
> netconsole/serial console.
>
> This has the extra benefit of reducing the noise. But really needs a
> reliable VM setup and can be a little tricky to setup.

I got this to work, the console logs are attached.  I added

    echo 1 > $tracefs/buffer_size_kb

otherwise it tried to dump 48MiB over the serial console which I
didn't have the patience for.  Hopefully that's a big enough buffer, I
can re-run it if you need more logs.

> Feel free to ask for any extra help to setup the environment, as you're
> our last hope to properly pin down the bug.

Hopefully this trace helps you debug this.  Let me know whenever you
have something else for me to test.

I can also try to send you the VM, but I'm not sure how to package it
up exactly.  It has two (emulated) NVMEs with LUKS and BTRFS raid0 on
top.
Qu Wenruo March 15, 2024, 7:51 p.m. UTC | #5
在 2024/3/16 01:53, Tavian Barnes 写道:
> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> [SNIP]
>>
>> The second patch is to making tree-checker to BUG_ON() when something
>> went wrong.
>> This patch should only be applied if you can reliably reproduce it
>> inside a VM.
>
> Okay, I have finally reproduced this in a VM.  I had to add this hunk
> to your patch 0002 in order to trigger the BUG_ON:
>
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index c8fbcae4e88e..4ee7a717642a 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct
> extent_buffer *eb, u64 root_owner)
>                                  btrfs_header_level(eb) == 0 ? "leaf" : "node",
>                                  root_owner, btrfs_header_bytenr(eb), eb_owner,
>                                  root_owner);
> +                       BUG_ON(1);
>                          return -EUCLEAN;
>                  }
>                  return 0;
> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct
> extent_buffer *eb, u64 root_owner)
>                          btrfs_header_level(eb) == 0 ? "leaf" : "node",
>                          root_owner, btrfs_header_bytenr(eb), eb_owner,
>                          BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID);
> +               BUG_ON(1);
>                  return -EUCLEAN;
>          }
>          return 0;
>
>> When using the 2nd patch, it's strongly recommended to enable the
>> following sysctl:
>>
>>    kernel.ftrace_dump_on_oops = 1
>>    kernel.panic = 5
>>    kernel.panic_on_oops = 1
>
> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog,
> otherwise I got watchdog backtraces (due to slow console) interspersed
> with the traces which was hard to read.

oops_all_cpu_backtrace looks a little overkilled, and it seems to flood
the output.

>
>> And you need a way to reliably access the VM (either netconsole or a
>> serial console setup).
>> In that case, you would got all the ftrace buffer to be dumped into the
>> netconsole/serial console.
>>
>> This has the extra benefit of reducing the noise. But really needs a
>> reliable VM setup and can be a little tricky to setup.
>
> I got this to work, the console logs are attached.  I added
>
>      echo 1 > $tracefs/buffer_size_kb
>
> otherwise it tried to dump 48MiB over the serial console which I
> didn't have the patience for.  Hopefully that's a big enough buffer, I
> can re-run it if you need more logs.

That's totally fine, and that's exactly what I do during debugging.

>
>> Feel free to ask for any extra help to setup the environment, as you're
>> our last hope to properly pin down the bug.
>
> Hopefully this trace helps you debug this.  Let me know whenever you
> have something else for me to test.

The btrfs_crit() line is using btrfs_header_bytenr(), which can be
corrupted.

So it's much better to add extra trace_printk() to print eb->start so
that we can match the output.

But there is some interesting output, the trace_printk() in
btrfs_release_extent_buffer_pages() are already showing the page refs is
already 0, and its contents is already incorrect.

It may be a clue, but without the proper matching trace, it's still hard
to say.

I'm afraid you will need to add the extra trace_printk() lines, much
like this to all the return -EUCLEAN locations:

	trace_printk("eb=%llu\n", eb->start);

>
> I can also try to send you the VM, but I'm not sure how to package it
> up exactly.  It has two (emulated) NVMEs with LUKS and BTRFS raid0 on
> top.
>

Just send me the rootfs qcow2 would be more than enough.
I can setup LUKS and btrfs all by myself here.

Thanks,
Qu
Tavian Barnes March 15, 2024, 8:01 p.m. UTC | #6
On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes
<tavianator@tavianator.com> wrote:
> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > [SNIP]
> >
> > The second patch is to making tree-checker to BUG_ON() when something
> > went wrong.
> > This patch should only be applied if you can reliably reproduce it
> > inside a VM.
>
> Okay, I have finally reproduced this in a VM.  I had to add this hunk
> to your patch 0002 in order to trigger the BUG_ON:
>
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index c8fbcae4e88e..4ee7a717642a 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct
> extent_buffer *eb, u64 root_owner)
>                                 btrfs_header_level(eb) == 0 ? "leaf" : "node",
>                                 root_owner, btrfs_header_bytenr(eb), eb_owner,
>                                 root_owner);
> +                       BUG_ON(1);
>                         return -EUCLEAN;
>                 }
>                 return 0;
> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct
> extent_buffer *eb, u64 root_owner)
>                         btrfs_header_level(eb) == 0 ? "leaf" : "node",
>                         root_owner, btrfs_header_bytenr(eb), eb_owner,
>                         BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID);
> +               BUG_ON(1);
>                 return -EUCLEAN;
>         }
>         return 0;
>
> > When using the 2nd patch, it's strongly recommended to enable the
> > following sysctl:
> >
> >   kernel.ftrace_dump_on_oops = 1
> >   kernel.panic = 5
> >   kernel.panic_on_oops = 1
>
> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog,
> otherwise I got watchdog backtraces (due to slow console) interspersed
> with the traces which was hard to read.
>
> > And you need a way to reliably access the VM (either netconsole or a
> > serial console setup).
> > In that case, you would got all the ftrace buffer to be dumped into the
> > netconsole/serial console.
> >
> > This has the extra benefit of reducing the noise. But really needs a
> > reliable VM setup and can be a little tricky to setup.
>
> I got this to work, the console logs are attached.  I added
>
>     echo 1 > $tracefs/buffer_size_kb
>
> otherwise it tried to dump 48MiB over the serial console which I
> didn't have the patience for.  Hopefully that's a big enough buffer, I
> can re-run it if you need more logs.
>
> > Feel free to ask for any extra help to setup the environment, as you're
> > our last hope to properly pin down the bug.
>
> Hopefully this trace helps you debug this.  Let me know whenever you
> have something else for me to test.
>
> I can also try to send you the VM, but I'm not sure how to package it
> up exactly.  It has two (emulated) NVMEs with LUKS and BTRFS raid0 on
> top.

I added eb->start to the "corrupted node/leaf" message so I could look
for relevant lines in the trace output.  From another run, I see this:

$ grep 'eb=15302115328' typescript-5
[ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258
block=15302115328 owner mismatch, have 13709377558419367261 expect
[256, 18446744073709551360] eb=15302115328
[ 2740.240046] iou-wrk--173727   15..... 2649295481us :
alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328
len=16384
[ 2740.301767] kworker/-322      15..... 2649295735us :
end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0
fsid=b66a67f0-8273-4158-b7bf-988bb5683000
[ 2740.328424] kworker/-5095     31..... 2649295941us :
end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0
fsid=b66a67f0-8273-4158-b7bf-988bb5683000

I am surprised to see two end_bbio_meta_read lines with only one
matching alloc_extent_buffer.  That made me check the locking in
read_extent_buffer_pages() again, and I think I may have found
something.

Let's say we get two threads simultaneously call
read_extent_buffer_pages() on the same eb.  Thread 1 starts reading:

    if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
        return 0;

    /*
     * We could have had EXTENT_BUFFER_UPTODATE cleared by the write
     * operation, which could potentially still be in flight.  In this case
     * we simply want to return an error.
     */
    if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
        return -EIO;

    /* (Thread 2 pre-empted here) */

    /* Someone else is already reading the buffer, just wait for it. */
    if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
        goto done;
    ...

Meanwhile, thread 2 does the same thing but gets preempted at the
marked point, before testing EXTENT_BUFFER_READING.  Now the read
finishes, and end_bbio_meta_read() does

            btrfs_folio_set_uptodate(fs_info, folio, start, len);
    ...
    clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
    smp_mb__after_atomic();
    wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);

Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING
(again) and starts reading into the already-filled-in extent buffer.
This might normally be a benign race, except end_bbio_meta_read() has
also set EXTENT_BUFFER_UPTODATE.  So now if a third thread tries to
read the same extent buffer, it will do

    if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
        return 0;

and return 0 *while the eb is still under I/O*.  The caller will then
try to read data from the extent buffer which is concurrently being
updated by the extra read started by thread 2.

I attached a candidate patch.  So far only compile-tested.
Qu Wenruo March 15, 2024, 8:21 p.m. UTC | #7
在 2024/3/16 06:31, Tavian Barnes 写道:
> On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes
> <tavianator@tavianator.com> wrote:
>> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> [SNIP]
>>>
>>> The second patch is to making tree-checker to BUG_ON() when something
>>> went wrong.
>>> This patch should only be applied if you can reliably reproduce it
>>> inside a VM.
>>
>> Okay, I have finally reproduced this in a VM.  I had to add this hunk
>> to your patch 0002 in order to trigger the BUG_ON:
>>
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index c8fbcae4e88e..4ee7a717642a 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct
>> extent_buffer *eb, u64 root_owner)
>>                                  btrfs_header_level(eb) == 0 ? "leaf" : "node",
>>                                  root_owner, btrfs_header_bytenr(eb), eb_owner,
>>                                  root_owner);
>> +                       BUG_ON(1);
>>                          return -EUCLEAN;
>>                  }
>>                  return 0;
>> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct
>> extent_buffer *eb, u64 root_owner)
>>                          btrfs_header_level(eb) == 0 ? "leaf" : "node",
>>                          root_owner, btrfs_header_bytenr(eb), eb_owner,
>>                          BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID);
>> +               BUG_ON(1);
>>                  return -EUCLEAN;
>>          }
>>          return 0;
>>
>>> When using the 2nd patch, it's strongly recommended to enable the
>>> following sysctl:
>>>
>>>    kernel.ftrace_dump_on_oops = 1
>>>    kernel.panic = 5
>>>    kernel.panic_on_oops = 1
>>
>> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog,
>> otherwise I got watchdog backtraces (due to slow console) interspersed
>> with the traces which was hard to read.
>>
>>> And you need a way to reliably access the VM (either netconsole or a
>>> serial console setup).
>>> In that case, you would got all the ftrace buffer to be dumped into the
>>> netconsole/serial console.
>>>
>>> This has the extra benefit of reducing the noise. But really needs a
>>> reliable VM setup and can be a little tricky to setup.
>>
>> I got this to work, the console logs are attached.  I added
>>
>>      echo 1 > $tracefs/buffer_size_kb
>>
>> otherwise it tried to dump 48MiB over the serial console which I
>> didn't have the patience for.  Hopefully that's a big enough buffer, I
>> can re-run it if you need more logs.
>>
>>> Feel free to ask for any extra help to setup the environment, as you're
>>> our last hope to properly pin down the bug.
>>
>> Hopefully this trace helps you debug this.  Let me know whenever you
>> have something else for me to test.
>>
>> I can also try to send you the VM, but I'm not sure how to package it
>> up exactly.  It has two (emulated) NVMEs with LUKS and BTRFS raid0 on
>> top.
>
> I added eb->start to the "corrupted node/leaf" message so I could look
> for relevant lines in the trace output.  From another run, I see this:
>
> $ grep 'eb=15302115328' typescript-5
> [ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258
> block=15302115328 owner mismatch, have 13709377558419367261 expect
> [256, 18446744073709551360] eb=15302115328
> [ 2740.240046] iou-wrk--173727   15..... 2649295481us :
> alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328
> len=16384
> [ 2740.301767] kworker/-322      15..... 2649295735us :
> end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0
> fsid=b66a67f0-8273-4158-b7bf-988bb5683000
> [ 2740.328424] kworker/-5095     31..... 2649295941us :
> end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0
> fsid=b66a67f0-8273-4158-b7bf-988bb5683000
>
> I am surprised to see two end_bbio_meta_read lines with only one
> matching alloc_extent_buffer.  That made me check the locking in
> read_extent_buffer_pages() again, and I think I may have found
> something.
>
> Let's say we get two threads simultaneously call
> read_extent_buffer_pages() on the same eb.  Thread 1 starts reading:
>
>      if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
>          return 0;
>
>      /*
>       * We could have had EXTENT_BUFFER_UPTODATE cleared by the write
>       * operation, which could potentially still be in flight.  In this case
>       * we simply want to return an error.
>       */
>      if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
>          return -EIO;
>
>      /* (Thread 2 pre-empted here) */
>
>      /* Someone else is already reading the buffer, just wait for it. */
>      if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
>          goto done;
>      ...
>
> Meanwhile, thread 2 does the same thing but gets preempted at the
> marked point, before testing EXTENT_BUFFER_READING.  Now the read
> finishes, and end_bbio_meta_read() does
>
>              btrfs_folio_set_uptodate(fs_info, folio, start, len);
>      ...
>      clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
>      smp_mb__after_atomic();
>      wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
>
> Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING
> (again) and starts reading into the already-filled-in extent buffer.
> This might normally be a benign race, except end_bbio_meta_read() has
> also set EXTENT_BUFFER_UPTODATE.  So now if a third thread tries to
> read the same extent buffer, it will do
>
>      if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
>          return 0;
>
> and return 0 *while the eb is still under I/O*.  The caller will then
> try to read data from the extent buffer which is concurrently being
> updated by the extra read started by thread 2.

Awesome! Not only you got a super good trace, but even pinned down the
root cause.

The trace indeed shows duplicated metadata read endio finishes, which
means we indeed have race where we shouldn't.

Although I'm not 100% sure about the fix, the normal way we handle it
would be locking the page before IO, and unlock it at endio time.

I believe we can do that and properly solve it.
And go with fallbacks to use spinlock to make all the bits testing in a
critical section if the page lock one is no feasible.

Thank you so much!
You really saved us all!

Thanks,
Qu
>
> I attached a candidate patch.  So far only compile-tested.
>
Tavian Barnes March 15, 2024, 10:15 p.m. UTC | #8
On Fri, Mar 15, 2024 at 4:21 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> 在 2024/3/16 06:31, Tavian Barnes 写道:
> > On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes
> > <tavianator@tavianator.com> wrote:
> >> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >>> [SNIP]
> >>>
> >>> The second patch is to making tree-checker to BUG_ON() when something
> >>> went wrong.
> >>> This patch should only be applied if you can reliably reproduce it
> >>> inside a VM.
> >>
> >> Okay, I have finally reproduced this in a VM.  I had to add this hunk
> >> to your patch 0002 in order to trigger the BUG_ON:
> >>
> >> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> >> index c8fbcae4e88e..4ee7a717642a 100644
> >> --- a/fs/btrfs/tree-checker.c
> >> +++ b/fs/btrfs/tree-checker.c
> >> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct
> >> extent_buffer *eb, u64 root_owner)
> >>                                  btrfs_header_level(eb) == 0 ? "leaf" : "node",
> >>                                  root_owner, btrfs_header_bytenr(eb), eb_owner,
> >>                                  root_owner);
> >> +                       BUG_ON(1);
> >>                          return -EUCLEAN;
> >>                  }
> >>                  return 0;
> >> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct
> >> extent_buffer *eb, u64 root_owner)
> >>                          btrfs_header_level(eb) == 0 ? "leaf" : "node",
> >>                          root_owner, btrfs_header_bytenr(eb), eb_owner,
> >>                          BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID);
> >> +               BUG_ON(1);
> >>                  return -EUCLEAN;
> >>          }
> >>          return 0;
> >>
> >>> When using the 2nd patch, it's strongly recommended to enable the
> >>> following sysctl:
> >>>
> >>>    kernel.ftrace_dump_on_oops = 1
> >>>    kernel.panic = 5
> >>>    kernel.panic_on_oops = 1
> >>
> >> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog,
> >> otherwise I got watchdog backtraces (due to slow console) interspersed
> >> with the traces which was hard to read.
> >>
> >>> And you need a way to reliably access the VM (either netconsole or a
> >>> serial console setup).
> >>> In that case, you would got all the ftrace buffer to be dumped into the
> >>> netconsole/serial console.
> >>>
> >>> This has the extra benefit of reducing the noise. But really needs a
> >>> reliable VM setup and can be a little tricky to setup.
> >>
> >> I got this to work, the console logs are attached.  I added
> >>
> >>      echo 1 > $tracefs/buffer_size_kb
> >>
> >> otherwise it tried to dump 48MiB over the serial console which I
> >> didn't have the patience for.  Hopefully that's a big enough buffer, I
> >> can re-run it if you need more logs.
> >>
> >>> Feel free to ask for any extra help to setup the environment, as you're
> >>> our last hope to properly pin down the bug.
> >>
> >> Hopefully this trace helps you debug this.  Let me know whenever you
> >> have something else for me to test.
> >>
> >> I can also try to send you the VM, but I'm not sure how to package it
> >> up exactly.  It has two (emulated) NVMEs with LUKS and BTRFS raid0 on
> >> top.
> >
> > I added eb->start to the "corrupted node/leaf" message so I could look
> > for relevant lines in the trace output.  From another run, I see this:
> >
> > $ grep 'eb=15302115328' typescript-5
> > [ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258
> > block=15302115328 owner mismatch, have 13709377558419367261 expect
> > [256, 18446744073709551360] eb=15302115328
> > [ 2740.240046] iou-wrk--173727   15..... 2649295481us :
> > alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328
> > len=16384
> > [ 2740.301767] kworker/-322      15..... 2649295735us :
> > end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0
> > fsid=b66a67f0-8273-4158-b7bf-988bb5683000
> > [ 2740.328424] kworker/-5095     31..... 2649295941us :
> > end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0
> > fsid=b66a67f0-8273-4158-b7bf-988bb5683000
> >
> > I am surprised to see two end_bbio_meta_read lines with only one
> > matching alloc_extent_buffer.  That made me check the locking in
> > read_extent_buffer_pages() again, and I think I may have found
> > something.
> >
> > Let's say we get two threads simultaneously call
> > read_extent_buffer_pages() on the same eb.  Thread 1 starts reading:
> >
> >      if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
> >          return 0;
> >
> >      /*
> >       * We could have had EXTENT_BUFFER_UPTODATE cleared by the write
> >       * operation, which could potentially still be in flight.  In this case
> >       * we simply want to return an error.
> >       */
> >      if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
> >          return -EIO;
> >
> >      /* (Thread 2 pre-empted here) */
> >
> >      /* Someone else is already reading the buffer, just wait for it. */
> >      if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
> >          goto done;
> >      ...
> >
> > Meanwhile, thread 2 does the same thing but gets preempted at the
> > marked point, before testing EXTENT_BUFFER_READING.  Now the read
> > finishes, and end_bbio_meta_read() does
> >
> >              btrfs_folio_set_uptodate(fs_info, folio, start, len);
> >      ...
> >      clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
> >      smp_mb__after_atomic();
> >      wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
> >
> > Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING
> > (again) and starts reading into the already-filled-in extent buffer.
> > This might normally be a benign race, except end_bbio_meta_read() has
> > also set EXTENT_BUFFER_UPTODATE.  So now if a third thread tries to
> > read the same extent buffer, it will do
> >
> >      if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
> >          return 0;
> >
> > and return 0 *while the eb is still under I/O*.  The caller will then
> > try to read data from the extent buffer which is concurrently being
> > updated by the extra read started by thread 2.
>
> Awesome! Not only you got a super good trace, but even pinned down the
> root cause.

Thanks!

> The trace indeed shows duplicated metadata read endio finishes, which
> means we indeed have race where we shouldn't.

Okay, glad to see I read the trace correctly.  Now that we think we
understand it, can you think of a more reliable reproducer?  It would
need lots of read_block_for_search() on the same block in parallel.

> Although I'm not 100% sure about the fix, the normal way we handle it
> would be locking the page before IO, and unlock it at endio time.

Fair enough, I don't really know the locking rules in btrfs or fs/mm
generally.  I'm pretty sure the double-checked approach from my patch
is enough to protect read_extent_buffer_pages() from racing with
itself in the way I described, but I don't know if a lock may be
required for other reasons.  It might be nice to avoid taking a lock
in this path but I'm not sure how important that is.

I do see that this code path used to take page locks until commit
d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer
reading").

I'm currently re-running the reproducer with my patch.  If it survives
the night without crashing, I'll post it formally, unless you'd like
to do it a different way.

> I believe we can do that and properly solve it.
> And go with fallbacks to use spinlock to make all the bits testing in a
> critical section if the page lock one is no feasible.
>
> Thank you so much!
> You really saved us all!

No problem, happy to help :)

> Thanks,
> Qu
Qu Wenruo March 15, 2024, 11:14 p.m. UTC | #9
在 2024/3/16 08:45, Tavian Barnes 写道:
> On Fri, Mar 15, 2024 at 4:21 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> 在 2024/3/16 06:31, Tavian Barnes 写道:
>>> On Fri, Mar 15, 2024 at 11:23 AM Tavian Barnes
>>> <tavianator@tavianator.com> wrote:
>>>> On Wed, Mar 13, 2024 at 2:07 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>> [SNIP]
>>>>>
>>>>> The second patch is to making tree-checker to BUG_ON() when something
>>>>> went wrong.
>>>>> This patch should only be applied if you can reliably reproduce it
>>>>> inside a VM.
>>>>
>>>> Okay, I have finally reproduced this in a VM.  I had to add this hunk
>>>> to your patch 0002 in order to trigger the BUG_ON:
>>>>
>>>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>>>> index c8fbcae4e88e..4ee7a717642a 100644
>>>> --- a/fs/btrfs/tree-checker.c
>>>> +++ b/fs/btrfs/tree-checker.c
>>>> @@ -2047,6 +2051,7 @@ int btrfs_check_eb_owner(const struct
>>>> extent_buffer *eb, u64 root_owner)
>>>>                                   btrfs_header_level(eb) == 0 ? "leaf" : "node",
>>>>                                   root_owner, btrfs_header_bytenr(eb), eb_owner,
>>>>                                   root_owner);
>>>> +                       BUG_ON(1);
>>>>                           return -EUCLEAN;
>>>>                   }
>>>>                   return 0;
>>>> @@ -2062,6 +2067,7 @@ int btrfs_check_eb_owner(const struct
>>>> extent_buffer *eb, u64 root_owner)
>>>>                           btrfs_header_level(eb) == 0 ? "leaf" : "node",
>>>>                           root_owner, btrfs_header_bytenr(eb), eb_owner,
>>>>                           BTRFS_FIRST_FREE_OBJECTID, BTRFS_LAST_FREE_OBJECTID);
>>>> +               BUG_ON(1);
>>>>                   return -EUCLEAN;
>>>>           }
>>>>           return 0;
>>>>
>>>>> When using the 2nd patch, it's strongly recommended to enable the
>>>>> following sysctl:
>>>>>
>>>>>     kernel.ftrace_dump_on_oops = 1
>>>>>     kernel.panic = 5
>>>>>     kernel.panic_on_oops = 1
>>>>
>>>> I also set kernel.oops_all_cpu_backtrace = 1, and ran with nowatchdog,
>>>> otherwise I got watchdog backtraces (due to slow console) interspersed
>>>> with the traces which was hard to read.
>>>>
>>>>> And you need a way to reliably access the VM (either netconsole or a
>>>>> serial console setup).
>>>>> In that case, you would got all the ftrace buffer to be dumped into the
>>>>> netconsole/serial console.
>>>>>
>>>>> This has the extra benefit of reducing the noise. But really needs a
>>>>> reliable VM setup and can be a little tricky to setup.
>>>>
>>>> I got this to work, the console logs are attached.  I added
>>>>
>>>>       echo 1 > $tracefs/buffer_size_kb
>>>>
>>>> otherwise it tried to dump 48MiB over the serial console which I
>>>> didn't have the patience for.  Hopefully that's a big enough buffer, I
>>>> can re-run it if you need more logs.
>>>>
>>>>> Feel free to ask for any extra help to setup the environment, as you're
>>>>> our last hope to properly pin down the bug.
>>>>
>>>> Hopefully this trace helps you debug this.  Let me know whenever you
>>>> have something else for me to test.
>>>>
>>>> I can also try to send you the VM, but I'm not sure how to package it
>>>> up exactly.  It has two (emulated) NVMEs with LUKS and BTRFS raid0 on
>>>> top.
>>>
>>> I added eb->start to the "corrupted node/leaf" message so I could look
>>> for relevant lines in the trace output.  From another run, I see this:
>>>
>>> $ grep 'eb=15302115328' typescript-5
>>> [ 2725.113804] BTRFS critical (device dm-0): corrupted leaf, root=258
>>> block=15302115328 owner mismatch, have 13709377558419367261 expect
>>> [256, 18446744073709551360] eb=15302115328
>>> [ 2740.240046] iou-wrk--173727   15..... 2649295481us :
>>> alloc_extent_buffer: alloc_extent_buffer: alloc eb=15302115328
>>> len=16384
>>> [ 2740.301767] kworker/-322      15..... 2649295735us :
>>> end_bbio_meta_read: read done, eb=15302115328 page_refs=3 eb level=0
>>> fsid=b66a67f0-8273-4158-b7bf-988bb5683000
>>> [ 2740.328424] kworker/-5095     31..... 2649295941us :
>>> end_bbio_meta_read: read done, eb=15302115328 page_refs=8 eb level=0
>>> fsid=b66a67f0-8273-4158-b7bf-988bb5683000
>>>
>>> I am surprised to see two end_bbio_meta_read lines with only one
>>> matching alloc_extent_buffer.  That made me check the locking in
>>> read_extent_buffer_pages() again, and I think I may have found
>>> something.
>>>
>>> Let's say we get two threads simultaneously call
>>> read_extent_buffer_pages() on the same eb.  Thread 1 starts reading:
>>>
>>>       if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
>>>           return 0;
>>>
>>>       /*
>>>        * We could have had EXTENT_BUFFER_UPTODATE cleared by the write
>>>        * operation, which could potentially still be in flight.  In this case
>>>        * we simply want to return an error.
>>>        */
>>>       if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
>>>           return -EIO;
>>>
>>>       /* (Thread 2 pre-empted here) */
>>>
>>>       /* Someone else is already reading the buffer, just wait for it. */
>>>       if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
>>>           goto done;
>>>       ...
>>>
>>> Meanwhile, thread 2 does the same thing but gets preempted at the
>>> marked point, before testing EXTENT_BUFFER_READING.  Now the read
>>> finishes, and end_bbio_meta_read() does
>>>
>>>               btrfs_folio_set_uptodate(fs_info, folio, start, len);
>>>       ...
>>>       clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
>>>       smp_mb__after_atomic();
>>>       wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
>>>
>>> Now thread 2 resumes executing, atomically sets EXTENT_BUFFER_READING
>>> (again) and starts reading into the already-filled-in extent buffer.
>>> This might normally be a benign race, except end_bbio_meta_read() has
>>> also set EXTENT_BUFFER_UPTODATE.  So now if a third thread tries to
>>> read the same extent buffer, it will do
>>>
>>>       if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
>>>           return 0;
>>>
>>> and return 0 *while the eb is still under I/O*.  The caller will then
>>> try to read data from the extent buffer which is concurrently being
>>> updated by the extra read started by thread 2.
>>
>> Awesome! Not only you got a super good trace, but even pinned down the
>> root cause.
>
> Thanks!
>
>> The trace indeed shows duplicated metadata read endio finishes, which
>> means we indeed have race where we shouldn't.
>
> Okay, glad to see I read the trace correctly.  Now that we think we
> understand it, can you think of a more reliable reproducer?  It would
> need lots of read_block_for_search() on the same block in parallel.
>
>> Although I'm not 100% sure about the fix, the normal way we handle it
>> would be locking the page before IO, and unlock it at endio time.
>
> Fair enough, I don't really know the locking rules in btrfs or fs/mm
> generally.  I'm pretty sure the double-checked approach from my patch
> is enough to protect read_extent_buffer_pages() from racing with
> itself in the way I described, but I don't know if a lock may be
> required for other reasons.  It might be nice to avoid taking a lock
> in this path but I'm not sure how important that is.
>
> I do see that this code path used to take page locks until commit
> d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer
> reading").
>
> I'm currently re-running the reproducer with my patch.  If it survives
> the night without crashing, I'll post it formally, unless you'd like
> to do it a different way.

After digging deeper, your fix looks fine.
(Just remove the cleanup, as we need a minimal hot fix for backport, and
don't forget to CC the fix to stable, and if possible find the
regression commit if it's a regression, and I think the commit you
mentioned is the culprit, as previously we have extra page locking to
prevent this from happening)

I can only come up with the following racing cases, and all looks fine
with your fix:

Case 1:
                   T1              |              T2
----------------------------------+-------------------------------
test_bit(UPTODATE) == false       |
                                   | test_bit(UPTODATE) == false
test_set(READING) == true         |
                                   | test_set(READING) == false
                                   | goto wait

Everything is fine, the old common case.


Case 2:
                   T1              |              T2
----------------------------------+-------------------------------
test_bit(UPTODATE) == false       |
                                   | test_bit(UPTODATE) == false
test_set(READING) == true         |
end_io:                           |
set_bit(UPTODATE)                 |
clear_bit(READING)                |
                                   | test_set(READING) == true
                                   | test_bit(UPTODATE) == true
                                   | done

The old case where the double read would happen, and now it's fine.

Case 3:
The extra race between UPTODATE and READING:

                   T1              |              T2
----------------------------------+-------------------------------
test_bit(UPTODATE) == false       |
                                   | test_bit(UPTODATE) == false
test_set(READING) == true         |
end_io:                           |
set_bit(UPTODATE)                 |
                                   | test_set(READING) == false
                                   | wait
clear_bit(READING)                |

Still safe, due to the fact that UPTODATE is set before clearing READING.

So yep, your fix should be fine.

Although initially I'd prefer a more straightforward spinlock based
solution like this:

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7441245b1ceb..88fcf1df87c7 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4318,20 +4318,28 @@ int read_extent_buffer_pages(struct
extent_buffer *eb, int wait, int mirror_num,
  	struct btrfs_bio *bbio;
  	bool ret;

-	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
+	spin_lock(&eb->refs_lock);
+	if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) {
+		spin_unlock(&eb->refs_lock);
  		return 0;
+	}

  	/*
  	 * We could have had EXTENT_BUFFER_UPTODATE cleared by the write
  	 * operation, which could potentially still be in flight.  In this case
  	 * we simply want to return an error.
  	 */
-	if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
+	if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))) {
+		spin_unlock(&eb->refs_lock);
  		return -EIO;
+	}

  	/* Someone else is already reading the buffer, just wait for it. */
-	if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
+	if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) {
+		spin_unlock(&eb->refs_lock);
  		goto done;
+	}
+	spin_unlock(&eb->refs_lock);

  	clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
  	eb->read_mirror = 0;

But this is much larger than your fix.
Although you may want some extra explanation on the fix.

Otherwise pretty happy to see we finally pinned it down.

Thanks,
Qu
>> I believe we can do that and properly solve it.
>> And go with fallbacks to use spinlock to make all the bits testing in a
>> critical section if the page lock one is no feasible.
>>
>> Thank you so much!
>> You really saved us all!
>
> No problem, happy to help :)
>
>> Thanks,
>> Qu
>
diff mbox series

Patch

From fad1d5ff67e18ceaa8fa439f09be7f4320f327df Mon Sep 17 00:00:00 2001
Message-ID: <fad1d5ff67e18ceaa8fa439f09be7f4320f327df.1710309440.git.wqu@suse.com>
In-Reply-To: <cover.1710309440.git.wqu@suse.com>
References: <cover.1710309440.git.wqu@suse.com>
From: Qu Wenruo <wqu@suse.com>
Date: Wed, 13 Mar 2024 16:24:03 +1030
Subject: [PATCH 1/2] btrfs: add extra debug for eb read path

This is for the recently exposed but very hard to reproduce (for
everyone except awesome and helpful Tavian Barnes) bug, where we seem to
got trash contents for our extent buffer.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index bb225798bb89..43331ba74c64 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -90,6 +90,18 @@  void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info)
 #define btrfs_leak_debug_del_eb(eb)			do {} while (0)
 #endif
 
+static inline void dump_extent_buffer(const char *prefix, struct extent_buffer *eb)
+{
+	u8 fsid[BTRFS_FSID_SIZE] = { 0 };
+
+	read_extent_buffer(eb, &fsid, offsetof(struct btrfs_header, fsid),
+			   BTRFS_FSID_SIZE);
+
+	trace_printk("%s, eb=%llu page_refs=%d eb level=%d fsid=%pUb\n",
+			prefix, eb->start, atomic_read(&eb->refs),
+			btrfs_header_level(eb), fsid);
+}
+
 /*
  * Structure to record info about the bio being assembled, and other info like
  * how many bytes are there before stripe/ordered extent boundary.
@@ -3481,6 +3493,7 @@  static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
 {
 	ASSERT(!extent_buffer_under_io(eb));
 
+	dump_extent_buffer("before release", eb);
 	for (int i = 0; i < INLINE_EXTENT_BUFFER_PAGES; i++) {
 		struct folio *folio = eb->folios[i];
 
@@ -3499,6 +3512,7 @@  static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
  */
 static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
 {
+	trace_printk("called from %s", __func__);
 	btrfs_release_extent_buffer_pages(eb);
 	btrfs_leak_debug_del_eb(eb);
 	__free_extent_buffer(eb);
@@ -3532,6 +3546,7 @@  struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
 	int num_folios = num_extent_folios(src);
 	int ret;
 
+	trace_printk("%s: alloc eb=%llu len=%u\n", __func__, src->start, src->len);
 	new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
 	if (new == NULL)
 		return NULL;
@@ -3573,6 +3588,7 @@  struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
 	int num_folios = 0;
 	int ret;
 
+	trace_printk("%s: alloc eb=%llu len=%lu\n", __func__, start, len);
 	eb = __alloc_extent_buffer(fs_info, start, len);
 	if (!eb)
 		return NULL;
@@ -3893,6 +3909,7 @@  struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
 	if (eb)
 		return eb;
 
+	trace_printk("%s: alloc eb=%llu len=%lu\n", __func__, start, len);
 	eb = __alloc_extent_buffer(fs_info, start, len);
 	if (!eb)
 		return ERR_PTR(-ENOMEM);
@@ -4114,6 +4131,8 @@  static int release_extent_buffer(struct extent_buffer *eb)
 		}
 
 		btrfs_leak_debug_del_eb(eb);
+
+		trace_printk("called from %s", __func__);
 		/* Should be safe to release our pages at this point */
 		btrfs_release_extent_buffer_pages(eb);
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
@@ -4347,9 +4366,12 @@  static void end_bbio_meta_read(struct btrfs_bio *bbio)
 
 	eb->read_mirror = bbio->mirror_num;
 
+	dump_extent_buffer("read done", eb);
 	if (uptodate &&
-	    btrfs_validate_extent_buffer(eb, &bbio->parent_check) < 0)
+	    btrfs_validate_extent_buffer(eb, &bbio->parent_check) < 0) {
+		trace_printk("!!! Validation failed, eb=%llu !!!\n", eb->start);
 		uptodate = false;
+	}
 
 	if (uptodate) {
 		set_extent_buffer_uptodate(eb);
-- 
2.44.0