Message ID | 07df5461bf34cf138f2f4b281a6fa6a0b389ff68.1673568238.git.boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: hold block_group refcount during async discard | expand |
On Thu, Jan 12, 2023 at 04:05:11PM -0800, Boris Burkov wrote: > Async discard does not acquire the block group reference count while it > holds a reference on the discard list. This is generally OK, as the > paths which destroy block groups tend to try to synchronize on > cancelling async discard work. However, relying on cancelling work > requires careful analysis to be sure it is safe from races with > unpinning scheduling more work. > > While I am unable to find a race with unpinning in the current code for > either the unused bgs or relocation paths, I believe we have one in an > older version of auto relocation in a Meta internal build. This suggests > that this is in fact an error prone model, and could be fragile to > future changes to these bg deletion paths. Which version is that? I'll add stable tag anyway but for a cross reference from a real deployment. Thanks.
On Fri, Jan 13, 2023 at 02:26:25PM +0100, David Sterba wrote: > On Thu, Jan 12, 2023 at 04:05:11PM -0800, Boris Burkov wrote: > > Async discard does not acquire the block group reference count while it > > holds a reference on the discard list. This is generally OK, as the > > paths which destroy block groups tend to try to synchronize on > > cancelling async discard work. However, relying on cancelling work > > requires careful analysis to be sure it is safe from races with > > unpinning scheduling more work. > > > > While I am unable to find a race with unpinning in the current code for > > either the unused bgs or relocation paths, I believe we have one in an > > older version of auto relocation in a Meta internal build. This suggests > > that this is in fact an error prone model, and could be fragile to > > future changes to these bg deletion paths. > > Which version is that? I'll add stable tag anyway but for a cross > reference from a real deployment. Thanks. I misunderstood the state of our branch. The code we were running has Johannes's original patches for the reclaim worker and an internal only patch that removes the pinned check. So I don't think there is any commit in your tree which lacks the pinned check.
On 2023/1/13 08:05, Boris Burkov wrote: > Async discard does not acquire the block group reference count while it > holds a reference on the discard list. This is generally OK, as the > paths which destroy block groups tend to try to synchronize on > cancelling async discard work. However, relying on cancelling work > requires careful analysis to be sure it is safe from races with > unpinning scheduling more work. > > While I am unable to find a race with unpinning in the current code for > either the unused bgs or relocation paths, I believe we have one in an > older version of auto relocation in a Meta internal build. This suggests > that this is in fact an error prone model, and could be fragile to > future changes to these bg deletion paths. > > To make this ownership more clear, add a refcount for async discard. If > work is queued for a block group, its refcount should be incremented, > and when work is completed or canceled, it should be decremented. > > Signed-off-by: Boris Burkov <boris@bur.io> It looks like the patch is causing btrfs/011 to panic with the following ASSERT() failure: assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4259 ------------[ cut here ]------------ kernel BUG at fs/btrfs/messages.c:259! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: btrfs xor xor_neon raid6_pq zstd_compress vfat fat cfg80211 rfkill dm_mod fuse ext4 mbcache jbd2 virtio_blk virtio_net virtio_rng net_failover virtio_balloon failover virtio_pci virtio_pci_legacy_dev virtio_mmio virtio_pci_modern_dev CPU: 3 PID: 166866 Comm: umount Not tainted 6.2.0-rc4-custom+ #16 Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : btrfs_assertfail+0x28/0x2c [btrfs] lr : btrfs_assertfail+0x28/0x2c [btrfs] sp : ffff80000ff9bc20 x29: ffff80000ff9bc20 x28: ffff0000ebde2dc0 x27: 0000000000000000 x26: 0000000000000000 x25: ffff0000c30c9000 x24: dead000000000100 x23: dead000000000122 x22: ffff0000c11c9090 x21: ffff0000c11c9088 x20: ffff0000c11c9000 x19: ffff0000c30c90d8 x18: ffffffffffffffff x17: 2f7366206e69202c x16: 31203d3d20297366 x15: 65723e2d70756f72 x14: 675f6b636f6c6226 x13: ffff80000919ace8 x12: 0000000000000e0d x11: 00000000000004af x10: ffff80000924ace8 x9 : ffff80000919ace8 x8 : 00000000ffffdfff x7 : ffff80000924ace8 x6 : 80000000ffffe000 x5 : 0000000000000000 x4 : 0000000000000002 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff0000ebde2dc0 x0 : 0000000000000058 Call trace: btrfs_assertfail+0x28/0x2c [btrfs] btrfs_free_block_groups+0x420/0x480 [btrfs] close_ctree+0x470/0x4d0 [btrfs] btrfs_put_super+0x14/0x20 [btrfs] generic_shutdown_super+0x7c/0x120 kill_anon_super+0x18/0x40 btrfs_kill_super+0x18/0x30 [btrfs] deactivate_locked_super+0x44/0xe0 deactivate_super+0x88/0xa0 cleanup_mnt+0x9c/0x130 __cleanup_mnt+0x14/0x20 task_work_run+0x80/0xe0 do_notify_resume+0x124/0x1b0 el0_svc+0x74/0x84 el0t_64_sync_handler+0xf4/0x120 el0t_64_sync+0x190/0x194 Code: aa0403e2 b0000100 9134e000 95f1de14 (d4210000) ---[ end trace 0000000000000000 ]--- Weirdly on x86_64 it's very hard to reproduce. I have to go my aarch64 machine to reliably reproduce it. With this patch (already in misc-next), my aarch64 VM failed 2/2 during daily runs. With this patch reverted, I haven't hit it in 8 runs. Thus I guess there is some race in the patch. Thanks, Qu > --- > fs/btrfs/discard.c | 25 +++++++++++++++++++++++-- > 1 file changed, 23 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c > index ff2e524d9937..bcfd8beca543 100644 > --- a/fs/btrfs/discard.c > +++ b/fs/btrfs/discard.c > @@ -89,6 +89,8 @@ static void __add_to_discard_list(struct btrfs_discard_ctl *discard_ctl, > BTRFS_DISCARD_DELAY); > block_group->discard_state = BTRFS_DISCARD_RESET_CURSOR; > } > + if (list_empty(&block_group->discard_list)) > + btrfs_get_block_group(block_group); > > list_move_tail(&block_group->discard_list, > get_discard_list(discard_ctl, block_group)); > @@ -108,8 +110,12 @@ static void add_to_discard_list(struct btrfs_discard_ctl *discard_ctl, > static void add_to_discard_unused_list(struct btrfs_discard_ctl *discard_ctl, > struct btrfs_block_group *block_group) > { > + bool queued; > + > spin_lock(&discard_ctl->lock); > > + queued = !list_empty(&block_group->discard_list); > + > if (!btrfs_run_discard_work(discard_ctl)) { > spin_unlock(&discard_ctl->lock); > return; > @@ -121,6 +127,8 @@ static void add_to_discard_unused_list(struct btrfs_discard_ctl *discard_ctl, > block_group->discard_eligible_time = (ktime_get_ns() + > BTRFS_DISCARD_UNUSED_DELAY); > block_group->discard_state = BTRFS_DISCARD_RESET_CURSOR; > + if (!queued) > + btrfs_get_block_group(block_group); > list_add_tail(&block_group->discard_list, > &discard_ctl->discard_list[BTRFS_DISCARD_INDEX_UNUSED]); > > @@ -131,6 +139,7 @@ static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl, > struct btrfs_block_group *block_group) > { > bool running = false; > + bool queued = false; > > spin_lock(&discard_ctl->lock); > > @@ -140,7 +149,10 @@ static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl, > } > > block_group->discard_eligible_time = 0; > + queued = !list_empty(&block_group->discard_list); > list_del_init(&block_group->discard_list); > + if (queued && !running) > + btrfs_put_block_group(block_group); > > spin_unlock(&discard_ctl->lock); > > @@ -216,8 +228,10 @@ static struct btrfs_block_group *peek_discard_list( > block_group->used != 0) { > if (btrfs_is_block_group_data_only(block_group)) > __add_to_discard_list(discard_ctl, block_group); > - else > + else { > list_del_init(&block_group->discard_list); > + btrfs_put_block_group(block_group); > + } > goto again; > } > if (block_group->discard_state == BTRFS_DISCARD_RESET_CURSOR) { > @@ -511,6 +525,8 @@ static void btrfs_discard_workfn(struct work_struct *work) > spin_lock(&discard_ctl->lock); > discard_ctl->prev_discard = trimmed; > discard_ctl->prev_discard_time = now; > + if (list_empty(&block_group->discard_list)) > + btrfs_put_block_group(block_group); > discard_ctl->block_group = NULL; > __btrfs_discard_schedule_work(discard_ctl, now, false); > spin_unlock(&discard_ctl->lock); > @@ -651,8 +667,12 @@ void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info) > list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs, > bg_list) { > list_del_init(&block_group->bg_list); > - btrfs_put_block_group(block_group); > btrfs_discard_queue_work(&fs_info->discard_ctl, block_group); > + /* > + * This put is for the get done by btrfs_mark_bg_unused. > + * queueing discard incremented it for discard's reference. > + */ > + btrfs_put_block_group(block_group); > } > spin_unlock(&fs_info->unused_bgs_lock); > } > @@ -683,6 +703,7 @@ static void btrfs_discard_purge_list(struct btrfs_discard_ctl *discard_ctl) > if (block_group->used == 0) > btrfs_mark_bg_unused(block_group); > spin_lock(&discard_ctl->lock); > + btrfs_put_block_group(block_group); > } > } > spin_unlock(&discard_ctl->lock);
On 2023/1/19 08:21, Qu Wenruo wrote: > > > On 2023/1/13 08:05, Boris Burkov wrote: >> Async discard does not acquire the block group reference count while it >> holds a reference on the discard list. This is generally OK, as the >> paths which destroy block groups tend to try to synchronize on >> cancelling async discard work. However, relying on cancelling work >> requires careful analysis to be sure it is safe from races with >> unpinning scheduling more work. >> >> While I am unable to find a race with unpinning in the current code for >> either the unused bgs or relocation paths, I believe we have one in an >> older version of auto relocation in a Meta internal build. This suggests >> that this is in fact an error prone model, and could be fragile to >> future changes to these bg deletion paths. >> >> To make this ownership more clear, add a refcount for async discard. If >> work is queued for a block group, its refcount should be incremented, >> and when work is completed or canceled, it should be decremented. >> >> Signed-off-by: Boris Burkov <boris@bur.io> > > It looks like the patch is causing btrfs/011 to panic with the following > ASSERT() failure: > > assertion failed: refcount_read(&block_group->refs) == 1, in > fs/btrfs/block-group.c:4259 > ------------[ cut here ]------------ > kernel BUG at fs/btrfs/messages.c:259! > Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP > Modules linked in: btrfs xor xor_neon raid6_pq zstd_compress vfat fat > cfg80211 rfkill dm_mod fuse ext4 mbcache jbd2 virtio_blk virtio_net > virtio_rng net_failover virtio_balloon failover virtio_pci > virtio_pci_legacy_dev virtio_mmio virtio_pci_modern_dev > CPU: 3 PID: 166866 Comm: umount Not tainted 6.2.0-rc4-custom+ #16 > Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 > pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : btrfs_assertfail+0x28/0x2c [btrfs] > lr : btrfs_assertfail+0x28/0x2c [btrfs] > sp : ffff80000ff9bc20 > x29: ffff80000ff9bc20 x28: ffff0000ebde2dc0 x27: 0000000000000000 > x26: 0000000000000000 x25: ffff0000c30c9000 x24: dead000000000100 > x23: dead000000000122 x22: ffff0000c11c9090 x21: ffff0000c11c9088 > x20: ffff0000c11c9000 x19: ffff0000c30c90d8 x18: ffffffffffffffff > x17: 2f7366206e69202c x16: 31203d3d20297366 x15: 65723e2d70756f72 > x14: 675f6b636f6c6226 x13: ffff80000919ace8 x12: 0000000000000e0d > x11: 00000000000004af x10: ffff80000924ace8 x9 : ffff80000919ace8 > x8 : 00000000ffffdfff x7 : ffff80000924ace8 x6 : 80000000ffffe000 > x5 : 0000000000000000 x4 : 0000000000000002 x3 : 0000000000000000 > x2 : 0000000000000000 x1 : ffff0000ebde2dc0 x0 : 0000000000000058 > Call trace: > btrfs_assertfail+0x28/0x2c [btrfs] > btrfs_free_block_groups+0x420/0x480 [btrfs] > close_ctree+0x470/0x4d0 [btrfs] > btrfs_put_super+0x14/0x20 [btrfs] > generic_shutdown_super+0x7c/0x120 > kill_anon_super+0x18/0x40 > btrfs_kill_super+0x18/0x30 [btrfs] > deactivate_locked_super+0x44/0xe0 > deactivate_super+0x88/0xa0 > cleanup_mnt+0x9c/0x130 > __cleanup_mnt+0x14/0x20 > task_work_run+0x80/0xe0 > do_notify_resume+0x124/0x1b0 > el0_svc+0x74/0x84 > el0t_64_sync_handler+0xf4/0x120 > el0t_64_sync+0x190/0x194 > Code: aa0403e2 b0000100 9134e000 95f1de14 (d4210000) > ---[ end trace 0000000000000000 ]--- > > Weirdly on x86_64 it's very hard to reproduce. > I have to go my aarch64 machine to reliably reproduce it. Extra info on the reproducibility. The 100% reproducibility needs to run at least btrfs/010 and btrfs/011 on aarch64. I normally just go `-g auto` and it's 100% reproducible on aarch64 at least. With "-g auto" run, I can also reproduce it on x86_64 now. And with the extra reproducibility, I can confirm with this patch reverted the crash is gone for sure. Thanks, Qu > > With this patch (already in misc-next), my aarch64 VM failed 2/2 during > daily runs. > With this patch reverted, I haven't hit it in 8 runs. > > Thus I guess there is some race in the patch. > > Thanks, > Qu > >> --- >> fs/btrfs/discard.c | 25 +++++++++++++++++++++++-- >> 1 file changed, 23 insertions(+), 2 deletions(-) >> >> diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c >> index ff2e524d9937..bcfd8beca543 100644 >> --- a/fs/btrfs/discard.c >> +++ b/fs/btrfs/discard.c >> @@ -89,6 +89,8 @@ static void __add_to_discard_list(struct >> btrfs_discard_ctl *discard_ctl, >> BTRFS_DISCARD_DELAY); >> block_group->discard_state = BTRFS_DISCARD_RESET_CURSOR; >> } >> + if (list_empty(&block_group->discard_list)) >> + btrfs_get_block_group(block_group); >> list_move_tail(&block_group->discard_list, >> get_discard_list(discard_ctl, block_group)); >> @@ -108,8 +110,12 @@ static void add_to_discard_list(struct >> btrfs_discard_ctl *discard_ctl, >> static void add_to_discard_unused_list(struct btrfs_discard_ctl >> *discard_ctl, >> struct btrfs_block_group *block_group) >> { >> + bool queued; >> + >> spin_lock(&discard_ctl->lock); >> + queued = !list_empty(&block_group->discard_list); >> + >> if (!btrfs_run_discard_work(discard_ctl)) { >> spin_unlock(&discard_ctl->lock); >> return; >> @@ -121,6 +127,8 @@ static void add_to_discard_unused_list(struct >> btrfs_discard_ctl *discard_ctl, >> block_group->discard_eligible_time = (ktime_get_ns() + >> BTRFS_DISCARD_UNUSED_DELAY); >> block_group->discard_state = BTRFS_DISCARD_RESET_CURSOR; >> + if (!queued) >> + btrfs_get_block_group(block_group); >> list_add_tail(&block_group->discard_list, >> &discard_ctl->discard_list[BTRFS_DISCARD_INDEX_UNUSED]); >> @@ -131,6 +139,7 @@ static bool remove_from_discard_list(struct >> btrfs_discard_ctl *discard_ctl, >> struct btrfs_block_group *block_group) >> { >> bool running = false; >> + bool queued = false; >> spin_lock(&discard_ctl->lock); >> @@ -140,7 +149,10 @@ static bool remove_from_discard_list(struct >> btrfs_discard_ctl *discard_ctl, >> } >> block_group->discard_eligible_time = 0; >> + queued = !list_empty(&block_group->discard_list); >> list_del_init(&block_group->discard_list); >> + if (queued && !running) >> + btrfs_put_block_group(block_group); >> spin_unlock(&discard_ctl->lock); >> @@ -216,8 +228,10 @@ static struct btrfs_block_group *peek_discard_list( >> block_group->used != 0) { >> if (btrfs_is_block_group_data_only(block_group)) >> __add_to_discard_list(discard_ctl, block_group); >> - else >> + else { >> list_del_init(&block_group->discard_list); >> + btrfs_put_block_group(block_group); >> + } >> goto again; >> } >> if (block_group->discard_state == BTRFS_DISCARD_RESET_CURSOR) { >> @@ -511,6 +525,8 @@ static void btrfs_discard_workfn(struct >> work_struct *work) >> spin_lock(&discard_ctl->lock); >> discard_ctl->prev_discard = trimmed; >> discard_ctl->prev_discard_time = now; >> + if (list_empty(&block_group->discard_list)) >> + btrfs_put_block_group(block_group); >> discard_ctl->block_group = NULL; >> __btrfs_discard_schedule_work(discard_ctl, now, false); >> spin_unlock(&discard_ctl->lock); >> @@ -651,8 +667,12 @@ void btrfs_discard_punt_unused_bgs_list(struct >> btrfs_fs_info *fs_info) >> list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs, >> bg_list) { >> list_del_init(&block_group->bg_list); >> - btrfs_put_block_group(block_group); >> btrfs_discard_queue_work(&fs_info->discard_ctl, block_group); >> + /* >> + * This put is for the get done by btrfs_mark_bg_unused. >> + * queueing discard incremented it for discard's reference. >> + */ >> + btrfs_put_block_group(block_group); >> } >> spin_unlock(&fs_info->unused_bgs_lock); >> } >> @@ -683,6 +703,7 @@ static void btrfs_discard_purge_list(struct >> btrfs_discard_ctl *discard_ctl) >> if (block_group->used == 0) >> btrfs_mark_bg_unused(block_group); >> spin_lock(&discard_ctl->lock); >> + btrfs_put_block_group(block_group); >> } >> } >> spin_unlock(&discard_ctl->lock);
On Thu, Jan 19, 2023 at 08:21:42AM +0800, Qu Wenruo wrote: > > > On 2023/1/13 08:05, Boris Burkov wrote: > > Async discard does not acquire the block group reference count while it > > holds a reference on the discard list. This is generally OK, as the > > paths which destroy block groups tend to try to synchronize on > > cancelling async discard work. However, relying on cancelling work > > requires careful analysis to be sure it is safe from races with > > unpinning scheduling more work. > > > > While I am unable to find a race with unpinning in the current code for > > either the unused bgs or relocation paths, I believe we have one in an > > older version of auto relocation in a Meta internal build. This suggests > > that this is in fact an error prone model, and could be fragile to > > future changes to these bg deletion paths. > > > > To make this ownership more clear, add a refcount for async discard. If > > work is queued for a block group, its refcount should be incremented, > > and when work is completed or canceled, it should be decremented. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > It looks like the patch is causing btrfs/011 to panic with the following > ASSERT() failure: I see that in btrfs/004 and there's a use after free in 005 that's also attached, though it obviously depends on the former. This seems like a race conditon, this was on an NVMe as backing storage, HDD VMs don't reproduce that. btrfs/004 [21:51:14][ 153.471653] run fstests btrfs/004 at 2023-01-18 21:51:15 [ 154.123838] BTRFS info (device vda): using crc32c (crc32c-intel) checksum algorithm [ 154.124856] BTRFS info (device vda): using free space tree [ 154.133918] BTRFS info (device vda): auto enabling async discard [ 154.846829] BTRFS: device fsid 7f589b0c-c7b5-44e5-bf6c-7635414513b2 devid 1 transid 6 /dev/vdb scanned by mkfs.btrfs (19201) [ 154.888485] BTRFS info (device vdb): using crc32c (crc32c-intel) checksum algorithm [ 154.889428] BTRFS info (device vdb): using free space tree [ 154.902184] BTRFS info (device vdb): auto enabling async discard [ 154.903864] BTRFS info (device vdb): checking UUID tree [ 159.653583] BTRFS info (device vdb): using crc32c (crc32c-intel) checksum algorithm [ 159.654524] BTRFS info (device vdb): setting incompat feature flag for COMPRESS_LZO (0x8) [ 159.655693] BTRFS info (device vdb): use lzo compression, level 0 [ 159.656538] BTRFS info (device vdb): using free space tree [ 159.666609] BTRFS info (device vdb): auto enabling async discard [ 165.550875] BTRFS info (device vdb): using crc32c (crc32c-intel) checksum algorithm [ 165.551799] BTRFS info (device vdb): using free space tree [ 165.570278] BTRFS info (device vdb): auto enabling async discard [ 298.053804] perf: interrupt took too long (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78250 [ 336.179737] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4259 [ 336.180918] ------------[ cut here ]------------ [ 336.181476] kernel BUG at fs/btrfs/messages.c:259! [ 336.182032] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 336.182597] CPU: 0 PID: 18955 Comm: umount Not tainted 6.2.0-rc4-default+ #67 [ 336.183368] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014 [ 336.184529] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs] [ 336.186947] RSP: 0018:ffff888108a97c20 EFLAGS: 00010282 [ 336.187431] RAX: 0000000000000058 RBX: ffff88800320b998 RCX: 0000000000000000 [ 336.188097] RDX: 0000000000000058 RSI: 0000000000000008 RDI: ffffed1021152f77 [ 336.188832] RBP: ffff888103840000 R08: 0000000000000001 R09: ffff888108a9796f [ 336.189572] R10: ffffed1021152f2d R11: 0000000000000000 R12: ffff88800320b9c0 [ 336.190307] R13: ffff88800320b800 R14: ffff888100ccd1b0 R15: ffff888109c3b5b8 [ 336.191037] FS: 00007f4b4d984800(0000) GS:ffff88811a800000(0000) knlGS:0000000000000000 [ 336.191880] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 336.192494] CR2: 00007f0ef124d31e CR3: 0000000100d02001 CR4: 00000000003706b0 [ 336.193207] Call Trace: [ 336.193559] <TASK> [ 336.193874] btrfs_free_block_groups.cold+0x52/0xae [btrfs] [ 336.194643] close_ctree+0x6c2/0x761 [btrfs] [ 336.195149] ? __wait_for_common+0x2b8/0x360 [ 336.195595] ? btrfs_cleanup_one_transaction.cold+0x7a/0x7a [btrfs] [ 336.196235] ? lockdep_unregister_key+0x214/0x3c0 [ 336.196713] ? mark_held_locks+0x6b/0x90 [ 336.197114] ? lockdep_hardirqs_on_prepare+0x13d/0x200 [ 336.197706] ? __call_rcu_common.constprop.0+0x1ea/0x3d0 [ 336.198320] ? trace_hardirqs_on+0x2d/0x110 [ 336.198829] ? __call_rcu_common.constprop.0+0x1ea/0x3d0 [ 336.199440] generic_shutdown_super+0xb0/0x1c0 [ 336.199995] kill_anon_super+0x1e/0x40 [ 336.200455] btrfs_kill_super+0x25/0x30 [btrfs] [ 336.201077] deactivate_locked_super+0x4c/0xc0 [ 336.201614] cleanup_mnt+0x13d/0x1f0 [ 336.202094] task_work_run+0xf2/0x170 [ 336.202591] ? task_work_cancel+0x20/0x20 [ 336.203084] ? lockdep_hardirqs_on_prepare+0x13d/0x200 [ 336.203719] exit_to_user_mode_prepare+0x109/0x130 [ 336.204292] syscall_exit_to_user_mode+0x19/0x40 [ 336.204843] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 336.205436] RIP: 0033:0x7f4b4dbb410b [ 336.207563] RSP: 002b:00007ffc89ec2038 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 336.208558] RAX: 0000000000000000 RBX: 00005643b954f9f0 RCX: 00007f4b4dbb410b [ 336.209478] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00005643b9550c90 [ 336.210390] RBP: 00005643b954fb00 R08: 0000000000000000 R09: 0000000000000073 [ 336.211278] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 336.211934] R13: 00005643b9550c90 R14: 00005643b954fc20 R15: 00007ffc89ec509e [ 336.212544] </TASK> [ 336.212804] Modules linked in: btrfs blake2b_generic libcrc32c crc32c_intel xor lzo_compress lzo_decompress raid6_pq zstd_decompress zstd_compress xxhash zstd_common loop [ 336.214123] ---[ end trace 0000000000000000 ]--- [ 336.214560] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs] [ 336.216649] RSP: 0018:ffff888108a97c20 EFLAGS: 00010282 [ 336.217141] RAX: 0000000000000058 RBX: ffff88800320b998 RCX: 0000000000000000 [ 336.217754] RDX: 0000000000000058 RSI: 0000000000000008 RDI: ffffed1021152f77 [ 336.218364] RBP: ffff888103840000 R08: 0000000000000001 R09: ffff888108a9796f [ 336.218972] R10: ffffed1021152f2d R11: 0000000000000000 R12: ffff88800320b9c0 [ 336.219582] R13: ffff88800320b800 R14: ffff888100ccd1b0 R15: ffff888109c3b5b8 [ 336.220194] FS: 00007f4b4d984800(0000) GS:ffff88811a800000(0000) knlGS:0000000000000000 [ 336.220898] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 336.221587] CR2: 00007f0ef124d31e CR3: 0000000100d02001 CR4: 00000000003706b0 _check_btrfs_filesystem: filesystem on /dev/vdb is inconsistent (see /tmp/fstests/results//btrfs/004.full for details) _check_dmesg: something found in dmesg (see /tmp/fstests/results//btrfs/004.dmesg) [21:54:17] btrfs/005 [21:54:17][ 336.472919] run fstests btrfs/005 at 2023-01-18 21:54:18 [ 336.919873] BTRFS info (device vda): using crc32c (crc32c-intel) checksum algorithm [ 336.920734] BTRFS info (device vda): using free space tree [ 336.927751] BTRFS info (device vda): auto enabling async discard [ 337.157673] ================================================================== [ 337.158514] BUG: KASAN: use-after-free in rwsem_down_write_slowpath+0xb3a/0xc50 [ 337.159355] Read of size 4 at addr ffff88810aabb334 by task mount/19191 [ 337.160087] [ 337.160335] CPU: 0 PID: 19191 Comm: mount Tainted: G D 6.2.0-rc4-default+ #67 [ 337.161292] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014 [ 337.162388] Call Trace: [ 337.162665] <TASK> [ 337.162919] dump_stack_lvl+0x4c/0x5f [ 337.163286] print_report+0x196/0x48e [ 337.166960] ? __virt_addr_valid+0xc3/0x100 [ 337.167372] ? kasan_addr_to_slab+0x9/0xb0 [ 337.167773] ? rwsem_down_write_slowpath+0xb3a/0xc50 [ 337.168228] kasan_report+0xbb/0xf0 [ 337.168576] ? rwsem_down_write_slowpath+0xb3a/0xc50 [ 337.169027] rwsem_down_write_slowpath+0xb3a/0xc50 [ 337.169483] ? memcmp+0x83/0xa0 [ 337.169824] ? down_timeout+0x70/0x70 [ 337.170186] ? lock_acquire+0xbd/0x3c0 [ 337.170556] ? lock_contended+0xb3/0x6d0 [ 337.170950] ? __down_write_trylock+0xb7/0x270 [ 337.171382] ? debug_check_no_locks_held+0x50/0x50 [ 337.171835] ? rcu_read_lock_sched_held+0x10/0x70 [ 337.172280] ? lock_release+0xb2/0x6c0 [ 337.172649] ? lock_acquire+0xbd/0x3c0 [ 337.173018] ? grab_super+0x34/0x100 [ 337.173375] down_write+0x1c9/0x1e0 [ 337.173728] ? down_write_killable_nested+0x200/0x200 [ 337.174203] ? lock_contended+0x6d0/0x6d0 [ 337.174594] grab_super+0x3c/0x100 [ 337.174937] ? __traceiter_raid56_scrub_read_recover+0x70/0x70 [btrfs] [ 337.175613] sget+0x201/0x2c0 [ 337.175982] ? btrfs_kill_super+0x30/0x30 [btrfs] [ 337.176635] btrfs_mount_root+0x396/0x610 [btrfs] [ 337.177244] ? __bpf_trace_find_free_extent_search_loop+0xe0/0xe0 [btrfs] [ 337.178003] ? legacy_parse_param+0x48/0x3a0 [ 337.178503] ? strcmp+0x2e/0x50 [ 337.178927] ? rcu_read_lock_sched_held+0x10/0x70 [ 337.179437] ? kfree+0x130/0x1b0 [ 337.179870] ? vfs_parse_fs_string+0xd4/0x120 [ 337.180408] ? vfs_parse_fs_param+0x180/0x180 [ 337.180890] ? kasan_set_track+0x21/0x30 [ 337.181285] ? __kasan_kmalloc+0x86/0x90 [ 337.181670] ? __bpf_trace_find_free_extent_search_loop+0xe0/0xe0 [btrfs] [ 337.182332] legacy_get_tree+0x80/0xd0 [ 337.182703] vfs_get_tree+0x43/0xf0 [ 337.183053] vfs_kern_mount.part.0+0x73/0xd0 [ 337.183521] btrfs_mount+0x1b6/0x590 [btrfs] [ 337.184164] ? vfs_parse_fs_string+0xac/0x120 [ 337.184587] ? entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 337.185064] ? btrfs_show_options+0x900/0x900 [btrfs] [ 337.185611] ? kasan_unpoison+0x23/0x50 [ 337.185994] ? __kasan_slab_alloc+0x2f/0x70 [ 337.186401] ? strcmp+0x2e/0x50 [ 337.186734] ? vfs_parse_fs_param_source+0x5c/0xe0 [ 337.187180] ? legacy_parse_param+0x48/0x3a0 [ 337.187620] ? strcmp+0x2e/0x50 [ 337.187952] ? rcu_read_lock_sched_held+0x10/0x70 [ 337.188385] ? kfree+0x130/0x1b0 [ 337.188718] ? vfs_parse_fs_string+0xd4/0x120 [ 337.189138] ? vfs_parse_fs_param+0x180/0x180 [ 337.189549] ? kasan_set_track+0x21/0x30 [ 337.189930] ? btrfs_show_options+0x900/0x900 [btrfs] [ 337.190480] legacy_get_tree+0x80/0xd0 [ 337.190848] vfs_get_tree+0x43/0xf0 [ 337.191198] path_mount+0x9b8/0xef0 [ 337.191551] ? lockdep_hardirqs_on_prepare+0xe/0x200 [ 337.192010] ? finish_automount+0x470/0x470 [ 337.192406] ? kasan_quarantine_put+0x76/0x1b0 [ 337.192832] ? user_path_at_empty+0x40/0x50 [ 337.193249] ? kmem_cache_free+0xf9/0x3f0 [ 337.193640] __x64_sys_mount+0x24c/0x2c0 [ 337.194026] ? copy_mnt_ns+0x5e0/0x5e0 [ 337.194396] ? getname_flags+0xa1/0x260 [ 337.194771] ? lockdep_hardirqs_on_prepare+0xe/0x200 [ 337.195224] ? syscall_enter_from_user_mode+0x1c/0x40 [ 337.195683] do_syscall_64+0x2b/0x50 [ 337.196045] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 337.196517] RIP: 0033:0x7f6b961da68e [ 337.198367] RSP: 002b:00007ffc8e3478e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [ 337.199038] RAX: ffffffffffffffda RBX: 0000563fcd6c39f0 RCX: 00007f6b961da68e [ 337.199647] RDX: 0000563fcd6c3c20 RSI: 0000563fcd6c3c60 RDI: 0000563fcd6c3c40 [ 337.200279] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000073 [ 337.200891] R10: 0000000000000000 R11: 0000000000000246 R12: 0000563fcd6c3c20 [ 337.201522] R13: 0000563fcd6c3c40 R14: 0000000000000000 R15: 00007f6b96322076 [ 337.202146] </TASK> [ 337.202407] [ 337.202620] Allocated by task 18950: [ 337.202977] kasan_save_stack+0x1c/0x40 [ 337.203356] kasan_set_track+0x21/0x30 [ 337.203747] __kasan_slab_alloc+0x65/0x70 [ 337.204162] kmem_cache_alloc_node+0x10e/0x220 [ 337.204666] copy_process+0x30c/0x2d10 [ 337.205098] kernel_clone+0xf4/0x4e0 [ 337.205484] __do_sys_clone+0xb6/0xf0 [ 337.205885] do_syscall_64+0x2b/0x50 [ 337.206250] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 337.206740] [ 337.206964] Freed by task 0: [ 337.207373] kasan_save_stack+0x1c/0x40 [ 337.207866] kasan_set_track+0x21/0x30 [ 337.208249] kasan_save_free_info+0x2a/0x40 [ 337.208656] ____kasan_slab_free+0x1b7/0x210 [ 337.209068] kmem_cache_free+0xf9/0x3f0 [ 337.209450] rcu_core+0x529/0xd40 [ 337.209794] __do_softirq+0x105/0x69a [ 337.210168] [ 337.210385] Last potentially related work creation: [ 337.210831] kasan_save_stack+0x1c/0x40 [ 337.211207] __kasan_record_aux_stack+0x100/0x110 [ 337.211652] __call_rcu_common.constprop.0+0x47/0x3d0 [ 337.212121] wait_consider_task+0xa61/0x1760 [ 337.212537] do_wait+0x3bd/0x5b0 [ 337.212872] kernel_wait4+0xeb/0x1c0 [ 337.213230] __do_sys_wait4+0xf0/0x100 [ 337.213601] do_syscall_64+0x2b/0x50 [ 337.213958] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 337.214424] [ 337.214635] Second to last potentially related work creation: [ 337.215141] kasan_save_stack+0x1c/0x40 [ 337.215517] __kasan_record_aux_stack+0x100/0x110 [ 337.215964] __call_rcu_common.constprop.0+0x47/0x3d0 [ 337.216431] wait_consider_task+0xa61/0x1760 [ 337.216840] do_wait+0x3bd/0x5b0 [ 337.217174] kernel_wait4+0xeb/0x1c0 [ 337.217532] __do_sys_wait4+0xf0/0x100 [ 337.217900] do_syscall_64+0x2b/0x50 [ 337.218257] entry_SYSCALL_64_after_hwframe+0x46/0xb0 [ 337.218722] [ 337.218933] The buggy address belongs to the object at ffff88810aabb300 [ 337.218933] which belongs to the cache task_struct of size 12616 [ 337.219953] The buggy address is located 52 bytes inside of [ 337.219953] 12616-byte region [ffff88810aabb300, ffff88810aabe448) [ 337.220930] [ 337.221144] The buggy address belongs to the physical page: [ 337.221731] page:ffff88813f0aae00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10aab8 [ 337.222586] head:ffff88813f0aae00 order:3 compound_mapcount:0 subpages_mapcount:0 compound_pincount:0 [ 337.223574] flags: 0x4300000010200(slab|head|section=33|zone=2) [ 337.224209] raw: 0004300000010200 ffff8881000519c0 ffff88813efde410 ffff88813cc72810 [ 337.225027] raw: 0000000000000000 0000000000020002 00000001ffffffff 0000000000000000 [ 337.225834] page dumped because: kasan: bad access detected [ 337.226425] [ 337.226663] Memory state around the buggy address: [ 337.227168] ffff88810aabb200: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 337.227936] ffff88810aabb280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 337.228703] >ffff88810aabb300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 337.229470] ^ [ 337.229997] ffff88810aabb380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 337.230766] ffff88810aabb400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 337.231534] ================================================================== [ 337.545046] EXT4-fs (sda): error count since last fsck: 63 [ 337.547861] EXT4-fs (sda): initial error at time 1667566565: ext4_mb_generate_buddy:1095 [ 337.550888] EXT4-fs (sda): last error at time 1674078512: ext4_mb_generate_buddy:1095 Connection closed by foreign host.
On Thu, Jan 19, 2023 at 08:21:42AM +0800, Qu Wenruo wrote: > > > On 2023/1/13 08:05, Boris Burkov wrote: > > Async discard does not acquire the block group reference count while it > > holds a reference on the discard list. This is generally OK, as the > > paths which destroy block groups tend to try to synchronize on > > cancelling async discard work. However, relying on cancelling work > > requires careful analysis to be sure it is safe from races with > > unpinning scheduling more work. > > > > While I am unable to find a race with unpinning in the current code for > > either the unused bgs or relocation paths, I believe we have one in an > > older version of auto relocation in a Meta internal build. This suggests > > that this is in fact an error prone model, and could be fragile to > > future changes to these bg deletion paths. > > > > To make this ownership more clear, add a refcount for async discard. If > > work is queued for a block group, its refcount should be incremented, > > and when work is completed or canceled, it should be decremented. > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > It looks like the patch is causing btrfs/011 to panic with the following > ASSERT() failure: Fixed version replaced and pushed to misc-next. The fix is --- a/fs/btrfs/discard.c +++ b/fs/btrfs/discard.c @@ -540,7 +540,7 @@ static void btrfs_discard_workfn(struct work_struct *work) spin_lock(&discard_ctl->lock); discard_ctl->prev_discard = trimmed; discard_ctl->prev_discard_time = now; - if (list_empty(&block_group->discard_list)) + if (discard_ctl->block_group == NULL) __put_discard(block_group); discard_ctl->block_group = NULL; __btrfs_discard_schedule_work(discard_ctl, now, false); ---
diff --git a/fs/btrfs/discard.c b/fs/btrfs/discard.c index ff2e524d9937..bcfd8beca543 100644 --- a/fs/btrfs/discard.c +++ b/fs/btrfs/discard.c @@ -89,6 +89,8 @@ static void __add_to_discard_list(struct btrfs_discard_ctl *discard_ctl, BTRFS_DISCARD_DELAY); block_group->discard_state = BTRFS_DISCARD_RESET_CURSOR; } + if (list_empty(&block_group->discard_list)) + btrfs_get_block_group(block_group); list_move_tail(&block_group->discard_list, get_discard_list(discard_ctl, block_group)); @@ -108,8 +110,12 @@ static void add_to_discard_list(struct btrfs_discard_ctl *discard_ctl, static void add_to_discard_unused_list(struct btrfs_discard_ctl *discard_ctl, struct btrfs_block_group *block_group) { + bool queued; + spin_lock(&discard_ctl->lock); + queued = !list_empty(&block_group->discard_list); + if (!btrfs_run_discard_work(discard_ctl)) { spin_unlock(&discard_ctl->lock); return; @@ -121,6 +127,8 @@ static void add_to_discard_unused_list(struct btrfs_discard_ctl *discard_ctl, block_group->discard_eligible_time = (ktime_get_ns() + BTRFS_DISCARD_UNUSED_DELAY); block_group->discard_state = BTRFS_DISCARD_RESET_CURSOR; + if (!queued) + btrfs_get_block_group(block_group); list_add_tail(&block_group->discard_list, &discard_ctl->discard_list[BTRFS_DISCARD_INDEX_UNUSED]); @@ -131,6 +139,7 @@ static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl, struct btrfs_block_group *block_group) { bool running = false; + bool queued = false; spin_lock(&discard_ctl->lock); @@ -140,7 +149,10 @@ static bool remove_from_discard_list(struct btrfs_discard_ctl *discard_ctl, } block_group->discard_eligible_time = 0; + queued = !list_empty(&block_group->discard_list); list_del_init(&block_group->discard_list); + if (queued && !running) + btrfs_put_block_group(block_group); spin_unlock(&discard_ctl->lock); @@ -216,8 +228,10 @@ static struct btrfs_block_group *peek_discard_list( block_group->used != 0) { if (btrfs_is_block_group_data_only(block_group)) __add_to_discard_list(discard_ctl, block_group); - else + else { list_del_init(&block_group->discard_list); + btrfs_put_block_group(block_group); + } goto again; } if (block_group->discard_state == BTRFS_DISCARD_RESET_CURSOR) { @@ -511,6 +525,8 @@ static void btrfs_discard_workfn(struct work_struct *work) spin_lock(&discard_ctl->lock); discard_ctl->prev_discard = trimmed; discard_ctl->prev_discard_time = now; + if (list_empty(&block_group->discard_list)) + btrfs_put_block_group(block_group); discard_ctl->block_group = NULL; __btrfs_discard_schedule_work(discard_ctl, now, false); spin_unlock(&discard_ctl->lock); @@ -651,8 +667,12 @@ void btrfs_discard_punt_unused_bgs_list(struct btrfs_fs_info *fs_info) list_for_each_entry_safe(block_group, next, &fs_info->unused_bgs, bg_list) { list_del_init(&block_group->bg_list); - btrfs_put_block_group(block_group); btrfs_discard_queue_work(&fs_info->discard_ctl, block_group); + /* + * This put is for the get done by btrfs_mark_bg_unused. + * queueing discard incremented it for discard's reference. + */ + btrfs_put_block_group(block_group); } spin_unlock(&fs_info->unused_bgs_lock); } @@ -683,6 +703,7 @@ static void btrfs_discard_purge_list(struct btrfs_discard_ctl *discard_ctl) if (block_group->used == 0) btrfs_mark_bg_unused(block_group); spin_lock(&discard_ctl->lock); + btrfs_put_block_group(block_group); } } spin_unlock(&discard_ctl->lock);
Async discard does not acquire the block group reference count while it holds a reference on the discard list. This is generally OK, as the paths which destroy block groups tend to try to synchronize on cancelling async discard work. However, relying on cancelling work requires careful analysis to be sure it is safe from races with unpinning scheduling more work. While I am unable to find a race with unpinning in the current code for either the unused bgs or relocation paths, I believe we have one in an older version of auto relocation in a Meta internal build. This suggests that this is in fact an error prone model, and could be fragile to future changes to these bg deletion paths. To make this ownership more clear, add a refcount for async discard. If work is queued for a block group, its refcount should be incremented, and when work is completed or canceled, it should be decremented. Signed-off-by: Boris Burkov <boris@bur.io> --- fs/btrfs/discard.c | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-)