Message ID | 91b0cff4e87fbc01a7a6fd7d068c6e54ea7296ea.1727777967.git.fdmanana@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: wait for fixup workers before stopping cleaner kthread during umount | expand |
在 2024/10/1 19:51, fdmanana@kernel.org 写道: > From: Filipe Manana <fdmanana@suse.com> > > During unmount, at close_ctree(), we have the following steps in this order: > > 1) Park the cleaner kthread - this doesn't destroy the kthread, it basically > halts its execution (wake ups against it work but do nothing); > > 2) We stop the cleaner kthread - this results in freeing the respective > struct task_struct; > > 3) We call btrfs_stop_all_workers() which waits for any jobs running in all > the work queues and then free the work queues. > > Syzbot reported a case where a fixup worker resulted in a crash when doing > a delayed iput on its inode while attempting to wake up the cleaner at > btrfs_add_delayed_iput(), because the task_struct of the cleaner kthread > was already freed. This can happen during unmount because we don't wait > for any fixup workers still running before we call kthread_stop() against > the cleaner kthread, which stops and free all its resources. > > Fix this by waiting for any fixup workers at close_ctree() before we call > kthread_stop() against the cleaner and run pending delayed iputs. > > The stack traces reported by syzbot were the following: > > BUG: KASAN: slab-use-after-free in __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065 > Read of size 8 at addr ffff8880272a8a18 by task kworker/u8:3/52 > > CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.12.0-rc1-syzkaller #0 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024 > Workqueue: btrfs-fixup btrfs_work_helper > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:94 [inline] > dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 > print_address_description mm/kasan/report.c:377 [inline] > print_report+0x169/0x550 mm/kasan/report.c:488 > kasan_report+0x143/0x180 mm/kasan/report.c:601 > __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065 > lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825 > __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] > _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162 > class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline] > try_to_wake_up+0xb0/0x1480 kernel/sched/core.c:4154 > btrfs_writepage_fixup_worker+0xc16/0xdf0 fs/btrfs/inode.c:2842 > btrfs_work_helper+0x390/0xc50 fs/btrfs/async-thread.c:314 > process_one_work kernel/workqueue.c:3229 [inline] > process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310 > worker_thread+0x870/0xd30 kernel/workqueue.c:3391 > kthread+0x2f0/0x390 kernel/kthread.c:389 > ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 > </TASK> > > Allocated by task 2: > kasan_save_stack mm/kasan/common.c:47 [inline] > kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 > unpoison_slab_object mm/kasan/common.c:319 [inline] > __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345 > kasan_slab_alloc include/linux/kasan.h:247 [inline] > slab_post_alloc_hook mm/slub.c:4086 [inline] > slab_alloc_node mm/slub.c:4135 [inline] > kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4187 > alloc_task_struct_node kernel/fork.c:180 [inline] > dup_task_struct+0x57/0x8c0 kernel/fork.c:1107 > copy_process+0x5d1/0x3d50 kernel/fork.c:2206 > kernel_clone+0x223/0x880 kernel/fork.c:2787 > kernel_thread+0x1bc/0x240 kernel/fork.c:2849 > create_kthread kernel/kthread.c:412 [inline] > kthreadd+0x60d/0x810 kernel/kthread.c:765 > ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 > > Freed by task 61: > kasan_save_stack mm/kasan/common.c:47 [inline] > kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 > kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579 > poison_slab_object mm/kasan/common.c:247 [inline] > __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264 > kasan_slab_free include/linux/kasan.h:230 [inline] > slab_free_hook mm/slub.c:2343 [inline] > slab_free mm/slub.c:4580 [inline] > kmem_cache_free+0x1a2/0x420 mm/slub.c:4682 > put_task_struct include/linux/sched/task.h:144 [inline] > delayed_put_task_struct+0x125/0x300 kernel/exit.c:228 > rcu_do_batch kernel/rcu/tree.c:2567 [inline] > rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823 > handle_softirqs+0x2c5/0x980 kernel/softirq.c:554 > __do_softirq kernel/softirq.c:588 [inline] > invoke_softirq kernel/softirq.c:428 [inline] > __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637 > irq_exit_rcu+0x9/0x30 kernel/softirq.c:649 > instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1037 [inline] > sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1037 > asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702 > > Last potentially related work creation: > kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47 > __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:541 > __call_rcu_common kernel/rcu/tree.c:3086 [inline] > call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190 > context_switch kernel/sched/core.c:5318 [inline] > __schedule+0x184b/0x4ae0 kernel/sched/core.c:6675 > schedule_idle+0x56/0x90 kernel/sched/core.c:6793 > do_idle+0x56a/0x5d0 kernel/sched/idle.c:354 > cpu_startup_entry+0x42/0x60 kernel/sched/idle.c:424 > start_secondary+0x102/0x110 arch/x86/kernel/smpboot.c:314 > common_startup_64+0x13e/0x147 > > The buggy address belongs to the object at ffff8880272a8000 > which belongs to the cache task_struct of size 7424 > The buggy address is located 2584 bytes inside of > freed 7424-byte region [ffff8880272a8000, ffff8880272a9d00) > > The buggy address belongs to the physical page: > page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x272a8 > head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 > flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff) > page_type: f5(slab) > raw: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000 > raw: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000 > head: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000 > head: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000 > head: 00fff00000000003 ffffea00009caa01 ffffffffffffffff 0000000000000000 > head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 > page dumped because: kasan: bad access detected > page_owner tracks the page as allocated > page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 2, tgid 2 (kthreadd), ts 71247381401, free_ts 71214998153 > set_page_owner include/linux/page_owner.h:32 [inline] > post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537 > prep_new_page mm/page_alloc.c:1545 [inline] > get_page_from_freelist+0x3039/0x3180 mm/page_alloc.c:3457 > __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4733 > alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265 > alloc_slab_page+0x6a/0x120 mm/slub.c:2413 > allocate_slab+0x5a/0x2f0 mm/slub.c:2579 > new_slab mm/slub.c:2632 [inline] > ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3819 > __slab_alloc+0x58/0xa0 mm/slub.c:3909 > __slab_alloc_node mm/slub.c:3962 [inline] > slab_alloc_node mm/slub.c:4123 [inline] > kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4187 > alloc_task_struct_node kernel/fork.c:180 [inline] > dup_task_struct+0x57/0x8c0 kernel/fork.c:1107 > copy_process+0x5d1/0x3d50 kernel/fork.c:2206 > kernel_clone+0x223/0x880 kernel/fork.c:2787 > kernel_thread+0x1bc/0x240 kernel/fork.c:2849 > create_kthread kernel/kthread.c:412 [inline] > kthreadd+0x60d/0x810 kernel/kthread.c:765 > ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 > page last free pid 5230 tgid 5230 stack trace: > reset_page_owner include/linux/page_owner.h:25 [inline] > free_pages_prepare mm/page_alloc.c:1108 [inline] > free_unref_page+0xcd0/0xf00 mm/page_alloc.c:2638 > discard_slab mm/slub.c:2678 [inline] > __put_partials+0xeb/0x130 mm/slub.c:3146 > put_cpu_partial+0x17c/0x250 mm/slub.c:3221 > __slab_free+0x2ea/0x3d0 mm/slub.c:4450 > qlink_free mm/kasan/quarantine.c:163 [inline] > qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179 > kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286 > __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329 > kasan_slab_alloc include/linux/kasan.h:247 [inline] > slab_post_alloc_hook mm/slub.c:4086 [inline] > slab_alloc_node mm/slub.c:4135 [inline] > kmem_cache_alloc_noprof+0x135/0x2a0 mm/slub.c:4142 > getname_flags+0xb7/0x540 fs/namei.c:139 > do_sys_openat2+0xd2/0x1d0 fs/open.c:1409 > do_sys_open fs/open.c:1430 [inline] > __do_sys_openat fs/open.c:1446 [inline] > __se_sys_openat fs/open.c:1441 [inline] > __x64_sys_openat+0x247/0x2a0 fs/open.c:1441 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > Memory state around the buggy address: > ffff8880272a8900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff8880272a8980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > >ffff8880272a8a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ^ > ffff8880272a8a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ffff8880272a8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > ================================================================== > > Link: https://lore.kernel.org/linux-btrfs/66fb36b1.050a0220.aab67.003b.GAE@google.com/ > Reported-by: syzbot+8aaf2df2ef0164ffe1fb@syzkaller.appspotmail.com > Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Thanks for the quick fix, Qu > --- > fs/btrfs/disk-io.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 25d768e67e37..1238a38c59b2 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -4255,6 +4255,17 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info) > /* clear out the rbtree of defraggable inodes */ > btrfs_cleanup_defrag_inodes(fs_info); > > + /* > + * Wait for any fixup workers to complete. > + * If we don't wait for them here and they are still running by the time > + * we call kthread_stop() against the cleaner kthread further below, we > + * get an use-after-free on the cleaner because the fixup worker adds an > + * inode to the list of delayed iputs and then attempts to wakeup the > + * cleaner kthread, which was already stopped and destroyed. We parked > + * already the cleaner, but below we run all pending delayed iputs. > + */ > + btrfs_flush_workqueue(fs_info->fixup_workers); > + > /* > * After we parked the cleaner kthread, ordered extents may have > * completed and created new delayed iputs. If one of the async reclaim
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
On Tue, Oct 01, 2024 at 11:21:35AM +0100, fdmanana@kernel.org wrote: > From: Filipe Manana <fdmanana@suse.com> > > During unmount, at close_ctree(), we have the following steps in this order: > > 1) Park the cleaner kthread - this doesn't destroy the kthread, it basically > halts its execution (wake ups against it work but do nothing); > > 2) We stop the cleaner kthread - this results in freeing the respective > struct task_struct; > > 3) We call btrfs_stop_all_workers() which waits for any jobs running in all > the work queues and then free the work queues. > > Syzbot reported a case where a fixup worker resulted in a crash when doing > a delayed iput on its inode while attempting to wake up the cleaner at > btrfs_add_delayed_iput(), because the task_struct of the cleaner kthread > was already freed. This can happen during unmount because we don't wait > for any fixup workers still running before we call kthread_stop() against > the cleaner kthread, which stops and free all its resources. What a lucky shot by syzbot, getting the fixup + delayed iput + umount lined up. Reviewed-by: David Sterba <dsterba@suse.com>
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 25d768e67e37..1238a38c59b2 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4255,6 +4255,17 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info) /* clear out the rbtree of defraggable inodes */ btrfs_cleanup_defrag_inodes(fs_info); + /* + * Wait for any fixup workers to complete. + * If we don't wait for them here and they are still running by the time + * we call kthread_stop() against the cleaner kthread further below, we + * get an use-after-free on the cleaner because the fixup worker adds an + * inode to the list of delayed iputs and then attempts to wakeup the + * cleaner kthread, which was already stopped and destroyed. We parked + * already the cleaner, but below we run all pending delayed iputs. + */ + btrfs_flush_workqueue(fs_info->fixup_workers); + /* * After we parked the cleaner kthread, ordered extents may have * completed and created new delayed iputs. If one of the async reclaim