Message ID | 20231108012821.56104-1-junxiao.bi@oracle.com (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | [RFC] workqueue: allow system workqueue be used in memory reclaim | expand |
Hello, On Tue, Nov 07, 2023 at 05:28:21PM -0800, Junxiao Bi wrote: > The following deadlock was triggered on Intel IMSM raid1 volumes. > > The sequence of the event is this: > > 1. memory reclaim was waiting xfs journal flushing and get stucked by > md flush work. > > 2. md flush work was queued into "md" workqueue, but never get executed, > kworker thread can not be created and also the rescuer thread was executing > md flush work for another md disk and get stuck because > "MD_SB_CHANGE_PENDING" flag was set. > > 3. That flag should be set by some md write process which was asking to > update md superblock to change in_sync status to 0, and then it used > kernfs_notify to ask "mdmon" process to update superblock, after that, > write process waited that flag to be cleared. > > 4. But "mdmon" was never wake up, because kernfs_notify() depended on > system wide workqueue "system_wq" to do the notify, but since that > workqueue doesn't have a rescuer thread, notify will not happen. Things like this can't be fixed by adding RECLAIM to system_wq because system_wq is shared and someone else might occupy that rescuer thread. The flag doesn't guarantee unlimited forward progress. It only guarantees forward progress of one work item. That seems to be where the problem is in #2 in the first place. If a work item is required during memory reclaim, it must have guaranteed forward progress but it looks like that's waiting for someone else who can end up waiting for userspace? You'll need to untangle the dependencies earlier. Thanks.
On 11/9/23 10:58 AM, Tejun Heo wrote: > Hello, > > On Tue, Nov 07, 2023 at 05:28:21PM -0800, Junxiao Bi wrote: >> The following deadlock was triggered on Intel IMSM raid1 volumes. >> >> The sequence of the event is this: >> >> 1. memory reclaim was waiting xfs journal flushing and get stucked by >> md flush work. >> >> 2. md flush work was queued into "md" workqueue, but never get executed, >> kworker thread can not be created and also the rescuer thread was executing >> md flush work for another md disk and get stuck because >> "MD_SB_CHANGE_PENDING" flag was set. >> >> 3. That flag should be set by some md write process which was asking to >> update md superblock to change in_sync status to 0, and then it used >> kernfs_notify to ask "mdmon" process to update superblock, after that, >> write process waited that flag to be cleared. >> >> 4. But "mdmon" was never wake up, because kernfs_notify() depended on >> system wide workqueue "system_wq" to do the notify, but since that >> workqueue doesn't have a rescuer thread, notify will not happen. > Things like this can't be fixed by adding RECLAIM to system_wq because > system_wq is shared and someone else might occupy that rescuer thread. The > flag doesn't guarantee unlimited forward progress. It only guarantees > forward progress of one work item. > > That seems to be where the problem is in #2 in the first place. If a work > item is required during memory reclaim, it must have guaranteed forward > progress but it looks like that's waiting for someone else who can end up > waiting for userspace? > > You'll need to untangle the dependencies earlier. Make sense. Thanks a lot for the comments. > > Thanks. >
Hello, kernel test robot noticed "WARNING:possible_circular_locking_dependency_detected" on: commit: c8c183493c1dcc874a9d903cb6ba685c98f6c12a ("[RFC] workqueue: allow system workqueue be used in memory reclaim") url: https://github.com/intel-lab-lkp/linux/commits/Junxiao-Bi/workqueue-allow-system-workqueue-be-used-in-memory-reclaim/20231108-093107 base: https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git for-next patch link: https://lore.kernel.org/all/20231108012821.56104-1-junxiao.bi@oracle.com/ patch subject: [RFC] workqueue: allow system workqueue be used in memory reclaim in testcase: boot compiler: gcc-12 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202311161556.59af3ec9-oliver.sang@intel.com [ 6.524239][ T9] WARNING: possible circular locking dependency detected [ 6.524787][ T9] 6.6.0-rc6-00056-gc8c183493c1d #1 Not tainted [ 6.525271][ T9] ------------------------------------------------------ [ 6.525606][ T9] kworker/0:1/9 is trying to acquire lock: [ 6.525606][ T9] ffffffff88f6f480 (cpu_hotplug_lock){++++}-{0:0}, at: vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) [ 6.525606][ T9] [ 6.525606][ T9] but task is already holding lock: [ 6.525606][ T9] ffff888110aa7d88 ((shepherd).work){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:2606) [ 6.525606][ T9] [ 6.525606][ T9] which lock already depends on the new lock. [ 6.525606][ T9] [ 6.525606][ T9] the existing dependency chain (in reverse order) is: [ 6.525606][ T9] [ 6.525606][ T9] -> #2 ((shepherd).work){+.+.}-{0:0}: [ 6.525606][ T9] __lock_acquire (kernel/locking/lockdep.c:5136) [ 6.525606][ T9] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) [ 6.525606][ T9] process_one_work (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:444 include/linux/jump_label.h:260 include/linux/jump_label.h:270 include/trace/events/workqueue.h:82 kernel/workqueue.c:2629) [ 6.525606][ T9] worker_thread (kernel/workqueue.c:2697 kernel/workqueue.c:2784) [ 6.525606][ T9] kthread (kernel/kthread.c:388) [ 6.525606][ T9] ret_from_fork (arch/x86/kernel/process.c:153) [ 6.525606][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 6.525606][ T9] [ 6.525606][ T9] -> #1 ((wq_completion)events){+.+.}-{0:0}: [ 6.525606][ T9] __lock_acquire (kernel/locking/lockdep.c:5136) [ 6.525606][ T9] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) [ 6.525606][ T9] start_flush_work (kernel/workqueue.c:3383) [ 6.525606][ T9] __flush_work (kernel/workqueue.c:3406) [ 6.525606][ T9] schedule_on_each_cpu (kernel/workqueue.c:3668 (discriminator 3)) [ 6.525606][ T9] rcu_tasks_one_gp (kernel/rcu/rcu.h:109 kernel/rcu/tasks.h:587) [ 6.525606][ T9] rcu_tasks_kthread (kernel/rcu/tasks.h:625 (discriminator 1)) [ 6.525606][ T9] kthread (kernel/kthread.c:388) [ 6.525606][ T9] ret_from_fork (arch/x86/kernel/process.c:153) [ 6.525606][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 6.525606][ T9] [ 6.525606][ T9] -> #0 (cpu_hotplug_lock){++++}-{0:0}: [ 6.525606][ T9] check_prev_add (kernel/locking/lockdep.c:3135) [ 6.525606][ T9] validate_chain (kernel/locking/lockdep.c:3254 kernel/locking/lockdep.c:3868) [ 6.525606][ T9] __lock_acquire (kernel/locking/lockdep.c:5136) [ 6.525606][ T9] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) [ 6.525606][ T9] cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:489) [ 6.525606][ T9] vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) [ 6.525606][ T9] process_one_work (kernel/workqueue.c:2635) [ 6.525606][ T9] worker_thread (kernel/workqueue.c:2697 kernel/workqueue.c:2784) [ 6.525606][ T9] kthread (kernel/kthread.c:388) [ 6.525606][ T9] ret_from_fork (arch/x86/kernel/process.c:153) [ 6.525606][ T9] ret_from_fork_asm (arch/x86/entry/entry_64.S:312) [ 6.525606][ T9] [ 6.525606][ T9] other info that might help us debug this: [ 6.525606][ T9] [ 6.525606][ T9] Chain exists of: [ 6.525606][ T9] cpu_hotplug_lock --> (wq_completion)events --> (shepherd).work [ 6.525606][ T9] [ 6.525606][ T9] Possible unsafe locking scenario: [ 6.525606][ T9] [ 6.525606][ T9] CPU0 CPU1 [ 6.525606][ T9] ---- ---- [ 6.525606][ T9] lock((shepherd).work); [ 6.525606][ T9] lock((wq_completion)events); [ 6.525606][ T9] lock((shepherd).work); [ 6.525606][ T9] rlock(cpu_hotplug_lock); [ 6.525606][ T9] [ 6.525606][ T9] *** DEADLOCK *** [ 6.525606][ T9] [ 6.525606][ T9] 2 locks held by kworker/0:1/9: [ 6.525606][ T9] #0: ffff88810007cd48 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:2603) [ 6.525606][ T9] #1: ffff888110aa7d88 ((shepherd).work){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:2606) [ 6.525606][ T9] [ 6.525606][ T9] stack backtrace: [ 6.525606][ T9] CPU: 0 PID: 9 Comm: kworker/0:1 Not tainted 6.6.0-rc6-00056-gc8c183493c1d #1 [ 6.525606][ T9] Workqueue: events vmstat_shepherd [ 6.525606][ T9] Call Trace: [ 6.525606][ T9] <TASK> [ 6.525606][ T9] dump_stack_lvl (lib/dump_stack.c:107) [ 6.525606][ T9] check_noncircular (kernel/locking/lockdep.c:2187) [ 6.525606][ T9] ? print_circular_bug (kernel/locking/lockdep.c:2163) [ 6.525606][ T9] ? stack_trace_save (kernel/stacktrace.c:123) [ 6.525606][ T9] ? stack_trace_snprint (kernel/stacktrace.c:114) [ 6.525606][ T9] check_prev_add (kernel/locking/lockdep.c:3135) [ 6.525606][ T9] validate_chain (kernel/locking/lockdep.c:3254 kernel/locking/lockdep.c:3868) [ 6.525606][ T9] ? check_prev_add (kernel/locking/lockdep.c:3824) [ 6.525606][ T9] ? hlock_class (arch/x86/include/asm/bitops.h:228 arch/x86/include/asm/bitops.h:240 include/asm-generic/bitops/instrumented-non-atomic.h:142 kernel/locking/lockdep.c:228) [ 6.525606][ T9] ? mark_lock (kernel/locking/lockdep.c:4655 (discriminator 3)) [ 6.525606][ T9] __lock_acquire (kernel/locking/lockdep.c:5136) [ 6.525606][ T9] lock_acquire (kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5755) [ 6.525606][ T9] ? vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) [ 6.525606][ T9] ? lock_sync (kernel/locking/lockdep.c:5721) [ 6.525606][ T9] ? debug_object_active_state (lib/debugobjects.c:772) [ 6.525606][ T9] ? __cant_migrate (kernel/sched/core.c:10142) [ 6.525606][ T9] cpus_read_lock (include/linux/percpu-rwsem.h:53 kernel/cpu.c:489) [ 6.525606][ T9] ? vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) [ 6.525606][ T9] vmstat_shepherd (include/linux/find.h:63 mm/vmstat.c:2025) [ 6.525606][ T9] process_one_work (kernel/workqueue.c:2635) [ 6.525606][ T9] ? worker_thread (kernel/workqueue.c:2740) [ 6.525606][ T9] ? show_pwq (kernel/workqueue.c:2539) [ 6.525606][ T9] ? assign_work (kernel/workqueue.c:1096) [ 6.525606][ T9] worker_thread (kernel/workqueue.c:2697 kernel/workqueue.c:2784) [ 6.525606][ T9] ? __kthread_parkme (kernel/kthread.c:293 (discriminator 3)) [ 6.525606][ T9] ? schedule (arch/x86/include/asm/bitops.h:207 (discriminator 1) arch/x86/include/asm/bitops.h:239 (discriminator 1) include/linux/thread_info.h:184 (discriminator 1) include/linux/sched.h:2255 (discriminator 1) kernel/sched/core.c:6773 (discriminator 1)) [ 6.525606][ T9] ? process_one_work (kernel/workqueue.c:2730) [ 6.525606][ T9] kthread (kernel/kthread.c:388) [ 6.525606][ T9] ? _raw_spin_unlock_irq (arch/x86/include/asm/irqflags.h:42 arch/x86/include/asm/irqflags.h:77 include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:202) The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231116/202311161556.59af3ec9-oliver.sang@intel.com
diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 6e578f576a6f..e3338e3be700 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -6597,7 +6597,7 @@ void __init workqueue_init_early(void) ordered_wq_attrs[i] = attrs; } - system_wq = alloc_workqueue("events", 0, 0); + system_wq = alloc_workqueue("events", WQ_MEM_RECLAIM, 0); system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0); system_long_wq = alloc_workqueue("events_long", 0, 0); system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
The following deadlock was triggered on Intel IMSM raid1 volumes. The sequence of the event is this: 1. memory reclaim was waiting xfs journal flushing and get stucked by md flush work. 2. md flush work was queued into "md" workqueue, but never get executed, kworker thread can not be created and also the rescuer thread was executing md flush work for another md disk and get stuck because "MD_SB_CHANGE_PENDING" flag was set. 3. That flag should be set by some md write process which was asking to update md superblock to change in_sync status to 0, and then it used kernfs_notify to ask "mdmon" process to update superblock, after that, write process waited that flag to be cleared. 4. But "mdmon" was never wake up, because kernfs_notify() depended on system wide workqueue "system_wq" to do the notify, but since that workqueue doesn't have a rescuer thread, notify will not happen. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> --- kernel/workqueue.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)