Message ID | 20240817093334.6062-2-chenridong@huawei.com (mailing list archive) |
---|---|
State | Handled Elsewhere |
Delegated to: | BPF |
Headers | show |
Series | Fix deadlock caused by cgroup_mutex and cpu_hotplug_lock | expand |
On 2024/8/17 17:33, Chen Ridong wrote: > We found a hung_task problem as shown below: > > INFO: task kworker/0:0:8 blocked for more than 327 seconds. > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 flags:0x00004000 > Workqueue: events cgroup_bpf_release > Call Trace: > <TASK> > __schedule+0x5a2/0x2050 > ? find_held_lock+0x33/0x100 > ? wq_worker_sleeping+0x9e/0xe0 > schedule+0x9f/0x180 > schedule_preempt_disabled+0x25/0x50 > __mutex_lock+0x512/0x740 > ? cgroup_bpf_release+0x1e/0x4d0 > ? cgroup_bpf_release+0xcf/0x4d0 > ? process_scheduled_works+0x161/0x8a0 > ? cgroup_bpf_release+0x1e/0x4d0 > ? mutex_lock_nested+0x2b/0x40 > ? __pfx_delay_tsc+0x10/0x10 > mutex_lock_nested+0x2b/0x40 > cgroup_bpf_release+0xcf/0x4d0 > ? process_scheduled_works+0x161/0x8a0 > ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 > ? process_scheduled_works+0x161/0x8a0 > process_scheduled_works+0x23a/0x8a0 > worker_thread+0x231/0x5b0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x14d/0x1c0 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x59/0x70 > ? __pfx_kthread+0x10/0x10 > ret_from_fork_asm+0x1b/0x30 > </TASK> > > This issue can be reproduced by the following pressuse test: > 1. A large number of cpuset cgroups are deleted. > 2. Set cpu on and off repeatly. > 3. Set watchdog_thresh repeatly. > The scripts can be obtained at LINK mentioned above the signature. > > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are > acquired in different tasks, which may lead to deadlock. > It can lead to a deadlock through the following steps: > 1. A large number of cpusets are deleted asynchronously, which puts a > large number of cgroup_bpf_release works into system_wq. The max_active > of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are > cgroup_bpf_release works, and many cgroup_bpf_release works will be put > into inactive queue. As illustrated in the diagram, there are 256 (in > the acvtive queue) + n (in the inactive queue) works. > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put > smp_call_on_cpu work into system_wq. However step 1 has already filled > system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has > to wait until the works that were put into the inacvtive queue earlier > have executed (n cgroup_bpf_release), so it will be blocked for a while. > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. > 4. Cpusets that were deleted at step 1 put cgroup_release works into > cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. > When cgroup_metux is acqured by work at css_killed_work_fn, it will > call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. > However, cpuset_css_offline will be blocked for step 3. > 5. At this moment, there are 256 works in active queue that are > cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as > a result, all of them are blocked. Consequently, sscs.work can not be > executed. Ultimately, this situation leads to four processes being > blocked, forming a deadlock. > > system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) > ... > 2000+ cgroups deleted asyn > 256 actives + n inactives > __lockup_detector_reconfigure > P(cpu_hotplug_lock.read) > put sscs.work into system_wq > 256 + n + 1(sscs.work) > sscs.work wait to be executed > warting sscs.work finish > percpu_down_write > P(cpu_hotplug_lock.write) > ...blocking... > css_killed_work_fn > P(cgroup_mutex) > cpuset_css_offline > P(cpu_hotplug_lock.read) > ...blocking... > 256 cgroup_bpf_release > mutex_lock(&cgroup_mutex); > ..blocking... > > To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, > which can break the loop and solve the problem. System wqs are for misc > things which shouldn't create a large number of concurrent work items. > If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work > items, it should use its own dedicated workqueue. > > Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") > Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t > Signed-off-by: Chen Ridong <chenridong@huawei.com> > --- > kernel/bpf/cgroup.c | 2 +- > kernel/cgroup/cgroup-internal.h | 1 + > kernel/cgroup/cgroup.c | 2 +- > 3 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index 8ba73042a239..a611a1274788 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref) > struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt); > > INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); > - queue_work(system_wq, &cgrp->bpf.release_work); > + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work); > } > > /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through > diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h > index c964dd7ff967..17ac19bc8106 100644 > --- a/kernel/cgroup/cgroup-internal.h > +++ b/kernel/cgroup/cgroup-internal.h > @@ -13,6 +13,7 @@ > extern spinlock_t trace_cgroup_path_lock; > extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; > extern void __init enable_debug_cgroup(void); > +extern struct workqueue_struct *cgroup_destroy_wq; > > /* > * cgroup_path() takes a spin lock. It is good practice not to take > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index 75058fbf4450..77fa9ed69c86 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem); > * destruction work items don't end up filling up max_active of system_wq > * which may lead to deadlock. > */ > -static struct workqueue_struct *cgroup_destroy_wq; > +struct workqueue_struct *cgroup_destroy_wq; > > /* generate an array of cgroup subsystem pointers */ > #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys, Ping. Hi,TJ, Roman and Michal, I have updated commit message, I think it can be much clearer now, can you review it again? Thanks, Ridong
On 2024/8/22 8:57, Chen Ridong wrote: > > > On 2024/8/17 17:33, Chen Ridong wrote: >> We found a hung_task problem as shown below: >> >> INFO: task kworker/0:0:8 blocked for more than 327 seconds. >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 >> flags:0x00004000 >> Workqueue: events cgroup_bpf_release >> Call Trace: >> <TASK> >> __schedule+0x5a2/0x2050 >> ? find_held_lock+0x33/0x100 >> ? wq_worker_sleeping+0x9e/0xe0 >> schedule+0x9f/0x180 >> schedule_preempt_disabled+0x25/0x50 >> __mutex_lock+0x512/0x740 >> ? cgroup_bpf_release+0x1e/0x4d0 >> ? cgroup_bpf_release+0xcf/0x4d0 >> ? process_scheduled_works+0x161/0x8a0 >> ? cgroup_bpf_release+0x1e/0x4d0 >> ? mutex_lock_nested+0x2b/0x40 >> ? __pfx_delay_tsc+0x10/0x10 >> mutex_lock_nested+0x2b/0x40 >> cgroup_bpf_release+0xcf/0x4d0 >> ? process_scheduled_works+0x161/0x8a0 >> ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 >> ? process_scheduled_works+0x161/0x8a0 >> process_scheduled_works+0x23a/0x8a0 >> worker_thread+0x231/0x5b0 >> ? __pfx_worker_thread+0x10/0x10 >> kthread+0x14d/0x1c0 >> ? __pfx_kthread+0x10/0x10 >> ret_from_fork+0x59/0x70 >> ? __pfx_kthread+0x10/0x10 >> ret_from_fork_asm+0x1b/0x30 >> </TASK> >> >> This issue can be reproduced by the following pressuse test: >> 1. A large number of cpuset cgroups are deleted. >> 2. Set cpu on and off repeatly. >> 3. Set watchdog_thresh repeatly. >> The scripts can be obtained at LINK mentioned above the signature. >> >> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are >> acquired in different tasks, which may lead to deadlock. >> It can lead to a deadlock through the following steps: >> 1. A large number of cpusets are deleted asynchronously, which puts a >> large number of cgroup_bpf_release works into system_wq. The >> max_active >> of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works >> are >> cgroup_bpf_release works, and many cgroup_bpf_release works will >> be put >> into inactive queue. As illustrated in the diagram, there are 256 (in >> the acvtive queue) + n (in the inactive queue) works. >> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put >> smp_call_on_cpu work into system_wq. However step 1 has already >> filled >> system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has >> to wait until the works that were put into the inacvtive queue >> earlier >> have executed (n cgroup_bpf_release), so it will be blocked for a >> while. >> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by >> step 2. >> 4. Cpusets that were deleted at step 1 put cgroup_release works into >> cgroup_destroy_wq. They are competing to get cgroup_mutex all the >> time. >> When cgroup_metux is acqured by work at css_killed_work_fn, it will >> call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. >> However, cpuset_css_offline will be blocked for step 3. >> 5. At this moment, there are 256 works in active queue that are >> cgroup_bpf_release, they are attempting to acquire cgroup_mutex, >> and as >> a result, all of them are blocked. Consequently, sscs.work can not be >> executed. Ultimately, this situation leads to four processes being >> blocked, forming a deadlock. >> >> system_wq(step1) WatchDog(step2) cpu >> offline(step3) cgroup_destroy_wq(step4) >> ... >> 2000+ cgroups deleted asyn >> 256 actives + n inactives >> __lockup_detector_reconfigure >> P(cpu_hotplug_lock.read) >> put sscs.work into system_wq >> 256 + n + 1(sscs.work) >> sscs.work wait to be executed >> warting sscs.work finish >> percpu_down_write >> P(cpu_hotplug_lock.write) >> ...blocking... >> css_killed_work_fn >> P(cgroup_mutex) >> cpuset_css_offline >> P(cpu_hotplug_lock.read) >> ...blocking... >> 256 cgroup_bpf_release >> mutex_lock(&cgroup_mutex); >> ..blocking... >> >> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, >> which can break the loop and solve the problem. System wqs are for misc >> things which shouldn't create a large number of concurrent work items. >> If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work >> items, it should use its own dedicated workqueue. >> >> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from >> cgroup itself") >> Link: >> https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t >> Signed-off-by: Chen Ridong <chenridong@huawei.com> >> --- >> kernel/bpf/cgroup.c | 2 +- >> kernel/cgroup/cgroup-internal.h | 1 + >> kernel/cgroup/cgroup.c | 2 +- >> 3 files changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c >> index 8ba73042a239..a611a1274788 100644 >> --- a/kernel/bpf/cgroup.c >> +++ b/kernel/bpf/cgroup.c >> @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct >> percpu_ref *ref) >> struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt); >> INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); >> - queue_work(system_wq, &cgrp->bpf.release_work); >> + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work); >> } >> /* Get underlying bpf_prog of bpf_prog_list entry, regardless if >> it's through >> diff --git a/kernel/cgroup/cgroup-internal.h >> b/kernel/cgroup/cgroup-internal.h >> index c964dd7ff967..17ac19bc8106 100644 >> --- a/kernel/cgroup/cgroup-internal.h >> +++ b/kernel/cgroup/cgroup-internal.h >> @@ -13,6 +13,7 @@ >> extern spinlock_t trace_cgroup_path_lock; >> extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; >> extern void __init enable_debug_cgroup(void); >> +extern struct workqueue_struct *cgroup_destroy_wq; >> /* >> * cgroup_path() takes a spin lock. It is good practice not to take >> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c >> index 75058fbf4450..77fa9ed69c86 100644 >> --- a/kernel/cgroup/cgroup.c >> +++ b/kernel/cgroup/cgroup.c >> @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem); >> * destruction work items don't end up filling up max_active of >> system_wq >> * which may lead to deadlock. >> */ >> -static struct workqueue_struct *cgroup_destroy_wq; >> +struct workqueue_struct *cgroup_destroy_wq; >> /* generate an array of cgroup subsystem pointers */ >> #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys, > > Ping. > Hi,TJ, Roman and Michal, I have updated commit message, I think it can > be much clearer now, can you review it again? > > Thanks, > Ridong > Friendly ping.
On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are > acquired in different tasks, which may lead to deadlock. > It can lead to a deadlock through the following steps: > 1. A large number of cpusets are deleted asynchronously, which puts a > large number of cgroup_bpf_release works into system_wq. The max_active > of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are > cgroup_bpf_release works, and many cgroup_bpf_release works will be put > into inactive queue. As illustrated in the diagram, there are 256 (in > the acvtive queue) + n (in the inactive queue) works. > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put > smp_call_on_cpu work into system_wq. However step 1 has already filled > system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has > to wait until the works that were put into the inacvtive queue earlier > have executed (n cgroup_bpf_release), so it will be blocked for a while. > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. > 4. Cpusets that were deleted at step 1 put cgroup_release works into > cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. > When cgroup_metux is acqured by work at css_killed_work_fn, it will > call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. > However, cpuset_css_offline will be blocked for step 3. > 5. At this moment, there are 256 works in active queue that are > cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as > a result, all of them are blocked. Consequently, sscs.work can not be > executed. Ultimately, this situation leads to four processes being > blocked, forming a deadlock. > > system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) > ... > 2000+ cgroups deleted asyn > 256 actives + n inactives > __lockup_detector_reconfigure > P(cpu_hotplug_lock.read) > put sscs.work into system_wq > 256 + n + 1(sscs.work) > sscs.work wait to be executed > warting sscs.work finish > percpu_down_write > P(cpu_hotplug_lock.write) > ...blocking... > css_killed_work_fn > P(cgroup_mutex) > cpuset_css_offline > P(cpu_hotplug_lock.read) > ...blocking... > 256 cgroup_bpf_release > mutex_lock(&cgroup_mutex); > ..blocking... Thanks, Ridong, for laying this out. Let me try to extract the core of the deps above. The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock. However, the smp_call_on_cpu() under cpus_read_lock may lead to a deadlock (ABBA over those two locks). This is OK thread T system_wq worker lock(cgroup_mutex) (II) ... unlock(cgroup_mutex) down(cpu_hotplug_lock.read) smp_call_on_cpu queue_work_on(cpu, system_wq, scss) (I) scss.func wait_for_completion(scss) up(cpu_hotplug_lock.read) However, there is no ordering between (I) and (II) so they can also happen in opposite thread T system_wq worker down(cpu_hotplug_lock.read) smp_call_on_cpu queue_work_on(cpu, system_wq, scss) (I) lock(cgroup_mutex) (II) ... unlock(cgroup_mutex) scss.func wait_for_completion(scss) up(cpu_hotplug_lock.read) And here the thread T + system_wq worker effectively call cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're two threads, it won't be caught by lockdep.) By that reasoning any holder of cgroup_mutex on system_wq makes system susceptible to a deadlock (in presence of cpu_hotplug_lock waiting writers + cpuset operations). And the two work items must meet in same worker's processing hence probability is low (zero?) with less than WQ_DFL_ACTIVE items. (And more generally, any lock that is ordered before cpu_hotplug_lock should not be taken in system_wq work functions. Or at least such works items should not saturate WQ_DFL_ACTIVE workers.) Wrt other uses of cgroup_mutex, I only see bpf_map_free_in_work queue_work(system_unbound_wq) bpf_map_free_deferred ops->map_free == cgroup_storage_map_free cgroup_lock() which is safe since it uses a different workqueue than system_wq. > To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, > which can break the loop and solve the problem. Yes, it moves the problematic cgroup_mutex holder away from system_wq and cgroup_destroy_wq could not cause similar problems because there are no explicit waiter for particular work items or its flushing. > System wqs are for misc things which shouldn't create a large number > of concurrent work items. If something is going to generate > >WQ_DFL_ACTIVE(256) concurrent work > items, it should use its own dedicated workqueue. Actually, I'm not sure (because I lack workqueue knowledge) if producing less than WQ_DFL_ACTIVE work items completely eliminates the chance of two offending work items producing the wrong lock ordering. > Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") I'm now indifferent whether this is needed (perhaps in the sense it is the _latest_ of multiple changes that contributed to possibility of this deadlock scenario). > Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t > Signed-off-by: Chen Ridong <chenridong@huawei.com> > --- > kernel/bpf/cgroup.c | 2 +- > kernel/cgroup/cgroup-internal.h | 1 + > kernel/cgroup/cgroup.c | 2 +- > 3 files changed, 3 insertions(+), 2 deletions(-) I have convinved myself now that you can put Reviewed-by: Michal Koutný <mkoutny@suse.com> Regards, Michal
On 2024/9/9 22:19, Michal Koutný wrote: > On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: >> The reason for this issue is cgroup_mutex and cpu_hotplug_lock are >> acquired in different tasks, which may lead to deadlock. >> It can lead to a deadlock through the following steps: >> 1. A large number of cpusets are deleted asynchronously, which puts a >> large number of cgroup_bpf_release works into system_wq. The max_active >> of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are >> cgroup_bpf_release works, and many cgroup_bpf_release works will be put >> into inactive queue. As illustrated in the diagram, there are 256 (in >> the acvtive queue) + n (in the inactive queue) works. >> 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put >> smp_call_on_cpu work into system_wq. However step 1 has already filled >> system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has >> to wait until the works that were put into the inacvtive queue earlier >> have executed (n cgroup_bpf_release), so it will be blocked for a while. >> 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. >> 4. Cpusets that were deleted at step 1 put cgroup_release works into >> cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. >> When cgroup_metux is acqured by work at css_killed_work_fn, it will >> call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. >> However, cpuset_css_offline will be blocked for step 3. >> 5. At this moment, there are 256 works in active queue that are >> cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as >> a result, all of them are blocked. Consequently, sscs.work can not be >> executed. Ultimately, this situation leads to four processes being >> blocked, forming a deadlock. >> >> system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) >> ... >> 2000+ cgroups deleted asyn >> 256 actives + n inactives >> __lockup_detector_reconfigure >> P(cpu_hotplug_lock.read) >> put sscs.work into system_wq >> 256 + n + 1(sscs.work) >> sscs.work wait to be executed >> warting sscs.work finish >> percpu_down_write >> P(cpu_hotplug_lock.write) >> ...blocking... >> css_killed_work_fn >> P(cgroup_mutex) >> cpuset_css_offline >> P(cpu_hotplug_lock.read) >> ...blocking... >> 256 cgroup_bpf_release >> mutex_lock(&cgroup_mutex); >> ..blocking... > > Thanks, Ridong, for laying this out. > Let me try to extract the core of the deps above. > > The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock. > However, the smp_call_on_cpu() under cpus_read_lock may lead to > a deadlock (ABBA over those two locks). > That's right. > This is OK > thread T system_wq worker > > lock(cgroup_mutex) (II) > ... > unlock(cgroup_mutex) > down(cpu_hotplug_lock.read) > smp_call_on_cpu > queue_work_on(cpu, system_wq, scss) (I) > scss.func > wait_for_completion(scss) > up(cpu_hotplug_lock.read) > > However, there is no ordering between (I) and (II) so they can also happen > in opposite > > thread T system_wq worker > > down(cpu_hotplug_lock.read) > smp_call_on_cpu > queue_work_on(cpu, system_wq, scss) (I) > lock(cgroup_mutex) (II) > ... > unlock(cgroup_mutex) > scss.func > wait_for_completion(scss) > up(cpu_hotplug_lock.read) > > And here the thread T + system_wq worker effectively call > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're > two threads, it won't be caught by lockdep.) > > By that reasoning any holder of cgroup_mutex on system_wq makes system > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting > writers + cpuset operations). And the two work items must meet in same > worker's processing hence probability is low (zero?) with less than > WQ_DFL_ACTIVE items. > > (And more generally, any lock that is ordered before cpu_hotplug_lock > should not be taken in system_wq work functions. Or at least such works > items should not saturate WQ_DFL_ACTIVE workers.) > > Wrt other uses of cgroup_mutex, I only see > bpf_map_free_in_work > queue_work(system_unbound_wq) > bpf_map_free_deferred > ops->map_free == cgroup_storage_map_free > cgroup_lock() > which is safe since it uses a different workqueue than system_wq. > >> To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, >> which can break the loop and solve the problem. > > Yes, it moves the problematic cgroup_mutex holder away from system_wq > and cgroup_destroy_wq could not cause similar problems because there are > no explicit waiter for particular work items or its flushing. > > >> System wqs are for misc things which shouldn't create a large number >> of concurrent work items. If something is going to generate >>> WQ_DFL_ACTIVE(256) concurrent work >> items, it should use its own dedicated workqueue. > > Actually, I'm not sure (because I lack workqueue knowledge) if producing > less than WQ_DFL_ACTIVE work items completely eliminates the chance of > two offending work items producing the wrong lock ordering. > If producing less than WQ_DFL_ACTIVE work items, it won't lead to a deadlock. Because scss.func can be executed and doesn't have to wait for work that holds cgroup_mutex to be completed. Therefore, the probability is low and this issue can only be reproduced under pressure test. > >> Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") > > I'm now indifferent whether this is needed (perhaps in the sense it is > the _latest_ of multiple changes that contributed to possibility of this > deadlock scenario). > > >> Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t >> Signed-off-by: Chen Ridong <chenridong@huawei.com> >> --- >> kernel/bpf/cgroup.c | 2 +- >> kernel/cgroup/cgroup-internal.h | 1 + >> kernel/cgroup/cgroup.c | 2 +- >> 3 files changed, 3 insertions(+), 2 deletions(-) > > I have convinved myself now that you can put > > Reviewed-by: Michal Koutný <mkoutny@suse.com> > > Regards, > Michal Thank you very much. Best Regards, Ridong
On Tue, Sep 10, 2024 at 09:31:41AM +0800, Chen Ridong wrote: > > > On 2024/9/9 22:19, Michal Koutný wrote: > > On Sat, Aug 17, 2024 at 09:33:34AM GMT, Chen Ridong <chenridong@huawei.com> wrote: > > > The reason for this issue is cgroup_mutex and cpu_hotplug_lock are > > > acquired in different tasks, which may lead to deadlock. > > > It can lead to a deadlock through the following steps: > > > 1. A large number of cpusets are deleted asynchronously, which puts a > > > large number of cgroup_bpf_release works into system_wq. The max_active > > > of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are > > > cgroup_bpf_release works, and many cgroup_bpf_release works will be put > > > into inactive queue. As illustrated in the diagram, there are 256 (in > > > the acvtive queue) + n (in the inactive queue) works. > > > 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put > > > smp_call_on_cpu work into system_wq. However step 1 has already filled > > > system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has > > > to wait until the works that were put into the inacvtive queue earlier > > > have executed (n cgroup_bpf_release), so it will be blocked for a while. > > > 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. > > > 4. Cpusets that were deleted at step 1 put cgroup_release works into > > > cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. > > > When cgroup_metux is acqured by work at css_killed_work_fn, it will > > > call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. > > > However, cpuset_css_offline will be blocked for step 3. > > > 5. At this moment, there are 256 works in active queue that are > > > cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as > > > a result, all of them are blocked. Consequently, sscs.work can not be > > > executed. Ultimately, this situation leads to four processes being > > > blocked, forming a deadlock. > > > > > > system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) > > > ... > > > 2000+ cgroups deleted asyn > > > 256 actives + n inactives > > > __lockup_detector_reconfigure > > > P(cpu_hotplug_lock.read) > > > put sscs.work into system_wq > > > 256 + n + 1(sscs.work) > > > sscs.work wait to be executed > > > warting sscs.work finish > > > percpu_down_write > > > P(cpu_hotplug_lock.write) > > > ...blocking... > > > css_killed_work_fn > > > P(cgroup_mutex) > > > cpuset_css_offline > > > P(cpu_hotplug_lock.read) > > > ...blocking... > > > 256 cgroup_bpf_release > > > mutex_lock(&cgroup_mutex); > > > ..blocking... > > > > Thanks, Ridong, for laying this out. > > Let me try to extract the core of the deps above. > > > > The correct lock ordering is: cgroup_mutex then cpu_hotplug_lock. > > However, the smp_call_on_cpu() under cpus_read_lock may lead to > > a deadlock (ABBA over those two locks). > > > > That's right. > > > This is OK > > thread T system_wq worker > > > > lock(cgroup_mutex) (II) > > ... > > unlock(cgroup_mutex) > > down(cpu_hotplug_lock.read) > > smp_call_on_cpu > > queue_work_on(cpu, system_wq, scss) (I) > > scss.func > > wait_for_completion(scss) > > up(cpu_hotplug_lock.read) > > > > However, there is no ordering between (I) and (II) so they can also happen > > in opposite > > > > thread T system_wq worker > > > > down(cpu_hotplug_lock.read) > > smp_call_on_cpu > > queue_work_on(cpu, system_wq, scss) (I) > > lock(cgroup_mutex) (II) > > ... > > unlock(cgroup_mutex) > > scss.func > > wait_for_completion(scss) > > up(cpu_hotplug_lock.read) > > > > And here the thread T + system_wq worker effectively call > > cpu_hotplug_lock and cgroup_mutex in the wrong order. (And since they're > > two threads, it won't be caught by lockdep.) > > > > By that reasoning any holder of cgroup_mutex on system_wq makes system > > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting > > writers + cpuset operations). And the two work items must meet in same > > worker's processing hence probability is low (zero?) with less than > > WQ_DFL_ACTIVE items. Right, I'm on the same page. Should we document then somewhere that the cgroup mutex can't be locked from a system wq context? I think thus will also make the Fixes tag more meaningful. Thank you!
On Tue, Sep 10, 2024 at 09:02:59PM +0000, Roman Gushchin wrote: ... > > > By that reasoning any holder of cgroup_mutex on system_wq makes system > > > susceptible to a deadlock (in presence of cpu_hotplug_lock waiting > > > writers + cpuset operations). And the two work items must meet in same > > > worker's processing hence probability is low (zero?) with less than > > > WQ_DFL_ACTIVE items. > > Right, I'm on the same page. Should we document then somewhere that > the cgroup mutex can't be locked from a system wq context? > > I think thus will also make the Fixes tag more meaningful. I think that's completely fine. What's not fine is saturating system_wq. Anything which creates a large number of concurrent work items should be using its own workqueue. If anything, workqueue needs to add a warning for saturation conditions and who are the offenders. Thanks.
On 2024/9/11 5:17, Tejun Heo wrote: > On Tue, Sep 10, 2024 at 09:02:59PM +0000, Roman Gushchin wrote: > ... >>>> By that reasoning any holder of cgroup_mutex on system_wq makes system >>>> susceptible to a deadlock (in presence of cpu_hotplug_lock waiting >>>> writers + cpuset operations). And the two work items must meet in same >>>> worker's processing hence probability is low (zero?) with less than >>>> WQ_DFL_ACTIVE items. >> >> Right, I'm on the same page. Should we document then somewhere that >> the cgroup mutex can't be locked from a system wq context? >> >> I think thus will also make the Fixes tag more meaningful. > > I think that's completely fine. What's not fine is saturating system_wq. > Anything which creates a large number of concurrent work items should be > using its own workqueue. If anything, workqueue needs to add a warning for > saturation conditions and who are the offenders. > > Thanks. > I will add a patch do document that. Should we modify WQ_DFL_ACTIVE(256 now)? Maybe 1024 is acceptable? Best regards, Ridong
On Thu, Sep 12, 2024 at 09:33:23AM +0800, Chen Ridong wrote: > I will add a patch do document that. > Should we modify WQ_DFL_ACTIVE(256 now)? Maybe 1024 is acceptable? Yeah, let's bump it. Thanks.
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index 8ba73042a239..a611a1274788 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -334,7 +334,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref) struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt); INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release); - queue_work(system_wq, &cgrp->bpf.release_work); + queue_work(cgroup_destroy_wq, &cgrp->bpf.release_work); } /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h index c964dd7ff967..17ac19bc8106 100644 --- a/kernel/cgroup/cgroup-internal.h +++ b/kernel/cgroup/cgroup-internal.h @@ -13,6 +13,7 @@ extern spinlock_t trace_cgroup_path_lock; extern char trace_cgroup_path[TRACE_CGROUP_PATH_LEN]; extern void __init enable_debug_cgroup(void); +extern struct workqueue_struct *cgroup_destroy_wq; /* * cgroup_path() takes a spin lock. It is good practice not to take diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 75058fbf4450..77fa9ed69c86 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -124,7 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem); * destruction work items don't end up filling up max_active of system_wq * which may lead to deadlock. */ -static struct workqueue_struct *cgroup_destroy_wq; +struct workqueue_struct *cgroup_destroy_wq; /* generate an array of cgroup subsystem pointers */ #define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,
We found a hung_task problem as shown below: INFO: task kworker/0:0:8 blocked for more than 327 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/0:0 state:D stack:13920 pid:8 ppid:2 flags:0x00004000 Workqueue: events cgroup_bpf_release Call Trace: <TASK> __schedule+0x5a2/0x2050 ? find_held_lock+0x33/0x100 ? wq_worker_sleeping+0x9e/0xe0 schedule+0x9f/0x180 schedule_preempt_disabled+0x25/0x50 __mutex_lock+0x512/0x740 ? cgroup_bpf_release+0x1e/0x4d0 ? cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? cgroup_bpf_release+0x1e/0x4d0 ? mutex_lock_nested+0x2b/0x40 ? __pfx_delay_tsc+0x10/0x10 mutex_lock_nested+0x2b/0x40 cgroup_bpf_release+0xcf/0x4d0 ? process_scheduled_works+0x161/0x8a0 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0 ? process_scheduled_works+0x161/0x8a0 process_scheduled_works+0x23a/0x8a0 worker_thread+0x231/0x5b0 ? __pfx_worker_thread+0x10/0x10 kthread+0x14d/0x1c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x59/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> This issue can be reproduced by the following pressuse test: 1. A large number of cpuset cgroups are deleted. 2. Set cpu on and off repeatly. 3. Set watchdog_thresh repeatly. The scripts can be obtained at LINK mentioned above the signature. The reason for this issue is cgroup_mutex and cpu_hotplug_lock are acquired in different tasks, which may lead to deadlock. It can lead to a deadlock through the following steps: 1. A large number of cpusets are deleted asynchronously, which puts a large number of cgroup_bpf_release works into system_wq. The max_active of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are cgroup_bpf_release works, and many cgroup_bpf_release works will be put into inactive queue. As illustrated in the diagram, there are 256 (in the acvtive queue) + n (in the inactive queue) works. 2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put smp_call_on_cpu work into system_wq. However step 1 has already filled system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has to wait until the works that were put into the inacvtive queue earlier have executed (n cgroup_bpf_release), so it will be blocked for a while. 3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2. 4. Cpusets that were deleted at step 1 put cgroup_release works into cgroup_destroy_wq. They are competing to get cgroup_mutex all the time. When cgroup_metux is acqured by work at css_killed_work_fn, it will call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read. However, cpuset_css_offline will be blocked for step 3. 5. At this moment, there are 256 works in active queue that are cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as a result, all of them are blocked. Consequently, sscs.work can not be executed. Ultimately, this situation leads to four processes being blocked, forming a deadlock. system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4) ... 2000+ cgroups deleted asyn 256 actives + n inactives __lockup_detector_reconfigure P(cpu_hotplug_lock.read) put sscs.work into system_wq 256 + n + 1(sscs.work) sscs.work wait to be executed warting sscs.work finish percpu_down_write P(cpu_hotplug_lock.write) ...blocking... css_killed_work_fn P(cgroup_mutex) cpuset_css_offline P(cpu_hotplug_lock.read) ...blocking... 256 cgroup_bpf_release mutex_lock(&cgroup_mutex); ..blocking... To fix the problem, place cgroup_bpf_release works on cgroup_destroy_wq, which can break the loop and solve the problem. System wqs are for misc things which shouldn't create a large number of concurrent work items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent work items, it should use its own dedicated workqueue. Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself") Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t Signed-off-by: Chen Ridong <chenridong@huawei.com> --- kernel/bpf/cgroup.c | 2 +- kernel/cgroup/cgroup-internal.h | 1 + kernel/cgroup/cgroup.c | 2 +- 3 files changed, 3 insertions(+), 2 deletions(-)