Message ID | 20241217121828.3219752-1-chenridong@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] memcg: fix soft lockup in the OOM process | expand |
On Tue 17-12-24 12:18:28, Chen Ridong wrote: [...] > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 1c485beb0b93..14260381cccc 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) > if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) > return 0; > > + cond_resched(); > task = find_lock_task_mm(p); > if (!task) { > /* This is called from RCU read lock for the global OOM killer path and I do not think you can schedule there. I do not remember specifics of task traversal for crgoup path but I guess that you might need to silence the soft lockup detector instead or come up with a different iteration scheme.
On 2024/12/17 20:54, Michal Hocko wrote: > On Tue 17-12-24 12:18:28, Chen Ridong wrote: > [...] >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >> index 1c485beb0b93..14260381cccc 100644 >> --- a/mm/oom_kill.c >> +++ b/mm/oom_kill.c >> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) >> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) >> return 0; >> >> + cond_resched(); >> task = find_lock_task_mm(p); >> if (!task) { >> /* > > This is called from RCU read lock for the global OOM killer path and I > do not think you can schedule there. I do not remember specifics of task > traversal for crgoup path but I guess that you might need to silence the > soft lockup detector instead or come up with a different iteration > scheme. Thank you, Michal. I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks function below the fn, but after reconsideration, it may cause unnecessary scheduling for other callers of mem_cgroup_scan_tasks. Therefore, I moved it into the dump_task function. However, I missed the RCU lock from the global OOM. I think we can use touch_nmi_watchdog in place of cond_resched, which can silence the soft lockup detector. Do you think that is acceptable? @@ -390,7 +391,7 @@ static int dump_task(struct task_struct *p, void *arg) if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) return 0; + touch_nmi_watchdog(); task = find_lock_task_mm(p); Best regards, Ridong
On Wed 18-12-24 15:44:34, Chen Ridong wrote: > > > On 2024/12/17 20:54, Michal Hocko wrote: > > On Tue 17-12-24 12:18:28, Chen Ridong wrote: > > [...] > >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c > >> index 1c485beb0b93..14260381cccc 100644 > >> --- a/mm/oom_kill.c > >> +++ b/mm/oom_kill.c > >> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) > >> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) > >> return 0; > >> > >> + cond_resched(); > >> task = find_lock_task_mm(p); > >> if (!task) { > >> /* > > > > This is called from RCU read lock for the global OOM killer path and I > > do not think you can schedule there. I do not remember specifics of task > > traversal for crgoup path but I guess that you might need to silence the > > soft lockup detector instead or come up with a different iteration > > scheme. > > Thank you, Michal. > > I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks > function below the fn, but after reconsideration, it may cause > unnecessary scheduling for other callers of mem_cgroup_scan_tasks. > Therefore, I moved it into the dump_task function. However, I missed the > RCU lock from the global OOM. > > I think we can use touch_nmi_watchdog in place of cond_resched, which > can silence the soft lockup detector. Do you think that is acceptable? It is certainly a way to go. Not the best one at that though. Maybe we need different solution for the global and for the memcg OOMs. During the global OOM we rarely care about latency as the whole system is likely to struggle. Memcg ooms are much more likely. Having that many tasks in a memcg certainly requires a further partitioning so if configured properly the OOM latency shouldn't be visible much. But I am wondering whether the cgroup task iteration could use cond_resched while the global one would touch_nmi_watchdog for every N iterations. I might be missing something but I do not see any locking required outside of css_task_iter_*.
On 2024/12/18 15:56, Michal Hocko wrote: > On Wed 18-12-24 15:44:34, Chen Ridong wrote: >> >> >> On 2024/12/17 20:54, Michal Hocko wrote: >>> On Tue 17-12-24 12:18:28, Chen Ridong wrote: >>> [...] >>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>>> index 1c485beb0b93..14260381cccc 100644 >>>> --- a/mm/oom_kill.c >>>> +++ b/mm/oom_kill.c >>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) >>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) >>>> return 0; >>>> >>>> + cond_resched(); >>>> task = find_lock_task_mm(p); >>>> if (!task) { >>>> /* >>> >>> This is called from RCU read lock for the global OOM killer path and I >>> do not think you can schedule there. I do not remember specifics of task >>> traversal for crgoup path but I guess that you might need to silence the >>> soft lockup detector instead or come up with a different iteration >>> scheme. >> >> Thank you, Michal. >> >> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks >> function below the fn, but after reconsideration, it may cause >> unnecessary scheduling for other callers of mem_cgroup_scan_tasks. >> Therefore, I moved it into the dump_task function. However, I missed the >> RCU lock from the global OOM. >> >> I think we can use touch_nmi_watchdog in place of cond_resched, which >> can silence the soft lockup detector. Do you think that is acceptable? > > It is certainly a way to go. Not the best one at that though. Maybe we > need different solution for the global and for the memcg OOMs. During > the global OOM we rarely care about latency as the whole system is > likely to struggle. Memcg ooms are much more likely. Having that many > tasks in a memcg certainly requires a further partitioning so if > configured properly the OOM latency shouldn't be visible much. But I am > wondering whether the cgroup task iteration could use cond_resched while > the global one would touch_nmi_watchdog for every N iterations. I might > be missing something but I do not see any locking required outside of > css_task_iter_*. Do you mean like that: diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index d9061bd55436..9d197a731841 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5023,7 +5023,7 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it) } spin_unlock_irqrestore(&css_set_lock, irqflags); - + cond_resched(); return it->cur_task; } @@ -433,8 +433,10 @@ static void dump_tasks(struct oom_control *oc) struct task_struct *p; rcu_read_lock(); - for_each_process(p) + for_each_process(p) { + touch_nmi_watchdog(); dump_task(p, oc); + } rcu_read_unlock(); } The 'css_task_iter_*' functions are used in many places. We should be very careful when adding cond_resched within these functions. I don't see any RCU or spinlock usage outside of css_task_iter_*, except for mutex locks, such as in cgroup_do_freeze. And perhaps Tj will have some opinions on this? Best regards, Ridong
On Wed 18-12-24 17:00:38, Chen Ridong wrote: > > > On 2024/12/18 15:56, Michal Hocko wrote: > > On Wed 18-12-24 15:44:34, Chen Ridong wrote: > >> > >> > >> On 2024/12/17 20:54, Michal Hocko wrote: > >>> On Tue 17-12-24 12:18:28, Chen Ridong wrote: > >>> [...] > >>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c > >>>> index 1c485beb0b93..14260381cccc 100644 > >>>> --- a/mm/oom_kill.c > >>>> +++ b/mm/oom_kill.c > >>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) > >>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) > >>>> return 0; > >>>> > >>>> + cond_resched(); > >>>> task = find_lock_task_mm(p); > >>>> if (!task) { > >>>> /* > >>> > >>> This is called from RCU read lock for the global OOM killer path and I > >>> do not think you can schedule there. I do not remember specifics of task > >>> traversal for crgoup path but I guess that you might need to silence the > >>> soft lockup detector instead or come up with a different iteration > >>> scheme. > >> > >> Thank you, Michal. > >> > >> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks > >> function below the fn, but after reconsideration, it may cause > >> unnecessary scheduling for other callers of mem_cgroup_scan_tasks. > >> Therefore, I moved it into the dump_task function. However, I missed the > >> RCU lock from the global OOM. > >> > >> I think we can use touch_nmi_watchdog in place of cond_resched, which > >> can silence the soft lockup detector. Do you think that is acceptable? > > > > It is certainly a way to go. Not the best one at that though. Maybe we > > need different solution for the global and for the memcg OOMs. During > > the global OOM we rarely care about latency as the whole system is > > likely to struggle. Memcg ooms are much more likely. Having that many > > tasks in a memcg certainly requires a further partitioning so if > > configured properly the OOM latency shouldn't be visible much. But I am > > wondering whether the cgroup task iteration could use cond_resched while > > the global one would touch_nmi_watchdog for every N iterations. I might > > be missing something but I do not see any locking required outside of > > css_task_iter_*. > > Do you mean like that: I've had something like this (untested) in mind diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7b3503d12aaf..37abc94abd2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, for_each_mem_cgroup_tree(iter, memcg) { struct css_task_iter it; struct task_struct *task; + unsigned int i = 0 css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it); - while (!ret && (task = css_task_iter_next(&it))) + while (!ret && (task = css_task_iter_next(&it))) { ret = fn(task, arg); + if (++i % 1000) + cond_resched(); + } css_task_iter_end(&it); if (ret) { mem_cgroup_iter_break(memcg, iter); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1c485beb0b93..3bf2304ed20c 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -430,10 +430,14 @@ static void dump_tasks(struct oom_control *oc) mem_cgroup_scan_tasks(oc->memcg, dump_task, oc); else { struct task_struct *p; + unsigned int i = 0; rcu_read_lock(); - for_each_process(p) + for_each_process(p) { + if (++i % 1000) + touch_softlockup_watchdog(); dump_task(p, oc); + } rcu_read_unlock(); } }
On 2024/12/18 18:22, Michal Hocko wrote: > On Wed 18-12-24 17:00:38, Chen Ridong wrote: >> >> >> On 2024/12/18 15:56, Michal Hocko wrote: >>> On Wed 18-12-24 15:44:34, Chen Ridong wrote: >>>> >>>> >>>> On 2024/12/17 20:54, Michal Hocko wrote: >>>>> On Tue 17-12-24 12:18:28, Chen Ridong wrote: >>>>> [...] >>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>>>>> index 1c485beb0b93..14260381cccc 100644 >>>>>> --- a/mm/oom_kill.c >>>>>> +++ b/mm/oom_kill.c >>>>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) >>>>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) >>>>>> return 0; >>>>>> >>>>>> + cond_resched(); >>>>>> task = find_lock_task_mm(p); >>>>>> if (!task) { >>>>>> /* >>>>> >>>>> This is called from RCU read lock for the global OOM killer path and I >>>>> do not think you can schedule there. I do not remember specifics of task >>>>> traversal for crgoup path but I guess that you might need to silence the >>>>> soft lockup detector instead or come up with a different iteration >>>>> scheme. >>>> >>>> Thank you, Michal. >>>> >>>> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks >>>> function below the fn, but after reconsideration, it may cause >>>> unnecessary scheduling for other callers of mem_cgroup_scan_tasks. >>>> Therefore, I moved it into the dump_task function. However, I missed the >>>> RCU lock from the global OOM. >>>> >>>> I think we can use touch_nmi_watchdog in place of cond_resched, which >>>> can silence the soft lockup detector. Do you think that is acceptable? >>> >>> It is certainly a way to go. Not the best one at that though. Maybe we >>> need different solution for the global and for the memcg OOMs. During >>> the global OOM we rarely care about latency as the whole system is >>> likely to struggle. Memcg ooms are much more likely. Having that many >>> tasks in a memcg certainly requires a further partitioning so if >>> configured properly the OOM latency shouldn't be visible much. But I am >>> wondering whether the cgroup task iteration could use cond_resched while >>> the global one would touch_nmi_watchdog for every N iterations. I might >>> be missing something but I do not see any locking required outside of >>> css_task_iter_*. >> >> Do you mean like that: > > I've had something like this (untested) in mind > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 7b3503d12aaf..37abc94abd2e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, > for_each_mem_cgroup_tree(iter, memcg) { > struct css_task_iter it; > struct task_struct *task; > + unsigned int i = 0 > > css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it); > - while (!ret && (task = css_task_iter_next(&it))) > + while (!ret && (task = css_task_iter_next(&it))) { > ret = fn(task, arg); > + if (++i % 1000) > + cond_resched(); > + } > css_task_iter_end(&it); > if (ret) { > mem_cgroup_iter_break(memcg, iter); Thank you for your patience. I had this idea in mind as well. However, there are two considerations that led me to reconsider it: 1. I wasn't convinced about how we should call cond_resched every N iterations. Should it be 1000 or 10000? 2. I don't think all callers of mem_cgroup_scan_tasks need cond_resched. Only fn is expensive (e.g., dump_tasks), and it needs cond_resched. At least, I have not encountered any other issue except except when fn is dump_tasks. If you think this is acceptable, I will test and update the patch. Best regards, Ridong
On Thu 19-12-24 09:27:52, Chen Ridong wrote: > > > On 2024/12/18 18:22, Michal Hocko wrote: > > On Wed 18-12-24 17:00:38, Chen Ridong wrote: > >> > >> > >> On 2024/12/18 15:56, Michal Hocko wrote: > >>> On Wed 18-12-24 15:44:34, Chen Ridong wrote: > >>>> > >>>> > >>>> On 2024/12/17 20:54, Michal Hocko wrote: > >>>>> On Tue 17-12-24 12:18:28, Chen Ridong wrote: > >>>>> [...] > >>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c > >>>>>> index 1c485beb0b93..14260381cccc 100644 > >>>>>> --- a/mm/oom_kill.c > >>>>>> +++ b/mm/oom_kill.c > >>>>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) > >>>>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) > >>>>>> return 0; > >>>>>> > >>>>>> + cond_resched(); > >>>>>> task = find_lock_task_mm(p); > >>>>>> if (!task) { > >>>>>> /* > >>>>> > >>>>> This is called from RCU read lock for the global OOM killer path and I > >>>>> do not think you can schedule there. I do not remember specifics of task > >>>>> traversal for crgoup path but I guess that you might need to silence the > >>>>> soft lockup detector instead or come up with a different iteration > >>>>> scheme. > >>>> > >>>> Thank you, Michal. > >>>> > >>>> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks > >>>> function below the fn, but after reconsideration, it may cause > >>>> unnecessary scheduling for other callers of mem_cgroup_scan_tasks. > >>>> Therefore, I moved it into the dump_task function. However, I missed the > >>>> RCU lock from the global OOM. > >>>> > >>>> I think we can use touch_nmi_watchdog in place of cond_resched, which > >>>> can silence the soft lockup detector. Do you think that is acceptable? > >>> > >>> It is certainly a way to go. Not the best one at that though. Maybe we > >>> need different solution for the global and for the memcg OOMs. During > >>> the global OOM we rarely care about latency as the whole system is > >>> likely to struggle. Memcg ooms are much more likely. Having that many > >>> tasks in a memcg certainly requires a further partitioning so if > >>> configured properly the OOM latency shouldn't be visible much. But I am > >>> wondering whether the cgroup task iteration could use cond_resched while > >>> the global one would touch_nmi_watchdog for every N iterations. I might > >>> be missing something but I do not see any locking required outside of > >>> css_task_iter_*. > >> > >> Do you mean like that: > > > > I've had something like this (untested) in mind > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 7b3503d12aaf..37abc94abd2e 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, > > for_each_mem_cgroup_tree(iter, memcg) { > > struct css_task_iter it; > > struct task_struct *task; > > + unsigned int i = 0 > > > > css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it); > > - while (!ret && (task = css_task_iter_next(&it))) > > + while (!ret && (task = css_task_iter_next(&it))) { > > ret = fn(task, arg); > > + if (++i % 1000) > > + cond_resched(); > > + } > > css_task_iter_end(&it); > > if (ret) { > > mem_cgroup_iter_break(memcg, iter); > > Thank you for your patience. > > I had this idea in mind as well. > However, there are two considerations that led me to reconsider it: > > 1. I wasn't convinced about how we should call cond_resched every N > iterations. Should it be 1000 or 10000? Sure, there will likely not be any _right_ value. This is mostly to mitigate the overhead of cond_resched which is not completely free. Having a system with 1000 tasks is not completely uncommon and we do not really need cond_resched now. More importantly we can expect cond_resched will eventually go away with the PREEMPT_LAZY (or what is the current name of that) so I wouldn't overthink this. > 2. I don't think all callers of mem_cgroup_scan_tasks need cond_resched. > Only fn is expensive (e.g., dump_tasks), and it needs cond_resched. At > least, I have not encountered any other issue except except when fn is > dump_tasks. See above. I wouldn't really overthink this.
On 2024/12/19 15:57, Michal Hocko wrote: > On Thu 19-12-24 09:27:52, Chen Ridong wrote: >> >> >> On 2024/12/18 18:22, Michal Hocko wrote: >>> On Wed 18-12-24 17:00:38, Chen Ridong wrote: >>>> >>>> >>>> On 2024/12/18 15:56, Michal Hocko wrote: >>>>> On Wed 18-12-24 15:44:34, Chen Ridong wrote: >>>>>> >>>>>> >>>>>> On 2024/12/17 20:54, Michal Hocko wrote: >>>>>>> On Tue 17-12-24 12:18:28, Chen Ridong wrote: >>>>>>> [...] >>>>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c >>>>>>>> index 1c485beb0b93..14260381cccc 100644 >>>>>>>> --- a/mm/oom_kill.c >>>>>>>> +++ b/mm/oom_kill.c >>>>>>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) >>>>>>>> if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) >>>>>>>> return 0; >>>>>>>> >>>>>>>> + cond_resched(); >>>>>>>> task = find_lock_task_mm(p); >>>>>>>> if (!task) { >>>>>>>> /* >>>>>>> >>>>>>> This is called from RCU read lock for the global OOM killer path and I >>>>>>> do not think you can schedule there. I do not remember specifics of task >>>>>>> traversal for crgoup path but I guess that you might need to silence the >>>>>>> soft lockup detector instead or come up with a different iteration >>>>>>> scheme. >>>>>> >>>>>> Thank you, Michal. >>>>>> >>>>>> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks >>>>>> function below the fn, but after reconsideration, it may cause >>>>>> unnecessary scheduling for other callers of mem_cgroup_scan_tasks. >>>>>> Therefore, I moved it into the dump_task function. However, I missed the >>>>>> RCU lock from the global OOM. >>>>>> >>>>>> I think we can use touch_nmi_watchdog in place of cond_resched, which >>>>>> can silence the soft lockup detector. Do you think that is acceptable? >>>>> >>>>> It is certainly a way to go. Not the best one at that though. Maybe we >>>>> need different solution for the global and for the memcg OOMs. During >>>>> the global OOM we rarely care about latency as the whole system is >>>>> likely to struggle. Memcg ooms are much more likely. Having that many >>>>> tasks in a memcg certainly requires a further partitioning so if >>>>> configured properly the OOM latency shouldn't be visible much. But I am >>>>> wondering whether the cgroup task iteration could use cond_resched while >>>>> the global one would touch_nmi_watchdog for every N iterations. I might >>>>> be missing something but I do not see any locking required outside of >>>>> css_task_iter_*. >>>> >>>> Do you mean like that: >>> >>> I've had something like this (untested) in mind >>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >>> index 7b3503d12aaf..37abc94abd2e 100644 >>> --- a/mm/memcontrol.c >>> +++ b/mm/memcontrol.c >>> @@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg, >>> for_each_mem_cgroup_tree(iter, memcg) { >>> struct css_task_iter it; >>> struct task_struct *task; >>> + unsigned int i = 0 >>> >>> css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it); >>> - while (!ret && (task = css_task_iter_next(&it))) >>> + while (!ret && (task = css_task_iter_next(&it))) { >>> ret = fn(task, arg); >>> + if (++i % 1000) >>> + cond_resched(); >>> + } >>> css_task_iter_end(&it); >>> if (ret) { >>> mem_cgroup_iter_break(memcg, iter); >> >> Thank you for your patience. >> >> I had this idea in mind as well. >> However, there are two considerations that led me to reconsider it: >> >> 1. I wasn't convinced about how we should call cond_resched every N >> iterations. Should it be 1000 or 10000? > > Sure, there will likely not be any _right_ value. This is mostly to > mitigate the overhead of cond_resched which is not completely free. > Having a system with 1000 tasks is not completely uncommon and we do not > really need cond_resched now. > > More importantly we can expect cond_resched will eventually go away with > the PREEMPT_LAZY (or what is the current name of that) so I wouldn't > overthink this. > >> 2. I don't think all callers of mem_cgroup_scan_tasks need cond_resched. >> Only fn is expensive (e.g., dump_tasks), and it needs cond_resched. At >> least, I have not encountered any other issue except except when fn is >> dump_tasks. > > See above. I wouldn't really overthink this. Thanks, I tested and sent v2. Best regards, Ridong
diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 1c485beb0b93..14260381cccc 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg) if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc)) return 0; + cond_resched(); task = find_lock_task_mm(p); if (!task) { /*