[v1] memcg: fix soft lockup in the OOM process

Message ID	20241217121828.3219752-1-chenridong@huaweicloud.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Chen Ridong <chenridong@huaweicloud.com> To: akpm@linux-foundation.org, mhocko@kernel.org, hannes@cmpxchg.org, yosryahmed@google.com, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, davidf@vimeo.com, vbabka@suse.cz, handai.szj@taobao.com, rientjes@google.com, kamezawa.hiroyu@jp.fujitsu.com Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, chenridong@huawei.com, wangweiyang2@huawei.com Subject: [PATCH v1] memcg: fix soft lockup in the OOM process Date: Tue, 17 Dec 2024 12:18:28 +0000 Message-Id: <20241217121828.3219752-1-chenridong@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[v1] memcg: fix soft lockup in the OOM process \| expand [v1] memcg: fix soft lockup in the OOM process

Chen Ridong Dec. 17, 2024, 12:18 p.m. UTC

From: Chen Ridong <chenridong@huawei.com>

A soft lockup issue was found in the product with. About 56,000 tasks were
in the OOM cgroup, it was traversing them when the soft lockup was
triggered.

watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [VM Thread:1503066]
CPU: 2 PID: 1503066 Comm: VM Thread Kdump: loaded Tainted: G
Hardware name: Huawei Cloud OpenStack Nova, BIOS
RIP: 0010:console_unlock+0x343/0x540
RSP: 0000:ffffb751447db9a0 EFLAGS: 00000247 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000ffffffff
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000247
RBP: ffffffffafc71f90 R08: 0000000000000000 R09: 0000000000000040
R10: 0000000000000080 R11: 0000000000000000 R12: ffffffffafc74bd0
R13: ffffffffaf60a220 R14: 0000000000000247 R15: 0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2fe6ad91f0 CR3: 00000004b2076003 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 vprintk_emit+0x193/0x280
 printk+0x52/0x6e
 dump_task+0x114/0x130
 mem_cgroup_scan_tasks+0x76/0x100
 dump_header+0x1fe/0x210
 oom_kill_process+0xd1/0x100
 out_of_memory+0x125/0x570
 mem_cgroup_out_of_memory+0xb5/0xd0
 try_charge+0x720/0x770
 mem_cgroup_try_charge+0x86/0x180
 mem_cgroup_try_charge_delay+0x1c/0x40
 do_anonymous_page+0xb5/0x390
 handle_mm_fault+0xc4/0x1f0

This is because thousands of processes are in the OOM cgroup, it takes a
long time to traverse all of them. As a result, this lead to soft lockup
in the OOM process.

To fix this issue, add 'cond_resched' in the 'dump_task' function.

Fixes: 9cbb78bb3143 ("mm, memcg: introduce own oom handler to iterate only over its own threads")
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 mm/oom_kill.c | 1 +
 1 file changed, 1 insertion(+)

Michal Hocko Dec. 17, 2024, 12:54 p.m. UTC | #1

On Tue 17-12-24 12:18:28, Chen Ridong wrote:
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 1c485beb0b93..14260381cccc 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg)
>  	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
>  		return 0;
>  
> +	cond_resched();
>  	task = find_lock_task_mm(p);
>  	if (!task) {
>  		/*

This is called from RCU read lock for the global OOM killer path and I
do not think you can schedule there. I do not remember specifics of task
traversal for crgoup path but I guess that you might need to silence the
soft lockup detector instead or come up with a different iteration
scheme.

Chen Ridong Dec. 18, 2024, 7:44 a.m. UTC | #2

On 2024/12/17 20:54, Michal Hocko wrote:
> On Tue 17-12-24 12:18:28, Chen Ridong wrote:
> [...]
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 1c485beb0b93..14260381cccc 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg)
>>  	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
>>  		return 0;
>>  
>> +	cond_resched();
>>  	task = find_lock_task_mm(p);
>>  	if (!task) {
>>  		/*
> 
> This is called from RCU read lock for the global OOM killer path and I
> do not think you can schedule there. I do not remember specifics of task
> traversal for crgoup path but I guess that you might need to silence the
> soft lockup detector instead or come up with a different iteration
> scheme.

Thank you, Michal.

I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks
function below the fn, but after reconsideration, it may cause
unnecessary scheduling for other callers of mem_cgroup_scan_tasks.
Therefore, I moved it into the dump_task function. However, I missed the
RCU lock from the global OOM.

I think we can use touch_nmi_watchdog in place of cond_resched, which
can silence the soft lockup detector. Do you think that is acceptable?

@@ -390,7 +391,7 @@ static int dump_task(struct task_struct *p, void *arg)
        if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
                return 0;

+       touch_nmi_watchdog();
        task = find_lock_task_mm(p);

Best regards,
Ridong

Michal Hocko Dec. 18, 2024, 7:56 a.m. UTC | #3

On Wed 18-12-24 15:44:34, Chen Ridong wrote:
> 
> 
> On 2024/12/17 20:54, Michal Hocko wrote:
> > On Tue 17-12-24 12:18:28, Chen Ridong wrote:
> > [...]
> >> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >> index 1c485beb0b93..14260381cccc 100644
> >> --- a/mm/oom_kill.c
> >> +++ b/mm/oom_kill.c
> >> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg)
> >>  	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
> >>  		return 0;
> >>  
> >> +	cond_resched();
> >>  	task = find_lock_task_mm(p);
> >>  	if (!task) {
> >>  		/*
> > 
> > This is called from RCU read lock for the global OOM killer path and I
> > do not think you can schedule there. I do not remember specifics of task
> > traversal for crgoup path but I guess that you might need to silence the
> > soft lockup detector instead or come up with a different iteration
> > scheme.
> 
> Thank you, Michal.
> 
> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks
> function below the fn, but after reconsideration, it may cause
> unnecessary scheduling for other callers of mem_cgroup_scan_tasks.
> Therefore, I moved it into the dump_task function. However, I missed the
> RCU lock from the global OOM.
> 
> I think we can use touch_nmi_watchdog in place of cond_resched, which
> can silence the soft lockup detector. Do you think that is acceptable?

It is certainly a way to go. Not the best one at that though. Maybe we
need different solution for the global and for the memcg OOMs. During
the global OOM we rarely care about latency as the whole system is
likely to struggle. Memcg ooms are much more likely. Having that many
tasks in a memcg certainly requires a further partitioning so if
configured properly the OOM latency shouldn't be visible much. But I am
wondering whether the cgroup task iteration could use cond_resched while
the global one would touch_nmi_watchdog for every N iterations. I might
be missing something but I do not see any locking required outside of
css_task_iter_*.

Chen Ridong Dec. 18, 2024, 9 a.m. UTC | #4

On 2024/12/18 15:56, Michal Hocko wrote:
> On Wed 18-12-24 15:44:34, Chen Ridong wrote:
>>
>>
>> On 2024/12/17 20:54, Michal Hocko wrote:
>>> On Tue 17-12-24 12:18:28, Chen Ridong wrote:
>>> [...]
>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>> index 1c485beb0b93..14260381cccc 100644
>>>> --- a/mm/oom_kill.c
>>>> +++ b/mm/oom_kill.c
>>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg)
>>>>  	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
>>>>  		return 0;
>>>>  
>>>> +	cond_resched();
>>>>  	task = find_lock_task_mm(p);
>>>>  	if (!task) {
>>>>  		/*
>>>
>>> This is called from RCU read lock for the global OOM killer path and I
>>> do not think you can schedule there. I do not remember specifics of task
>>> traversal for crgoup path but I guess that you might need to silence the
>>> soft lockup detector instead or come up with a different iteration
>>> scheme.
>>
>> Thank you, Michal.
>>
>> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks
>> function below the fn, but after reconsideration, it may cause
>> unnecessary scheduling for other callers of mem_cgroup_scan_tasks.
>> Therefore, I moved it into the dump_task function. However, I missed the
>> RCU lock from the global OOM.
>>
>> I think we can use touch_nmi_watchdog in place of cond_resched, which
>> can silence the soft lockup detector. Do you think that is acceptable?
> 
> It is certainly a way to go. Not the best one at that though. Maybe we
> need different solution for the global and for the memcg OOMs. During
> the global OOM we rarely care about latency as the whole system is
> likely to struggle. Memcg ooms are much more likely. Having that many
> tasks in a memcg certainly requires a further partitioning so if
> configured properly the OOM latency shouldn't be visible much. But I am
> wondering whether the cgroup task iteration could use cond_resched while
> the global one would touch_nmi_watchdog for every N iterations. I might
> be missing something but I do not see any locking required outside of
> css_task_iter_*.

Do you mean like that:

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index d9061bd55436..9d197a731841 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5023,7 +5023,7 @@ struct task_struct *css_task_iter_next(struct
css_task_iter *it)
        }

        spin_unlock_irqrestore(&css_set_lock, irqflags);
-
+       cond_resched();
        return it->cur_task;
 }

@@ -433,8 +433,10 @@ static void dump_tasks(struct oom_control *oc)
                struct task_struct *p;

                rcu_read_lock();
-               for_each_process(p)
+               for_each_process(p) {
+                       touch_nmi_watchdog();
                        dump_task(p, oc);
+               }
                rcu_read_unlock();
        }


The 'css_task_iter_*' functions are used in many places. We should be
very careful when adding cond_resched within these functions. I don't
see any RCU or spinlock usage outside of css_task_iter_*, except for
mutex locks, such as in cgroup_do_freeze.

And perhaps Tj will have some opinions on this?

Best regards,
Ridong

Michal Hocko Dec. 18, 2024, 10:22 a.m. UTC | #5

On Wed 18-12-24 17:00:38, Chen Ridong wrote:
> 
> 
> On 2024/12/18 15:56, Michal Hocko wrote:
> > On Wed 18-12-24 15:44:34, Chen Ridong wrote:
> >>
> >>
> >> On 2024/12/17 20:54, Michal Hocko wrote:
> >>> On Tue 17-12-24 12:18:28, Chen Ridong wrote:
> >>> [...]
> >>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >>>> index 1c485beb0b93..14260381cccc 100644
> >>>> --- a/mm/oom_kill.c
> >>>> +++ b/mm/oom_kill.c
> >>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg)
> >>>>  	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
> >>>>  		return 0;
> >>>>  
> >>>> +	cond_resched();
> >>>>  	task = find_lock_task_mm(p);
> >>>>  	if (!task) {
> >>>>  		/*
> >>>
> >>> This is called from RCU read lock for the global OOM killer path and I
> >>> do not think you can schedule there. I do not remember specifics of task
> >>> traversal for crgoup path but I guess that you might need to silence the
> >>> soft lockup detector instead or come up with a different iteration
> >>> scheme.
> >>
> >> Thank you, Michal.
> >>
> >> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks
> >> function below the fn, but after reconsideration, it may cause
> >> unnecessary scheduling for other callers of mem_cgroup_scan_tasks.
> >> Therefore, I moved it into the dump_task function. However, I missed the
> >> RCU lock from the global OOM.
> >>
> >> I think we can use touch_nmi_watchdog in place of cond_resched, which
> >> can silence the soft lockup detector. Do you think that is acceptable?
> > 
> > It is certainly a way to go. Not the best one at that though. Maybe we
> > need different solution for the global and for the memcg OOMs. During
> > the global OOM we rarely care about latency as the whole system is
> > likely to struggle. Memcg ooms are much more likely. Having that many
> > tasks in a memcg certainly requires a further partitioning so if
> > configured properly the OOM latency shouldn't be visible much. But I am
> > wondering whether the cgroup task iteration could use cond_resched while
> > the global one would touch_nmi_watchdog for every N iterations. I might
> > be missing something but I do not see any locking required outside of
> > css_task_iter_*.
> 
> Do you mean like that:

I've had something like this (untested) in mind
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7b3503d12aaf..37abc94abd2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	for_each_mem_cgroup_tree(iter, memcg) {
 		struct css_task_iter it;
 		struct task_struct *task;
+		unsigned int i = 0
 
 		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
-		while (!ret && (task = css_task_iter_next(&it)))
+		while (!ret && (task = css_task_iter_next(&it))) {
 			ret = fn(task, arg);
+			if (++i % 1000)
+				cond_resched();
+		}
 		css_task_iter_end(&it);
 		if (ret) {
 			mem_cgroup_iter_break(memcg, iter);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1c485beb0b93..3bf2304ed20c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -430,10 +430,14 @@ static void dump_tasks(struct oom_control *oc)
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
 	else {
 		struct task_struct *p;
+		unsigned int i = 0;
 
 		rcu_read_lock();
-		for_each_process(p)
+		for_each_process(p) {
+			if (++i % 1000)
+				touch_softlockup_watchdog();
 			dump_task(p, oc);
+		}
 		rcu_read_unlock();
 	}
 }

Chen Ridong Dec. 19, 2024, 1:27 a.m. UTC | #6

On 2024/12/18 18:22, Michal Hocko wrote:
> On Wed 18-12-24 17:00:38, Chen Ridong wrote:
>>
>>
>> On 2024/12/18 15:56, Michal Hocko wrote:
>>> On Wed 18-12-24 15:44:34, Chen Ridong wrote:
>>>>
>>>>
>>>> On 2024/12/17 20:54, Michal Hocko wrote:
>>>>> On Tue 17-12-24 12:18:28, Chen Ridong wrote:
>>>>> [...]
>>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>>> index 1c485beb0b93..14260381cccc 100644
>>>>>> --- a/mm/oom_kill.c
>>>>>> +++ b/mm/oom_kill.c
>>>>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg)
>>>>>>  	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
>>>>>>  		return 0;
>>>>>>  
>>>>>> +	cond_resched();
>>>>>>  	task = find_lock_task_mm(p);
>>>>>>  	if (!task) {
>>>>>>  		/*
>>>>>
>>>>> This is called from RCU read lock for the global OOM killer path and I
>>>>> do not think you can schedule there. I do not remember specifics of task
>>>>> traversal for crgoup path but I guess that you might need to silence the
>>>>> soft lockup detector instead or come up with a different iteration
>>>>> scheme.
>>>>
>>>> Thank you, Michal.
>>>>
>>>> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks
>>>> function below the fn, but after reconsideration, it may cause
>>>> unnecessary scheduling for other callers of mem_cgroup_scan_tasks.
>>>> Therefore, I moved it into the dump_task function. However, I missed the
>>>> RCU lock from the global OOM.
>>>>
>>>> I think we can use touch_nmi_watchdog in place of cond_resched, which
>>>> can silence the soft lockup detector. Do you think that is acceptable?
>>>
>>> It is certainly a way to go. Not the best one at that though. Maybe we
>>> need different solution for the global and for the memcg OOMs. During
>>> the global OOM we rarely care about latency as the whole system is
>>> likely to struggle. Memcg ooms are much more likely. Having that many
>>> tasks in a memcg certainly requires a further partitioning so if
>>> configured properly the OOM latency shouldn't be visible much. But I am
>>> wondering whether the cgroup task iteration could use cond_resched while
>>> the global one would touch_nmi_watchdog for every N iterations. I might
>>> be missing something but I do not see any locking required outside of
>>> css_task_iter_*.
>>
>> Do you mean like that:
> 
> I've had something like this (untested) in mind
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..37abc94abd2e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>  	for_each_mem_cgroup_tree(iter, memcg) {
>  		struct css_task_iter it;
>  		struct task_struct *task;
> +		unsigned int i = 0
>  
>  		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
> -		while (!ret && (task = css_task_iter_next(&it)))
> +		while (!ret && (task = css_task_iter_next(&it))) {
>  			ret = fn(task, arg);
> +			if (++i % 1000)
> +				cond_resched();
> +		}
>  		css_task_iter_end(&it);
>  		if (ret) {
>  			mem_cgroup_iter_break(memcg, iter);

Thank you for your patience.

I had this idea in mind as well.
However, there are two considerations that led me to reconsider it:

1. I wasn't convinced about how we should call cond_resched every N
iterations. Should it be 1000 or 10000?
2. I don't think all callers of mem_cgroup_scan_tasks need cond_resched.
Only fn is expensive (e.g., dump_tasks), and it needs cond_resched. At
least, I have not encountered any other issue except except when fn is
dump_tasks.

If you think this is acceptable, I will test and update the patch.

Best regards,
Ridong

Michal Hocko Dec. 19, 2024, 7:57 a.m. UTC | #7

On Thu 19-12-24 09:27:52, Chen Ridong wrote:
> 
> 
> On 2024/12/18 18:22, Michal Hocko wrote:
> > On Wed 18-12-24 17:00:38, Chen Ridong wrote:
> >>
> >>
> >> On 2024/12/18 15:56, Michal Hocko wrote:
> >>> On Wed 18-12-24 15:44:34, Chen Ridong wrote:
> >>>>
> >>>>
> >>>> On 2024/12/17 20:54, Michal Hocko wrote:
> >>>>> On Tue 17-12-24 12:18:28, Chen Ridong wrote:
> >>>>> [...]
> >>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >>>>>> index 1c485beb0b93..14260381cccc 100644
> >>>>>> --- a/mm/oom_kill.c
> >>>>>> +++ b/mm/oom_kill.c
> >>>>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg)
> >>>>>>  	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
> >>>>>>  		return 0;
> >>>>>>  
> >>>>>> +	cond_resched();
> >>>>>>  	task = find_lock_task_mm(p);
> >>>>>>  	if (!task) {
> >>>>>>  		/*
> >>>>>
> >>>>> This is called from RCU read lock for the global OOM killer path and I
> >>>>> do not think you can schedule there. I do not remember specifics of task
> >>>>> traversal for crgoup path but I guess that you might need to silence the
> >>>>> soft lockup detector instead or come up with a different iteration
> >>>>> scheme.
> >>>>
> >>>> Thank you, Michal.
> >>>>
> >>>> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks
> >>>> function below the fn, but after reconsideration, it may cause
> >>>> unnecessary scheduling for other callers of mem_cgroup_scan_tasks.
> >>>> Therefore, I moved it into the dump_task function. However, I missed the
> >>>> RCU lock from the global OOM.
> >>>>
> >>>> I think we can use touch_nmi_watchdog in place of cond_resched, which
> >>>> can silence the soft lockup detector. Do you think that is acceptable?
> >>>
> >>> It is certainly a way to go. Not the best one at that though. Maybe we
> >>> need different solution for the global and for the memcg OOMs. During
> >>> the global OOM we rarely care about latency as the whole system is
> >>> likely to struggle. Memcg ooms are much more likely. Having that many
> >>> tasks in a memcg certainly requires a further partitioning so if
> >>> configured properly the OOM latency shouldn't be visible much. But I am
> >>> wondering whether the cgroup task iteration could use cond_resched while
> >>> the global one would touch_nmi_watchdog for every N iterations. I might
> >>> be missing something but I do not see any locking required outside of
> >>> css_task_iter_*.
> >>
> >> Do you mean like that:
> > 
> > I've had something like this (untested) in mind
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 7b3503d12aaf..37abc94abd2e 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
> >  	for_each_mem_cgroup_tree(iter, memcg) {
> >  		struct css_task_iter it;
> >  		struct task_struct *task;
> > +		unsigned int i = 0
> >  
> >  		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
> > -		while (!ret && (task = css_task_iter_next(&it)))
> > +		while (!ret && (task = css_task_iter_next(&it))) {
> >  			ret = fn(task, arg);
> > +			if (++i % 1000)
> > +				cond_resched();
> > +		}
> >  		css_task_iter_end(&it);
> >  		if (ret) {
> >  			mem_cgroup_iter_break(memcg, iter);
> 
> Thank you for your patience.
> 
> I had this idea in mind as well.
> However, there are two considerations that led me to reconsider it:
> 
> 1. I wasn't convinced about how we should call cond_resched every N
> iterations. Should it be 1000 or 10000?

Sure, there will likely not be any _right_ value. This is mostly to
mitigate the overhead of cond_resched which is not completely free.
Having a system with 1000 tasks is not completely uncommon and we do not
really need cond_resched now.

More importantly we can expect cond_resched will eventually go away with
the PREEMPT_LAZY (or what is the current name of that) so I wouldn't
overthink this.

> 2. I don't think all callers of mem_cgroup_scan_tasks need cond_resched.
> Only fn is expensive (e.g., dump_tasks), and it needs cond_resched. At
> least, I have not encountered any other issue except except when fn is
> dump_tasks.

See above. I wouldn't really overthink this.

Chen Ridong Dec. 20, 2024, 10:44 a.m. UTC | #8

On 2024/12/19 15:57, Michal Hocko wrote:
> On Thu 19-12-24 09:27:52, Chen Ridong wrote:
>>
>>
>> On 2024/12/18 18:22, Michal Hocko wrote:
>>> On Wed 18-12-24 17:00:38, Chen Ridong wrote:
>>>>
>>>>
>>>> On 2024/12/18 15:56, Michal Hocko wrote:
>>>>> On Wed 18-12-24 15:44:34, Chen Ridong wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/12/17 20:54, Michal Hocko wrote:
>>>>>>> On Tue 17-12-24 12:18:28, Chen Ridong wrote:
>>>>>>> [...]
>>>>>>>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>>>>>>>> index 1c485beb0b93..14260381cccc 100644
>>>>>>>> --- a/mm/oom_kill.c
>>>>>>>> +++ b/mm/oom_kill.c
>>>>>>>> @@ -390,6 +390,7 @@ static int dump_task(struct task_struct *p, void *arg)
>>>>>>>>  	if (!is_memcg_oom(oc) && !oom_cpuset_eligible(p, oc))
>>>>>>>>  		return 0;
>>>>>>>>  
>>>>>>>> +	cond_resched();
>>>>>>>>  	task = find_lock_task_mm(p);
>>>>>>>>  	if (!task) {
>>>>>>>>  		/*
>>>>>>>
>>>>>>> This is called from RCU read lock for the global OOM killer path and I
>>>>>>> do not think you can schedule there. I do not remember specifics of task
>>>>>>> traversal for crgoup path but I guess that you might need to silence the
>>>>>>> soft lockup detector instead or come up with a different iteration
>>>>>>> scheme.
>>>>>>
>>>>>> Thank you, Michal.
>>>>>>
>>>>>> I made a mistake. I added cond_resched in the mem_cgroup_scan_tasks
>>>>>> function below the fn, but after reconsideration, it may cause
>>>>>> unnecessary scheduling for other callers of mem_cgroup_scan_tasks.
>>>>>> Therefore, I moved it into the dump_task function. However, I missed the
>>>>>> RCU lock from the global OOM.
>>>>>>
>>>>>> I think we can use touch_nmi_watchdog in place of cond_resched, which
>>>>>> can silence the soft lockup detector. Do you think that is acceptable?
>>>>>
>>>>> It is certainly a way to go. Not the best one at that though. Maybe we
>>>>> need different solution for the global and for the memcg OOMs. During
>>>>> the global OOM we rarely care about latency as the whole system is
>>>>> likely to struggle. Memcg ooms are much more likely. Having that many
>>>>> tasks in a memcg certainly requires a further partitioning so if
>>>>> configured properly the OOM latency shouldn't be visible much. But I am
>>>>> wondering whether the cgroup task iteration could use cond_resched while
>>>>> the global one would touch_nmi_watchdog for every N iterations. I might
>>>>> be missing something but I do not see any locking required outside of
>>>>> css_task_iter_*.
>>>>
>>>> Do you mean like that:
>>>
>>> I've had something like this (untested) in mind
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 7b3503d12aaf..37abc94abd2e 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1167,10 +1167,14 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
>>>  	for_each_mem_cgroup_tree(iter, memcg) {
>>>  		struct css_task_iter it;
>>>  		struct task_struct *task;
>>> +		unsigned int i = 0
>>>  
>>>  		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
>>> -		while (!ret && (task = css_task_iter_next(&it)))
>>> +		while (!ret && (task = css_task_iter_next(&it))) {
>>>  			ret = fn(task, arg);
>>> +			if (++i % 1000)
>>> +				cond_resched();
>>> +		}
>>>  		css_task_iter_end(&it);
>>>  		if (ret) {
>>>  			mem_cgroup_iter_break(memcg, iter);
>>
>> Thank you for your patience.
>>
>> I had this idea in mind as well.
>> However, there are two considerations that led me to reconsider it:
>>
>> 1. I wasn't convinced about how we should call cond_resched every N
>> iterations. Should it be 1000 or 10000?
> 
> Sure, there will likely not be any _right_ value. This is mostly to
> mitigate the overhead of cond_resched which is not completely free.
> Having a system with 1000 tasks is not completely uncommon and we do not
> really need cond_resched now.
> 
> More importantly we can expect cond_resched will eventually go away with
> the PREEMPT_LAZY (or what is the current name of that) so I wouldn't
> overthink this.
> 
>> 2. I don't think all callers of mem_cgroup_scan_tasks need cond_resched.
>> Only fn is expensive (e.g., dump_tasks), and it needs cond_resched. At
>> least, I have not encountered any other issue except except when fn is
>> dump_tasks.
> 
> See above. I wouldn't really overthink this.

Thanks, I tested and sent v2.

Best regards,
Ridong

[v1] memcg: fix soft lockup in the OOM process

Commit Message

Comments

Patch