diff mbox series

mm, oom: check memcg margin for parallel oom

Message ID 1594728512-18969-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm, oom: check memcg margin for parallel oom | expand

Commit Message

Yafang Shao July 14, 2020, 12:08 p.m. UTC
The commit 7775face2079 ("memcg: killed threads should not invoke memcg OOM
killer") resolves the problem that different threads in a multi-threaded
task doing parallel memcg oom, but it doesn't solve the problem that
different tasks doing parallel memcg oom.

It may happens that many different tasks in the same memcg are waiting
oom_lock at the same time, if one of them has already made progress and
freed enough available memory, the others don't need to trigger the oom
killer again. By checking memcg margin after hold oom_lock can help
achieve it.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
---
 mm/memcontrol.c | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Comments

Michal Hocko July 14, 2020, 12:37 p.m. UTC | #1
On Tue 14-07-20 08:08:32, Yafang Shao wrote:
> The commit 7775face2079 ("memcg: killed threads should not invoke memcg OOM
> killer") resolves the problem that different threads in a multi-threaded
> task doing parallel memcg oom, but it doesn't solve the problem that
> different tasks doing parallel memcg oom.
> 
> It may happens that many different tasks in the same memcg are waiting
> oom_lock at the same time, if one of them has already made progress and
> freed enough available memory, the others don't need to trigger the oom
> killer again. By checking memcg margin after hold oom_lock can help
> achieve it.

While the changelog makes sense it I believe it can be improved. I would
use something like the following. Feel free to use its parts.

"
Memcg oom killer invocation is synchronized by the global oom_lock and
tasks are sleeping on the lock while somebody is selecting the victim or
potentially race with the oom_reaper is releasing the victim's memory.
This can result in a pointless oom killer invocation because a waiter
might be racing with the oom_reaper

	P1		oom_reaper		P2
			oom_reap_task		mutex_lock(oom_lock)
						out_of_memory # no victim because we have one already
			__oom_reap_task_mm	mute_unlock(oom_lock)
mutex_lock(oom_lock)
			set MMF_OOM_SKIP
select_bad_process
# finds a new victim

The page allocator prevents from this race by trying to allocate after
the lock can be acquired (in __alloc_pages_may_oom) which acts as a last
minute check. Moreover page allocator simply doesn't block on the
oom_lock and simply retries the whole reclaim process.

Memcg oom killer should do the last minute check as well. Call
mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
well but this doesn't seem to be necessary at this stage.
"

> Suggested-by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Cc: David Rientjes <rientjes@google.com>
> ---
>  mm/memcontrol.c | 19 +++++++++++++++++--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1962232..df141e1 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1560,16 +1560,31 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		.gfp_mask = gfp_mask,
>  		.order = order,
>  	};
> -	bool ret;
> +	bool ret = true;
>  
>  	if (mutex_lock_killable(&oom_lock))
>  		return true;
> +
>  	/*
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = should_force_charge() || out_of_memory(&oc);
> +	if (should_force_charge())
> +		goto out;
> +
> +	/*
> +	 * Different tasks may be doing parallel oom, so after hold the
> +	 * oom_lock the task should check the memcg margin again to check
> +	 * whether other task has already made progress.
> +	 */
> +	if (mem_cgroup_margin(memcg) >= (1 << order))
> +		goto out;

Is there any reason why you simply haven't done this? (+ your comment
which is helpful).

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 248e6cad0095..2c176825efe3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1561,15 +1561,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.order = order,
 		.chosen_points = LONG_MIN,
 	};
-	bool ret;
+	bool ret = true;
 
-	if (mutex_lock_killable(&oom_lock))
+	if (!mutex_trylock(&oom_lock))
 		return true;
+
+	if (mem_cgroup_margin(memcg) >= (1 << order))
+		goto unlock;
+
 	/*
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
 	ret = should_force_charge() || out_of_memory(&oc);
+
+unlock:
 	mutex_unlock(&oom_lock);
 	return ret;
 }
> +
> +	ret = out_of_memory(&oc);
> +
> +out:
>  	mutex_unlock(&oom_lock);
> +
>  	return ret;
>  }
>  
> -- 
> 1.8.3.1
Michal Hocko July 14, 2020, 12:38 p.m. UTC | #2
And btw. "memcg, oom: check memcg margin for parallel oom" would be a
better fit because this is not really related to the global oom.
Tetsuo Handa July 14, 2020, 12:47 p.m. UTC | #3
On 2020/07/14 21:37, Michal Hocko wrote:
> -	if (mutex_lock_killable(&oom_lock))
> +	if (!mutex_trylock(&oom_lock))
>  		return true;

I don't like this change. The trylock needlessly wastes CPU time which could
have been utilized by the OOM killer/reaper for reclaiming memory.

Rather, I want to change

-	if (!mutex_trylock(&oom_lock)) {
+	if (mutex_lock_killable(&oom_lock)) {

in __alloc_pages_may_oom() side.
Michal Hocko July 14, 2020, 12:57 p.m. UTC | #4
On Tue 14-07-20 21:47:06, Tetsuo Handa wrote:
> On 2020/07/14 21:37, Michal Hocko wrote:
> > -	if (mutex_lock_killable(&oom_lock))
> > +	if (!mutex_trylock(&oom_lock))
> >  		return true;
> 
> I don't like this change. The trylock needlessly wastes CPU time which could
> have been utilized by the OOM killer/reaper for reclaiming memory.

Ohh, that was an unintended hunk which stayed there from the previous
discussion. With the margin check the lock itself can stay.
> 
> Rather, I want to change
> 
> -	if (!mutex_trylock(&oom_lock)) {
> +	if (mutex_lock_killable(&oom_lock)) {
> 
> in __alloc_pages_may_oom() side.

Please do not start a tangent discussion again. If you want to make this
change then post a patch with a justification.
Yafang Shao July 14, 2020, 1:25 p.m. UTC | #5
On Tue, Jul 14, 2020 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 14-07-20 08:08:32, Yafang Shao wrote:
> > The commit 7775face2079 ("memcg: killed threads should not invoke memcg OOM
> > killer") resolves the problem that different threads in a multi-threaded
> > task doing parallel memcg oom, but it doesn't solve the problem that
> > different tasks doing parallel memcg oom.
> >
> > It may happens that many different tasks in the same memcg are waiting
> > oom_lock at the same time, if one of them has already made progress and
> > freed enough available memory, the others don't need to trigger the oom
> > killer again. By checking memcg margin after hold oom_lock can help
> > achieve it.
>
> While the changelog makes sense it I believe it can be improved. I would
> use something like the following. Feel free to use its parts.
>
> "
> Memcg oom killer invocation is synchronized by the global oom_lock and
> tasks are sleeping on the lock while somebody is selecting the victim or
> potentially race with the oom_reaper is releasing the victim's memory.
> This can result in a pointless oom killer invocation because a waiter
> might be racing with the oom_reaper
>
>         P1              oom_reaper              P2
>                         oom_reap_task           mutex_lock(oom_lock)
>                                                 out_of_memory # no victim because we have one already
>                         __oom_reap_task_mm      mute_unlock(oom_lock)
> mutex_lock(oom_lock)
>                         set MMF_OOM_SKIP
> select_bad_process
> # finds a new victim
>
> The page allocator prevents from this race by trying to allocate after
> the lock can be acquired (in __alloc_pages_may_oom) which acts as a last
> minute check. Moreover page allocator simply doesn't block on the
> oom_lock and simply retries the whole reclaim process.
>
> Memcg oom killer should do the last minute check as well. Call
> mem_cgroup_margin to do that. Trylock on the oom_lock could be done as
> well but this doesn't seem to be necessary at this stage.
> "
>
> > Suggested-by: Michal Hocko <mhocko@kernel.org>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> > Cc: David Rientjes <rientjes@google.com>
> > ---
> >  mm/memcontrol.c | 19 +++++++++++++++++--
> >  1 file changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1962232..df141e1 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1560,16 +1560,31 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >               .gfp_mask = gfp_mask,
> >               .order = order,
> >       };
> > -     bool ret;
> > +     bool ret = true;
> >
> >       if (mutex_lock_killable(&oom_lock))
> >               return true;
> > +
> >       /*
> >        * A few threads which were not waiting at mutex_lock_killable() can
> >        * fail to bail out. Therefore, check again after holding oom_lock.
> >        */
> > -     ret = should_force_charge() || out_of_memory(&oc);
> > +     if (should_force_charge())
> > +             goto out;
> > +
> > +     /*
> > +      * Different tasks may be doing parallel oom, so after hold the
> > +      * oom_lock the task should check the memcg margin again to check
> > +      * whether other task has already made progress.
> > +      */
> > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > +             goto out;
>
> Is there any reason why you simply haven't done this? (+ your comment
> which is helpful).
>

No strong reason.
I just think that threads of a multi-thread task are more likely to do
parallel OOM, so I checked it first.
I can change it as you suggested below,  as it is more simple.


> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 248e6cad0095..2c176825efe3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1561,15 +1561,21 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>                 .order = order,
>                 .chosen_points = LONG_MIN,
>         };
> -       bool ret;
> +       bool ret = true;
>
> -       if (mutex_lock_killable(&oom_lock))
> +       if (!mutex_trylock(&oom_lock))
>                 return true;
> +
> +       if (mem_cgroup_margin(memcg) >= (1 << order))
> +               goto unlock;
> +
>         /*
>          * A few threads which were not waiting at mutex_lock_killable() can
>          * fail to bail out. Therefore, check again after holding oom_lock.
>          */
>         ret = should_force_charge() || out_of_memory(&oc);
> +
> +unlock:
>         mutex_unlock(&oom_lock);
>         return ret;
>  }
> > +
> > +     ret = out_of_memory(&oc);
> > +
> > +out:
> >       mutex_unlock(&oom_lock);
> > +
> >       return ret;
> >  }
> >
> > --
> > 1.8.3.1
>
> --
> Michal Hocko
> SUSE Labs
Yafang Shao July 14, 2020, 1:26 p.m. UTC | #6
On Tue, Jul 14, 2020 at 8:38 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> And btw. "memcg, oom: check memcg margin for parallel oom" would be a
> better fit because this is not really related to the global oom.

Sure, I will do it.
Michal Hocko July 14, 2020, 1:34 p.m. UTC | #7
On Tue 14-07-20 21:25:04, Yafang Shao wrote:
> On Tue, Jul 14, 2020 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > @@ -1560,16 +1560,31 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >               .gfp_mask = gfp_mask,
> > >               .order = order,
> > >       };
> > > -     bool ret;
> > > +     bool ret = true;
> > >
> > >       if (mutex_lock_killable(&oom_lock))
> > >               return true;
> > > +
> > >       /*
> > >        * A few threads which were not waiting at mutex_lock_killable() can
> > >        * fail to bail out. Therefore, check again after holding oom_lock.
> > >        */
> > > -     ret = should_force_charge() || out_of_memory(&oc);
> > > +     if (should_force_charge())
> > > +             goto out;
> > > +
> > > +     /*
> > > +      * Different tasks may be doing parallel oom, so after hold the
> > > +      * oom_lock the task should check the memcg margin again to check
> > > +      * whether other task has already made progress.
> > > +      */
> > > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > > +             goto out;
> >
> > Is there any reason why you simply haven't done this? (+ your comment
> > which is helpful).
> >
> 
> No strong reason.
> I just think that threads of a multi-thread task are more likely to do
> parallel OOM, so I checked it first.
> I can change it as you suggested below,  as it is more simple.

I would rather go with simplicity. This is a super slow path so ordering
of checks shouldn't matter much (if at all).
Yafang Shao July 14, 2020, 1:40 p.m. UTC | #8
On Tue, Jul 14, 2020 at 9:34 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 14-07-20 21:25:04, Yafang Shao wrote:
> > On Tue, Jul 14, 2020 at 8:37 PM Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > > @@ -1560,16 +1560,31 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > > >               .gfp_mask = gfp_mask,
> > > >               .order = order,
> > > >       };
> > > > -     bool ret;
> > > > +     bool ret = true;
> > > >
> > > >       if (mutex_lock_killable(&oom_lock))
> > > >               return true;
> > > > +
> > > >       /*
> > > >        * A few threads which were not waiting at mutex_lock_killable() can
> > > >        * fail to bail out. Therefore, check again after holding oom_lock.
> > > >        */
> > > > -     ret = should_force_charge() || out_of_memory(&oc);
> > > > +     if (should_force_charge())
> > > > +             goto out;
> > > > +
> > > > +     /*
> > > > +      * Different tasks may be doing parallel oom, so after hold the
> > > > +      * oom_lock the task should check the memcg margin again to check
> > > > +      * whether other task has already made progress.
> > > > +      */
> > > > +     if (mem_cgroup_margin(memcg) >= (1 << order))
> > > > +             goto out;
> > >
> > > Is there any reason why you simply haven't done this? (+ your comment
> > > which is helpful).
> > >
> >
> > No strong reason.
> > I just think that threads of a multi-thread task are more likely to do
> > parallel OOM, so I checked it first.
> > I can change it as you suggested below,  as it is more simple.
>
> I would rather go with simplicity. This is a super slow path so ordering
> of checks shouldn't matter much (if at all).
>

Sure.
diff mbox series

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1962232..df141e1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1560,16 +1560,31 @@  static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		.gfp_mask = gfp_mask,
 		.order = order,
 	};
-	bool ret;
+	bool ret = true;
 
 	if (mutex_lock_killable(&oom_lock))
 		return true;
+
 	/*
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	if (should_force_charge())
+		goto out;
+
+	/*
+	 * Different tasks may be doing parallel oom, so after hold the
+	 * oom_lock the task should check the memcg margin again to check
+	 * whether other task has already made progress.
+	 */
+	if (mem_cgroup_margin(memcg) >= (1 << order))
+		goto out;
+
+	ret = out_of_memory(&oc);
+
+out:
 	mutex_unlock(&oom_lock);
+
 	return ret;
 }