memcg: killed threads should not invoke memcg OOM killer

Message ID	1545819215-10892-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: best guess record for domain of penguin-kernel@i-love.sakura.ne.jp designates 202.181.97.72 as permitted sender) client-ip=202.181.97.72; From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> To: Michal Hocko <mhocko@kernel.org>, Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, David Rientjes <rientjes@google.com>, linux-mm@kvack.org, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer Date: Wed, 26 Dec 2018 19:13:35 +0900 Message-Id: <1545819215-10892-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	memcg: killed threads should not invoke memcg OOM killer \| expand memcg: killed threads should not invoke memcg OOM killer

Tetsuo Handa Dec. 26, 2018, 10:13 a.m. UTC

It is possible that a single process group memcg easily swamps the log
with no-eligible OOM victim messages after current thread was OOM-killed,
due to race between the memcg charge and the OOM reaper [1].

Thread-1                 Thread-2                       OOM reaper
try_charge()
  mem_cgroup_out_of_memory()
    mutex_lock(oom_lock)
                        try_charge()
                          mem_cgroup_out_of_memory()
                            mutex_lock(oom_lock)
    out_of_memory()
      select_bad_process()
      oom_kill_process(current)
      wake_oom_reaper()
                                                        oom_reap_task()
                                                        # sets MMF_OOM_SKIP
    mutex_unlock(oom_lock)
                            out_of_memory()
                              select_bad_process() # no task
                            mutex_unlock(oom_lock)

We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and
memory_max_write() can bail out upon SIGKILL, and try_charge() allows
already killed/exiting threads to make forward progress.

Michal has a plan to use tsk_is_oom_victim() by calling mark_oom_victim()
on all thread groups sharing victim's mm. But fatal_signal_pending() in
this patch helps regardless of Michal's plan because it will avoid
needlessly calling out_of_memory() when current thread is already
terminating (e.g. got SIGINT after passing fatal_signal_pending() check
in try_charge() and mutex_lock_killable() did not block).

[1] https://lkml.kernel.org/r/ea637f9a-5dd0-f927-d26d-d0b4fd8ccb6f@i-love.sakura.ne.jp

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/memcontrol.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Kirill Tkhai Dec. 28, 2018, 10:22 a.m. UTC | #1

Hi, Tetsuo,

On 26.12.2018 13:13, Tetsuo Handa wrote:
> It is possible that a single process group memcg easily swamps the log
> with no-eligible OOM victim messages after current thread was OOM-killed,
> due to race between the memcg charge and the OOM reaper [1].
> 
> Thread-1                 Thread-2                       OOM reaper
> try_charge()
>   mem_cgroup_out_of_memory()
>     mutex_lock(oom_lock)
>                         try_charge()
>                           mem_cgroup_out_of_memory()
>                             mutex_lock(oom_lock)
>     out_of_memory()
>       select_bad_process()
>       oom_kill_process(current)
>       wake_oom_reaper()
>                                                         oom_reap_task()
>                                                         # sets MMF_OOM_SKIP
>     mutex_unlock(oom_lock)
>                             out_of_memory()
>                               select_bad_process() # no task
>                             mutex_unlock(oom_lock)
> 
> We don't need to invoke the memcg OOM killer if current thread was killed
> when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and
> memory_max_write() can bail out upon SIGKILL, and try_charge() allows
> already killed/exiting threads to make forward progress.
> 
> Michal has a plan to use tsk_is_oom_victim() by calling mark_oom_victim()
> on all thread groups sharing victim's mm. But fatal_signal_pending() in
> this patch helps regardless of Michal's plan because it will avoid
> needlessly calling out_of_memory() when current thread is already
> terminating (e.g. got SIGINT after passing fatal_signal_pending() check
> in try_charge() and mutex_lock_killable() did not block).
> 
> [1] https://lkml.kernel.org/r/ea637f9a-5dd0-f927-d26d-d0b4fd8ccb6f@i-love.sakura.ne.jp
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  mm/memcontrol.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b860dd4f7..b0d3bf3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1389,8 +1389,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	};
>  	bool ret;
>  
> -	mutex_lock(&oom_lock);
> -	ret = out_of_memory(&oc);
> +	if (mutex_lock_killable(&oom_lock))
> +		return true;
> +	/*
> +	 * A few threads which were not waiting at mutex_lock_killable() can
> +	 * fail to bail out. Therefore, check again after holding oom_lock.
> +	 */
> +	ret = fatal_signal_pending(current) || out_of_memory(&oc);

This fatal_signal_pending() check has a sense because of
it's possible, a killed task is waking up slowly, and it
returns from schedule(), when there are no more waiters
for a lock.

Why not make this approach generic, and add a check into
__mutex_lock_common() after schedule_preempt_disabled()
instead of this? This will handle all the places like
that at once.

(The only adding a check is not enough for __mutex_lock_common(),
 since mutex code will require to wake next waiter also. So,
 you will need a couple of changes in mutex code).

Kirill

>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }
>

Tetsuo Handa Dec. 28, 2018, 11 a.m. UTC | #2

On 2018/12/28 19:22, Kirill Tkhai wrote:
>> @@ -1389,8 +1389,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>  	};
>>  	bool ret;
>>  
>> -	mutex_lock(&oom_lock);
>> -	ret = out_of_memory(&oc);
>> +	if (mutex_lock_killable(&oom_lock))
>> +		return true;
>> +	/*
>> +	 * A few threads which were not waiting at mutex_lock_killable() can
>> +	 * fail to bail out. Therefore, check again after holding oom_lock.
>> +	 */
>> +	ret = fatal_signal_pending(current) || out_of_memory(&oc);
> 
> This fatal_signal_pending() check has a sense because of
> it's possible, a killed task is waking up slowly, and it
> returns from schedule(), when there are no more waiters
> for a lock.

Thanks. Michal thinks that mutex_lock_killable() would be sufficient
( https://lkml.kernel.org/r/20181107100810.GA27423@dhcp22.suse.cz ) but
I can confirm that mutex_lock_killable() is not sufficient when I test
using a VM with 8 CPUs. Thus, I'd like to keep this fatal_signal_pending()
check.

> 
> Why not make this approach generic, and add a check into
> __mutex_lock_common() after schedule_preempt_disabled()
> instead of this? This will handle all the places like
> that at once.
> 
> (The only adding a check is not enough for __mutex_lock_common(),
>  since mutex code will require to wake next waiter also. So,
>  you will need a couple of changes in mutex code).

I think that we should not assume that everybody is ready for making
mutex_lock_killable() to return -EINTR if fatal_signal_pending() is
true, and that adding below version would be a safer choice.

int __sched mutex_lock_unless_killed(struct mutex *lock)
{
	const int ret = mutex_lock_killable(lock);

	if (ret)
		return ret;
	if (fatal_signale_pending(current)) {
		mutex_unlock(lock);
		return -EINTR;
	}
	return 0;
}

Kirill Tkhai Dec. 28, 2018, 11:28 a.m. UTC | #3

On 28.12.2018 14:00, Tetsuo Handa wrote:
> On 2018/12/28 19:22, Kirill Tkhai wrote:
>>> @@ -1389,8 +1389,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>>>  	};
>>>  	bool ret;
>>>  
>>> -	mutex_lock(&oom_lock);
>>> -	ret = out_of_memory(&oc);
>>> +	if (mutex_lock_killable(&oom_lock))
>>> +		return true;
>>> +	/*
>>> +	 * A few threads which were not waiting at mutex_lock_killable() can
>>> +	 * fail to bail out. Therefore, check again after holding oom_lock.
>>> +	 */
>>> +	ret = fatal_signal_pending(current) || out_of_memory(&oc);
>>
>> This fatal_signal_pending() check has a sense because of
>> it's possible, a killed task is waking up slowly, and it
>> returns from schedule(), when there are no more waiters
>> for a lock.
> 
> Thanks. Michal thinks that mutex_lock_killable() would be sufficient
> ( https://lkml.kernel.org/r/20181107100810.GA27423@dhcp22.suse.cz ) but
> I can confirm that mutex_lock_killable() is not sufficient when I test
> using a VM with 8 CPUs. Thus, I'd like to keep this fatal_signal_pending()
> check.
> 
>>
>> Why not make this approach generic, and add a check into
>> __mutex_lock_common() after schedule_preempt_disabled()
>> instead of this? This will handle all the places like
>> that at once.
>>
>> (The only adding a check is not enough for __mutex_lock_common(),
>>  since mutex code will require to wake next waiter also. So,
>>  you will need a couple of changes in mutex code).
> 
> I think that we should not assume that everybody is ready for making
> mutex_lock_killable() to return -EINTR if fatal_signal_pending() is
> true, and that adding below version would be a safer choice.

There is signal_pending_state() primitive, and this is the check,
which should be used instead of fatal_signal_pending() in mutex
code.

Let's ask Peter :) Peter, what you think about the approach overall?
I.e., changing __mutex_lock_common() by adding one more check of
signal_pending_state() after schedule_preempt_disabled() (with respect
to other mutex code, e.g., waking next waiter etc)?

Kirill

Tetsuo Handa Jan. 6, 2019, 6:02 a.m. UTC | #4

Michal and Johannes, can we please stop this stupid behavior now?

Reproducer:
----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>

#define NUMTHREADS 256
#define MMAPSIZE 4 * 10485760
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
	int fd = open("/dev/zero", O_RDONLY);
	char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
			 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
	read(pipe_fd[0], buf, 1);
	read(fd, buf, MMAPSIZE);
	pause();
	return 0;
}
int main(int argc, char *argv[])
{
	int i;
	char *stack;
	FILE *fp;
	const unsigned long size = 1048576 * 200;
	mkdir("/sys/fs/cgroup/memory/test1", 0755);
	fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
	fprintf(fp, "%lu\n", size);
	fclose(fp);
	fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
	fprintf(fp, "%u\n", getpid());
	fclose(fp);
	if (setgid(-2) || setuid(-2) || pipe(pipe_fd))
		return 1;
	if (fork() == 0) {
		stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
			     MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
		for (i = 0; i < NUMTHREADS; i++)
			if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
				  CLONE_SIGHAND | CLONE_THREAD | CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1)
				break;
		close(pipe_fd[1]);
		pause();
	}
	close(pipe_fd[0]);
	for (i = 0; i < NUMTHREADS / 2; i++)
		if (fork() == 0) {
			close(pipe_fd[1]);
			pause();
		}
	sleep(1);
	close(pipe_fd[1]);
	pause();
	return 0;
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20190106.txt.xz :
----------
[   79.104729] a.out invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
(...snipped...)
[   79.237203] memory: usage 204800kB, limit 204800kB, failcnt 2834
[   79.242176] memory+swap: usage 204800kB, limit 9007199254740988kB, failcnt 0
[   79.245175] kmem: usage 23420kB, limit 9007199254740988kB, failcnt 0
[   79.247945] Memory cgroup stats for /test1: cache:177456KB rss:3420KB rss_huge:0KB shmem:177456KB mapped_file:177540KB dirty:0KB writeback:0KB swap:0KB inactive_anon:177676KB active_anon:3612KB inactive_file:0KB active_file:0KB unevictable:0KB
[   79.256726] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=8204,uid=-2
[   79.262470] Memory cgroup out of memory: Kill process 8204 (a.out) score 822 or sacrifice child
[   79.266901] Killed process 8204 (a.out) total-vm:10491132kB, anon-rss:92kB, file-rss:444kB, shmem-rss:167028kB
[   79.272974] oom_reaper: reaped process 8447 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:167488kB
[   79.277733] a.out invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
(...snipped...)
[   79.386222] memory: usage 204708kB, limit 204800kB, failcnt 2837
[   79.412519] Killed process 8205 (a.out) total-vm:4348kB, anon-rss:92kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[   79.539042] memory: usage 204600kB, limit 204800kB, failcnt 2838
[   79.564617] Killed process 8206 (a.out) total-vm:4348kB, anon-rss:92kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[   81.967741] Memory cgroup out of memory: Kill process 8203 (a.out) score 6 or sacrifice child
[   81.971760] Killed process 8203 (a.out) total-vm:4348kB, anon-rss:92kB, file-rss:1156kB, shmem-rss:0kB
[   81.977329] a.out invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
(...snipped...)
[   81.977529] memory: usage 187160kB, limit 204800kB, failcnt 2838
[   81.977530] memory+swap: usage 187160kB, limit 9007199254740988kB, failcnt 0
[   81.977531] kmem: usage 8264kB, limit 9007199254740988kB, failcnt 0
[   81.977532] Memory cgroup stats for /test1: cache:178248KB rss:372KB rss_huge:0KB shmem:178248KB mapped_file:178332KB dirty:0KB writeback:0KB swap:0KB inactive_anon:178568KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[   81.977545] Out of memory and no killable processes...
(...snipped...)
[   87.914960] a.out invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
(...snipped...)
[   88.019110] memory: usage 183472kB, limit 204800kB, failcnt 2838
[   88.021629] memory+swap: usage 183472kB, limit 9007199254740988kB, failcnt 0
[   88.024513] kmem: usage 4448kB, limit 9007199254740988kB, failcnt 0
[   88.027137] Memory cgroup stats for /test1: cache:178512KB rss:372KB rss_huge:0KB shmem:178512KB mapped_file:178464KB dirty:0KB writeback:0KB swap:0KB inactive_anon:178760KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[   88.036008] Out of memory and no killable processes...
----------



From 0fb58415770a83d6c40d471e1840f8bc4a35ca83 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 26 Dec 2018 19:13:35 +0900
Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer

If $N > $M, a single process with $N threads in a memcg group can easily
kill all $M processes in that memcg group, for mem_cgroup_out_of_memory()
does not check if current thread needs to invoke the memcg OOM killer.

  T1@P1     |T2...$N@P1|P2...$M   |OOM reaper
  ----------+----------+----------+----------
                        # all sleeping
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(oom_lock)
      out_of_memory()
        select_bad_process()
        oom_kill_process(P1)
        wake_oom_reaper()
                                   oom_reap_task() # ignores P1
      mutex_unlock(oom_lock)
                 out_of_memory()
                   select_bad_process(P2...M)
                        # all killed by T2...N@P1
                   wake_oom_reaper()
                                   oom_reap_task() # ignores P2...M
                 mutex_unlock(oom_lock)

We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and
memory_max_write() can bail out upon SIGKILL, and try_charge() allows
already killed/exiting threads to make forward progress.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/memcontrol.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b860dd4f7..b0d3bf3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1389,8 +1389,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret;
 
-	mutex_lock(&oom_lock);
-	ret = out_of_memory(&oc);
+	if (mutex_lock_killable(&oom_lock))
+		return true;
+	/*
+	 * A few threads which were not waiting at mutex_lock_killable() can
+	 * fail to bail out. Therefore, check again after holding oom_lock.
+	 */
+	ret = fatal_signal_pending(current) || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
 }

Michal Hocko Jan. 7, 2019, 11:41 a.m. UTC | #5

On Sun 06-01-19 15:02:24, Tetsuo Handa wrote:
> Michal and Johannes, can we please stop this stupid behavior now?

I have proposed a patch with a much more limited scope which is still
waiting for feedback. I haven't heard it wouldn't be working so far.

Tetsuo Handa Jan. 7, 2019, 1:07 p.m. UTC | #6

On 2019/01/07 20:41, Michal Hocko wrote:
> On Sun 06-01-19 15:02:24, Tetsuo Handa wrote:
>> Michal and Johannes, can we please stop this stupid behavior now?
> 
> I have proposed a patch with a much more limited scope which is still
> waiting for feedback. I haven't heard it wouldn't be working so far.
> 

You mean

  mutex_lock_killable would take care of exiting task already. I would
  then still prefer to check for mark_oom_victim because that is not racy
  with the exit path clearing signals. I can update my patch to use
  _killable lock variant if we are really going with the memcg specific
  fix.

? No response for two months.

One memcg OOM killer kills all processes in that memcg is broken. What is
the race you are referring by "racy with the exit path clearing signals" ?
You are saying that a thread between clearing fatal signal and setting
PF_EXITING can invoke the memcg OOM killer again, aren't you? But how likely
is that? Even if it can happen, your patch can call mark_oom_victim() even
if my patch bailed out upon SIGKILL. That is, your patch and my patch are
not conflicting/exclusive.

Michal Hocko Jan. 7, 2019, 1:37 p.m. UTC | #7

On Mon 07-01-19 22:07:43, Tetsuo Handa wrote:
> On 2019/01/07 20:41, Michal Hocko wrote:
> > On Sun 06-01-19 15:02:24, Tetsuo Handa wrote:
> >> Michal and Johannes, can we please stop this stupid behavior now?
> > 
> > I have proposed a patch with a much more limited scope which is still
> > waiting for feedback. I haven't heard it wouldn't be working so far.
> > 
> 
> You mean
> 
>   mutex_lock_killable would take care of exiting task already. I would
>   then still prefer to check for mark_oom_victim because that is not racy
>   with the exit path clearing signals. I can update my patch to use
>   _killable lock variant if we are really going with the memcg specific
>   fix.
> 
> ? No response for two months.

I mean http://lkml.kernel.org/r/20181022071323.9550-1-mhocko@kernel.org
which has died in nit picking. I am not very interested to go back there
and spend a lot of time with it again. If you do not respect my opinion
as the maintainer of this code then find somebody else to push it
through.

Tetsuo Handa Jan. 7, 2019, 2:20 p.m. UTC | #8

On 2019/01/07 22:37, Michal Hocko wrote:
> On Mon 07-01-19 22:07:43, Tetsuo Handa wrote:
>> On 2019/01/07 20:41, Michal Hocko wrote:
>>> On Sun 06-01-19 15:02:24, Tetsuo Handa wrote:
>>>> Michal and Johannes, can we please stop this stupid behavior now?
>>>
>>> I have proposed a patch with a much more limited scope which is still
>>> waiting for feedback. I haven't heard it wouldn't be working so far.
>>>
>>
>> You mean
>>
>>   mutex_lock_killable would take care of exiting task already. I would
>>   then still prefer to check for mark_oom_victim because that is not racy
>>   with the exit path clearing signals. I can update my patch to use
>>   _killable lock variant if we are really going with the memcg specific
>>   fix.
>>
>> ? No response for two months.
> 
> I mean http://lkml.kernel.org/r/20181022071323.9550-1-mhocko@kernel.org
> which has died in nit picking. I am not very interested to go back there
> and spend a lot of time with it again. If you do not respect my opinion
> as the maintainer of this code then find somebody else to push it
> through.
> 

OK, you haven't proposed an updated patch. Since nobody can test
not-yet-proposed patch, you haven't heard it wouldn't be working so far.

Tetsuo Handa Jan. 9, 2019, 10:56 a.m. UTC | #9

On 2019/01/07 22:37, Michal Hocko wrote:
> On Mon 07-01-19 22:07:43, Tetsuo Handa wrote:
>> On 2019/01/07 20:41, Michal Hocko wrote:
>>> On Sun 06-01-19 15:02:24, Tetsuo Handa wrote:
>>>> Michal and Johannes, can we please stop this stupid behavior now?
>>>
>>> I have proposed a patch with a much more limited scope which is still
>>> waiting for feedback. I haven't heard it wouldn't be working so far.
>>>
>>
>> You mean
>>
>>   mutex_lock_killable would take care of exiting task already. I would
>>   then still prefer to check for mark_oom_victim because that is not racy
>>   with the exit path clearing signals. I can update my patch to use
>>   _killable lock variant if we are really going with the memcg specific
>>   fix.
>>
>> ? No response for two months.
> 
> I mean http://lkml.kernel.org/r/20181022071323.9550-1-mhocko@kernel.org
> which has died in nit picking. I am not very interested to go back there
> and spend a lot of time with it again. If you do not respect my opinion
> as the maintainer of this code then find somebody else to push it
> through.
> 

OK. It turned out that Michal's comment is independent with this patch.
We can apply both Michal's patch and my patch, and here is my patch.

From 0fb58415770a83d6c40d471e1840f8bc4a35ca83 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 26 Dec 2018 19:13:35 +0900
Subject: [PATCH] memcg: killed threads should not invoke memcg OOM killer

If $N > $M, a single process with $N threads in a memcg group can easily
kill all $M processes in that memcg group, for mem_cgroup_out_of_memory()
does not check if current thread needs to invoke the memcg OOM killer.

  T1@P1     |T2...$N@P1|P2...$M   |OOM reaper
  ----------+----------+----------+----------
                        # all sleeping
  try_charge()
    mem_cgroup_out_of_memory()
      mutex_lock(oom_lock)
             try_charge()
               mem_cgroup_out_of_memory()
                 mutex_lock(oom_lock)
      out_of_memory()
        select_bad_process()
        oom_kill_process(P1)
        wake_oom_reaper()
                                   oom_reap_task() # ignores P1
      mutex_unlock(oom_lock)
                 out_of_memory()
                   select_bad_process(P2...$M)
                        # all killed by T2...$N@P1
                   wake_oom_reaper()
                                   oom_reap_task() # ignores P2...$M
                 mutex_unlock(oom_lock)

We don't need to invoke the memcg OOM killer if current thread was killed
when waiting for oom_lock, for mem_cgroup_oom_synchronize(true) and
memory_max_write() can bail out upon SIGKILL, and try_charge() allows
already killed/exiting threads to make forward progress.

If memcg OOM events in different domains are pending, already OOM-killed
threads needlessly wait for pending memcg OOM events in different domains.
An out_of_memory() call is slow because it involves printk(). With slow
serial consoles, out_of_memory() might take more than a second. Therefore,
allowing killed processes to quickly call mmput() from exit_mm() from
do_exit() will help calling __mmput() (which can reclaim more memory than
the OOM reaper can reclaim) quickly.

At first Michal thought that fatal signal check is racy compared to
tsk_is_oom_victim() check. But actually there is no such race, for
by the moment mutex_unlock(&oom_lock) is called after returning from
out_of_memory(), fatal_signal_pending() == F && tsk_is_oom_victim() == T
can't happen if current thread is holding oom_lock inside
mem_cgroup_out_of_memory(). On the other hand,
fatal_signal_pending() == T && tsk_is_oom_victim() == F can happen, and
bailing out upon that condition will save some process from needlessly
being OOM-killed.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/memcontrol.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b860dd4f7..b0d3bf3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1389,8 +1389,13 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	};
 	bool ret;

-	mutex_lock(&oom_lock);
-	ret = out_of_memory(&oc);
+	if (mutex_lock_killable(&oom_lock))
+		return true;
+	/*
+	 * A few threads which were not waiting at mutex_lock_killable() can
+	 * fail to bail out. Therefore, check again after holding oom_lock.
+	 */
+	ret = fatal_signal_pending(current) || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
 }

memcg: killed threads should not invoke memcg OOM killer

Commit Message

Comments

Patch