mbox series

[0/2] oom, memcg: do not report racy no-eligible OOM

Message ID 20190107143802.16847-1-mhocko@kernel.org (mailing list archive)
Headers show
Series oom, memcg: do not report racy no-eligible OOM | expand

Message

Michal Hocko Jan. 7, 2019, 2:38 p.m. UTC
Hi,
I have posted this as an RFC previously [1]. Tetsuo has pointed out some
issues with the patch 1 which I have fixed hopefully. Other than that
this is just a rebase on top of Linus tree.

The original cover:
this is a follow up for [2] which has been nacked mostly because Tetsuo
was able to find a simple workload which can trigger a race where
no-eligible task is reported without a good reason. I believe the patch2
addresses that issue and we do not have to play dirty games with
throttling just because of the race. I still believe that patch proposed
in [2] is a useful one but this can be addressed later.

This series comprises 2 patch. The first one is something I meant to do
loooong time ago, I just never have time to do that. We need it here to
handle CLONE_VM without CLONE_SIGHAND cases. The second patch closes the
race.

Feedback is appreciated of course.

[1] http://lkml.kernel.org/r/20181022071323.9550-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20181010151135.25766-1-mhocko@kernel.org

Comments

Michal Hocko Jan. 9, 2019, 11:03 a.m. UTC | #1
Tetsuo,
can you confirm that these two patches are fixing the issue you have
reported please?
Tetsuo Handa Jan. 9, 2019, 11:34 a.m. UTC | #2
On 2019/01/09 20:03, Michal Hocko wrote:
> Tetsuo,
> can you confirm that these two patches are fixing the issue you have
> reported please?
> 

My patch fixes the issue better than your "[PATCH 2/2] memcg: do not report racy no-eligible OOM tasks" does.

You can post "[PATCH 1/2] mm, oom: marks all killed tasks as oom victims"
based on a report that we needlessly select more OOM victims because
MMF_OOM_SKIP is quickly set by the OOM reaper. In fact, updating
oom_reap_task() / exit_mmap() to use

  mutex_lock(&oom_lock);
  set_bit(MMF_OOM_SKIP, &mm->flags);
  mutex_unlock(&oom_lock);

will mostly close the race as well.
Michal Hocko Jan. 9, 2019, 12:02 p.m. UTC | #3
On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
> On 2019/01/09 20:03, Michal Hocko wrote:
> > Tetsuo,
> > can you confirm that these two patches are fixing the issue you have
> > reported please?
> > 
> 
> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
> report racy no-eligible OOM tasks" does.

OK, so we are stuck again. Hooray!
Tetsuo Handa Jan. 10, 2019, 11:59 p.m. UTC | #4
Michal Hocko wrote:
> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
> > On 2019/01/09 20:03, Michal Hocko wrote:
> > > Tetsuo,
> > > can you confirm that these two patches are fixing the issue you have
> > > reported please?
> > > 
> > 
> > My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
> > report racy no-eligible OOM tasks" does.
> 
> OK, so we are stuck again. Hooray!

Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
does not close the race whereas my patch closes the race better.
Tetsuo Handa Jan. 11, 2019, 10:25 a.m. UTC | #5
On 2019/01/11 8:59, Tetsuo Handa wrote:
> Michal Hocko wrote:
>> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
>>> On 2019/01/09 20:03, Michal Hocko wrote:
>>>> Tetsuo,
>>>> can you confirm that these two patches are fixing the issue you have
>>>> reported please?
>>>>
>>>
>>> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
>>> report racy no-eligible OOM tasks" does.
>>
>> OK, so we are stuck again. Hooray!
> 
> Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
> Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
> when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
> does not close the race whereas my patch closes the race better.
> 

I confirmed that mm-oom-marks-all-killed-tasks-as-oom-victims.patch and
memcg-do-not-report-racy-no-eligible-oom-tasks.patch are completely failing
to fix the issue I am reporting. :-(

Reproducer:
----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>

#define NUMTHREADS 256
#define MMAPSIZE 4 * 10485760
#define STACKSIZE 4096
static int pipe_fd[2] = { EOF, EOF };
static int memory_eater(void *unused)
{
	int fd = open("/dev/zero", O_RDONLY);
	char *buf = mmap(NULL, MMAPSIZE, PROT_WRITE | PROT_READ,
			 MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
	read(pipe_fd[0], buf, 1);
	read(fd, buf, MMAPSIZE);
	pause();
	return 0;
}
int main(int argc, char *argv[])
{
	int i;
	char *stack;
	FILE *fp;
	const unsigned long size = 1048576UL * 200;
	mkdir("/sys/fs/cgroup/memory/test1", 0755);
	fp = fopen("/sys/fs/cgroup/memory/test1/memory.limit_in_bytes", "w");
	fprintf(fp, "%lu\n", size);
	fclose(fp);
	fp = fopen("/sys/fs/cgroup/memory/test1/tasks", "w");
	fprintf(fp, "%u\n", getpid());
	fclose(fp);
	if (setgid(-2) || setuid(-2) || pipe(pipe_fd))
		return 1;
	stack = mmap(NULL, STACKSIZE * NUMTHREADS, PROT_WRITE | PROT_READ,
		     MAP_ANONYMOUS | MAP_SHARED, EOF, 0);
	for (i = 0; i < NUMTHREADS; i++)
		if (clone(memory_eater, stack + (i + 1) * STACKSIZE,
			  CLONE_VM | CLONE_FS | CLONE_FILES, NULL) == -1)
			break;
	close(pipe_fd[1]);
	pause(); // Manually enter Ctrl-C immediately after dump_header() started.
	return 0;
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20190111.txt.xz :
----------
[   71.146532][ T9694] a.out invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[   71.151647][ T9694] CPU: 1 PID: 9694 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc1-next-20190111 #272
(...snipped...)
[   71.304689][ T9694] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9692,uid=-2
[   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child
[   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB
[   71.328523][ T9748] a.out invoked oom-killer: gfp_mask=0x7080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
[   71.328552][ T9748] CPU: 4 PID: 9748 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc1-next-20190111 #272
(...snipped...)
[   71.328785][ T9748] Out of memory and no killable processes...
[   71.329194][ T9771] a.out invoked oom-killer: gfp_mask=0x7080c0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), order=0, oom_score_adj=0
(...snipped...)
[   99.696592][ T9924] Out of memory and no killable processes...
[   99.699001][ T9838] a.out invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
(...snipped...)
[   99.833413][ T9838] Out of memory and no killable processes...
----------

$ grep -F 'Out of memory and no killable processes...' serial-20190111.txt | wc -l
213
Michal Hocko Jan. 11, 2019, 11:33 a.m. UTC | #6
On Fri 11-01-19 19:25:22, Tetsuo Handa wrote:
> On 2019/01/11 8:59, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> >> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
> >>> On 2019/01/09 20:03, Michal Hocko wrote:
> >>>> Tetsuo,
> >>>> can you confirm that these two patches are fixing the issue you have
> >>>> reported please?
> >>>>
> >>>
> >>> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
> >>> report racy no-eligible OOM tasks" does.
> >>
> >> OK, so we are stuck again. Hooray!
> > 
> > Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
> > Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
> > when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
> > does not close the race whereas my patch closes the race better.
> > 
> 
> I confirmed that mm-oom-marks-all-killed-tasks-as-oom-victims.patch and
> memcg-do-not-report-racy-no-eligible-oom-tasks.patch are completely failing
> to fix the issue I am reporting. :-(

OK, this is really interesting. This means that we are racing
when marking all the tasks sharing the mm with the clone syscall.
Does fatal_signal_pending handle this better?
Tetsuo Handa Jan. 11, 2019, 12:40 p.m. UTC | #7
On 2019/01/11 20:33, Michal Hocko wrote:
> On Fri 11-01-19 19:25:22, Tetsuo Handa wrote:
>> On 2019/01/11 8:59, Tetsuo Handa wrote:
>>> Michal Hocko wrote:
>>>> On Wed 09-01-19 20:34:46, Tetsuo Handa wrote:
>>>>> On 2019/01/09 20:03, Michal Hocko wrote:
>>>>>> Tetsuo,
>>>>>> can you confirm that these two patches are fixing the issue you have
>>>>>> reported please?
>>>>>>
>>>>>
>>>>> My patch fixes the issue better than your "[PATCH 2/2] memcg: do not
>>>>> report racy no-eligible OOM tasks" does.
>>>>
>>>> OK, so we are stuck again. Hooray!
>>>
>>> Andrew, will you pick up "[PATCH 3/2] memcg: Facilitate termination of memcg OOM victims." ?
>>> Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
>>> when task_will_free_mem() == true, memcg-do-not-report-racy-no-eligible-oom-tasks.patch
>>> does not close the race whereas my patch closes the race better.
>>>
>>
>> I confirmed that mm-oom-marks-all-killed-tasks-as-oom-victims.patch and
>> memcg-do-not-report-racy-no-eligible-oom-tasks.patch are completely failing
>> to fix the issue I am reporting. :-(
> 
> OK, this is really interesting. This means that we are racing
> when marking all the tasks sharing the mm with the clone syscall.

Nothing interesting. This is NOT a race between clone() and the OOM killer. :-(
By the moment the OOM killer is invoked, all clone() requests are already completed.

Did you notice that there is no

  "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"

line between

  [   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child

and

  [   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB

? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
__oom_kill_process() in the first round of out_of_memory() call because
find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
complete exit_mm() before find_lock_task_mm() is called. Then, in the second round
of out_of_memory() call, [ T9750] (which is fatal_signal_pending() == T &&
tsk_is_oom_victim() == F) hit task_will_free_mem(current) path and called
mark_oom_victim() and woke up the OOM reaper. Then, before the third round of
out_of_memory() call starts, the OOM reaper set MMF_OOM_SKIP. When the third round
of out_of_memory() call started, [ T9748] could not hit task_will_free_mem(current)
path because MMF_OOM_SKIP was already set, and oom_badness() ignored any mm which
already has MMF_OOM_SKIP. As a result, [ T9748] failed to find a candidate. And this
step repeats for up to number of threads (213 times for this run).

> Does fatal_signal_pending handle this better?
> 

Of course. My patch handles it perfectly. Even if we raced with clone() requests,
why do we need to care about threads doing clone() requests? Such threads are not
inside try_charge(), and therefore such threads can't contribute to this issue
by calling out_of_memory() from try_charge().
Michal Hocko Jan. 11, 2019, 1:34 p.m. UTC | #8
On Fri 11-01-19 21:40:52, Tetsuo Handa wrote:
[...]
> Did you notice that there is no
> 
>   "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"
> 
> line between
> 
>   [   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child
> 
> and
> 
>   [   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB
> 
> ? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
> __oom_kill_process() in the first round of out_of_memory() call because
> find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
> complete exit_mm() before find_lock_task_mm() is called.

OK, so we haven't killed anything because the victim has exited by the
time we wanted to do so. We still have other tasks sharing that mm
pending and not killed because nothing has killed them yet, right?

How come the oom reaper could act on this oom event at all then?

What am I missing?
Tetsuo Handa Jan. 11, 2019, 2:31 p.m. UTC | #9
On 2019/01/11 22:34, Michal Hocko wrote:
> On Fri 11-01-19 21:40:52, Tetsuo Handa wrote:
> [...]
>> Did you notice that there is no
>>
>>   "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"
>>
>> line between
>>
>>   [   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child
>>
>> and
>>
>>   [   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB
>>
>> ? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
>> __oom_kill_process() in the first round of out_of_memory() call because
>> find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
>> complete exit_mm() before find_lock_task_mm() is called.
> 
> OK, so we haven't killed anything because the victim has exited by the
> time we wanted to do so. We still have other tasks sharing that mm
> pending and not killed because nothing has killed them yet, right?

The OOM killer invoked by [ T9694] called printk() but didn't kill anything.
Instead, SIGINT from Ctrl-C killed all thread groups sharing current->mm.

> 
> How come the oom reaper could act on this oom event at all then?
> 
> What am I missing?
> 

The OOM killer invoked by [ T9750] did not call printk() but hit
task_will_free_mem(current) in out_of_memory() and invoked the OOM reaper,
without calling mark_oom_victim() on all thread groups sharing current->mm.
Did you notice that I wrote that

  Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
  when task_will_free_mem() == true,

? :-(
Michal Hocko Jan. 11, 2019, 3:07 p.m. UTC | #10
On Fri 11-01-19 23:31:18, Tetsuo Handa wrote:
> On 2019/01/11 22:34, Michal Hocko wrote:
> > On Fri 11-01-19 21:40:52, Tetsuo Handa wrote:
> > [...]
> >> Did you notice that there is no
> >>
> >>   "Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n"
> >>
> >> line between
> >>
> >>   [   71.304703][ T9694] Memory cgroup out of memory: Kill process 9692 (a.out) score 904 or sacrifice child
> >>
> >> and
> >>
> >>   [   71.309149][   T54] oom_reaper: reaped process 9750 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:185532kB
> >>
> >> ? Then, you will find that [ T9694] failed to reach for_each_process(p) loop inside
> >> __oom_kill_process() in the first round of out_of_memory() call because
> >> find_lock_task_mm() == NULL at __oom_kill_process() because Ctrl-C made that victim
> >> complete exit_mm() before find_lock_task_mm() is called.
> > 
> > OK, so we haven't killed anything because the victim has exited by the
> > time we wanted to do so. We still have other tasks sharing that mm
> > pending and not killed because nothing has killed them yet, right?
> 
> The OOM killer invoked by [ T9694] called printk() but didn't kill anything.
> Instead, SIGINT from Ctrl-C killed all thread groups sharing current->mm.

I still do not get it. Those other processes are not sharing signals.
Or is it due to injecting the signal too all of them with the proper
timing?
 
> > How come the oom reaper could act on this oom event at all then?
> > 
> > What am I missing?
> > 
> 
> The OOM killer invoked by [ T9750] did not call printk() but hit
> task_will_free_mem(current) in out_of_memory() and invoked the OOM reaper,
> without calling mark_oom_victim() on all thread groups sharing current->mm.
> Did you notice that I wrote that

OK, now it starts making sense to me finally. I got hooked up in
find_lock_task_mm failing in __oom_kill_process because we do see 
"Memory cgroup out of memory" and that happens _after_
task_will_free_mem. So the whole oom_reaper scenario didn't make much
sense to me.

>   Since mm-oom-marks-all-killed-tasks-as-oom-victims.patch does not call mark_oom_victim()
>   when task_will_free_mem() == true,
> 
> ? :-(

No, I got lost in your writeup.

While the task_will_free_mem is fixable but this would get us to even
uglier code so I agree that the approach by my two patches is not
feasible.

I really wanted to have this heuristic based on the oom victim rather
than signal pending because one lesson I've learned over time was that
checks for fatal signals can lead to odd corner cases. Memcg is less
prone to those issues because we can bypass the charge but still.

Anyway, could you update your patch and abstract 
	if (unlikely(tsk_is_oom_victim(current) ||
		     fatal_signal_pending(current) ||
		     current->flags & PF_EXITING))

in try_charge and reuse it in mem_cgroup_out_of_memory under the
oom_lock with an explanation please?

Andrew, please drop my 2 patches please.
Tetsuo Handa Jan. 11, 2019, 3:37 p.m. UTC | #11
On 2019/01/12 0:07, Michal Hocko wrote:
> On Fri 11-01-19 23:31:18, Tetsuo Handa wrote:
>> The OOM killer invoked by [ T9694] called printk() but didn't kill anything.
>> Instead, SIGINT from Ctrl-C killed all thread groups sharing current->mm.
> 
> I still do not get it. Those other processes are not sharing signals.
> Or is it due to injecting the signal too all of them with the proper
> timing?

Pressing Ctrl-C between after task_will_free_mem(p) in oom_kill_process() and
before __oom_kill_process() (e.g. dump_header()) made fatal_signal_pending() = T
for all of them.

> Anyway, could you update your patch and abstract 
> 	if (unlikely(tsk_is_oom_victim(current) ||
> 		     fatal_signal_pending(current) ||
> 		     current->flags & PF_EXITING))
> 
> in try_charge and reuse it in mem_cgroup_out_of_memory under the
> oom_lock with an explanation please?

I don't think doing so makes sense, for

  tsk_is_oom_victim(current) = T && fatal_signal_pending(current) == F

can't happen for mem_cgroup_out_of_memory() under the oom_lock, and
current->flags cannot get PF_EXITING when current is inside
mem_cgroup_out_of_memory(). fatal_signal_pending(current) alone is
appropriate for mem_cgroup_out_of_memory() under the oom_lock because

  tsk_is_oom_victim(current) = F && fatal_signal_pending(current) == T

can happen there.

Also, doing so might become wrong in future, for mem_cgroup_out_of_memory()
is also called from memory_max_write() which does not bail out upon
PF_EXITING. I don't think we can call memory_max_write() after current
thread got PF_EXITING, but nobody knows what change will happen in future.
Michal Hocko Jan. 11, 2019, 4:45 p.m. UTC | #12
On Sat 12-01-19 00:37:05, Tetsuo Handa wrote:
> On 2019/01/12 0:07, Michal Hocko wrote:
> > On Fri 11-01-19 23:31:18, Tetsuo Handa wrote:
> >> The OOM killer invoked by [ T9694] called printk() but didn't kill anything.
> >> Instead, SIGINT from Ctrl-C killed all thread groups sharing current->mm.
> > 
> > I still do not get it. Those other processes are not sharing signals.
> > Or is it due to injecting the signal too all of them with the proper
> > timing?
> 
> Pressing Ctrl-C between after task_will_free_mem(p) in oom_kill_process() and
> before __oom_kill_process() (e.g. dump_header()) made fatal_signal_pending() = T
> for all of them.
> 
> > Anyway, could you update your patch and abstract 
> > 	if (unlikely(tsk_is_oom_victim(current) ||
> > 		     fatal_signal_pending(current) ||
> > 		     current->flags & PF_EXITING))
> > 
> > in try_charge and reuse it in mem_cgroup_out_of_memory under the
> > oom_lock with an explanation please?
> 
> I don't think doing so makes sense, for
> 
>   tsk_is_oom_victim(current) = T && fatal_signal_pending(current) == F
> 
> can't happen for mem_cgroup_out_of_memory() under the oom_lock, and
> current->flags cannot get PF_EXITING when current is inside
> mem_cgroup_out_of_memory(). fatal_signal_pending(current) alone is
> appropriate for mem_cgroup_out_of_memory() under the oom_lock because
> 
>   tsk_is_oom_victim(current) = F && fatal_signal_pending(current) == T
> 
> can happen there.

I meant to use the same check consistently. If we can bypass the charge
under a list of conditions in the charge path we should be surely be
able to the the same for the oom path. I will not insist but unless
there is a strong reason I would prefer that.

> Also, doing so might become wrong in future, for mem_cgroup_out_of_memory()
> is also called from memory_max_write() which does not bail out upon
> PF_EXITING. I don't think we can call memory_max_write() after current
> thread got PF_EXITING, but nobody knows what change will happen in future.

No, this makes no sense what so ever.
Tetsuo Handa Jan. 12, 2019, 10:52 a.m. UTC | #13
On 2019/01/12 1:45, Michal Hocko wrote:
>>> Anyway, could you update your patch and abstract 
>>> 	if (unlikely(tsk_is_oom_victim(current) ||
>>> 		     fatal_signal_pending(current) ||
>>> 		     current->flags & PF_EXITING))
>>>
>>> in try_charge and reuse it in mem_cgroup_out_of_memory under the
>>> oom_lock with an explanation please?
>>
>> I don't think doing so makes sense, for
>>
>>   tsk_is_oom_victim(current) = T && fatal_signal_pending(current) == F
>>
>> can't happen for mem_cgroup_out_of_memory() under the oom_lock, and
>> current->flags cannot get PF_EXITING when current is inside
>> mem_cgroup_out_of_memory(). fatal_signal_pending(current) alone is
>> appropriate for mem_cgroup_out_of_memory() under the oom_lock because
>>
>>   tsk_is_oom_victim(current) = F && fatal_signal_pending(current) == T
>>
>> can happen there.
> 
> I meant to use the same check consistently. If we can bypass the charge
> under a list of conditions in the charge path we should be surely be
> able to the the same for the oom path. I will not insist but unless
> there is a strong reason I would prefer that.
> 

You mean something like this? I'm not sure this change is safe.

 mm/memcontrol.c | 27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17189da..1733d019 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -248,6 +248,12 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
+static inline bool can_ignore_limit(void)
+{
+	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
+		(current->flags & PF_EXITING);
+}
+
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -1395,7 +1401,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = fatal_signal_pending(current) || out_of_memory(&oc);
+	ret = can_ignore_limit() || out_of_memory(&oc);
 	mutex_unlock(&oom_lock);
 	return ret;
 }
@@ -1724,6 +1730,10 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
 
 	mem_cgroup_unmark_under_oom(memcg);
 	if (mem_cgroup_out_of_memory(memcg, mask, order))
+		/*
+		 * Returning OOM_SUCCESS upon can_ignore_limit() is OK, for
+		 * the caller will check can_ignore_limit() again.
+		 */
 		ret = OOM_SUCCESS;
 	else
 		ret = OOM_FAILED;
@@ -1783,6 +1793,11 @@ bool mem_cgroup_oom_synchronize(bool handle)
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(memcg, current->memcg_oom_gfp_mask,
 					 current->memcg_oom_order);
+		/*
+		 * Returning upon can_ignore_limit() is OK, for the caller is
+		 * already killed... CheckMe: Is this assumption correct?
+		 * Page fault can't happen after getting PF_EXITING?
+		 */
 	} else {
 		schedule();
 		mem_cgroup_unmark_under_oom(memcg);
@@ -2215,9 +2230,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * bypass the last charges so that they can exit quickly and
 	 * free their memory.
 	 */
-	if (unlikely(tsk_is_oom_victim(current) ||
-		     fatal_signal_pending(current) ||
-		     current->flags & PF_EXITING))
+	if (unlikely(can_ignore_limit()))
 		goto force;
 
 	/*
@@ -5527,6 +5540,12 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 		memcg_memory_event(memcg, MEMCG_OOM);
 		if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
 			break;
+		/*
+		 * There is no need to check can_ignore_limit() here, for
+		 * signal_pending(current) above will break anyway.
+		 */
+		if (unlikely(can_ignore_limit()))
+			break;
 	}
 
 	memcg_wb_domain_size_changed(memcg);
Michal Hocko Jan. 13, 2019, 5:36 p.m. UTC | #14
On Sat 12-01-19 19:52:50, Tetsuo Handa wrote:
> On 2019/01/12 1:45, Michal Hocko wrote:
> >>> Anyway, could you update your patch and abstract 
> >>> 	if (unlikely(tsk_is_oom_victim(current) ||
> >>> 		     fatal_signal_pending(current) ||
> >>> 		     current->flags & PF_EXITING))
> >>>
> >>> in try_charge and reuse it in mem_cgroup_out_of_memory under the
> >>> oom_lock with an explanation please?
> >>
> >> I don't think doing so makes sense, for
> >>
> >>   tsk_is_oom_victim(current) = T && fatal_signal_pending(current) == F
> >>
> >> can't happen for mem_cgroup_out_of_memory() under the oom_lock, and
> >> current->flags cannot get PF_EXITING when current is inside
> >> mem_cgroup_out_of_memory(). fatal_signal_pending(current) alone is
> >> appropriate for mem_cgroup_out_of_memory() under the oom_lock because
> >>
> >>   tsk_is_oom_victim(current) = F && fatal_signal_pending(current) == T
> >>
> >> can happen there.
> > 
> > I meant to use the same check consistently. If we can bypass the charge
> > under a list of conditions in the charge path we should be surely be
> > able to the the same for the oom path. I will not insist but unless
> > there is a strong reason I would prefer that.
> > 
> 
> You mean something like this? I'm not sure this change is safe.
> 
>  mm/memcontrol.c | 27 +++++++++++++++++++++++----
>  1 file changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 17189da..1733d019 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -248,6 +248,12 @@ enum res_type {
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> +static inline bool can_ignore_limit(void)
> +{
> +	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> +		(current->flags & PF_EXITING);
> +}
> +
>  /* Some nice accessors for the vmpressure. */
>  struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
>  {
> @@ -1395,7 +1401,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * A few threads which were not waiting at mutex_lock_killable() can
>  	 * fail to bail out. Therefore, check again after holding oom_lock.
>  	 */
> -	ret = fatal_signal_pending(current) || out_of_memory(&oc);
> +	ret = can_ignore_limit() || out_of_memory(&oc);
>  	mutex_unlock(&oom_lock);
>  	return ret;
>  }
> @@ -2215,9 +2230,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * bypass the last charges so that they can exit quickly and
>  	 * free their memory.
>  	 */
> -	if (unlikely(tsk_is_oom_victim(current) ||
> -		     fatal_signal_pending(current) ||
> -		     current->flags & PF_EXITING))
> +	if (unlikely(can_ignore_limit()))
>  		goto force;
>  
>  	/*

I meant something as simple as this, indeed. I would just
s@can_ignore_limit@should_force_charge@ but this is a minor thing.