mm, oom: OOM victims do not need to select next OOM victim unless __GFP_NOFAIL.
diff mbox series

Message ID 1534761465-6449-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
State New
Headers show
Series
  • mm, oom: OOM victims do not need to select next OOM victim unless __GFP_NOFAIL.
Related show

Commit Message

Tetsuo Handa Aug. 20, 2018, 10:37 a.m. UTC
Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
oom_reaped tasks") changed to select next OOM victim as soon as
MMF_OOM_SKIP is set. But since OOM victims can try ALLOC_OOM allocation
and then give up (if !memcg OOM) or can use forced charge and then retry
(if memcg OOM), OOM victims do not need to select next OOM victim unless
they are doing __GFP_NOFAIL allocations.

This is a quick mitigation because syzbot is hitting WARN(1) caused by
this race window [1]. More robust fix (e.g. make it possible to reclaim
more memory before MMF_OOM_SKIP is set, wait for some more after
MMF_OOM_SKIP is set) is a future work.

[1] https://syzkaller.appspot.com/bug?id=ea8c7912757d253537375e981b61749b2da69258

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: syzbot <syzbot+bab151e82a4e973fa325@syzkaller.appspotmail.com>
---
 mm/oom_kill.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Tetsuo Handa Aug. 28, 2018, 10:20 a.m. UTC | #1
On 2018/08/20 20:10, Michal Hocko wrote:
> On Mon 20-08-18 20:02:30, Tetsuo Handa wrote:
>> On 2018/08/20 19:53, Michal Hocko wrote:
>>> On Mon 20-08-18 19:37:45, Tetsuo Handa wrote:
>>>> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
>>>> oom_reaped tasks") changed to select next OOM victim as soon as
>>>> MMF_OOM_SKIP is set. But since OOM victims can try ALLOC_OOM allocation
>>>> and then give up (if !memcg OOM) or can use forced charge and then retry
>>>> (if memcg OOM), OOM victims do not need to select next OOM victim unless
>>>> they are doing __GFP_NOFAIL allocations.
>>>
>>> I do not like this at all. It seems hackish to say the least. And more
>>> importantly...
>>>
>>>> This is a quick mitigation because syzbot is hitting WARN(1) caused by
>>>> this race window [1]. More robust fix (e.g. make it possible to reclaim
>>>> more memory before MMF_OOM_SKIP is set, wait for some more after
>>>> MMF_OOM_SKIP is set) is a future work.
>>>
>>> .. there is already a patch (by Johannes) for that warning IIRC.
>>
>> You mean http://lkml.kernel.org/r/20180808144515.GA9276@cmpxchg.org ?
> 
> Yes
> 
>> But I can't find that patch in linux-next.git . And as far as I know,
>> no patch was sent to linux.git for handling this problem. Therefore,
>> I wrote this patch so that we can apply for 4.19-rc1.
> 
> I am pretty sure Johannes will post them later after merge window
> closes.
> 

But Johannes' patch will not prevent the OOM killer from needlessly selecting
next OOM victim, will it? I still think we can apply my patch in order to prevent
the OOM killer from needlessly selecting next OOM victim.
Michal Hocko Aug. 28, 2018, 10:59 a.m. UTC | #2
On Tue 28-08-18 19:20:32, Tetsuo Handa wrote:
> On 2018/08/20 20:10, Michal Hocko wrote:
> > On Mon 20-08-18 20:02:30, Tetsuo Handa wrote:
> >> On 2018/08/20 19:53, Michal Hocko wrote:
> >>> On Mon 20-08-18 19:37:45, Tetsuo Handa wrote:
> >>>> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> >>>> oom_reaped tasks") changed to select next OOM victim as soon as
> >>>> MMF_OOM_SKIP is set. But since OOM victims can try ALLOC_OOM allocation
> >>>> and then give up (if !memcg OOM) or can use forced charge and then retry
> >>>> (if memcg OOM), OOM victims do not need to select next OOM victim unless
> >>>> they are doing __GFP_NOFAIL allocations.
> >>>
> >>> I do not like this at all. It seems hackish to say the least. And more
> >>> importantly...
> >>>
> >>>> This is a quick mitigation because syzbot is hitting WARN(1) caused by
> >>>> this race window [1]. More robust fix (e.g. make it possible to reclaim
> >>>> more memory before MMF_OOM_SKIP is set, wait for some more after
> >>>> MMF_OOM_SKIP is set) is a future work.
> >>>
> >>> .. there is already a patch (by Johannes) for that warning IIRC.
> >>
> >> You mean http://lkml.kernel.org/r/20180808144515.GA9276@cmpxchg.org ?
> > 
> > Yes
> > 
> >> But I can't find that patch in linux-next.git . And as far as I know,
> >> no patch was sent to linux.git for handling this problem. Therefore,
> >> I wrote this patch so that we can apply for 4.19-rc1.
> > 
> > I am pretty sure Johannes will post them later after merge window
> > closes.
> > 
> 
> But Johannes' patch will not prevent the OOM killer from needlessly selecting
> next OOM victim, will it? I still think we can apply my patch in order to prevent
> the OOM killer from needlessly selecting next OOM victim.

see my feedback on your patch.
Johannes Weiner Aug. 28, 2018, 11:59 a.m. UTC | #3
On Mon, Aug 20, 2018 at 08:02:30PM +0900, Tetsuo Handa wrote:
> On 2018/08/20 19:53, Michal Hocko wrote:
> > On Mon 20-08-18 19:37:45, Tetsuo Handa wrote:
> >> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> >> oom_reaped tasks") changed to select next OOM victim as soon as
> >> MMF_OOM_SKIP is set. But since OOM victims can try ALLOC_OOM allocation
> >> and then give up (if !memcg OOM) or can use forced charge and then retry
> >> (if memcg OOM), OOM victims do not need to select next OOM victim unless
> >> they are doing __GFP_NOFAIL allocations.
> > 
> > I do not like this at all. It seems hackish to say the least. And more
> > importantly...
> > 
> >> This is a quick mitigation because syzbot is hitting WARN(1) caused by
> >> this race window [1]. More robust fix (e.g. make it possible to reclaim
> >> more memory before MMF_OOM_SKIP is set, wait for some more after
> >> MMF_OOM_SKIP is set) is a future work.
> > 
> > .. there is already a patch (by Johannes) for that warning IIRC.
> 
> You mean http://lkml.kernel.org/r/20180808144515.GA9276@cmpxchg.org ?
> But I can't find that patch in linux-next.git . And as far as I know,
> no patch was sent to linux.git for handling this problem. Therefore,
> I wrote this patch so that we can apply for 4.19-rc1.

I assume it'll go in soon, it's the first patch in the -mm tree:

$ cat http://ozlabs.org/~akpm/mmots/series
origin.patch
#NEXT_PATCHES_START linus
#NEXT_PATCHES_END
#NEXT_PATCHES_START mainline-urgent (this week, approximately)
#NEXT_PATCHES_END
#NEXT_PATCHES_START mainline-later (next week, approximately)
mm-memcontrol-print-proper-oom-header-when-no-eligible-victim-left.patch
Johannes Weiner Aug. 28, 2018, 12:40 p.m. UTC | #4
On Mon, Aug 20, 2018 at 07:37:45PM +0900, Tetsuo Handa wrote:
> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed to select next OOM victim as soon as
> MMF_OOM_SKIP is set. But since OOM victims can try ALLOC_OOM allocation
> and then give up (if !memcg OOM) or can use forced charge and then retry
> (if memcg OOM), OOM victims do not need to select next OOM victim unless
> they are doing __GFP_NOFAIL allocations.

Can you outline the exact sequence here? After a task invokes the OOM
killer, it will retry and do ALLOC_OOM before invoking it again. If
that succeeds, OOM is not invoked another time.

If there is a race condition where the allocating task gets killed
right before it acquires the oom_lock itself, there is another attempt
to allocate under the oom lock to catch parallel kills. It's not using
ALLOC_OOM, but that's intentional because we want to restore the high
watermark, not just make a single allocation from reserves succeed.

If that doesn't succeed, then we are committed to killing something.
Racing with the OOM reaper then is no different than another task
voluntarily exiting or munmap()ing in parallel. I don't know why we
should special case your particular scenario.

Granted, the OOM reaper is not exactly like the others, because it can
be considered to be part of the OOM killer itself. But then we should
wait for it like we wait for any concurrent OOM kill, and not allow
another __alloc_pages_may_oom() while the reaper is still at work;
instead of more hard-to-understand special cases in this code.

> This is a quick mitigation because syzbot is hitting WARN(1) caused by
> this race window [1]. More robust fix (e.g. make it possible to reclaim
> more memory before MMF_OOM_SKIP is set, wait for some more after
> MMF_OOM_SKIP is set) is a future work.

As per the other email, the warning was already replaced.
Tetsuo Handa Aug. 28, 2018, 1:29 p.m. UTC | #5
On 2018/08/28 21:40, Johannes Weiner wrote:
> On Mon, Aug 20, 2018 at 07:37:45PM +0900, Tetsuo Handa wrote:
>> Commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
>> oom_reaped tasks") changed to select next OOM victim as soon as
>> MMF_OOM_SKIP is set. But since OOM victims can try ALLOC_OOM allocation
>> and then give up (if !memcg OOM) or can use forced charge and then retry
>> (if memcg OOM), OOM victims do not need to select next OOM victim unless
>> they are doing __GFP_NOFAIL allocations.
> 
> Can you outline the exact sequence here? After a task invokes the OOM
> killer, it will retry and do ALLOC_OOM before invoking it again. If
> that succeeds, OOM is not invoked another time.

Did you mean

  After a task invoked the OOM killer, that task will retry and an OOM
  victim will do ALLOC_OOM before that task or that OOM victim again
  invokes the OOM killer.

? Then, yes. But the OOM reaper disturbs this behavior.

> 
> If there is a race condition where the allocating task gets killed
> right before it acquires the oom_lock itself, there is another attempt
> to allocate under the oom lock to catch parallel kills. It's not using
> ALLOC_OOM, but that's intentional because we want to restore the high
> watermark, not just make a single allocation from reserves succeed.

Yes. Though an OOM victim will try ALLOC_OOM watermark unless
__GFP_NOMEMALLOC due to

  /* Avoid allocations with no watermarks from looping endlessly */
  if (tsk_is_oom_victim(current) &&
      (alloc_flags == ALLOC_OOM || (gfp_mask & __GFP_NOMEMALLOC)))
          goto nopage;

test after returning from __alloc_pages_may_oom().

> 
> If that doesn't succeed, then we are committed to killing something.

No. we want to avoid unnecessary killing of additional processes. The test
above was updated by commit c288983dddf71421 ("mm/page_alloc.c: make sure
OOM victim can try allocations with no watermarks once") in order to avoid
unnecessary killing of additional processes.

Thanks to the test above, an OOM victim is expected to give up allocation
without selecting next OOM victim. But if the OOM reaper set MMF_OOM_SKIP
before that OOM victim enters into out_of_memory(),

  /*
   * If current has a pending SIGKILL or is exiting, then automatically
   * select it.  The goal is to allow it to allocate so that it may
   * quickly exit and free its memory.
   */
  if (task_will_free_mem(current)) {
      mark_oom_victim(current);
      wake_oom_reaper(current);
      return true;
  }

test does not help. In other words, an OOM victim will select next OOM
victim when we can avoid selecting next OOM victim.

> Racing with the OOM reaper then is no different than another task
> voluntarily exiting or munmap()ing in parallel. I don't know why we
> should special case your particular scenario.

Because we want to avoid unnecessary killing of additional processes.

> 
> Granted, the OOM reaper is not exactly like the others, because it can
> be considered to be part of the OOM killer itself. But then we should
> wait for it like we wait for any concurrent OOM kill, and not allow
> another __alloc_pages_may_oom() while the reaper is still at work;
> instead of more hard-to-understand special cases in this code.

The OOM reaper may set MMF_OOM_SKIP without reclaiming any memory (due
to e.g. mlock()ed memory, shared memory, unable to grab mmap_sem for read).
We haven't reached to the point where the OOM reaper reclaims all memory
nor allocating threads wait some more after setting MMF_OOM_SKIP.
Therefore, this

  if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
      return true;

is the simplest mitigation we can do now.
Michal Hocko Aug. 28, 2018, 1:51 p.m. UTC | #6
On Tue 28-08-18 22:29:56, Tetsuo Handa wrote:
[...]
> The OOM reaper may set MMF_OOM_SKIP without reclaiming any memory (due
> to e.g. mlock()ed memory, shared memory, unable to grab mmap_sem for read).
> We haven't reached to the point where the OOM reaper reclaims all memory
> nor allocating threads wait some more after setting MMF_OOM_SKIP.
> Therefore, this
> 
>   if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
>       return true;
> 
> is the simplest mitigation we can do now.

But this is adding a mess because you pretend to make a forward progress
even the OOM path didn't do anything at all and rely on another kludge
elsewhere to work. This just makes the code fragile for not strong
reason. Yes, this whole area is racy and there are rare corner cases as
you mentioned. I have already mentioned how to deal with some of them
several times. It would be so much more helpful to go after those and
address them rather than post some random hacks and build castles on a
sand.
Tetsuo Handa Aug. 28, 2018, 9:17 p.m. UTC | #7
On 2018/08/28 22:51, Michal Hocko wrote:
> On Tue 28-08-18 22:29:56, Tetsuo Handa wrote:
> [...]
>> The OOM reaper may set MMF_OOM_SKIP without reclaiming any memory (due
>> to e.g. mlock()ed memory, shared memory, unable to grab mmap_sem for read).
>> We haven't reached to the point where the OOM reaper reclaims all memory
>> nor allocating threads wait some more after setting MMF_OOM_SKIP.
>> Therefore, this
>>
>>   if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
>>       return true;
>>
>> is the simplest mitigation we can do now.
> 
> But this is adding a mess because you pretend to make a forward progress
> even the OOM path didn't do anything at all and rely on another kludge
> elsewhere to work.

I'm not pretending to make a forward progress. If current thread is an OOM
victim, it is guaranteed to make forward progress (unless __GFP_NOFAIL) by
failing that allocation attempt after trying memory reserves. The OOM path
does not need to do anything at all.

Patch
diff mbox series

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 412f434..421c0f6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1031,6 +1031,9 @@  bool out_of_memory(struct oom_control *oc)
 	unsigned long freed = 0;
 	enum oom_constraint constraint = CONSTRAINT_NONE;
 
+	if (tsk_is_oom_victim(current) && !(oc->gfp_mask & __GFP_NOFAIL))
+		return true;
+
 	if (oom_killer_disabled)
 		return false;