diff mbox

mm, oom: fix unnecessary killing of additional processes

Message ID alpine.DEB.2.21.1806141339580.4543@chino.kir.corp.google.com (mailing list archive)
State New, archived
Headers show

Commit Message

David Rientjes June 14, 2018, 8:42 p.m. UTC
The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm.  This can happen for a variety of reasons,
including:

 - the inability to grab mm->mmap_sem in a sufficient amount of time,

 - when the mm has blockable mmu notifiers that could cause the oom reaper
   to stall indefinitely,

but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:

 - when the mm's memory is fully mlocked.

When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily.  This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.

This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap().

We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.

To fix this, this patch introduces a per-mm reaping timeout, initially set
at 10s.  It requires that the oom reaper's list becomes a properly linked
list so that other mm's may be reaped while waiting for an mm's timeout to
expire.

This replaces the current timeouts in the oom reaper: (1) when trying to
grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
a HZ sleep if there are blockable mmu notifiers.  It extends it with
timeout to allow an oom victim to reach exit_mmap() before choosing
additional processes unnecessarily.

The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.

The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.

The reaping timeout is intentionally set for a substantial amount of time
since oom livelock is a very rare occurrence and it's better to optimize
for preventing additional (unnecessary) oom killing than a scenario that
is much more unlikely.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 Note: I understand there is an objection based on timeout based delays.
 This is currently the only possible way to avoid oom killing important
 processes completely unnecessarily.  If the oom reaper can someday free
 all memory, including mlocked memory and those mm's with blockable mmu
 notifiers, and is guaranteed to always be able to grab mm->mmap_sem,
 this can be removed.  I do not believe any such guarantee is possible
 and consider the massive killing of additional processes unnecessarily
 to be a regression introduced by the oom reaper and its very quick
 setting of MMF_OOM_SKIP to allow additional processes to be oom killed.

 include/linux/mm_types.h |   4 ++
 include/linux/sched.h    |   2 +-
 kernel/fork.c            |   4 ++
 mm/mmap.c                |  12 ++---
 mm/oom_kill.c            | 112 ++++++++++++++++++++++-----------------
 5 files changed, 79 insertions(+), 55 deletions(-)

Comments

Michal Hocko June 15, 2018, 6:55 a.m. UTC | #1
On Thu 14-06-18 13:42:59, David Rientjes wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm.  This can happen for a variety of reasons,
> including:
> 
>  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> 
>  - when the mm has blockable mmu notifiers that could cause the oom reaper
>    to stall indefinitely,
> 
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
> 
>  - when the mm's memory is fully mlocked.
> 
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily.  This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
> 
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
> 
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
> 
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s.  It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.
> 
> This replaces the current timeouts in the oom reaper: (1) when trying to
> grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
> a HZ sleep if there are blockable mmu notifiers.  It extends it with
> timeout to allow an oom victim to reach exit_mmap() before choosing
> additional processes unnecessarily.
> 
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
> 
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.
> 
> The reaping timeout is intentionally set for a substantial amount of time
> since oom livelock is a very rare occurrence and it's better to optimize
> for preventing additional (unnecessary) oom killing than a scenario that
> is much more unlikely.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>

Nacked-by: Michal Hocko <mhocko@suse.com>
as already explained elsewhere in this email thread.

> ---
>  Note: I understand there is an objection based on timeout based delays.
>  This is currently the only possible way to avoid oom killing important
>  processes completely unnecessarily.  If the oom reaper can someday free
>  all memory, including mlocked memory and those mm's with blockable mmu
>  notifiers, and is guaranteed to always be able to grab mm->mmap_sem,
>  this can be removed.  I do not believe any such guarantee is possible
>  and consider the massive killing of additional processes unnecessarily
>  to be a regression introduced by the oom reaper and its very quick
>  setting of MMF_OOM_SKIP to allow additional processes to be oom killed.

If you find oom reaper more harmful than useful I would be willing to
ack a comman line option to disable it. Especially when you keep
claiming that the lockups are not really happening in your environment.

Other than that I've already pointed to a more robust solution. If you
are reluctant to try it out I will do, but introducing a timeout is just
papering over the real problem. Maybe we will not reach the state that
_all_ the memory is reapable but we definitely should try to make as
much as possible to be reapable and I do not see any fundamental
problems in that direction.
David Rientjes June 15, 2018, 11:15 p.m. UTC | #2
On Fri, 15 Jun 2018, Michal Hocko wrote:

> > Signed-off-by: David Rientjes <rientjes@google.com>
> 
> Nacked-by: Michal Hocko <mhocko@suse.com>
> as already explained elsewhere in this email thread.
> 

I don't find this to be surprising, but I'm not sure that it actually 
matters if you won't fix a regression that you introduced.  Tetsuo 
initially found this issue and presented a similar solution, so I think 
his feedback on this is more important since it would fix a problem for 
him as well.

> > ---
> >  Note: I understand there is an objection based on timeout based delays.
> >  This is currently the only possible way to avoid oom killing important
> >  processes completely unnecessarily.  If the oom reaper can someday free
> >  all memory, including mlocked memory and those mm's with blockable mmu
> >  notifiers, and is guaranteed to always be able to grab mm->mmap_sem,
> >  this can be removed.  I do not believe any such guarantee is possible
> >  and consider the massive killing of additional processes unnecessarily
> >  to be a regression introduced by the oom reaper and its very quick
> >  setting of MMF_OOM_SKIP to allow additional processes to be oom killed.
> 
> If you find oom reaper more harmful than useful I would be willing to
> ack a comman line option to disable it. Especially when you keep
> claiming that the lockups are not really happening in your environment.
> 

There's no need to disable it, we simply need to ensure that it doesn't 
set MMF_OOM_SKIP too early, which my patch does.  We also need to avoid 
setting MMF_OOM_SKIP in exit_mmap() until after all memory has been freed, 
i.e. after free_pgtables().

I'd be happy to make the this timeout configurable, however, and default 
it to perhaps one second as the blockable mmu notifier timeout in your own 
code does.  I find it somewhat sad that we'd need a sysctl for this, but 
if that will appease you and it will help to move this into -mm then we 
can do that.

> Other than that I've already pointed to a more robust solution. If you
> are reluctant to try it out I will do, but introducing a timeout is just
> papering over the real problem. Maybe we will not reach the state that
> _all_ the memory is reapable but we definitely should try to make as
> much as possible to be reapable and I do not see any fundamental
> problems in that direction.

You introduced the timeout already, I'm sure you realized yourself that 
the oom reaper sets MMF_OOM_SKIP much too early.  Trying to grab 
mm->mmap_sem 10 times in a row with HZ/10 sleeps in between is a timeout.  
If there are blockable mmu notifiers, your code puts the oom reaper to 
sleep for HZ before setting MMF_OOM_SKIP, which is a timeout.  This patch 
moves the timeout to reaching exit_mmap() where we actually free all 
memory possible and still allow for additional oom killing if there is a 
very rare oom livelock.

You haven't provided any data that suggests oom livelocking isn't a very 
rare event and that we need to respond immediately by randomly killing 
more and more processes rather than wait a bounded period of time to allow 
for forward progress to be made.  I have constantly provided data showing 
oom livelock in our fleet is extremely rare, less than 0.04% of the time.  
Yet your solution is to kill many processes so this 0.04% is fast.

The reproducer on powerpc is very simple.  Do an mmap() and mlock() the 
length.  Fork one 120MB process that does that and two 60MB processes that 
do that in a 128MB memcg.

[  402.064375] Killed process 17024 (a.out) total-vm:134080kB, anon-rss:122032kB, file-rss:1600kB
[  402.107521] Killed process 17026 (a.out) total-vm:64448kB, anon-rss:44736kB, file-rss:1600kB

Completely reproducible and completely unnecessary.  Killing two processes 
pointlessly when the first oom kill would have been successful.

Killing processes is important, optimizing for 0.04% of cases of true oom 
livelock by insisting everybody tolerate excessive oom killing is not.  If 
you have data to suggest the 0.04% is higher, please present it.  I'd be 
interested in any data you have that suggests its higher and has even 
1/1,000,000th oom occurrence rate that I have shown.

It's inappropriate to merge code that oom kills many processes 
unnecessarily when one happens to be mlocked or have blockable mmu 
notifiers or when mm->mmap_sem can't be grabbed fast enough but forward 
progress is actually being made.  It's a regression, and it impacts real 
users.  Insisting that we fix the problem you introduced by making all mmu 
notifiers unblockable and mlocked memory can always be reaped and 
mm->mmap_sem can always be grabbed within a second is irresponsible.
Andrew Morton June 19, 2018, 12:27 a.m. UTC | #3
On Thu, 14 Jun 2018 13:42:59 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm.  This can happen for a variety of reasons,
> including:
> 
>  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> 
>  - when the mm has blockable mmu notifiers that could cause the oom reaper
>    to stall indefinitely,

Maybe we should have more than one oom reaper thread?  I assume the
probability of the oom reaper thread blocking on an mmu notifier is
small, so perhaps just dive in and hope for the best.  If the oom
reaper gets stuck then there's another thread ready to take over.  And
revisit the decision to use a kernel thread instead of workqueues.

> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
> 
>  - when the mm's memory is fully mlocked.
> 
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily.  This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
> 
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().

So what's actually happening here.  A process has a large amount of
mlocked memory, it has been oom-killed and it is in the process of
releasing its memory and exiting, yes?

If so, why does this task set MMF_OOM_SKIP on itself?  Why aren't we
just patiently waiting for its attempt to release meory?

> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.

I guess that's my answer.  What causes this livelock?  Process looping
in alloc_pages while holding a lock the oom victim wants?

> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s.  It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.
> 
> This replaces the current timeouts in the oom reaper: (1) when trying to
> grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
> a HZ sleep if there are blockable mmu notifiers.  It extends it with
> timeout to allow an oom victim to reach exit_mmap() before choosing
> additional processes unnecessarily.
> 
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified,

That seems sensible, but why set MMF_OOM_SKIP at all?

> and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
> 
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.
> 
> The reaping timeout is intentionally set for a substantial amount of time
> since oom livelock is a very rare occurrence and it's better to optimize
> for preventing additional (unnecessary) oom killing than a scenario that
> is much more unlikely.

What happened to the old idea of permitting the task which is blocking
the oom victim to access additional reserves?

Come to that, what happened to the really really old Andreaidea of not
looping in the page allocator anyway?  Return NULL instead...

I dunno, I'm thrashing around here.  We seem to be piling mess on top
of mess and then being surprised that the result is a mess.

> +#ifdef CONFIG_MMU
> +	/* When to give up on oom reaping this mm */
> +	unsigned long reap_timeout;

"timeout" implies "interval".  To me, anyway.  This is an absolute
time, so something like reap_time would be clearer.  Along with a
comment explaining that the units are in jiffies.

> +#endif
>  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
>  	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
>  #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1163,7 +1163,7 @@ struct task_struct {
>  #endif
>  	int				pagefault_disabled;
>  #ifdef CONFIG_MMU
> -	struct task_struct		*oom_reaper_list;
> +	struct list_head		oom_reap_list;

Can we have a comment explaining its locking.

>  #endif
>  #ifdef CONFIG_VMAP_STACK
>  	struct vm_struct		*stack_vm_area;
>
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm)
>  	if (unlikely(mm_is_oom_victim(mm))) {
>  		/*
>  		 * Manually reap the mm to free as much memory as possible.
> -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> -		 * this mm from further consideration.  Taking mm->mmap_sem for
> -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> -		 * reaper will not run on this mm again after mmap_sem is
> -		 * dropped.
> +		 * Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
> +		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> +		 * guarantee that the oom reaper will not run on this mm again
> +		 * after mmap_sem is dropped.

Comment should explain *why* we don't want the reaper to run on this mm
again.

>
> ...
>
Michal Hocko June 19, 2018, 8:33 a.m. UTC | #4
On Fri 15-06-18 16:15:39, David Rientjes wrote:
[...]
> I'd be happy to make the this timeout configurable, however, and default 
> it to perhaps one second as the blockable mmu notifier timeout in your own 
> code does.  I find it somewhat sad that we'd need a sysctl for this, but 
> if that will appease you and it will help to move this into -mm then we 
> can do that.

No. This has been nacked in the past and I do not see anything different
from back than.

> > Other than that I've already pointed to a more robust solution. If you
> > are reluctant to try it out I will do, but introducing a timeout is just
> > papering over the real problem. Maybe we will not reach the state that
> > _all_ the memory is reapable but we definitely should try to make as
> > much as possible to be reapable and I do not see any fundamental
> > problems in that direction.
> 
> You introduced the timeout already, I'm sure you realized yourself that 
> the oom reaper sets MMF_OOM_SKIP much too early.  Trying to grab 
> mm->mmap_sem 10 times in a row with HZ/10 sleeps in between is a timeout.  

Yes, it is. And it is a timeout based some some feedback. The lock is
held, let's retry later but do not retry for ever. We can do the same
with blockable mmu notifiers. We are currently giving up right away. I
was proposing to add can_sleep parameter to mmu_notifier_invalidate_range_start
and return it EAGAIN if it would block. This would allow to simply retry
on EAGAIN like we do for the mmap_sem.

[...]
 
> The reproducer on powerpc is very simple.  Do an mmap() and mlock() the 
> length.  Fork one 120MB process that does that and two 60MB processes that 
> do that in a 128MB memcg.

And again, to solve this we just need to teach oom_reaper to handle
mlocked memory. There shouldn't be any fundamental reason why this would
be impossible AFAICS. Timeout is not a solution!

[...]

> It's inappropriate to merge code that oom kills many processes 
> unnecessarily when one happens to be mlocked or have blockable mmu 
> notifiers or when mm->mmap_sem can't be grabbed fast enough but forward 
> progress is actually being made.  It's a regression, and it impacts real 
> users.  Insisting that we fix the problem you introduced by making all mmu 
> notifiers unblockable and mlocked memory can always be reaped and 
> mm->mmap_sem can always be grabbed within a second is irresponsible.

Well, a lack of real world bug reports doesn't really back your story
here. I have asked about non-artificial workloads suffering and your
responsive were quite nonspecific to say the least.

And I do insist to come with a reasonable solution rather than random
hacks. Jeez the oom killer was full of these.

As I've said, if you are not willing to work on a proper solution, I
will, but my nack holds for this patch until we see no other way around
existing and real world problems.
Michal Hocko June 19, 2018, 8:47 a.m. UTC | #5
On Mon 18-06-18 17:27:33, Andrew Morton wrote:
> On Thu, 14 Jun 2018 13:42:59 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> 
> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm.  This can happen for a variety of reasons,
> > including:
> > 
> >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > 
> >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> >    to stall indefinitely,
> 
> Maybe we should have more than one oom reaper thread?  I assume the
> probability of the oom reaper thread blocking on an mmu notifier is
> small, so perhaps just dive in and hope for the best.  If the oom
> reaper gets stuck then there's another thread ready to take over.  And
> revisit the decision to use a kernel thread instead of workqueues.

Well, I think that having more threads would be wasteful for a rare
event like the oom. Creating one on demand could be tricky because we
are under strong memory pressure at the time and a new thread costs some
resoures.

> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> > 
> >  - when the mm's memory is fully mlocked.
> > 
> > When all memory is mlocked, the oom reaper will not be able to free any
> > substantial amount of memory.  It sets MMF_OOM_SKIP before the victim can
> > unmap and free its memory in exit_mmap() and subsequent oom victims are
> > chosen unnecessarily.  This is trivial to reproduce if all eligible
> > processes on the system have mlocked their memory: the oom killer calls
> > panic() even though forward progress can be made.
> > 
> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
> 
> So what's actually happening here.  A process has a large amount of
> mlocked memory, it has been oom-killed and it is in the process of
> releasing its memory and exiting, yes?
> 
> If so, why does this task set MMF_OOM_SKIP on itself?  Why aren't we
> just patiently waiting for its attempt to release meory?

Because the oom victim has no guarantee to proceed to exit and release
its own memory. OOM reaper jumps in and skip over mlocked ranges because
they require lock page and that egain cannot be taken from the oom
reaper path (the lock might be held and doing an allocation). This in
turn means that the oom_reaper doesn't free mlocked memory before it
sets MMF_OOM_SKIP which will allow a new oom victim to be selected.
At the time we merged the oom reaper this hasn't been seen as a major
issue because tasks usually do not consume a lot of mlocked memory and
there is always some other memory to tear down and help to relief the
memory pressure. mlockall oom victim were deemed unlikely because they
need a large rlimit and as such it should be trusted and therefore quite
safe from runaways. But there was definitely a plan to make mlocked
memory reapable. So time to do it finally.

> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> 
> I guess that's my answer.  What causes this livelock?  Process looping
> in alloc_pages while holding a lock the oom victim wants?

Yes.

> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s.  It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
> > 
> > This replaces the current timeouts in the oom reaper: (1) when trying to
> > grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
> > a HZ sleep if there are blockable mmu notifiers.  It extends it with
> > timeout to allow an oom victim to reach exit_mmap() before choosing
> > additional processes unnecessarily.
> > 
> > The exit path will now set MMF_OOM_SKIP only after all memory has been
> > freed, so additional oom killing is justified,
> 
> That seems sensible, but why set MMF_OOM_SKIP at all?

MMF_OOM_SKIP is a way to say that the task should be skipped from OOM
victims evaluation.

> > and rely on MMF_UNSTABLE to
> > determine when it can race with the oom reaper.
> > 
> > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> > lapsed because it can no longer guarantee forward progress.
> > 
> > The reaping timeout is intentionally set for a substantial amount of time
> > since oom livelock is a very rare occurrence and it's better to optimize
> > for preventing additional (unnecessary) oom killing than a scenario that
> > is much more unlikely.
> 
> What happened to the old idea of permitting the task which is blocking
> the oom victim to access additional reserves?

How do you find such a task?

> Come to that, what happened to the really really old Andreaidea of not
> looping in the page allocator anyway?  Return NULL instead...

Nacked by Linus because too-small-to-fail is a long term semantic that
cannot change easily. We do not have any way to audit syscall paths to
not return ENOMEM when inappropriate.

> I dunno, I'm thrashing around here.  We seem to be piling mess on top
> of mess and then being surprised that the result is a mess.

Are we? The current oom_reaper certainly has some shortcomings that
are addressable. We have started simple to cover most cases and move
on with more complex heuristics based on real life bug reports. But we
_do_ have a quite straightforward feedback based algorithm to reclaim
oom victims. This is a solid ground for future development. Something we
never had before. So I am really wondering what is all the mess about.
David Rientjes June 19, 2018, 8:34 p.m. UTC | #6
On Mon, 18 Jun 2018, Andrew Morton wrote:

> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm.  This can happen for a variety of reasons,
> > including:
> > 
> >  - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > 
> >  - when the mm has blockable mmu notifiers that could cause the oom reaper
> >    to stall indefinitely,
> 
> Maybe we should have more than one oom reaper thread?  I assume the
> probability of the oom reaper thread blocking on an mmu notifier is
> small, so perhaps just dive in and hope for the best.  If the oom
> reaper gets stuck then there's another thread ready to take over.  And
> revisit the decision to use a kernel thread instead of workqueues.
> 

I'm not sure that we need more than one thread, per se, but we need the 
ability to operate on more than one oom victim while deciding whether one 
victim can be reaped or not.  The current implementation only processes 
one victim at a time: it tries to grab mm->mmap_sem, it sleeps, retries, 
sleeps, etc.  We need to try other oom victims (we do parallel memcg oom 
stress testing, and the oom reaper can uncharge memory to a hierarchy that 
prevents livelock as well), which my patch does.

> So what's actually happening here.  A process has a large amount of
> mlocked memory, it has been oom-killed and it is in the process of
> releasing its memory and exiting, yes?
> 

That's one failure mode, yes, and three possible ways:

 - the oom reaper immediately sets MMF_OOM_SKIP because it tried to free
   memory and completely failed, so it actually declares this as a success
   and sets MMF_OOM_SKIP assuming memory was freed, which wasn't,

 - to avoid CVE-2018-1000200 exit_mmap() must set MMF_OOM_SKIP before 
   doing munlock_vma_pages_all() which the oom reaper uses to determine if
   it can safely operate on a vma, so the exit path sets MMF_OOM_SKIP 
   before any possible memory freeing as well, and

 - the previous iteration of the oom reaper to set MMF_OOM_SKIP between
   unmap_vmas() and free_pgtables() suffered from the same problem for 
   large amounts of virtual memory whereas subsequent oom kill could have 
   been prevented if free_pgtables() could have completed.

My patch fixes all these issues because MMF_OOM_SKIP only gets set after 
free_pgtables(), i.e. no additional memory freeing is possible through 
exit_mmap(), or a process has failed to exit for 10s by the oom reaper.  I 
will patch this to make the timeout configurable.  I use the existing 
MMF_UNSTABLE to determine if the oom reaper can safely operate on vmas of 
the mm.

> If so, why does this task set MMF_OOM_SKIP on itself?  Why aren't we
> just patiently waiting for its attempt to release meory?
> 

That's what my patch does, yes, it needs to wait to ensure forward 
progress is not being made before setting MMF_OOM_SKIP and allowing all 
other processes on the system to be oom killed.  Taken to an extreme, 
imagine a single large mlocked process or one with a blockable mmu 
notifier taking up almost all memory on a machine.  If there is a memory 
leak, it will be oom killed same as it always has been.  The difference 
now is that the machine panic()'s because MMF_OOM_SKIP is set with no 
memory freeing and the oom killer finds no more eligible processes so its 
only alternative is panicking.

> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> 
> I guess that's my answer.  What causes this livelock?  Process looping
> in alloc_pages while holding a lock the oom victim wants?
> 

That's one way, yes, the other is to be charging memory in the mem cgroup 
path while holding a mutex the victim wants.  If additional kmem will 
start being charged to mem cgroup hierarchies and the oom killer is called 
synchronously in the charge path (there is no fault path to unwind to), 
which has been discussed, this problem will become much more prolific.

> > The exit path will now set MMF_OOM_SKIP only after all memory has been
> > freed, so additional oom killing is justified,
> 
> That seems sensible, but why set MMF_OOM_SKIP at all?
> 

The oom reaper will eventually need to set it if its actually livelocked, 
which happens extremely rarely in practice, because the oom reaper was 
unable to free memory such that an allocator holding our mutex could 
successfully allocate.  It sets it immediately now for mlocked processes 
(it doesn't realize it didn't free a single page).  It retries 10 times to 
grab mm->mmap_sem and sets it after one second if it fails.  If it has a 
blockable mmu notifier it sleeps for a second and sets it.  I'm replacing 
all the current timeouts with a per-mm timeout and volunteering to make it 
configurable so that it can be disabled or set to 10s as preferred by us 
because we are tired of every process getting oom killed pointlessly.  
I'll suggest a default of 1s to match the timeouts currently implemented 
in the oom reaper and generalize them to be per-mm.

> > and rely on MMF_UNSTABLE to
> > determine when it can race with the oom reaper.
> > 
> > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> > lapsed because it can no longer guarantee forward progress.
> > 
> > The reaping timeout is intentionally set for a substantial amount of time
> > since oom livelock is a very rare occurrence and it's better to optimize
> > for preventing additional (unnecessary) oom killing than a scenario that
> > is much more unlikely.
> 
> What happened to the old idea of permitting the task which is blocking
> the oom victim to access additional reserves?
> 

That is an alternative to the oom reaper and worked quite successfully for 
us.  We'd detect when a process was looping endlessly waiting for the same 
victim to exit and then grant it access to additional reserves, 
specifically to detect oom livelock scenarios.  The oom reaper should 
theoretically make this extremely rare since it normally can free *some* 
memory so we aren't oom anymore and allocators holding mutexes can 
succeed.

> > +#ifdef CONFIG_MMU
> > +	/* When to give up on oom reaping this mm */
> > +	unsigned long reap_timeout;
> 
> "timeout" implies "interval".  To me, anyway.  This is an absolute
> time, so something like reap_time would be clearer.  Along with a
> comment explaining that the units are in jiffies.
> 

Ack.

> > +#endif
> >  #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> >  	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
> >  #endif
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1163,7 +1163,7 @@ struct task_struct {
> >  #endif
> >  	int				pagefault_disabled;
> >  #ifdef CONFIG_MMU
> > -	struct task_struct		*oom_reaper_list;
> > +	struct list_head		oom_reap_list;
> 
> Can we have a comment explaining its locking.
> 

Ok.

> >  #endif
> >  #ifdef CONFIG_VMAP_STACK
> >  	struct vm_struct		*stack_vm_area;
> >
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm)
> >  	if (unlikely(mm_is_oom_victim(mm))) {
> >  		/*
> >  		 * Manually reap the mm to free as much memory as possible.
> > -		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > -		 * this mm from further consideration.  Taking mm->mmap_sem for
> > -		 * write after setting MMF_OOM_SKIP will guarantee that the oom
> > -		 * reaper will not run on this mm again after mmap_sem is
> > -		 * dropped.
> > +		 * Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
> > +		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> > +		 * guarantee that the oom reaper will not run on this mm again
> > +		 * after mmap_sem is dropped.
> 
> Comment should explain *why* we don't want the reaper to run on this mm
> again.
> 

Sounds good.
diff mbox

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -449,6 +449,10 @@  struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 	struct mmu_notifier_mm *mmu_notifier_mm;
 #endif
+#ifdef CONFIG_MMU
+	/* When to give up on oom reaping this mm */
+	unsigned long reap_timeout;
+#endif
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
 	pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1163,7 +1163,7 @@  struct task_struct {
 #endif
 	int				pagefault_disabled;
 #ifdef CONFIG_MMU
-	struct task_struct		*oom_reaper_list;
+	struct list_head		oom_reap_list;
 #endif
 #ifdef CONFIG_VMAP_STACK
 	struct vm_struct		*stack_vm_area;
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -835,6 +835,10 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->fail_nth = 0;
 #endif
 
+#ifdef CONFIG_MMU
+	INIT_LIST_HEAD(&tsk->oom_reap_list);
+#endif
+
 	return tsk;
 
 free_stack:
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,11 +3059,10 @@  void exit_mmap(struct mm_struct *mm)
 	if (unlikely(mm_is_oom_victim(mm))) {
 		/*
 		 * Manually reap the mm to free as much memory as possible.
-		 * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
-		 * this mm from further consideration.  Taking mm->mmap_sem for
-		 * write after setting MMF_OOM_SKIP will guarantee that the oom
-		 * reaper will not run on this mm again after mmap_sem is
-		 * dropped.
+		 * Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
+		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+		 * guarantee that the oom reaper will not run on this mm again
+		 * after mmap_sem is dropped.
 		 *
 		 * Nothing can be holding mm->mmap_sem here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
@@ -3077,7 +3076,7 @@  void exit_mmap(struct mm_struct *mm)
 		__oom_reap_task_mm(mm);
 		mutex_unlock(&oom_lock);
 
-		set_bit(MMF_OOM_SKIP, &mm->flags);
+		set_bit(MMF_UNSTABLE, &mm->flags);
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
@@ -3105,6 +3104,7 @@  void exit_mmap(struct mm_struct *mm)
 	unmap_vmas(&tlb, vma, 0, -1);
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
+	set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/*
 	 * Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -476,7 +476,7 @@  bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
  */
 static struct task_struct *oom_reaper_th;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 void __oom_reap_task_mm(struct mm_struct *mm)
@@ -519,10 +519,8 @@  void __oom_reap_task_mm(struct mm_struct *mm)
 	}
 }
 
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
-	bool ret = true;
-
 	/*
 	 * We have to make sure to not race with the victim exit path
 	 * and cause premature new oom victim selection:
@@ -540,9 +538,8 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		goto out_oom;
 	}
 
 	/*
@@ -551,69 +548,81 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	 * TODO: we really want to get rid of this ugly hack and make sure that
 	 * notifiers cannot block for unbounded amount of time
 	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
+	if (mm_has_blockable_invalidate_notifiers(mm))
+		goto out_mm;
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
+	 * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
+	 * work on the mm anymore. The check for MMF_UNSTABLE must run
 	 * under mmap_sem for reading because it serializes against the
 	 * down_write();up_write() cycle in exit_mmap().
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
-		up_read(&mm->mmap_sem);
+	if (test_bit(MMF_UNSTABLE, &mm->flags)) {
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		goto out_mm;
 	}
 
 	trace_start_task_reaping(tsk->pid);
-
 	__oom_reap_task_mm(mm);
+	trace_finish_task_reaping(tsk->pid);
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_mm:
 	up_read(&mm->mmap_sem);
-
-	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
+out_oom:
 	mutex_unlock(&oom_lock);
-	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
-	int attempts = 0;
 	struct mm_struct *mm = tsk->signal->oom_mm;
 
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
-		schedule_timeout_idle(HZ/10);
+	/*
+	 * If this mm has either been fully unmapped, or the oom reaper has
+	 * given up on it, nothing left to do except drop the refcount.
+	 */
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
 
-	if (attempts <= MAX_OOM_REAP_RETRIES ||
-	    test_bit(MMF_OOM_SKIP, &mm->flags))
-		goto done;
+	/*
+	 * If this mm has already been reaped, doing so again will not likely
+	 * free additional memory.
+	 */
+	if (!test_bit(MMF_UNSTABLE, &mm->flags))
+		oom_reap_task_mm(tsk, mm);
 
-	pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-		task_pid_nr(tsk), tsk->comm);
-	debug_show_all_locks();
+	if (time_after_eq(jiffies, mm->reap_timeout)) {
+		if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+			debug_show_all_locks();
 
-done:
-	tsk->oom_reaper_list = NULL;
+			/*
+			 * Reaping has failed for the timeout period, so give up
+			 * and allow additional processes to be oom killed.
+			 */
+			set_bit(MMF_OOM_SKIP, &mm->flags);
+		}
+		goto drop;
+	}
 
-	/*
-	 * Hide this mm from OOM killer because it has been either reaped or
-	 * somebody can't call up_write(mmap_sem).
-	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (test_bit(MMF_OOM_SKIP, &mm->flags))
+		goto drop;
 
-	/* Drop a reference taken by wake_oom_reaper */
+	/* Enqueue to be reaped again */
+	spin_lock(&oom_reaper_lock);
+	list_add_tail(&tsk->oom_reap_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+
+	schedule_timeout_idle(HZ/10);
+	return;
+
+drop:
+	/* Drop the reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
 
@@ -622,11 +631,13 @@  static int oom_reaper(void *unused)
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-		wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+		wait_event_freezable(oom_reaper_wait,
+				     !list_empty(&oom_reaper_list));
 		spin_lock(&oom_reaper_lock);
-		if (oom_reaper_list != NULL) {
-			tsk = oom_reaper_list;
-			oom_reaper_list = tsk->oom_reaper_list;
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_entry(oom_reaper_list.next,
+					 struct task_struct, oom_reap_list);
+			list_del(&tsk->oom_reap_list);
 		}
 		spin_unlock(&oom_reaper_lock);
 
@@ -637,17 +648,22 @@  static int oom_reaper(void *unused)
 	return 0;
 }
 
+/* How long to wait to oom reap an mm before selecting another process */
+#define OOM_REAP_TIMEOUT_MSECS (10 * 1000)
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	/* tsk is already queued? */
-	if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+	/*
+	 * Set the reap timeout; if it's already set, the mm is enqueued and
+	 * this tsk can be ignored.
+	 */
+	if (cmpxchg(&tsk->signal->oom_mm->reap_timeout, 0UL,
+			jiffies + msecs_to_jiffies(OOM_REAP_TIMEOUT_MSECS)))
 		return;
 
 	get_task_struct(tsk);
 
 	spin_lock(&oom_reaper_lock);
-	tsk->oom_reaper_list = oom_reaper_list;
-	oom_reaper_list = tsk;
+	list_add(&tsk->oom_reap_list, &oom_reaper_list);
 	spin_unlock(&oom_reaper_lock);
 	trace_wake_reaper(tsk->pid);
 	wake_up(&oom_reaper_wait);