mm: Check for SIGKILL inside dup_mmap() loop.

Message ID	1522322870-4335-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, Alexander Viro <viro@zeniv.linux.org.uk>, Andrew Morton <akpm@linux-foundation.org>, "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, Michal Hocko <mhocko@suse.com>, Rik van Riel <riel@redhat.com> Subject: [PATCH] mm: Check for SIGKILL inside dup_mmap() loop. Date: Thu, 29 Mar 2018 20:27:50 +0900 Message-Id: <1522322870-4335-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Tetsuo Handa March 29, 2018, 11:27 a.m. UTC

Theoretically it is possible that an mm_struct with 60000+ vmas loops
with potentially allocating memory, with mm->mmap_sem held for write by
the current thread. Unless I overlooked that fatal_signal_pending() is
somewhere in the loop, this is bad if current thread was selected as an
OOM victim, for the current thread will continue allocations using memory
reserves while the OOM reaper is unable to reclaim memory.

But there is no point with continuing the loop from the beginning if
current thread is killed. If there were __GFP_KILLABLE (or something
like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it
to all allocations inside the loop. But since we don't have such flag,
this patch uses fatal_signal_pending() check inside the loop.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 kernel/fork.c | 4 ++++
 1 file changed, 4 insertions(+)

Andrew Morton March 29, 2018, 9:30 p.m. UTC | #1

On Thu, 29 Mar 2018 20:27:50 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> Theoretically it is possible that an mm_struct with 60000+ vmas loops
> with potentially allocating memory, with mm->mmap_sem held for write by
> the current thread. Unless I overlooked that fatal_signal_pending() is
> somewhere in the loop, this is bad if current thread was selected as an
> OOM victim, for the current thread will continue allocations using memory
> reserves while the OOM reaper is unable to reclaim memory.

All of which implies to me that this patch fixes a problem which is not
known to exist!  

> But there is no point with continuing the loop from the beginning if
> current thread is killed. If there were __GFP_KILLABLE (or something
> like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it
> to all allocations inside the loop. But since we don't have such flag,
> this patch uses fatal_signal_pending() check inside the loop.

Dumb question: if a thread has been oom-killed and then tries to
allocate memory, should the page allocator just fail the allocation
attempt?  I suppose there are all sorts of reasons why not :(

In which case, yes, setting a new
PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
tidy enough solution.  It would be a bit sad to add another test in the
hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
already.

> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>  			continue;
>  		}
>  		charge = 0;
> +		if (fatal_signal_pending(current)) {
> +			retval = -EINTR;
> +			goto out;
> +		}
>  		if (mpnt->vm_flags & VM_ACCOUNT) {
>  			unsigned long len = vma_pages(mpnt);

I think a comment explaining why we're doing this would help.

Better would be to add a new function "current_is_oom_killed()" or
such, which becomes self-documenting.  Because there are other reasons
why a task may have a fatal signal pending.

Michal Hocko April 3, 2018, 11:16 a.m. UTC | #2

On Thu 29-03-18 14:30:03, Andrew Morton wrote:
[...]
> Dumb question: if a thread has been oom-killed and then tries to
> allocate memory, should the page allocator just fail the allocation
> attempt?  I suppose there are all sorts of reasons why not :(

We give those tasks access to memory reserves to move on (see
oom_reserves_allowed) and fail allocation if reserves do not help

	if (tsk_is_oom_victim(current) &&
	    (alloc_flags == ALLOC_OOM ||
	     (gfp_mask & __GFP_NOMEMALLOC)))
		goto nopage;
So we...

> In which case, yes, setting a new
> PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> tidy enough solution.  It would be a bit sad to add another test in the
> hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> already.

... do not need this.

Tetsuo Handa April 3, 2018, 11:32 a.m. UTC | #3

Michal Hocko wrote:
> On Thu 29-03-18 14:30:03, Andrew Morton wrote:
> [...]
> > Dumb question: if a thread has been oom-killed and then tries to
> > allocate memory, should the page allocator just fail the allocation
> > attempt?  I suppose there are all sorts of reasons why not :(
> 
> We give those tasks access to memory reserves to move on (see
> oom_reserves_allowed) and fail allocation if reserves do not help
> 
> 	if (tsk_is_oom_victim(current) &&
> 	    (alloc_flags == ALLOC_OOM ||
> 	     (gfp_mask & __GFP_NOMEMALLOC)))
> 		goto nopage;
> So we...
> 
> > In which case, yes, setting a new
> > PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> > tidy enough solution.  It would be a bit sad to add another test in the
> > hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> > already.
> 
> ... do not need this.

Excuse me? But that check is after

	/* Reclaim has failed us, start killing things */
	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
	if (page)
		goto got_pg;

which means that tsk_is_oom_victim(current) && alloc_flags == ALLOC_OOM threads
can still trigger the OOM killer as soon as the OOM reaper sets MMF_OOM_SKIP.

Michal Hocko April 3, 2018, 11:38 a.m. UTC | #4

On Tue 03-04-18 20:32:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 29-03-18 14:30:03, Andrew Morton wrote:
> > [...]
> > > Dumb question: if a thread has been oom-killed and then tries to
> > > allocate memory, should the page allocator just fail the allocation
> > > attempt?  I suppose there are all sorts of reasons why not :(
> > 
> > We give those tasks access to memory reserves to move on (see
> > oom_reserves_allowed) and fail allocation if reserves do not help
> > 
> > 	if (tsk_is_oom_victim(current) &&
> > 	    (alloc_flags == ALLOC_OOM ||
> > 	     (gfp_mask & __GFP_NOMEMALLOC)))
> > 		goto nopage;
> > So we...
> > 
> > > In which case, yes, setting a new
> > > PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> > > tidy enough solution.  It would be a bit sad to add another test in the
> > > hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> > > already.
> > 
> > ... do not need this.
> 
> Excuse me? But that check is after
> 
> 	/* Reclaim has failed us, start killing things */
> 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> 	if (page)
> 		goto got_pg;
> 
> which means that tsk_is_oom_victim(current) && alloc_flags == ALLOC_OOM threads
> can still trigger the OOM killer as soon as the OOM reaper sets MMF_OOM_SKIP.

Races are possible and I do not see them as critical _right now_. If
that turnes out to be not the case we can think of a more robust way.
The thing is that we have "bail out for OOM victims already".

Matthew Wilcox (Oracle) April 3, 2018, 11:58 a.m. UTC | #5

On Thu, Mar 29, 2018 at 02:30:03PM -0700, Andrew Morton wrote:
> > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >  			continue;
> >  		}
> >  		charge = 0;
> > +		if (fatal_signal_pending(current)) {
> > +			retval = -EINTR;
> > +			goto out;
> > +		}
> >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> >  			unsigned long len = vma_pages(mpnt);
> 
> I think a comment explaining why we're doing this would help.
> 
> Better would be to add a new function "current_is_oom_killed()" or
> such, which becomes self-documenting.  Because there are other reasons
> why a task may have a fatal signal pending.

I disagree that we need a comment here, or to create an alias.  Someone
who knows nothing of the oom-killer (like, er, me) reading that code sees
"Oh, we're checking for fatal signals here.  I guess it doesn't make sense
to continue forking a process if it's already received a fatal signal."

One might speculate about the causes of the fatal signal having been
received and settle on reasons which make sense even without thinking
of the OOM case.  Because it's why it was introduced, I always think
about a task blocked on a dead NFS mount.  If it's multithreaded and
one of the threads called fork() while another thread was blocked on a
page fault and the dup_mmap() had to wait for the page fault to finish
... that would make some kind of sense.

Michal Hocko April 3, 2018, 12:08 p.m. UTC | #6

On Tue 03-04-18 04:58:57, Matthew Wilcox wrote:
> On Thu, Mar 29, 2018 at 02:30:03PM -0700, Andrew Morton wrote:
> > > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> > >  			continue;
> > >  		}
> > >  		charge = 0;
> > > +		if (fatal_signal_pending(current)) {
> > > +			retval = -EINTR;
> > > +			goto out;
> > > +		}
> > >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> > >  			unsigned long len = vma_pages(mpnt);
> > 
> > I think a comment explaining why we're doing this would help.
> > 
> > Better would be to add a new function "current_is_oom_killed()" or
> > such, which becomes self-documenting.  Because there are other reasons
> > why a task may have a fatal signal pending.
> 
> I disagree that we need a comment here, or to create an alias.  Someone
> who knows nothing of the oom-killer (like, er, me) reading that code sees
> "Oh, we're checking for fatal signals here.  I guess it doesn't make sense
> to continue forking a process if it's already received a fatal signal."
> 
> One might speculate about the causes of the fatal signal having been
> received and settle on reasons which make sense even without thinking
> of the OOM case.  Because it's why it was introduced, I always think
> about a task blocked on a dead NFS mount.  If it's multithreaded and
> one of the threads called fork() while another thread was blocked on a
> page fault and the dup_mmap() had to wait for the page fault to finish
> ... that would make some kind of sense.

I completely agree. If the check is really correct then it should be
pretty much self explanatory like many other checks. There is absolutely
zero oom specific in there. If a check _is_ oom specific then there is
something fishy going on.

mm: Check for SIGKILL inside dup_mmap() loop.

Commit Message

Comments

Patch