mm: Check for SIGKILL inside dup_mmap() loop.

Andrew Morton wrote:
> On Thu, 29 Mar 2018 20:27:50 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > Theoretically it is possible that an mm_struct with 60000+ vmas loops
> > with potentially allocating memory, with mm->mmap_sem held for write by
> > the current thread. Unless I overlooked that fatal_signal_pending() is
> > somewhere in the loop, this is bad if current thread was selected as an
> > OOM victim, for the current thread will continue allocations using memory
> > reserves while the OOM reaper is unable to reclaim memory.
> 
> All of which implies to me that this patch fixes a problem which is not
> known to exist!

Yes.

The trigger which made me to check this loop is that it is not difficult to hit
"oom_reaper: unable to reap pid:" messages if the victim thread is blocked at
i_mmap_lock_write() in this loop.

Checking for SIGKILL in this loop will be nice. Doing so should to some degree help
faster termination by reducing possibility of being blocked at i_mmap_sem_write().
That is, if i_mmap_lock_write() against N'th vma would block, we can avoid needless
delay by escaping the loop via fatal_signal_pending() test before reaching
i_mmap_lock_write() against N'th vma. Even if we are already blocked at
i_mmap_lock_write() against N'th vma, we can still avoid needless delay if
i_mmap_lock_write() against subsequent vmas would also block.

> 
> > But there is no point with continuing the loop from the beginning if
> > current thread is killed. If there were __GFP_KILLABLE (or something
> > like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it
> > to all allocations inside the loop. But since we don't have such flag,
> > this patch uses fatal_signal_pending() check inside the loop.
> 
> Dumb question: if a thread has been oom-killed and then tries to
> allocate memory, should the page allocator just fail the allocation
> attempt?  I suppose there are all sorts of reasons why not :(

Maybe because allocation failure paths are not tested enough
( https://lwn.net/Articles/627419/ ). But that should not prevent the page
allocator from just failing the allocation attempt.

I do want a mechanism for telling the page allocator whether we want to
give up upon SIGKILL. I've been proposing it as __GFP_KILLABLE.

> 
> In which case, yes, setting a new
> PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> tidy enough solution.  It would be a bit sad to add another test in the
> hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> already.

Maybe we can make "give up by default upon SIGKILL" and let callers
explicitly say "do not give up upon SIGKILL".

----------------------------------------
 include/linux/gfp.h            | 10 ++++++++++
 include/trace/events/mmflags.h |  1 +
 mm/page_alloc.c                | 15 +++++++++++++++
 tools/perf/builtin-kmem.c      |  1 +
 4 files changed, 27 insertions(+)

----------------------------------------

> 
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >  			continue;
> >  		}
> >  		charge = 0;
> > +		if (fatal_signal_pending(current)) {
> > +			retval = -EINTR;
> > +			goto out;
> > +		}
> >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> >  			unsigned long len = vma_pages(mpnt);
> 
> I think a comment explaining why we're doing this would help.

I think such comment can go to patch description like commit d1908f52557b3230
("fs: break out of iomap_file_buffered_write on fatal signals") did.

> 
> Better would be to add a new function "current_is_oom_killed()" or
> such, which becomes self-documenting.  Because there are other reasons
> why a task may have a fatal signal pending.
> 

current_is_oom_killed() is already there as tsk_is_oom_victim(current)
except that tsk_is_oom_victim(current) is not accurate because the OOM
killer sets ->signal->oom_mm field to only one thread group even when
the OOM victim consists of multiple thread groups.

But I don't think we need to distinguish "killed by the OOM killer" and
"killed by other than the OOM killer" if we can tell the page allocator
whether we want to give up upon SIGKILL. Any allocation which does not
want to give up upon SIGKILL could get preferred access to memory reserves
if SIGKILL is pending.

mm: Check for SIGKILL inside dup_mmap() loop.

Commit Message

Comments

Patch