diff mbox

mm: Check for SIGKILL inside dup_mmap() loop.

Message ID 201803301934.DHF12420.SOFFJQMLVtHOOF@I-love.SAKURA.ne.jp (mailing list archive)
State New, archived
Headers show

Commit Message

Tetsuo Handa March 30, 2018, 10:34 a.m. UTC
Andrew Morton wrote:
> On Thu, 29 Mar 2018 20:27:50 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > Theoretically it is possible that an mm_struct with 60000+ vmas loops
> > with potentially allocating memory, with mm->mmap_sem held for write by
> > the current thread. Unless I overlooked that fatal_signal_pending() is
> > somewhere in the loop, this is bad if current thread was selected as an
> > OOM victim, for the current thread will continue allocations using memory
> > reserves while the OOM reaper is unable to reclaim memory.
> 
> All of which implies to me that this patch fixes a problem which is not
> known to exist!

Yes.

The trigger which made me to check this loop is that it is not difficult to hit
"oom_reaper: unable to reap pid:" messages if the victim thread is blocked at
i_mmap_lock_write() in this loop.

Checking for SIGKILL in this loop will be nice. Doing so should to some degree help
faster termination by reducing possibility of being blocked at i_mmap_sem_write().
That is, if i_mmap_lock_write() against N'th vma would block, we can avoid needless
delay by escaping the loop via fatal_signal_pending() test before reaching
i_mmap_lock_write() against N'th vma. Even if we are already blocked at
i_mmap_lock_write() against N'th vma, we can still avoid needless delay if
i_mmap_lock_write() against subsequent vmas would also block.

> 
> > But there is no point with continuing the loop from the beginning if
> > current thread is killed. If there were __GFP_KILLABLE (or something
> > like memalloc_nofs_save()/memalloc_nofs_restore()), we could apply it
> > to all allocations inside the loop. But since we don't have such flag,
> > this patch uses fatal_signal_pending() check inside the loop.
> 
> Dumb question: if a thread has been oom-killed and then tries to
> allocate memory, should the page allocator just fail the allocation
> attempt?  I suppose there are all sorts of reasons why not :(

Maybe because allocation failure paths are not tested enough
( https://lwn.net/Articles/627419/ ). But that should not prevent the page
allocator from just failing the allocation attempt.

I do want a mechanism for telling the page allocator whether we want to
give up upon SIGKILL. I've been proposing it as __GFP_KILLABLE.

> 
> In which case, yes, setting a new
> PF_MEMALLOC_MAY_FAIL_IF_I_WAS_OOMKILLED around such code might be a
> tidy enough solution.  It would be a bit sad to add another test in the
> hot path (should_fail_alloc_page()?), but geeze we do a lot of junk
> already.

Maybe we can make "give up by default upon SIGKILL" and let callers
explicitly say "do not give up upon SIGKILL".

----------------------------------------
 include/linux/gfp.h            | 10 ++++++++++
 include/trace/events/mmflags.h |  1 +
 mm/page_alloc.c                | 15 +++++++++++++++
 tools/perf/builtin-kmem.c      |  1 +
 4 files changed, 27 insertions(+)

----------------------------------------

> 
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -440,6 +440,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> >  			continue;
> >  		}
> >  		charge = 0;
> > +		if (fatal_signal_pending(current)) {
> > +			retval = -EINTR;
> > +			goto out;
> > +		}
> >  		if (mpnt->vm_flags & VM_ACCOUNT) {
> >  			unsigned long len = vma_pages(mpnt);
> 
> I think a comment explaining why we're doing this would help.

I think such comment can go to patch description like commit d1908f52557b3230
("fs: break out of iomap_file_buffered_write on fatal signals") did.

> 
> Better would be to add a new function "current_is_oom_killed()" or
> such, which becomes self-documenting.  Because there are other reasons
> why a task may have a fatal signal pending.
> 

current_is_oom_killed() is already there as tsk_is_oom_victim(current)
except that tsk_is_oom_victim(current) is not accurate because the OOM
killer sets ->signal->oom_mm field to only one thread group even when
the OOM victim consists of multiple thread groups.

But I don't think we need to distinguish "killed by the OOM killer" and
"killed by other than the OOM killer" if we can tell the page allocator
whether we want to give up upon SIGKILL. Any allocation which does not
want to give up upon SIGKILL could get preferred access to memory reserves
if SIGKILL is pending.

Comments

Matthew Wilcox (Oracle) April 3, 2018, 12:14 p.m. UTC | #1
On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> Maybe we can make "give up by default upon SIGKILL" and let callers
> explicitly say "do not give up upon SIGKILL".

I really strongly disapprove of this patch.  This GFP flag will be abused
like every other GFP flag.

> +++ b/mm/page_alloc.c
> @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (current->flags & PF_MEMALLOC)
>  		goto nopage;
>  
> +	/* Can give up if caller is willing to give up upon fatal signals */
> +	if (fatal_signal_pending(current) &&
> +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> +		gfp_mask |= __GFP_NOWARN;
> +		goto nopage;
> +	}
> +
>  	/* Try direct reclaim and then allocating */

This part is superficially tempting, although without the UNKILLABLE.  ie:

+	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
+		gfp_mask |= __GFP_NOWARN;
+		goto nopage;
+	}

It makes some sense to me to prevent tasks with a fatal signal pending
from being able to trigger reclaim.  But I'm worried about what memory
allocation failures it might trigger on paths that aren't accustomed to
seeing failures.
Michal Hocko April 3, 2018, 12:19 p.m. UTC | #2
On Tue 03-04-18 05:14:14, Matthew Wilcox wrote:
> On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> > Maybe we can make "give up by default upon SIGKILL" and let callers
> > explicitly say "do not give up upon SIGKILL".
> 
> I really strongly disapprove of this patch.  This GFP flag will be abused
> like every other GFP flag.
> 
> > +++ b/mm/page_alloc.c
> > @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> >  	if (current->flags & PF_MEMALLOC)
> >  		goto nopage;
> >  
> > +	/* Can give up if caller is willing to give up upon fatal signals */
> > +	if (fatal_signal_pending(current) &&
> > +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> > +		gfp_mask |= __GFP_NOWARN;
> > +		goto nopage;
> > +	}
> > +
> >  	/* Try direct reclaim and then allocating */
> 
> This part is superficially tempting, although without the UNKILLABLE.  ie:
> 
> +	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
> +		gfp_mask |= __GFP_NOWARN;
> +		goto nopage;
> +	}
> 
> It makes some sense to me to prevent tasks with a fatal signal pending
> from being able to trigger reclaim.  But I'm worried about what memory
> allocation failures it might trigger on paths that aren't accustomed to
> seeing failures.

Please be aware that we _do_ allocate in the exit path. I have a strong
suspicion that even while fatal signal is pending. Do we really want
fail those really easily.
Matthew Wilcox (Oracle) April 3, 2018, 12:25 p.m. UTC | #3
On Tue, Apr 03, 2018 at 02:19:50PM +0200, Michal Hocko wrote:
> On Tue 03-04-18 05:14:14, Matthew Wilcox wrote:
> > On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> > > Maybe we can make "give up by default upon SIGKILL" and let callers
> > > explicitly say "do not give up upon SIGKILL".
> > 
> > I really strongly disapprove of this patch.  This GFP flag will be abused
> > like every other GFP flag.
> > 
> > > +++ b/mm/page_alloc.c
> > > @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > >  	if (current->flags & PF_MEMALLOC)
> > >  		goto nopage;
> > >  
> > > +	/* Can give up if caller is willing to give up upon fatal signals */
> > > +	if (fatal_signal_pending(current) &&
> > > +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> > > +		gfp_mask |= __GFP_NOWARN;
> > > +		goto nopage;
> > > +	}
> > > +
> > >  	/* Try direct reclaim and then allocating */
> > 
> > This part is superficially tempting, although without the UNKILLABLE.  ie:
> > 
> > +	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
> > +		gfp_mask |= __GFP_NOWARN;
> > +		goto nopage;
> > +	}
> > 
> > It makes some sense to me to prevent tasks with a fatal signal pending
> > from being able to trigger reclaim.  But I'm worried about what memory
> > allocation failures it might trigger on paths that aren't accustomed to
> > seeing failures.
> 
> Please be aware that we _do_ allocate in the exit path. I have a strong
> suspicion that even while fatal signal is pending. Do we really want
> fail those really easily.

I agree.  The allocations I'm thinking about are NFS wanting to send
I/Os in order to fsync each file that gets closed.  We probably don't
want those to fail.  And we definitely don't want to chase around the
kernel adding __GFP_KILLABLE to each place that we discover needs to
allocate on the exit path.
Tetsuo Handa April 3, 2018, 12:29 p.m. UTC | #4
Michal Hocko wrote:
> On Tue 03-04-18 05:14:14, Matthew Wilcox wrote:
> > On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> > > Maybe we can make "give up by default upon SIGKILL" and let callers
> > > explicitly say "do not give up upon SIGKILL".
> > 
> > I really strongly disapprove of this patch.  This GFP flag will be abused
> > like every other GFP flag.
> > 
> > > +++ b/mm/page_alloc.c
> > > @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > >  	if (current->flags & PF_MEMALLOC)
> > >  		goto nopage;
> > >  
> > > +	/* Can give up if caller is willing to give up upon fatal signals */
> > > +	if (fatal_signal_pending(current) &&
> > > +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> > > +		gfp_mask |= __GFP_NOWARN;
> > > +		goto nopage;
> > > +	}
> > > +
> > >  	/* Try direct reclaim and then allocating */
> > 
> > This part is superficially tempting, although without the UNKILLABLE.  ie:
> > 
> > +	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
> > +		gfp_mask |= __GFP_NOWARN;
> > +		goto nopage;
> > +	}
> > 
> > It makes some sense to me to prevent tasks with a fatal signal pending
> > from being able to trigger reclaim.  But I'm worried about what memory
> > allocation failures it might trigger on paths that aren't accustomed to
> > seeing failures.

Userspace tasks might call routines which need GFP_FS or GFP_NOIO (for
direct reclaim), and giving up upon fatal signals leads to more problems
like FS error or I/O error. Thus, we can't unconditionally give up upon
fatal signals.

> 
> Please be aware that we _do_ allocate in the exit path. I have a strong
> suspicion that even while fatal signal is pending. Do we really want
> fail those really easily.

Does the exit path mean inside do_exit() ? If yes, fatal signals are already
cleared before reaching do_exit().
Michal Hocko April 3, 2018, 1:06 p.m. UTC | #5
On Tue 03-04-18 21:29:52, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Please be aware that we _do_ allocate in the exit path. I have a strong
> > suspicion that even while fatal signal is pending. Do we really want
> > fail those really easily.
> 
> Does the exit path mean inside do_exit() ? If yes, fatal signals are already
> cleared before reaching do_exit().

They usually are. But we can send a SIGKILL on an already killed task
after it removed the previously deadly signal already AFAIR. Maybe I
mis-remember of course. Signal handling code always makes my head
explode and I tend to forget all the details. Anyway relying on
fatal_signal_pending for some allocator semantic is just too subtle.
Tetsuo Handa April 3, 2018, 2:54 p.m. UTC | #6
Matthew Wilcox wrote:
> On Tue, Apr 03, 2018 at 02:19:50PM +0200, Michal Hocko wrote:
> > On Tue 03-04-18 05:14:14, Matthew Wilcox wrote:
> > > On Fri, Mar 30, 2018 at 07:34:59PM +0900, Tetsuo Handa wrote:
> > > > Maybe we can make "give up by default upon SIGKILL" and let callers
> > > > explicitly say "do not give up upon SIGKILL".
> > > 
> > > I really strongly disapprove of this patch.  This GFP flag will be abused
> > > like every other GFP flag.
> > > 
> > > > +++ b/mm/page_alloc.c
> > > > @@ -4183,6 +4183,13 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> > > >  	if (current->flags & PF_MEMALLOC)
> > > >  		goto nopage;
> > > >  
> > > > +	/* Can give up if caller is willing to give up upon fatal signals */
> > > > +	if (fatal_signal_pending(current) &&
> > > > +	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
> > > > +		gfp_mask |= __GFP_NOWARN;
> > > > +		goto nopage;
> > > > +	}
> > > > +
> > > >  	/* Try direct reclaim and then allocating */
> > > 
> > > This part is superficially tempting, although without the UNKILLABLE.  ie:
> > > 
> > > +	if (fatal_signal_pending(current) && !(gfp_mask & __GFP_NOFAIL)) {
> > > +		gfp_mask |= __GFP_NOWARN;
> > > +		goto nopage;
> > > +	}
> > > 
> > > It makes some sense to me to prevent tasks with a fatal signal pending
> > > from being able to trigger reclaim.  But I'm worried about what memory
> > > allocation failures it might trigger on paths that aren't accustomed to
> > > seeing failures.
> > 
> > Please be aware that we _do_ allocate in the exit path. I have a strong
> > suspicion that even while fatal signal is pending. Do we really want
> > fail those really easily.
> 
> I agree.  The allocations I'm thinking about are NFS wanting to send
> I/Os in order to fsync each file that gets closed.  We probably don't
> want those to fail.  And we definitely don't want to chase around the
> kernel adding __GFP_KILLABLE to each place that we discover needs to
> allocate on the exit path.
> 

But memory allocations for syscalls are willing to give up upon SIGKILL
regardless of OOM.

If we worry the exit/nofs/noio paths, we can use scoped masking like
memalloc_nofs_save()/memalloc_nofs_restore() for ignoring __GFP_KILLABLE.
diff mbox

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a4582b..a0e8a9c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -24,6 +24,7 @@ 
 #define ___GFP_HIGH		0x20u
 #define ___GFP_IO		0x40u
 #define ___GFP_FS		0x80u
+#define ___GFP_UNKILLABLE	0x100u
 #define ___GFP_NOWARN		0x200u
 #define ___GFP_RETRY_MAYFAIL	0x400u
 #define ___GFP_NOFAIL		0x800u
@@ -122,6 +123,14 @@ 
  *   allocator recursing into the filesystem which might already be holding
  *   locks.
  *
+ * __GFP_UNKILLABLE: The VM implementation does not fail by simply because
+ *   fatal_signal_pending(current) is true when the current thread in task
+ *   context is doing memory allocations. Those allocations which do not want
+ *   to be disturbed by SIGKILL can add this flag. But note that those
+ *   allocations which must not fail have to add __GFP_NOFAIL, for
+ *   __GFP_UNKILLABLE allocations can still fail by other reasons such as
+ *   __GFP_NORETRY, __GFP_RETRY_MAYFAIL, being selected as an OOM victim.
+ *
  * __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
  *   This flag can be cleared to avoid unnecessary delays when a fallback
  *   option is available.
@@ -181,6 +190,7 @@ 
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
+#define __GFP_UNKILLABLE ((__force gfp_t)___GFP_UNKILLABLE)
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
 #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a81cffb..6a21654 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -32,6 +32,7 @@ 
 	{(unsigned long)__GFP_ATOMIC,		"__GFP_ATOMIC"},	\
 	{(unsigned long)__GFP_IO,		"__GFP_IO"},		\
 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
+	{(unsigned long)__GFP_UNKILLABLE,	"__GFP_UNKILLABLE"},	\
 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
 	{(unsigned long)__GFP_RETRY_MAYFAIL,	"__GFP_RETRY_MAYFAIL"},	\
 	{(unsigned long)__GFP_NOFAIL,		"__GFP_NOFAIL"},	\
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 905db9d..c8af32e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4183,6 +4183,13 @@  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (current->flags & PF_MEMALLOC)
 		goto nopage;
 
+	/* Can give up if caller is willing to give up upon fatal signals */
+	if (fatal_signal_pending(current) &&
+	    !(gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL))) {
+		gfp_mask |= __GFP_NOWARN;
+		goto nopage;
+	}
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
 							&did_some_progress);
@@ -4301,6 +4308,14 @@  static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct alloc_context *ac, gfp_t *alloc_mask,
 		unsigned int *alloc_flags)
 {
+	/*
+	 * Can give up if caller in task context is willing to give up upon
+	 * fatal signals
+	 */
+	if (in_task() && fatal_signal_pending(current) &&
+	    (gfp_mask & (__GFP_UNKILLABLE | __GFP_NOFAIL)))
+		return false;
+
 	ac->high_zoneidx = gfp_zone(gfp_mask);
 	ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
 	ac->nodemask = nodemask;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index ae11e4c..b36d945 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -641,6 +641,7 @@  static int gfpcmp(const void *a, const void *b)
 	{ "__GFP_ATOMIC",		"_A" },
 	{ "__GFP_IO",			"I" },
 	{ "__GFP_FS",			"F" },
+	{ "__GFP_UNKILLABLE",		"UK" },
 	{ "__GFP_NOWARN",		"NWR" },
 	{ "__GFP_RETRY_MAYFAIL",	"R" },
 	{ "__GFP_NOFAIL",		"NF" },