diff mbox

[08/12] mm: page_alloc: wait for OOM killer progress before retrying

Message ID 20150326152343.GE23973@cmpxchg.org (mailing list archive)
State New, archived
Headers show

Commit Message

Johannes Weiner March 26, 2015, 3:23 p.m. UTC
On Thu, Mar 26, 2015 at 03:32:23PM +0100, Michal Hocko wrote:
> On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> > On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > > Johannes Weiner wrote:
> [...]
> > > >  	/*
> > > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > > -	 * making progress for us.
> > > > +	 * This allocating task can become the OOM victim itself at
> > > > +	 * any point before acquiring the lock.  In that case, exit
> > > > +	 * quickly and don't block on the lock held by another task
> > > > +	 * waiting for us to exit.
> > > >  	 */
> > > > -	if (!mutex_trylock(&oom_lock)) {
> > > > -		*did_some_progress = 1;
> > > > -		schedule_timeout_uninterruptible(1);
> > > > -		return NULL;
> > > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > +		goto alloc;
> > > >  	}
> > > 
> > > When a thread group has 1000 threads and most of them are doing memory allocation
> > > request, all of them will get fatal_signal_pending() == true when one of them are
> > > chosen by OOM killer.
> > > This code will allow most of them to access memory reserves, won't it?
> > 
> > Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> > any dying thread.  Thanks, I'll fix it in v2.
> 
> Do you plan to post this v2 here for review?

Yeah, I was going to wait for feedback to settle before updating the
code.  But I was thinking something like this?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Michal Hocko March 26, 2015, 3:38 p.m. UTC | #1
On Thu 26-03-15 11:23:43, Johannes Weiner wrote:
> On Thu, Mar 26, 2015 at 03:32:23PM +0100, Michal Hocko wrote:
> > On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> > > On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > > > Johannes Weiner wrote:
> > [...]
> > > > >  	/*
> > > > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > > > -	 * making progress for us.
> > > > > +	 * This allocating task can become the OOM victim itself at
> > > > > +	 * any point before acquiring the lock.  In that case, exit
> > > > > +	 * quickly and don't block on the lock held by another task
> > > > > +	 * waiting for us to exit.
> > > > >  	 */
> > > > > -	if (!mutex_trylock(&oom_lock)) {
> > > > > -		*did_some_progress = 1;
> > > > > -		schedule_timeout_uninterruptible(1);
> > > > > -		return NULL;
> > > > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > > +		goto alloc;
> > > > >  	}
> > > > 
> > > > When a thread group has 1000 threads and most of them are doing memory allocation
> > > > request, all of them will get fatal_signal_pending() == true when one of them are
> > > > chosen by OOM killer.
> > > > This code will allow most of them to access memory reserves, won't it?
> > > 
> > > Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> > > any dying thread.  Thanks, I'll fix it in v2.
> > 
> > Do you plan to post this v2 here for review?
> 
> Yeah, I was going to wait for feedback to settle before updating the
> code.  But I was thinking something like this?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ce9c4c083a0..106793a75461 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2344,7 +2344,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	 * waiting for us to exit.
>  	 */
>  	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> -		alloc_flags |= ALLOC_NO_WATERMARKS;
> +		if (test_thread_flag(TIF_MEMDIE))
> +			alloc_flags |= ALLOC_NO_WATERMARKS;
>  		goto alloc;
>  	}

OK, I have expected something like this. I understand why you want to
retry inside this function. But I would prefer if gfp_to_alloc_flags was
used here so that we do not have that TIF_MEMDIE logic duplicated at two
places.
Johannes Weiner March 26, 2015, 6:17 p.m. UTC | #2
On Thu, Mar 26, 2015 at 04:38:47PM +0100, Michal Hocko wrote:
> On Thu 26-03-15 11:23:43, Johannes Weiner wrote:
> > On Thu, Mar 26, 2015 at 03:32:23PM +0100, Michal Hocko wrote:
> > > On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> > > > On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > > > > Johannes Weiner wrote:
> > > [...]
> > > > > >  	/*
> > > > > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > > > > -	 * making progress for us.
> > > > > > +	 * This allocating task can become the OOM victim itself at
> > > > > > +	 * any point before acquiring the lock.  In that case, exit
> > > > > > +	 * quickly and don't block on the lock held by another task
> > > > > > +	 * waiting for us to exit.
> > > > > >  	 */
> > > > > > -	if (!mutex_trylock(&oom_lock)) {
> > > > > > -		*did_some_progress = 1;
> > > > > > -		schedule_timeout_uninterruptible(1);
> > > > > > -		return NULL;
> > > > > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > > > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > > > +		goto alloc;
> > > > > >  	}
> > > > > 
> > > > > When a thread group has 1000 threads and most of them are doing memory allocation
> > > > > request, all of them will get fatal_signal_pending() == true when one of them are
> > > > > chosen by OOM killer.
> > > > > This code will allow most of them to access memory reserves, won't it?
> > > > 
> > > > Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> > > > any dying thread.  Thanks, I'll fix it in v2.
> > > 
> > > Do you plan to post this v2 here for review?
> > 
> > Yeah, I was going to wait for feedback to settle before updating the
> > code.  But I was thinking something like this?
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9ce9c4c083a0..106793a75461 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2344,7 +2344,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	 * waiting for us to exit.
> >  	 */
> >  	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > -		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		if (test_thread_flag(TIF_MEMDIE))
> > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  		goto alloc;
> >  	}
> 
> OK, I have expected something like this. I understand why you want to
> retry inside this function. But I would prefer if gfp_to_alloc_flags was
> used here so that we do not have that TIF_MEMDIE logic duplicated at two
> places.

I don't think that's a good idea.  gfp_to_alloc_flags() reinitializes
the entire allocation context from the gfp flags and the task state,
but the only thing we care about, which can actually change here, is
TIF_MEMDIE.  This is perfectly obvious and expected in the OOM kill
allocation function, which makes my code self-documenting, whereas if
you use gfp_to_alloc_flags() you have to explain why it is called.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tetsuo Handa March 27, 2015, 2:01 p.m. UTC | #3
Michal Hocko wrote:
> On Thu 26-03-15 11:23:43, Johannes Weiner wrote:
> > On Thu, Mar 26, 2015 at 03:32:23PM +0100, Michal Hocko wrote:
> > > On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> > > > On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > > > > Johannes Weiner wrote:
> > > [...]
> > > > > >  	/*
> > > > > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > > > > -	 * making progress for us.
> > > > > > +	 * This allocating task can become the OOM victim itself at
> > > > > > +	 * any point before acquiring the lock.  In that case, exit
> > > > > > +	 * quickly and don't block on the lock held by another task
> > > > > > +	 * waiting for us to exit.
> > > > > >  	 */
> > > > > > -	if (!mutex_trylock(&oom_lock)) {
> > > > > > -		*did_some_progress = 1;
> > > > > > -		schedule_timeout_uninterruptible(1);
> > > > > > -		return NULL;
> > > > > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > > > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > > > > +		goto alloc;
> > > > > >  	}
> > > > > 
> > > > > When a thread group has 1000 threads and most of them are doing memory allocation
> > > > > request, all of them will get fatal_signal_pending() == true when one of them are
> > > > > chosen by OOM killer.
> > > > > This code will allow most of them to access memory reserves, won't it?
> > > > 
> > > > Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> > > > any dying thread.  Thanks, I'll fix it in v2.
> > > 
> > > Do you plan to post this v2 here for review?
> > 
> > Yeah, I was going to wait for feedback to settle before updating the
> > code.  But I was thinking something like this?
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9ce9c4c083a0..106793a75461 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2344,7 +2344,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	 * waiting for us to exit.
> >  	 */
> >  	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > -		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		if (test_thread_flag(TIF_MEMDIE))
> > +			alloc_flags |= ALLOC_NO_WATERMARKS;
> >  		goto alloc;
> >  	}
> 
> OK, I have expected something like this. I understand why you want to
> retry inside this function. But I would prefer if gfp_to_alloc_flags was
> used here so that we do not have that TIF_MEMDIE logic duplicated at two
> places.

I thought we expected something like

-	if (!mutex_trylock(&oom_lock)) {
+	if (mutex_lock_killable(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
-		return NULL;
+		if (test_thread_flag(TIF_MEMDIE))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+		else
+			schedule_timeout_uninterruptible(1);
+		goto alloc;
 	}

or

 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
+		if (test_thread_flag(TIF_MEMDIE)) {
+			alloc_flags |= ALLOC_NO_WATERMARKS;
+			goto alloc;
+		}
 		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}

because jumping to

  return get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

without modifying alloc_flags and without setting *did_some_progress to 1
will lead to immediate allocation failures for !TIF_MEMDIE threads.

I don't like allowing only TIF_MEMDIE to get reserve access, for it can be
one of !TIF_MEMDIE threads which really need memory to safely terminate without
failing allocations from do_exit(). Rather, why not to discontinue TIF_MEMDIE
handling and allow getting access to private memory reserves for all
fatal_signal_pending() threads (i.e. replacing WMARK_OOM with WMARK_KILLED
in "[patch 09/12] mm: page_alloc: private memory reserves for OOM-killing
allocations") ?

> > > @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > >  		if (gfp_mask & __GFP_THISNODE)
> > >  			goto out;
> > >  	}
> > > -	/* Exhausted what can be done so it's blamo time */
> > > -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> > > -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > > +
> > > +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
> > >  		*did_some_progress = 1;
> > > +	} else {
> > > +		/* Oops, these shouldn't happen with the OOM killer disabled */
> > > +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > > +			*did_some_progress = 1;
> > > +	}
> > 
> > I think GFP_NOFAIL allocations need to involve OOM killer than
> > pretending as if forward progress is made. If all of in-flight
> > allocation requests are GFP_NOFAIL, the system will lock up.
> 
> Hm?  They do involve the OOM killer, but once userspace is frozen for
> suspend/hibernate we shouldn't kill and thaw random tasks anymore as
> that might corrupt the memory snapshot, so nofail allocations are a
> bug at this point.
> 

Aren't there still kernel threads which might do GFP_NOFAIL allocations?
I think corrupting the memory snapshot by involving the OOM killer is the
correct behavior than a bug.

> > After all, if we wait for OOM killer progress before retrying, I think
> > we should involve OOM killer after some bounded timeout regardless of
> > gfp flags, and let OOM killer kill more threads after another bounded
> > timeout. Otherwise, the corner cases will lock up the system.
> 
> Giving nofail allocations access to emergency reserves targets this
> problem, but I agree with you that it's still possible for the system
> to lock up if they have been consumed and still no task made enough
> forward progress to release memory.  It is unlikely but possible.
> 
> I will probably come back to the OOM victim timeout patch some time in
> the future as that seems more robust.  It would also drastically
> simplify memcg OOM handling.  But that patch was controversial in the
> past and seemed beyond the scope of this patch set.
> 

A timeout for recovering from WMARK_KILLED to WMARK_MIN could be used for
detecting whether we need to trigger additinal OOM-killing?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ce9c4c083a0..106793a75461 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2344,7 +2344,8 @@  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * waiting for us to exit.
 	 */
 	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
-		alloc_flags |= ALLOC_NO_WATERMARKS;
+		if (test_thread_flag(TIF_MEMDIE))
+			alloc_flags |= ALLOC_NO_WATERMARKS;
 		goto alloc;
 	}