diff mbox

[08/12] mm: page_alloc: wait for OOM killer progress before retrying

Message ID 1427264236-17249-9-git-send-email-hannes@cmpxchg.org (mailing list archive)
State New, archived
Headers show

Commit Message

Johannes Weiner March 25, 2015, 6:17 a.m. UTC
There is not much point in rushing back to the freelists and burning
CPU cycles in direct reclaim when somebody else is in the process of
OOM killing, or right after issuing a kill ourselves, because it could
take some time for the OOM victim to release memory.

This is a very cold error path, so there is not much hurry.  Use the
OOM victim waitqueue to wait for victims to actually exit, which is a
solid signal that the memory pinned by those tasks has been released.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/oom_kill.c   | 11 +++++++----
 mm/page_alloc.c | 42 +++++++++++++++++++++++++-----------------
 2 files changed, 32 insertions(+), 21 deletions(-)

Comments

Tetsuo Handa March 25, 2015, 2:15 p.m. UTC | #1
Johannes Weiner wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5cfda39b3268..e066ac7353a4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		killed = 1;
>  	}
>  out:
> +	if (test_thread_flag(TIF_MEMDIE))
> +		return true;
>  	/*
> -	 * Give the killed threads a good chance of exiting before trying to
> -	 * allocate memory again.
> +	 * Wait for any outstanding OOM victims to die.  In rare cases
> +	 * victims can get stuck behind the allocating tasks, so the
> +	 * wait needs to be bounded.  It's crude alright, but cheaper
> +	 * than keeping a global dependency tree between all tasks.
>  	 */
> -	if (killed)
> -		schedule_timeout_killable(1);
> +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
>  
>  	return true;
>  }

out_of_memory() returning true with bounded wait effectively means that
wait forever without choosing subsequent OOM victims when first OOM victim
failed to die. The system will lock up, won't it?

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c1224ba45548..9ce9c4c083a0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  }
>  
>  static inline struct page *
> -__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  
>  	*did_some_progress = 0;
>  
>  	/*
> -	 * Acquire the oom lock.  If that fails, somebody else is
> -	 * making progress for us.
> +	 * This allocating task can become the OOM victim itself at
> +	 * any point before acquiring the lock.  In that case, exit
> +	 * quickly and don't block on the lock held by another task
> +	 * waiting for us to exit.
>  	 */
> -	if (!mutex_trylock(&oom_lock)) {
> -		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
> -		return NULL;
> +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +		goto alloc;
>  	}

When a thread group has 1000 threads and most of them are doing memory allocation
request, all of them will get fatal_signal_pending() == true when one of them are
chosen by OOM killer.
This code will allow most of them to access memory reserves, won't it?

> @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (gfp_mask & __GFP_THISNODE)
>  			goto out;
>  	}
> -	/* Exhausted what can be done so it's blamo time */
> -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> +
> +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
>  		*did_some_progress = 1;
> +	} else {
> +		/* Oops, these shouldn't happen with the OOM killer disabled */
> +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> +			*did_some_progress = 1;
> +	}

I think GFP_NOFAIL allocations need to involve OOM killer than
pretending as if forward progress is made. If all of in-flight
allocation requests are GFP_NOFAIL, the system will lock up.

After all, if we wait for OOM killer progress before retrying, I think
we should involve OOM killer after some bounded timeout regardless of
gfp flags, and let OOM killer kill more threads after another bounded
timeout. Otherwise, the corner cases will lock up the system.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vlastimil Babka March 25, 2015, 5:01 p.m. UTC | #2
On 03/25/2015 03:15 PM, Tetsuo Handa wrote:
> Johannes Weiner wrote:
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 5cfda39b3268..e066ac7353a4 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>>   		killed = 1;
>>   	}
>>   out:
>> +	if (test_thread_flag(TIF_MEMDIE))
>> +		return true;
>>   	/*
>> -	 * Give the killed threads a good chance of exiting before trying to
>> -	 * allocate memory again.
>> +	 * Wait for any outstanding OOM victims to die.  In rare cases
>> +	 * victims can get stuck behind the allocating tasks, so the
>> +	 * wait needs to be bounded.  It's crude alright, but cheaper
>> +	 * than keeping a global dependency tree between all tasks.
>>   	 */
>> -	if (killed)
>> -		schedule_timeout_killable(1);
>> +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
>>
>>   	return true;
>>   }
>
> out_of_memory() returning true with bounded wait effectively means that
> wait forever without choosing subsequent OOM victims when first OOM victim
> failed to die. The system will lock up, won't it?

And after patch 12, does this mean that you may not be waiting long 
enough for the victim to die, before you fail the allocation, 
prematurely? I can imagine there would be situations where the victim is 
not deadlocked, but still take more than HZ to finish, no?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner March 26, 2015, 11:24 a.m. UTC | #3
On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 5cfda39b3268..e066ac7353a4 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >  		killed = 1;
> >  	}
> >  out:
> > +	if (test_thread_flag(TIF_MEMDIE))
> > +		return true;
> >  	/*
> > -	 * Give the killed threads a good chance of exiting before trying to
> > -	 * allocate memory again.
> > +	 * Wait for any outstanding OOM victims to die.  In rare cases
> > +	 * victims can get stuck behind the allocating tasks, so the
> > +	 * wait needs to be bounded.  It's crude alright, but cheaper
> > +	 * than keeping a global dependency tree between all tasks.
> >  	 */
> > -	if (killed)
> > -		schedule_timeout_killable(1);
> > +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
> >  
> >  	return true;
> >  }
> 
> out_of_memory() returning true with bounded wait effectively means that
> wait forever without choosing subsequent OOM victims when first OOM victim
> failed to die. The system will lock up, won't it?

The OOM killer already refuses to choose another victim as long as the
first one hasn't exited, see oom_scan_process_thread().  That's why
later patches in this series introduce a reserve for OOM-killing tasks
and give nofail allocations access to emergency reserves, in case they
themselves prevent that single OOM victim from exiting.  But otherwise
victims should be exiting eventually.

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c1224ba45548..9ce9c4c083a0 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> >  }
> >  
> >  static inline struct page *
> > -__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> > +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	const struct alloc_context *ac, unsigned long *did_some_progress)
> >  {
> > -	struct page *page;
> > +	struct page *page = NULL;
> >  
> >  	*did_some_progress = 0;
> >  
> >  	/*
> > -	 * Acquire the oom lock.  If that fails, somebody else is
> > -	 * making progress for us.
> > +	 * This allocating task can become the OOM victim itself at
> > +	 * any point before acquiring the lock.  In that case, exit
> > +	 * quickly and don't block on the lock held by another task
> > +	 * waiting for us to exit.
> >  	 */
> > -	if (!mutex_trylock(&oom_lock)) {
> > -		*did_some_progress = 1;
> > -		schedule_timeout_uninterruptible(1);
> > -		return NULL;
> > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > +		goto alloc;
> >  	}
> 
> When a thread group has 1000 threads and most of them are doing memory allocation
> request, all of them will get fatal_signal_pending() == true when one of them are
> chosen by OOM killer.
> This code will allow most of them to access memory reserves, won't it?

Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
any dying thread.  Thanks, I'll fix it in v2.

> > @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> >  		if (gfp_mask & __GFP_THISNODE)
> >  			goto out;
> >  	}
> > -	/* Exhausted what can be done so it's blamo time */
> > -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> > -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > +
> > +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
> >  		*did_some_progress = 1;
> > +	} else {
> > +		/* Oops, these shouldn't happen with the OOM killer disabled */
> > +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> > +			*did_some_progress = 1;
> > +	}
> 
> I think GFP_NOFAIL allocations need to involve OOM killer than
> pretending as if forward progress is made. If all of in-flight
> allocation requests are GFP_NOFAIL, the system will lock up.

Hm?  They do involve the OOM killer, but once userspace is frozen for
suspend/hibernate we shouldn't kill and thaw random tasks anymore as
that might corrupt the memory snapshot, so nofail allocations are a
bug at this point.

> After all, if we wait for OOM killer progress before retrying, I think
> we should involve OOM killer after some bounded timeout regardless of
> gfp flags, and let OOM killer kill more threads after another bounded
> timeout. Otherwise, the corner cases will lock up the system.

Giving nofail allocations access to emergency reserves targets this
problem, but I agree with you that it's still possible for the system
to lock up if they have been consumed and still no task made enough
forward progress to release memory.  It is unlikely but possible.

I will probably come back to the OOM victim timeout patch some time in
the future as that seems more robust.  It would also drastically
simplify memcg OOM handling.  But that patch was controversial in the
past and seemed beyond the scope of this patch set.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Weiner March 26, 2015, 11:28 a.m. UTC | #4
On Wed, Mar 25, 2015 at 06:01:48PM +0100, Vlastimil Babka wrote:
> On 03/25/2015 03:15 PM, Tetsuo Handa wrote:
> >Johannes Weiner wrote:
> >>diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> >>index 5cfda39b3268..e066ac7353a4 100644
> >>--- a/mm/oom_kill.c
> >>+++ b/mm/oom_kill.c
> >>@@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >>  		killed = 1;
> >>  	}
> >>  out:
> >>+	if (test_thread_flag(TIF_MEMDIE))
> >>+		return true;
> >>  	/*
> >>-	 * Give the killed threads a good chance of exiting before trying to
> >>-	 * allocate memory again.
> >>+	 * Wait for any outstanding OOM victims to die.  In rare cases
> >>+	 * victims can get stuck behind the allocating tasks, so the
> >>+	 * wait needs to be bounded.  It's crude alright, but cheaper
> >>+	 * than keeping a global dependency tree between all tasks.
> >>  	 */
> >>-	if (killed)
> >>-		schedule_timeout_killable(1);
> >>+	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
> >>
> >>  	return true;
> >>  }
> >
> >out_of_memory() returning true with bounded wait effectively means that
> >wait forever without choosing subsequent OOM victims when first OOM victim
> >failed to die. The system will lock up, won't it?
> 
> And after patch 12, does this mean that you may not be waiting long enough
> for the victim to die, before you fail the allocation, prematurely? I can
> imagine there would be situations where the victim is not deadlocked, but
> still take more than HZ to finish, no?

Arguably it should be reasonable to fail allocations once the OOM
victim is stuck for over a second and the OOM reserves have been
depleted.

On the other hand, we don't need to play it that tight, because that
timeout is only targetted for the victim-blocked-on-alloc situations
which aren't all that common.  Something like 5 seconds should still
be okay.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko March 26, 2015, 2:32 p.m. UTC | #5
On Thu 26-03-15 07:24:45, Johannes Weiner wrote:
> On Wed, Mar 25, 2015 at 11:15:48PM +0900, Tetsuo Handa wrote:
> > Johannes Weiner wrote:
[...]
> > >  	/*
> > > -	 * Acquire the oom lock.  If that fails, somebody else is
> > > -	 * making progress for us.
> > > +	 * This allocating task can become the OOM victim itself at
> > > +	 * any point before acquiring the lock.  In that case, exit
> > > +	 * quickly and don't block on the lock held by another task
> > > +	 * waiting for us to exit.
> > >  	 */
> > > -	if (!mutex_trylock(&oom_lock)) {
> > > -		*did_some_progress = 1;
> > > -		schedule_timeout_uninterruptible(1);
> > > -		return NULL;
> > > +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> > > +		alloc_flags |= ALLOC_NO_WATERMARKS;
> > > +		goto alloc;
> > >  	}
> > 
> > When a thread group has 1000 threads and most of them are doing memory allocation
> > request, all of them will get fatal_signal_pending() == true when one of them are
> > chosen by OOM killer.
> > This code will allow most of them to access memory reserves, won't it?
> 
> Ah, good point!  Only TIF_MEMDIE should get reserve access, not just
> any dying thread.  Thanks, I'll fix it in v2.

Do you plan to post this v2 here for review?

[...]
Michal Hocko March 26, 2015, 3:58 p.m. UTC | #6
On Wed 25-03-15 02:17:12, Johannes Weiner wrote:
> There is not much point in rushing back to the freelists and burning
> CPU cycles in direct reclaim when somebody else is in the process of
> OOM killing, or right after issuing a kill ourselves, because it could
> take some time for the OOM victim to release memory.

Yes this makes sense and it is better than what we have now. The
question is how long we should wait. I can see you have gone with HZ.
What is the value based on? Have your testing shown that the OOM victim
manages to die within a second most of the time?

I do not want to get into which value is the best discussion but I would
expect a larger value. Most OOM victims are not blocked so they would
wake up soon. This is a safety net for those who are blocked and I do
not think we have to expedite those rare cases and rather optimize for
"regular" OOM situations. How about 10-30s?

> This is a very cold error path, so there is not much hurry.  Use the
> OOM victim waitqueue to wait for victims to actually exit, which is a
> solid signal that the memory pinned by those tasks has been released.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/oom_kill.c   | 11 +++++++----
>  mm/page_alloc.c | 42 +++++++++++++++++++++++++-----------------
>  2 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5cfda39b3268..e066ac7353a4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		killed = 1;
>  	}
>  out:
> +	if (test_thread_flag(TIF_MEMDIE))
> +		return true;
>  	/*
> -	 * Give the killed threads a good chance of exiting before trying to
> -	 * allocate memory again.
> +	 * Wait for any outstanding OOM victims to die.  In rare cases
> +	 * victims can get stuck behind the allocating tasks, so the
> +	 * wait needs to be bounded.  It's crude alright, but cheaper
> +	 * than keeping a global dependency tree between all tasks.
>  	 */
> -	if (killed)
> -		schedule_timeout_killable(1);
> +	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
>  
>  	return true;
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c1224ba45548..9ce9c4c083a0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2330,30 +2330,29 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  }
>  
>  static inline struct page *
> -__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> +__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  	const struct alloc_context *ac, unsigned long *did_some_progress)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  
>  	*did_some_progress = 0;
>  
>  	/*
> -	 * Acquire the oom lock.  If that fails, somebody else is
> -	 * making progress for us.
> +	 * This allocating task can become the OOM victim itself at
> +	 * any point before acquiring the lock.  In that case, exit
> +	 * quickly and don't block on the lock held by another task
> +	 * waiting for us to exit.
>  	 */
> -	if (!mutex_trylock(&oom_lock)) {
> -		*did_some_progress = 1;
> -		schedule_timeout_uninterruptible(1);
> -		return NULL;
> +	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +		goto alloc;
>  	}
>  
>  	/*
> -	 * Go through the zonelist yet one more time, keep very high watermark
> -	 * here, this is only to catch a parallel oom killing, we must fail if
> -	 * we're still under heavy pressure.
> +	 * While we have been waiting for the lock, the previous OOM
> +	 * kill might have released enough memory for the both of us.
>  	 */
> -	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
> -					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  	if (page)
>  		goto out;
>  
> @@ -2383,12 +2382,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (gfp_mask & __GFP_THISNODE)
>  			goto out;
>  	}
> -	/* Exhausted what can be done so it's blamo time */
> -	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
> -			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> +
> +	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
>  		*did_some_progress = 1;
> +	} else {
> +		/* Oops, these shouldn't happen with the OOM killer disabled */
> +		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> +			*did_some_progress = 1;
> +	}
>  out:
>  	mutex_unlock(&oom_lock);
> +alloc:
> +	if (!page)
> +		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> +
>  	return page;
>  }
>  
> @@ -2775,7 +2782,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/* Reclaim has failed us, start killing things */
> -	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> +	page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac,
> +				     &did_some_progress);
>  	if (page)
>  		goto got_pg;
>  
> -- 
> 2.3.3
>
Johannes Weiner March 26, 2015, 6:23 p.m. UTC | #7
On Thu, Mar 26, 2015 at 04:58:46PM +0100, Michal Hocko wrote:
> On Wed 25-03-15 02:17:12, Johannes Weiner wrote:
> > There is not much point in rushing back to the freelists and burning
> > CPU cycles in direct reclaim when somebody else is in the process of
> > OOM killing, or right after issuing a kill ourselves, because it could
> > take some time for the OOM victim to release memory.
> 
> Yes this makes sense and it is better than what we have now. The
> question is how long we should wait. I can see you have gone with HZ.
> What is the value based on? Have your testing shown that the OOM victim
> manages to die within a second most of the time?
> 
> I do not want to get into which value is the best discussion but I would
> expect a larger value. Most OOM victims are not blocked so they would
> wake up soon. This is a safety net for those who are blocked and I do
> not think we have to expedite those rare cases and rather optimize for
> "regular" OOM situations. How about 10-30s?

Yup, I agree with that reasoning.  We can certainly go higher than HZ.

However, we should probably try to stay within the thresholds of any
lock/hang detection watchdogs, which on a higher level includes the
user itself, who might get confused if the machine hangs for 30s.

As I replied to Vlastimil, once the OOM victim hangs for several
seconds without a deadlock, failing the allocation wouldn't seem
entirely unreasonable, either.

But yes, something like 5-10s would still sound good to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5cfda39b3268..e066ac7353a4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -711,12 +711,15 @@  bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		killed = 1;
 	}
 out:
+	if (test_thread_flag(TIF_MEMDIE))
+		return true;
 	/*
-	 * Give the killed threads a good chance of exiting before trying to
-	 * allocate memory again.
+	 * Wait for any outstanding OOM victims to die.  In rare cases
+	 * victims can get stuck behind the allocating tasks, so the
+	 * wait needs to be bounded.  It's crude alright, but cheaper
+	 * than keeping a global dependency tree between all tasks.
 	 */
-	if (killed)
-		schedule_timeout_killable(1);
+	wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), HZ);
 
 	return true;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c1224ba45548..9ce9c4c083a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2330,30 +2330,29 @@  void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 }
 
 static inline struct page *
-__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	const struct alloc_context *ac, unsigned long *did_some_progress)
 {
-	struct page *page;
+	struct page *page = NULL;
 
 	*did_some_progress = 0;
 
 	/*
-	 * Acquire the oom lock.  If that fails, somebody else is
-	 * making progress for us.
+	 * This allocating task can become the OOM victim itself at
+	 * any point before acquiring the lock.  In that case, exit
+	 * quickly and don't block on the lock held by another task
+	 * waiting for us to exit.
 	 */
-	if (!mutex_trylock(&oom_lock)) {
-		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
-		return NULL;
+	if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
+		alloc_flags |= ALLOC_NO_WATERMARKS;
+		goto alloc;
 	}
 
 	/*
-	 * Go through the zonelist yet one more time, keep very high watermark
-	 * here, this is only to catch a parallel oom killing, we must fail if
-	 * we're still under heavy pressure.
+	 * While we have been waiting for the lock, the previous OOM
+	 * kill might have released enough memory for the both of us.
 	 */
-	page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
-					ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto out;
 
@@ -2383,12 +2382,20 @@  __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		if (gfp_mask & __GFP_THISNODE)
 			goto out;
 	}
-	/* Exhausted what can be done so it's blamo time */
-	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
-			|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+
+	if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
 		*did_some_progress = 1;
+	} else {
+		/* Oops, these shouldn't happen with the OOM killer disabled */
+		if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+			*did_some_progress = 1;
+	}
 out:
 	mutex_unlock(&oom_lock);
+alloc:
+	if (!page)
+		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+
 	return page;
 }
 
@@ -2775,7 +2782,8 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* Reclaim has failed us, start killing things */
-	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+	page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac,
+				     &did_some_progress);
 	if (page)
 		goto got_pg;