[00/12] mm: page_alloc: improve OOM mechanism and policy

Message ID	20150407141822.GA3262@cmpxchg.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Tue, 7 Apr 2015 10:18:22 -0400 From: Johannes Weiner <hannes@cmpxchg.org> To: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Linus Torvalds <torvalds@linux-foundation.org>, Andrew Morton <akpm@linux-foundation.org>, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, Huang Ying <ying.huang@intel.com>, Andrea Arcangeli <aarcange@redhat.com>, Theodore Ts'o <tytso@mit.edu> Subject: Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy Message-ID: <20150407141822.GA3262@cmpxchg.org> References: <1427264236-17249-1-git-send-email-hannes@cmpxchg.org> <20150326195822.GB28129@dastard> <20150327150509.GA21119@cmpxchg.org> <20150330003240.GB28621@dastard> <20150401151920.GB23824@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150401151920.GB23824@dhcp22.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Johannes Weiner April 7, 2015, 2:18 p.m. UTC

On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> [...]
> > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > the allocator, even though many of them seem to have fallback code.
> > > My reasoning here is that if you *have* an exit strategy for failing
> > > allocations that is smarter than hanging, we should probably use that.
> > 
> > We already do that for allocations where we can handle failure in
> > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > tell the allocator to try really hard if we've already had a failure
> > and we are already in memory reclaim conditions (e.g. a shrinker
> > trying to clean dirty objects so they can be reclaimed).
> > 
> > From that perspective, I think that this patch set aims force us
> > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > more likely to fail, and b) provides no mechanism to "try harder"
> > when we really need the allocation to succeed.
> 
> You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> in your fallback case?

I would think __GFP_REPEAT would be more suitable here.  From the doc:

 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 * _might_ fail.  This depends upon the particular VM implementation.

so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
are allowed to use the OOM killer and dip into the OOM reserves.

My question here would be: are there any NOFS allocations that *don't*
want this behavior?  Does it even make sense to require this separate
annotation or should we just make it the default?

The argument here was always that NOFS allocations are very limited in
their reclaim powers and will trigger OOM prematurely.  However, the
way we limit dirty memory these days forces most cache to be clean at
all times, and direct reclaim in general hasn't been allowed to issue
page writeback for quite some time.  So these days, NOFS reclaim isn't
really weaker than regular direct reclaim.  The only exception is that
it might block writeback, so we'd go OOM if the only reclaimables left
were dirty pages against that filesystem.  That should be acceptable.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Tetsuo Handa April 11, 2015, 7:29 a.m. UTC | #1

Johannes Weiner wrote:
> The argument here was always that NOFS allocations are very limited in
> their reclaim powers and will trigger OOM prematurely.  However, the
> way we limit dirty memory these days forces most cache to be clean at
> all times, and direct reclaim in general hasn't been allowed to issue
> page writeback for quite some time.  So these days, NOFS reclaim isn't
> really weaker than regular direct reclaim.  The only exception is that
> it might block writeback, so we'd go OOM if the only reclaimables left
> were dirty pages against that filesystem.  That should be acceptable.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47981c5e54c3..fe3cb2b0b85b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		/* The OOM killer does not needlessly kill tasks for lowmem */
>  		if (ac->high_zoneidx < ZONE_NORMAL)
>  			goto out;
> -		/* The OOM killer does not compensate for IO-less reclaim */
> -		if (!(gfp_mask & __GFP_FS)) {
> -			/*
> -			 * XXX: Page reclaim didn't yield anything,
> -			 * and the OOM killer can't be invoked, but
> -			 * keep looping as per tradition.
> -			 */
> -			*did_some_progress = 1;
> -			goto out;
> -		}
>  		if (pm_suspended_storage())
>  			goto out;
>  		/* The OOM killer may not free memory on a specific node */
> 

I think this change will allow calling out_of_memory() which results in
"oom_kill_process() is trivially called via pagefault_out_of_memory()"
problem described in https://lkml.org/lkml/2015/3/18/219 .

I myself think that we should trigger OOM killer for !__GFP_FS allocation
in order to make forward progress in case the OOM victim is blocked.
So, my question about this change is whether we can accept involving OOM
killer from page fault, no matter how trivially OOM killer will kill some
process?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Hocko April 13, 2015, 12:46 p.m. UTC | #2

[Sorry for a late reply]

On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > [...]
> > > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > > the allocator, even though many of them seem to have fallback code.
> > > > My reasoning here is that if you *have* an exit strategy for failing
> > > > allocations that is smarter than hanging, we should probably use that.
> > > 
> > > We already do that for allocations where we can handle failure in
> > > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > > tell the allocator to try really hard if we've already had a failure
> > > and we are already in memory reclaim conditions (e.g. a shrinker
> > > trying to clean dirty objects so they can be reclaimed).
> > > 
> > > From that perspective, I think that this patch set aims force us
> > > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > > more likely to fail, and b) provides no mechanism to "try harder"
> > > when we really need the allocation to succeed.
> > 
> > You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> > in your fallback case?
> 
> I would think __GFP_REPEAT would be more suitable here.  From the doc:
> 
>  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
>  * _might_ fail.  This depends upon the particular VM implementation.
> 
> so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
> are allowed to use the OOM killer and dip into the OOM reserves.

__GFP_REPEAT is quite subtle already. It makes a difference only for
high order allocations and it is not clear to me why it should imply OOM
killer for small orders now. Or did you suggest making it special only
with GFP_NOFS? That sounds even more ugly.

AFAIU, David wasn't asking for the OOM killer as much as he was
interested in getting access to a small amount of reserves in order to
make a progress. __GFP_HIGH is there for this purpose.

> My question here would be: are there any NOFS allocations that *don't*
> want this behavior?  Does it even make sense to require this separate
> annotation or should we just make it the default?
> 
> The argument here was always that NOFS allocations are very limited in
> their reclaim powers and will trigger OOM prematurely.  However, the
> way we limit dirty memory these days forces most cache to be clean at
> all times, and direct reclaim in general hasn't been allowed to issue
> page writeback for quite some time.  So these days, NOFS reclaim isn't
> really weaker than regular direct reclaim. 

What about [di]cache and some others fs specific shrinkers (and heavy
metadata loads)?

> The only exception is that
> it might block writeback, so we'd go OOM if the only reclaimables left
> were dirty pages against that filesystem.  That should be acceptable.

OOM killer is hardly acceptable by most users I've heard from. OOM
killer is the _last_ resort and if the allocation is restricted then
we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
get access to memory reserves if it can fail or __GFP_NOFAIL if it
cannot. With your patches the NOFAIL case would get an access to memory
reserves as well. So I do not really see a reason to change GFP_NOFS vs.
OOM killer semantic.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 47981c5e54c3..fe3cb2b0b85b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		/* The OOM killer does not needlessly kill tasks for lowmem */
>  		if (ac->high_zoneidx < ZONE_NORMAL)
>  			goto out;
> -		/* The OOM killer does not compensate for IO-less reclaim */
> -		if (!(gfp_mask & __GFP_FS)) {
> -			/*
> -			 * XXX: Page reclaim didn't yield anything,
> -			 * and the OOM killer can't be invoked, but
> -			 * keep looping as per tradition.
> -			 */
> -			*did_some_progress = 1;
> -			goto out;
> -		}
>  		if (pm_suspended_storage())
>  			goto out;
>  		/* The OOM killer may not free memory on a specific node */

Michal Hocko April 13, 2015, 12:49 p.m. UTC | #3

On Sat 11-04-15 16:29:26, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim.  The only exception is that
> > it might block writeback, so we'd go OOM if the only reclaimables left
> > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 47981c5e54c3..fe3cb2b0b85b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2367,16 +2367,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  		/* The OOM killer does not needlessly kill tasks for lowmem */
> >  		if (ac->high_zoneidx < ZONE_NORMAL)
> >  			goto out;
> > -		/* The OOM killer does not compensate for IO-less reclaim */
> > -		if (!(gfp_mask & __GFP_FS)) {
> > -			/*
> > -			 * XXX: Page reclaim didn't yield anything,
> > -			 * and the OOM killer can't be invoked, but
> > -			 * keep looping as per tradition.
> > -			 */
> > -			*did_some_progress = 1;
> > -			goto out;
> > -		}
> >  		if (pm_suspended_storage())
> >  			goto out;
> >  		/* The OOM killer may not free memory on a specific node */
> > 
> 
> I think this change will allow calling out_of_memory() which results in
> "oom_kill_process() is trivially called via pagefault_out_of_memory()"
> problem described in https://lkml.org/lkml/2015/3/18/219 .
> 
> I myself think that we should trigger OOM killer for !__GFP_FS allocation
> in order to make forward progress in case the OOM victim is blocked.
> So, my question about this change is whether we can accept involving OOM
> killer from page fault, no matter how trivially OOM killer will kill some
> process?

We trigger OOM killer from the page fault path for ages. In fact the memcg
will trigger memcg OOM killer _only_ from the page fault path because
this context is safe as we do not sit on any locks at the time.

Dave Chinner April 14, 2015, 12:11 a.m. UTC | #4

On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> [Sorry for a late reply]
> 
> On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > My question here would be: are there any NOFS allocations that *don't*
> > want this behavior?  Does it even make sense to require this separate
> > annotation or should we just make it the default?
> > 
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim. 
> 
> What about [di]cache and some others fs specific shrinkers (and heavy
> metadata loads)?

We don't do direct reclaim for fs shrinkers in GFP_NOFS context,
either.

*HOWEVER*

The shrinker reclaim we can not execute is deferred to the next
context that can do the reclaim, which is usually kswapd. So the
reclaim gets done according to the GFP_NOFS memory pressure that is
occurring, it is just done in a different context...

> > The only exception is that
> > it might block writeback, so we'd go OOM if the only reclaimables left
> > were dirty pages against that filesystem.  That should be acceptable.
> 
> OOM killer is hardly acceptable by most users I've heard from. OOM
> killer is the _last_ resort and if the allocation is restricted then
> we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
> get access to memory reserves if it can fail or __GFP_NOFAIL if it
> cannot. With your patches the NOFAIL case would get an access to memory
> reserves as well. So I do not really see a reason to change GFP_NOFS vs.
> OOM killer semantic.

So, really, what we want is something like:

#define __GFP_USE_LOWMEM_RESERVE	__GFP_HIGH

So that it documents the code that is using it effectively and we
can find them easily with cscope/grep?

Cheers,

Dave.

Michal Hocko April 14, 2015, 7:20 a.m. UTC | #5

On Tue 14-04-15 10:11:18, Dave Chinner wrote:
> On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> > [Sorry for a late reply]
> > 
> > On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > > My question here would be: are there any NOFS allocations that *don't*
> > > want this behavior?  Does it even make sense to require this separate
> > > annotation or should we just make it the default?
> > > 
> > > The argument here was always that NOFS allocations are very limited in
> > > their reclaim powers and will trigger OOM prematurely.  However, the
> > > way we limit dirty memory these days forces most cache to be clean at
> > > all times, and direct reclaim in general hasn't been allowed to issue
> > > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > > really weaker than regular direct reclaim. 
> > 
> > What about [di]cache and some others fs specific shrinkers (and heavy
> > metadata loads)?
> 
> We don't do direct reclaim for fs shrinkers in GFP_NOFS context,
> either.

Yeah but we invoke fs shrinkers for the _regular_ direct reclaim (with
__GFP_FS), which was the point I've tried to make here.

> *HOWEVER*
> 
> The shrinker reclaim we can not execute is deferred to the next
> context that can do the reclaim, which is usually kswapd. So the
> reclaim gets done according to the GFP_NOFS memory pressure that is
> occurring, it is just done in a different context...

Right, deferring to kswapd is the reason why I think the direct reclaim
shouldn't invoke OOM killer in this context because that would be
premature - as kswapd still can make some progress. Sorry for not being
more clear.

> > > The only exception is that
> > > it might block writeback, so we'd go OOM if the only reclaimables left
> > > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > OOM killer is hardly acceptable by most users I've heard from. OOM
> > killer is the _last_ resort and if the allocation is restricted then
> > we shouldn't use the big hammer. The allocator might use __GFP_HIGH to
> > get access to memory reserves if it can fail or __GFP_NOFAIL if it
> > cannot. With your patches the NOFAIL case would get an access to memory
> > reserves as well. So I do not really see a reason to change GFP_NOFS vs.
> > OOM killer semantic.
> 
> So, really, what we want is something like:
> 
> #define __GFP_USE_LOWMEM_RESERVE	__GFP_HIGH
> 
> So that it documents the code that is using it effectively and we
> can find them easily with cscope/grep?

I wouldn't be opposed. To be honest I was never fond of __GFP_HIGH. The
naming is counterintuitive. So I would rather go with renaminag it. We do
not have that many users in the tree.
git grep "GFP_HIGH\>" | wc -l
40

Johannes Weiner April 14, 2015, 10:36 a.m. UTC | #6

On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
> [Sorry for a late reply]
> 
> On Tue 07-04-15 10:18:22, Johannes Weiner wrote:
> > On Wed, Apr 01, 2015 at 05:19:20PM +0200, Michal Hocko wrote:
> > > On Mon 30-03-15 11:32:40, Dave Chinner wrote:
> > > > On Fri, Mar 27, 2015 at 11:05:09AM -0400, Johannes Weiner wrote:
> > > [...]
> > > > > GFP_NOFS sites are currently one of the sites that can deadlock inside
> > > > > the allocator, even though many of them seem to have fallback code.
> > > > > My reasoning here is that if you *have* an exit strategy for failing
> > > > > allocations that is smarter than hanging, we should probably use that.
> > > > 
> > > > We already do that for allocations where we can handle failure in
> > > > GFP_NOFS conditions. It is, however, somewhat useless if we can't
> > > > tell the allocator to try really hard if we've already had a failure
> > > > and we are already in memory reclaim conditions (e.g. a shrinker
> > > > trying to clean dirty objects so they can be reclaimed).
> > > > 
> > > > From that perspective, I think that this patch set aims force us
> > > > away from handling fallbacks ourselves because a) it makes GFP_NOFS
> > > > more likely to fail, and b) provides no mechanism to "try harder"
> > > > when we really need the allocation to succeed.
> > > 
> > > You can ask for this "try harder" by __GFP_HIGH flag. Would that help
> > > in your fallback case?
> > 
> > I would think __GFP_REPEAT would be more suitable here.  From the doc:
> > 
> >  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> >  * _might_ fail.  This depends upon the particular VM implementation.
> > 
> > so we can make the semantics of GFP_NOFS | __GFP_REPEAT such that they
> > are allowed to use the OOM killer and dip into the OOM reserves.
> 
> __GFP_REPEAT is quite subtle already.  It makes a difference only
> for high order allocations

That's an implementation detail, owed to the fact that smaller orders
already imply that behavior.  That doesn't change the semantics.  And
people currently *use* it all over the tree for small orders, because
of how the flag is defined in gfp.h; not because of how it's currently
implemented.

> and it is not clear to me why it should imply OOM killer for small
> orders now.  Or did you suggest making it special only with
> GFP_NOFS?  That sounds even more ugly.

Small orders already invoke the OOM killer.  I suggested using this
flag to override the specialness of GFP_NOFS not OOM killing - in
response to whether we can provide an annotation to make some GFP_NOFS
sites more robust.

This is exactly what __GFP_REPEAT is: try the allocation harder than
you would without this flag.  It identifies a caller that is willing
to put in extra effort or be more aggressive because the allocation is
more important than other allocations of the otherwise same gfp_mask.

> AFAIU, David wasn't asking for the OOM killer as much as he was
> interested in getting access to a small amount of reserves in order to
> make a progress. __GFP_HIGH is there for this purpose.

That's not just any reserve pool available to the generic caller, it's
the reserve pool for interrupts, which can not wait and replenish it.
It relies on kswapd to run soon after the interrupt, or right away on
SMP.  But locks held in the filesystem can hold up kswapd (the reason
we even still perform direct reclaim) so NOFS allocs shouldn't use it.

[hannes@dexter linux]$ git grep '__GFP_HIGH\b' | wc -l
39
[hannes@dexter linux]$ git grep GFP_ATOMIC | wc -l
4324

Interrupts have *no other option*.  It's misguided to deplete their
reserves, cause loss of network packets, loss of input events, from
allocations that can actually perform reclaim and have perfectly
acceptable fallback strategies in the caller.

Generally, for any reserve system there must be a way to replenish it.
For interrupts it's kswapd, for the OOM reserves I proposed it's the
OOM victim exiting soon after the allocation, if not right away.

__GFP_NOFAIL is the odd one out here because accessing the system's
emergency reserves without any prospect of near-future replenishing is
just slightly better than deadlocking right away.  Which is why this
reserve access can not be separated out: if you can do *anything*
better than hanging, do it.  If not, use __GFP_NOFAIL.

> > My question here would be: are there any NOFS allocations that *don't*
> > want this behavior?  Does it even make sense to require this separate
> > annotation or should we just make it the default?
> > 
> > The argument here was always that NOFS allocations are very limited in
> > their reclaim powers and will trigger OOM prematurely.  However, the
> > way we limit dirty memory these days forces most cache to be clean at
> > all times, and direct reclaim in general hasn't been allowed to issue
> > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > really weaker than regular direct reclaim. 
> 
> What about [di]cache and some others fs specific shrinkers (and heavy
> metadata loads)?

My bad, I forgot about those.  But it doesn't really change the basic
question of whether we want to change the GFP_NOFS default or merely
annotate individual sites that want to try harder.

> > The only exception is that
> > it might block writeback, so we'd go OOM if the only reclaimables left
> > were dirty pages against that filesystem.  That should be acceptable.
> 
> OOM killer is hardly acceptable by most users I've heard from. OOM
> killer is the _last_ resort and if the allocation is restricted then
> we shouldn't use the big hammer.

We *are* talking about the last resort for these allocations!  There
is nothing else we can do to avoid allocation failure at this point.
Absent a reservation system, we have the choice between failing after
reclaim - which Dave said was too fragile for XFS - or OOM killing.

Or continue to deadlock in the allocator of course if we can't figure
out what the filesystem side actually wants given the current options.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michal Hocko April 14, 2015, 2:23 p.m. UTC | #7

On Tue 14-04-15 06:36:25, Johannes Weiner wrote:
> On Mon, Apr 13, 2015 at 02:46:14PM +0200, Michal Hocko wrote:
[...]
> > AFAIU, David wasn't asking for the OOM killer as much as he was
> > interested in getting access to a small amount of reserves in order to
> > make a progress. __GFP_HIGH is there for this purpose.
> 
> That's not just any reserve pool available to the generic caller, it's
> the reserve pool for interrupts, which can not wait and replenish it.
> It relies on kswapd to run soon after the interrupt, or right away on
> SMP.  But locks held in the filesystem can hold up kswapd (the reason
> we even still perform direct reclaim) so NOFS allocs shouldn't use it.
> 
> [hannes@dexter linux]$ git grep '__GFP_HIGH\b' | wc -l
> 39
> [hannes@dexter linux]$ git grep GFP_ATOMIC | wc -l
> 4324
> 
> Interrupts have *no other option*. 

Atomic context in general can ALLOC_HARDER so it has an access to
additional reserves wrt. __GFP_HIGH|__GFP_WAIT.

> It's misguided to deplete their
> reserves, cause loss of network packets, loss of input events, from
> allocations that can actually perform reclaim and have perfectly
> acceptable fallback strategies in the caller.

OK, I thought that it was clear that the proposed __GFP_HIGH is a
fallback strategy for those paths which cannot do much better. Not a
random solution for "this shouldn't fail to eagerly".

> Generally, for any reserve system there must be a way to replenish it.
> For interrupts it's kswapd, for the OOM reserves I proposed it's the
> OOM victim exiting soon after the allocation, if not right away.

And my understanding was that the fallback mode would be used in the
context which would lead to release of the fs pressure thus releasing a
memory as well.

> __GFP_NOFAIL is the odd one out here because accessing the system's
> emergency reserves without any prospect of near-future replenishing is
> just slightly better than deadlocking right away.  Which is why this
> reserve access can not be separated out: if you can do *anything*
> better than hanging, do it.  If not, use __GFP_NOFAIL.

Agreed.
 
> > > My question here would be: are there any NOFS allocations that *don't*
> > > want this behavior?  Does it even make sense to require this separate
> > > annotation or should we just make it the default?
> > > 
> > > The argument here was always that NOFS allocations are very limited in
> > > their reclaim powers and will trigger OOM prematurely.  However, the
> > > way we limit dirty memory these days forces most cache to be clean at
> > > all times, and direct reclaim in general hasn't been allowed to issue
> > > page writeback for quite some time.  So these days, NOFS reclaim isn't
> > > really weaker than regular direct reclaim. 
> > 
> > What about [di]cache and some others fs specific shrinkers (and heavy
> > metadata loads)?
> 
> My bad, I forgot about those.  But it doesn't really change the basic
> question of whether we want to change the GFP_NOFS default or merely
> annotate individual sites that want to try harder.

My understanding was the later one. If you look at page cache allocations
which use mapping_gfp_mask (e.g. xfs is using GFP_NOFS for that context
all the time) then those do not really have to try harder.

> > > The only exception is that
> > > it might block writeback, so we'd go OOM if the only reclaimables left
> > > were dirty pages against that filesystem.  That should be acceptable.
> > 
> > OOM killer is hardly acceptable by most users I've heard from. OOM
> > killer is the _last_ resort and if the allocation is restricted then
> > we shouldn't use the big hammer.
> 
> We *are* talking about the last resort for these allocations!  There
> is nothing else we can do to avoid allocation failure at this point.
> Absent a reservation system, we have the choice between failing after
> reclaim - which Dave said was too fragile for XFS - or OOM killing.

As per other emails in this thread (e.g.
http://marc.info/?l=linux-mm&m=142897087230385&w=2), I understood that
the access to a small portion of emergency pool would be sufficient to
release the pressure and that sounds preferable to me over a destructive
reclaim attempts.

[00/12] mm: page_alloc: improve OOM mechanism and policy

Commit Message

Comments

Patch