diff mbox series

bcachefs: Switch to memalloc_flags_do() for vmalloc allocations

Message ID 20240828140638.3204253-1-kent.overstreet@linux.dev (mailing list archive)
State New
Headers show
Series bcachefs: Switch to memalloc_flags_do() for vmalloc allocations | expand

Commit Message

Kent Overstreet Aug. 28, 2024, 2:06 p.m. UTC
vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for
pte allocation, so doing vmalloc/kvmalloc allocations with reclaim
unsafe locks is a potential deadlock.

Note that we also want to use PF_MEMALLOC_NORECLAIM, not
PF_MEMALLOC_NOFS, because when we're doing allocations with btree locks
held we have a fallback available - drop locks and do a normal
GFP_KERNEL allocation. We don't want to be invoking reclaim with btree
locks held at all, since these are big shared locks and overalll system
performance is sensitive to hold times.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/acl.c             |  5 ++--
 fs/bcachefs/btree_cache.c     |  3 ++-
 fs/bcachefs/btree_iter.h      | 46 +++++++++++++++++++----------------
 fs/bcachefs/btree_key_cache.c | 10 ++++----
 fs/bcachefs/ec.c              | 12 ++++-----
 fs/bcachefs/fs.c              |  8 ------
 6 files changed, 41 insertions(+), 43 deletions(-)

Comments

Matthew Wilcox Aug. 28, 2024, 6:48 p.m. UTC | #1
On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote:
> vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for
> pte allocation, so doing vmalloc/kvmalloc allocations with reclaim
> unsafe locks is a potential deadlock.

Kent, the approach you've taken with this was NACKed.  You merged it
anyway (!).  Now you're spreading this crap further, presumably in an effort
to make it harder to remove.

Stop it.  Work with us to come up with an acceptable approach.  I
think there is one that will work, but you need to listen to the people
who're giving you feedback because Linux is too big of a code-base for
you to understand everything.
Kent Overstreet Aug. 28, 2024, 7:11 p.m. UTC | #2
On Wed, Aug 28, 2024 at 07:48:43PM GMT, Matthew Wilcox wrote:
> On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote:
> > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for
> > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim
> > unsafe locks is a potential deadlock.
> 
> Kent, the approach you've taken with this was NACKed.  You merged it
> anyway (!).  Now you're spreading this crap further, presumably in an effort
> to make it harder to remove.

Excuse me? This is fixing a real issue which has been known for years.

It was decided _years_ ago that PF_MEMALLOC flags were how this was
going to be addressed.

> Stop it.  Work with us to come up with an acceptable approach.  I
> think there is one that will work, but you need to listen to the people
> who're giving you feedback because Linux is too big of a code-base for
> you to understand everything.

No, you guys need to stop pushing broken shit.
Michal Hocko Aug. 28, 2024, 7:26 p.m. UTC | #3
On Wed 28-08-24 15:11:19, Kent Overstreet wrote:
> On Wed, Aug 28, 2024 at 07:48:43PM GMT, Matthew Wilcox wrote:
> > On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote:
> > > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for
> > > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim
> > > unsafe locks is a potential deadlock.
> > 
> > Kent, the approach you've taken with this was NACKed.  You merged it
> > anyway (!).  Now you're spreading this crap further, presumably in an effort
> > to make it harder to remove.
> 
> Excuse me? This is fixing a real issue which has been known for years.

If you mean a lack of GFP_NOWAIT support in vmalloc then this is not a
bug but a lack of feature. vmalloc has never promissed to support this
allocation mode and a scoped gfp flag will not magically make it work
because there is a sleeping lock involved in an allocation path in some
cases.

If you really need this feature to be added then you should clearly
describe your usecase and listen to people who are familiar with the
vmalloc internals rather than heavily pushing your direction which
doesn't work anyway.

> It was decided _years_ ago that PF_MEMALLOC flags were how this was
> going to be addressed.

Nope! It has been decided that _some_ gfp flags are acceptable to be used
by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim
modifiers and other flags so these are indeed safe to be used that way.

> > Stop it.  Work with us to come up with an acceptable approach.  I
> > think there is one that will work, but you need to listen to the people
> > who're giving you feedback because Linux is too big of a code-base for
> > you to understand everything.
> 
> No, you guys need to stop pushing broken shit.

This is not the way to work with other people. Seriously!
Kent Overstreet Aug. 28, 2024, 10:58 p.m. UTC | #4
On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote:
> On Wed 28-08-24 15:11:19, Kent Overstreet wrote:
> > On Wed, Aug 28, 2024 at 07:48:43PM GMT, Matthew Wilcox wrote:
> > > On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote:
> > > > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for
> > > > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim
> > > > unsafe locks is a potential deadlock.
> > > 
> > > Kent, the approach you've taken with this was NACKed.  You merged it
> > > anyway (!).  Now you're spreading this crap further, presumably in an effort
> > > to make it harder to remove.
> > 
> > Excuse me? This is fixing a real issue which has been known for years.
> 
> If you mean a lack of GFP_NOWAIT support in vmalloc then this is not a
> bug but a lack of feature. vmalloc has never promissed to support this
> allocation mode and a scoped gfp flag will not magically make it work
> because there is a sleeping lock involved in an allocation path in some
> cases.
> 
> If you really need this feature to be added then you should clearly
> describe your usecase and listen to people who are familiar with the
> vmalloc internals rather than heavily pushing your direction which
> doesn't work anyway.

Michal, I'm plenty familiar with the vmalloc internals. Given that you
didn't even seem to be aware of how it doesn't respect gfp flags, you
seem to be the person who hasn't been up to speed in this discussion.

> > It was decided _years_ ago that PF_MEMALLOC flags were how this was
> > going to be addressed.
> 
> Nope! It has been decided that _some_ gfp flags are acceptable to be used
> by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim
> modifiers and other flags so these are indeed safe to be used that way.

Decided by who?

Cite the list discussion, and the reasoning.

> > > Stop it.  Work with us to come up with an acceptable approach.  I
> > > think there is one that will work, but you need to listen to the people
> > > who're giving you feedback because Linux is too big of a code-base for
> > > you to understand everything.
> > 
> > No, you guys need to stop pushing broken shit.
> 
> This is not the way to work with other people. Seriously!

No, your patches and the reasoning you've been pushing are broken.

A safe interface is not one that "does not fail", a safe interface is
one that never invokes undefined behaviour - and generally that means
it'll return an error instead.

And programming in the kernel means that you check for errors, period.

This idea that you have of "safety" meaning "we'll just conveniently
remove or ignore the cases where GFP_NOFAIL would have to fail" is
complete and utter garbage.
Michal Hocko Aug. 29, 2024, 7:19 a.m. UTC | #5
On Wed 28-08-24 18:58:43, Kent Overstreet wrote:
> On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote:
> > On Wed 28-08-24 15:11:19, Kent Overstreet wrote:
> > > On Wed, Aug 28, 2024 at 07:48:43PM GMT, Matthew Wilcox wrote:
> > > > On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote:
> > > > > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for
> > > > > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim
> > > > > unsafe locks is a potential deadlock.
> > > > 
> > > > Kent, the approach you've taken with this was NACKed.  You merged it
> > > > anyway (!).  Now you're spreading this crap further, presumably in an effort
> > > > to make it harder to remove.
> > > 
> > > Excuse me? This is fixing a real issue which has been known for years.
> > 
> > If you mean a lack of GFP_NOWAIT support in vmalloc then this is not a
> > bug but a lack of feature. vmalloc has never promissed to support this
> > allocation mode and a scoped gfp flag will not magically make it work
> > because there is a sleeping lock involved in an allocation path in some
> > cases.
> > 
> > If you really need this feature to be added then you should clearly
> > describe your usecase and listen to people who are familiar with the
> > vmalloc internals rather than heavily pushing your direction which
> > doesn't work anyway.
> 
> Michal, I'm plenty familiar with the vmalloc internals. Given that you
> didn't even seem to be aware of how it doesn't respect gfp flags, you
> seem to be the person who hasn't been up to speed in this discussion.

GFP_NOWAIT is explicitly documented as unsupported
(__vmalloc_node_range_noprof). vmalloc internals are using
vmap_purge_lock and blocking notifiers (vmap_notify_list) in rare cases
so PF_MEMALLOC_NORECLAIM is not really sufficient to provide NOWAIT
semantic (this is really not just about page tables allocations). There
might be other places that require blocking - I do not claim to be an
expert on the vmalloc allocator.

Just my 2 cents do whatever you want with this information. 

It seems that this discussion is not going to be really productive so I
will leave you here.

If you reconsider and realize that a productive discussion realy
requires also listening and respect then get back and we can try again.

Good luck!
Michal Hocko Aug. 29, 2024, 11:08 a.m. UTC | #6
On Wed 28-08-24 18:58:43, Kent Overstreet wrote:
> On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote:
> > On Wed 28-08-24 15:11:19, Kent Overstreet wrote:
[...]
> > > It was decided _years_ ago that PF_MEMALLOC flags were how this was
> > > going to be addressed.
> > 
> > Nope! It has been decided that _some_ gfp flags are acceptable to be used
> > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim
> > modifiers and other flags so these are indeed safe to be used that way.
> 
> Decided by who?

Decides semantic of respective GFP flags and their compatibility with
others that could be nested in the scope.

Zone modifiers __GFP_DMA, __GFP_HIGHMEM, __GFP_DMA32 and __GFP_MOVABLE
would allow only __GFP_DMA to have scoped semantic because it is the
most restrictive of all of them (i.e. __GFP_DMA32 can be served from
__GFP_DMA but not other way around) but nobody really requested that.

__GFP_RECLAIMABLE is slab allocator specific and nested allocations
cannot be assumed they have shrinkers so this cannot really have scoped
semantic.

__GFP_WRITE only implies node spreading. Likely OK for scope interface,
nobody requested that.

__GFP_HARDWALL only to be used for user space allocations. Wouldn't break
anything if it had scoped interface but nobody requested that.

__GFP_THISNODE only to be used by allocators internally to define NUMA
placement strategy. Not safe for scoped interface as it changes the
failure semantic

__GFP_ACCOUNT defines memcg accounting. Generally usable from user
context and safe for scope interface in that context as it doesn't
change the failure nor reclaim semantic

__GFP_NO_OBJ_EXT internal flag not to be used outside of mm.

__GFP_HIGH gives access to memory reserves. It could be used for scope
interface but nobody requested that.

__GFP_MEMALLOC - already has a scope interface PF_MEMALLOC. This is not
really great though because it grants unbounded access to memory
reserves and that means that it isreally tricky to see how many
allocations really can use reserves. It has been added because swap over
NFS had to guarantee forward progress and networking layer was not
prepared for that. Fundamentally this doesn't change the allocation nor
reclaim semantic so it is safe for a scope API.

__GFP_NOMEMALLOC used to override PF_MEMALLOC so a scoped interface
doesn't make much sense

__GFP_IO already has scope interface to drop this flag. It is safe
because it doesn't change failure semantic and it makes the reclaim
context more constrained so it is compatible with other reclaim
modifiers. Contrary it would be unsafe to have a scope interface to add
this flag because all GFP_NOIO nested allocations could deadlock

__GFP_FS. Similar to __GFP_IO.

__GFP_DIRECT_RECLAIM allows allocation to sleep. Scoped interface to set
the flag is unsafe for any nested GFP_NOWAIT/GFP_ATOMIC requests which
might be called from withing atomic contexts. Scope interface to clear
the flag is unsafe for scoped interface because __GFP_NOFAIL
allocation mode doesn't support requests without this flag so any nested
NOFAIL allocation would break and see unexpected and potentially
unhandled failure mode.

__GFP_KSWAPD_RECLAIM controls whether kswapd is woken up. Doesn't change
the failure nor direct reclaim behavior. Scoped interface to set the
flag seems rather pointless and one to clear the bit dangerous because
it could put MM into unbalanced state as kswapd wouldn't wake up.

__GFP_RETRY_MAYFAIL - changes the failure mode so it is fundamentally
incompatible with nested __GFP_NOFAIL allocations. Scoped interface to
clear the flag would be safe but probably pointless.

__GFP_NORETRY - same as above

__GFP_NOFAIL - incompatible with any nested GFP_NOWAIT/GFP_ATOMIC
allocations. One could argue that those are fine to see allocation
failure so this will not create any unexpected failure mode which is a
fair argument but what would be the actual usecase for setting all
nested allocations to NOFAIL mode when they likely have a failure mode?
Interface to clear the flag for the scope would be unsafe because all
nested NOFAIL allocations would get an unexpected failure mode.

__GFP_NOWARN safe to have scope interface both to set and clear the
flag. 

__GFP_COMP only to be used for high order allocations and changes the
tail pages tracking which would break any nested high order
request without the flag. So unsafe for the scope interface both to set
and clear the flag.

__GFP_ZERO changes the initialization and safe for scope interface. We
even have a global switch to do that for all allocations init_on_alloc

__GFP_NOLOCKDEP disables lockdep reclaim recursion detection. Safe for
scope interface AFAICS.
Kent Overstreet Aug. 29, 2024, 11:41 a.m. UTC | #7
On Thu, Aug 29, 2024 at 09:19:13AM GMT, Michal Hocko wrote:
> GFP_NOWAIT is explicitly documented as unsupported
> (__vmalloc_node_range_noprof). vmalloc internals are using
> vmap_purge_lock and blocking notifiers (vmap_notify_list) in rare cases
> so PF_MEMALLOC_NORECLAIM is not really sufficient to provide NOWAIT
> semantic (this is really not just about page tables allocations). There
> might be other places that require blocking - I do not claim to be an
> expert on the vmalloc allocator.

Sure, but those are easily fixed, if necessary. We'd also want to be
explicit on whether NORECLAIM needs to be NOWAIT or just NORECLAIM -
those are slightly different things.

As long as it's NORECLAIM that should be fine - it's really only an
issue if we need to map GFP_NOWAIT to PF_MEMALLOC_*.

> It seems that this discussion is not going to be really productive so I
> will leave you here.

I think you'll find these discussions become much more productive when
you're able to stick to the technical, and when you yourself are able to
listen to counterarguments. I remind you that we hash things out here on
the list; we don't resort to things like "this has been decided" or "the
mm community says"; you need to be properly justifying what you want to
do here.

And if people aren't listening to your arguments, that should be a hint
to you that maybe you aren't making good ones.
Kent Overstreet Aug. 29, 2024, 11:55 a.m. UTC | #8
On Thu, Aug 29, 2024 at 01:08:53PM GMT, Michal Hocko wrote:
> On Wed 28-08-24 18:58:43, Kent Overstreet wrote:
> > On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote:
> > > On Wed 28-08-24 15:11:19, Kent Overstreet wrote:
> [...]
> > > > It was decided _years_ ago that PF_MEMALLOC flags were how this was
> > > > going to be addressed.
> > > 
> > > Nope! It has been decided that _some_ gfp flags are acceptable to be used
> > > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim
> > > modifiers and other flags so these are indeed safe to be used that way.
> > 
> > Decided by who?
> 
> Decides semantic of respective GFP flags and their compatibility with
> others that could be nested in the scope.

Well, that's a bit of commentary, at least.

The question is which of those could properly apply to a section, not a
callsite, and a PF_MEMALLOC_NOWAIT (similar to but not exactly the same
as PF_MEMALLOC_NORECLAIM) would be at the top of that list since we
already have a clear concept of sections where we're not allowed to
sleep.

And that tells us how to resolve GFP_NOFAIL with other conflicting
PF_MEMALLOC flags: GFP_NOFAIL loses.

It is a _bug_ if GFP_NOFAIL is accidentally used in a non sleepable
context, and properly labelling those sections to the allocator would
allow us to turn undefined behaviour into an error - _that_ would be
turning kmalloc() into a safe interface.

Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe
according to call path and allocation size, you still need to be
checking for failure - in the same way that you shouldn't be using
BUG_ON() if you cannot prove that the condition won't occur in real wold
usage.

Given that, it's easy to see how to handle __GFP_RETRY_MAYFAIL and
__GFP_NORETRY: if they're applied to a context, then the usage is saying
"I need to attempt to run this section with some sort of latency
bounds", and GFP_NOFAIL should lose - as well as emitting a warning.

BTW, this is how you should be interpreting PF_MEMALLOC_NORECLAIM today:
"I have strong latency bounds here, but not so strict that it needs to
be strictly nonblocking".
Michal Hocko Aug. 29, 2024, 12:34 p.m. UTC | #9
On Thu 29-08-24 07:55:08, Kent Overstreet wrote:
> On Thu, Aug 29, 2024 at 01:08:53PM GMT, Michal Hocko wrote:
> > On Wed 28-08-24 18:58:43, Kent Overstreet wrote:
> > > On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote:
> > > > On Wed 28-08-24 15:11:19, Kent Overstreet wrote:
> > [...]
> > > > > It was decided _years_ ago that PF_MEMALLOC flags were how this was
> > > > > going to be addressed.
> > > > 
> > > > Nope! It has been decided that _some_ gfp flags are acceptable to be used
> > > > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim
> > > > modifiers and other flags so these are indeed safe to be used that way.
> > > 
> > > Decided by who?
> > 
> > Decides semantic of respective GFP flags and their compatibility with
> > others that could be nested in the scope.
> 
> Well, that's a bit of commentary, at least.
> 
> The question is which of those could properly apply to a section, not a
> callsite, and a PF_MEMALLOC_NOWAIT (similar to but not exactly the same
> as PF_MEMALLOC_NORECLAIM) would be at the top of that list since we
> already have a clear concept of sections where we're not allowed to
> sleep.

Unfortunately a lack of __GFP_DIRECT_RECLAIM means both no reclaim and
no sleeping allowed for historical reasons. GFP_NOWAIT is both used from
atomic contexts and as an optimistic allocation attempt with a heavier
fallback allocation strategy. If you want NORECLAIM semantic then this
would need to be represented by different means than __GFP_DIRECT_RECLAIM
alone.

> And that tells us how to resolve GFP_NOFAIL with other conflicting
> PF_MEMALLOC flags: GFP_NOFAIL loses.
> 
> It is a _bug_ if GFP_NOFAIL is accidentally used in a non sleepable
> context, and properly labelling those sections to the allocator would
> allow us to turn undefined behaviour into an error - _that_ would be
> turning kmalloc() into a safe interface.

If your definition of safe includes an oops or worse silent failure
then yes. Not really safe interface in my book though. E.g. (just
randomly looking at GFP_NOFAIL users) btree_paths_realloc doesn't check
the return value and if it happened to be called from such a scope it
would have blown up. That code is safe without the scope though. There
are many other callsites which do not have failure paths.
Kent Overstreet Aug. 29, 2024, 12:42 p.m. UTC | #10
On Thu, Aug 29, 2024 at 02:34:08PM GMT, Michal Hocko wrote:
> On Thu 29-08-24 07:55:08, Kent Overstreet wrote:
> > On Thu, Aug 29, 2024 at 01:08:53PM GMT, Michal Hocko wrote:
> > > On Wed 28-08-24 18:58:43, Kent Overstreet wrote:
> > > > On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote:
> > > > > On Wed 28-08-24 15:11:19, Kent Overstreet wrote:
> > > [...]
> > > > > > It was decided _years_ ago that PF_MEMALLOC flags were how this was
> > > > > > going to be addressed.
> > > > > 
> > > > > Nope! It has been decided that _some_ gfp flags are acceptable to be used
> > > > > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim
> > > > > modifiers and other flags so these are indeed safe to be used that way.
> > > > 
> > > > Decided by who?
> > > 
> > > Decides semantic of respective GFP flags and their compatibility with
> > > others that could be nested in the scope.
> > 
> > Well, that's a bit of commentary, at least.
> > 
> > The question is which of those could properly apply to a section, not a
> > callsite, and a PF_MEMALLOC_NOWAIT (similar to but not exactly the same
> > as PF_MEMALLOC_NORECLAIM) would be at the top of that list since we
> > already have a clear concept of sections where we're not allowed to
> > sleep.
> 
> Unfortunately a lack of __GFP_DIRECT_RECLAIM means both no reclaim and
> no sleeping allowed for historical reasons. GFP_NOWAIT is both used from
> atomic contexts and as an optimistic allocation attempt with a heavier
> fallback allocation strategy. If you want NORECLAIM semantic then this
> would need to be represented by different means than __GFP_DIRECT_RECLAIM
> alone.

I don't see it as particularly needed - the vmalloc locks you mentioned
previously just mean it's something worth considering. In my usage I
probably wouldn't care about those locks, but for keeping the API simple
we probably want just PF_MEMALLOC_NOWAIT (where those locks become
trylock).

> > And that tells us how to resolve GFP_NOFAIL with other conflicting
> > PF_MEMALLOC flags: GFP_NOFAIL loses.
> > 
> > It is a _bug_ if GFP_NOFAIL is accidentally used in a non sleepable
> > context, and properly labelling those sections to the allocator would
> > allow us to turn undefined behaviour into an error - _that_ would be
> > turning kmalloc() into a safe interface.
> 
> If your definition of safe includes an oops or worse silent failure
> then yes. Not really safe interface in my book though. E.g. (just
> randomly looking at GFP_NOFAIL users) btree_paths_realloc doesn't check
> the return value and if it happened to be called from such a scope it
> would have blown up. That code is safe without the scope though. There
> are many other callsites which do not have failure paths.

Yes, but that's unsafe anyways due to the max allocation size of
GFP_NOFAIL - I'll have to fix that.

Note that even if we got rid of the smaller max allocation size of
GFP_NOFAIL allocations we'd _still_ generally need error paths due to
the hard limit of INT_MAX, and integer overflow checking for array
allocations.
Dave Chinner Aug. 29, 2024, 2:27 p.m. UTC | #11
On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote:
> Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe
> according to call path and allocation size, you still need to be
> checking for failure - in the same way that you shouldn't be using
> BUG_ON() if you cannot prove that the condition won't occur in real wold
> usage.

We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
now. This was the default Irix kernel allocator behaviour (it had a
forwards progress guarantee and would never fail allocation unless
told it could do so). We've been using the same "guaranteed not to
fail" semantics on Linux since the original port started 25 years
ago via open-coded loops.

IOWs, __GFP_NOFAIL semantics have been production tested for a
couple of decades on Linux via XFS, and nobody here can argue that
XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL
as it is used by XFS is reliable and lives up to the "will not fail"
guarantee that it is supposed to have.

Fundamentally, __GFP_NOFAIL came about to replace the callers doing

	do {
		p = kmalloc(size);
	while (!p);

so that they blocked until memory allocation succeeded. The call
sites do not check for failure, because -failure never occurs-.

The MM devs want to have visibility of these allocations - they may
not like them, but having __GFP_NOFAIL means it's trivial to audit
all the allocations that use these semantics.  IOWs, __GFP_NOFAIL
was created with an explicit guarantee that it -will not fail- for
normal allocation contexts so it could replace all the open-coded
will-not-fail allocation loops..

Given this guarantee, we recently removed these historic allocation
wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the
allocation call sites. There's nearly a hundred memory allocation
locations in XFS that are tagged with __GFP_NOFAIL.

If we're now going to have the "will not fail" guarantee taken away
from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can
it be used anywhere else that a "will not fail" guarantee it
required.

Put simply: __GFP_NOFAIL will be rendered completely useless if it
can fail due to external scoped memory allocation contexts.  This
will force us to revert all __GFP_NOFAIL allocations back to
open-coded will-not-fail loops.

This is not a step forwards for anyone.

-Dave.
Theodore Ts'o Aug. 30, 2024, 3:39 a.m. UTC | #12
On Fri, Aug 30, 2024 at 12:27:11AM +1000, Dave Chinner wrote:
> 
> We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
> now. This was the default Irix kernel allocator behaviour (it had a
> forwards progress guarantee and would never fail allocation unless
> told it could do so). We've been using the same "guaranteed not to
> fail" semantics on Linux since the original port started 25 years
> ago via open-coded loops.

Ext3/ext4 doesn't have quite the history as XFS --- it's only been
around for 23 years --- but we've also used __GFP_NOFAIL or its
moral equivalent, e.g.:

> 	do {
> 		p = kmalloc(size);
> 	while (!p);

For the entire existence of ext3.

> Put simply: __GFP_NOFAIL will be rendered completely useless if it
> can fail due to external scoped memory allocation contexts.  This
> will force us to revert all __GFP_NOFAIL allocations back to
> open-coded will-not-fail loops.

The same will be true for ext4.  And as Dave has said, the MM
developers want to have visibility to when file systems have basically
said, "if you can't allow us to allocate memory, our only alternative
is to cause user data loss, crash the kernel, or loop forever; we will
choose the latter".  The MM developers tried to make __GFP_NOFAIL go
away several years ago, and ext4 put the retry loop back, As a result,
the compromise was that the MM developers restored __GFP_NOFAIL, and
the file systems developers have done their best to reduce the use of
__GFP_NOFAIL as much as possible.

So if you try to break the GFP_NOFAIL promise, both xfs and ext4 will
back to the retry loop.  And the MM devs will be sad, and they will
forcibly revert your change to *ther* code, even if that means
breaking bcachefs.  Becuase otherwise, you will be breaking ext4 and
xfs, and so we will go back to using a retry loop, which will be worse
for Linux users.

Cheers,

					- Ted
Yafang Shao Aug. 30, 2024, 9:14 a.m. UTC | #13
On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote:
> > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe
> > according to call path and allocation size, you still need to be
> > checking for failure - in the same way that you shouldn't be using
> > BUG_ON() if you cannot prove that the condition won't occur in real wold
> > usage.
>
> We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
> now. This was the default Irix kernel allocator behaviour (it had a
> forwards progress guarantee and would never fail allocation unless
> told it could do so). We've been using the same "guaranteed not to
> fail" semantics on Linux since the original port started 25 years
> ago via open-coded loops.
>
> IOWs, __GFP_NOFAIL semantics have been production tested for a
> couple of decades on Linux via XFS, and nobody here can argue that
> XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL
> as it is used by XFS is reliable and lives up to the "will not fail"
> guarantee that it is supposed to have.
>
> Fundamentally, __GFP_NOFAIL came about to replace the callers doing
>
>         do {
>                 p = kmalloc(size);
>         while (!p);
>
> so that they blocked until memory allocation succeeded. The call
> sites do not check for failure, because -failure never occurs-.
>
> The MM devs want to have visibility of these allocations - they may
> not like them, but having __GFP_NOFAIL means it's trivial to audit
> all the allocations that use these semantics.  IOWs, __GFP_NOFAIL
> was created with an explicit guarantee that it -will not fail- for
> normal allocation contexts so it could replace all the open-coded
> will-not-fail allocation loops..
>
> Given this guarantee, we recently removed these historic allocation
> wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the
> allocation call sites. There's nearly a hundred memory allocation
> locations in XFS that are tagged with __GFP_NOFAIL.
>
> If we're now going to have the "will not fail" guarantee taken away
> from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can
> it be used anywhere else that a "will not fail" guarantee it
> required.
>
> Put simply: __GFP_NOFAIL will be rendered completely useless if it
> can fail due to external scoped memory allocation contexts.  This
> will force us to revert all __GFP_NOFAIL allocations back to
> open-coded will-not-fail loops.
>
> This is not a step forwards for anyone.

Hello Dave,

I've noticed that XFS has increasingly replaced kmem_alloc() with
__GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of
__GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41
occurrences. In kmem_alloc(), there's an explicit
memalloc_retry_wait() to throttle the allocator under heavy memory
pressure, which aligns with your filesystem design. However, using
__GFP_NOFAIL removes this throttling mechanism, potentially causing
issues when the system is under heavy memory load. I'm concerned that
this shift might not be a beneficial trend.

We have been using XFS for our big data servers for years, and it has
consistently performed well with older kernels like 4.19.y. However,
after upgrading all our servers from 4.19.y to 6.1.y over the past two
years, we have frequently encountered livelock issues caused by memory
exhaustion. To mitigate this, we've had to limit the RSS of
applications, which isn't an ideal solution and represents a worrying
trend.
Vlastimil Babka Aug. 30, 2024, 3:25 p.m. UTC | #14
On 8/30/24 11:14, Yafang Shao wrote:
> On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote:
> 
> Hello Dave,
> 
> I've noticed that XFS has increasingly replaced kmem_alloc() with
> __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of
> __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41
> occurrences. In kmem_alloc(), there's an explicit
> memalloc_retry_wait() to throttle the allocator under heavy memory
> pressure, which aligns with your filesystem design. However, using
> __GFP_NOFAIL removes this throttling mechanism, potentially causing
> issues when the system is under heavy memory load. I'm concerned that
> this shift might not be a beneficial trend.
> 
> We have been using XFS for our big data servers for years, and it has
> consistently performed well with older kernels like 4.19.y. However,
> after upgrading all our servers from 4.19.y to 6.1.y over the past two
> years, we have frequently encountered livelock issues caused by memory
> exhaustion. To mitigate this, we've had to limit the RSS of
> applications, which isn't an ideal solution and represents a worrying
> trend.

By "livelock issues caused by memory exhaustion" you mean the long-standing
infamous issue that the system might become thrashing for the remaining
small amount of page cache, and anonymous memory being swapped out/in,
instead of issuing OOM, because there's always just enough progress of the
reclaim to keep going, but the system isn't basically doing anything else?

I think that's related to near-exhausted memory by userspace, so I'm not
sure why XFS would be to blame here.

That said, if memalloc_retry_wait() is indeed a useful mechanism, maybe we
could perform it inside the page allocator itself for __GFP_NOFAIL?
Kent Overstreet Aug. 31, 2024, 3:46 p.m. UTC | #15
On Thu, Aug 29, 2024 at 11:39:05PM GMT, Theodore Ts'o wrote:
> On Fri, Aug 30, 2024 at 12:27:11AM +1000, Dave Chinner wrote:
> > 
> > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
> > now. This was the default Irix kernel allocator behaviour (it had a
> > forwards progress guarantee and would never fail allocation unless
> > told it could do so). We've been using the same "guaranteed not to
> > fail" semantics on Linux since the original port started 25 years
> > ago via open-coded loops.
> 
> Ext3/ext4 doesn't have quite the history as XFS --- it's only been
> around for 23 years --- but we've also used __GFP_NOFAIL or its
> moral equivalent, e.g.:
> 
> > 	do {
> > 		p = kmalloc(size);
> > 	while (!p);
> 
> For the entire existence of ext3.
> 
> > Put simply: __GFP_NOFAIL will be rendered completely useless if it
> > can fail due to external scoped memory allocation contexts.  This
> > will force us to revert all __GFP_NOFAIL allocations back to
> > open-coded will-not-fail loops.
> 
> The same will be true for ext4.  And as Dave has said, the MM
> developers want to have visibility to when file systems have basically
> said, "if you can't allow us to allocate memory, our only alternative
> is to cause user data loss, crash the kernel, or loop forever; we will
> choose the latter".  The MM developers tried to make __GFP_NOFAIL go
> away several years ago, and ext4 put the retry loop back, As a result,
> the compromise was that the MM developers restored __GFP_NOFAIL, and
> the file systems developers have done their best to reduce the use of
> __GFP_NOFAIL as much as possible.
> 
> So if you try to break the GFP_NOFAIL promise, both xfs and ext4 will
> back to the retry loop.  And the MM devs will be sad, and they will
> forcibly revert your change to *ther* code, even if that means
> breaking bcachefs.  Becuase otherwise, you will be breaking ext4 and
> xfs, and so we will go back to using a retry loop, which will be worse
> for Linux users.

GFP_NOFAIL may be better than the retry loop, but it's still not good.

Consider what happens when you have a GFP_NOFAIL in a critical IO path,
when the system is almost exhausted on memory; yes, that allocation will
succeed _eventually_, but without any latency bounds. When you're
thrashing or being fork bombed, that allocation is contending with
everything else.

Much the same way that a lock in a critical path where the work done
under the lock grows when the system is loaded, it's a contention point
subject to catastrophic failure.

Much better to preallocate, e.g. with a mempool, or have some other kind
of fallback.

It might work to do __GFP_NOFAIL|__GFP_HIGH in critical paths, but I've
never seen that investigated or tried.

And this is an area filesystem people really need to be thinking about.
Block layer gets this right, filesystems do not, and I suspect this is a
key contributor to our performance and behaviour sucking when we're
thrashing.

bcachefs puts a lot of effort into making sure we can run in bounded
memory, because I put a lot of emphasiss on consistent performance and
bounded latency, not just winning benchmarks. There's only two
__GFP_NOFAIL allocations in bcachefs, and I'll likely remove both of
them when I get around to it.
Dave Chinner Sept. 1, 2024, 3:35 a.m. UTC | #16
On Fri, Aug 30, 2024 at 05:14:28PM +0800, Yafang Shao wrote:
> On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote:
> > > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe
> > > according to call path and allocation size, you still need to be
> > > checking for failure - in the same way that you shouldn't be using
> > > BUG_ON() if you cannot prove that the condition won't occur in real wold
> > > usage.
> >
> > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
> > now. This was the default Irix kernel allocator behaviour (it had a
> > forwards progress guarantee and would never fail allocation unless
> > told it could do so). We've been using the same "guaranteed not to
> > fail" semantics on Linux since the original port started 25 years
> > ago via open-coded loops.
> >
> > IOWs, __GFP_NOFAIL semantics have been production tested for a
> > couple of decades on Linux via XFS, and nobody here can argue that
> > XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL
> > as it is used by XFS is reliable and lives up to the "will not fail"
> > guarantee that it is supposed to have.
> >
> > Fundamentally, __GFP_NOFAIL came about to replace the callers doing
> >
> >         do {
> >                 p = kmalloc(size);
> >         while (!p);
> >
> > so that they blocked until memory allocation succeeded. The call
> > sites do not check for failure, because -failure never occurs-.
> >
> > The MM devs want to have visibility of these allocations - they may
> > not like them, but having __GFP_NOFAIL means it's trivial to audit
> > all the allocations that use these semantics.  IOWs, __GFP_NOFAIL
> > was created with an explicit guarantee that it -will not fail- for
> > normal allocation contexts so it could replace all the open-coded
> > will-not-fail allocation loops..
> >
> > Given this guarantee, we recently removed these historic allocation
> > wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the
> > allocation call sites. There's nearly a hundred memory allocation
> > locations in XFS that are tagged with __GFP_NOFAIL.
> >
> > If we're now going to have the "will not fail" guarantee taken away
> > from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can
> > it be used anywhere else that a "will not fail" guarantee it
> > required.
> >
> > Put simply: __GFP_NOFAIL will be rendered completely useless if it
> > can fail due to external scoped memory allocation contexts.  This
> > will force us to revert all __GFP_NOFAIL allocations back to
> > open-coded will-not-fail loops.
> >
> > This is not a step forwards for anyone.
> 
> Hello Dave,
> 
> I've noticed that XFS has increasingly replaced kmem_alloc() with
> __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of
> __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41
> occurrences. In kmem_alloc(), there's an explicit
> memalloc_retry_wait() to throttle the allocator under heavy memory
> pressure, which aligns with your filesystem design. However, using
> __GFP_NOFAIL removes this throttling mechanism, potentially causing
> issues when the system is under heavy memory load. I'm concerned that
> this shift might not be a beneficial trend.

AIUI, the memory allocation looping has back-offs already built in
to it when memory reserves are exhausted and/or reclaim is
congested.

e.g:

get_page_from_freelist()
  (zone below watermark)
  node_reclaim()
    __node_reclaim()
      shrink_node()
        reclaim_throttle()

And the call to recalim_throttle() will do the equivalent of
memalloc_retry_wait() (a 2ms sleep).

> We have been using XFS for our big data servers for years, and it has
> consistently performed well with older kernels like 4.19.y. However,
> after upgrading all our servers from 4.19.y to 6.1.y over the past two
> years, we have frequently encountered livelock issues caused by memory
> exhaustion. To mitigate this, we've had to limit the RSS of
> applications, which isn't an ideal solution and represents a worrying
> trend.

If userspace uses all of memory all the time, then the best the
kernel can do is slowly limp along. Preventing userspace from
overcommitting memory to the point of OOM is the only way to avoid
these "userspace space wants more memory than the machine physically
has" sorts of issues. i.e. this is not a problem that the kernel
code can solve short of randomly killing userspace applications...

-Dave.
Yafang Shao Sept. 2, 2024, 3 a.m. UTC | #17
On Fri, Aug 30, 2024 at 11:25 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/30/24 11:14, Yafang Shao wrote:
> > On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > Hello Dave,
> >
> > I've noticed that XFS has increasingly replaced kmem_alloc() with
> > __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of
> > __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41
> > occurrences. In kmem_alloc(), there's an explicit
> > memalloc_retry_wait() to throttle the allocator under heavy memory
> > pressure, which aligns with your filesystem design. However, using
> > __GFP_NOFAIL removes this throttling mechanism, potentially causing
> > issues when the system is under heavy memory load. I'm concerned that
> > this shift might not be a beneficial trend.
> >
> > We have been using XFS for our big data servers for years, and it has
> > consistently performed well with older kernels like 4.19.y. However,
> > after upgrading all our servers from 4.19.y to 6.1.y over the past two
> > years, we have frequently encountered livelock issues caused by memory
> > exhaustion. To mitigate this, we've had to limit the RSS of
> > applications, which isn't an ideal solution and represents a worrying
> > trend.
>
> By "livelock issues caused by memory exhaustion" you mean the long-standing
> infamous issue that the system might become thrashing for the remaining
> small amount of page cache, and anonymous memory being swapped out/in,
> instead of issuing OOM, because there's always just enough progress of the
> reclaim to keep going, but the system isn't basically doing anything else?
>

Exactly

> I think that's related to near-exhausted memory by userspace,

If user space is the root cause, the appropriate response should be to
terminate the offending user tasks. However, this doesn't happen at
all.

> so I'm not
> sure why XFS would be to blame here.

Honestly, I'm not sure what to blame, as I don't have a clear
understanding of what's happening during memory allocation. One server
among tens of thousands in production randomly experiences a livelock
within days, making it extremely difficult to pinpoint the root cause.

>
> That said, if memalloc_retry_wait() is indeed a useful mechanism, maybe we
> could perform it inside the page allocator itself for __GFP_NOFAIL?

Perhaps an additional wait or exit mechanism should be implemented
specifically for __GFP_NOFAIL.

--
Regards
Yafang
Yafang Shao Sept. 2, 2024, 3:02 a.m. UTC | #18
On Sun, Sep 1, 2024 at 11:35 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Aug 30, 2024 at 05:14:28PM +0800, Yafang Shao wrote:
> > On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote:
> > > > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe
> > > > according to call path and allocation size, you still need to be
> > > > checking for failure - in the same way that you shouldn't be using
> > > > BUG_ON() if you cannot prove that the condition won't occur in real wold
> > > > usage.
> > >
> > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years
> > > now. This was the default Irix kernel allocator behaviour (it had a
> > > forwards progress guarantee and would never fail allocation unless
> > > told it could do so). We've been using the same "guaranteed not to
> > > fail" semantics on Linux since the original port started 25 years
> > > ago via open-coded loops.
> > >
> > > IOWs, __GFP_NOFAIL semantics have been production tested for a
> > > couple of decades on Linux via XFS, and nobody here can argue that
> > > XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL
> > > as it is used by XFS is reliable and lives up to the "will not fail"
> > > guarantee that it is supposed to have.
> > >
> > > Fundamentally, __GFP_NOFAIL came about to replace the callers doing
> > >
> > >         do {
> > >                 p = kmalloc(size);
> > >         while (!p);
> > >
> > > so that they blocked until memory allocation succeeded. The call
> > > sites do not check for failure, because -failure never occurs-.
> > >
> > > The MM devs want to have visibility of these allocations - they may
> > > not like them, but having __GFP_NOFAIL means it's trivial to audit
> > > all the allocations that use these semantics.  IOWs, __GFP_NOFAIL
> > > was created with an explicit guarantee that it -will not fail- for
> > > normal allocation contexts so it could replace all the open-coded
> > > will-not-fail allocation loops..
> > >
> > > Given this guarantee, we recently removed these historic allocation
> > > wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the
> > > allocation call sites. There's nearly a hundred memory allocation
> > > locations in XFS that are tagged with __GFP_NOFAIL.
> > >
> > > If we're now going to have the "will not fail" guarantee taken away
> > > from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can
> > > it be used anywhere else that a "will not fail" guarantee it
> > > required.
> > >
> > > Put simply: __GFP_NOFAIL will be rendered completely useless if it
> > > can fail due to external scoped memory allocation contexts.  This
> > > will force us to revert all __GFP_NOFAIL allocations back to
> > > open-coded will-not-fail loops.
> > >
> > > This is not a step forwards for anyone.
> >
> > Hello Dave,
> >
> > I've noticed that XFS has increasingly replaced kmem_alloc() with
> > __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of
> > __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41
> > occurrences. In kmem_alloc(), there's an explicit
> > memalloc_retry_wait() to throttle the allocator under heavy memory
> > pressure, which aligns with your filesystem design. However, using
> > __GFP_NOFAIL removes this throttling mechanism, potentially causing
> > issues when the system is under heavy memory load. I'm concerned that
> > this shift might not be a beneficial trend.
>
> AIUI, the memory allocation looping has back-offs already built in
> to it when memory reserves are exhausted and/or reclaim is
> congested.
>
> e.g:
>
> get_page_from_freelist()
>   (zone below watermark)
>   node_reclaim()
>     __node_reclaim()
>       shrink_node()
>         reclaim_throttle()

It applies to all kinds of allocations.

>
> And the call to recalim_throttle() will do the equivalent of
> memalloc_retry_wait() (a 2ms sleep).

I'm wondering if we should take special action for __GFP_NOFAIL, as
currently, it only results in an endless loop with no intervention.

>
> > We have been using XFS for our big data servers for years, and it has
> > consistently performed well with older kernels like 4.19.y. However,
> > after upgrading all our servers from 4.19.y to 6.1.y over the past two
> > years, we have frequently encountered livelock issues caused by memory
> > exhaustion. To mitigate this, we've had to limit the RSS of
> > applications, which isn't an ideal solution and represents a worrying
> > trend.
>
> If userspace uses all of memory all the time, then the best the
> kernel can do is slowly limp along. Preventing userspace from
> overcommitting memory to the point of OOM is the only way to avoid
> these "userspace space wants more memory than the machine physically
> has" sorts of issues. i.e. this is not a problem that the kernel
> code can solve short of randomly killing userspace applications...

We expect an OOM event, but it never occurs, which is a problem.

--
Regards

Yafang
Michal Hocko Sept. 2, 2024, 8:11 a.m. UTC | #19
On Mon 02-09-24 11:02:50, Yafang Shao wrote:
> On Sun, Sep 1, 2024 at 11:35 AM Dave Chinner <david@fromorbit.com> wrote:
[...]
> > AIUI, the memory allocation looping has back-offs already built in
> > to it when memory reserves are exhausted and/or reclaim is
> > congested.
> >
> > e.g:
> >
> > get_page_from_freelist()
> >   (zone below watermark)
> >   node_reclaim()
> >     __node_reclaim()
> >       shrink_node()
> >         reclaim_throttle()
> 
> It applies to all kinds of allocations.
> 
> >
> > And the call to recalim_throttle() will do the equivalent of
> > memalloc_retry_wait() (a 2ms sleep).
> 
> I'm wondering if we should take special action for __GFP_NOFAIL, as
> currently, it only results in an endless loop with no intervention.

If the memory allocator/reclaim is trashing on couple of remaining pages
that are easy to drop and reallocated again then the same endless loop
is de-facto the behavior for _all_ non-costly allocations. All of them
will loop. This is not really great but so far we haven't really
developed a reliable thrashing detection that would suit all potential
workloads. There are some that simply benefit from work not being lost
even if the cost is a severe performance penalty. A general conclusion
has been that workloads which would rather see OOM killer triggering
early should implement that policy in the userspace. We have PSI,
refault counters and other tools that could be used to detect
pathological patterns and trigger workload specific action.

I really do not see why GFP_NOFAIL should be any special in this
specific case.
Yafang Shao Sept. 2, 2024, 9:01 a.m. UTC | #20
On Mon, Sep 2, 2024 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 02-09-24 11:02:50, Yafang Shao wrote:
> > On Sun, Sep 1, 2024 at 11:35 AM Dave Chinner <david@fromorbit.com> wrote:
> [...]
> > > AIUI, the memory allocation looping has back-offs already built in
> > > to it when memory reserves are exhausted and/or reclaim is
> > > congested.
> > >
> > > e.g:
> > >
> > > get_page_from_freelist()
> > >   (zone below watermark)
> > >   node_reclaim()
> > >     __node_reclaim()
> > >       shrink_node()
> > >         reclaim_throttle()
> >
> > It applies to all kinds of allocations.
> >
> > >
> > > And the call to recalim_throttle() will do the equivalent of
> > > memalloc_retry_wait() (a 2ms sleep).
> >
> > I'm wondering if we should take special action for __GFP_NOFAIL, as
> > currently, it only results in an endless loop with no intervention.
>
> If the memory allocator/reclaim is trashing on couple of remaining pages
> that are easy to drop and reallocated again then the same endless loop
> is de-facto the behavior for _all_ non-costly allocations. All of them
> will loop. This is not really great but so far we haven't really
> developed a reliable thrashing detection that would suit all potential
> workloads. There are some that simply benefit from work not being lost
> even if the cost is a severe performance penalty. A general conclusion
> has been that workloads which would rather see OOM killer triggering
> early should implement that policy in the userspace. We have PSI,
> refault counters and other tools that could be used to detect
> pathological patterns and trigger workload specific action.

Indeed, we're currently working on developing that policy.

>
> I really do not see why GFP_NOFAIL should be any special in this
> specific case.

I believe there's no way to stop it from looping, even if you
implement a sophisticated user space OOM killer. ;)

--
Regards
Yafang
Michal Hocko Sept. 2, 2024, 9:09 a.m. UTC | #21
On Mon 02-09-24 17:01:12, Yafang Shao wrote:
> > I really do not see why GFP_NOFAIL should be any special in this
> > specific case.
> 
> I believe there's no way to stop it from looping, even if you
> implement a sophisticated user space OOM killer. ;)

User space OOM killer should be helping to replenish a free memory and
we have some heuristics to help NOFAIL users out with some portion of
memory reserves already IIRC. So we do already give them some special
treatment in the page allocator path. Not so much in the reclaim path.
Yafang Shao Sept. 3, 2024, 6:34 a.m. UTC | #22
On Mon, Sep 2, 2024 at 5:09 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 02-09-24 17:01:12, Yafang Shao wrote:
> > > I really do not see why GFP_NOFAIL should be any special in this
> > > specific case.
> >
> > I believe there's no way to stop it from looping, even if you
> > implement a sophisticated user space OOM killer. ;)
>
> User space OOM killer should be helping to replenish a free memory and
> we have some heuristics to help NOFAIL users out with some portion of
> memory reserves already IIRC. So we do already give them some special
> treatment in the page allocator path. Not so much in the reclaim path.

When setting GFP_NOFAIL, it's important to not only enable direct
reclaim but also the OOM killer. In scenarios where swap is off and
there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can
result in an infinite loop. In other words, GFP_NOFAIL should not be
used with GFP_NOFS. Unfortunately, many call sites do combine them.
For example:

XFS:

fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL
fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL

EXT4:

fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL
fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL

This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted
could provide further insight.

--
Regards

Yafang
Michal Hocko Sept. 3, 2024, 7:18 a.m. UTC | #23
On Tue 03-09-24 14:34:05, Yafang Shao wrote:
> On Mon, Sep 2, 2024 at 5:09 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 02-09-24 17:01:12, Yafang Shao wrote:
> > > > I really do not see why GFP_NOFAIL should be any special in this
> > > > specific case.
> > >
> > > I believe there's no way to stop it from looping, even if you
> > > implement a sophisticated user space OOM killer. ;)
> >
> > User space OOM killer should be helping to replenish a free memory and
> > we have some heuristics to help NOFAIL users out with some portion of
> > memory reserves already IIRC. So we do already give them some special
> > treatment in the page allocator path. Not so much in the reclaim path.
> 
> When setting GFP_NOFAIL, it's important to not only enable direct
> reclaim but also the OOM killer. In scenarios where swap is off and
> there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can
> result in an infinite loop. In other words, GFP_NOFAIL should not be
> used with GFP_NOFS. Unfortunately, many call sites do combine them.

This is the case with GFP_NOFS on its own already. NOFAIL is no
different and both will be looping for ever. We heavily rely on kswapd
or other GFP_KERNEL's direct reclaim to allow for forward progress.

Unfortunatelly we haven't really found a better way to deal with NOFS
only/predominant workloads.
Theodore Ts'o Sept. 3, 2024, 12:44 p.m. UTC | #24
On Tue, Sep 03, 2024 at 02:34:05PM +0800, Yafang Shao wrote:
>
> When setting GFP_NOFAIL, it's important to not only enable direct
> reclaim but also the OOM killer. In scenarios where swap is off and
> there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can
> result in an infinite loop. In other words, GFP_NOFAIL should not be
> used with GFP_NOFS. Unfortunately, many call sites do combine them.
> For example:
> 
> XFS:
> 
> fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL
> fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL
> 
> EXT4:
> 
> fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL
> fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL
> 
> This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted
> could provide further insight.

GFP_NOFS is needed because we need to signal to the mm layer to avoid
recursing into file system layer --- for example, to clean a page by
writing it back to the FS.  Since we may have taken various file
system locks, recursing could lead to deadlock, which would make the
system (and the user) sad.

If the mm layer wants to OOM kill a process, that should be fine as
far as the file system is concerned --- this could reclaim anonymous
pages that don't need to be written back, for example.  And we don't
need to write back dirty pages before the process killed.  So I'm a
bit puzzled why (as you imply; I haven't dug into the mm code in
question) GFP_NOFS implies disabling the OOM killer?

Regards,

					- Ted

P.S.  Note that this is a fairly simplistic, very conservative set of
constraints.  If you have several dozen file sysetems mounted, and
we're deep in the guts of file system A, it might be *fine* to clean
pages associated with file system B or file system C.  Unless of
course, file system A is a loop-back mount onto a file located in file
system B, in which case writing into file system A might require
taking locks related to file system B.  But that aside, in theory we
could allow certain types of page reclaim if we were willing to track
which file systems are busy.

On the other hand, if the system is allowed to get that busy,
performance is going to be *terrible*, and so perhaps the better thing
to do is to teach the container manager not to schedule so many jobs
on the server in the first place, or having the mobile OS kill off
applications that aren't in the foreground, or giving the OOM killer
license to kill off jobs much earlier, etc.  By the time we get to the
point where we are trying to use these last dozen or so pages, the
system is going to be thrashing super-badly, and the user is going to
be *quite* unhappy.  So arguably these problems should be solved much
higher up the software stack, by not letting the system get into such
a condition in the first place.
Yafang Shao Sept. 3, 2024, 1:15 p.m. UTC | #25
On Tue, Sep 3, 2024 at 8:44 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Sep 03, 2024 at 02:34:05PM +0800, Yafang Shao wrote:
> >
> > When setting GFP_NOFAIL, it's important to not only enable direct
> > reclaim but also the OOM killer. In scenarios where swap is off and
> > there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can
> > result in an infinite loop. In other words, GFP_NOFAIL should not be
> > used with GFP_NOFS. Unfortunately, many call sites do combine them.
> > For example:
> >
> > XFS:
> >
> > fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL
> > fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL
> >
> > EXT4:
> >
> > fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL
> > fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL
> >
> > This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted
> > could provide further insight.
>
> GFP_NOFS is needed because we need to signal to the mm layer to avoid
> recursing into file system layer --- for example, to clean a page by
> writing it back to the FS.  Since we may have taken various file
> system locks, recursing could lead to deadlock, which would make the
> system (and the user) sad.
>
> If the mm layer wants to OOM kill a process, that should be fine as
> far as the file system is concerned --- this could reclaim anonymous
> pages that don't need to be written back, for example.  And we don't
> need to write back dirty pages before the process killed.  So I'm a
> bit puzzled why (as you imply; I haven't dug into the mm code in
> question) GFP_NOFS implies disabling the OOM killer?

Refer to the out_of_memory() function [0]:

    if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
        return true;

[0]. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/oom_kill.c#n1137

Is it possible that this check can be removed?

>
> Regards,
>
>                                         - Ted
>
> P.S.  Note that this is a fairly simplistic, very conservative set of
> constraints.  If you have several dozen file sysetems mounted, and
> we're deep in the guts of file system A, it might be *fine* to clean
> pages associated with file system B or file system C.  Unless of
> course, file system A is a loop-back mount onto a file located in file
> system B, in which case writing into file system A might require
> taking locks related to file system B.  But that aside, in theory we
> could allow certain types of page reclaim if we were willing to track
> which file systems are busy.
>
> On the other hand, if the system is allowed to get that busy,
> performance is going to be *terrible*, and so perhaps the better thing
> to do is to teach the container manager not to schedule so many jobs
> on the server in the first place, or having the mobile OS kill off
> applications that aren't in the foreground, or giving the OOM killer
> license to kill off jobs much earlier, etc.  By the time we get to the
> point where we are trying to use these last dozen or so pages, the
> system is going to be thrashing super-badly, and the user is going to
> be *quite* unhappy.  So arguably these problems should be solved much
> higher up the software stack, by not letting the system get into such
> a condition in the first place.

I completely agree with your point. However, in the real world, things
don't always work as expected, which is why it's crucial to ensure the
OOM killer is effective during system thrashing. Unfortunately, the
kernel's OOM killer doesn't always perform as expected, particularly
under heavy thrashing. This is one reason why user-space OOM killers
like oomd exist.


--
Regards
Yafang
Michal Hocko Sept. 3, 2024, 1:30 p.m. UTC | #26
On Tue 03-09-24 08:44:16, Theodore Ts'o wrote:
> On Tue, Sep 03, 2024 at 02:34:05PM +0800, Yafang Shao wrote:
> >
> > When setting GFP_NOFAIL, it's important to not only enable direct
> > reclaim but also the OOM killer. In scenarios where swap is off and
> > there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can
> > result in an infinite loop. In other words, GFP_NOFAIL should not be
> > used with GFP_NOFS. Unfortunately, many call sites do combine them.
> > For example:
> > 
> > XFS:
> > 
> > fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL
> > fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL
> > 
> > EXT4:
> > 
> > fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL
> > fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL
> > 
> > This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted
> > could provide further insight.
> 
> GFP_NOFS is needed because we need to signal to the mm layer to avoid
> recursing into file system layer --- for example, to clean a page by
> writing it back to the FS.  Since we may have taken various file
> system locks, recursing could lead to deadlock, which would make the
> system (and the user) sad.
> 
> If the mm layer wants to OOM kill a process, that should be fine as
> far as the file system is concerned --- this could reclaim anonymous
> pages that don't need to be written back, for example.  And we don't
> need to write back dirty pages before the process killed.  So I'm a
> bit puzzled why (as you imply; I haven't dug into the mm code in
> question) GFP_NOFS implies disabling the OOM killer?

Yes, because there might be a lot of fs pages pinned while performing
NOFS allocation and that could fire the OOM killer way too prematurely.
This has been quite some time ago since this was introduced but I do
remember workloads hitting that. Also there is usually kswapd making
sufficient progress to move forward. There are cases where kswapd is
completely stuck and other __GFP_FS allocations triggering full direct
reclaim or background kworkers freeing some memory and OOM killer
doesn't have good enough picture to make an educated guess the oom
killer is the only available way forward.

A typical example would be a workload that would care is trashing but
still making a slow progress which is acceptable which is acceptable
because the most important workload makes a decent progress (the working
set fits in or is mlocked) and rebuilding the state is more harmfull
than a slow IO.
Michal Hocko Sept. 3, 2024, 2:03 p.m. UTC | #27
On Tue 03-09-24 21:15:59, Yafang Shao wrote:
[...]
> I completely agree with your point. However, in the real world, things
> don't always work as expected, which is why it's crucial to ensure the
> OOM killer is effective during system thrashing. Unfortunately, the
> kernel's OOM killer doesn't always perform as expected, particularly
> under heavy thrashing. This is one reason why user-space OOM killers
> like oomd exist.

I do undestand your point. On the other hand over a long time seeing all
different usecases we have concluded that the OOM killer should be
really conservative last resort. More agressive OOM policies should be
implemented by userspace to prevent from regressions in other usecases.

That doesn't really mean improvements to the kernel oom killer are not
welcome or impossible. The bar is just quite hard as the wide variety of
workloads is really hard to support. Heavy trashing is one example.
Different workloads will have a different understanding what that means
actually.
diff mbox series

Patch

diff --git a/fs/bcachefs/acl.c b/fs/bcachefs/acl.c
index 87f1be9d4db4..1def61875a6f 100644
--- a/fs/bcachefs/acl.c
+++ b/fs/bcachefs/acl.c
@@ -137,7 +137,7 @@  static struct posix_acl *bch2_acl_from_disk(struct btree_trans *trans,
 		return NULL;
 
 	acl = allocate_dropping_locks(trans, ret,
-			posix_acl_alloc(count, _gfp));
+			posix_acl_alloc(count, GFP_KERNEL));
 	if (!acl)
 		return ERR_PTR(-ENOMEM);
 	if (ret) {
@@ -427,7 +427,8 @@  int bch2_acl_chmod(struct btree_trans *trans, subvol_inum inum,
 	if (ret)
 		goto err;
 
-	ret = allocate_dropping_locks_errcode(trans, __posix_acl_chmod(&acl, _gfp, mode));
+	ret = allocate_dropping_locks_errcode(trans,
+					 __posix_acl_chmod(&acl, GFP_KERNEL, mode));
 	if (ret)
 		goto err;
 
diff --git a/fs/bcachefs/btree_cache.c b/fs/bcachefs/btree_cache.c
index 9f096fdcaf9a..e0090fa551d7 100644
--- a/fs/bcachefs/btree_cache.c
+++ b/fs/bcachefs/btree_cache.c
@@ -729,7 +729,8 @@  struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_rea
 
 	mutex_unlock(&bc->lock);
 
-	if (btree_node_data_alloc(c, b, GFP_NOWAIT|__GFP_NOWARN)) {
+	if (memalloc_flags_do(PF_MEMALLOC_NORECLAIM,
+			      btree_node_data_alloc(c, b, GFP_KERNEL|__GFP_NOWARN))) {
 		bch2_trans_unlock(trans);
 		if (btree_node_data_alloc(c, b, GFP_KERNEL|__GFP_NOWARN))
 			goto err;
diff --git a/fs/bcachefs/btree_iter.h b/fs/bcachefs/btree_iter.h
index 6d87e57745da..0ea21a5f6d86 100644
--- a/fs/bcachefs/btree_iter.h
+++ b/fs/bcachefs/btree_iter.h
@@ -865,29 +865,33 @@  struct bkey_s_c bch2_btree_iter_peek_and_restart_outlined(struct btree_iter *);
 	(_do) ?: bch2_trans_relock(_trans);				\
 })
 
-#define allocate_dropping_locks_errcode(_trans, _do)			\
-({									\
-	gfp_t _gfp = GFP_NOWAIT|__GFP_NOWARN;				\
-	int _ret = _do;							\
-									\
-	if (bch2_err_matches(_ret, ENOMEM)) {				\
-		_gfp = GFP_KERNEL;					\
-		_ret = drop_locks_do(_trans, _do);			\
-	}								\
-	_ret;								\
+#define memalloc_flags_do(_flags, _do)						\
+({										\
+	unsigned _saved_flags = memalloc_flags_save(_flags);			\
+	typeof(_do) _ret = _do;							\
+	memalloc_noreclaim_restore(_saved_flags);				\
+	_ret;									\
 })
 
-#define allocate_dropping_locks(_trans, _ret, _do)			\
-({									\
-	gfp_t _gfp = GFP_NOWAIT|__GFP_NOWARN;				\
-	typeof(_do) _p = _do;						\
-									\
-	_ret = 0;							\
-	if (unlikely(!_p)) {						\
-		_gfp = GFP_KERNEL;					\
-		_ret = drop_locks_do(_trans, ((_p = _do), 0));		\
-	}								\
-	_p;								\
+#define allocate_dropping_locks_errcode(_trans, _do)				\
+({										\
+	int _ret = memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN, _do);\
+										\
+	if (bch2_err_matches(_ret, ENOMEM)) {					\
+		_ret = drop_locks_do(_trans, _do);				\
+	}									\
+	_ret;									\
+})
+
+#define allocate_dropping_locks(_trans, _ret, _do)				\
+({										\
+	typeof(_do) _p = memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN, _do);\
+										\
+	_ret = 0;								\
+	if (unlikely(!_p)) {							\
+		_ret = drop_locks_do(_trans, ((_p = _do), 0));			\
+	}									\
+	_p;									\
 })
 
 #define bch2_trans_run(_c, _do)						\
diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
index af84516fb607..feea58778d44 100644
--- a/fs/bcachefs/btree_key_cache.c
+++ b/fs/bcachefs/btree_key_cache.c
@@ -117,12 +117,12 @@  static void bkey_cached_free(struct btree_key_cache *bc,
 	this_cpu_inc(*bc->nr_pending);
 }
 
-static struct bkey_cached *__bkey_cached_alloc(unsigned key_u64s, gfp_t gfp)
+static struct bkey_cached *__bkey_cached_alloc(unsigned key_u64s)
 {
-	struct bkey_cached *ck = kmem_cache_zalloc(bch2_key_cache, gfp);
+	struct bkey_cached *ck = kmem_cache_zalloc(bch2_key_cache, GFP_KERNEL);
 	if (unlikely(!ck))
 		return NULL;
-	ck->k = kmalloc(key_u64s * sizeof(u64), gfp);
+	ck->k = kmalloc(key_u64s * sizeof(u64), GFP_KERNEL);
 	if (unlikely(!ck->k)) {
 		kmem_cache_free(bch2_key_cache, ck);
 		return NULL;
@@ -146,7 +146,7 @@  bkey_cached_alloc(struct btree_trans *trans, struct btree_path *path, unsigned k
 		goto lock;
 
 	ck = allocate_dropping_locks(trans, ret,
-				     __bkey_cached_alloc(key_u64s, _gfp));
+				     __bkey_cached_alloc(key_u64s));
 	if (ret) {
 		if (ck)
 			kfree(ck->k);
@@ -240,7 +240,7 @@  static int btree_key_cache_create(struct btree_trans *trans, struct btree_path *
 		mark_btree_node_locked_noreset(path, 0, BTREE_NODE_UNLOCKED);
 
 		struct bkey_i *new_k = allocate_dropping_locks(trans, ret,
-				kmalloc(key_u64s * sizeof(u64), _gfp));
+						kmalloc(key_u64s * sizeof(u64), GFP_KERNEL));
 		if (unlikely(!new_k)) {
 			bch_err(trans->c, "error allocating memory for key cache key, btree %s u64s %u",
 				bch2_btree_id_str(ck->key.btree_id), key_u64s);
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index 141a4c63142f..e2163cbf63a9 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -890,12 +890,12 @@  int bch2_ec_read_extent(struct btree_trans *trans, struct bch_read_bio *rbio)
 
 /* stripe bucket accounting: */
 
-static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx, gfp_t gfp)
+static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx)
 {
 	ec_stripes_heap n, *h = &c->ec_stripes_heap;
 
 	if (idx >= h->size) {
-		if (!init_heap(&n, max(1024UL, roundup_pow_of_two(idx + 1)), gfp))
+		if (!init_heap(&n, max(1024UL, roundup_pow_of_two(idx + 1)), GFP_KERNEL))
 			return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
 
 		mutex_lock(&c->ec_stripes_heap_lock);
@@ -909,11 +909,11 @@  static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx, gfp_t gfp)
 		free_heap(&n);
 	}
 
-	if (!genradix_ptr_alloc(&c->stripes, idx, gfp))
+	if (!genradix_ptr_alloc(&c->stripes, idx, GFP_KERNEL))
 		return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
 
 	if (c->gc_pos.phase != GC_PHASE_not_running &&
-	    !genradix_ptr_alloc(&c->gc_stripes, idx, gfp))
+	    !genradix_ptr_alloc(&c->gc_stripes, idx, GFP_KERNEL))
 		return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc;
 
 	return 0;
@@ -923,7 +923,7 @@  static int ec_stripe_mem_alloc(struct btree_trans *trans,
 			       struct btree_iter *iter)
 {
 	return allocate_dropping_locks_errcode(trans,
-			__ec_stripe_mem_alloc(trans->c, iter->pos.offset, _gfp));
+			__ec_stripe_mem_alloc(trans->c, iter->pos.offset));
 }
 
 /*
@@ -2193,7 +2193,7 @@  int bch2_stripes_read(struct bch_fs *c)
 			if (k.k->type != KEY_TYPE_stripe)
 				continue;
 
-			ret = __ec_stripe_mem_alloc(c, k.k->p.offset, GFP_KERNEL);
+			ret = __ec_stripe_mem_alloc(c, k.k->p.offset);
 			if (ret)
 				break;
 
diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index bc8dd89ec15c..153ec4e5c0f4 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -273,14 +273,6 @@  static struct bch_inode_info *bch2_inode_hash_insert(struct bch_fs *c,
 	}
 }
 
-#define memalloc_flags_do(_flags, _do)						\
-({										\
-	unsigned _saved_flags = memalloc_flags_save(_flags);			\
-	typeof(_do) _ret = _do;							\
-	memalloc_noreclaim_restore(_saved_flags);				\
-	_ret;									\
-})
-
 static struct inode *bch2_alloc_inode(struct super_block *sb)
 {
 	BUG();