diff mbox series

[1/2,v2] bcachefs: do not use PF_MEMALLOC_NORECLAIM

Message ID 20240827061543.1235703-1-mhocko@kernel.org (mailing list archive)
State Handled Elsewhere
Headers show
Series [1/2,v2] bcachefs: do not use PF_MEMALLOC_NORECLAIM | expand

Commit Message

Michal Hocko Aug. 27, 2024, 6:15 a.m. UTC
From: Michal Hocko <mhocko@suse.com>

bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
inode to achieve GFP_NOWAIT semantic while holding locks. If this
allocation fails it will drop locks and use GFP_NOFS allocation context.

We would like to drop PF_MEMALLOC_NORECLAIM because it is really
dangerous to use if the caller doesn't control the full call chain with
this flag set. E.g. if any of the function down the chain needed
GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
cause unexpected failure.

While this is not the case in this particular case using the scoped gfp
semantic is not really needed bacause we can easily pus the allocation
context down the chain without too much clutter.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/bcachefs/fs.c          | 14 ++++++--------
 fs/inode.c                |  6 +++---
 include/linux/fs.h        |  7 ++++++-
 include/linux/lsm_hooks.h |  2 +-
 include/linux/security.h  |  4 ++--
 security/security.c       |  8 ++++----
 6 files changed, 22 insertions(+), 19 deletions(-)

Chancges since v1
- compile errors fixed 
- dropped GFP_NOWARN as it is part of GFP_NOWAIT now

Comments

Christoph Hellwig Aug. 27, 2024, 12:29 p.m. UTC | #1
Thanks for fixing this up!

Reviewed-by: Christoph Hellwig <hch@lst.de>
Dave Chinner Aug. 28, 2024, 4:09 a.m. UTC | #2
On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> inode to achieve GFP_NOWAIT semantic while holding locks. If this
> allocation fails it will drop locks and use GFP_NOFS allocation context.
> 
> We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> dangerous to use if the caller doesn't control the full call chain with
> this flag set. E.g. if any of the function down the chain needed
> GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> cause unexpected failure.
> 
> While this is not the case in this particular case using the scoped gfp
> semantic is not really needed bacause we can easily pus the allocation
> context down the chain without too much clutter.
> 
> Acked-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Looks good to me.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Kent Overstreet Aug. 29, 2024, 10:02 a.m. UTC | #3
On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote:
> On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > 
> > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > dangerous to use if the caller doesn't control the full call chain with
> > this flag set. E.g. if any of the function down the chain needed
> > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > cause unexpected failure.
> > 
> > While this is not the case in this particular case using the scoped gfp
> > semantic is not really needed bacause we can easily pus the allocation
> > context down the chain without too much clutter.
> > 
> > Acked-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Looks good to me.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>

Reposting what I wrote in the other thread:

This series is based on a fundamental misunderstanding of what a safe
API is: a safe API is not one that doesn't return errors, it's one that
never invokes undefined behaviour.

It was decided years ago that the scoped APIs were the better
replacement for the gfp APIs, and they need to exist.

This "GFP_NOFAIL exists therefore we can't tell the memory allocator
about situations wehre it would have to fail" is utter insanity - it's
the exact opposite of defining a safe API.

A safe API would be one where we /did/ always tell the memory allocator
when we're in non-sleepable context, and the allocator always returned
failure instead of context switching.

This is utter brain damage; rule #1 of kernel programming is that _you
check for errors_. If you don't know that your GFP_NOFAIL usage is in a
safe context (and that's not just memory reclaim context, it's also the
size of your alloction) then you have to check for errors.
Dave Chinner Aug. 29, 2024, 1:12 p.m. UTC | #4
On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote:
> On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote:
> > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > > 
> > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > > dangerous to use if the caller doesn't control the full call chain with
> > > this flag set. E.g. if any of the function down the chain needed
> > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > > cause unexpected failure.
> > > 
> > > While this is not the case in this particular case using the scoped gfp
> > > semantic is not really needed bacause we can easily pus the allocation
> > > context down the chain without too much clutter.
> > > 
> > > Acked-by: Christoph Hellwig <hch@lst.de>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > 
> > Looks good to me.
> > 
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> Reposting what I wrote in the other thread:

I've read the thread. I've heard what you have had to say. Like
several other people, I think your position is just not practical or
reasonable.

I don't care about the purity or the safety of the API - the
practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL
allocation can now fail and that will cause unexpected kernel
crashes.  Keeping existing code and API semantics working correctly
(i.e. regression free) takes precedence over new functionality or
API features that people want to introduce.

That's all there is to it. This is not a hill you need to die on.

-Dave.
Kent Overstreet Aug. 29, 2024, 1:22 p.m. UTC | #5
On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote:
> On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote:
> > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote:
> > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > > > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > > > 
> > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > > > dangerous to use if the caller doesn't control the full call chain with
> > > > this flag set. E.g. if any of the function down the chain needed
> > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > > > cause unexpected failure.
> > > > 
> > > > While this is not the case in this particular case using the scoped gfp
> > > > semantic is not really needed bacause we can easily pus the allocation
> > > > context down the chain without too much clutter.
> > > > 
> > > > Acked-by: Christoph Hellwig <hch@lst.de>
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > 
> > > Looks good to me.
> > > 
> > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Reposting what I wrote in the other thread:
> 
> I've read the thread. I've heard what you have had to say. Like
> several other people, I think your position is just not practical or
> reasonable.
> 
> I don't care about the purity or the safety of the API - the
> practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL
> allocation can now fail and that will cause unexpected kernel
> crashes.  Keeping existing code and API semantics working correctly
> (i.e. regression free) takes precedence over new functionality or
> API features that people want to introduce.
> 
> That's all there is to it. This is not a hill you need to die on.

If you use GFP_NOFAIL in a context where you're not allowed to sleep,
that's a bug, same as any other bug where you get the gfp flags wrong
(e.g. GFP_KERNEL in non sleepable context).

This isn't going to affect you unless you start going around inserting
PF_MEMALLOC_NORECLAIM where it doesn't need to be. Why would you do
that?

But the lack of gfp flags for pte allocation means that this actually is
a serious gap we need to be fixing.
Kent Overstreet Aug. 29, 2024, 1:32 p.m. UTC | #6
On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote:
> On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote:
> > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote:
> > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > > > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > > > 
> > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > > > dangerous to use if the caller doesn't control the full call chain with
> > > > this flag set. E.g. if any of the function down the chain needed
> > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > > > cause unexpected failure.
> > > > 
> > > > While this is not the case in this particular case using the scoped gfp
> > > > semantic is not really needed bacause we can easily pus the allocation
> > > > context down the chain without too much clutter.
> > > > 
> > > > Acked-by: Christoph Hellwig <hch@lst.de>
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > 
> > > Looks good to me.
> > > 
> > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Reposting what I wrote in the other thread:
> 
> I've read the thread. I've heard what you have had to say. Like
> several other people, I think your position is just not practical or
> reasonable.
> 
> I don't care about the purity or the safety of the API - the
> practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL
> allocation can now fail and that will cause unexpected kernel
> crashes.  Keeping existing code and API semantics working correctly
> (i.e. regression free) takes precedence over new functionality or
> API features that people want to introduce.
> 
> That's all there is to it. This is not a hill you need to die on.

And more than that, this is coming from you saying "We didn't have to
handle memory allocation failures in IRIX, why can't we be like IRIX?
All those error paths are a pain to test, why can't we get rid of them?"

Except that's bullshit; at the very least any dynamically sized
allocation _definitely_ has to have an error path that's tested, and if
there's questions about the context a code path might run in, that
that's another reason.

GFP_NOFAIL is the problem here, and if it's encouraging this brain
damaged "why can't we just get rid of error paths?" thinking, then it
should be removed.

Error paths have to exist, and they have to be tested.
Yafang Shao Aug. 29, 2024, 2:03 p.m. UTC | #7
On Thu, Aug 29, 2024 at 9:32 PM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote:
> > On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote:
> > > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote:
> > > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > >
> > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > > > > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > > > >
> > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > > > > dangerous to use if the caller doesn't control the full call chain with
> > > > > this flag set. E.g. if any of the function down the chain needed
> > > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > > > > cause unexpected failure.
> > > > >
> > > > > While this is not the case in this particular case using the scoped gfp
> > > > > semantic is not really needed bacause we can easily pus the allocation
> > > > > context down the chain without too much clutter.
> > > > >
> > > > > Acked-by: Christoph Hellwig <hch@lst.de>
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > >
> > > > Looks good to me.
> > > >
> > > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > >
> > > Reposting what I wrote in the other thread:
> >
> > I've read the thread. I've heard what you have had to say. Like
> > several other people, I think your position is just not practical or
> > reasonable.
> >
> > I don't care about the purity or the safety of the API - the
> > practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL
> > allocation can now fail and that will cause unexpected kernel
> > crashes.  Keeping existing code and API semantics working correctly
> > (i.e. regression free) takes precedence over new functionality or
> > API features that people want to introduce.
> >
> > That's all there is to it. This is not a hill you need to die on.
>
> And more than that, this is coming from you saying "We didn't have to
> handle memory allocation failures in IRIX, why can't we be like IRIX?
> All those error paths are a pain to test, why can't we get rid of them?"
>
> Except that's bullshit; at the very least any dynamically sized
> allocation _definitely_ has to have an error path that's tested, and if
> there's questions about the context a code path might run in, that
> that's another reason.
>
> GFP_NOFAIL is the problem here, and if it's encouraging this brain
> damaged "why can't we just get rid of error paths?" thinking, then it
> should be removed.
>
> Error paths have to exist, and they have to be tested.

I completely agree.
Adding a dead loop in the core of the page allocator just to bypass
error handling is a reckless idea.
Dave Chinner Sept. 2, 2024, 12:23 a.m. UTC | #8
On Thu, Aug 29, 2024 at 09:32:52AM -0400, Kent Overstreet wrote:
> On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote:
> > On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote:
> > > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote:
> > > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko
> > > > wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to
> > > > > allocate a new inode to achieve GFP_NOWAIT semantic while
> > > > > holding locks. If this allocation fails it will drop locks
> > > > > and use GFP_NOFS allocation context.
> > > > > 
> > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is
> > > > > really dangerous to use if the caller doesn't control the
> > > > > full call chain with this flag set. E.g. if any of the
> > > > > function down the chain needed GFP_NOFAIL request the
> > > > > PF_MEMALLOC_NORECLAIM would override this and cause
> > > > > unexpected failure.
> > > > > 
> > > > > While this is not the case in this particular case using
> > > > > the scoped gfp semantic is not really needed bacause we
> > > > > can easily pus the allocation context down the chain
> > > > > without too much clutter.
> > > > > 
> > > > > Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by:
> > > > > Michal Hocko <mhocko@suse.com>
> > > > 
> > > > Looks good to me.
> > > > 
> > > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Reposting what I wrote in the other thread:
> > 
> > I've read the thread. I've heard what you have had to say. Like
> > several other people, I think your position is just not
> > practical or reasonable.
> > 
> > I don't care about the purity or the safety of the API - the
> > practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL
> > allocation can now fail and that will cause unexpected kernel
> > crashes.  Keeping existing code and API semantics working
> > correctly (i.e. regression free) takes precedence over new
> > functionality or API features that people want to introduce.
> > 
> > That's all there is to it. This is not a hill you need to die
> > on.
> 
> And more than that, this is coming from you saying "We didn't have
> to handle memory allocation failures in IRIX, why can't we be like
> IRIX?  All those error paths are a pain to test, why can't we get
> rid of them?"
>
You're not listening, Kent. We are not eliding error paths because
they aren't (or cannot be) tested.

It's a choice between whether a transient error (e.g. ENOMEM) should
be considered a fatal error or not. The architectural choice that
was made for XFS back in the 1990s was that the filesystem should
never fail when transient errors occur. The choice was to wait for
the transient error to go away and then continue on. The rest of the
filesystem was build around these fundamental behavioural choices.

This goes beyond memory allocation - we do it for IO errors, too.
e.g.  metadata writeback keeps trying to write back the metadata
repeatedly on -EIO.  On every EIO from a metadata write, we will
immediately attempt a rewrite without a backoff. If that rewrite
then fails, wei requeue the write for later resubmission. That means
we back off and wait for up to 30s before attempting the next
rewrite. 

Hence -EIO  on async metadata writeback won't fail/shutdown the
filesystem until a (configurable) number of repeated failures occurs
or the filesystem unmounts before the metadata could be written back
successfully.

There's good reason for this "wait for transients to resolve" method
of error handling - go back to the late 1990s and early 2000s and
high-end multi-path FC SAN based storage was well known to have
transient path failures that can take minutes to resolve before a
secondary path takes over. That was the sort of storage environment
XFS was designed to operate in, and those users expected the
filesystem to be extremely tolerant of transient failure conditions.

Hence failing an IO and shutting down the filesystem because there
are transient errors occuring in either the storage or the OS was
absolutely the wrong thing to be doing. It still is the wrong thing
to be doing - we want to wait until the transient error has
progressed to being classified as a permanent error before we take
drastic action like denying service to the filesystem.

Memory allocation failure has always been considered a transient
error by XFS that the OS will resolve in one way or another in a
realtively short period of time. If we're prepared to wait minutes
for IO path failures to resolve, waiting a couple of seconds for
transient low memory situations to resolve isn't a big deal.

Ranting about how we handle errors without understanding the error
model we are working within is not productive. bcachefs has a
different error handling model to almost every other filesystem out
there, but that doesn't mean every other filesystem must work the
same way that bcachefs does.

If changing this transient error handling model was as simple as
detecting an allocation failure and returning -ENOMEM, we would have
done that 20 years ago. But it isn't - the error handling model is
"block until transients resolve" so that the error handling paths
only need to handle fatal errors.

Therein lies the problem - those error handling paths need to be
substantially changed to be able to handle transient errors such as
ENOMEM. We'd need to either be able to back out of a dirty
transaction or restart the transaction in some way rather than
shutting down the filesystem.

Put simply: reclassifying ENOMEM from a "wait for transient to
resolve" handler to a "back out and restart" mechanism like bcachefs
uses requires re-architecting the entire log item architecture for
metadata modification tracking and journal space management.

Even if I knew how to implement this right now, it would require
years worth of engineering effort and resources before it would be
completed and ready for merge. Then it will take years more for all
the existing kernels to cycle out of production.

Further: this "ENOMEM is transient so retry" model has been used
without any significant issues in production systems for mission
critical infrastructure for the past 25+ years. There's a very
strong "if it ain't broke, don't fix it" argument to be made here.
The cost-benefit analysis comes out very strongly on the side of
keeping __GFP_NOFAIL semantics as they currently stand.

> Except that's bullshit; at the very least any dynamically sized
> allocation _definitely_ has to have an error path that's tested, and if
> there's questions about the context a code path might run in, that
> that's another reason.

We know the context we run __GFP_NOFAIL allocations in - transaction
context absolutely requires a task context because we take sleeping
locks, submit and wait on IO, do blocking memory allocation, etc. We
also know the size of the allocations because we've bounds checked
everything before we do an allocation.

Hence this argument of "__GFP_NOFAIL aboslutely requires error
checking because an invalid size or wonrg context might be used"
is completely irrelevant to XFS. If you call into the filesytsem
from an atomic context, you've lost long before we get to memory
allocation because filesystems take sleeping locks....

> GFP_NOFAIL is the problem here, and if it's encouraging this brain
> damaged "why can't we just get rid of error paths?" thinking, then it
> should be removed.
>
> Error paths have to exist, and they have to be tested.

__GFP_NOFAIL is *not the problem*, and we are not "avoiding error
handling".  Backing off, looping and trying again is a valid
mechanism for handling transient failure conditions. Having a flag
that tells the allocator to "backoff, loop and try again" is a
perfectly good way of providing a generic error handling mechanism.

IOWs, Using __GFP_NOFAIL doesn't mean we are "not handling errors";
it simply means we have moved the error handling we were doing
inside the allocator.  And yes, we test the hell out of this error
handling path....

-Dave.
Kent Overstreet Sept. 2, 2024, 1:35 a.m. UTC | #9
On Mon, Sep 02, 2024 at 10:23:06AM GMT, Dave Chinner wrote:
> On Thu, Aug 29, 2024 at 09:32:52AM -0400, Kent Overstreet wrote:
> > And more than that, this is coming from you saying "We didn't have
> > to handle memory allocation failures in IRIX, why can't we be like
> > IRIX?  All those error paths are a pain to test, why can't we get
> > rid of them?"
> >
> You're not listening, Kent. We are not eliding error paths because
> they aren't (or cannot be) tested.
> 
> It's a choice between whether a transient error (e.g. ENOMEM) should
> be considered a fatal error or not. The architectural choice that
> was made for XFS back in the 1990s was that the filesystem should
> never fail when transient errors occur. The choice was to wait for
> the transient error to go away and then continue on. The rest of the
> filesystem was build around these fundamental behavioural choices.

<snipped>

> Hence failing an IO and shutting down the filesystem because there
> are transient errors occuring in either the storage or the OS was
> absolutely the wrong thing to be doing. It still is the wrong thing
> to be doing - we want to wait until the transient error has
> progressed to being classified as a permanent error before we take
> drastic action like denying service to the filesystem.

Two different things are getting conflated here.

I'm saying forcefully that __GFP_NOFAIL isn't a good design decision,
and it's bad for system behaviour when we're thrashing, because you,
willy and others have been talking openly about making something like
that the norm, and that concerns me.

I'm _not_ saying that you should have to rearchitect your codebase to
deal with transient -ENOMEMs... that's clearly not going to happen.

But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL
in the case of bugs, because that's going to be an improvement w.r.t.
system robustness, in exactly the same way we don't use BUG_ON() if it's
something that we can't guarantee won't happen in the wild - we WARN()
and try to handle the error as best we can.

If we could someday get PF_MEMALLOC_NORECLAIM annotated everywhere we're
not in a sleepable context (or more likely, teach kmalloc() about
preempt count) - that turns a whole class of "scheduling while atomic"
bugs into recoverable errors. That'd be a good thing.

In XFS's case, the only thing you'd ever want to do on failure of a
__GFP_NOFAIL allocation is do an emergency shutdown - the code is buggy,
and those bugs tend to be quickly found and fixed (even quicker when the
system stays usable and the user can collect a backtrace).

> > Except that's bullshit; at the very least any dynamically sized
> > allocation _definitely_ has to have an error path that's tested, and if
> > there's questions about the context a code path might run in, that
> > that's another reason.
> 
> We know the context we run __GFP_NOFAIL allocations in - transaction
> context absolutely requires a task context because we take sleeping
> locks, submit and wait on IO, do blocking memory allocation, etc. We
> also know the size of the allocations because we've bounds checked
> everything before we do an allocation.

*nod*

> Hence this argument of "__GFP_NOFAIL aboslutely requires error
> checking because an invalid size or wonrg context might be used"
> is completely irrelevant to XFS. If you call into the filesytsem
> from an atomic context, you've lost long before we get to memory
> allocation because filesystems take sleeping locks....

Yeah, if you know the context of your allocation and you have bounds on
the allocation size that's all fine - that's your call.

_As a general policy_, I will say that even __GFP_NOFAIL allocations
should have error paths - becuase lots of allocations have sizes that
userspace can control, so it becomes a security issue and we need to be
careful what we tell people.

> > GFP_NOFAIL is the problem here, and if it's encouraging this brain
> > damaged "why can't we just get rid of error paths?" thinking, then it
> > should be removed.
> >
> > Error paths have to exist, and they have to be tested.
> 
> __GFP_NOFAIL is *not the problem*, and we are not "avoiding error
> handling".  Backing off, looping and trying again is a valid
> mechanism for handling transient failure conditions. Having a flag
> that tells the allocator to "backoff, loop and try again" is a
> perfectly good way of providing a generic error handling mechanism.

But we do have to differentiate between transient allocation failures
and failures that are a result of bugs. Because just looping means a
buggy allocation call becomes, at best, a hung task you can't kill.
Michal Hocko Sept. 2, 2024, 8:41 a.m. UTC | #10
On Sun 01-09-24 21:35:30, Kent Overstreet wrote:
[...]
> But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL
> in the case of bugs, because that's going to be an improvement w.r.t.
> system robustness, in exactly the same way we don't use BUG_ON() if it's
> something that we can't guarantee won't happen in the wild - we WARN()
> and try to handle the error as best we can.

We have discussed that in a different email thread. And I have to say
that I am not convinced that returning NULL makes a broken code much
better. Why? Because we can expect that broken NOFAIL users will not have a
error checking path. Even valid NOFAIL users will not have one because
they _know_ they do not have a different than retry for ever recovery
path. 

That means that an unexpected NULL return either means OOPS or a subtle
silent error - e.g. memory corruption. The former is a actually a saner
recovery model because the execution is stopped before more harm can be
done. I suspect most of those buggy users will simply OOPS but
systematically checking for latter is a lot of work and needs to be
constantly because code evolves...

I have tried to argue that if allocator cannot or refuse to satisfy
GFP_NOFAIL request because it is trying to use unsupported allocation
mode or size then we should terminate the allocation context. That would
make the API more predictable and therefore safer to use.

This is not what the allocator does today though. Atomic NOFAIL
allocations fail same as kvmalloc requests which are clearly overflows.
Especially the later could become a risk if they are reachable from the
userspace with controlable allocation size.
Kent Overstreet Sept. 2, 2024, 8:52 a.m. UTC | #11
On Mon, Sep 02, 2024 at 10:41:31AM GMT, Michal Hocko wrote:
> On Sun 01-09-24 21:35:30, Kent Overstreet wrote:
> [...]
> > But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL
> > in the case of bugs, because that's going to be an improvement w.r.t.
> > system robustness, in exactly the same way we don't use BUG_ON() if it's
> > something that we can't guarantee won't happen in the wild - we WARN()
> > and try to handle the error as best we can.
> 
> We have discussed that in a different email thread. And I have to say
> that I am not convinced that returning NULL makes a broken code much
> better. Why? Because we can expect that broken NOFAIL users will not have a
> error checking path. Even valid NOFAIL users will not have one because
> they _know_ they do not have a different than retry for ever recovery
> path. 

You mean where I asked you for a link to the discussion and rationale
you claimed had happened? Still waiting on that

> That means that an unexpected NULL return either means OOPS or a subtle
> silent error - e.g. memory corruption. The former is a actually a saner
> recovery model because the execution is stopped before more harm can be
> done. I suspect most of those buggy users will simply OOPS but
> systematically checking for latter is a lot of work and needs to be
> constantly because code evolves...
> 
> I have tried to argue that if allocator cannot or refuse to satisfy
> GFP_NOFAIL request because it is trying to use unsupported allocation
> mode or size then we should terminate the allocation context. That would
> make the API more predictable and therefore safer to use.

You're arguing that we treat it like an oops?

Yeah, enough of this insanity.
Michal Hocko Sept. 2, 2024, 9:39 a.m. UTC | #12
On Mon 02-09-24 04:52:49, Kent Overstreet wrote:
> On Mon, Sep 02, 2024 at 10:41:31AM GMT, Michal Hocko wrote:
> > On Sun 01-09-24 21:35:30, Kent Overstreet wrote:
> > [...]
> > > But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL
> > > in the case of bugs, because that's going to be an improvement w.r.t.
> > > system robustness, in exactly the same way we don't use BUG_ON() if it's
> > > something that we can't guarantee won't happen in the wild - we WARN()
> > > and try to handle the error as best we can.
> > 
> > We have discussed that in a different email thread. And I have to say
> > that I am not convinced that returning NULL makes a broken code much
> > better. Why? Because we can expect that broken NOFAIL users will not have a
> > error checking path. Even valid NOFAIL users will not have one because
> > they _know_ they do not have a different than retry for ever recovery
> > path. 
> 
> You mean where I asked you for a link to the discussion and rationale
> you claimed had happened? Still waiting on that

I am not your assistent to be tasked and search through lore archives.
Find one if you need that.

Anyway, if you read the email and even tried to understand what is
written there rather than immediately started shouting a response then
you would have noticed I have put actual arguments here. You are free to
disagree with them and lay down your arguments. You have decided to

[...]

> Yeah, enough of this insanity.

so I do not think you are able to do that. Again...
Kent Overstreet Sept. 2, 2024, 9:51 a.m. UTC | #13
On Mon, Sep 02, 2024 at 11:39:41AM GMT, Michal Hocko wrote:
> On Mon 02-09-24 04:52:49, Kent Overstreet wrote:
> > On Mon, Sep 02, 2024 at 10:41:31AM GMT, Michal Hocko wrote:
> > > On Sun 01-09-24 21:35:30, Kent Overstreet wrote:
> > > [...]
> > > > But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL
> > > > in the case of bugs, because that's going to be an improvement w.r.t.
> > > > system robustness, in exactly the same way we don't use BUG_ON() if it's
> > > > something that we can't guarantee won't happen in the wild - we WARN()
> > > > and try to handle the error as best we can.
> > > 
> > > We have discussed that in a different email thread. And I have to say
> > > that I am not convinced that returning NULL makes a broken code much
> > > better. Why? Because we can expect that broken NOFAIL users will not have a
> > > error checking path. Even valid NOFAIL users will not have one because
> > > they _know_ they do not have a different than retry for ever recovery
> > > path. 
> > 
> > You mean where I asked you for a link to the discussion and rationale
> > you claimed had happened? Still waiting on that
> 
> I am not your assistent to be tasked and search through lore archives.
> Find one if you need that.
> 
> Anyway, if you read the email and even tried to understand what is
> written there rather than immediately started shouting a response then
> you would have noticed I have put actual arguments here. You are free to
> disagree with them and lay down your arguments. You have decided to
> 
> [...]
> 
> > Yeah, enough of this insanity.
> 
> so I do not think you are able to do that. Again...

Michal, if you think crashing processes is an acceptable alternative to
error handling _you have no business writing kernel code_.

You have been stridently arguing for one bad idea after another, and
it's an insult to those of us who do give a shit about writing reliable
software.

You're arguing against basic precepts of kernel programming.

Get your head examined. And get the fuck out of here with this shit.
Jonathan Corbet Sept. 2, 2024, 2:07 p.m. UTC | #14
Kent Overstreet <kent.overstreet@linux.dev> writes:

> You're arguing against basic precepts of kernel programming.
>
> Get your head examined. And get the fuck out of here with this shit.

Kent, this is not the way to deal with your peers in this community, no
matter how strongly you disagree with them.  Might I suggest calming
down a bit, followed by an apology?

Thanks,

jon
Shuah Khan Sept. 4, 2024, 6:01 p.m. UTC | #15
On 9/2/24 03:51, Kent Overstreet wrote:
> On Mon, Sep 02, 2024 at 11:39:41AM GMT, Michal Hocko wrote:
>> On Mon 02-09-24 04:52:49, Kent Overstreet wrote:
>>> On Mon, Sep 02, 2024 at 10:41:31AM GMT, Michal Hocko wrote:
>>>> On Sun 01-09-24 21:35:30, Kent Overstreet wrote:
>>>> [...]
>>>>> But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL
>>>>> in the case of bugs, because that's going to be an improvement w.r.t.
>>>>> system robustness, in exactly the same way we don't use BUG_ON() if it's
>>>>> something that we can't guarantee won't happen in the wild - we WARN()
>>>>> and try to handle the error as best we can.
>>>>
>>>> We have discussed that in a different email thread. And I have to say
>>>> that I am not convinced that returning NULL makes a broken code much
>>>> better. Why? Because we can expect that broken NOFAIL users will not have a
>>>> error checking path. Even valid NOFAIL users will not have one because
>>>> they _know_ they do not have a different than retry for ever recovery
>>>> path.
>>>
>>> You mean where I asked you for a link to the discussion and rationale
>>> you claimed had happened? Still waiting on that
>>
>> I am not your assistent to be tasked and search through lore archives.
>> Find one if you need that.
>>
>> Anyway, if you read the email and even tried to understand what is
>> written there rather than immediately started shouting a response then
>> you would have noticed I have put actual arguments here. You are free to
>> disagree with them and lay down your arguments. You have decided to
>>
>> [...]
>>
>>> Yeah, enough of this insanity.
>>
>> so I do not think you are able to do that. Again...
> 
> Michal, if you think crashing processes is an acceptable alternative to
> error handling _you have no business writing kernel code_.
> 
> You have been stridently arguing for one bad idea after another, and
> it's an insult to those of us who do give a shit about writing reliable
> software.
> 
> You're arguing against basic precepts of kernel programming.
> 
> Get your head examined. And get the fuck out of here with this shit.
> 

Kent,

Using language like this is clearly unacceptable and violates the
Code of Conduct. This type of language doesn't promote respectful
and productive discussions and is detrimental to the health of the
community.

You should be well aware that this type of language and personal
attack is a clear violation of the Linux kernel Contributor Covenant
Code of Conduct as outlined in the following:

https://www.kernel.org/doc/html/latest/process/code-of-conduct.html

Refer to the Code of Conduct and refrain from violating the Code of
Conduct in the future.

thanks,
-- Shuah (On behalf of the Code of Conduct Committee)
diff mbox series

Patch

diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
index 15fc41e63b6c..d151a2f28d12 100644
--- a/fs/bcachefs/fs.c
+++ b/fs/bcachefs/fs.c
@@ -231,9 +231,9 @@  static struct inode *bch2_alloc_inode(struct super_block *sb)
 	BUG();
 }
 
-static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c)
+static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c, gfp_t gfp)
 {
-	struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, GFP_NOFS);
+	struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, gfp);
 	if (!inode)
 		return NULL;
 
@@ -245,7 +245,7 @@  static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c)
 	mutex_init(&inode->ei_quota_lock);
 	memset(&inode->ei_devs_need_flush, 0, sizeof(inode->ei_devs_need_flush));
 
-	if (unlikely(inode_init_always(c->vfs_sb, &inode->v))) {
+	if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v, gfp))) {
 		kmem_cache_free(bch2_inode_cache, inode);
 		return NULL;
 	}
@@ -258,12 +258,10 @@  static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c)
  */
 static struct bch_inode_info *bch2_new_inode(struct btree_trans *trans)
 {
-	struct bch_inode_info *inode =
-		memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN,
-				  __bch2_new_inode(trans->c));
+	struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWAIT);
 
 	if (unlikely(!inode)) {
-		int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c)) ? 0 : -ENOMEM);
+		int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c, GFP_NOFS)) ? 0 : -ENOMEM);
 		if (ret && inode) {
 			__destroy_inode(&inode->v);
 			kmem_cache_free(bch2_inode_cache, inode);
@@ -328,7 +326,7 @@  __bch2_create(struct mnt_idmap *idmap,
 	if (ret)
 		return ERR_PTR(ret);
 #endif
-	inode = __bch2_new_inode(c);
+	inode = __bch2_new_inode(c, GFP_NOFS);
 	if (unlikely(!inode)) {
 		inode = ERR_PTR(-ENOMEM);
 		goto err;
diff --git a/fs/inode.c b/fs/inode.c
index 86670941884b..a2aabbcffbe4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -153,7 +153,7 @@  static int no_open(struct inode *inode, struct file *file)
  * These are initializations that need to be done on every inode
  * allocation as the fields are not initialised by slab allocation.
  */
-int inode_init_always(struct super_block *sb, struct inode *inode)
+int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp)
 {
 	static const struct inode_operations empty_iops;
 	static const struct file_operations no_open_fops = {.open = no_open};
@@ -230,14 +230,14 @@  int inode_init_always(struct super_block *sb, struct inode *inode)
 #endif
 	inode->i_flctx = NULL;
 
-	if (unlikely(security_inode_alloc(inode)))
+	if (unlikely(security_inode_alloc(inode, gfp)))
 		return -ENOMEM;
 
 	this_cpu_inc(nr_inodes);
 
 	return 0;
 }
-EXPORT_SYMBOL(inode_init_always);
+EXPORT_SYMBOL(inode_init_always_gfp);
 
 void free_inode_nonrcu(struct inode *inode)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..d46ca71a7855 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3027,7 +3027,12 @@  extern loff_t default_llseek(struct file *file, loff_t offset, int whence);
 
 extern loff_t vfs_llseek(struct file *file, loff_t offset, int whence);
 
-extern int inode_init_always(struct super_block *, struct inode *);
+extern int inode_init_always_gfp(struct super_block *, struct inode *, gfp_t);
+static inline int inode_init_always(struct super_block *sb, struct inode *inode)
+{
+	return inode_init_always_gfp(sb, inode, GFP_NOFS);
+}
+
 extern void inode_init_once(struct inode *);
 extern void address_space_init_once(struct address_space *mapping);
 extern struct inode * igrab(struct inode *);
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index a2ade0ffe9e7..b08472d64765 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -150,6 +150,6 @@  extern struct lsm_info __start_early_lsm_info[], __end_early_lsm_info[];
 		__used __section(".early_lsm_info.init")		\
 		__aligned(sizeof(unsigned long))
 
-extern int lsm_inode_alloc(struct inode *inode);
+extern int lsm_inode_alloc(struct inode *inode, gfp_t gfp);
 
 #endif /* ! __LINUX_LSM_HOOKS_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index 1390f1efb4f0..7c6b9b038a0d 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -336,7 +336,7 @@  int security_dentry_create_files_as(struct dentry *dentry, int mode,
 					struct cred *new);
 int security_path_notify(const struct path *path, u64 mask,
 					unsigned int obj_type);
-int security_inode_alloc(struct inode *inode);
+int security_inode_alloc(struct inode *inode, gfp_t gfp);
 void security_inode_free(struct inode *inode);
 int security_inode_init_security(struct inode *inode, struct inode *dir,
 				 const struct qstr *qstr,
@@ -769,7 +769,7 @@  static inline int security_path_notify(const struct path *path, u64 mask,
 	return 0;
 }
 
-static inline int security_inode_alloc(struct inode *inode)
+static inline int security_inode_alloc(struct inode *inode, gfp_t gfp)
 {
 	return 0;
 }
diff --git a/security/security.c b/security/security.c
index 8cee5b6c6e6d..3581262da5ee 100644
--- a/security/security.c
+++ b/security/security.c
@@ -660,14 +660,14 @@  static int lsm_file_alloc(struct file *file)
  *
  * Returns 0, or -ENOMEM if memory can't be allocated.
  */
-int lsm_inode_alloc(struct inode *inode)
+int lsm_inode_alloc(struct inode *inode, gfp_t gfp)
 {
 	if (!lsm_inode_cache) {
 		inode->i_security = NULL;
 		return 0;
 	}
 
-	inode->i_security = kmem_cache_zalloc(lsm_inode_cache, GFP_NOFS);
+	inode->i_security = kmem_cache_zalloc(lsm_inode_cache, gfp);
 	if (inode->i_security == NULL)
 		return -ENOMEM;
 	return 0;
@@ -1582,9 +1582,9 @@  int security_path_notify(const struct path *path, u64 mask,
  *
  * Return: Return 0 if operation was successful.
  */
-int security_inode_alloc(struct inode *inode)
+int security_inode_alloc(struct inode *inode, gfp_t gfp)
 {
-	int rc = lsm_inode_alloc(inode);
+	int rc = lsm_inode_alloc(inode, gfp);
 
 	if (unlikely(rc))
 		return rc;