Message ID | 20240827061543.1235703-1-mhocko@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2,v2] bcachefs: do not use PF_MEMALLOC_NORECLAIM | expand |
Thanks for fixing this up!
Reviewed-by: Christoph Hellwig <hch@lst.de>
On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new > inode to achieve GFP_NOWAIT semantic while holding locks. If this > allocation fails it will drop locks and use GFP_NOFS allocation context. > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really > dangerous to use if the caller doesn't control the full call chain with > this flag set. E.g. if any of the function down the chain needed > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and > cause unexpected failure. > > While this is not the case in this particular case using the scoped gfp > semantic is not really needed bacause we can easily pus the allocation > context down the chain without too much clutter. > > Acked-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Michal Hocko <mhocko@suse.com> Looks good to me. Reviewed-by: Dave Chinner <dchinner@redhat.com>
On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote: > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new > > inode to achieve GFP_NOWAIT semantic while holding locks. If this > > allocation fails it will drop locks and use GFP_NOFS allocation context. > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really > > dangerous to use if the caller doesn't control the full call chain with > > this flag set. E.g. if any of the function down the chain needed > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and > > cause unexpected failure. > > > > While this is not the case in this particular case using the scoped gfp > > semantic is not really needed bacause we can easily pus the allocation > > context down the chain without too much clutter. > > > > Acked-by: Christoph Hellwig <hch@lst.de> > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > Looks good to me. > > Reviewed-by: Dave Chinner <dchinner@redhat.com> Reposting what I wrote in the other thread: This series is based on a fundamental misunderstanding of what a safe API is: a safe API is not one that doesn't return errors, it's one that never invokes undefined behaviour. It was decided years ago that the scoped APIs were the better replacement for the gfp APIs, and they need to exist. This "GFP_NOFAIL exists therefore we can't tell the memory allocator about situations wehre it would have to fail" is utter insanity - it's the exact opposite of defining a safe API. A safe API would be one where we /did/ always tell the memory allocator when we're in non-sleepable context, and the allocator always returned failure instead of context switching. This is utter brain damage; rule #1 of kernel programming is that _you check for errors_. If you don't know that your GFP_NOFAIL usage is in a safe context (and that's not just memory reclaim context, it's also the size of your alloction) then you have to check for errors.
On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote: > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote: > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this > > > allocation fails it will drop locks and use GFP_NOFS allocation context. > > > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really > > > dangerous to use if the caller doesn't control the full call chain with > > > this flag set. E.g. if any of the function down the chain needed > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and > > > cause unexpected failure. > > > > > > While this is not the case in this particular case using the scoped gfp > > > semantic is not really needed bacause we can easily pus the allocation > > > context down the chain without too much clutter. > > > > > > Acked-by: Christoph Hellwig <hch@lst.de> > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > Looks good to me. > > > > Reviewed-by: Dave Chinner <dchinner@redhat.com> > > Reposting what I wrote in the other thread: I've read the thread. I've heard what you have had to say. Like several other people, I think your position is just not practical or reasonable. I don't care about the purity or the safety of the API - the practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL allocation can now fail and that will cause unexpected kernel crashes. Keeping existing code and API semantics working correctly (i.e. regression free) takes precedence over new functionality or API features that people want to introduce. That's all there is to it. This is not a hill you need to die on. -Dave.
On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote: > On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote: > > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote: > > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote: > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new > > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this > > > > allocation fails it will drop locks and use GFP_NOFS allocation context. > > > > > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really > > > > dangerous to use if the caller doesn't control the full call chain with > > > > this flag set. E.g. if any of the function down the chain needed > > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and > > > > cause unexpected failure. > > > > > > > > While this is not the case in this particular case using the scoped gfp > > > > semantic is not really needed bacause we can easily pus the allocation > > > > context down the chain without too much clutter. > > > > > > > > Acked-by: Christoph Hellwig <hch@lst.de> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > > > Looks good to me. > > > > > > Reviewed-by: Dave Chinner <dchinner@redhat.com> > > > > Reposting what I wrote in the other thread: > > I've read the thread. I've heard what you have had to say. Like > several other people, I think your position is just not practical or > reasonable. > > I don't care about the purity or the safety of the API - the > practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL > allocation can now fail and that will cause unexpected kernel > crashes. Keeping existing code and API semantics working correctly > (i.e. regression free) takes precedence over new functionality or > API features that people want to introduce. > > That's all there is to it. This is not a hill you need to die on. If you use GFP_NOFAIL in a context where you're not allowed to sleep, that's a bug, same as any other bug where you get the gfp flags wrong (e.g. GFP_KERNEL in non sleepable context). This isn't going to affect you unless you start going around inserting PF_MEMALLOC_NORECLAIM where it doesn't need to be. Why would you do that? But the lack of gfp flags for pte allocation means that this actually is a serious gap we need to be fixing.
On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote: > On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote: > > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote: > > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote: > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new > > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this > > > > allocation fails it will drop locks and use GFP_NOFS allocation context. > > > > > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really > > > > dangerous to use if the caller doesn't control the full call chain with > > > > this flag set. E.g. if any of the function down the chain needed > > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and > > > > cause unexpected failure. > > > > > > > > While this is not the case in this particular case using the scoped gfp > > > > semantic is not really needed bacause we can easily pus the allocation > > > > context down the chain without too much clutter. > > > > > > > > Acked-by: Christoph Hellwig <hch@lst.de> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > > > Looks good to me. > > > > > > Reviewed-by: Dave Chinner <dchinner@redhat.com> > > > > Reposting what I wrote in the other thread: > > I've read the thread. I've heard what you have had to say. Like > several other people, I think your position is just not practical or > reasonable. > > I don't care about the purity or the safety of the API - the > practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL > allocation can now fail and that will cause unexpected kernel > crashes. Keeping existing code and API semantics working correctly > (i.e. regression free) takes precedence over new functionality or > API features that people want to introduce. > > That's all there is to it. This is not a hill you need to die on. And more than that, this is coming from you saying "We didn't have to handle memory allocation failures in IRIX, why can't we be like IRIX? All those error paths are a pain to test, why can't we get rid of them?" Except that's bullshit; at the very least any dynamically sized allocation _definitely_ has to have an error path that's tested, and if there's questions about the context a code path might run in, that that's another reason. GFP_NOFAIL is the problem here, and if it's encouraging this brain damaged "why can't we just get rid of error paths?" thinking, then it should be removed. Error paths have to exist, and they have to be tested.
On Thu, Aug 29, 2024 at 9:32 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote: > > On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote: > > > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote: > > > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko wrote: > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new > > > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this > > > > > allocation fails it will drop locks and use GFP_NOFS allocation context. > > > > > > > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really > > > > > dangerous to use if the caller doesn't control the full call chain with > > > > > this flag set. E.g. if any of the function down the chain needed > > > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and > > > > > cause unexpected failure. > > > > > > > > > > While this is not the case in this particular case using the scoped gfp > > > > > semantic is not really needed bacause we can easily pus the allocation > > > > > context down the chain without too much clutter. > > > > > > > > > > Acked-by: Christoph Hellwig <hch@lst.de> > > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > > > > > Looks good to me. > > > > > > > > Reviewed-by: Dave Chinner <dchinner@redhat.com> > > > > > > Reposting what I wrote in the other thread: > > > > I've read the thread. I've heard what you have had to say. Like > > several other people, I think your position is just not practical or > > reasonable. > > > > I don't care about the purity or the safety of the API - the > > practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL > > allocation can now fail and that will cause unexpected kernel > > crashes. Keeping existing code and API semantics working correctly > > (i.e. regression free) takes precedence over new functionality or > > API features that people want to introduce. > > > > That's all there is to it. This is not a hill you need to die on. > > And more than that, this is coming from you saying "We didn't have to > handle memory allocation failures in IRIX, why can't we be like IRIX? > All those error paths are a pain to test, why can't we get rid of them?" > > Except that's bullshit; at the very least any dynamically sized > allocation _definitely_ has to have an error path that's tested, and if > there's questions about the context a code path might run in, that > that's another reason. > > GFP_NOFAIL is the problem here, and if it's encouraging this brain > damaged "why can't we just get rid of error paths?" thinking, then it > should be removed. > > Error paths have to exist, and they have to be tested. I completely agree. Adding a dead loop in the core of the page allocator just to bypass error handling is a reckless idea.
On Thu, Aug 29, 2024 at 09:32:52AM -0400, Kent Overstreet wrote: > On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote: > > On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote: > > > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote: > > > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko > > > > wrote: > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to > > > > > allocate a new inode to achieve GFP_NOWAIT semantic while > > > > > holding locks. If this allocation fails it will drop locks > > > > > and use GFP_NOFS allocation context. > > > > > > > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is > > > > > really dangerous to use if the caller doesn't control the > > > > > full call chain with this flag set. E.g. if any of the > > > > > function down the chain needed GFP_NOFAIL request the > > > > > PF_MEMALLOC_NORECLAIM would override this and cause > > > > > unexpected failure. > > > > > > > > > > While this is not the case in this particular case using > > > > > the scoped gfp semantic is not really needed bacause we > > > > > can easily pus the allocation context down the chain > > > > > without too much clutter. > > > > > > > > > > Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: > > > > > Michal Hocko <mhocko@suse.com> > > > > > > > > Looks good to me. > > > > > > > > Reviewed-by: Dave Chinner <dchinner@redhat.com> > > > > > > Reposting what I wrote in the other thread: > > > > I've read the thread. I've heard what you have had to say. Like > > several other people, I think your position is just not > > practical or reasonable. > > > > I don't care about the purity or the safety of the API - the > > practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL > > allocation can now fail and that will cause unexpected kernel > > crashes. Keeping existing code and API semantics working > > correctly (i.e. regression free) takes precedence over new > > functionality or API features that people want to introduce. > > > > That's all there is to it. This is not a hill you need to die > > on. > > And more than that, this is coming from you saying "We didn't have > to handle memory allocation failures in IRIX, why can't we be like > IRIX? All those error paths are a pain to test, why can't we get > rid of them?" > You're not listening, Kent. We are not eliding error paths because they aren't (or cannot be) tested. It's a choice between whether a transient error (e.g. ENOMEM) should be considered a fatal error or not. The architectural choice that was made for XFS back in the 1990s was that the filesystem should never fail when transient errors occur. The choice was to wait for the transient error to go away and then continue on. The rest of the filesystem was build around these fundamental behavioural choices. This goes beyond memory allocation - we do it for IO errors, too. e.g. metadata writeback keeps trying to write back the metadata repeatedly on -EIO. On every EIO from a metadata write, we will immediately attempt a rewrite without a backoff. If that rewrite then fails, wei requeue the write for later resubmission. That means we back off and wait for up to 30s before attempting the next rewrite. Hence -EIO on async metadata writeback won't fail/shutdown the filesystem until a (configurable) number of repeated failures occurs or the filesystem unmounts before the metadata could be written back successfully. There's good reason for this "wait for transients to resolve" method of error handling - go back to the late 1990s and early 2000s and high-end multi-path FC SAN based storage was well known to have transient path failures that can take minutes to resolve before a secondary path takes over. That was the sort of storage environment XFS was designed to operate in, and those users expected the filesystem to be extremely tolerant of transient failure conditions. Hence failing an IO and shutting down the filesystem because there are transient errors occuring in either the storage or the OS was absolutely the wrong thing to be doing. It still is the wrong thing to be doing - we want to wait until the transient error has progressed to being classified as a permanent error before we take drastic action like denying service to the filesystem. Memory allocation failure has always been considered a transient error by XFS that the OS will resolve in one way or another in a realtively short period of time. If we're prepared to wait minutes for IO path failures to resolve, waiting a couple of seconds for transient low memory situations to resolve isn't a big deal. Ranting about how we handle errors without understanding the error model we are working within is not productive. bcachefs has a different error handling model to almost every other filesystem out there, but that doesn't mean every other filesystem must work the same way that bcachefs does. If changing this transient error handling model was as simple as detecting an allocation failure and returning -ENOMEM, we would have done that 20 years ago. But it isn't - the error handling model is "block until transients resolve" so that the error handling paths only need to handle fatal errors. Therein lies the problem - those error handling paths need to be substantially changed to be able to handle transient errors such as ENOMEM. We'd need to either be able to back out of a dirty transaction or restart the transaction in some way rather than shutting down the filesystem. Put simply: reclassifying ENOMEM from a "wait for transient to resolve" handler to a "back out and restart" mechanism like bcachefs uses requires re-architecting the entire log item architecture for metadata modification tracking and journal space management. Even if I knew how to implement this right now, it would require years worth of engineering effort and resources before it would be completed and ready for merge. Then it will take years more for all the existing kernels to cycle out of production. Further: this "ENOMEM is transient so retry" model has been used without any significant issues in production systems for mission critical infrastructure for the past 25+ years. There's a very strong "if it ain't broke, don't fix it" argument to be made here. The cost-benefit analysis comes out very strongly on the side of keeping __GFP_NOFAIL semantics as they currently stand. > Except that's bullshit; at the very least any dynamically sized > allocation _definitely_ has to have an error path that's tested, and if > there's questions about the context a code path might run in, that > that's another reason. We know the context we run __GFP_NOFAIL allocations in - transaction context absolutely requires a task context because we take sleeping locks, submit and wait on IO, do blocking memory allocation, etc. We also know the size of the allocations because we've bounds checked everything before we do an allocation. Hence this argument of "__GFP_NOFAIL aboslutely requires error checking because an invalid size or wonrg context might be used" is completely irrelevant to XFS. If you call into the filesytsem from an atomic context, you've lost long before we get to memory allocation because filesystems take sleeping locks.... > GFP_NOFAIL is the problem here, and if it's encouraging this brain > damaged "why can't we just get rid of error paths?" thinking, then it > should be removed. > > Error paths have to exist, and they have to be tested. __GFP_NOFAIL is *not the problem*, and we are not "avoiding error handling". Backing off, looping and trying again is a valid mechanism for handling transient failure conditions. Having a flag that tells the allocator to "backoff, loop and try again" is a perfectly good way of providing a generic error handling mechanism. IOWs, Using __GFP_NOFAIL doesn't mean we are "not handling errors"; it simply means we have moved the error handling we were doing inside the allocator. And yes, we test the hell out of this error handling path.... -Dave.
On Mon, Sep 02, 2024 at 10:23:06AM GMT, Dave Chinner wrote: > On Thu, Aug 29, 2024 at 09:32:52AM -0400, Kent Overstreet wrote: > > And more than that, this is coming from you saying "We didn't have > > to handle memory allocation failures in IRIX, why can't we be like > > IRIX? All those error paths are a pain to test, why can't we get > > rid of them?" > > > You're not listening, Kent. We are not eliding error paths because > they aren't (or cannot be) tested. > > It's a choice between whether a transient error (e.g. ENOMEM) should > be considered a fatal error or not. The architectural choice that > was made for XFS back in the 1990s was that the filesystem should > never fail when transient errors occur. The choice was to wait for > the transient error to go away and then continue on. The rest of the > filesystem was build around these fundamental behavioural choices. <snipped> > Hence failing an IO and shutting down the filesystem because there > are transient errors occuring in either the storage or the OS was > absolutely the wrong thing to be doing. It still is the wrong thing > to be doing - we want to wait until the transient error has > progressed to being classified as a permanent error before we take > drastic action like denying service to the filesystem. Two different things are getting conflated here. I'm saying forcefully that __GFP_NOFAIL isn't a good design decision, and it's bad for system behaviour when we're thrashing, because you, willy and others have been talking openly about making something like that the norm, and that concerns me. I'm _not_ saying that you should have to rearchitect your codebase to deal with transient -ENOMEMs... that's clearly not going to happen. But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL in the case of bugs, because that's going to be an improvement w.r.t. system robustness, in exactly the same way we don't use BUG_ON() if it's something that we can't guarantee won't happen in the wild - we WARN() and try to handle the error as best we can. If we could someday get PF_MEMALLOC_NORECLAIM annotated everywhere we're not in a sleepable context (or more likely, teach kmalloc() about preempt count) - that turns a whole class of "scheduling while atomic" bugs into recoverable errors. That'd be a good thing. In XFS's case, the only thing you'd ever want to do on failure of a __GFP_NOFAIL allocation is do an emergency shutdown - the code is buggy, and those bugs tend to be quickly found and fixed (even quicker when the system stays usable and the user can collect a backtrace). > > Except that's bullshit; at the very least any dynamically sized > > allocation _definitely_ has to have an error path that's tested, and if > > there's questions about the context a code path might run in, that > > that's another reason. > > We know the context we run __GFP_NOFAIL allocations in - transaction > context absolutely requires a task context because we take sleeping > locks, submit and wait on IO, do blocking memory allocation, etc. We > also know the size of the allocations because we've bounds checked > everything before we do an allocation. *nod* > Hence this argument of "__GFP_NOFAIL aboslutely requires error > checking because an invalid size or wonrg context might be used" > is completely irrelevant to XFS. If you call into the filesytsem > from an atomic context, you've lost long before we get to memory > allocation because filesystems take sleeping locks.... Yeah, if you know the context of your allocation and you have bounds on the allocation size that's all fine - that's your call. _As a general policy_, I will say that even __GFP_NOFAIL allocations should have error paths - becuase lots of allocations have sizes that userspace can control, so it becomes a security issue and we need to be careful what we tell people. > > GFP_NOFAIL is the problem here, and if it's encouraging this brain > > damaged "why can't we just get rid of error paths?" thinking, then it > > should be removed. > > > > Error paths have to exist, and they have to be tested. > > __GFP_NOFAIL is *not the problem*, and we are not "avoiding error > handling". Backing off, looping and trying again is a valid > mechanism for handling transient failure conditions. Having a flag > that tells the allocator to "backoff, loop and try again" is a > perfectly good way of providing a generic error handling mechanism. But we do have to differentiate between transient allocation failures and failures that are a result of bugs. Because just looping means a buggy allocation call becomes, at best, a hung task you can't kill.
On Sun 01-09-24 21:35:30, Kent Overstreet wrote: [...] > But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL > in the case of bugs, because that's going to be an improvement w.r.t. > system robustness, in exactly the same way we don't use BUG_ON() if it's > something that we can't guarantee won't happen in the wild - we WARN() > and try to handle the error as best we can. We have discussed that in a different email thread. And I have to say that I am not convinced that returning NULL makes a broken code much better. Why? Because we can expect that broken NOFAIL users will not have a error checking path. Even valid NOFAIL users will not have one because they _know_ they do not have a different than retry for ever recovery path. That means that an unexpected NULL return either means OOPS or a subtle silent error - e.g. memory corruption. The former is a actually a saner recovery model because the execution is stopped before more harm can be done. I suspect most of those buggy users will simply OOPS but systematically checking for latter is a lot of work and needs to be constantly because code evolves... I have tried to argue that if allocator cannot or refuse to satisfy GFP_NOFAIL request because it is trying to use unsupported allocation mode or size then we should terminate the allocation context. That would make the API more predictable and therefore safer to use. This is not what the allocator does today though. Atomic NOFAIL allocations fail same as kvmalloc requests which are clearly overflows. Especially the later could become a risk if they are reachable from the userspace with controlable allocation size.
On Mon, Sep 02, 2024 at 10:41:31AM GMT, Michal Hocko wrote: > On Sun 01-09-24 21:35:30, Kent Overstreet wrote: > [...] > > But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL > > in the case of bugs, because that's going to be an improvement w.r.t. > > system robustness, in exactly the same way we don't use BUG_ON() if it's > > something that we can't guarantee won't happen in the wild - we WARN() > > and try to handle the error as best we can. > > We have discussed that in a different email thread. And I have to say > that I am not convinced that returning NULL makes a broken code much > better. Why? Because we can expect that broken NOFAIL users will not have a > error checking path. Even valid NOFAIL users will not have one because > they _know_ they do not have a different than retry for ever recovery > path. You mean where I asked you for a link to the discussion and rationale you claimed had happened? Still waiting on that > That means that an unexpected NULL return either means OOPS or a subtle > silent error - e.g. memory corruption. The former is a actually a saner > recovery model because the execution is stopped before more harm can be > done. I suspect most of those buggy users will simply OOPS but > systematically checking for latter is a lot of work and needs to be > constantly because code evolves... > > I have tried to argue that if allocator cannot or refuse to satisfy > GFP_NOFAIL request because it is trying to use unsupported allocation > mode or size then we should terminate the allocation context. That would > make the API more predictable and therefore safer to use. You're arguing that we treat it like an oops? Yeah, enough of this insanity.
On Mon 02-09-24 04:52:49, Kent Overstreet wrote: > On Mon, Sep 02, 2024 at 10:41:31AM GMT, Michal Hocko wrote: > > On Sun 01-09-24 21:35:30, Kent Overstreet wrote: > > [...] > > > But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL > > > in the case of bugs, because that's going to be an improvement w.r.t. > > > system robustness, in exactly the same way we don't use BUG_ON() if it's > > > something that we can't guarantee won't happen in the wild - we WARN() > > > and try to handle the error as best we can. > > > > We have discussed that in a different email thread. And I have to say > > that I am not convinced that returning NULL makes a broken code much > > better. Why? Because we can expect that broken NOFAIL users will not have a > > error checking path. Even valid NOFAIL users will not have one because > > they _know_ they do not have a different than retry for ever recovery > > path. > > You mean where I asked you for a link to the discussion and rationale > you claimed had happened? Still waiting on that I am not your assistent to be tasked and search through lore archives. Find one if you need that. Anyway, if you read the email and even tried to understand what is written there rather than immediately started shouting a response then you would have noticed I have put actual arguments here. You are free to disagree with them and lay down your arguments. You have decided to [...] > Yeah, enough of this insanity. so I do not think you are able to do that. Again...
On Mon, Sep 02, 2024 at 11:39:41AM GMT, Michal Hocko wrote: > On Mon 02-09-24 04:52:49, Kent Overstreet wrote: > > On Mon, Sep 02, 2024 at 10:41:31AM GMT, Michal Hocko wrote: > > > On Sun 01-09-24 21:35:30, Kent Overstreet wrote: > > > [...] > > > > But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL > > > > in the case of bugs, because that's going to be an improvement w.r.t. > > > > system robustness, in exactly the same way we don't use BUG_ON() if it's > > > > something that we can't guarantee won't happen in the wild - we WARN() > > > > and try to handle the error as best we can. > > > > > > We have discussed that in a different email thread. And I have to say > > > that I am not convinced that returning NULL makes a broken code much > > > better. Why? Because we can expect that broken NOFAIL users will not have a > > > error checking path. Even valid NOFAIL users will not have one because > > > they _know_ they do not have a different than retry for ever recovery > > > path. > > > > You mean where I asked you for a link to the discussion and rationale > > you claimed had happened? Still waiting on that > > I am not your assistent to be tasked and search through lore archives. > Find one if you need that. > > Anyway, if you read the email and even tried to understand what is > written there rather than immediately started shouting a response then > you would have noticed I have put actual arguments here. You are free to > disagree with them and lay down your arguments. You have decided to > > [...] > > > Yeah, enough of this insanity. > > so I do not think you are able to do that. Again... Michal, if you think crashing processes is an acceptable alternative to error handling _you have no business writing kernel code_. You have been stridently arguing for one bad idea after another, and it's an insult to those of us who do give a shit about writing reliable software. You're arguing against basic precepts of kernel programming. Get your head examined. And get the fuck out of here with this shit.
Kent Overstreet <kent.overstreet@linux.dev> writes: > You're arguing against basic precepts of kernel programming. > > Get your head examined. And get the fuck out of here with this shit. Kent, this is not the way to deal with your peers in this community, no matter how strongly you disagree with them. Might I suggest calming down a bit, followed by an apology? Thanks, jon
On 9/2/24 03:51, Kent Overstreet wrote: > On Mon, Sep 02, 2024 at 11:39:41AM GMT, Michal Hocko wrote: >> On Mon 02-09-24 04:52:49, Kent Overstreet wrote: >>> On Mon, Sep 02, 2024 at 10:41:31AM GMT, Michal Hocko wrote: >>>> On Sun 01-09-24 21:35:30, Kent Overstreet wrote: >>>> [...] >>>>> But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL >>>>> in the case of bugs, because that's going to be an improvement w.r.t. >>>>> system robustness, in exactly the same way we don't use BUG_ON() if it's >>>>> something that we can't guarantee won't happen in the wild - we WARN() >>>>> and try to handle the error as best we can. >>>> >>>> We have discussed that in a different email thread. And I have to say >>>> that I am not convinced that returning NULL makes a broken code much >>>> better. Why? Because we can expect that broken NOFAIL users will not have a >>>> error checking path. Even valid NOFAIL users will not have one because >>>> they _know_ they do not have a different than retry for ever recovery >>>> path. >>> >>> You mean where I asked you for a link to the discussion and rationale >>> you claimed had happened? Still waiting on that >> >> I am not your assistent to be tasked and search through lore archives. >> Find one if you need that. >> >> Anyway, if you read the email and even tried to understand what is >> written there rather than immediately started shouting a response then >> you would have noticed I have put actual arguments here. You are free to >> disagree with them and lay down your arguments. You have decided to >> >> [...] >> >>> Yeah, enough of this insanity. >> >> so I do not think you are able to do that. Again... > > Michal, if you think crashing processes is an acceptable alternative to > error handling _you have no business writing kernel code_. > > You have been stridently arguing for one bad idea after another, and > it's an insult to those of us who do give a shit about writing reliable > software. > > You're arguing against basic precepts of kernel programming. > > Get your head examined. And get the fuck out of here with this shit. > Kent, Using language like this is clearly unacceptable and violates the Code of Conduct. This type of language doesn't promote respectful and productive discussions and is detrimental to the health of the community. You should be well aware that this type of language and personal attack is a clear violation of the Linux kernel Contributor Covenant Code of Conduct as outlined in the following: https://www.kernel.org/doc/html/latest/process/code-of-conduct.html Refer to the Code of Conduct and refrain from violating the Code of Conduct in the future. thanks, -- Shuah (On behalf of the Code of Conduct Committee)
diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c index 15fc41e63b6c..d151a2f28d12 100644 --- a/fs/bcachefs/fs.c +++ b/fs/bcachefs/fs.c @@ -231,9 +231,9 @@ static struct inode *bch2_alloc_inode(struct super_block *sb) BUG(); } -static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c) +static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c, gfp_t gfp) { - struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, GFP_NOFS); + struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, gfp); if (!inode) return NULL; @@ -245,7 +245,7 @@ static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c) mutex_init(&inode->ei_quota_lock); memset(&inode->ei_devs_need_flush, 0, sizeof(inode->ei_devs_need_flush)); - if (unlikely(inode_init_always(c->vfs_sb, &inode->v))) { + if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v, gfp))) { kmem_cache_free(bch2_inode_cache, inode); return NULL; } @@ -258,12 +258,10 @@ static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c) */ static struct bch_inode_info *bch2_new_inode(struct btree_trans *trans) { - struct bch_inode_info *inode = - memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN, - __bch2_new_inode(trans->c)); + struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWAIT); if (unlikely(!inode)) { - int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c)) ? 0 : -ENOMEM); + int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c, GFP_NOFS)) ? 0 : -ENOMEM); if (ret && inode) { __destroy_inode(&inode->v); kmem_cache_free(bch2_inode_cache, inode); @@ -328,7 +326,7 @@ __bch2_create(struct mnt_idmap *idmap, if (ret) return ERR_PTR(ret); #endif - inode = __bch2_new_inode(c); + inode = __bch2_new_inode(c, GFP_NOFS); if (unlikely(!inode)) { inode = ERR_PTR(-ENOMEM); goto err; diff --git a/fs/inode.c b/fs/inode.c index 86670941884b..a2aabbcffbe4 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -153,7 +153,7 @@ static int no_open(struct inode *inode, struct file *file) * These are initializations that need to be done on every inode * allocation as the fields are not initialised by slab allocation. */ -int inode_init_always(struct super_block *sb, struct inode *inode) +int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp) { static const struct inode_operations empty_iops; static const struct file_operations no_open_fops = {.open = no_open}; @@ -230,14 +230,14 @@ int inode_init_always(struct super_block *sb, struct inode *inode) #endif inode->i_flctx = NULL; - if (unlikely(security_inode_alloc(inode))) + if (unlikely(security_inode_alloc(inode, gfp))) return -ENOMEM; this_cpu_inc(nr_inodes); return 0; } -EXPORT_SYMBOL(inode_init_always); +EXPORT_SYMBOL(inode_init_always_gfp); void free_inode_nonrcu(struct inode *inode) { diff --git a/include/linux/fs.h b/include/linux/fs.h index fd34b5755c0b..d46ca71a7855 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3027,7 +3027,12 @@ extern loff_t default_llseek(struct file *file, loff_t offset, int whence); extern loff_t vfs_llseek(struct file *file, loff_t offset, int whence); -extern int inode_init_always(struct super_block *, struct inode *); +extern int inode_init_always_gfp(struct super_block *, struct inode *, gfp_t); +static inline int inode_init_always(struct super_block *sb, struct inode *inode) +{ + return inode_init_always_gfp(sb, inode, GFP_NOFS); +} + extern void inode_init_once(struct inode *); extern void address_space_init_once(struct address_space *mapping); extern struct inode * igrab(struct inode *); diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index a2ade0ffe9e7..b08472d64765 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -150,6 +150,6 @@ extern struct lsm_info __start_early_lsm_info[], __end_early_lsm_info[]; __used __section(".early_lsm_info.init") \ __aligned(sizeof(unsigned long)) -extern int lsm_inode_alloc(struct inode *inode); +extern int lsm_inode_alloc(struct inode *inode, gfp_t gfp); #endif /* ! __LINUX_LSM_HOOKS_H */ diff --git a/include/linux/security.h b/include/linux/security.h index 1390f1efb4f0..7c6b9b038a0d 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -336,7 +336,7 @@ int security_dentry_create_files_as(struct dentry *dentry, int mode, struct cred *new); int security_path_notify(const struct path *path, u64 mask, unsigned int obj_type); -int security_inode_alloc(struct inode *inode); +int security_inode_alloc(struct inode *inode, gfp_t gfp); void security_inode_free(struct inode *inode); int security_inode_init_security(struct inode *inode, struct inode *dir, const struct qstr *qstr, @@ -769,7 +769,7 @@ static inline int security_path_notify(const struct path *path, u64 mask, return 0; } -static inline int security_inode_alloc(struct inode *inode) +static inline int security_inode_alloc(struct inode *inode, gfp_t gfp) { return 0; } diff --git a/security/security.c b/security/security.c index 8cee5b6c6e6d..3581262da5ee 100644 --- a/security/security.c +++ b/security/security.c @@ -660,14 +660,14 @@ static int lsm_file_alloc(struct file *file) * * Returns 0, or -ENOMEM if memory can't be allocated. */ -int lsm_inode_alloc(struct inode *inode) +int lsm_inode_alloc(struct inode *inode, gfp_t gfp) { if (!lsm_inode_cache) { inode->i_security = NULL; return 0; } - inode->i_security = kmem_cache_zalloc(lsm_inode_cache, GFP_NOFS); + inode->i_security = kmem_cache_zalloc(lsm_inode_cache, gfp); if (inode->i_security == NULL) return -ENOMEM; return 0; @@ -1582,9 +1582,9 @@ int security_path_notify(const struct path *path, u64 mask, * * Return: Return 0 if operation was successful. */ -int security_inode_alloc(struct inode *inode) +int security_inode_alloc(struct inode *inode, gfp_t gfp) { - int rc = lsm_inode_alloc(inode); + int rc = lsm_inode_alloc(inode, gfp); if (unlikely(rc)) return rc;