Message ID | 20240828140638.3204253-1-kent.overstreet@linux.dev (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | bcachefs: Switch to memalloc_flags_do() for vmalloc allocations | expand |
On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote: > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim > unsafe locks is a potential deadlock. Kent, the approach you've taken with this was NACKed. You merged it anyway (!). Now you're spreading this crap further, presumably in an effort to make it harder to remove. Stop it. Work with us to come up with an acceptable approach. I think there is one that will work, but you need to listen to the people who're giving you feedback because Linux is too big of a code-base for you to understand everything.
On Wed, Aug 28, 2024 at 07:48:43PM GMT, Matthew Wilcox wrote: > On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote: > > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for > > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim > > unsafe locks is a potential deadlock. > > Kent, the approach you've taken with this was NACKed. You merged it > anyway (!). Now you're spreading this crap further, presumably in an effort > to make it harder to remove. Excuse me? This is fixing a real issue which has been known for years. It was decided _years_ ago that PF_MEMALLOC flags were how this was going to be addressed. > Stop it. Work with us to come up with an acceptable approach. I > think there is one that will work, but you need to listen to the people > who're giving you feedback because Linux is too big of a code-base for > you to understand everything. No, you guys need to stop pushing broken shit.
On Wed 28-08-24 15:11:19, Kent Overstreet wrote: > On Wed, Aug 28, 2024 at 07:48:43PM GMT, Matthew Wilcox wrote: > > On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote: > > > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for > > > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim > > > unsafe locks is a potential deadlock. > > > > Kent, the approach you've taken with this was NACKed. You merged it > > anyway (!). Now you're spreading this crap further, presumably in an effort > > to make it harder to remove. > > Excuse me? This is fixing a real issue which has been known for years. If you mean a lack of GFP_NOWAIT support in vmalloc then this is not a bug but a lack of feature. vmalloc has never promissed to support this allocation mode and a scoped gfp flag will not magically make it work because there is a sleeping lock involved in an allocation path in some cases. If you really need this feature to be added then you should clearly describe your usecase and listen to people who are familiar with the vmalloc internals rather than heavily pushing your direction which doesn't work anyway. > It was decided _years_ ago that PF_MEMALLOC flags were how this was > going to be addressed. Nope! It has been decided that _some_ gfp flags are acceptable to be used by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim modifiers and other flags so these are indeed safe to be used that way. > > Stop it. Work with us to come up with an acceptable approach. I > > think there is one that will work, but you need to listen to the people > > who're giving you feedback because Linux is too big of a code-base for > > you to understand everything. > > No, you guys need to stop pushing broken shit. This is not the way to work with other people. Seriously!
On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote: > On Wed 28-08-24 15:11:19, Kent Overstreet wrote: > > On Wed, Aug 28, 2024 at 07:48:43PM GMT, Matthew Wilcox wrote: > > > On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote: > > > > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for > > > > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim > > > > unsafe locks is a potential deadlock. > > > > > > Kent, the approach you've taken with this was NACKed. You merged it > > > anyway (!). Now you're spreading this crap further, presumably in an effort > > > to make it harder to remove. > > > > Excuse me? This is fixing a real issue which has been known for years. > > If you mean a lack of GFP_NOWAIT support in vmalloc then this is not a > bug but a lack of feature. vmalloc has never promissed to support this > allocation mode and a scoped gfp flag will not magically make it work > because there is a sleeping lock involved in an allocation path in some > cases. > > If you really need this feature to be added then you should clearly > describe your usecase and listen to people who are familiar with the > vmalloc internals rather than heavily pushing your direction which > doesn't work anyway. Michal, I'm plenty familiar with the vmalloc internals. Given that you didn't even seem to be aware of how it doesn't respect gfp flags, you seem to be the person who hasn't been up to speed in this discussion. > > It was decided _years_ ago that PF_MEMALLOC flags were how this was > > going to be addressed. > > Nope! It has been decided that _some_ gfp flags are acceptable to be used > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim > modifiers and other flags so these are indeed safe to be used that way. Decided by who? Cite the list discussion, and the reasoning. > > > Stop it. Work with us to come up with an acceptable approach. I > > > think there is one that will work, but you need to listen to the people > > > who're giving you feedback because Linux is too big of a code-base for > > > you to understand everything. > > > > No, you guys need to stop pushing broken shit. > > This is not the way to work with other people. Seriously! No, your patches and the reasoning you've been pushing are broken. A safe interface is not one that "does not fail", a safe interface is one that never invokes undefined behaviour - and generally that means it'll return an error instead. And programming in the kernel means that you check for errors, period. This idea that you have of "safety" meaning "we'll just conveniently remove or ignore the cases where GFP_NOFAIL would have to fail" is complete and utter garbage.
On Wed 28-08-24 18:58:43, Kent Overstreet wrote: > On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote: > > On Wed 28-08-24 15:11:19, Kent Overstreet wrote: > > > On Wed, Aug 28, 2024 at 07:48:43PM GMT, Matthew Wilcox wrote: > > > > On Wed, Aug 28, 2024 at 10:06:36AM -0400, Kent Overstreet wrote: > > > > > vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for > > > > > pte allocation, so doing vmalloc/kvmalloc allocations with reclaim > > > > > unsafe locks is a potential deadlock. > > > > > > > > Kent, the approach you've taken with this was NACKed. You merged it > > > > anyway (!). Now you're spreading this crap further, presumably in an effort > > > > to make it harder to remove. > > > > > > Excuse me? This is fixing a real issue which has been known for years. > > > > If you mean a lack of GFP_NOWAIT support in vmalloc then this is not a > > bug but a lack of feature. vmalloc has never promissed to support this > > allocation mode and a scoped gfp flag will not magically make it work > > because there is a sleeping lock involved in an allocation path in some > > cases. > > > > If you really need this feature to be added then you should clearly > > describe your usecase and listen to people who are familiar with the > > vmalloc internals rather than heavily pushing your direction which > > doesn't work anyway. > > Michal, I'm plenty familiar with the vmalloc internals. Given that you > didn't even seem to be aware of how it doesn't respect gfp flags, you > seem to be the person who hasn't been up to speed in this discussion. GFP_NOWAIT is explicitly documented as unsupported (__vmalloc_node_range_noprof). vmalloc internals are using vmap_purge_lock and blocking notifiers (vmap_notify_list) in rare cases so PF_MEMALLOC_NORECLAIM is not really sufficient to provide NOWAIT semantic (this is really not just about page tables allocations). There might be other places that require blocking - I do not claim to be an expert on the vmalloc allocator. Just my 2 cents do whatever you want with this information. It seems that this discussion is not going to be really productive so I will leave you here. If you reconsider and realize that a productive discussion realy requires also listening and respect then get back and we can try again. Good luck!
On Wed 28-08-24 18:58:43, Kent Overstreet wrote: > On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote: > > On Wed 28-08-24 15:11:19, Kent Overstreet wrote: [...] > > > It was decided _years_ ago that PF_MEMALLOC flags were how this was > > > going to be addressed. > > > > Nope! It has been decided that _some_ gfp flags are acceptable to be used > > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim > > modifiers and other flags so these are indeed safe to be used that way. > > Decided by who? Decides semantic of respective GFP flags and their compatibility with others that could be nested in the scope. Zone modifiers __GFP_DMA, __GFP_HIGHMEM, __GFP_DMA32 and __GFP_MOVABLE would allow only __GFP_DMA to have scoped semantic because it is the most restrictive of all of them (i.e. __GFP_DMA32 can be served from __GFP_DMA but not other way around) but nobody really requested that. __GFP_RECLAIMABLE is slab allocator specific and nested allocations cannot be assumed they have shrinkers so this cannot really have scoped semantic. __GFP_WRITE only implies node spreading. Likely OK for scope interface, nobody requested that. __GFP_HARDWALL only to be used for user space allocations. Wouldn't break anything if it had scoped interface but nobody requested that. __GFP_THISNODE only to be used by allocators internally to define NUMA placement strategy. Not safe for scoped interface as it changes the failure semantic __GFP_ACCOUNT defines memcg accounting. Generally usable from user context and safe for scope interface in that context as it doesn't change the failure nor reclaim semantic __GFP_NO_OBJ_EXT internal flag not to be used outside of mm. __GFP_HIGH gives access to memory reserves. It could be used for scope interface but nobody requested that. __GFP_MEMALLOC - already has a scope interface PF_MEMALLOC. This is not really great though because it grants unbounded access to memory reserves and that means that it isreally tricky to see how many allocations really can use reserves. It has been added because swap over NFS had to guarantee forward progress and networking layer was not prepared for that. Fundamentally this doesn't change the allocation nor reclaim semantic so it is safe for a scope API. __GFP_NOMEMALLOC used to override PF_MEMALLOC so a scoped interface doesn't make much sense __GFP_IO already has scope interface to drop this flag. It is safe because it doesn't change failure semantic and it makes the reclaim context more constrained so it is compatible with other reclaim modifiers. Contrary it would be unsafe to have a scope interface to add this flag because all GFP_NOIO nested allocations could deadlock __GFP_FS. Similar to __GFP_IO. __GFP_DIRECT_RECLAIM allows allocation to sleep. Scoped interface to set the flag is unsafe for any nested GFP_NOWAIT/GFP_ATOMIC requests which might be called from withing atomic contexts. Scope interface to clear the flag is unsafe for scoped interface because __GFP_NOFAIL allocation mode doesn't support requests without this flag so any nested NOFAIL allocation would break and see unexpected and potentially unhandled failure mode. __GFP_KSWAPD_RECLAIM controls whether kswapd is woken up. Doesn't change the failure nor direct reclaim behavior. Scoped interface to set the flag seems rather pointless and one to clear the bit dangerous because it could put MM into unbalanced state as kswapd wouldn't wake up. __GFP_RETRY_MAYFAIL - changes the failure mode so it is fundamentally incompatible with nested __GFP_NOFAIL allocations. Scoped interface to clear the flag would be safe but probably pointless. __GFP_NORETRY - same as above __GFP_NOFAIL - incompatible with any nested GFP_NOWAIT/GFP_ATOMIC allocations. One could argue that those are fine to see allocation failure so this will not create any unexpected failure mode which is a fair argument but what would be the actual usecase for setting all nested allocations to NOFAIL mode when they likely have a failure mode? Interface to clear the flag for the scope would be unsafe because all nested NOFAIL allocations would get an unexpected failure mode. __GFP_NOWARN safe to have scope interface both to set and clear the flag. __GFP_COMP only to be used for high order allocations and changes the tail pages tracking which would break any nested high order request without the flag. So unsafe for the scope interface both to set and clear the flag. __GFP_ZERO changes the initialization and safe for scope interface. We even have a global switch to do that for all allocations init_on_alloc __GFP_NOLOCKDEP disables lockdep reclaim recursion detection. Safe for scope interface AFAICS.
On Thu, Aug 29, 2024 at 09:19:13AM GMT, Michal Hocko wrote: > GFP_NOWAIT is explicitly documented as unsupported > (__vmalloc_node_range_noprof). vmalloc internals are using > vmap_purge_lock and blocking notifiers (vmap_notify_list) in rare cases > so PF_MEMALLOC_NORECLAIM is not really sufficient to provide NOWAIT > semantic (this is really not just about page tables allocations). There > might be other places that require blocking - I do not claim to be an > expert on the vmalloc allocator. Sure, but those are easily fixed, if necessary. We'd also want to be explicit on whether NORECLAIM needs to be NOWAIT or just NORECLAIM - those are slightly different things. As long as it's NORECLAIM that should be fine - it's really only an issue if we need to map GFP_NOWAIT to PF_MEMALLOC_*. > It seems that this discussion is not going to be really productive so I > will leave you here. I think you'll find these discussions become much more productive when you're able to stick to the technical, and when you yourself are able to listen to counterarguments. I remind you that we hash things out here on the list; we don't resort to things like "this has been decided" or "the mm community says"; you need to be properly justifying what you want to do here. And if people aren't listening to your arguments, that should be a hint to you that maybe you aren't making good ones.
On Thu, Aug 29, 2024 at 01:08:53PM GMT, Michal Hocko wrote: > On Wed 28-08-24 18:58:43, Kent Overstreet wrote: > > On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote: > > > On Wed 28-08-24 15:11:19, Kent Overstreet wrote: > [...] > > > > It was decided _years_ ago that PF_MEMALLOC flags were how this was > > > > going to be addressed. > > > > > > Nope! It has been decided that _some_ gfp flags are acceptable to be used > > > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim > > > modifiers and other flags so these are indeed safe to be used that way. > > > > Decided by who? > > Decides semantic of respective GFP flags and their compatibility with > others that could be nested in the scope. Well, that's a bit of commentary, at least. The question is which of those could properly apply to a section, not a callsite, and a PF_MEMALLOC_NOWAIT (similar to but not exactly the same as PF_MEMALLOC_NORECLAIM) would be at the top of that list since we already have a clear concept of sections where we're not allowed to sleep. And that tells us how to resolve GFP_NOFAIL with other conflicting PF_MEMALLOC flags: GFP_NOFAIL loses. It is a _bug_ if GFP_NOFAIL is accidentally used in a non sleepable context, and properly labelling those sections to the allocator would allow us to turn undefined behaviour into an error - _that_ would be turning kmalloc() into a safe interface. Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe according to call path and allocation size, you still need to be checking for failure - in the same way that you shouldn't be using BUG_ON() if you cannot prove that the condition won't occur in real wold usage. Given that, it's easy to see how to handle __GFP_RETRY_MAYFAIL and __GFP_NORETRY: if they're applied to a context, then the usage is saying "I need to attempt to run this section with some sort of latency bounds", and GFP_NOFAIL should lose - as well as emitting a warning. BTW, this is how you should be interpreting PF_MEMALLOC_NORECLAIM today: "I have strong latency bounds here, but not so strict that it needs to be strictly nonblocking".
On Thu 29-08-24 07:55:08, Kent Overstreet wrote: > On Thu, Aug 29, 2024 at 01:08:53PM GMT, Michal Hocko wrote: > > On Wed 28-08-24 18:58:43, Kent Overstreet wrote: > > > On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote: > > > > On Wed 28-08-24 15:11:19, Kent Overstreet wrote: > > [...] > > > > > It was decided _years_ ago that PF_MEMALLOC flags were how this was > > > > > going to be addressed. > > > > > > > > Nope! It has been decided that _some_ gfp flags are acceptable to be used > > > > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim > > > > modifiers and other flags so these are indeed safe to be used that way. > > > > > > Decided by who? > > > > Decides semantic of respective GFP flags and their compatibility with > > others that could be nested in the scope. > > Well, that's a bit of commentary, at least. > > The question is which of those could properly apply to a section, not a > callsite, and a PF_MEMALLOC_NOWAIT (similar to but not exactly the same > as PF_MEMALLOC_NORECLAIM) would be at the top of that list since we > already have a clear concept of sections where we're not allowed to > sleep. Unfortunately a lack of __GFP_DIRECT_RECLAIM means both no reclaim and no sleeping allowed for historical reasons. GFP_NOWAIT is both used from atomic contexts and as an optimistic allocation attempt with a heavier fallback allocation strategy. If you want NORECLAIM semantic then this would need to be represented by different means than __GFP_DIRECT_RECLAIM alone. > And that tells us how to resolve GFP_NOFAIL with other conflicting > PF_MEMALLOC flags: GFP_NOFAIL loses. > > It is a _bug_ if GFP_NOFAIL is accidentally used in a non sleepable > context, and properly labelling those sections to the allocator would > allow us to turn undefined behaviour into an error - _that_ would be > turning kmalloc() into a safe interface. If your definition of safe includes an oops or worse silent failure then yes. Not really safe interface in my book though. E.g. (just randomly looking at GFP_NOFAIL users) btree_paths_realloc doesn't check the return value and if it happened to be called from such a scope it would have blown up. That code is safe without the scope though. There are many other callsites which do not have failure paths.
On Thu, Aug 29, 2024 at 02:34:08PM GMT, Michal Hocko wrote: > On Thu 29-08-24 07:55:08, Kent Overstreet wrote: > > On Thu, Aug 29, 2024 at 01:08:53PM GMT, Michal Hocko wrote: > > > On Wed 28-08-24 18:58:43, Kent Overstreet wrote: > > > > On Wed, Aug 28, 2024 at 09:26:44PM GMT, Michal Hocko wrote: > > > > > On Wed 28-08-24 15:11:19, Kent Overstreet wrote: > > > [...] > > > > > > It was decided _years_ ago that PF_MEMALLOC flags were how this was > > > > > > going to be addressed. > > > > > > > > > > Nope! It has been decided that _some_ gfp flags are acceptable to be used > > > > > by scoped APIs. Most notably NOFS and NOIO are compatible with reclaim > > > > > modifiers and other flags so these are indeed safe to be used that way. > > > > > > > > Decided by who? > > > > > > Decides semantic of respective GFP flags and their compatibility with > > > others that could be nested in the scope. > > > > Well, that's a bit of commentary, at least. > > > > The question is which of those could properly apply to a section, not a > > callsite, and a PF_MEMALLOC_NOWAIT (similar to but not exactly the same > > as PF_MEMALLOC_NORECLAIM) would be at the top of that list since we > > already have a clear concept of sections where we're not allowed to > > sleep. > > Unfortunately a lack of __GFP_DIRECT_RECLAIM means both no reclaim and > no sleeping allowed for historical reasons. GFP_NOWAIT is both used from > atomic contexts and as an optimistic allocation attempt with a heavier > fallback allocation strategy. If you want NORECLAIM semantic then this > would need to be represented by different means than __GFP_DIRECT_RECLAIM > alone. I don't see it as particularly needed - the vmalloc locks you mentioned previously just mean it's something worth considering. In my usage I probably wouldn't care about those locks, but for keeping the API simple we probably want just PF_MEMALLOC_NOWAIT (where those locks become trylock). > > And that tells us how to resolve GFP_NOFAIL with other conflicting > > PF_MEMALLOC flags: GFP_NOFAIL loses. > > > > It is a _bug_ if GFP_NOFAIL is accidentally used in a non sleepable > > context, and properly labelling those sections to the allocator would > > allow us to turn undefined behaviour into an error - _that_ would be > > turning kmalloc() into a safe interface. > > If your definition of safe includes an oops or worse silent failure > then yes. Not really safe interface in my book though. E.g. (just > randomly looking at GFP_NOFAIL users) btree_paths_realloc doesn't check > the return value and if it happened to be called from such a scope it > would have blown up. That code is safe without the scope though. There > are many other callsites which do not have failure paths. Yes, but that's unsafe anyways due to the max allocation size of GFP_NOFAIL - I'll have to fix that. Note that even if we got rid of the smaller max allocation size of GFP_NOFAIL allocations we'd _still_ generally need error paths due to the hard limit of INT_MAX, and integer overflow checking for array allocations.
On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote: > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe > according to call path and allocation size, you still need to be > checking for failure - in the same way that you shouldn't be using > BUG_ON() if you cannot prove that the condition won't occur in real wold > usage. We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years now. This was the default Irix kernel allocator behaviour (it had a forwards progress guarantee and would never fail allocation unless told it could do so). We've been using the same "guaranteed not to fail" semantics on Linux since the original port started 25 years ago via open-coded loops. IOWs, __GFP_NOFAIL semantics have been production tested for a couple of decades on Linux via XFS, and nobody here can argue that XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL as it is used by XFS is reliable and lives up to the "will not fail" guarantee that it is supposed to have. Fundamentally, __GFP_NOFAIL came about to replace the callers doing do { p = kmalloc(size); while (!p); so that they blocked until memory allocation succeeded. The call sites do not check for failure, because -failure never occurs-. The MM devs want to have visibility of these allocations - they may not like them, but having __GFP_NOFAIL means it's trivial to audit all the allocations that use these semantics. IOWs, __GFP_NOFAIL was created with an explicit guarantee that it -will not fail- for normal allocation contexts so it could replace all the open-coded will-not-fail allocation loops.. Given this guarantee, we recently removed these historic allocation wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the allocation call sites. There's nearly a hundred memory allocation locations in XFS that are tagged with __GFP_NOFAIL. If we're now going to have the "will not fail" guarantee taken away from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can it be used anywhere else that a "will not fail" guarantee it required. Put simply: __GFP_NOFAIL will be rendered completely useless if it can fail due to external scoped memory allocation contexts. This will force us to revert all __GFP_NOFAIL allocations back to open-coded will-not-fail loops. This is not a step forwards for anyone. -Dave.
On Fri, Aug 30, 2024 at 12:27:11AM +1000, Dave Chinner wrote: > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years > now. This was the default Irix kernel allocator behaviour (it had a > forwards progress guarantee and would never fail allocation unless > told it could do so). We've been using the same "guaranteed not to > fail" semantics on Linux since the original port started 25 years > ago via open-coded loops. Ext3/ext4 doesn't have quite the history as XFS --- it's only been around for 23 years --- but we've also used __GFP_NOFAIL or its moral equivalent, e.g.: > do { > p = kmalloc(size); > while (!p); For the entire existence of ext3. > Put simply: __GFP_NOFAIL will be rendered completely useless if it > can fail due to external scoped memory allocation contexts. This > will force us to revert all __GFP_NOFAIL allocations back to > open-coded will-not-fail loops. The same will be true for ext4. And as Dave has said, the MM developers want to have visibility to when file systems have basically said, "if you can't allow us to allocate memory, our only alternative is to cause user data loss, crash the kernel, or loop forever; we will choose the latter". The MM developers tried to make __GFP_NOFAIL go away several years ago, and ext4 put the retry loop back, As a result, the compromise was that the MM developers restored __GFP_NOFAIL, and the file systems developers have done their best to reduce the use of __GFP_NOFAIL as much as possible. So if you try to break the GFP_NOFAIL promise, both xfs and ext4 will back to the retry loop. And the MM devs will be sad, and they will forcibly revert your change to *ther* code, even if that means breaking bcachefs. Becuase otherwise, you will be breaking ext4 and xfs, and so we will go back to using a retry loop, which will be worse for Linux users. Cheers, - Ted
On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote: > > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe > > according to call path and allocation size, you still need to be > > checking for failure - in the same way that you shouldn't be using > > BUG_ON() if you cannot prove that the condition won't occur in real wold > > usage. > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years > now. This was the default Irix kernel allocator behaviour (it had a > forwards progress guarantee and would never fail allocation unless > told it could do so). We've been using the same "guaranteed not to > fail" semantics on Linux since the original port started 25 years > ago via open-coded loops. > > IOWs, __GFP_NOFAIL semantics have been production tested for a > couple of decades on Linux via XFS, and nobody here can argue that > XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL > as it is used by XFS is reliable and lives up to the "will not fail" > guarantee that it is supposed to have. > > Fundamentally, __GFP_NOFAIL came about to replace the callers doing > > do { > p = kmalloc(size); > while (!p); > > so that they blocked until memory allocation succeeded. The call > sites do not check for failure, because -failure never occurs-. > > The MM devs want to have visibility of these allocations - they may > not like them, but having __GFP_NOFAIL means it's trivial to audit > all the allocations that use these semantics. IOWs, __GFP_NOFAIL > was created with an explicit guarantee that it -will not fail- for > normal allocation contexts so it could replace all the open-coded > will-not-fail allocation loops.. > > Given this guarantee, we recently removed these historic allocation > wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the > allocation call sites. There's nearly a hundred memory allocation > locations in XFS that are tagged with __GFP_NOFAIL. > > If we're now going to have the "will not fail" guarantee taken away > from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can > it be used anywhere else that a "will not fail" guarantee it > required. > > Put simply: __GFP_NOFAIL will be rendered completely useless if it > can fail due to external scoped memory allocation contexts. This > will force us to revert all __GFP_NOFAIL allocations back to > open-coded will-not-fail loops. > > This is not a step forwards for anyone. Hello Dave, I've noticed that XFS has increasingly replaced kmem_alloc() with __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41 occurrences. In kmem_alloc(), there's an explicit memalloc_retry_wait() to throttle the allocator under heavy memory pressure, which aligns with your filesystem design. However, using __GFP_NOFAIL removes this throttling mechanism, potentially causing issues when the system is under heavy memory load. I'm concerned that this shift might not be a beneficial trend. We have been using XFS for our big data servers for years, and it has consistently performed well with older kernels like 4.19.y. However, after upgrading all our servers from 4.19.y to 6.1.y over the past two years, we have frequently encountered livelock issues caused by memory exhaustion. To mitigate this, we've had to limit the RSS of applications, which isn't an ideal solution and represents a worrying trend.
On 8/30/24 11:14, Yafang Shao wrote: > On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote: > > Hello Dave, > > I've noticed that XFS has increasingly replaced kmem_alloc() with > __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of > __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41 > occurrences. In kmem_alloc(), there's an explicit > memalloc_retry_wait() to throttle the allocator under heavy memory > pressure, which aligns with your filesystem design. However, using > __GFP_NOFAIL removes this throttling mechanism, potentially causing > issues when the system is under heavy memory load. I'm concerned that > this shift might not be a beneficial trend. > > We have been using XFS for our big data servers for years, and it has > consistently performed well with older kernels like 4.19.y. However, > after upgrading all our servers from 4.19.y to 6.1.y over the past two > years, we have frequently encountered livelock issues caused by memory > exhaustion. To mitigate this, we've had to limit the RSS of > applications, which isn't an ideal solution and represents a worrying > trend. By "livelock issues caused by memory exhaustion" you mean the long-standing infamous issue that the system might become thrashing for the remaining small amount of page cache, and anonymous memory being swapped out/in, instead of issuing OOM, because there's always just enough progress of the reclaim to keep going, but the system isn't basically doing anything else? I think that's related to near-exhausted memory by userspace, so I'm not sure why XFS would be to blame here. That said, if memalloc_retry_wait() is indeed a useful mechanism, maybe we could perform it inside the page allocator itself for __GFP_NOFAIL?
On Thu, Aug 29, 2024 at 11:39:05PM GMT, Theodore Ts'o wrote: > On Fri, Aug 30, 2024 at 12:27:11AM +1000, Dave Chinner wrote: > > > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years > > now. This was the default Irix kernel allocator behaviour (it had a > > forwards progress guarantee and would never fail allocation unless > > told it could do so). We've been using the same "guaranteed not to > > fail" semantics on Linux since the original port started 25 years > > ago via open-coded loops. > > Ext3/ext4 doesn't have quite the history as XFS --- it's only been > around for 23 years --- but we've also used __GFP_NOFAIL or its > moral equivalent, e.g.: > > > do { > > p = kmalloc(size); > > while (!p); > > For the entire existence of ext3. > > > Put simply: __GFP_NOFAIL will be rendered completely useless if it > > can fail due to external scoped memory allocation contexts. This > > will force us to revert all __GFP_NOFAIL allocations back to > > open-coded will-not-fail loops. > > The same will be true for ext4. And as Dave has said, the MM > developers want to have visibility to when file systems have basically > said, "if you can't allow us to allocate memory, our only alternative > is to cause user data loss, crash the kernel, or loop forever; we will > choose the latter". The MM developers tried to make __GFP_NOFAIL go > away several years ago, and ext4 put the retry loop back, As a result, > the compromise was that the MM developers restored __GFP_NOFAIL, and > the file systems developers have done their best to reduce the use of > __GFP_NOFAIL as much as possible. > > So if you try to break the GFP_NOFAIL promise, both xfs and ext4 will > back to the retry loop. And the MM devs will be sad, and they will > forcibly revert your change to *ther* code, even if that means > breaking bcachefs. Becuase otherwise, you will be breaking ext4 and > xfs, and so we will go back to using a retry loop, which will be worse > for Linux users. GFP_NOFAIL may be better than the retry loop, but it's still not good. Consider what happens when you have a GFP_NOFAIL in a critical IO path, when the system is almost exhausted on memory; yes, that allocation will succeed _eventually_, but without any latency bounds. When you're thrashing or being fork bombed, that allocation is contending with everything else. Much the same way that a lock in a critical path where the work done under the lock grows when the system is loaded, it's a contention point subject to catastrophic failure. Much better to preallocate, e.g. with a mempool, or have some other kind of fallback. It might work to do __GFP_NOFAIL|__GFP_HIGH in critical paths, but I've never seen that investigated or tried. And this is an area filesystem people really need to be thinking about. Block layer gets this right, filesystems do not, and I suspect this is a key contributor to our performance and behaviour sucking when we're thrashing. bcachefs puts a lot of effort into making sure we can run in bounded memory, because I put a lot of emphasiss on consistent performance and bounded latency, not just winning benchmarks. There's only two __GFP_NOFAIL allocations in bcachefs, and I'll likely remove both of them when I get around to it.
On Fri, Aug 30, 2024 at 05:14:28PM +0800, Yafang Shao wrote: > On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote: > > > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe > > > according to call path and allocation size, you still need to be > > > checking for failure - in the same way that you shouldn't be using > > > BUG_ON() if you cannot prove that the condition won't occur in real wold > > > usage. > > > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years > > now. This was the default Irix kernel allocator behaviour (it had a > > forwards progress guarantee and would never fail allocation unless > > told it could do so). We've been using the same "guaranteed not to > > fail" semantics on Linux since the original port started 25 years > > ago via open-coded loops. > > > > IOWs, __GFP_NOFAIL semantics have been production tested for a > > couple of decades on Linux via XFS, and nobody here can argue that > > XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL > > as it is used by XFS is reliable and lives up to the "will not fail" > > guarantee that it is supposed to have. > > > > Fundamentally, __GFP_NOFAIL came about to replace the callers doing > > > > do { > > p = kmalloc(size); > > while (!p); > > > > so that they blocked until memory allocation succeeded. The call > > sites do not check for failure, because -failure never occurs-. > > > > The MM devs want to have visibility of these allocations - they may > > not like them, but having __GFP_NOFAIL means it's trivial to audit > > all the allocations that use these semantics. IOWs, __GFP_NOFAIL > > was created with an explicit guarantee that it -will not fail- for > > normal allocation contexts so it could replace all the open-coded > > will-not-fail allocation loops.. > > > > Given this guarantee, we recently removed these historic allocation > > wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the > > allocation call sites. There's nearly a hundred memory allocation > > locations in XFS that are tagged with __GFP_NOFAIL. > > > > If we're now going to have the "will not fail" guarantee taken away > > from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can > > it be used anywhere else that a "will not fail" guarantee it > > required. > > > > Put simply: __GFP_NOFAIL will be rendered completely useless if it > > can fail due to external scoped memory allocation contexts. This > > will force us to revert all __GFP_NOFAIL allocations back to > > open-coded will-not-fail loops. > > > > This is not a step forwards for anyone. > > Hello Dave, > > I've noticed that XFS has increasingly replaced kmem_alloc() with > __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of > __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41 > occurrences. In kmem_alloc(), there's an explicit > memalloc_retry_wait() to throttle the allocator under heavy memory > pressure, which aligns with your filesystem design. However, using > __GFP_NOFAIL removes this throttling mechanism, potentially causing > issues when the system is under heavy memory load. I'm concerned that > this shift might not be a beneficial trend. AIUI, the memory allocation looping has back-offs already built in to it when memory reserves are exhausted and/or reclaim is congested. e.g: get_page_from_freelist() (zone below watermark) node_reclaim() __node_reclaim() shrink_node() reclaim_throttle() And the call to recalim_throttle() will do the equivalent of memalloc_retry_wait() (a 2ms sleep). > We have been using XFS for our big data servers for years, and it has > consistently performed well with older kernels like 4.19.y. However, > after upgrading all our servers from 4.19.y to 6.1.y over the past two > years, we have frequently encountered livelock issues caused by memory > exhaustion. To mitigate this, we've had to limit the RSS of > applications, which isn't an ideal solution and represents a worrying > trend. If userspace uses all of memory all the time, then the best the kernel can do is slowly limp along. Preventing userspace from overcommitting memory to the point of OOM is the only way to avoid these "userspace space wants more memory than the machine physically has" sorts of issues. i.e. this is not a problem that the kernel code can solve short of randomly killing userspace applications... -Dave.
On Fri, Aug 30, 2024 at 11:25 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 8/30/24 11:14, Yafang Shao wrote: > > On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote: > > > > Hello Dave, > > > > I've noticed that XFS has increasingly replaced kmem_alloc() with > > __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of > > __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41 > > occurrences. In kmem_alloc(), there's an explicit > > memalloc_retry_wait() to throttle the allocator under heavy memory > > pressure, which aligns with your filesystem design. However, using > > __GFP_NOFAIL removes this throttling mechanism, potentially causing > > issues when the system is under heavy memory load. I'm concerned that > > this shift might not be a beneficial trend. > > > > We have been using XFS for our big data servers for years, and it has > > consistently performed well with older kernels like 4.19.y. However, > > after upgrading all our servers from 4.19.y to 6.1.y over the past two > > years, we have frequently encountered livelock issues caused by memory > > exhaustion. To mitigate this, we've had to limit the RSS of > > applications, which isn't an ideal solution and represents a worrying > > trend. > > By "livelock issues caused by memory exhaustion" you mean the long-standing > infamous issue that the system might become thrashing for the remaining > small amount of page cache, and anonymous memory being swapped out/in, > instead of issuing OOM, because there's always just enough progress of the > reclaim to keep going, but the system isn't basically doing anything else? > Exactly > I think that's related to near-exhausted memory by userspace, If user space is the root cause, the appropriate response should be to terminate the offending user tasks. However, this doesn't happen at all. > so I'm not > sure why XFS would be to blame here. Honestly, I'm not sure what to blame, as I don't have a clear understanding of what's happening during memory allocation. One server among tens of thousands in production randomly experiences a livelock within days, making it extremely difficult to pinpoint the root cause. > > That said, if memalloc_retry_wait() is indeed a useful mechanism, maybe we > could perform it inside the page allocator itself for __GFP_NOFAIL? Perhaps an additional wait or exit mechanism should be implemented specifically for __GFP_NOFAIL. -- Regards Yafang
On Sun, Sep 1, 2024 at 11:35 AM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Aug 30, 2024 at 05:14:28PM +0800, Yafang Shao wrote: > > On Thu, Aug 29, 2024 at 10:29 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Thu, Aug 29, 2024 at 07:55:08AM -0400, Kent Overstreet wrote: > > > > Ergo, if you're not absolutely sure that a GFP_NOFAIL use is safe > > > > according to call path and allocation size, you still need to be > > > > checking for failure - in the same way that you shouldn't be using > > > > BUG_ON() if you cannot prove that the condition won't occur in real wold > > > > usage. > > > > > > We've been using __GFP_NOFAIL semantics in XFS heavily for 30 years > > > now. This was the default Irix kernel allocator behaviour (it had a > > > forwards progress guarantee and would never fail allocation unless > > > told it could do so). We've been using the same "guaranteed not to > > > fail" semantics on Linux since the original port started 25 years > > > ago via open-coded loops. > > > > > > IOWs, __GFP_NOFAIL semantics have been production tested for a > > > couple of decades on Linux via XFS, and nobody here can argue that > > > XFS is unreliable or crashes in low memory scenarios. __GFP_NOFAIL > > > as it is used by XFS is reliable and lives up to the "will not fail" > > > guarantee that it is supposed to have. > > > > > > Fundamentally, __GFP_NOFAIL came about to replace the callers doing > > > > > > do { > > > p = kmalloc(size); > > > while (!p); > > > > > > so that they blocked until memory allocation succeeded. The call > > > sites do not check for failure, because -failure never occurs-. > > > > > > The MM devs want to have visibility of these allocations - they may > > > not like them, but having __GFP_NOFAIL means it's trivial to audit > > > all the allocations that use these semantics. IOWs, __GFP_NOFAIL > > > was created with an explicit guarantee that it -will not fail- for > > > normal allocation contexts so it could replace all the open-coded > > > will-not-fail allocation loops.. > > > > > > Given this guarantee, we recently removed these historic allocation > > > wrapper loops from XFS, and replaced them with __GFP_NOFAIL at the > > > allocation call sites. There's nearly a hundred memory allocation > > > locations in XFS that are tagged with __GFP_NOFAIL. > > > > > > If we're now going to have the "will not fail" guarantee taken away > > > from __GFP_NOFAIL, then we cannot use __GFP_NOFAIL in XFS. Nor can > > > it be used anywhere else that a "will not fail" guarantee it > > > required. > > > > > > Put simply: __GFP_NOFAIL will be rendered completely useless if it > > > can fail due to external scoped memory allocation contexts. This > > > will force us to revert all __GFP_NOFAIL allocations back to > > > open-coded will-not-fail loops. > > > > > > This is not a step forwards for anyone. > > > > Hello Dave, > > > > I've noticed that XFS has increasingly replaced kmem_alloc() with > > __GFP_NOFAIL. For example, in kernel 4.19.y, there are 0 instances of > > __GFP_NOFAIL under fs/xfs, but in kernel 6.1.y, there are 41 > > occurrences. In kmem_alloc(), there's an explicit > > memalloc_retry_wait() to throttle the allocator under heavy memory > > pressure, which aligns with your filesystem design. However, using > > __GFP_NOFAIL removes this throttling mechanism, potentially causing > > issues when the system is under heavy memory load. I'm concerned that > > this shift might not be a beneficial trend. > > AIUI, the memory allocation looping has back-offs already built in > to it when memory reserves are exhausted and/or reclaim is > congested. > > e.g: > > get_page_from_freelist() > (zone below watermark) > node_reclaim() > __node_reclaim() > shrink_node() > reclaim_throttle() It applies to all kinds of allocations. > > And the call to recalim_throttle() will do the equivalent of > memalloc_retry_wait() (a 2ms sleep). I'm wondering if we should take special action for __GFP_NOFAIL, as currently, it only results in an endless loop with no intervention. > > > We have been using XFS for our big data servers for years, and it has > > consistently performed well with older kernels like 4.19.y. However, > > after upgrading all our servers from 4.19.y to 6.1.y over the past two > > years, we have frequently encountered livelock issues caused by memory > > exhaustion. To mitigate this, we've had to limit the RSS of > > applications, which isn't an ideal solution and represents a worrying > > trend. > > If userspace uses all of memory all the time, then the best the > kernel can do is slowly limp along. Preventing userspace from > overcommitting memory to the point of OOM is the only way to avoid > these "userspace space wants more memory than the machine physically > has" sorts of issues. i.e. this is not a problem that the kernel > code can solve short of randomly killing userspace applications... We expect an OOM event, but it never occurs, which is a problem. -- Regards Yafang
On Mon 02-09-24 11:02:50, Yafang Shao wrote: > On Sun, Sep 1, 2024 at 11:35 AM Dave Chinner <david@fromorbit.com> wrote: [...] > > AIUI, the memory allocation looping has back-offs already built in > > to it when memory reserves are exhausted and/or reclaim is > > congested. > > > > e.g: > > > > get_page_from_freelist() > > (zone below watermark) > > node_reclaim() > > __node_reclaim() > > shrink_node() > > reclaim_throttle() > > It applies to all kinds of allocations. > > > > > And the call to recalim_throttle() will do the equivalent of > > memalloc_retry_wait() (a 2ms sleep). > > I'm wondering if we should take special action for __GFP_NOFAIL, as > currently, it only results in an endless loop with no intervention. If the memory allocator/reclaim is trashing on couple of remaining pages that are easy to drop and reallocated again then the same endless loop is de-facto the behavior for _all_ non-costly allocations. All of them will loop. This is not really great but so far we haven't really developed a reliable thrashing detection that would suit all potential workloads. There are some that simply benefit from work not being lost even if the cost is a severe performance penalty. A general conclusion has been that workloads which would rather see OOM killer triggering early should implement that policy in the userspace. We have PSI, refault counters and other tools that could be used to detect pathological patterns and trigger workload specific action. I really do not see why GFP_NOFAIL should be any special in this specific case.
On Mon, Sep 2, 2024 at 4:11 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 02-09-24 11:02:50, Yafang Shao wrote: > > On Sun, Sep 1, 2024 at 11:35 AM Dave Chinner <david@fromorbit.com> wrote: > [...] > > > AIUI, the memory allocation looping has back-offs already built in > > > to it when memory reserves are exhausted and/or reclaim is > > > congested. > > > > > > e.g: > > > > > > get_page_from_freelist() > > > (zone below watermark) > > > node_reclaim() > > > __node_reclaim() > > > shrink_node() > > > reclaim_throttle() > > > > It applies to all kinds of allocations. > > > > > > > > And the call to recalim_throttle() will do the equivalent of > > > memalloc_retry_wait() (a 2ms sleep). > > > > I'm wondering if we should take special action for __GFP_NOFAIL, as > > currently, it only results in an endless loop with no intervention. > > If the memory allocator/reclaim is trashing on couple of remaining pages > that are easy to drop and reallocated again then the same endless loop > is de-facto the behavior for _all_ non-costly allocations. All of them > will loop. This is not really great but so far we haven't really > developed a reliable thrashing detection that would suit all potential > workloads. There are some that simply benefit from work not being lost > even if the cost is a severe performance penalty. A general conclusion > has been that workloads which would rather see OOM killer triggering > early should implement that policy in the userspace. We have PSI, > refault counters and other tools that could be used to detect > pathological patterns and trigger workload specific action. Indeed, we're currently working on developing that policy. > > I really do not see why GFP_NOFAIL should be any special in this > specific case. I believe there's no way to stop it from looping, even if you implement a sophisticated user space OOM killer. ;) -- Regards Yafang
On Mon 02-09-24 17:01:12, Yafang Shao wrote: > > I really do not see why GFP_NOFAIL should be any special in this > > specific case. > > I believe there's no way to stop it from looping, even if you > implement a sophisticated user space OOM killer. ;) User space OOM killer should be helping to replenish a free memory and we have some heuristics to help NOFAIL users out with some portion of memory reserves already IIRC. So we do already give them some special treatment in the page allocator path. Not so much in the reclaim path.
On Mon, Sep 2, 2024 at 5:09 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 02-09-24 17:01:12, Yafang Shao wrote: > > > I really do not see why GFP_NOFAIL should be any special in this > > > specific case. > > > > I believe there's no way to stop it from looping, even if you > > implement a sophisticated user space OOM killer. ;) > > User space OOM killer should be helping to replenish a free memory and > we have some heuristics to help NOFAIL users out with some portion of > memory reserves already IIRC. So we do already give them some special > treatment in the page allocator path. Not so much in the reclaim path. When setting GFP_NOFAIL, it's important to not only enable direct reclaim but also the OOM killer. In scenarios where swap is off and there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can result in an infinite loop. In other words, GFP_NOFAIL should not be used with GFP_NOFS. Unfortunately, many call sites do combine them. For example: XFS: fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL EXT4: fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted could provide further insight. -- Regards Yafang
On Tue 03-09-24 14:34:05, Yafang Shao wrote: > On Mon, Sep 2, 2024 at 5:09 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 02-09-24 17:01:12, Yafang Shao wrote: > > > > I really do not see why GFP_NOFAIL should be any special in this > > > > specific case. > > > > > > I believe there's no way to stop it from looping, even if you > > > implement a sophisticated user space OOM killer. ;) > > > > User space OOM killer should be helping to replenish a free memory and > > we have some heuristics to help NOFAIL users out with some portion of > > memory reserves already IIRC. So we do already give them some special > > treatment in the page allocator path. Not so much in the reclaim path. > > When setting GFP_NOFAIL, it's important to not only enable direct > reclaim but also the OOM killer. In scenarios where swap is off and > there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can > result in an infinite loop. In other words, GFP_NOFAIL should not be > used with GFP_NOFS. Unfortunately, many call sites do combine them. This is the case with GFP_NOFS on its own already. NOFAIL is no different and both will be looping for ever. We heavily rely on kswapd or other GFP_KERNEL's direct reclaim to allow for forward progress. Unfortunatelly we haven't really found a better way to deal with NOFS only/predominant workloads.
On Tue, Sep 03, 2024 at 02:34:05PM +0800, Yafang Shao wrote: > > When setting GFP_NOFAIL, it's important to not only enable direct > reclaim but also the OOM killer. In scenarios where swap is off and > there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can > result in an infinite loop. In other words, GFP_NOFAIL should not be > used with GFP_NOFS. Unfortunately, many call sites do combine them. > For example: > > XFS: > > fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL > fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL > > EXT4: > > fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL > fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL > > This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted > could provide further insight. GFP_NOFS is needed because we need to signal to the mm layer to avoid recursing into file system layer --- for example, to clean a page by writing it back to the FS. Since we may have taken various file system locks, recursing could lead to deadlock, which would make the system (and the user) sad. If the mm layer wants to OOM kill a process, that should be fine as far as the file system is concerned --- this could reclaim anonymous pages that don't need to be written back, for example. And we don't need to write back dirty pages before the process killed. So I'm a bit puzzled why (as you imply; I haven't dug into the mm code in question) GFP_NOFS implies disabling the OOM killer? Regards, - Ted P.S. Note that this is a fairly simplistic, very conservative set of constraints. If you have several dozen file sysetems mounted, and we're deep in the guts of file system A, it might be *fine* to clean pages associated with file system B or file system C. Unless of course, file system A is a loop-back mount onto a file located in file system B, in which case writing into file system A might require taking locks related to file system B. But that aside, in theory we could allow certain types of page reclaim if we were willing to track which file systems are busy. On the other hand, if the system is allowed to get that busy, performance is going to be *terrible*, and so perhaps the better thing to do is to teach the container manager not to schedule so many jobs on the server in the first place, or having the mobile OS kill off applications that aren't in the foreground, or giving the OOM killer license to kill off jobs much earlier, etc. By the time we get to the point where we are trying to use these last dozen or so pages, the system is going to be thrashing super-badly, and the user is going to be *quite* unhappy. So arguably these problems should be solved much higher up the software stack, by not letting the system get into such a condition in the first place.
On Tue, Sep 3, 2024 at 8:44 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Tue, Sep 03, 2024 at 02:34:05PM +0800, Yafang Shao wrote: > > > > When setting GFP_NOFAIL, it's important to not only enable direct > > reclaim but also the OOM killer. In scenarios where swap is off and > > there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can > > result in an infinite loop. In other words, GFP_NOFAIL should not be > > used with GFP_NOFS. Unfortunately, many call sites do combine them. > > For example: > > > > XFS: > > > > fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL > > fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL > > > > EXT4: > > > > fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL > > fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL > > > > This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted > > could provide further insight. > > GFP_NOFS is needed because we need to signal to the mm layer to avoid > recursing into file system layer --- for example, to clean a page by > writing it back to the FS. Since we may have taken various file > system locks, recursing could lead to deadlock, which would make the > system (and the user) sad. > > If the mm layer wants to OOM kill a process, that should be fine as > far as the file system is concerned --- this could reclaim anonymous > pages that don't need to be written back, for example. And we don't > need to write back dirty pages before the process killed. So I'm a > bit puzzled why (as you imply; I haven't dug into the mm code in > question) GFP_NOFS implies disabling the OOM killer? Refer to the out_of_memory() function [0]: if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) return true; [0]. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/oom_kill.c#n1137 Is it possible that this check can be removed? > > Regards, > > - Ted > > P.S. Note that this is a fairly simplistic, very conservative set of > constraints. If you have several dozen file sysetems mounted, and > we're deep in the guts of file system A, it might be *fine* to clean > pages associated with file system B or file system C. Unless of > course, file system A is a loop-back mount onto a file located in file > system B, in which case writing into file system A might require > taking locks related to file system B. But that aside, in theory we > could allow certain types of page reclaim if we were willing to track > which file systems are busy. > > On the other hand, if the system is allowed to get that busy, > performance is going to be *terrible*, and so perhaps the better thing > to do is to teach the container manager not to schedule so many jobs > on the server in the first place, or having the mobile OS kill off > applications that aren't in the foreground, or giving the OOM killer > license to kill off jobs much earlier, etc. By the time we get to the > point where we are trying to use these last dozen or so pages, the > system is going to be thrashing super-badly, and the user is going to > be *quite* unhappy. So arguably these problems should be solved much > higher up the software stack, by not letting the system get into such > a condition in the first place. I completely agree with your point. However, in the real world, things don't always work as expected, which is why it's crucial to ensure the OOM killer is effective during system thrashing. Unfortunately, the kernel's OOM killer doesn't always perform as expected, particularly under heavy thrashing. This is one reason why user-space OOM killers like oomd exist. -- Regards Yafang
On Tue 03-09-24 08:44:16, Theodore Ts'o wrote: > On Tue, Sep 03, 2024 at 02:34:05PM +0800, Yafang Shao wrote: > > > > When setting GFP_NOFAIL, it's important to not only enable direct > > reclaim but also the OOM killer. In scenarios where swap is off and > > there is minimal page cache, setting GFP_NOFAIL without __GFP_FS can > > result in an infinite loop. In other words, GFP_NOFAIL should not be > > used with GFP_NOFS. Unfortunately, many call sites do combine them. > > For example: > > > > XFS: > > > > fs/xfs/libxfs/xfs_exchmaps.c: GFP_NOFS | __GFP_NOFAIL > > fs/xfs/xfs_attr_item.c: GFP_NOFS | __GFP_NOFAIL > > > > EXT4: > > > > fs/ext4/mballoc.c: GFP_NOFS | __GFP_NOFAIL > > fs/ext4/extents.c: GFP_NOFS | __GFP_NOFAIL > > > > This seems problematic, but I'm not an FS expert. Perhaps Dave or Ted > > could provide further insight. > > GFP_NOFS is needed because we need to signal to the mm layer to avoid > recursing into file system layer --- for example, to clean a page by > writing it back to the FS. Since we may have taken various file > system locks, recursing could lead to deadlock, which would make the > system (and the user) sad. > > If the mm layer wants to OOM kill a process, that should be fine as > far as the file system is concerned --- this could reclaim anonymous > pages that don't need to be written back, for example. And we don't > need to write back dirty pages before the process killed. So I'm a > bit puzzled why (as you imply; I haven't dug into the mm code in > question) GFP_NOFS implies disabling the OOM killer? Yes, because there might be a lot of fs pages pinned while performing NOFS allocation and that could fire the OOM killer way too prematurely. This has been quite some time ago since this was introduced but I do remember workloads hitting that. Also there is usually kswapd making sufficient progress to move forward. There are cases where kswapd is completely stuck and other __GFP_FS allocations triggering full direct reclaim or background kworkers freeing some memory and OOM killer doesn't have good enough picture to make an educated guess the oom killer is the only available way forward. A typical example would be a workload that would care is trashing but still making a slow progress which is acceptable which is acceptable because the most important workload makes a decent progress (the working set fits in or is mlocked) and rebuilding the state is more harmfull than a slow IO.
On Tue 03-09-24 21:15:59, Yafang Shao wrote: [...] > I completely agree with your point. However, in the real world, things > don't always work as expected, which is why it's crucial to ensure the > OOM killer is effective during system thrashing. Unfortunately, the > kernel's OOM killer doesn't always perform as expected, particularly > under heavy thrashing. This is one reason why user-space OOM killers > like oomd exist. I do undestand your point. On the other hand over a long time seeing all different usecases we have concluded that the OOM killer should be really conservative last resort. More agressive OOM policies should be implemented by userspace to prevent from regressions in other usecases. That doesn't really mean improvements to the kernel oom killer are not welcome or impossible. The bar is just quite hard as the wide variety of workloads is really hard to support. Heavy trashing is one example. Different workloads will have a different understanding what that means actually.
diff --git a/fs/bcachefs/acl.c b/fs/bcachefs/acl.c index 87f1be9d4db4..1def61875a6f 100644 --- a/fs/bcachefs/acl.c +++ b/fs/bcachefs/acl.c @@ -137,7 +137,7 @@ static struct posix_acl *bch2_acl_from_disk(struct btree_trans *trans, return NULL; acl = allocate_dropping_locks(trans, ret, - posix_acl_alloc(count, _gfp)); + posix_acl_alloc(count, GFP_KERNEL)); if (!acl) return ERR_PTR(-ENOMEM); if (ret) { @@ -427,7 +427,8 @@ int bch2_acl_chmod(struct btree_trans *trans, subvol_inum inum, if (ret) goto err; - ret = allocate_dropping_locks_errcode(trans, __posix_acl_chmod(&acl, _gfp, mode)); + ret = allocate_dropping_locks_errcode(trans, + __posix_acl_chmod(&acl, GFP_KERNEL, mode)); if (ret) goto err; diff --git a/fs/bcachefs/btree_cache.c b/fs/bcachefs/btree_cache.c index 9f096fdcaf9a..e0090fa551d7 100644 --- a/fs/bcachefs/btree_cache.c +++ b/fs/bcachefs/btree_cache.c @@ -729,7 +729,8 @@ struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_rea mutex_unlock(&bc->lock); - if (btree_node_data_alloc(c, b, GFP_NOWAIT|__GFP_NOWARN)) { + if (memalloc_flags_do(PF_MEMALLOC_NORECLAIM, + btree_node_data_alloc(c, b, GFP_KERNEL|__GFP_NOWARN))) { bch2_trans_unlock(trans); if (btree_node_data_alloc(c, b, GFP_KERNEL|__GFP_NOWARN)) goto err; diff --git a/fs/bcachefs/btree_iter.h b/fs/bcachefs/btree_iter.h index 6d87e57745da..0ea21a5f6d86 100644 --- a/fs/bcachefs/btree_iter.h +++ b/fs/bcachefs/btree_iter.h @@ -865,29 +865,33 @@ struct bkey_s_c bch2_btree_iter_peek_and_restart_outlined(struct btree_iter *); (_do) ?: bch2_trans_relock(_trans); \ }) -#define allocate_dropping_locks_errcode(_trans, _do) \ -({ \ - gfp_t _gfp = GFP_NOWAIT|__GFP_NOWARN; \ - int _ret = _do; \ - \ - if (bch2_err_matches(_ret, ENOMEM)) { \ - _gfp = GFP_KERNEL; \ - _ret = drop_locks_do(_trans, _do); \ - } \ - _ret; \ +#define memalloc_flags_do(_flags, _do) \ +({ \ + unsigned _saved_flags = memalloc_flags_save(_flags); \ + typeof(_do) _ret = _do; \ + memalloc_noreclaim_restore(_saved_flags); \ + _ret; \ }) -#define allocate_dropping_locks(_trans, _ret, _do) \ -({ \ - gfp_t _gfp = GFP_NOWAIT|__GFP_NOWARN; \ - typeof(_do) _p = _do; \ - \ - _ret = 0; \ - if (unlikely(!_p)) { \ - _gfp = GFP_KERNEL; \ - _ret = drop_locks_do(_trans, ((_p = _do), 0)); \ - } \ - _p; \ +#define allocate_dropping_locks_errcode(_trans, _do) \ +({ \ + int _ret = memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN, _do);\ + \ + if (bch2_err_matches(_ret, ENOMEM)) { \ + _ret = drop_locks_do(_trans, _do); \ + } \ + _ret; \ +}) + +#define allocate_dropping_locks(_trans, _ret, _do) \ +({ \ + typeof(_do) _p = memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN, _do);\ + \ + _ret = 0; \ + if (unlikely(!_p)) { \ + _ret = drop_locks_do(_trans, ((_p = _do), 0)); \ + } \ + _p; \ }) #define bch2_trans_run(_c, _do) \ diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c index af84516fb607..feea58778d44 100644 --- a/fs/bcachefs/btree_key_cache.c +++ b/fs/bcachefs/btree_key_cache.c @@ -117,12 +117,12 @@ static void bkey_cached_free(struct btree_key_cache *bc, this_cpu_inc(*bc->nr_pending); } -static struct bkey_cached *__bkey_cached_alloc(unsigned key_u64s, gfp_t gfp) +static struct bkey_cached *__bkey_cached_alloc(unsigned key_u64s) { - struct bkey_cached *ck = kmem_cache_zalloc(bch2_key_cache, gfp); + struct bkey_cached *ck = kmem_cache_zalloc(bch2_key_cache, GFP_KERNEL); if (unlikely(!ck)) return NULL; - ck->k = kmalloc(key_u64s * sizeof(u64), gfp); + ck->k = kmalloc(key_u64s * sizeof(u64), GFP_KERNEL); if (unlikely(!ck->k)) { kmem_cache_free(bch2_key_cache, ck); return NULL; @@ -146,7 +146,7 @@ bkey_cached_alloc(struct btree_trans *trans, struct btree_path *path, unsigned k goto lock; ck = allocate_dropping_locks(trans, ret, - __bkey_cached_alloc(key_u64s, _gfp)); + __bkey_cached_alloc(key_u64s)); if (ret) { if (ck) kfree(ck->k); @@ -240,7 +240,7 @@ static int btree_key_cache_create(struct btree_trans *trans, struct btree_path * mark_btree_node_locked_noreset(path, 0, BTREE_NODE_UNLOCKED); struct bkey_i *new_k = allocate_dropping_locks(trans, ret, - kmalloc(key_u64s * sizeof(u64), _gfp)); + kmalloc(key_u64s * sizeof(u64), GFP_KERNEL)); if (unlikely(!new_k)) { bch_err(trans->c, "error allocating memory for key cache key, btree %s u64s %u", bch2_btree_id_str(ck->key.btree_id), key_u64s); diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c index 141a4c63142f..e2163cbf63a9 100644 --- a/fs/bcachefs/ec.c +++ b/fs/bcachefs/ec.c @@ -890,12 +890,12 @@ int bch2_ec_read_extent(struct btree_trans *trans, struct bch_read_bio *rbio) /* stripe bucket accounting: */ -static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx, gfp_t gfp) +static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx) { ec_stripes_heap n, *h = &c->ec_stripes_heap; if (idx >= h->size) { - if (!init_heap(&n, max(1024UL, roundup_pow_of_two(idx + 1)), gfp)) + if (!init_heap(&n, max(1024UL, roundup_pow_of_two(idx + 1)), GFP_KERNEL)) return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; mutex_lock(&c->ec_stripes_heap_lock); @@ -909,11 +909,11 @@ static int __ec_stripe_mem_alloc(struct bch_fs *c, size_t idx, gfp_t gfp) free_heap(&n); } - if (!genradix_ptr_alloc(&c->stripes, idx, gfp)) + if (!genradix_ptr_alloc(&c->stripes, idx, GFP_KERNEL)) return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; if (c->gc_pos.phase != GC_PHASE_not_running && - !genradix_ptr_alloc(&c->gc_stripes, idx, gfp)) + !genradix_ptr_alloc(&c->gc_stripes, idx, GFP_KERNEL)) return -BCH_ERR_ENOMEM_ec_stripe_mem_alloc; return 0; @@ -923,7 +923,7 @@ static int ec_stripe_mem_alloc(struct btree_trans *trans, struct btree_iter *iter) { return allocate_dropping_locks_errcode(trans, - __ec_stripe_mem_alloc(trans->c, iter->pos.offset, _gfp)); + __ec_stripe_mem_alloc(trans->c, iter->pos.offset)); } /* @@ -2193,7 +2193,7 @@ int bch2_stripes_read(struct bch_fs *c) if (k.k->type != KEY_TYPE_stripe) continue; - ret = __ec_stripe_mem_alloc(c, k.k->p.offset, GFP_KERNEL); + ret = __ec_stripe_mem_alloc(c, k.k->p.offset); if (ret) break; diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c index bc8dd89ec15c..153ec4e5c0f4 100644 --- a/fs/bcachefs/fs.c +++ b/fs/bcachefs/fs.c @@ -273,14 +273,6 @@ static struct bch_inode_info *bch2_inode_hash_insert(struct bch_fs *c, } } -#define memalloc_flags_do(_flags, _do) \ -({ \ - unsigned _saved_flags = memalloc_flags_save(_flags); \ - typeof(_do) _ret = _do; \ - memalloc_noreclaim_restore(_saved_flags); \ - _ret; \ -}) - static struct inode *bch2_alloc_inode(struct super_block *sb) { BUG();
vmalloc doesn't correctly respect gfp flags - gfp flags aren't used for pte allocation, so doing vmalloc/kvmalloc allocations with reclaim unsafe locks is a potential deadlock. Note that we also want to use PF_MEMALLOC_NORECLAIM, not PF_MEMALLOC_NOFS, because when we're doing allocations with btree locks held we have a fallback available - drop locks and do a normal GFP_KERNEL allocation. We don't want to be invoking reclaim with btree locks held at all, since these are big shared locks and overalll system performance is sensitive to hold times. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> --- fs/bcachefs/acl.c | 5 ++-- fs/bcachefs/btree_cache.c | 3 ++- fs/bcachefs/btree_iter.h | 46 +++++++++++++++++++---------------- fs/bcachefs/btree_key_cache.c | 10 ++++---- fs/bcachefs/ec.c | 12 ++++----- fs/bcachefs/fs.c | 8 ------ 6 files changed, 41 insertions(+), 43 deletions(-)