Message ID | 20240817062449.21164-1-21cnbao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: clarify nofail memory allocation | expand |
On 17.08.24 08:24, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > -v3: > * collect reviewed-by, acked-by etc. Michal, Christoph, Vlastimil, Davidlohr, > thanks! > * use Jason's patch[1] to fix vdpa and refine his changelog. > * refine changelogs > [1] https://lore.kernel.org/all/20240808054320.10017-1-jasowang@redhat.com/ > > -v2: > https://lore.kernel.org/linux-mm/20240731000155.109583-1-21cnbao@gmail.com/ > > * adjust vpda fix according to Jason and Michal's feedback, I would > expect Jason to test it, thanks! > * split BUG_ON of unavoidable failure and the case GFP_ATOMIC | > __GFP_NOFAIL into two patches according to Vlastimil and Michal. > * collect Michal's acked-by for patch 2 - the doc; > * remove the new GFP_NOFAIL from this series, that one would be a > separate enhancement patchset later on. > > -v1: > https://lore.kernel.org/linux-mm/20240724085544.299090-1-21cnbao@gmail.com/ > > __GFP_NOFAIL carries the semantics of never failing, so its callers > do not check the return value: > %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > cannot handle allocation failures. The allocation could block > indefinitely but will never return with failure. Testing for > failure is pointless. > You should have called this series "mm: clarify nofail and BUG_ON on with invalid arguments with nofail" I was a bit surprised to find that many BUG_ON in this series ;) > However, __GFP_NOFAIL can sometimes fail if it exceeds size limits > or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context. > This can expose security vulnerabilities due to potential NULL > dereferences. Which code is that that we are concerned about? Some OOT driver? Some untested in-tree driver? Just trying to understand what we are afraid about here. I'm afraid we cannot safe the world with some OOT drivers :( Are we aware of such security vulnerabilities (out of pure interest!)? Is the expectation that these code paths are never triggered during early testing where a WARN_ON_ONCE() would just highlight -- in addition to a likely crash -- that something is very wrong? > > Since __GFP_NOFAIL does not support non-blocking allocation, we introduce > GFP_NOFAIL with inclusive blocking semantics and encourage using GFP_NOFAIL > as a replacement for __GFP_NOFAIL in non-mm. I still wonder if specifying GFP_NOFAIL should not simply come with the semantics "implies direct reclaim. If called for non-blocking allocations, bad things will happen (no security bad things though)." ;) > > If we must still fail a nofail allocation, we should trigger a BUG rather > than exposing NULL dereferences to callers who do not check the return > value. I am not convinced that BUG_ON is the right tool here to save the world, but I see how we arrived here.
On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote: > > > > If we must still fail a nofail allocation, we should trigger a BUG rather > > than exposing NULL dereferences to callers who do not check the return > > value. > > I am not convinced that BUG_ON is the right tool here to save the world, > but I see how we arrived here. I think the thing to do is to just add a WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags)); or similar, where that bad_nofail_alloc() checks that the allocation order is small and that the flags are sane for a NOFAIL allocation. Because no, BUG_ON() is *never* the answer. The answer is to make sure nobody ever sets NOFAIL in situations where the allocation can fail and there is no way forward. A BUG_ON() will quite likely just make things worse. You're better off with a WARN_ON() and letting the caller just oops. Honestly, I'm perfectly fine with just removing that stupid useless flag entirely. The flag goes back to 2003 and was introduced in 2.5.69, and was meant to be for very particular uses that otherwise just looped waiting for memory. Back in 2.5.69, there was exactly one user: the jbd journal code, that did a buffer head allocation with GFP_NOFAIL. By 2.6.0 that had expanded by another user in XFS, and even that one had a comment saying that it needed to be narrowed down. And in fact, by the 2.6.12 release, that XFS use had been removed, but the jbd journal had grown another jbd_kmalloc case for transaction data. So at the beginning of the git archives, we had exactly *one* user (with two places). *THAT* is the kind of use that the flag was meant for: small allocations required to make forward progress in writeout during memory pressure. It has then expanded and is now a problem. The cases using GFP_NOFAIL for things like vmalloc() - which is by definition not a small allocation - should be just removed as outright bugs. Note that we had this comment back in 2010: * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. This modifier is deprecated and no new * users should be added. and then it was softened in 2015 to the current * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller * cannot handle allocation failures. New users should be evaluated carefully ... and clearly that "evaluated carefully" actually never happened, so the new comment is just garbage. I wonder how many modern users of GFP_NOFAIL are simply due to over-eager allocation failure injection testing, and then people added GFP_NOFAIL just because it shut up the mindless random allocation failures. I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can actually return an error just fine, but there was this crazy worry about the IPC layer initialization failing: https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/ Things like that, where people just added mindless "theoretical concerns" issues, or possibly had some error injection module that inserted impossible failures. I do NOT want those things to become BUG_ON()'s. It's better to just return NULL with a "bogus GFP_NOFAIL" warning, and have the oops happen in the actual bad place that did an invalid allocation. Because the blame should go *there*, and it should not even remotely look like "oh, the MM code failed". No. The caller was garbage. So no. No MM BUG_ON code. Linus
On Tue, Aug 20, 2024 at 4:05 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote: > > > > > > If we must still fail a nofail allocation, we should trigger a BUG rather > > > than exposing NULL dereferences to callers who do not check the return > > > value. > > > > I am not convinced that BUG_ON is the right tool here to save the world, > > but I see how we arrived here. > > I think the thing to do is to just add a > > WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags)); > > or similar, where that bad_nofail_alloc() checks that the allocation > order is small and that the flags are sane for a NOFAIL allocation. > > Because no, BUG_ON() is *never* the answer. The answer is to make sure > nobody ever sets NOFAIL in situations where the allocation can fail > and there is no way forward. > > A BUG_ON() will quite likely just make things worse. You're better off > with a WARN_ON() and letting the caller just oops. That could be an exploit taking advantage of those improper callers, thus it wouldn’t necessarily result in an immediate oops in callers but result in an exploit such as p = malloc(gfp_nofail); using p->field or using p + offset where p->filed or p + offset might be attacking code though p is NULL. Could we consider infinitely blocking those bad callers: while (illegal_gfp_nofail) do_sth_relax_cpu(); > > Honestly, I'm perfectly fine with just removing that stupid useless > flag entirely. The flag goes back to 2003 and was introduced in > 2.5.69, and was meant to be for very particular uses that otherwise > just looped waiting for memory. > > Back in 2.5.69, there was exactly one user: the jbd journal code, that > did a buffer head allocation with GFP_NOFAIL. By 2.6.0 that had > expanded by another user in XFS, and even that one had a comment > saying that it needed to be narrowed down. And in fact, by the 2.6.12 > release, that XFS use had been removed, but the jbd journal had grown > another jbd_kmalloc case for transaction data. So at the beginning of > the git archives, we had exactly *one* user (with two places). > > *THAT* is the kind of use that the flag was meant for: small > allocations required to make forward progress in writeout during > memory pressure. > > It has then expanded and is now a problem. The cases using GFP_NOFAIL > for things like vmalloc() - which is by definition not a small > allocation - should be just removed as outright bugs. > > Note that we had this comment back in 2010: > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. This modifier is deprecated and no new > * users should be added. > > and then it was softened in 2015 to the current > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. New users should be evaluated carefully > ... > > and clearly that "evaluated carefully" actually never happened, so the > new comment is just garbage. > > I wonder how many modern users of GFP_NOFAIL are simply due to > over-eager allocation failure injection testing, and then people added > GFP_NOFAIL just because it shut up the mindless random allocation > failures. > > I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can > actually return an error just fine, but there was this crazy worry > about the IPC layer initialization failing: > > https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/ > > Things like that, where people just added mindless "theoretical > concerns" issues, or possibly had some error injection module that > inserted impossible failures. > > I do NOT want those things to become BUG_ON()'s. It's better to just > return NULL with a "bogus GFP_NOFAIL" warning, and have the oops > happen in the actual bad place that did an invalid allocation. > > Because the blame should go *there*, and it should not even remotely > look like "oh, the MM code failed". No. The caller was garbage. > > So no. No MM BUG_ON code. > > Linus
On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote: > > > That could be an exploit taking advantage of those improper callers, So? FIX THE BUGGY CODE. Don't make insane and incorrect changes to the MM code and spread Fear, Uncertainty and Doubt. > thus it wouldn’t necessarily result in an immediate oops in callers but > result in an exploit No. Any bug can be an exploit. Don't try to make this something special by calling it an exploit. NULL pointer dereferences are some of the *least* worrisome bugs, because we don't allow people to mmap the NULL area anyway. So just stop spreading FUD. We don't improve the kernel by making excuses for bugs, we improve it by fixing things. And any caller that asks for NOFAIL with bad parameters is buggy. The MM code should NOT try to fix it up, and dammit, BUG_ON() is not acceptable as a debugging help. Never was, never will be. Worry-warts already do "reboot-on-warn". Linus
On Tue, Aug 20, 2024 at 7:33 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote: > > > > > > That could be an exploit taking advantage of those improper callers, > > So? > > FIX THE BUGGY CODE. That's definitely in progress, with patch 1/4 addressing vdpa. There's also an RFC to enforce DIRECT_RECLAMATION for __GFP_NOFAIL, which will prevent passing unsupported flags to the memory management system: https://lore.kernel.org/all/20240724085544.299090-6-21cnbao@gmail.com/ > > Don't make insane and incorrect changes to the MM code and spread > Fear, Uncertainty and Doubt. > > > thus it wouldn’t necessarily result in an immediate oops in callers but > > result in an exploit > > No. Any bug can be an exploit. Don't try to make this something > special by calling it an exploit. > > NULL pointer dereferences are some of the *least* worrisome bugs, > because we don't allow people to mmap the NULL area anyway. > > So just stop spreading FUD. We don't improve the kernel by making > excuses for bugs, we improve it by fixing things. > > And any caller that asks for NOFAIL with bad parameters is buggy. The > MM code should NOT try to fix it up, and dammit, BUG_ON() is not > acceptable as a debugging help. Never was, never will be. Okay, I see your point. However, the discussion originally began with just a simple WARN_ON() to flag improper usage: https://lore.kernel.org/linux-mm/20240717230025.77361-1-21cnbao@gmail.com/ Now, it seems we've come full circle and are opting to use WARN_ON_ONCE() instead? > > Worry-warts already do "reboot-on-warn". > > Linus Thanks Barry
On Mon 19-08-24 12:33:17, Linus Torvalds wrote: > On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote: > > > > > > That could be an exploit taking advantage of those improper callers, > > So? > > FIX THE BUGGY CODE. > > Don't make insane and incorrect changes to the MM code and spread > Fear, Uncertainty and Doubt. > > > thus it wouldn’t necessarily result in an immediate oops in callers but > > result in an exploit > > No. Any bug can be an exploit. Don't try to make this something > special by calling it an exploit. > > NULL pointer dereferences are some of the *least* worrisome bugs, > because we don't allow people to mmap the NULL area anyway. > > So just stop spreading FUD. We don't improve the kernel by making > excuses for bugs, we improve it by fixing things. I really do not think that accusing Barry from spreading FUD is fair! I believe there is a good intention behind this initiative and he has used tools that he has available. So rather than throwing FUD arguments I would really like to stick to a technical discussion, please.
On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote: > > > > > > If we must still fail a nofail allocation, we should trigger a BUG rather > > > than exposing NULL dereferences to callers who do not check the return > > > value. > > > > I am not convinced that BUG_ON is the right tool here to save the world, > > but I see how we arrived here. > > I think the thing to do is to just add a > > WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags)); > > or similar, where that bad_nofail_alloc() checks that the allocation > order is small and that the flags are sane for a NOFAIL allocation. > > Because no, BUG_ON() is *never* the answer. The answer is to make sure > nobody ever sets NOFAIL in situations where the allocation can fail > and there is no way forward. > > A BUG_ON() will quite likely just make things worse. You're better off > with a WARN_ON() and letting the caller just oops. > > Honestly, I'm perfectly fine with just removing that stupid useless > flag entirely. The flag goes back to 2003 and was introduced in > 2.5.69, and was meant to be for very particular uses that otherwise > just looped waiting for memory. > > Back in 2.5.69, there was exactly one user: the jbd journal code, that > did a buffer head allocation with GFP_NOFAIL. By 2.6.0 that had > expanded by another user in XFS, and even that one had a comment > saying that it needed to be narrowed down. And in fact, by the 2.6.12 > release, that XFS use had been removed, but the jbd journal had grown > another jbd_kmalloc case for transaction data. So at the beginning of > the git archives, we had exactly *one* user (with two places). > > *THAT* is the kind of use that the flag was meant for: small > allocations required to make forward progress in writeout during > memory pressure. > > It has then expanded and is now a problem. The cases using GFP_NOFAIL > for things like vmalloc() - which is by definition not a small > allocation - should be just removed as outright bugs. One potential approach could be to rename GFP_NOFAIL to GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and to clear this flag for larger allocations. However, the challenge lies in determining what constitutes a 'small' allocation. > > Note that we had this comment back in 2010: > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. This modifier is deprecated and no new > * users should be added. > > and then it was softened in 2015 to the current > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > * cannot handle allocation failures. New users should be evaluated carefully > ... > > and clearly that "evaluated carefully" actually never happened, so the > new comment is just garbage. > > I wonder how many modern users of GFP_NOFAIL are simply due to > over-eager allocation failure injection testing, and then people added > GFP_NOFAIL just because it shut up the mindless random allocation > failures. > > I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can > actually return an error just fine, but there was this crazy worry > about the IPC layer initialization failing: > > https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/ > > Things like that, where people just added mindless "theoretical > concerns" issues, or possibly had some error injection module that > inserted impossible failures. > > I do NOT want those things to become BUG_ON()'s. It's better to just > return NULL with a "bogus GFP_NOFAIL" warning, and have the oops > happen in the actual bad place that did an invalid allocation. > > Because the blame should go *there*, and it should not even remotely > look like "oh, the MM code failed". No. The caller was garbage. > > So no. No MM BUG_ON code. > > Linus > -- Regards Yafang
On Wed, 21 Aug 2024 at 20:41, Yafang Shao <laoar.shao@gmail.com> wrote: > > One potential approach could be to rename GFP_NOFAIL to > GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and > to clear this flag for larger allocations. Yes, that sounds like a good way to make sure people don't blame the MM layer when they themselves were the cause of problems. > However, the challenge lies > in determining what constitutes a 'small' allocation. I think we could easily just stick to the historical "order < PAGE_ALLOC_COSTLY_ORDER": * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed * costly to service. (And the value for that is 3 - orders 0-2 are considered "cheap") Linus
On Thu 22-08-24 06:59:08, Linus Torvalds wrote: > On Wed, 21 Aug 2024 at 20:41, Yafang Shao <laoar.shao@gmail.com> wrote: > > > > One potential approach could be to rename GFP_NOFAIL to > > GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and > > to clear this flag for larger allocations. > > Yes, that sounds like a good way to make sure people don't blame the > MM layer when they themselves were the cause of problems. The reality disagrees because there is a real demand for real GFP_NOFAIL semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for anything larger than PAGE_SIZE is doable without a considerable burden on the MM end.
On Thu, Aug 22, 2024 at 12:41 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote: > > > > > > > > If we must still fail a nofail allocation, we should trigger a BUG rather > > > > than exposing NULL dereferences to callers who do not check the return > > > > value. > > > > > > I am not convinced that BUG_ON is the right tool here to save the world, > > > but I see how we arrived here. > > > > I think the thing to do is to just add a > > > > WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags)); > > > > or similar, where that bad_nofail_alloc() checks that the allocation > > order is small and that the flags are sane for a NOFAIL allocation. > > > > Because no, BUG_ON() is *never* the answer. The answer is to make sure > > nobody ever sets NOFAIL in situations where the allocation can fail > > and there is no way forward. > > > > A BUG_ON() will quite likely just make things worse. You're better off > > with a WARN_ON() and letting the caller just oops. > > > > Honestly, I'm perfectly fine with just removing that stupid useless > > flag entirely. The flag goes back to 2003 and was introduced in > > 2.5.69, and was meant to be for very particular uses that otherwise > > just looped waiting for memory. > > > > Back in 2.5.69, there was exactly one user: the jbd journal code, that > > did a buffer head allocation with GFP_NOFAIL. By 2.6.0 that had > > expanded by another user in XFS, and even that one had a comment > > saying that it needed to be narrowed down. And in fact, by the 2.6.12 > > release, that XFS use had been removed, but the jbd journal had grown > > another jbd_kmalloc case for transaction data. So at the beginning of > > the git archives, we had exactly *one* user (with two places). > > > > *THAT* is the kind of use that the flag was meant for: small > > allocations required to make forward progress in writeout during > > memory pressure. > > > > It has then expanded and is now a problem. The cases using GFP_NOFAIL > > for things like vmalloc() - which is by definition not a small > > allocation - should be just removed as outright bugs. > > One potential approach could be to rename GFP_NOFAIL to > GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and > to clear this flag for larger allocations. However, the challenge lies > in determining what constitutes a 'small' allocation. I'm not entirely sure if our concern is with higher order or larger size. Higher order might pose a problem, but larger size(not too large) isn't always an issue. Allocating 100 * 4KiB pages is possibly easier than allocating a single 128KB folio. Are we trying to limit the physical size or the physical order? If the concern is order, vmalloc manages __GFP_NOFAIL by mapping order-0 pages. If the concern is higher order, this sounds reasonable. but it seems the buddy system already has code to trigger a warning even for order > 1: struct page *rmqueue(struct zone *preferred_zone, struct zone *zone, unsigned int order, gfp_t gfp_flags, unsigned int alloc_flags, int migratetype) { struct page *page; /* * We most definitely don't want callers attempting to * allocate greater than order-1 page units with __GFP_NOFAIL. */ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, migratetype, alloc_flags); if (likely(page)) goto out; } .... } > > > > > Note that we had this comment back in 2010: > > > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. This modifier is deprecated and no new > > * users should be added. > > > > and then it was softened in 2015 to the current > > > > * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller > > * cannot handle allocation failures. New users should be evaluated carefully > > ... > > > > and clearly that "evaluated carefully" actually never happened, so the > > new comment is just garbage. > > > > I wonder how many modern users of GFP_NOFAIL are simply due to > > over-eager allocation failure injection testing, and then people added > > GFP_NOFAIL just because it shut up the mindless random allocation > > failures. > > > > I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can > > actually return an error just fine, but there was this crazy worry > > about the IPC layer initialization failing: > > > > https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/ > > > > Things like that, where people just added mindless "theoretical > > concerns" issues, or possibly had some error injection module that > > inserted impossible failures. > > > > I do NOT want those things to become BUG_ON()'s. It's better to just > > return NULL with a "bogus GFP_NOFAIL" warning, and have the oops > > happen in the actual bad place that did an invalid allocation. > > > > Because the blame should go *there*, and it should not even remotely > > look like "oh, the MM code failed". No. The caller was garbage. > > > > So no. No MM BUG_ON code. > > > > Linus > > > > > -- > Regards > Yafang Thanks Barry
On Thu, 22 Aug 2024 at 14:21, Michal Hocko <mhocko@suse.com> wrote: > > The reality disagrees because there is a real demand for real GFP_NOFAIL > semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL > for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for > anything larger than PAGE_SIZE is doable without a considerable burden > on the MM end. Doable? Sure. Sensible? Not clear. I do not find a single case of that in the kernel. I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and one in erofs. It's not clear that any of them make much sense (or that the erofs one is actually a large allocation). And there's some disgusting noise in hmm_test.c. Irrelevant. Now, this was all from a pure mindless grep, and it didn't take any context into account, so if somebody is building up the gfp flags and not using them directly, that grep has failed. But when you say "reality disagrees", I think you need to point at said reality. Because the *real* reality is that large allocations HAVE ALWAYS FAILED. In fact, with flags like GFP_ATOMIC etc, it will fail very aggressively. Christ, that's what started this whole thread and discussion in the first place. So I think *that* is the reality you need to face: GFP_NOFAIL has never ever been a hard guarantee to begin with, and expecting it to be that is a sign of insanity. It's fundamentally an impossible operation. It really is a "Daddy, daddy, I want a pony" flag. Linus
On Thu, 22 Aug 2024 at 14:40, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and > one in erofs. It's not clear that any of them make much sense (or that > the erofs one is actually a large allocation). Oh, and I missed one in btrfs because it needed five lines of context due to being the allocation from hell. That said, yes, the vmalloc case at least has no fragmentation issues, but I do think even that needs to be size limited for sanity. The size limit might be larger than a couple of pages, but not _hugely_ larger. You can't just say "I want a megabyte, and you can't fail me". That kind of code is garbage, and needs to be called out for being garbage. Linus
Hi Linus, On 2024/8/22 14:40, Linus Torvalds wrote: > On Thu, 22 Aug 2024 at 14:21, Michal Hocko <mhocko@suse.com> wrote: >> >> The reality disagrees because there is a real demand for real GFP_NOFAIL >> semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL >> for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for >> anything larger than PAGE_SIZE is doable without a considerable burden >> on the MM end. > > Doable? Sure. Sensible? Not clear. > > I do not find a single case of that in the kernel. > > I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and > one in erofs. It's not clear that any of them make much sense (or that > the erofs one is actually a large allocation). I don't follow all the thread due to other internal work ongoing but EROFS could do _large_ kvmalloc NOFAIL allocation according to PAGE_ALLOC_COSTLY_ORDER (~24kb at most due to on-disk restriction), my detailed story was outlined in my previous reply (and thread): https://lore.kernel.org/r/20d782ad-c059-4029-9c75-0ef278c98d81@linux.alibaba.com Because EROFS needs page arraies for vmap and then do decompression, for the worst case, it almost needs ~24kb temporary page array but that is the end user choice to use such extreme compression (mostly just syzkallar crafted images.) In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction means for a single shot. Because assume even if you don't consider a virtual consecutive buffer, people could also do < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost the same heavy workload to the whole system. And we also allow direct/kswap reclaim here. Failure path is complex in some cases like here and it's hard to reach or get it right. If kvmalloc() will be restricted on < PAGE_ALLOC_COSTLY_ORDER anyway, I guess I will use a global static buffer (and a sleeping lock) as a worst fallback to fulfill the extreme on-disk restriction. Thanks, Gao Xiang
On Thu 22-08-24 14:56:08, Linus Torvalds wrote: > On Thu, 22 Aug 2024 at 14:40, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and > > one in erofs. It's not clear that any of them make much sense (or that > > the erofs one is actually a large allocation). > > Oh, and I missed one in btrfs because it needed five lines of context > due to being the allocation from hell. > > That said, yes, the vmalloc case at least has no fragmentation issues, > but I do think even that needs to be size limited for sanity. > > The size limit might be larger than a couple of pages, but not > _hugely_ larger. You can't just say "I want a megabyte, and you can't > fail me". That kind of code is garbage, and needs to be called out for > being garbage. yes, no objection here. Our current limits are too large for any practical purpose. We still need a strategy how to communicate that the size is not supported though. Just returning NULL is IMHO bad thing because it adds potentially silent failure. In other subthread I was contemplating about OOPS_ON which would simply terminate the user context and kill it right away. What do you think about something like that?
On Thu 22-08-24 15:01:43, Gao Xiang wrote: > In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction > means for a single shot. Because assume even if you don't consider > a virtual consecutive buffer, people could also do > < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost > the same heavy workload to the whole system. And we also allow > direct/kswap reclaim here. Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain make sense outside of the page allocator proper. There is no reason why vmalloc NOFAIL should be constrained by that. Sure it should be contrained to some value but considering it is just a bunch of PAGE_SIZE allocation then the limit could be higher. I am not sure where the practical limit should be but anything that requires more than couple of MBs seems really excessive. And for PAGE_ALLOC_COSTLY_ORDER and NOFAIL at the page allocator level I would argue that we do not want to provide such a strong guarantee to anything order > 0. Just use kvmalloc for that purpose. Sure we practically never fail up to costly order but guaranteeing that is a different story.
On Thu, Aug 22, 2024 at 7:47 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 22-08-24 14:56:08, Linus Torvalds wrote: > > On Thu, 22 Aug 2024 at 14:40, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and > > > one in erofs. It's not clear that any of them make much sense (or that > > > the erofs one is actually a large allocation). > > > > Oh, and I missed one in btrfs because it needed five lines of context > > due to being the allocation from hell. > > > > That said, yes, the vmalloc case at least has no fragmentation issues, > > but I do think even that needs to be size limited for sanity. > > > > The size limit might be larger than a couple of pages, but not > > _hugely_ larger. You can't just say "I want a megabyte, and you can't > > fail me". That kind of code is garbage, and needs to be called out for > > being garbage. > > yes, no objection here. Our current limits are too large for any > practical purpose. We still need a strategy how to communicate that > the size is not supported though. Just returning NULL is IMHO bad thing > because it adds potentially silent failure. In other subthread I was > contemplating about OOPS_ON which would simply terminate the user > context and kill it right away. What do you think about something like > that? Personally, I fully support this approach that falls between BUG_ON and returning NULL. Regarding the concern about 'leaving locks behind' you have in that subthread, I believe there's no difference when returning NULL, as it could still leave locks behind but offers a chance for the calling process to avoid an immediate crash. > -- > Michal Hocko > SUSE Labs Thanks Barry
Hi Michal, On 2024/8/22 15:54, Michal Hocko wrote: > On Thu 22-08-24 15:01:43, Gao Xiang wrote: >> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction >> means for a single shot. Because assume even if you don't consider >> a virtual consecutive buffer, people could also do >> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost >> the same heavy workload to the whole system. And we also allow >> direct/kswap reclaim here. > > Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain > make sense outside of the page allocator proper. There is no reason why > vmalloc NOFAIL should be constrained by that. Sure it should be > contrained to some value but considering it is just a bunch of PAGE_SIZE > allocation then the limit could be higher. I am not sure where the > practical limit should be but anything that requires more than couple of > MBs seems really excessive. Yeah, totally agreed, that would make my own life easier, of course I will not allocate MBs insanely. I've always trying to kill unnecessary NOFAILs (mostly together with code cleanups), but if a failure path increases more than 100 LOCs just for rare failure and extreme workloads, I _do_ hope kvmalloc(NOFAIL) could work instead. > > And for PAGE_ALLOC_COSTLY_ORDER and NOFAIL at the page allocator level I > would argue that we do not want to provide such a strong guarantee to > anything order > 0. Just use kvmalloc for that purpose. Sure we > practically never fail up to costly order but guaranteeing that is a > different story. Agreed. Thanks, Gao Xiang >
On Thu 22-08-24 19:57:41, Barry Song wrote: > Regarding the concern about 'leaving locks > behind' you have in that subthread, I believe there's no difference > when returning NULL, as it could still leave locks behind but offers > a chance for the calling process to avoid an immediate crash. Yes, I have mentioned this risk just for completeness. Without having some sort of unwinding mechanism we are doomed to not be able to handle this. The sole difference between just returning NULL and OOPsing rigth away is that the former is not guaranteed to happen and the caller can cause an actual harm by derefering non-oopsing addressed close to 0 which would be a) much harder to find out b) could cause much more damage than killing the context right away. Besides that I believe we have many BUG_ON users which would really prefer to just call the current context instead, they just do not have means to do that so OOPS_ON could be a safer way to stop bad users and reduce the number of BUG_ONs as well. I am just not really sure how to implement that. A stupid and an obvious way would be to have a dereference from a known (pre-defined) unmapped area. But this smells like something that should be achievable in a better way.
On 22.08.24 10:24, Michal Hocko wrote: > On Thu 22-08-24 19:57:41, Barry Song wrote: >> Regarding the concern about 'leaving locks >> behind' you have in that subthread, I believe there's no difference >> when returning NULL, as it could still leave locks behind but offers >> a chance for the calling process to avoid an immediate crash. > > Yes, I have mentioned this risk just for completeness. Without having > some sort of unwinding mechanism we are doomed to not be able to handle > this. > > The sole difference between just returning NULL and OOPsing rigth away > is that the former is not guaranteed to happen and the caller can cause > an actual harm by derefering non-oopsing addressed close to 0 which > would be a) much harder to find out b) could cause much more damage than > killing the context right away. > > Besides that I believe we have many BUG_ON users which would really > prefer to just call the current context instead, they just do not have > means to do that so OOPS_ON could be a safer way to stop bad users and > reduce the number of BUG_ONs as well. To me that sounds better as well, but I was also wondering if it's easy to implement or easy to assemble from existing pieces. Linus has a point that "retry forever" can also be nasty. I think the important part here is, though, that we report sufficient information (stacktrace), such that the problem can be debugged reasonably well, and not just having a locked-up system. But then the question is: does it really make sense to differentiate difference between an NOFAIL allocation under memory pressure of MAX_ORDER compared to MAX_ORDER+1 (Linus also touched on that)? It could well take minutes/hours/days to satisfy a very large NOFAIL allocation. So callers should be prepared to run into effective lockups ... :/ NOFAIL shouldn't exist, or at least not used to that degree. I am to blame myself, I made use of it in kernel/resource.c, where there is no turning back when completed memory unplug to 99% (even having freed the vmemmap), but then we might have to allocate a new node in the resource tree, when having to split an existing one. Maybe there would be ways to preallocate before starting memory unplug, or to pre-split ... But then again, sizeof(struct resource) is probably so small that it likely would never fail.
On Thu, 22 Aug 2024 at 16:39, David Hildenbrand <david@redhat.com> wrote: > > Linus has a point that "retry forever" can also be nasty. I think the > important part here is, though, that we report sufficient information > (stacktrace), such that the problem can be debugged reasonably well, and > not just having a locked-up system. Unless I missed some case, I *think* most NOFAIL cases are actually fairly small. In fact, I suspect many of them are so small that we already effectively give that guarantee: > But then again, sizeof(struct resource) is probably so small that it > likely would never fail. Iirc, we had the policy of never failing unrestricted kernel allocations that are smaller than a page (where "unrestricted" means that it's a regular GFP_KERNEL, not some NOFS or similar allocation). In fact, I think we practically speaking still do. We really *really* tend to try very hard to retry small allocations. That was one of the things that GFP_USER does - it's identical to GFP_KERNEL, but it basically tells the MM that it should not try so hard because an allocation failure was fine. (I say "was", because the semantics have changed over time. Originally, GFP_KERNEL had the "GFP_HIGH" bit set that said "access emergency pools for this allocation", and GFP_USER did not have that bit set. Now the only difference seems to be GFP_HARDWALL, so a user allocation the cgroup limits are hard limits etc. I think there's been other versions of this kind of logic over the years as people try to make it all work out well in practice). In fact, kernel allocations try so hard that we have those "opposite flags" of ___GFP_NORETRY and ___GFP_RETRY_MAYFAIL because we often try *TOO* hard, and reasonably many code-paths have that whole "let's optimistically ask for a big allocation, but not try very hard and not warn if it fails, because we can fall back on a smaller one". So it's _really_ hard to fail a small GFP_KERNEL allocation. It used to be practically impossible, and in fact I think GFP_NOFAIL was originally added long ago when the MM code was going through big upheavals and one of the things that was mucked around with was the whole "how hard to retry". Linus
On Thu 22-08-24 10:39:09, David Hildenbrand wrote: [...] > But then the question is: does it really make sense to differentiate > difference between an NOFAIL allocation under memory pressure of MAX_ORDER > compared to MAX_ORDER+1 (Linus also touched on that)? It could well take > minutes/hours/days to satisfy a very large NOFAIL allocation. So callers > should be prepared to run into effective lockups ... :/ As pointed out in other subthread. We shouldn't really pretend we support NOFAIL for order > 0, or at least anything > A_SMALL_ORDER and encourage kvmalloc for those users. A nofail order 2 allocation can kill most of the userspace on terribly fragmented system that is kernel allocation heavy. > NOFAIL shouldn't exist, or at least not used to that degree. Let's put whishful thinking aside. Unless somebody manages to go over all existing NOFAIL users and fix them then we should better focus on providing a reasonable clearly documented and enforced semantic.
On Thu 22-08-24 17:08:15, Linus Torvalds wrote: > On Thu, 22 Aug 2024 at 16:39, David Hildenbrand <david@redhat.com> wrote: > > > > Linus has a point that "retry forever" can also be nasty. I think the > > important part here is, though, that we report sufficient information > > (stacktrace), such that the problem can be debugged reasonably well, and > > not just having a locked-up system. > > Unless I missed some case, I *think* most NOFAIL cases are actually > fairly small. > > In fact, I suspect many of them are so small that we already > effectively give that guarantee: > > > But then again, sizeof(struct resource) is probably so small that it > > likely would never fail. > > Iirc, we had the policy of never failing unrestricted kernel > allocations that are smaller than a page (where "unrestricted" means > that it's a regular GFP_KERNEL, not some NOFS or similar allocation). > > In fact, I think we practically speaking still do. We really *really* > tend to try very hard to retry small allocations. yes we try very hard but allocation failure is still possible in some corner cases so callers _must_ check for return value and deal with it. > That was one of the things that GFP_USER does - it's identical to > GFP_KERNEL, but it basically tells the MM that it should not try so > hard because an allocation failure was fine. GFP_USER allocation only impluy __GFP_HARDWALL and that only makes difference for cpusets. It doesn't make difference in most cases though. > In fact, kernel allocations try so hard that we have those "opposite > flags" of ___GFP_NORETRY and ___GFP_RETRY_MAYFAIL because we often try > *TOO* hard, and reasonably many code-paths have that whole "let's > optimistically ask for a big allocation, but not try very hard and not > warn if it fails, because we can fall back on a smaller one". > > So it's _really_ hard to fail a small GFP_KERNEL allocation. It used > to be practically impossible, and in fact I think GFP_NOFAIL was > originally added long ago when the MM code was going through big > upheavals and one of the things that was mucked around with was the > whole "how hard to retry". There is a fundamental difference here. GPF_NOFAIL _guarantees_ that the allocation will not fail so callers do not check for the failure because they have (presumably) no (practical) way to handle the failure.
On Thu, 22 Aug 2024 at 17:11, Michal Hocko <mhocko@suse.com> wrote: > > Let's put whishful thinking aside. Unless somebody manages to go over > all existing NOFAIL users and fix them then we should better focus on > providing a reasonable clearly documented and enforced semantic. I do like changing the naming to make it clear that it's not some kind of general MM guarantee for any random allocation. So that's why I liked the NOFAIL_SMALL_ALLOC just to make people who use it aware that no, they aren't getting a "get out of jail free" card. Admittedly that's probably _too_ long a name, but conceptually... Linus
On Thu, 22 Aug 2024 at 17:16, Michal Hocko <mhocko@suse.com> wrote: > > GFP_USER allocation only impluy __GFP_HARDWALL and that only makes > difference for cpusets. It doesn't make difference in most cases though. That's what it does today. We used to have a very clear notion of "how hard to try". It was "LOW", "MED" and "HIGH". And GFP_USER used __GFP_LOW, exactly so that the MM layer knew to not try very hard. GFP_ATOMIC used __GFP_HIGH, to say "use the reserved resources". GFP_KERNEL then at one point used __GF_MED, to say "don't dip into the reserved pool, but retry harder". But exactly because people did want kernel allocations to basically always succeed, then GFP_KERNEL ended up using __GFP_HIGH too. > There is a fundamental difference here. GPF_NOFAIL _guarantees_ that the > allocation will not fail so callers do not check for the failure because > they have (presumably) no (practical) way to handle the failure. And this mindset needs to go away. That's what I've been trying to say. It absolutely MUST NOT GUARANTEE THAT. I've seen crap patche that say "BUG_ON() if we cannot guarantee it", and I'm NACKing those kinds of completely bogus models. The hard reality needs to be that GFP_NOFAIL is simply IGNORED if people mis-use it. It absolutely HAS to be a "conditional no-failure". And it needs to be conditional on both size and things like "I'm allowed to do reclaim". Any discussion that starts with "GFP_NOFAIL is a guarantee" needs to *DIE*. Linus
On 22.08.24 11:11, Michal Hocko wrote: > On Thu 22-08-24 10:39:09, David Hildenbrand wrote: > [...] >> But then the question is: does it really make sense to differentiate >> difference between an NOFAIL allocation under memory pressure of MAX_ORDER >> compared to MAX_ORDER+1 (Linus also touched on that)? It could well take >> minutes/hours/days to satisfy a very large NOFAIL allocation. So callers >> should be prepared to run into effective lockups ... :/ > > As pointed out in other subthread. We shouldn't really pretend we > support NOFAIL for order > 0, or at least anything > A_SMALL_ORDER and > encourage kvmalloc for those users. > > A nofail order 2 allocation can kill most of the userspace on terribly > fragmented system that is kernel allocation heavy. > >> NOFAIL shouldn't exist, or at least not used to that degree. > > Let's put whishful thinking aside. Unless somebody manages to go over > all existing NOFAIL users and fix them then we should better focus on > providing a reasonable clearly documented and enforced semantic. Probably it would be time better spent than trying to find ways to deal with that mess. ;) I think the documentation is mostly there: "The VM implementation _must_ retry infinitely: the caller cannot handle allocation failures. The allocation could block indefinitely but will never return with failure. Testing for failure is pointless." To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry infinitely". if that implies just OOPSing or actually be in a busy loop, I don't care. It could effectively happen with MAX_ORDER as well, as stated. But certainly not BUG_ON. "Using this flag for costly allocations is _highly_ discouraged" should be rephrased to "Using this flag with costly allocations is _highly dangerous_ and will likely result in the allocation never succeeding and this function never making any progress." I do also agree that renaming NOFAIL to make some of that clearer makes sense. Likely, checkpatch should be updated to warn on any new NOFAIL usage.
On Thu 22-08-24 17:18:29, Linus Torvalds wrote: > On Thu, 22 Aug 2024 at 17:11, Michal Hocko <mhocko@suse.com> wrote: > > > > Let's put whishful thinking aside. Unless somebody manages to go over > > all existing NOFAIL users and fix them then we should better focus on > > providing a reasonable clearly documented and enforced semantic. > > I do like changing the naming to make it clear that it's not some kind > of general MM guarantee for any random allocation. > > So that's why I liked the NOFAIL_SMALL_ALLOC just to make people who > use it aware that no, they aren't getting a "get out of jail free" > card. Small means different things to different people. Also that small has a completely different meaning for the page allocator and for kvmalloc. I really do not like to carve any ambiguity like that into the flag that is supposed to be used for both. Quite honestly I am not even sure we have actual GFP_NOFAIL users of the page allocator outside of the MM (e.g. to implement SLUB internal allocations). git grep just takes too much time to process because the underlying allocator is not always immediately visible. Limiting NOFAIL semantic to SLUB and {kv}malloc allocators would make some sense then as it could enforce reasonable use more easily I guess.
On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote: > > To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry > infinitely". if that implies just OOPSing or actually be in a busy loop, > I don't care. It could effectively happen with MAX_ORDER as well, as > stated. But certainly not BUG_ON. No BUG_ON(), but also no endless loop. Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to make it easy to find offenders, and then let them deal with it. Don't take it upon yourself to say "we have to deal with any amount of stupidity". The MM layer is not some slave to users. The MM layer is one of the most core pieces of code in the kernel, and as such the MM layer is damn well in charge. Nobody has the right to say "I will not deal with allocation failures". The MM should not bend over backwards over something like that. Seriously. Get a spine already, people. Tell random drivers that claim that they cannot deal with errors to just f-ck off. And you don't do it by looping forever, and you don't do it by killing the kernel. You do it by ignoring their bullying tactics. Then you document the *LIMITED* cases where you actually will try forever. This discussion has gone on for too damn long. Linus
On Thu 22-08-24 11:27:25, David Hildenbrand wrote:
[...]
> Likely, checkpatch should be updated to warn on any new NOFAIL usage.
What do you expect people to do? I do not see a pattern of nilly-willy
use of the flag (we have less than 200 in the kernel outside of mm,
compare that to 40k GFP_KERNEL allocations). If you warn and the only
answer is shrug and go on then this serves no purpose.
I have learned couple of things. People do not really give a deep
thought on the naming (e.g. GFP_TEMPORARY story). Or a documentation
(e.g. GFP_NOWAIT | GFP_NOFAIL user). They sometimes pay attention to
warnings but WARN_ON_ONCE tells you about a single abuser... I believe
that enforcing a constrains for GFP_NOFAIL by killing the allocation
user context would have a more visible effect than WARN_ON and it would
stop potential silent failure mode at the same time.
On 22.08.24 11:41, Michal Hocko wrote: > On Thu 22-08-24 11:27:25, David Hildenbrand wrote: > [...] >> Likely, checkpatch should be updated to warn on any new NOFAIL usage. > > What do you expect people to do? I do not see a pattern of nilly-willy > use of the flag (we have less than 200 in the kernel outside of mm, > compare that to 40k GFP_KERNEL allocations). If you warn and the only > answer is shrug and go on then this serves no purpose. Like the BUG_ON checks I added. Some people still try to ignore them and we end up in discussions like this when I spot it ;)
On 22.08.24 11:34, Linus Torvalds wrote: > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote: >> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry >> infinitely". if that implies just OOPSing or actually be in a busy loop, >> I don't care. It could effectively happen with MAX_ORDER as well, as >> stated. But certainly not BUG_ON. > > No BUG_ON(), but also no endless loop. > > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to > make it easy to find offenders, and then let them deal with it. > > Don't take it upon yourself to say "we have to deal with any amount of > stupidity". > > The MM layer is not some slave to users. The MM layer is one of the > most core pieces of code in the kernel, and as such the MM layer is > damn well in charge. > > Nobody has the right to say "I will not deal with allocation > failures". The MM should not bend over backwards over something like > that. > > Seriously. Get a spine already, people. Tell random drivers that claim > that they cannot deal with errors to just f-ck off. > > And you don't do it by looping forever, and you don't do it by killing > the kernel. You do it by ignoring their bullying tactics. > > Then you document the *LIMITED* cases where you actually will try forever. So on the buddy level, that might mean that we limit it to a single page, and document "NOFAIL is ineffective and ignored when allcoating pages of order > 0. Any attempt will result in a WARN_ON_ONCE()". (assuming we can find and eliminate users that allocate order > 0 fairly easily) {kv}malloc allocators would be different, as Michal said. No idea if that is feasible, but it sounds like something you have in mind.
On Thu, 22 Aug 2024 at 17:33, Michal Hocko <mhocko@suse.com> wrote: > > Limiting NOFAIL semantic to SLUB and {kv}malloc allocators would make > some sense then as it could enforce reasonable use more easily I guess. If by "limit to SLUB" you mean "limit it to the kmalloc() cases that can be done using the standard *SMALL* buckets, then maybe. But even then it should probably be only if you don't ask for specific nodes or other limitations on the allocation. Because realize that "kmalloc()" and friends will fall back to other things like __kmalloc_large_node_noprof(), and HELL NO those should not honor NOFAIL. And dammit, those kvmalloc() sizes need to be limited too. A number like 24kB was mentioned. That sounds fine. Maybe even 64kB. But make it *SMALL*. And make it clear that it will return NULL if somebody misuses it. Linus
On Thu, 22 Aug 2024 at 17:43, David Hildenbrand <david@redhat.com> wrote: > > So on the buddy level, that might mean that we limit it to a single > page, Actually, for many SLUB allocations, you probably do have to accept the small orders - the slab caches are often two or four pages. For example, a kmalloc(256) is an order-1 allocation on a buddy level from a quick look at /proc/slabinfo. So it's not necessarily only single pages. We do handle small orders. But it gets exponentially harder, so it really is just the small orders that work. Looks like slub will use up to order-3. That smells like an off-by-one to me (I thought we made 0-2 be the "cheap" orders, but maybe I'm just wrong), but it probably is still acceptable. Linus
On Thu 22-08-24 17:44:22, Linus Torvalds wrote: [...] > And dammit, those kvmalloc() sizes need to be limited too. A number > like 24kB was mentioned. That sounds fine. Maybe even 64kB. But make > it *SMALL*. I am not particularly bothered by exact size for kvmalloc. There should be a limit all right, but whether that is 1MB or 64kB shouldn't make much of a difference because we do use 4kB pages to back those allocations anyway. Sure 32b is a different story because the vmalloc space itself is a scarce resource but do we care about those? > And make it clear that it will return NULL if somebody misuses it. What do you expect users do with the return value then?
[ Sorry, on mobile right now, so HTML crud ] On Thu, Aug 22, 2024, 17:59 Michal Hocko <mhocko@suse.com> wrote: > > > And make it clear that it will return NULL if somebody misuses it. > > What do you expect users do with the return value then? > NOT. YOUR. PROBLEM. It's the problem of the caller. What's so hard to understand about that? The MM layer has enough problems on it's own, it doesn't need to take on the problems of others. Linus >
On Thu 22-08-24 18:30:12, Linus Torvalds wrote: > [ Sorry, on mobile right now, so HTML crud ] > > On Thu, Aug 22, 2024, 17:59 Michal Hocko <mhocko@suse.com> wrote: > > > > > > And make it clear that it will return NULL if somebody misuses it. > > > > What do you expect users do with the return value then? > > > > NOT. YOUR. PROBLEM. > > It's the problem of the caller. They have already told you they (believe) have no way to handle the failure. So if they learn they can get NULL they will either BUG_ON, put a loop around that or just close eyes and hope for the best. I argue that neither of that is a good option because it leads to a poor and buggy code. Look Linus, I believe we are getting into a circle and I do not have much more to offer into this discussion. I do agree with you that we should define boundaries of GFP_NOFAIL more explicitly. Documentation we have is not an effective way. I strongly disagree that WARN_ONCE and just return NULL and close eyes is a good programming nor it encourages a good programming on the caller side. I have offered to explicitly oops in those cases inside the allocator (while not holding any internal allocator locks) as a sreasonable compromise. I even believe that this is a useful tool in other contexts as well. I haven't heard your opinion on that so far. If I had time to work on that myself I would give it a try.
On Thu, Aug 22, 2024 at 05:53:22PM +0800, Linus Torvalds wrote: > On Thu, 22 Aug 2024 at 17:43, David Hildenbrand <david@redhat.com> wrote: > > > > So on the buddy level, that might mean that we limit it to a single > > page, > > Actually, for many SLUB allocations, you probably do have to accept > the small orders - the slab caches are often two or four pages. > > For example, a kmalloc(256) is an order-1 allocation on a buddy level > from a quick look at /proc/slabinfo. It will try higher orders to reduce fragmentation (and mask out NOFAIL on those attempts), but it can fall back to the minimum size required for the object, i.e. get_order(size). > So it's not necessarily only single pages. We do handle small orders. > But it gets exponentially harder, so it really is just the small > orders that work. Agreed. > Looks like slub will use up to order-3. That smells like an off-by-one > to me (I thought we made 0-2 be the "cheap" orders, but maybe I'm just > wrong), but it probably is still acceptable. It's 0-3. We #define PAGE_ALLOC_COSTLY_ORDER 3, but it's exclusive - all the order checks are > costly and <= costly.
On Thu, Aug 22, 2024 at 2:37 PM Barry Song <21cnbao@gmail.com> wrote: > > On Thu, Aug 22, 2024 at 12:41 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote: > > > > > > > > > > If we must still fail a nofail allocation, we should trigger a BUG rather > > > > > than exposing NULL dereferences to callers who do not check the return > > > > > value. > > > > > > > > I am not convinced that BUG_ON is the right tool here to save the world, > > > > but I see how we arrived here. > > > > > > I think the thing to do is to just add a > > > > > > WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags)); > > > > > > or similar, where that bad_nofail_alloc() checks that the allocation > > > order is small and that the flags are sane for a NOFAIL allocation. > > > > > > Because no, BUG_ON() is *never* the answer. The answer is to make sure > > > nobody ever sets NOFAIL in situations where the allocation can fail > > > and there is no way forward. > > > > > > A BUG_ON() will quite likely just make things worse. You're better off > > > with a WARN_ON() and letting the caller just oops. > > > > > > Honestly, I'm perfectly fine with just removing that stupid useless > > > flag entirely. The flag goes back to 2003 and was introduced in > > > 2.5.69, and was meant to be for very particular uses that otherwise > > > just looped waiting for memory. > > > > > > Back in 2.5.69, there was exactly one user: the jbd journal code, that > > > did a buffer head allocation with GFP_NOFAIL. By 2.6.0 that had > > > expanded by another user in XFS, and even that one had a comment > > > saying that it needed to be narrowed down. And in fact, by the 2.6.12 > > > release, that XFS use had been removed, but the jbd journal had grown > > > another jbd_kmalloc case for transaction data. So at the beginning of > > > the git archives, we had exactly *one* user (with two places). > > > > > > *THAT* is the kind of use that the flag was meant for: small > > > allocations required to make forward progress in writeout during > > > memory pressure. > > > > > > It has then expanded and is now a problem. The cases using GFP_NOFAIL > > > for things like vmalloc() - which is by definition not a small > > > allocation - should be just removed as outright bugs. > > > > One potential approach could be to rename GFP_NOFAIL to > > GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and > > to clear this flag for larger allocations. However, the challenge lies > > in determining what constitutes a 'small' allocation. > > I'm not entirely sure if our concern is with higher order or larger size. I believe both should be considered. Since the higher-order task might be easier to address, starting with that seems like the more straightforward approach. > Higher > order might pose a problem, but larger size(not too large) isn't > always an issue. > Allocating 100 * 4KiB pages is possibly easier than allocating a single > 128KB folio. > > Are we trying to limit the physical size or the physical order? If the concern > is order, vmalloc manages __GFP_NOFAIL by mapping order-0 pages. If the > concern is higher order, this sounds reasonable. but it seems the buddy > system already has code to trigger a warning even for order > 1: To avoid potential livelock, it may be wise to drop this flag for higher-order allocations as well. Following Linus's suggestion, we could start by removing it for "> PAGE_ALLOC_COSTLY_ORDER". > > struct page *rmqueue(struct zone *preferred_zone, > struct zone *zone, unsigned int order, > gfp_t gfp_flags, unsigned int alloc_flags, > int migratetype) > { > struct page *page; > > /* > * We most definitely don't want callers attempting to > * allocate greater than order-1 page units with __GFP_NOFAIL. > */ > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); This line was added by Michal in commit 0f352e5392c8 ("mm: remove __GFP_NOFAIL is deprecated comment"), but it appears that Michal has since reconsidered his stance. ;) -- Regards Yafang
On Thu, Aug 22, 2024 at 4:04 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: > > Hi Michal, > > On 2024/8/22 15:54, Michal Hocko wrote: > > On Thu 22-08-24 15:01:43, Gao Xiang wrote: > >> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction > >> means for a single shot. Because assume even if you don't consider > >> a virtual consecutive buffer, people could also do > >> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost > >> the same heavy workload to the whole system. And we also allow > >> direct/kswap reclaim here. > > > > Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain > > make sense outside of the page allocator proper. There is no reason why > > vmalloc NOFAIL should be constrained by that. Sure it should be > > contrained to some value but considering it is just a bunch of PAGE_SIZE > > allocation then the limit could be higher. I am not sure where the > > practical limit should be but anything that requires more than couple of > > MBs seems really excessive. > > Yeah, totally agreed, that would make my own life easier, of > course I will not allocate MBs insanely. > > I've always trying to kill unnecessary NOFAILs (mostly together > with code cleanups), but if a failure path increases more than > 100 LOCs just for rare failure and extreme workloads, I _do_ > hope kvmalloc(NOFAIL) could work instead. If the LOCs in the error handler are a concern, I believe we can simplify it to a single line: while (!alloc()), which is essentially what NOFAIL does and is also the reason we want desperate NOFAIL. A better approach might involve failing after a maximum number of retries at the call site, for example: while (try < max_retries && !alloc()) At least that is better than the endless loop in the page allocator.
On 2024/8/22 22:35, Yafang Shao wrote: > On Thu, Aug 22, 2024 at 4:04 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote: >> >> Hi Michal, >> >> On 2024/8/22 15:54, Michal Hocko wrote: >>> On Thu 22-08-24 15:01:43, Gao Xiang wrote: >>>> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction >>>> means for a single shot. Because assume even if you don't consider >>>> a virtual consecutive buffer, people could also do >>>> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost >>>> the same heavy workload to the whole system. And we also allow >>>> direct/kswap reclaim here. >>> >>> Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain >>> make sense outside of the page allocator proper. There is no reason why >>> vmalloc NOFAIL should be constrained by that. Sure it should be >>> contrained to some value but considering it is just a bunch of PAGE_SIZE >>> allocation then the limit could be higher. I am not sure where the >>> practical limit should be but anything that requires more than couple of >>> MBs seems really excessive. >> >> Yeah, totally agreed, that would make my own life easier, of >> course I will not allocate MBs insanely. >> >> I've always trying to kill unnecessary NOFAILs (mostly together >> with code cleanups), but if a failure path increases more than >> 100 LOCs just for rare failure and extreme workloads, I _do_ >> hope kvmalloc(NOFAIL) could work instead. > > If the LOCs in the error handler are a concern, I believe we can > simplify it to a single line: while (!alloc()), which is essentially > what NOFAIL does and is also the reason we want desperate NOFAIL. > > A better approach might involve failing after a maximum number of > retries at the call site, for example: > > while (try < max_retries && !alloc()) > > At least that is better than the endless loop in the page allocator. Funny. Thanks, Gao Xiang
On 8/22/24 11:34, Linus Torvalds wrote: > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote: >> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry >> infinitely". if that implies just OOPSing or actually be in a busy loop, >> I don't care. It could effectively happen with MAX_ORDER as well, as >> stated. But certainly not BUG_ON. > > No BUG_ON(), but also no endless loop. > > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to > make it easy to find offenders, and then let them deal with it. Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when we're about to actually return NULL, so the memory has to be depleted already. To make it easier to find the offenders much more reliably, we should consider doing it sooner, but also not add unnecessary overhead to allocator fastpaths just because of the potentially buggy users. So either always in __alloc_pages_slowpath(), which should be often enough (unless the system never needs to wake up kswapd to reclaim) but with negligible enough overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM? > Don't take it upon yourself to say "we have to deal with any amount of > stupidity". > > The MM layer is not some slave to users. The MM layer is one of the > most core pieces of code in the kernel, and as such the MM layer is > damn well in charge. > > Nobody has the right to say "I will not deal with allocation > failures". The MM should not bend over backwards over something like > that. > > Seriously. Get a spine already, people. Tell random drivers that claim > that they cannot deal with errors to just f-ck off. > > And you don't do it by looping forever, and you don't do it by killing > the kernel. You do it by ignoring their bullying tactics. > > Then you document the *LIMITED* cases where you actually will try forever. > > This discussion has gone on for too damn long. > > Linus
On Tue, 27 Aug 2024 at 00:10, Vlastimil Babka <vbabka@suse.cz> wrote: > > Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when > we're about to actually return NULL, so the memory has to be depleted > already. To make it easier to find the offenders much more reliably, we > should consider doing it sooner, but also not add unnecessary overhead to > allocator fastpaths just because of the potentially buggy users. Ack. Sounds like a sane model to me. And I agree that __alloc_pages_slowpath() is likely a reasonable middle path between "catch misuses, but don't put it in the hotpath". Linus
On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 8/22/24 11:34, Linus Torvalds wrote: > > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote: > >> > >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry > >> infinitely". if that implies just OOPSing or actually be in a busy loop, > >> I don't care. It could effectively happen with MAX_ORDER as well, as > >> stated. But certainly not BUG_ON. > > > > No BUG_ON(), but also no endless loop. > > > > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to > > make it easy to find offenders, and then let them deal with it. > > Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when > we're about to actually return NULL, so the memory has to be depleted > already. To make it easier to find the offenders much more reliably, we > should consider doing it sooner, but also not add unnecessary overhead to > allocator fastpaths just because of the potentially buggy users. So either > always in __alloc_pages_slowpath(), which should be often enough (unless the > system never needs to wake up kswapd to reclaim) but with negligible enough > overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM? We already have a WARN_ON for order > 1 in rmqueue. we might extend the condition there to include checking flags as well? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7dcb0713eb57..b5717c6569f9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone, /* * We most definitely don't want callers attempting to * allocate greater than order-1 page units with __GFP_NOFAIL. + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, + * which can result in a lockup */ - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); + WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && + (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM))); if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, > > > Don't take it upon yourself to say "we have to deal with any amount of > > stupidity". > > > > The MM layer is not some slave to users. The MM layer is one of the > > most core pieces of code in the kernel, and as such the MM layer is > > damn well in charge. > > > > Nobody has the right to say "I will not deal with allocation > > failures". The MM should not bend over backwards over something like > > that. > > > > Seriously. Get a spine already, people. Tell random drivers that claim > > that they cannot deal with errors to just f-ck off. > > > > And you don't do it by looping forever, and you don't do it by killing > > the kernel. You do it by ignoring their bullying tactics. > > > > Then you document the *LIMITED* cases where you actually will try forever. > > > > This discussion has gone on for too damn long. > > > > Linus >
On 8/27/24 09:15, Barry Song wrote: > On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> On 8/22/24 11:34, Linus Torvalds wrote: >> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote: >> >> >> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry >> >> infinitely". if that implies just OOPSing or actually be in a busy loop, >> >> I don't care. It could effectively happen with MAX_ORDER as well, as >> >> stated. But certainly not BUG_ON. >> > >> > No BUG_ON(), but also no endless loop. >> > >> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to >> > make it easy to find offenders, and then let them deal with it. >> >> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when >> we're about to actually return NULL, so the memory has to be depleted >> already. To make it easier to find the offenders much more reliably, we >> should consider doing it sooner, but also not add unnecessary overhead to >> allocator fastpaths just because of the potentially buggy users. So either >> always in __alloc_pages_slowpath(), which should be often enough (unless the >> system never needs to wake up kswapd to reclaim) but with negligible enough >> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM? > > We already have a WARN_ON for order > 1 in rmqueue. we might extend > the condition there to include checking flags as well? Ugh, wasn't aware, well spotted. So it means there at least shouldn't be existing users of __GFP_NOFAIL with order > 1 :) But also the check is in the hotpath, even before trying the pcplists, so we could move it to __alloc_pages_slowpath() while extending it? > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 7dcb0713eb57..b5717c6569f9 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone, > /* > * We most definitely don't want callers attempting to > * allocate greater than order-1 page units with __GFP_NOFAIL. > + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, > + * which can result in a lockup > */ > - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > + WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && > + (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM))); > > if (likely(pcp_allowed_order(order))) { > page = rmqueue_pcplist(preferred_zone, zone, order, > >> >> > Don't take it upon yourself to say "we have to deal with any amount of >> > stupidity". >> > >> > The MM layer is not some slave to users. The MM layer is one of the >> > most core pieces of code in the kernel, and as such the MM layer is >> > damn well in charge. >> > >> > Nobody has the right to say "I will not deal with allocation >> > failures". The MM should not bend over backwards over something like >> > that. >> > >> > Seriously. Get a spine already, people. Tell random drivers that claim >> > that they cannot deal with errors to just f-ck off. >> > >> > And you don't do it by looping forever, and you don't do it by killing >> > the kernel. You do it by ignoring their bullying tactics. >> > >> > Then you document the *LIMITED* cases where you actually will try forever. >> > >> > This discussion has gone on for too damn long. >> > >> > Linus >>
On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 8/27/24 09:15, Barry Song wrote: > > On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> On 8/22/24 11:34, Linus Torvalds wrote: > >> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote: > >> >> > >> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry > >> >> infinitely". if that implies just OOPSing or actually be in a busy loop, > >> >> I don't care. It could effectively happen with MAX_ORDER as well, as > >> >> stated. But certainly not BUG_ON. > >> > > >> > No BUG_ON(), but also no endless loop. > >> > > >> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to > >> > make it easy to find offenders, and then let them deal with it. > >> > >> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when > >> we're about to actually return NULL, so the memory has to be depleted > >> already. To make it easier to find the offenders much more reliably, we > >> should consider doing it sooner, but also not add unnecessary overhead to > >> allocator fastpaths just because of the potentially buggy users. So either > >> always in __alloc_pages_slowpath(), which should be often enough (unless the > >> system never needs to wake up kswapd to reclaim) but with negligible enough > >> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM? > > > > We already have a WARN_ON for order > 1 in rmqueue. we might extend > > the condition there to include checking flags as well? > > Ugh, wasn't aware, well spotted. So it means there at least shouldn't be > existing users of __GFP_NOFAIL with order > 1 :) > > But also the check is in the hotpath, even before trying the pcplists, so we > could move it to __alloc_pages_slowpath() while extending it? Agreed. I don't think it is reasonable to check the order and flags in two different places especially rmqueue() has already had gfp_flags & __GFP_NOFAIL operation and order > 1 overhead. We can at least extend the current check to make some improvement though I still believe Michal's suggestion of implementing OOPS_ON is a better approach to pursue, as it doesn't crash the entire system while ensuring the problematic process is terminated. > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 7dcb0713eb57..b5717c6569f9 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone, > > /* > > * We most definitely don't want callers attempting to > > * allocate greater than order-1 page units with __GFP_NOFAIL. > > + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, > > + * which can result in a lockup > > */ > > - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > > + WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && > > + (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM))); > > > > if (likely(pcp_allowed_order(order))) { > > page = rmqueue_pcplist(preferred_zone, zone, order, > > > >> > >> > Don't take it upon yourself to say "we have to deal with any amount of > >> > stupidity". > >> > > >> > The MM layer is not some slave to users. The MM layer is one of the > >> > most core pieces of code in the kernel, and as such the MM layer is > >> > damn well in charge. > >> > > >> > Nobody has the right to say "I will not deal with allocation > >> > failures". The MM should not bend over backwards over something like > >> > that. > >> > > >> > Seriously. Get a spine already, people. Tell random drivers that claim > >> > that they cannot deal with errors to just f-ck off. > >> > > >> > And you don't do it by looping forever, and you don't do it by killing > >> > the kernel. You do it by ignoring their bullying tactics. > >> > > >> > Then you document the *LIMITED* cases where you actually will try forever. > >> > > >> > This discussion has gone on for too damn long. > >> > > >> > Linus > >> > Thanks Barry
On 8/27/24 09:50, Barry Song wrote: > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote: >> >> >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be >> existing users of __GFP_NOFAIL with order > 1 :) >> >> But also the check is in the hotpath, even before trying the pcplists, so we >> could move it to __alloc_pages_slowpath() while extending it? > > Agreed. I don't think it is reasonable to check the order and flags in > two different places especially rmqueue() has already had > gfp_flags & __GFP_NOFAIL operation and order > 1 > overhead. > > We can at least extend the current check to make some improvement > though I still believe Michal's suggestion of implementing OOPS_ON is a > better approach to pursue, as it doesn't crash the entire system > while ensuring the problematic process is terminated. Linus made clear it's not a mm concern. If e.g. hardening people want to pursuit that instead, they can. BTW I think BUG_ON already works like this, if possible only the calling process is terminated. panic happens in case of being in a irq context, or due to panic_on_oops. Which the security people are setting to 1 anyway and OOPS_ON would have to observe it too. So AFAICS the only difference from BUG_ON would be not panic in the irq context, if panic_on_oops isn't set. (as for "no mm locks held" I think it's already satisfied at the points we check for __GFP_NOFAIL).
On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 8/27/24 09:50, Barry Song wrote: > > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote: > >> > >> > >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be > >> existing users of __GFP_NOFAIL with order > 1 :) > >> > >> But also the check is in the hotpath, even before trying the pcplists, so we > >> could move it to __alloc_pages_slowpath() while extending it? > > > > Agreed. I don't think it is reasonable to check the order and flags in > > two different places especially rmqueue() has already had > > gfp_flags & __GFP_NOFAIL operation and order > 1 > > overhead. > > > > We can at least extend the current check to make some improvement > > though I still believe Michal's suggestion of implementing OOPS_ON is a > > better approach to pursue, as it doesn't crash the entire system > > while ensuring the problematic process is terminated. > > Linus made clear it's not a mm concern. If e.g. hardening people want to > pursuit that instead, they can. > > BTW I think BUG_ON already works like this, if possible only the calling > process is terminated. panic happens in case of being in a irq context, or you are right. This is a detail I overlooked in the last discussion. BUG_ON has already been exactly the case to only terminate the bad process if it can (panic_on_oops=N and not in irq context). > due to panic_on_oops. Which the security people are setting to 1 anyway and > OOPS_ON would have to observe it too. So AFAICS the only difference from > BUG_ON would be not panic in the irq context, if panic_on_oops isn't set. right. > (as for "no mm locks held" I think it's already satisfied at the points we > check for __GFP_NOFAIL). Let me summarize the discussion: Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic in vdpa driver, is necessary. Patch 2/4, which updates the documentation to clarify that non-blockable gfp_nofail is not supported, is needed. Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the size is too large, where gfp_nofail will return NULL. Patch 4/4: We will move the order > 1 check from the current fast path to the slow path and extend the check of gfp_direct_reclaim flag also in the slow path. If nobody has an objection, I will prepare v4 as above. Thanks Barry
On Thu 29-08-24 23:53:33, Barry Song wrote: > On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > On 8/27/24 09:50, Barry Song wrote: > > > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > >> > > >> > > >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be > > >> existing users of __GFP_NOFAIL with order > 1 :) > > >> > > >> But also the check is in the hotpath, even before trying the pcplists, so we > > >> could move it to __alloc_pages_slowpath() while extending it? > > > > > > Agreed. I don't think it is reasonable to check the order and flags in > > > two different places especially rmqueue() has already had > > > gfp_flags & __GFP_NOFAIL operation and order > 1 > > > overhead. > > > > > > We can at least extend the current check to make some improvement > > > though I still believe Michal's suggestion of implementing OOPS_ON is a > > > better approach to pursue, as it doesn't crash the entire system > > > while ensuring the problematic process is terminated. > > > > Linus made clear it's not a mm concern. If e.g. hardening people want to > > pursuit that instead, they can. > > > > BTW I think BUG_ON already works like this, if possible only the calling > > process is terminated. panic happens in case of being in a irq context, or > > you are right. This is a detail I overlooked in the last discussion. > BUG_ON has already been exactly the case to only terminate the bad > process if it can > (panic_on_oops=N and not in irq context). Are you sure about that? Maybe x86 implementation treats BUG as oops but is this what that does on all arches? BUG() has historically meant stop everything and die and I am not really sure when that would have changed TBH. > > due to panic_on_oops. Which the security people are setting to 1 anyway and > > OOPS_ON would have to observe it too. So AFAICS the only difference from > > BUG_ON would be not panic in the irq context, if panic_on_oops isn't set. > > right. > > > (as for "no mm locks held" I think it's already satisfied at the points we > > check for __GFP_NOFAIL). > > Let me summarize the discussion: > > Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic > in vdpa driver, is necessary. > Patch 2/4, which updates the documentation to clarify that > non-blockable gfp_nofail is not > supported, is needed. Let's please have those merged now. > Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the > size is too large, > where gfp_nofail will return NULL. I would pull this one out for a separate discussion. We should really define what the too large really means and INT_MAX etc. is not it at all. > Patch 4/4: We will move the order > 1 check from the current fast path > to the slow path and extend > the check of gfp_direct_reclaim flag also in the slow path. OK, let's have that go in now as well.
On Fri, Aug 30, 2024 at 1:20 AM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 29-08-24 23:53:33, Barry Song wrote: > > On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > > > On 8/27/24 09:50, Barry Song wrote: > > > > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > >> > > > >> > > > >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be > > > >> existing users of __GFP_NOFAIL with order > 1 :) > > > >> > > > >> But also the check is in the hotpath, even before trying the pcplists, so we > > > >> could move it to __alloc_pages_slowpath() while extending it? > > > > > > > > Agreed. I don't think it is reasonable to check the order and flags in > > > > two different places especially rmqueue() has already had > > > > gfp_flags & __GFP_NOFAIL operation and order > 1 > > > > overhead. > > > > > > > > We can at least extend the current check to make some improvement > > > > though I still believe Michal's suggestion of implementing OOPS_ON is a > > > > better approach to pursue, as it doesn't crash the entire system > > > > while ensuring the problematic process is terminated. > > > > > > Linus made clear it's not a mm concern. If e.g. hardening people want to > > > pursuit that instead, they can. > > > > > > BTW I think BUG_ON already works like this, if possible only the calling > > > process is terminated. panic happens in case of being in a irq context, or > > > > you are right. This is a detail I overlooked in the last discussion. > > BUG_ON has already been exactly the case to only terminate the bad > > process if it can > > (panic_on_oops=N and not in irq context). > > Are you sure about that? Maybe x86 implementation treats BUG as oops but > is this what that does on all arches? BUG() has historically meant stop > everything and die and I am not really sure when that would have > changed TBH. My ARM64 machine also only terminates the bad process by BUG_ON() if we are not in irq and we don't set panic_on_oops. I guess it depends on HAVE_ARCH_BUG? if arch has no BUG(), BUG() will be just a panic ? #ifndef HAVE_ARCH_BUG #define BUG() do { \ printk("BUG: failure at %s:%d/%s()!\n", __FILE__, __LINE__, __func__); \ barrier_before_unreachable(); \ panic("BUG!"); \ } while (0) #endif I assume it is equally difficult to implement OOPS_ON() if arch lacks HAVE_ARCH_BUG ? "grep" shows the most mainstream archs have their own HAVE_ARCH_BUG: $ git grep HAVE_ARCH_BUG arch/alpha/include/asm/bug.h:#define HAVE_ARCH_BUG arch/arc/include/asm/bug.h:#define HAVE_ARCH_BUG arch/arm/include/asm/bug.h:#define HAVE_ARCH_BUG arch/arm64/include/asm/bug.h:#define HAVE_ARCH_BUG arch/csky/include/asm/bug.h:#define HAVE_ARCH_BUG arch/loongarch/include/asm/bug.h:#define HAVE_ARCH_BUG arch/m68k/include/asm/bug.h:#define HAVE_ARCH_BUG arch/mips/include/asm/bug.h:#define HAVE_ARCH_BUG arch/mips/include/asm/bug.h:#define HAVE_ARCH_BUG_ON arch/parisc/include/asm/bug.h:#define HAVE_ARCH_BUG arch/powerpc/include/asm/bug.h:#define HAVE_ARCH_BUG arch/powerpc/include/asm/bug.h:#define HAVE_ARCH_BUG_ON arch/riscv/include/asm/bug.h:#define HAVE_ARCH_BUG arch/s390/include/asm/bug.h:#define HAVE_ARCH_BUG arch/sh/include/asm/bug.h:#define HAVE_ARCH_BUG arch/sparc/include/asm/bug.h:#define HAVE_ARCH_BUG arch/x86/include/asm/bug.h:#define HAVE_ARCH_BUG > > > > due to panic_on_oops. Which the security people are setting to 1 anyway and > > > OOPS_ON would have to observe it too. So AFAICS the only difference from > > > BUG_ON would be not panic in the irq context, if panic_on_oops isn't set. > > > > right. > > > > > (as for "no mm locks held" I think it's already satisfied at the points we > > > check for __GFP_NOFAIL). > > > > Let me summarize the discussion: > > > > Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic > > in vdpa driver, is necessary. > > Patch 2/4, which updates the documentation to clarify that > > non-blockable gfp_nofail is not > > supported, is needed. > > Let's please have those merged now. > > > Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the > > size is too large, > > where gfp_nofail will return NULL. > > > I would pull this one out for a separate discussion. We should really > define what the too large really means and INT_MAX etc. is not it at > all. make sense. > > > Patch 4/4: We will move the order > 1 check from the current fast path > > to the slow path and extend > > the check of gfp_direct_reclaim flag also in the slow path. > > OK, let's have that go in now as well. > > -- > Michal Hocko > SUSE Labs Thanks Barry
> > > Patch 4/4: We will move the order > 1 check from the current fast path > > > to the slow path and extend > > > the check of gfp_direct_reclaim flag also in the slow path. > > > > OK, let's have that go in now as well. Hi Michal and Vlastimil, Could you please review the changes below before I send v4 for patch 4/4? 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is in the hotpath, while others are in less likely scenarios. Moving all warnings to the slowpath will reduce the overhead for order > 1 and increase the visibility of other warnings. 2. We currently have two warnings for order: one for order > 1 in the hotpath and another for order > costly_order in the laziest path. I suggest standardizing on order > 1 since it’s been in use for a long time. 3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is meant to suppress allocation failure reports, but here we're dealing with bug detection, not allocation failures. So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index c81ee5662cc7..0d3dd679d0ab 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone, { struct page *page; - /* - * We most definitely don't want callers attempting to - * allocate greater than order-1 page units with __GFP_NOFAIL. - */ - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); - if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, migratetype, alloc_flags); @@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) { bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; + bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM; bool can_compact = gfp_compaction_allowed(gfp_mask); const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; struct page *page = NULL; @@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned int zonelist_iter_cookie; int reserve_flags; + if (nofail) { + /* + * We most definitely don't want callers attempting to + * allocate greater than order-1 page units with __GFP_NOFAIL. + */ + WARN_ON_ONCE(order > 1); + /* + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, + * otherwise, we may result in lockup. + */ + WARN_ON_ONCE(!can_direct_reclaim); + /* + * PF_MEMALLOC request from this context is rather bizarre + * because we cannot reclaim anything and only can loop waiting + * for somebody to do a work for us. + */ + WARN_ON_ONCE(current->flags & PF_MEMALLOC); + } + restart: compaction_retries = 0; no_progress_loops = 0; @@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * Make sure that __GFP_NOFAIL request doesn't leak out and make sure * we always retry */ - if (gfp_mask & __GFP_NOFAIL) { + if (nofail) { /* - * All existing users of the __GFP_NOFAIL are blockable, so warn - * of any new users that actually require GFP_NOWAIT + * Lacking direct_reclaim we can't do anything to reclaim memory, + * we disregard these unreasonable nofail requests and still + * return NULL */ - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) + if (!can_direct_reclaim) goto fail; - /* - * PF_MEMALLOC request from this context is rather bizarre - * because we cannot reclaim anything and only can loop waiting - * for somebody to do a work for us - */ - WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask); - - /* - * non failing costly orders are a hard requirement which we - * are not prepared for much so let's warn about these users - * so that we can identify them and convert them to something - * else. - */ - WARN_ON_ONCE_GFP(costly_order, gfp_mask); - /* * Help non-failing allocations by giving some access to memory * reserves normally used for high priority non-blocking > > > > -- > > Michal Hocko > > SUSE Labs Thanks Barry
On Fri 30-08-24 10:31:14, Barry Song wrote: > > > > Patch 4/4: We will move the order > 1 check from the current fast path > > > > to the slow path and extend > > > > the check of gfp_direct_reclaim flag also in the slow path. > > > > > > OK, let's have that go in now as well. > > Hi Michal and Vlastimil, > Could you please review the changes below before I send v4 for patch 4/4? > > 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is > in the hotpath, while others are in less likely scenarios. Moving all warnings to the > slowpath will reduce the overhead for order > 1 and increase the visibility of other > warnings. > > 2. We currently have two warnings for order: one for order > 1 in the hotpath and another > for order > costly_order in the laziest path. I suggest standardizing on order > 1 since > it’s been in use for a long time. > > 3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is > meant to suppress allocation failure reports, but here we're dealing with bug detection, not > allocation failures. > So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c81ee5662cc7..0d3dd679d0ab 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone, > { > struct page *page; > > - /* > - * We most definitely don't want callers attempting to > - * allocate greater than order-1 page units with __GFP_NOFAIL. > - */ > - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); > - > if (likely(pcp_allowed_order(order))) { > page = rmqueue_pcplist(preferred_zone, zone, order, > migratetype, alloc_flags); > @@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > struct alloc_context *ac) > { > bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; > + bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM; > bool can_compact = gfp_compaction_allowed(gfp_mask); > const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; > struct page *page = NULL; > @@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > unsigned int zonelist_iter_cookie; > int reserve_flags; > > + if (nofail) { > + /* > + * We most definitely don't want callers attempting to > + * allocate greater than order-1 page units with __GFP_NOFAIL. > + */ > + WARN_ON_ONCE(order > 1); > + /* > + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, > + * otherwise, we may result in lockup. > + */ > + WARN_ON_ONCE(!can_direct_reclaim); > + /* > + * PF_MEMALLOC request from this context is rather bizarre > + * because we cannot reclaim anything and only can loop waiting > + * for somebody to do a work for us. > + */ > + WARN_ON_ONCE(current->flags & PF_MEMALLOC); > + } Yes, this makes sense. Any reason you have not put that int the nofail branch below? > + > restart: > compaction_retries = 0; > no_progress_loops = 0; > @@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > * Make sure that __GFP_NOFAIL request doesn't leak out and make sure > * we always retry > */ > - if (gfp_mask & __GFP_NOFAIL) { > + if (nofail) { > /* > - * All existing users of the __GFP_NOFAIL are blockable, so warn > - * of any new users that actually require GFP_NOWAIT > + * Lacking direct_reclaim we can't do anything to reclaim memory, > + * we disregard these unreasonable nofail requests and still > + * return NULL > */ > - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > + if (!can_direct_reclaim) > goto fail; > > - /* > - * PF_MEMALLOC request from this context is rather bizarre > - * because we cannot reclaim anything and only can loop waiting > - * for somebody to do a work for us > - */ > - WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask); > - > - /* > - * non failing costly orders are a hard requirement which we > - * are not prepared for much so let's warn about these users > - * so that we can identify them and convert them to something > - * else. > - */ > - WARN_ON_ONCE_GFP(costly_order, gfp_mask); > - > /* > * Help non-failing allocations by giving some access to memory > * reserves normally used for high priority non-blocking > > > > > > > -- > > > Michal Hocko > > > SUSE Labs > > Thanks > Barry
On 8/30/24 09:24, Michal Hocko wrote: > On Fri 30-08-24 10:31:14, Barry Song wrote: >> > > > Patch 4/4: We will move the order > 1 check from the current fast path >> > > > to the slow path and extend >> > > > the check of gfp_direct_reclaim flag also in the slow path. >> > > >> > > OK, let's have that go in now as well. >> >> Hi Michal and Vlastimil, >> Could you please review the changes below before I send v4 for patch 4/4? >> >> 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is >> in the hotpath, while others are in less likely scenarios. Moving all warnings to the >> slowpath will reduce the overhead for order > 1 and increase the visibility of other >> warnings. >> >> 2. We currently have two warnings for order: one for order > 1 in the hotpath and another >> for order > costly_order in the laziest path. I suggest standardizing on order > 1 since >> it’s been in use for a long time. >> >> 3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is >> meant to suppress allocation failure reports, but here we're dealing with bug detection, not >> allocation failures. Ack. __GFP_NOWARN is to suppress warnings in case the allocation has a less expensive fallback to the current attempt, which logically means the current attempt can't be a __GFP_NOFAIL one. So having both is a bug itself (not worth reporting) so we can just ignore __GFP_NOWARN. >> So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP. >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index c81ee5662cc7..0d3dd679d0ab 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone, >> { >> struct page *page; >> >> - /* >> - * We most definitely don't want callers attempting to >> - * allocate greater than order-1 page units with __GFP_NOFAIL. >> - */ >> - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >> - >> if (likely(pcp_allowed_order(order))) { >> page = rmqueue_pcplist(preferred_zone, zone, order, >> migratetype, alloc_flags); >> @@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> struct alloc_context *ac) >> { >> bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM; >> + bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM; __GFP_NOFAIL >> bool can_compact = gfp_compaction_allowed(gfp_mask); >> const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER; >> struct page *page = NULL; >> @@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> unsigned int zonelist_iter_cookie; >> int reserve_flags; >> >> + if (nofail) { Could add unlikely() to put it off the instruction cache hotpath. >> + /* >> + * We most definitely don't want callers attempting to >> + * allocate greater than order-1 page units with __GFP_NOFAIL. >> + */ >> + WARN_ON_ONCE(order > 1); >> + /* >> + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM, >> + * otherwise, we may result in lockup. >> + */ >> + WARN_ON_ONCE(!can_direct_reclaim); >> + /* >> + * PF_MEMALLOC request from this context is rather bizarre >> + * because we cannot reclaim anything and only can loop waiting >> + * for somebody to do a work for us. >> + */ >> + WARN_ON_ONCE(current->flags & PF_MEMALLOC); >> + } > > Yes, this makes sense. Any reason you have not put that int the nofail > branch below? Because that branch is executed only when we're already so depleted we gave up retrying, and we want to warn about the buggy users more reliably (see point 1 above). >> + >> restart: >> compaction_retries = 0; >> no_progress_loops = 0; >> @@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, >> * Make sure that __GFP_NOFAIL request doesn't leak out and make sure >> * we always retry >> */ >> - if (gfp_mask & __GFP_NOFAIL) { >> + if (nofail) { >> /* >> - * All existing users of the __GFP_NOFAIL are blockable, so warn >> - * of any new users that actually require GFP_NOWAIT >> + * Lacking direct_reclaim we can't do anything to reclaim memory, >> + * we disregard these unreasonable nofail requests and still >> + * return NULL >> */ >> - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) >> + if (!can_direct_reclaim) >> goto fail; >> >> - /* >> - * PF_MEMALLOC request from this context is rather bizarre >> - * because we cannot reclaim anything and only can loop waiting >> - * for somebody to do a work for us >> - */ >> - WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask); >> - >> - /* >> - * non failing costly orders are a hard requirement which we >> - * are not prepared for much so let's warn about these users >> - * so that we can identify them and convert them to something >> - * else. >> - */ >> - WARN_ON_ONCE_GFP(costly_order, gfp_mask); >> - >> /* >> * Help non-failing allocations by giving some access to memory >> * reserves normally used for high priority non-blocking >> >> > > >> > > -- >> > > Michal Hocko >> > > SUSE Labs >> >> Thanks >> Barry >
From: Barry Song <v-songbaohua@oppo.com> -v3: * collect reviewed-by, acked-by etc. Michal, Christoph, Vlastimil, Davidlohr, thanks! * use Jason's patch[1] to fix vdpa and refine his changelog. * refine changelogs [1] https://lore.kernel.org/all/20240808054320.10017-1-jasowang@redhat.com/ -v2: https://lore.kernel.org/linux-mm/20240731000155.109583-1-21cnbao@gmail.com/ * adjust vpda fix according to Jason and Michal's feedback, I would expect Jason to test it, thanks! * split BUG_ON of unavoidable failure and the case GFP_ATOMIC | __GFP_NOFAIL into two patches according to Vlastimil and Michal. * collect Michal's acked-by for patch 2 - the doc; * remove the new GFP_NOFAIL from this series, that one would be a separate enhancement patchset later on. -v1: https://lore.kernel.org/linux-mm/20240724085544.299090-1-21cnbao@gmail.com/ __GFP_NOFAIL carries the semantics of never failing, so its callers do not check the return value: %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller cannot handle allocation failures. The allocation could block indefinitely but will never return with failure. Testing for failure is pointless. However, __GFP_NOFAIL can sometimes fail if it exceeds size limits or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context. This can expose security vulnerabilities due to potential NULL dereferences. Since __GFP_NOFAIL does not support non-blocking allocation, we introduce GFP_NOFAIL with inclusive blocking semantics and encourage using GFP_NOFAIL as a replacement for __GFP_NOFAIL in non-mm. If we must still fail a nofail allocation, we should trigger a BUG rather than exposing NULL dereferences to callers who do not check the return value. * The discussion started from this topic: [PATCH RFC] mm: warn potential return NULL for kmalloc_array and kvmalloc_array with __GFP_NOFAIL https://lore.kernel.org/linux-mm/20240717230025.77361-1-21cnbao@gmail.com/ Thank you to Michal, Christoph, Vlastimil, and Hailong for all the comments. Barry Song (3): mm: document __GFP_NOFAIL must be blockable mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL Jason Wang (1): vduse: avoid using __GFP_NOFAIL drivers/vdpa/vdpa_user/iova_domain.c | 19 +++++++++++-------- drivers/vdpa/vdpa_user/iova_domain.h | 1 + include/linux/gfp_types.h | 5 ++++- include/linux/slab.h | 4 +++- mm/page_alloc.c | 14 ++++++++------ mm/util.c | 1 + 6 files changed, 28 insertions(+), 16 deletions(-)