mbox series

[v3,0/4] mm: clarify nofail memory allocation

Message ID	20240817062449.21164-1-21cnbao@gmail.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: 42.hyeyoo@gmail.com, cl@linux.com, hailong.liu@oppo.com, hch@infradead.org, iamjoonsoo.kim@lge.com, mhocko@suse.com, penberg@kernel.org, rientjes@google.com, roman.gushchin@linux.dev, torvalds@linux-foundation.org, urezki@gmail.com, v-songbaohua@oppo.com, vbabka@suse.cz, virtualization@lists.linux.dev Subject: [PATCH v3 0/4] mm: clarify nofail memory allocation Date: Sat, 17 Aug 2024 18:24:45 +1200 Message-Id: <20240817062449.21164-1-21cnbao@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: clarify nofail memory allocation \| expand [v3,0/4] mm: clarify nofail memory allocation [v3,1/4] vduse: avoid using __GFP_NOFAIL [v3,2/4] mm: document __GFP_NOFAIL must be blockable [v3,3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails [v3,4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL

Message ID

20240817062449.21164-1-21cnbao@gmail.com (mailing list archive)

Headers

From: Barry Song <21cnbao@gmail.com>
To: akpm@linux-foundation.org,
	linux-mm@kvack.org
Cc: 42.hyeyoo@gmail.com,
	cl@linux.com,
	hailong.liu@oppo.com,
	hch@infradead.org,
	iamjoonsoo.kim@lge.com,
	mhocko@suse.com,
	penberg@kernel.org,
	rientjes@google.com,
	roman.gushchin@linux.dev,
	torvalds@linux-foundation.org,
	urezki@gmail.com,
	v-songbaohua@oppo.com,
	vbabka@suse.cz,
	virtualization@lists.linux.dev
Subject: [PATCH v3 0/4] mm: clarify nofail memory allocation
Date: Sat, 17 Aug 2024 18:24:45 +1200
Message-Id: <20240817062449.21164-1-21cnbao@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: clarify nofail memory allocation | expand

Message

Barry Song Aug. 17, 2024, 6:24 a.m. UTC

From: Barry Song <v-songbaohua@oppo.com>

-v3:
 * collect reviewed-by, acked-by etc. Michal, Christoph, Vlastimil, Davidlohr,
   thanks!
 * use Jason's patch[1] to fix vdpa and refine his changelog.
 * refine changelogs
[1] https://lore.kernel.org/all/20240808054320.10017-1-jasowang@redhat.com/

-v2:
 https://lore.kernel.org/linux-mm/20240731000155.109583-1-21cnbao@gmail.com/

 * adjust vpda fix according to Jason and Michal's feedback, I would
   expect Jason to test it, thanks!
 * split BUG_ON of unavoidable failure and the case GFP_ATOMIC |
   __GFP_NOFAIL into two patches according to Vlastimil and Michal.
 * collect Michal's acked-by for patch 2 - the doc;
 * remove the new GFP_NOFAIL from this series, that one would be a
   separate enhancement patchset later on.

-v1:
 https://lore.kernel.org/linux-mm/20240724085544.299090-1-21cnbao@gmail.com/

__GFP_NOFAIL carries the semantics of never failing, so its callers
do not check the return value:
  %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
  cannot handle allocation failures. The allocation could block
  indefinitely but will never return with failure. Testing for
  failure is pointless.

However, __GFP_NOFAIL can sometimes fail if it exceeds size limits
or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context.
This can expose security vulnerabilities due to potential NULL
dereferences.

Since __GFP_NOFAIL does not support non-blocking allocation, we introduce
GFP_NOFAIL with inclusive blocking semantics and encourage using GFP_NOFAIL
as a replacement for __GFP_NOFAIL in non-mm.

If we must still fail a nofail allocation, we should trigger a BUG rather
than exposing NULL dereferences to callers who do not check the return
value.

* The discussion started from this topic:
 [PATCH RFC] mm: warn potential return NULL for kmalloc_array and
             kvmalloc_array with __GFP_NOFAIL

 https://lore.kernel.org/linux-mm/20240717230025.77361-1-21cnbao@gmail.com/

Thank you to Michal, Christoph, Vlastimil, and Hailong for all the
comments.

Barry Song (3):
  mm: document __GFP_NOFAIL must be blockable
  mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails
  mm: prohibit NULL deference exposed for unsupported non-blockable
    __GFP_NOFAIL

Jason Wang (1):
  vduse: avoid using __GFP_NOFAIL

 drivers/vdpa/vdpa_user/iova_domain.c | 19 +++++++++++--------
 drivers/vdpa/vdpa_user/iova_domain.h |  1 +
 include/linux/gfp_types.h            |  5 ++++-
 include/linux/slab.h                 |  4 +++-
 mm/page_alloc.c                      | 14 ++++++++------
 mm/util.c                            |  1 +
 6 files changed, 28 insertions(+), 16 deletions(-)

Comments

David Hildenbrand Aug. 19, 2024, 1:02 p.m. UTC | #1

On 17.08.24 08:24, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> -v3:
>   * collect reviewed-by, acked-by etc. Michal, Christoph, Vlastimil, Davidlohr,
>     thanks!
>   * use Jason's patch[1] to fix vdpa and refine his changelog.
>   * refine changelogs
> [1] https://lore.kernel.org/all/20240808054320.10017-1-jasowang@redhat.com/
> 
> -v2:
>   https://lore.kernel.org/linux-mm/20240731000155.109583-1-21cnbao@gmail.com/
> 
>   * adjust vpda fix according to Jason and Michal's feedback, I would
>     expect Jason to test it, thanks!
>   * split BUG_ON of unavoidable failure and the case GFP_ATOMIC |
>     __GFP_NOFAIL into two patches according to Vlastimil and Michal.
>   * collect Michal's acked-by for patch 2 - the doc;
>   * remove the new GFP_NOFAIL from this series, that one would be a
>     separate enhancement patchset later on.
> 
> -v1:
>   https://lore.kernel.org/linux-mm/20240724085544.299090-1-21cnbao@gmail.com/
> 
> __GFP_NOFAIL carries the semantics of never failing, so its callers
> do not check the return value:
>    %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>    cannot handle allocation failures. The allocation could block
>    indefinitely but will never return with failure. Testing for
>    failure is pointless.
> 

You should have called this series "mm: clarify nofail and BUG_ON on 
with invalid arguments with nofail"

I was a bit surprised to find that many BUG_ON in this series ;)

> However, __GFP_NOFAIL can sometimes fail if it exceeds size limits
> or is used with GFP_ATOMIC/GFP_NOWAIT in a non-sleepable context.
> This can expose security vulnerabilities due to potential NULL
> dereferences.

Which code is that that we are concerned about? Some OOT driver? Some 
untested in-tree driver? Just trying to understand what we are afraid 
about here. I'm afraid we cannot safe the world with some OOT drivers :(

Are we aware of such security vulnerabilities (out of pure interest!)?

Is the expectation that these code paths are never triggered during 
early testing where a WARN_ON_ONCE() would just highlight -- in addition 
to a likely crash -- that something is very wrong?

> 
> Since __GFP_NOFAIL does not support non-blocking allocation, we introduce
> GFP_NOFAIL with inclusive blocking semantics and encourage using GFP_NOFAIL
> as a replacement for __GFP_NOFAIL in non-mm.

I still wonder if specifying GFP_NOFAIL should not simply come with the 
semantics "implies direct reclaim. If called for non-blocking 
allocations, bad things will happen (no security bad things though)." ;)

> 
> If we must still fail a nofail allocation, we should trigger a BUG rather
> than exposing NULL dereferences to callers who do not check the return
> value.

I am not convinced that BUG_ON is the right tool here to save the world, 
but I see how we arrived here.

Linus Torvalds Aug. 19, 2024, 4:05 p.m. UTC | #2

On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> >
> > If we must still fail a nofail allocation, we should trigger a BUG rather
> > than exposing NULL dereferences to callers who do not check the return
> > value.
>
> I am not convinced that BUG_ON is the right tool here to save the world,
> but I see how we arrived here.

I think the thing to do is to just add a

     WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));

or similar, where that bad_nofail_alloc() checks that the allocation
order is small and that the flags are sane for a NOFAIL allocation.

Because no, BUG_ON() is *never* the answer. The answer is to make sure
nobody ever sets NOFAIL in situations where the allocation can fail
and there is no way forward.

A BUG_ON() will quite likely just make things worse. You're better off
with a WARN_ON() and letting the caller just oops.

Honestly, I'm perfectly fine with just removing that stupid useless
flag entirely. The flag goes back to 2003 and was introduced in
2.5.69, and was meant to be for very particular uses that otherwise
just looped waiting for memory.

Back in 2.5.69, there was exactly one user: the jbd journal code, that
did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
expanded by another user in XFS, and even that one had a comment
saying that it needed to be narrowed down. And in fact, by the 2.6.12
release, that XFS use had been removed, but the jbd journal had grown
another jbd_kmalloc case for transaction data. So at the beginning of
the git archives, we had exactly *one* user (with two places).

*THAT* is the kind of use that the flag was meant for: small
allocations required to make forward progress in writeout during
memory pressure.

It has then expanded and is now a problem. The cases using GFP_NOFAIL
for things like vmalloc() - which is by definition not a small
allocation - should be just removed as outright bugs.

Note that we had this comment back in 2010:

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures.  This modifier is deprecated and no new
 * users should be added.

and then it was softened in 2015 to the current

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures. New users should be evaluated carefully
  ...

and clearly that "evaluated carefully" actually never happened, so the
new comment is just garbage.

I wonder how many modern users of GFP_NOFAIL are simply due to
over-eager allocation failure injection testing, and then people added
GFP_NOFAIL just because it shut up the mindless random allocation
failures.

I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can
actually return an error just fine, but there was this crazy worry
about the IPC layer initialization failing:

   https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/

Things like that, where people just added mindless "theoretical
concerns" issues, or possibly had some error injection module that
inserted impossible failures.

I do NOT want those things to become BUG_ON()'s. It's better to just
return NULL with a "bogus GFP_NOFAIL" warning, and have the oops
happen in the actual bad place that did an invalid allocation.

Because the blame should go *there*, and it should not even remotely
look like "oh, the MM code failed". No. The caller was garbage.

So no. No MM BUG_ON code.

                    Linus

Barry Song Aug. 19, 2024, 7:23 p.m. UTC | #3

On Tue, Aug 20, 2024 at 4:05 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> > >
> > > If we must still fail a nofail allocation, we should trigger a BUG rather
> > > than exposing NULL dereferences to callers who do not check the return
> > > value.
> >
> > I am not convinced that BUG_ON is the right tool here to save the world,
> > but I see how we arrived here.
>
> I think the thing to do is to just add a
>
>      WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));
>
> or similar, where that bad_nofail_alloc() checks that the allocation
> order is small and that the flags are sane for a NOFAIL allocation.
>
> Because no, BUG_ON() is *never* the answer. The answer is to make sure
> nobody ever sets NOFAIL in situations where the allocation can fail
> and there is no way forward.
>
> A BUG_ON() will quite likely just make things worse. You're better off
> with a WARN_ON() and letting the caller just oops.

That could be an exploit taking advantage of those improper callers,
thus it wouldn’t necessarily result in an immediate oops in callers but
result in an exploit such as

p = malloc(gfp_nofail);

using p->field
or
using p + offset

where p->filed or p + offset might be attacking code though p is NULL.

Could we consider infinitely blocking those bad callers:
while (illegal_gfp_nofail)  do_sth_relax_cpu();

>
> Honestly, I'm perfectly fine with just removing that stupid useless
> flag entirely. The flag goes back to 2003 and was introduced in
> 2.5.69, and was meant to be for very particular uses that otherwise
> just looped waiting for memory.
>
> Back in 2.5.69, there was exactly one user: the jbd journal code, that
> did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
> expanded by another user in XFS, and even that one had a comment
> saying that it needed to be narrowed down. And in fact, by the 2.6.12
> release, that XFS use had been removed, but the jbd journal had grown
> another jbd_kmalloc case for transaction data. So at the beginning of
> the git archives, we had exactly *one* user (with two places).
>
> *THAT* is the kind of use that the flag was meant for: small
> allocations required to make forward progress in writeout during
> memory pressure.
>
> It has then expanded and is now a problem. The cases using GFP_NOFAIL
> for things like vmalloc() - which is by definition not a small
> allocation - should be just removed as outright bugs.
>
> Note that we had this comment back in 2010:
>
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
>
> and then it was softened in 2015 to the current
>
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures. New users should be evaluated carefully
>   ...
>
> and clearly that "evaluated carefully" actually never happened, so the
> new comment is just garbage.
>
> I wonder how many modern users of GFP_NOFAIL are simply due to
> over-eager allocation failure injection testing, and then people added
> GFP_NOFAIL just because it shut up the mindless random allocation
> failures.
>
> I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can
> actually return an error just fine, but there was this crazy worry
> about the IPC layer initialization failing:
>
>    https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/
>
> Things like that, where people just added mindless "theoretical
> concerns" issues, or possibly had some error injection module that
> inserted impossible failures.
>
> I do NOT want those things to become BUG_ON()'s. It's better to just
> return NULL with a "bogus GFP_NOFAIL" warning, and have the oops
> happen in the actual bad place that did an invalid allocation.
>
> Because the blame should go *there*, and it should not even remotely
> look like "oh, the MM code failed". No. The caller was garbage.
>
> So no. No MM BUG_ON code.
>
>                     Linus

Linus Torvalds Aug. 19, 2024, 7:33 p.m. UTC | #4

On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote:
>
>
> That could be an exploit taking advantage of those improper callers,

So?

FIX THE BUGGY CODE.

Don't make insane and incorrect changes to the MM code and spread
Fear, Uncertainty and Doubt.

> thus it wouldn’t necessarily result in an immediate oops in callers but
> result in an exploit

No. Any bug can be an exploit. Don't try to make this something
special by calling it an exploit.

NULL pointer dereferences are some of the *least* worrisome bugs,
because we don't allow people to mmap the NULL area anyway.

So just stop spreading FUD. We don't improve the kernel by making
excuses for bugs, we improve it by fixing things.

And any caller that asks for NOFAIL with bad parameters is buggy. The
MM code should NOT try to fix it up, and dammit, BUG_ON() is not
acceptable as a debugging help. Never was, never will be.

Worry-warts already do "reboot-on-warn".

            Linus

Barry Song Aug. 19, 2024, 9:48 p.m. UTC | #5

On Tue, Aug 20, 2024 at 7:33 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote:
> >
> >
> > That could be an exploit taking advantage of those improper callers,
>
> So?
>
> FIX THE BUGGY CODE.

That's definitely in progress, with patch 1/4 addressing vdpa. There's also
an RFC to enforce DIRECT_RECLAMATION for __GFP_NOFAIL, which
will prevent passing unsupported flags to the memory management
system:

https://lore.kernel.org/all/20240724085544.299090-6-21cnbao@gmail.com/

>
> Don't make insane and incorrect changes to the MM code and spread
> Fear, Uncertainty and Doubt.
>
> > thus it wouldn’t necessarily result in an immediate oops in callers but
> > result in an exploit
>
> No. Any bug can be an exploit. Don't try to make this something
> special by calling it an exploit.
>
> NULL pointer dereferences are some of the *least* worrisome bugs,
> because we don't allow people to mmap the NULL area anyway.
>
> So just stop spreading FUD. We don't improve the kernel by making
> excuses for bugs, we improve it by fixing things.
>
> And any caller that asks for NOFAIL with bad parameters is buggy. The
> MM code should NOT try to fix it up, and dammit, BUG_ON() is not
> acceptable as a debugging help. Never was, never will be.

Okay, I see your point. However, the discussion originally began with just
a simple WARN_ON() to flag improper usage:
https://lore.kernel.org/linux-mm/20240717230025.77361-1-21cnbao@gmail.com/

Now, it seems we've come full circle and are opting to use
WARN_ON_ONCE() instead?

>
> Worry-warts already do "reboot-on-warn".
>
>             Linus

Thanks
Barry

Michal Hocko Aug. 20, 2024, 6:24 a.m. UTC | #6

On Mon 19-08-24 12:33:17, Linus Torvalds wrote:
> On Mon, 19 Aug 2024 at 12:23, Barry Song <21cnbao@gmail.com> wrote:
> >
> >
> > That could be an exploit taking advantage of those improper callers,
> 
> So?
> 
> FIX THE BUGGY CODE.
> 
> Don't make insane and incorrect changes to the MM code and spread
> Fear, Uncertainty and Doubt.
> 
> > thus it wouldn’t necessarily result in an immediate oops in callers but
> > result in an exploit
> 
> No. Any bug can be an exploit. Don't try to make this something
> special by calling it an exploit.
> 
> NULL pointer dereferences are some of the *least* worrisome bugs,
> because we don't allow people to mmap the NULL area anyway.
> 
> So just stop spreading FUD. We don't improve the kernel by making
> excuses for bugs, we improve it by fixing things.

I really do not think that accusing Barry from spreading FUD is fair!
I believe there is a good intention behind this initiative and he has
used tools that he has available.

So rather than throwing FUD arguments I would really like to stick to a
technical discussion, please.

Yafang Shao Aug. 21, 2024, 12:40 p.m. UTC | #7

On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> > >
> > > If we must still fail a nofail allocation, we should trigger a BUG rather
> > > than exposing NULL dereferences to callers who do not check the return
> > > value.
> >
> > I am not convinced that BUG_ON is the right tool here to save the world,
> > but I see how we arrived here.
>
> I think the thing to do is to just add a
>
>      WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));
>
> or similar, where that bad_nofail_alloc() checks that the allocation
> order is small and that the flags are sane for a NOFAIL allocation.
>
> Because no, BUG_ON() is *never* the answer. The answer is to make sure
> nobody ever sets NOFAIL in situations where the allocation can fail
> and there is no way forward.
>
> A BUG_ON() will quite likely just make things worse. You're better off
> with a WARN_ON() and letting the caller just oops.
>
> Honestly, I'm perfectly fine with just removing that stupid useless
> flag entirely. The flag goes back to 2003 and was introduced in
> 2.5.69, and was meant to be for very particular uses that otherwise
> just looped waiting for memory.
>
> Back in 2.5.69, there was exactly one user: the jbd journal code, that
> did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
> expanded by another user in XFS, and even that one had a comment
> saying that it needed to be narrowed down. And in fact, by the 2.6.12
> release, that XFS use had been removed, but the jbd journal had grown
> another jbd_kmalloc case for transaction data. So at the beginning of
> the git archives, we had exactly *one* user (with two places).
>
> *THAT* is the kind of use that the flag was meant for: small
> allocations required to make forward progress in writeout during
> memory pressure.
>
> It has then expanded and is now a problem. The cases using GFP_NOFAIL
> for things like vmalloc() - which is by definition not a small
> allocation - should be just removed as outright bugs.

One potential approach could be to rename GFP_NOFAIL to
GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
to clear this flag for larger allocations. However, the challenge lies
in determining what constitutes a 'small' allocation.

>
> Note that we had this comment back in 2010:
>
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures.  This modifier is deprecated and no new
>  * users should be added.
>
> and then it was softened in 2015 to the current
>
>  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures. New users should be evaluated carefully
>   ...
>
> and clearly that "evaluated carefully" actually never happened, so the
> new comment is just garbage.
>
> I wonder how many modern users of GFP_NOFAIL are simply due to
> over-eager allocation failure injection testing, and then people added
> GFP_NOFAIL just because it shut up the mindless random allocation
> failures.
>
> I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can
> actually return an error just fine, but there was this crazy worry
> about the IPC layer initialization failing:
>
>    https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/
>
> Things like that, where people just added mindless "theoretical
> concerns" issues, or possibly had some error injection module that
> inserted impossible failures.
>
> I do NOT want those things to become BUG_ON()'s. It's better to just
> return NULL with a "bogus GFP_NOFAIL" warning, and have the oops
> happen in the actual bad place that did an invalid allocation.
>
> Because the blame should go *there*, and it should not even remotely
> look like "oh, the MM code failed". No. The caller was garbage.
>
> So no. No MM BUG_ON code.
>
>                     Linus
>


--
Regards
Yafang

Linus Torvalds Aug. 21, 2024, 10:59 p.m. UTC | #8

On Wed, 21 Aug 2024 at 20:41, Yafang Shao <laoar.shao@gmail.com> wrote:
>
> One potential approach could be to rename GFP_NOFAIL to
> GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
> to clear this flag for larger allocations.

Yes, that sounds like a good way to make sure people don't blame the
MM layer when they themselves were the cause of problems.

> However, the challenge lies
> in determining what constitutes a 'small' allocation.

I think we could easily just stick to the historical "order <
PAGE_ALLOC_COSTLY_ORDER":

 * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
 * costly to service.

(And the value for that is 3 - orders 0-2 are considered "cheap")

             Linus

Michal Hocko Aug. 22, 2024, 6:21 a.m. UTC | #9

On Thu 22-08-24 06:59:08, Linus Torvalds wrote:
> On Wed, 21 Aug 2024 at 20:41, Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > One potential approach could be to rename GFP_NOFAIL to
> > GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
> > to clear this flag for larger allocations.
> 
> Yes, that sounds like a good way to make sure people don't blame the
> MM layer when they themselves were the cause of problems.

The reality disagrees because there is a real demand for real GFP_NOFAIL
semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL
for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for
anything larger than PAGE_SIZE is doable without a considerable burden
on the MM end.

Barry Song Aug. 22, 2024, 6:37 a.m. UTC | #10

On Thu, Aug 22, 2024 at 12:41 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > If we must still fail a nofail allocation, we should trigger a BUG rather
> > > > than exposing NULL dereferences to callers who do not check the return
> > > > value.
> > >
> > > I am not convinced that BUG_ON is the right tool here to save the world,
> > > but I see how we arrived here.
> >
> > I think the thing to do is to just add a
> >
> >      WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));
> >
> > or similar, where that bad_nofail_alloc() checks that the allocation
> > order is small and that the flags are sane for a NOFAIL allocation.
> >
> > Because no, BUG_ON() is *never* the answer. The answer is to make sure
> > nobody ever sets NOFAIL in situations where the allocation can fail
> > and there is no way forward.
> >
> > A BUG_ON() will quite likely just make things worse. You're better off
> > with a WARN_ON() and letting the caller just oops.
> >
> > Honestly, I'm perfectly fine with just removing that stupid useless
> > flag entirely. The flag goes back to 2003 and was introduced in
> > 2.5.69, and was meant to be for very particular uses that otherwise
> > just looped waiting for memory.
> >
> > Back in 2.5.69, there was exactly one user: the jbd journal code, that
> > did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
> > expanded by another user in XFS, and even that one had a comment
> > saying that it needed to be narrowed down. And in fact, by the 2.6.12
> > release, that XFS use had been removed, but the jbd journal had grown
> > another jbd_kmalloc case for transaction data. So at the beginning of
> > the git archives, we had exactly *one* user (with two places).
> >
> > *THAT* is the kind of use that the flag was meant for: small
> > allocations required to make forward progress in writeout during
> > memory pressure.
> >
> > It has then expanded and is now a problem. The cases using GFP_NOFAIL
> > for things like vmalloc() - which is by definition not a small
> > allocation - should be just removed as outright bugs.
>
> One potential approach could be to rename GFP_NOFAIL to
> GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
> to clear this flag for larger allocations. However, the challenge lies
> in determining what constitutes a 'small' allocation.

I'm not entirely sure if our concern is with higher order or larger size. Higher
order might pose a problem, but larger size(not too large) isn't
always an issue.
Allocating 100 * 4KiB pages is possibly easier than allocating a single
128KB folio.

Are we trying to limit the physical size or the physical order? If the concern
is order, vmalloc manages __GFP_NOFAIL by mapping order-0 pages. If the
concern is higher order, this sounds reasonable.  but it seems the buddy
system already has code to trigger a warning even for order > 1:

struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
        WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

        if (likely(pcp_allowed_order(order))) {
                page = rmqueue_pcplist(preferred_zone, zone, order,
                                       migratetype, alloc_flags);
                if (likely(page))
                        goto out;
        }
        ....
}

>
> >
> > Note that we had this comment back in 2010:
> >
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures.  This modifier is deprecated and no new
> >  * users should be added.
> >
> > and then it was softened in 2015 to the current
> >
> >  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures. New users should be evaluated carefully
> >   ...
> >
> > and clearly that "evaluated carefully" actually never happened, so the
> > new comment is just garbage.
> >
> > I wonder how many modern users of GFP_NOFAIL are simply due to
> > over-eager allocation failure injection testing, and then people added
> > GFP_NOFAIL just because it shut up the mindless random allocation
> > failures.
> >
> > I mean, we have a __GFP_NOFAIL in rhashtable_init() - which can
> > actually return an error just fine, but there was this crazy worry
> > about the IPC layer initialization failing:
> >
> >    https://lore.kernel.org/all/20180523172500.anfvmjtumww65ief@linux-n805/
> >
> > Things like that, where people just added mindless "theoretical
> > concerns" issues, or possibly had some error injection module that
> > inserted impossible failures.
> >
> > I do NOT want those things to become BUG_ON()'s. It's better to just
> > return NULL with a "bogus GFP_NOFAIL" warning, and have the oops
> > happen in the actual bad place that did an invalid allocation.
> >
> > Because the blame should go *there*, and it should not even remotely
> > look like "oh, the MM code failed". No. The caller was garbage.
> >
> > So no. No MM BUG_ON code.
> >
> >                     Linus
> >
>
>
> --
> Regards
> Yafang

Thanks
Barry

Linus Torvalds Aug. 22, 2024, 6:40 a.m. UTC | #11

On Thu, 22 Aug 2024 at 14:21, Michal Hocko <mhocko@suse.com> wrote:
>
> The reality disagrees because there is a real demand for real GFP_NOFAIL
> semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL
> for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for
> anything larger than PAGE_SIZE is doable without a considerable burden
> on the MM end.

Doable? Sure. Sensible? Not clear.

I do not find a single case of that in the kernel.

I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
one in erofs. It's not clear that any of them make much sense (or that
the erofs one is actually a large allocation).

And there's some disgusting noise in hmm_test.c. Irrelevant.

Now, this was all from a pure mindless grep, and it didn't take any
context into account, so if somebody is building up the gfp flags and
not using them directly, that grep has failed.

But when you say "reality disagrees", I think you need to point at said reality.

Because the *real* reality is that large allocations HAVE ALWAYS
FAILED. In fact, with flags like GFP_ATOMIC etc, it will fail very
aggressively. Christ, that's what started this whole thread and
discussion in the first place.

So I think *that* is the reality you need to face: GFP_NOFAIL has
never ever been a hard guarantee to begin with, and expecting it to be
that is a sign of insanity. It's fundamentally an impossible
operation.

It really is a "Daddy, daddy, I want a pony" flag.

             Linus

Linus Torvalds Aug. 22, 2024, 6:56 a.m. UTC | #12

On Thu, 22 Aug 2024 at 14:40, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
> one in erofs. It's not clear that any of them make much sense (or that
> the erofs one is actually a large allocation).

Oh, and I missed one in btrfs because it needed five lines of context
due to being the allocation from hell.

That said, yes, the vmalloc case at least has no fragmentation issues,
but I do think even that needs to be size limited for sanity.

The size limit might be larger than a couple of pages, but not
_hugely_ larger. You can't just say "I want a megabyte, and you can't
fail me". That kind of code is garbage, and needs to be called out for
being garbage.

                 Linus

Gao Xiang Aug. 22, 2024, 7:01 a.m. UTC | #13

Hi Linus,

On 2024/8/22 14:40, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 14:21, Michal Hocko <mhocko@suse.com> wrote:
>>
>> The reality disagrees because there is a real demand for real GFP_NOFAIL
>> semantic. By that I do not mean arbitrary requests and sure GFP_NOFAIL
>> for higher orders is really hard to achieve but kvmalloc GFP_NOFAIL for
>> anything larger than PAGE_SIZE is doable without a considerable burden
>> on the MM end.
> 
> Doable? Sure. Sensible? Not clear.
> 
> I do not find a single case of that in the kernel.
> 
> I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
> one in erofs. It's not clear that any of them make much sense (or that
> the erofs one is actually a large allocation).

I don't follow all the thread due to other internal work ongoing
but EROFS could do _large_ kvmalloc NOFAIL allocation according to
PAGE_ALLOC_COSTLY_ORDER (~24kb at most due to on-disk restriction),
my detailed story was outlined in my previous reply (and thread):
https://lore.kernel.org/r/20d782ad-c059-4029-9c75-0ef278c98d81@linux.alibaba.com

Because EROFS needs page arraies for vmap and then do decompression,
for the worst case, it almost needs ~24kb temporary page array
but that is the end user choice to use such extreme compression
(mostly just syzkallar crafted images.)

In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
means for a single shot.  Because assume even if you don't consider
a virtual consecutive buffer, people could also do
< PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
the same heavy workload to the whole system.  And we also allow
direct/kswap reclaim here.

Failure path is complex in some cases like here and it's hard
to reach or get it right.  If kvmalloc() will be restricted on
< PAGE_ALLOC_COSTLY_ORDER anyway, I guess I will use a global
static buffer (and a sleeping lock) as a worst fallback to fulfill
the extreme on-disk restriction.

Thanks,
Gao Xiang

Michal Hocko Aug. 22, 2024, 7:47 a.m. UTC | #14

On Thu 22-08-24 14:56:08, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 14:40, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
> > one in erofs. It's not clear that any of them make much sense (or that
> > the erofs one is actually a large allocation).
> 
> Oh, and I missed one in btrfs because it needed five lines of context
> due to being the allocation from hell.
> 
> That said, yes, the vmalloc case at least has no fragmentation issues,
> but I do think even that needs to be size limited for sanity.
> 
> The size limit might be larger than a couple of pages, but not
> _hugely_ larger. You can't just say "I want a megabyte, and you can't
> fail me". That kind of code is garbage, and needs to be called out for
> being garbage.

yes, no objection here. Our current limits are too large for any
practical purpose. We still need a strategy how to communicate that
the size is not supported though. Just returning NULL is IMHO bad thing
because it adds potentially silent failure. In other subthread I was
contemplating about OOPS_ON which would simply terminate the user
context and kill it right away. What do you think about something like
that?

Michal Hocko Aug. 22, 2024, 7:54 a.m. UTC | #15

On Thu 22-08-24 15:01:43, Gao Xiang wrote:
> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
> means for a single shot.  Because assume even if you don't consider
> a virtual consecutive buffer, people could also do
> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
> the same heavy workload to the whole system.  And we also allow
> direct/kswap reclaim here.

Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain
make sense outside of the page allocator proper. There is no reason why
vmalloc NOFAIL should be constrained by that. Sure it should be
contrained to some value but considering it is just a bunch of PAGE_SIZE
allocation then the limit could be higher. I am not sure where the
practical limit should be but anything that requires more than couple of
MBs seems really excessive.

And for PAGE_ALLOC_COSTLY_ORDER and NOFAIL at the page allocator level I
would argue that we do not want to provide such a strong guarantee to
anything order > 0. Just use kvmalloc for that purpose. Sure we
practically never fail up to costly order but guaranteeing that is a
different story.

Barry Song Aug. 22, 2024, 7:57 a.m. UTC | #16

On Thu, Aug 22, 2024 at 7:47 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 22-08-24 14:56:08, Linus Torvalds wrote:
> > On Thu, 22 Aug 2024 at 14:40, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > I did find three cases of kvcalloc(NOFAIL) in the nouveau driver and
> > > one in erofs. It's not clear that any of them make much sense (or that
> > > the erofs one is actually a large allocation).
> >
> > Oh, and I missed one in btrfs because it needed five lines of context
> > due to being the allocation from hell.
> >
> > That said, yes, the vmalloc case at least has no fragmentation issues,
> > but I do think even that needs to be size limited for sanity.
> >
> > The size limit might be larger than a couple of pages, but not
> > _hugely_ larger. You can't just say "I want a megabyte, and you can't
> > fail me". That kind of code is garbage, and needs to be called out for
> > being garbage.
>
> yes, no objection here. Our current limits are too large for any
> practical purpose. We still need a strategy how to communicate that
> the size is not supported though. Just returning NULL is IMHO bad thing
> because it adds potentially silent failure. In other subthread I was
> contemplating about OOPS_ON which would simply terminate the user
> context and kill it right away. What do you think about something like
> that?

Personally, I fully support this approach that falls between BUG_ON
and returning NULL. Regarding the concern about 'leaving locks
behind' you have in that subthread,  I believe there's no difference
when returning NULL, as it could still leave locks behind but offers
a chance for the calling process to avoid an immediate crash.

> --
> Michal Hocko
> SUSE Labs

Thanks
Barry

Gao Xiang Aug. 22, 2024, 8:04 a.m. UTC | #17

Hi Michal,

On 2024/8/22 15:54, Michal Hocko wrote:
> On Thu 22-08-24 15:01:43, Gao Xiang wrote:
>> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
>> means for a single shot.  Because assume even if you don't consider
>> a virtual consecutive buffer, people could also do
>> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
>> the same heavy workload to the whole system.  And we also allow
>> direct/kswap reclaim here.
> 
> Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain
> make sense outside of the page allocator proper. There is no reason why
> vmalloc NOFAIL should be constrained by that. Sure it should be
> contrained to some value but considering it is just a bunch of PAGE_SIZE
> allocation then the limit could be higher. I am not sure where the
> practical limit should be but anything that requires more than couple of
> MBs seems really excessive.

Yeah, totally agreed, that would make my own life easier, of
course I will not allocate MBs insanely.

I've always trying to kill unnecessary NOFAILs (mostly together
with code cleanups), but if a failure path increases more than
100 LOCs just for rare failure and extreme workloads, I _do_
hope kvmalloc(NOFAIL) could work instead.


> 
> And for PAGE_ALLOC_COSTLY_ORDER and NOFAIL at the page allocator level I
> would argue that we do not want to provide such a strong guarantee to
> anything order > 0. Just use kvmalloc for that purpose. Sure we
> practically never fail up to costly order but guaranteeing that is a
> different story.

Agreed.

Thanks,
Gao Xiang

>

Michal Hocko Aug. 22, 2024, 8:24 a.m. UTC | #18

On Thu 22-08-24 19:57:41, Barry Song wrote:
> Regarding the concern about 'leaving locks
> behind' you have in that subthread,  I believe there's no difference
> when returning NULL, as it could still leave locks behind but offers
> a chance for the calling process to avoid an immediate crash.

Yes, I have mentioned this risk just for completeness. Without having
some sort of unwinding mechanism we are doomed to not be able to handle
this.

The sole difference between just returning NULL and OOPsing rigth away
is that the former is not guaranteed to happen and the caller can cause
an actual harm by derefering non-oopsing addressed close to 0 which
would be a) much harder to find out b) could cause much more damage than
killing the context right away.

Besides that I believe we have many BUG_ON users which would really
prefer to just call the current context instead, they just do not have
means to do that so OOPS_ON could be a safer way to stop bad users and
reduce the number of BUG_ONs as well.

I am just not really sure how to implement that. A stupid and an obvious
way would be to have a dereference from a known (pre-defined) unmapped
area. But this smells like something that should be achievable in a
better way.

David Hildenbrand Aug. 22, 2024, 8:39 a.m. UTC | #19

On 22.08.24 10:24, Michal Hocko wrote:
> On Thu 22-08-24 19:57:41, Barry Song wrote:
>> Regarding the concern about 'leaving locks
>> behind' you have in that subthread,  I believe there's no difference
>> when returning NULL, as it could still leave locks behind but offers
>> a chance for the calling process to avoid an immediate crash.
> 
> Yes, I have mentioned this risk just for completeness. Without having
> some sort of unwinding mechanism we are doomed to not be able to handle
> this.
> 
> The sole difference between just returning NULL and OOPsing rigth away
> is that the former is not guaranteed to happen and the caller can cause
> an actual harm by derefering non-oopsing addressed close to 0 which
> would be a) much harder to find out b) could cause much more damage than
> killing the context right away.
> 
> Besides that I believe we have many BUG_ON users which would really
> prefer to just call the current context instead, they just do not have
> means to do that so OOPS_ON could be a safer way to stop bad users and
> reduce the number of BUG_ONs as well.

To me that sounds better as well, but I was also wondering if it's easy 
to implement or easy to assemble from existing pieces.

Linus has a point that "retry forever" can also be nasty. I think the 
important part here is, though, that we report sufficient information 
(stacktrace), such that the problem can be debugged reasonably well, and 
not just having a locked-up system.

But then the question is: does it really make sense to differentiate 
difference between an NOFAIL allocation under memory pressure of 
MAX_ORDER compared to MAX_ORDER+1 (Linus also touched on that)? It could 
well take minutes/hours/days to satisfy a very large NOFAIL allocation. 
So callers should be prepared to run into effective lockups ... :/

NOFAIL shouldn't exist, or at least not used to that degree.

I am to blame myself, I made use of it in kernel/resource.c, where there 
is no turning back when completed memory unplug to 99% (even having 
freed the vmemmap), but then we might have to allocate a new node in the 
resource tree, when having to split an existing one. Maybe there would 
be ways to preallocate before starting memory unplug, or to pre-split ...

But then again, sizeof(struct resource) is probably so small that it 
likely would never fail.

Linus Torvalds Aug. 22, 2024, 9:08 a.m. UTC | #20

On Thu, 22 Aug 2024 at 16:39, David Hildenbrand <david@redhat.com> wrote:
>
> Linus has a point that "retry forever" can also be nasty. I think the
> important part here is, though, that we report sufficient information
> (stacktrace), such that the problem can be debugged reasonably well, and
> not just having a locked-up system.

Unless I missed some case, I *think* most NOFAIL cases are actually
fairly small.

In fact, I suspect many of them are so small that we already
effectively give that guarantee:

> But then again, sizeof(struct resource) is probably so small that it
> likely would never fail.

Iirc, we had the policy of never failing unrestricted kernel
allocations that are smaller than a page (where "unrestricted" means
that it's a regular GFP_KERNEL, not some NOFS or similar allocation).

In fact, I think we practically speaking still do. We really *really*
tend to try very hard to retry small allocations.

That was one of the things that GFP_USER does - it's identical to
GFP_KERNEL, but it basically tells the MM that it should not try so
hard because an allocation failure was fine.

(I say "was", because the semantics have changed over time.
Originally, GFP_KERNEL had the "GFP_HIGH" bit set that said "access
emergency pools for this allocation", and GFP_USER did not have that
bit set. Now the only difference seems to be GFP_HARDWALL, so a user
allocation the cgroup limits are hard limits etc. I think there's been
other versions of this kind of logic over the years as people try to
make it all work out well in practice).

In fact, kernel allocations try so hard that we have those "opposite
flags" of ___GFP_NORETRY and ___GFP_RETRY_MAYFAIL because we often try
*TOO* hard, and reasonably many code-paths have that whole "let's
optimistically ask for a big allocation, but not try very hard and not
warn if it fails, because we can fall back on a smaller one".

So it's _really_ hard to fail a small GFP_KERNEL allocation. It used
to be practically impossible, and in fact I think GFP_NOFAIL was
originally added long ago when the MM code was going through big
upheavals and one of the things that was mucked around with was the
whole "how hard to retry".

              Linus

Michal Hocko Aug. 22, 2024, 9:11 a.m. UTC | #21

On Thu 22-08-24 10:39:09, David Hildenbrand wrote:
[...]
> But then the question is: does it really make sense to differentiate
> difference between an NOFAIL allocation under memory pressure of MAX_ORDER
> compared to MAX_ORDER+1 (Linus also touched on that)? It could well take
> minutes/hours/days to satisfy a very large NOFAIL allocation. So callers
> should be prepared to run into effective lockups ... :/

As pointed out in other subthread. We shouldn't really pretend we
support NOFAIL for order > 0, or at least anything > A_SMALL_ORDER and
encourage kvmalloc for those users.

A nofail order 2 allocation can kill most of the userspace on terribly
fragmented system that is kernel allocation heavy.

> NOFAIL shouldn't exist, or at least not used to that degree.

Let's put whishful thinking aside. Unless somebody manages to go over
all existing NOFAIL users and fix them then we should better focus on
providing a reasonable clearly documented and enforced semantic.

Michal Hocko Aug. 22, 2024, 9:16 a.m. UTC | #22

On Thu 22-08-24 17:08:15, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 16:39, David Hildenbrand <david@redhat.com> wrote:
> >
> > Linus has a point that "retry forever" can also be nasty. I think the
> > important part here is, though, that we report sufficient information
> > (stacktrace), such that the problem can be debugged reasonably well, and
> > not just having a locked-up system.
> 
> Unless I missed some case, I *think* most NOFAIL cases are actually
> fairly small.
> 
> In fact, I suspect many of them are so small that we already
> effectively give that guarantee:
> 
> > But then again, sizeof(struct resource) is probably so small that it
> > likely would never fail.
> 
> Iirc, we had the policy of never failing unrestricted kernel
> allocations that are smaller than a page (where "unrestricted" means
> that it's a regular GFP_KERNEL, not some NOFS or similar allocation).
> 
> In fact, I think we practically speaking still do. We really *really*
> tend to try very hard to retry small allocations.

yes we try very hard but allocation failure is still possible in some
corner cases so callers _must_ check for return value and deal with it.

> That was one of the things that GFP_USER does - it's identical to
> GFP_KERNEL, but it basically tells the MM that it should not try so
> hard because an allocation failure was fine.

GFP_USER allocation only impluy __GFP_HARDWALL and that only makes
difference for cpusets. It doesn't make difference in most cases though.
 
> In fact, kernel allocations try so hard that we have those "opposite
> flags" of ___GFP_NORETRY and ___GFP_RETRY_MAYFAIL because we often try
> *TOO* hard, and reasonably many code-paths have that whole "let's
> optimistically ask for a big allocation, but not try very hard and not
> warn if it fails, because we can fall back on a smaller one".
> 
> So it's _really_ hard to fail a small GFP_KERNEL allocation. It used
> to be practically impossible, and in fact I think GFP_NOFAIL was
> originally added long ago when the MM code was going through big
> upheavals and one of the things that was mucked around with was the
> whole "how hard to retry".

There is a fundamental difference here. GPF_NOFAIL _guarantees_ that the
allocation will not fail so callers do not check for the failure because
they have (presumably) no (practical) way to handle the failure.

Linus Torvalds Aug. 22, 2024, 9:18 a.m. UTC | #23

On Thu, 22 Aug 2024 at 17:11, Michal Hocko <mhocko@suse.com> wrote:
>
> Let's put whishful thinking aside. Unless somebody manages to go over
> all existing NOFAIL users and fix them then we should better focus on
> providing a reasonable clearly documented and enforced semantic.

I do like changing the naming to make it clear that it's not some kind
of general MM guarantee for any random allocation.

So that's why I liked the NOFAIL_SMALL_ALLOC just to make people who
use it aware that no, they aren't getting a "get out of jail free"
card.

Admittedly that's probably _too_ long a name, but conceptually...

              Linus

Linus Torvalds Aug. 22, 2024, 9:24 a.m. UTC | #24

On Thu, 22 Aug 2024 at 17:16, Michal Hocko <mhocko@suse.com> wrote:
>
> GFP_USER allocation only impluy __GFP_HARDWALL and that only makes
> difference for cpusets. It doesn't make difference in most cases though.

That's what it does today.

We used to have a very clear notion of "how hard to try". It was
"LOW", "MED" and "HIGH".

And GFP_USER used __GFP_LOW, exactly so that the MM layer knew to not
try very hard.

GFP_ATOMIC used __GFP_HIGH, to say "use the reserved resources".

GFP_KERNEL then at one point used __GF_MED, to say "don't dip into the
reserved pool, but retry harder".

But exactly because people did want kernel allocations to basically
always succeed, then GFP_KERNEL ended up using __GFP_HIGH too.

> There is a fundamental difference here. GPF_NOFAIL _guarantees_ that the
> allocation will not fail so callers do not check for the failure because
> they have (presumably) no (practical) way to handle the failure.

And this mindset needs to go away. That's what I've been trying to say.

It absolutely MUST NOT GUARANTEE THAT.

I've seen crap patche that say "BUG_ON() if we cannot guarantee it",
and I'm NACKing those kinds of completely bogus models.

The hard reality needs to be that GFP_NOFAIL is simply IGNORED if
people mis-use it. It absolutely HAS to be a "conditional no-failure".

And it needs to be conditional on both size and things like "I'm
allowed to do reclaim".

Any discussion that starts with "GFP_NOFAIL is a guarantee" needs to *DIE*.

                  Linus

David Hildenbrand Aug. 22, 2024, 9:27 a.m. UTC | #25

On 22.08.24 11:11, Michal Hocko wrote:
> On Thu 22-08-24 10:39:09, David Hildenbrand wrote:
> [...]
>> But then the question is: does it really make sense to differentiate
>> difference between an NOFAIL allocation under memory pressure of MAX_ORDER
>> compared to MAX_ORDER+1 (Linus also touched on that)? It could well take
>> minutes/hours/days to satisfy a very large NOFAIL allocation. So callers
>> should be prepared to run into effective lockups ... :/
> 
> As pointed out in other subthread. We shouldn't really pretend we
> support NOFAIL for order > 0, or at least anything > A_SMALL_ORDER and
> encourage kvmalloc for those users.
> 
> A nofail order 2 allocation can kill most of the userspace on terribly
> fragmented system that is kernel allocation heavy.
> 
>> NOFAIL shouldn't exist, or at least not used to that degree.
> 
> Let's put whishful thinking aside. Unless somebody manages to go over
> all existing NOFAIL users and fix them then we should better focus on
> providing a reasonable clearly documented and enforced semantic.

Probably it would be time better spent than trying to find ways to deal 
with that mess. ;)

I think the documentation is mostly there:

"The VM implementation _must_ retry infinitely: the caller cannot handle 
allocation failures. The allocation could block indefinitely but will 
never return with failure. Testing for failure is pointless."

To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry 
infinitely". if that implies just OOPSing or actually be in a busy loop, 
I don't care. It could effectively happen with MAX_ORDER as well, as 
stated. But certainly not BUG_ON.

"Using this flag for costly allocations is _highly_ discouraged" should 
be rephrased to "Using this flag with costly allocations is _highly 
dangerous_ and will likely result in the allocation never succeeding and 
this function never making any progress."

I do also agree that renaming NOFAIL to make some of that clearer makes 
sense.

Likely, checkpatch should be updated to warn on any new NOFAIL usage.

Michal Hocko Aug. 22, 2024, 9:33 a.m. UTC | #26

On Thu 22-08-24 17:18:29, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 17:11, Michal Hocko <mhocko@suse.com> wrote:
> >
> > Let's put whishful thinking aside. Unless somebody manages to go over
> > all existing NOFAIL users and fix them then we should better focus on
> > providing a reasonable clearly documented and enforced semantic.
> 
> I do like changing the naming to make it clear that it's not some kind
> of general MM guarantee for any random allocation.
> 
> So that's why I liked the NOFAIL_SMALL_ALLOC just to make people who
> use it aware that no, they aren't getting a "get out of jail free"
> card.

Small means different things to different people. Also that small has a
completely different meaning for the page allocator and for kvmalloc. I
really do not like to carve any ambiguity like that into the flag that
is supposed to be used for both.

Quite honestly I am not even sure we have actual GFP_NOFAIL users of the
page allocator outside of the MM (e.g. to implement SLUB internal
allocations). git grep just takes too much time to process because the
underlying allocator is not always immediately visible.

Limiting NOFAIL semantic to SLUB and {kv}malloc allocators would make
some sense then as it could enforce reasonable use more easily I guess.

Linus Torvalds Aug. 22, 2024, 9:34 a.m. UTC | #27

On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
>
> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
> infinitely". if that implies just OOPSing or actually be in a busy loop,
> I don't care. It could effectively happen with MAX_ORDER as well, as
> stated. But certainly not BUG_ON.

No BUG_ON(), but also no endless loop.

Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
make it easy to find offenders, and then let them deal with it.

Don't take it upon yourself to say "we have to deal with any amount of
stupidity".

The MM layer is not some slave to users. The MM layer is one of the
most core pieces of code in the kernel, and as such the MM layer is
damn well in charge.

Nobody has the right to say "I will not deal with allocation
failures". The MM should not bend over backwards over something like
that.

Seriously. Get a spine already, people. Tell random drivers that claim
that they cannot deal with errors to just f-ck off.

And you don't do it by looping forever, and you don't do it by killing
the kernel. You do it by ignoring their bullying tactics.

Then you document the *LIMITED* cases where you actually will try forever.

This discussion has gone on for too damn long.

              Linus

Michal Hocko Aug. 22, 2024, 9:41 a.m. UTC | #28

On Thu 22-08-24 11:27:25, David Hildenbrand wrote:
[...]
> Likely, checkpatch should be updated to warn on any new NOFAIL usage.

What do you expect people to do? I do not see a pattern of nilly-willy
use of the flag (we have less than 200 in the kernel outside of mm,
compare that to 40k GFP_KERNEL allocations). If you warn and the only
answer is shrug and go on then this serves no purpose.

I have learned couple of things. People do not really give a deep
thought on the naming (e.g. GFP_TEMPORARY story). Or a documentation
(e.g. GFP_NOWAIT | GFP_NOFAIL user). They sometimes pay attention to
warnings but WARN_ON_ONCE tells you about a single abuser... I believe
that enforcing a constrains for GFP_NOFAIL by killing the allocation
user context would have a more visible effect than WARN_ON and it would
stop potential silent failure mode at the same time.

David Hildenbrand Aug. 22, 2024, 9:42 a.m. UTC | #29

On 22.08.24 11:41, Michal Hocko wrote:
> On Thu 22-08-24 11:27:25, David Hildenbrand wrote:
> [...]
>> Likely, checkpatch should be updated to warn on any new NOFAIL usage.
> 
> What do you expect people to do? I do not see a pattern of nilly-willy
> use of the flag (we have less than 200 in the kernel outside of mm,
> compare that to 40k GFP_KERNEL allocations). If you warn and the only
> answer is shrug and go on then this serves no purpose.

Like the BUG_ON checks I added. Some people still try to ignore them and 
we end up in discussions like this when I spot it ;)

David Hildenbrand Aug. 22, 2024, 9:43 a.m. UTC | #30

On 22.08.24 11:34, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
>>
>> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
>> infinitely". if that implies just OOPSing or actually be in a busy loop,
>> I don't care. It could effectively happen with MAX_ORDER as well, as
>> stated. But certainly not BUG_ON.
> 
> No BUG_ON(), but also no endless loop.
> 
> Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
> make it easy to find offenders, and then let them deal with it.
> 
> Don't take it upon yourself to say "we have to deal with any amount of
> stupidity".
> 
> The MM layer is not some slave to users. The MM layer is one of the
> most core pieces of code in the kernel, and as such the MM layer is
> damn well in charge.
> 
> Nobody has the right to say "I will not deal with allocation
> failures". The MM should not bend over backwards over something like
> that.
> 
> Seriously. Get a spine already, people. Tell random drivers that claim
> that they cannot deal with errors to just f-ck off.
> 
> And you don't do it by looping forever, and you don't do it by killing
> the kernel. You do it by ignoring their bullying tactics.
> 
> Then you document the *LIMITED* cases where you actually will try forever.

So on the buddy level, that might mean that we limit it to a single 
page, and document "NOFAIL is ineffective and ignored when allcoating 
pages of order > 0. Any attempt will result in a WARN_ON_ONCE()". 
(assuming we can find and eliminate users that allocate order > 0 fairly 
easily)

{kv}malloc allocators would be different, as Michal said.

No idea if that is feasible, but it sounds like something you have in mind.

Linus Torvalds Aug. 22, 2024, 9:44 a.m. UTC | #31

On Thu, 22 Aug 2024 at 17:33, Michal Hocko <mhocko@suse.com> wrote:
>
> Limiting NOFAIL semantic to SLUB and {kv}malloc allocators would make
> some sense then as it could enforce reasonable use more easily I guess.

If by "limit to SLUB" you mean "limit it to the kmalloc() cases that
can be done using the standard *SMALL* buckets, then maybe.

But even then it should probably be only if you don't ask for specific
nodes or other limitations on the allocation.

Because realize that "kmalloc()" and friends will fall back to other
things like __kmalloc_large_node_noprof(), and HELL NO those should
not honor NOFAIL.

And dammit, those kvmalloc() sizes need to be limited too. A number
like 24kB was mentioned. That sounds fine. Maybe even 64kB. But make
it *SMALL*.

And make it clear that it will return NULL if somebody misuses it.

            Linus

Linus Torvalds Aug. 22, 2024, 9:53 a.m. UTC | #32

On Thu, 22 Aug 2024 at 17:43, David Hildenbrand <david@redhat.com> wrote:
>
> So on the buddy level, that might mean that we limit it to a single
> page,

Actually, for many SLUB allocations, you probably do have to accept
the small orders - the slab caches are often two or four pages.

For example, a kmalloc(256) is an order-1 allocation on a buddy level
from a quick look at /proc/slabinfo.

So it's not necessarily only single pages. We do handle small orders.
But it gets exponentially harder, so it really is just the small
orders that work.

Looks like slub will use up to order-3. That smells like an off-by-one
to me (I thought we made 0-2 be the "cheap" orders, but maybe I'm just
wrong), but it probably is still acceptable.

             Linus

Michal Hocko Aug. 22, 2024, 9:59 a.m. UTC | #33

On Thu 22-08-24 17:44:22, Linus Torvalds wrote:
[...]
> And dammit, those kvmalloc() sizes need to be limited too. A number
> like 24kB was mentioned. That sounds fine. Maybe even 64kB. But make
> it *SMALL*.

I am not particularly bothered by exact size for kvmalloc. There should
be a limit all right, but whether that is 1MB or 64kB shouldn't make
much of a difference because we do use 4kB pages to back those
allocations anyway. Sure 32b is a different story because the vmalloc
space itself is a scarce resource but do we care about those?

> And make it clear that it will return NULL if somebody misuses it.

What do you expect users do with the return value then?

Linus Torvalds Aug. 22, 2024, 10:30 a.m. UTC | #34

[ Sorry, on mobile right now, so HTML crud ]

On Thu, Aug 22, 2024, 17:59 Michal Hocko <mhocko@suse.com> wrote:

>
> > And make it clear that it will return NULL if somebody misuses it.
>
> What do you expect users do with the return value then?
>

NOT. YOUR. PROBLEM.

It's the problem of the caller.

What's so hard to understand about that? The MM layer has enough problems
on it's own, it doesn't need to take on the problems of others.

      Linus

>

Michal Hocko Aug. 22, 2024, 10:46 a.m. UTC | #35

On Thu 22-08-24 18:30:12, Linus Torvalds wrote:
> [ Sorry, on mobile right now, so HTML crud ]
> 
> On Thu, Aug 22, 2024, 17:59 Michal Hocko <mhocko@suse.com> wrote:
> 
> >
> > > And make it clear that it will return NULL if somebody misuses it.
> >
> > What do you expect users do with the return value then?
> >
> 
> NOT. YOUR. PROBLEM.
> 
> It's the problem of the caller.

They have already told you they (believe) have no way to handle the failure.
So if they learn they can get NULL they will either BUG_ON, put a loop
around that or just close eyes and hope for the best. I argue that
neither of that is a good option because it leads to a poor and buggy
code.

Look Linus, I believe we are getting into a circle and I do not have
much more to offer into this discussion. I do agree with you that we
should define boundaries of GFP_NOFAIL more explicitly. Documentation we
have is not an effective way. I strongly disagree that WARN_ONCE and
just return NULL and close eyes is a good programming nor it encourages
a good programming on the caller side. I have offered to explicitly oops
in those cases inside the allocator (while not holding any internal
allocator locks) as a sreasonable compromise. I even believe that this
is a useful tool in other contexts as well. I haven't heard your opinion
on that so far. If I had time to work on that myself I would give it a
try.

Johannes Weiner Aug. 22, 2024, 11:58 a.m. UTC | #36

On Thu, Aug 22, 2024 at 05:53:22PM +0800, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 17:43, David Hildenbrand <david@redhat.com> wrote:
> >
> > So on the buddy level, that might mean that we limit it to a single
> > page,
> 
> Actually, for many SLUB allocations, you probably do have to accept
> the small orders - the slab caches are often two or four pages.
> 
> For example, a kmalloc(256) is an order-1 allocation on a buddy level
> from a quick look at /proc/slabinfo.

It will try higher orders to reduce fragmentation (and mask out NOFAIL
on those attempts), but it can fall back to the minimum size required
for the object, i.e. get_order(size).

> So it's not necessarily only single pages. We do handle small orders.
> But it gets exponentially harder, so it really is just the small
> orders that work.

Agreed.

> Looks like slub will use up to order-3. That smells like an off-by-one
> to me (I thought we made 0-2 be the "cheap" orders, but maybe I'm just
> wrong), but it probably is still acceptable.

It's 0-3. We #define PAGE_ALLOC_COSTLY_ORDER 3, but it's exclusive -
all the order checks are > costly and <= costly.

Yafang Shao Aug. 22, 2024, 2:22 p.m. UTC | #37

On Thu, Aug 22, 2024 at 2:37 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Aug 22, 2024 at 12:41 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Aug 20, 2024 at 12:05 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Mon, 19 Aug 2024 at 06:02, David Hildenbrand <david@redhat.com> wrote:
> > > > >
> > > > > If we must still fail a nofail allocation, we should trigger a BUG rather
> > > > > than exposing NULL dereferences to callers who do not check the return
> > > > > value.
> > > >
> > > > I am not convinced that BUG_ON is the right tool here to save the world,
> > > > but I see how we arrived here.
> > >
> > > I think the thing to do is to just add a
> > >
> > >      WARN_ON_ONCE((flags & __GFP_NOFAIL) && bad_nofail_alloc(oder, flags));
> > >
> > > or similar, where that bad_nofail_alloc() checks that the allocation
> > > order is small and that the flags are sane for a NOFAIL allocation.
> > >
> > > Because no, BUG_ON() is *never* the answer. The answer is to make sure
> > > nobody ever sets NOFAIL in situations where the allocation can fail
> > > and there is no way forward.
> > >
> > > A BUG_ON() will quite likely just make things worse. You're better off
> > > with a WARN_ON() and letting the caller just oops.
> > >
> > > Honestly, I'm perfectly fine with just removing that stupid useless
> > > flag entirely. The flag goes back to 2003 and was introduced in
> > > 2.5.69, and was meant to be for very particular uses that otherwise
> > > just looped waiting for memory.
> > >
> > > Back in 2.5.69, there was exactly one user: the jbd journal code, that
> > > did a buffer head allocation with GFP_NOFAIL.  By 2.6.0 that had
> > > expanded by another user in XFS, and even that one had a comment
> > > saying that it needed to be narrowed down. And in fact, by the 2.6.12
> > > release, that XFS use had been removed, but the jbd journal had grown
> > > another jbd_kmalloc case for transaction data. So at the beginning of
> > > the git archives, we had exactly *one* user (with two places).
> > >
> > > *THAT* is the kind of use that the flag was meant for: small
> > > allocations required to make forward progress in writeout during
> > > memory pressure.
> > >
> > > It has then expanded and is now a problem. The cases using GFP_NOFAIL
> > > for things like vmalloc() - which is by definition not a small
> > > allocation - should be just removed as outright bugs.
> >
> > One potential approach could be to rename GFP_NOFAIL to
> > GFP_NOFAIL_FOR_SMALL_ALLOC, specifically for smaller allocations, and
> > to clear this flag for larger allocations. However, the challenge lies
> > in determining what constitutes a 'small' allocation.
>
> I'm not entirely sure if our concern is with higher order or larger size.

I believe both should be considered. Since the higher-order task might
be easier to address, starting with that seems like the more
straightforward approach.

> Higher
> order might pose a problem, but larger size(not too large) isn't
> always an issue.
> Allocating 100 * 4KiB pages is possibly easier than allocating a single
> 128KB folio.
>
> Are we trying to limit the physical size or the physical order? If the concern
> is order, vmalloc manages __GFP_NOFAIL by mapping order-0 pages. If the
> concern is higher order, this sounds reasonable.  but it seems the buddy
> system already has code to trigger a warning even for order > 1:

To avoid potential livelock, it may be wise to drop this flag for
higher-order allocations as well. Following Linus's suggestion, we
could start by removing it for "> PAGE_ALLOC_COSTLY_ORDER".

>
> struct page *rmqueue(struct zone *preferred_zone,
>                         struct zone *zone, unsigned int order,
>                         gfp_t gfp_flags, unsigned int alloc_flags,
>                         int migratetype)
> {
>         struct page *page;
>
>         /*
>          * We most definitely don't want callers attempting to
>          * allocate greater than order-1 page units with __GFP_NOFAIL.
>          */
>         WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

This line was added by Michal in commit 0f352e5392c8 ("mm: remove
__GFP_NOFAIL is deprecated comment"), but it appears that Michal has
since reconsidered his stance. ;)



--
Regards
Yafang

Yafang Shao Aug. 22, 2024, 2:35 p.m. UTC | #38

On Thu, Aug 22, 2024 at 4:04 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>
> Hi Michal,
>
> On 2024/8/22 15:54, Michal Hocko wrote:
> > On Thu 22-08-24 15:01:43, Gao Xiang wrote:
> >> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
> >> means for a single shot.  Because assume even if you don't consider
> >> a virtual consecutive buffer, people could also do
> >> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
> >> the same heavy workload to the whole system.  And we also allow
> >> direct/kswap reclaim here.
> >
> > Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain
> > make sense outside of the page allocator proper. There is no reason why
> > vmalloc NOFAIL should be constrained by that. Sure it should be
> > contrained to some value but considering it is just a bunch of PAGE_SIZE
> > allocation then the limit could be higher. I am not sure where the
> > practical limit should be but anything that requires more than couple of
> > MBs seems really excessive.
>
> Yeah, totally agreed, that would make my own life easier, of
> course I will not allocate MBs insanely.
>
> I've always trying to kill unnecessary NOFAILs (mostly together
> with code cleanups), but if a failure path increases more than
> 100 LOCs just for rare failure and extreme workloads, I _do_
> hope kvmalloc(NOFAIL) could work instead.

If the LOCs in the error handler are a concern, I believe we can
simplify it to a single line: while (!alloc()), which is essentially
what NOFAIL does and is also the reason we want desperate NOFAIL.

A better approach might involve failing after a maximum number of
retries at the call site, for example:

  while (try < max_retries && !alloc())

At least that is better than the endless loop in the page allocator.

Gao Xiang Aug. 22, 2024, 3:02 p.m. UTC | #39

On 2024/8/22 22:35, Yafang Shao wrote:
> On Thu, Aug 22, 2024 at 4:04 PM Gao Xiang <hsiangkao@linux.alibaba.com> wrote:
>>
>> Hi Michal,
>>
>> On 2024/8/22 15:54, Michal Hocko wrote:
>>> On Thu 22-08-24 15:01:43, Gao Xiang wrote:
>>>> In my opinion, I'm not sure how PAGE_ALLOC_COSTLY_ORDER restriction
>>>> means for a single shot.  Because assume even if you don't consider
>>>> a virtual consecutive buffer, people could also do
>>>> < PAGE_ALLOC_COSTLY_ORDER allocations multiple times to get almost
>>>> the same heavy workload to the whole system.  And we also allow
>>>> direct/kswap reclaim here.
>>>
>>> Quite honestly I do not think that PAGE_ALLOC_COSTLY_ORDER constrain
>>> make sense outside of the page allocator proper. There is no reason why
>>> vmalloc NOFAIL should be constrained by that. Sure it should be
>>> contrained to some value but considering it is just a bunch of PAGE_SIZE
>>> allocation then the limit could be higher. I am not sure where the
>>> practical limit should be but anything that requires more than couple of
>>> MBs seems really excessive.
>>
>> Yeah, totally agreed, that would make my own life easier, of
>> course I will not allocate MBs insanely.
>>
>> I've always trying to kill unnecessary NOFAILs (mostly together
>> with code cleanups), but if a failure path increases more than
>> 100 LOCs just for rare failure and extreme workloads, I _do_
>> hope kvmalloc(NOFAIL) could work instead.
> 
> If the LOCs in the error handler are a concern, I believe we can
> simplify it to a single line: while (!alloc()), which is essentially
> what NOFAIL does and is also the reason we want desperate NOFAIL.
> 
> A better approach might involve failing after a maximum number of
> retries at the call site, for example:
> 
>    while (try < max_retries && !alloc())
> 
> At least that is better than the endless loop in the page allocator.

Funny.

Thanks,
Gao Xiang

Vlastimil Babka Aug. 26, 2024, 12:10 p.m. UTC | #40

On 8/22/24 11:34, Linus Torvalds wrote:
> On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
>>
>> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
>> infinitely". if that implies just OOPSing or actually be in a busy loop,
>> I don't care. It could effectively happen with MAX_ORDER as well, as
>> stated. But certainly not BUG_ON.
> 
> No BUG_ON(), but also no endless loop.
> 
> Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
> make it easy to find offenders, and then let them deal with it.

Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
we're about to actually return NULL, so the memory has to be depleted
already. To make it easier to find the offenders much more reliably, we
should consider doing it sooner, but also not add unnecessary overhead to
allocator fastpaths just because of the potentially buggy users. So either
always in __alloc_pages_slowpath(), which should be often enough (unless the
system never needs to wake up kswapd to reclaim) but with negligible enough
overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?

> Don't take it upon yourself to say "we have to deal with any amount of
> stupidity".
> 
> The MM layer is not some slave to users. The MM layer is one of the
> most core pieces of code in the kernel, and as such the MM layer is
> damn well in charge.
> 
> Nobody has the right to say "I will not deal with allocation
> failures". The MM should not bend over backwards over something like
> that.
> 
> Seriously. Get a spine already, people. Tell random drivers that claim
> that they cannot deal with errors to just f-ck off.
> 
> And you don't do it by looping forever, and you don't do it by killing
> the kernel. You do it by ignoring their bullying tactics.
> 
> Then you document the *LIMITED* cases where you actually will try forever.
> 
> This discussion has gone on for too damn long.
> 
>               Linus

Linus Torvalds Aug. 27, 2024, 6:57 a.m. UTC | #41

On Tue, 27 Aug 2024 at 00:10, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
> we're about to actually return NULL, so the memory has to be depleted
> already. To make it easier to find the offenders much more reliably, we
> should consider doing it sooner, but also not add unnecessary overhead to
> allocator fastpaths just because of the potentially buggy users.

Ack. Sounds like a sane model to me. And I agree that
__alloc_pages_slowpath() is likely a reasonable middle path between
"catch misuses, but don't put it in the hotpath".

               Linus

Barry Song Aug. 27, 2024, 7:15 a.m. UTC | #42

On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/22/24 11:34, Linus Torvalds wrote:
> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
> >>
> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
> >> infinitely". if that implies just OOPSing or actually be in a busy loop,
> >> I don't care. It could effectively happen with MAX_ORDER as well, as
> >> stated. But certainly not BUG_ON.
> >
> > No BUG_ON(), but also no endless loop.
> >
> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
> > make it easy to find offenders, and then let them deal with it.
>
> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
> we're about to actually return NULL, so the memory has to be depleted
> already. To make it easier to find the offenders much more reliably, we
> should consider doing it sooner, but also not add unnecessary overhead to
> allocator fastpaths just because of the potentially buggy users. So either
> always in __alloc_pages_slowpath(), which should be often enough (unless the
> system never needs to wake up kswapd to reclaim) but with negligible enough
> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?

We already have a WARN_ON for order > 1 in rmqueue. we might extend
the condition there to include checking flags as well?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7dcb0713eb57..b5717c6569f9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone,
  /*
  * We most definitely don't want callers attempting to
  * allocate greater than order-1 page units with __GFP_NOFAIL.
+ * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
+ * which can result in a lockup
  */
- WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) &&
+     (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM)));

  if (likely(pcp_allowed_order(order))) {
  page = rmqueue_pcplist(preferred_zone, zone, order,

>
> > Don't take it upon yourself to say "we have to deal with any amount of
> > stupidity".
> >
> > The MM layer is not some slave to users. The MM layer is one of the
> > most core pieces of code in the kernel, and as such the MM layer is
> > damn well in charge.
> >
> > Nobody has the right to say "I will not deal with allocation
> > failures". The MM should not bend over backwards over something like
> > that.
> >
> > Seriously. Get a spine already, people. Tell random drivers that claim
> > that they cannot deal with errors to just f-ck off.
> >
> > And you don't do it by looping forever, and you don't do it by killing
> > the kernel. You do it by ignoring their bullying tactics.
> >
> > Then you document the *LIMITED* cases where you actually will try forever.
> >
> > This discussion has gone on for too damn long.
> >
> >               Linus
>

Vlastimil Babka Aug. 27, 2024, 7:38 a.m. UTC | #43

On 8/27/24 09:15, Barry Song wrote:
> On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 8/22/24 11:34, Linus Torvalds wrote:
>> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
>> >>
>> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
>> >> infinitely". if that implies just OOPSing or actually be in a busy loop,
>> >> I don't care. It could effectively happen with MAX_ORDER as well, as
>> >> stated. But certainly not BUG_ON.
>> >
>> > No BUG_ON(), but also no endless loop.
>> >
>> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
>> > make it easy to find offenders, and then let them deal with it.
>>
>> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
>> we're about to actually return NULL, so the memory has to be depleted
>> already. To make it easier to find the offenders much more reliably, we
>> should consider doing it sooner, but also not add unnecessary overhead to
>> allocator fastpaths just because of the potentially buggy users. So either
>> always in __alloc_pages_slowpath(), which should be often enough (unless the
>> system never needs to wake up kswapd to reclaim) but with negligible enough
>> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?
> 
> We already have a WARN_ON for order > 1 in rmqueue. we might extend
> the condition there to include checking flags as well?

Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
existing users of __GFP_NOFAIL with order > 1 :)

But also the check is in the hotpath, even before trying the pcplists, so we
could move it to __alloc_pages_slowpath() while extending it?

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7dcb0713eb57..b5717c6569f9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone,
>   /*
>   * We most definitely don't want callers attempting to
>   * allocate greater than order-1 page units with __GFP_NOFAIL.
> + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
> + * which can result in a lockup
>   */
> - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> + WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) &&
> +     (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM)));
> 
>   if (likely(pcp_allowed_order(order))) {
>   page = rmqueue_pcplist(preferred_zone, zone, order,
> 
>>
>> > Don't take it upon yourself to say "we have to deal with any amount of
>> > stupidity".
>> >
>> > The MM layer is not some slave to users. The MM layer is one of the
>> > most core pieces of code in the kernel, and as such the MM layer is
>> > damn well in charge.
>> >
>> > Nobody has the right to say "I will not deal with allocation
>> > failures". The MM should not bend over backwards over something like
>> > that.
>> >
>> > Seriously. Get a spine already, people. Tell random drivers that claim
>> > that they cannot deal with errors to just f-ck off.
>> >
>> > And you don't do it by looping forever, and you don't do it by killing
>> > the kernel. You do it by ignoring their bullying tactics.
>> >
>> > Then you document the *LIMITED* cases where you actually will try forever.
>> >
>> > This discussion has gone on for too damn long.
>> >
>> >               Linus
>>

Barry Song Aug. 27, 2024, 7:50 a.m. UTC | #44

On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/27/24 09:15, Barry Song wrote:
> > On Tue, Aug 27, 2024 at 12:10 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 8/22/24 11:34, Linus Torvalds wrote:
> >> > On Thu, 22 Aug 2024 at 17:27, David Hildenbrand <david@redhat.com> wrote:
> >> >>
> >> >> To me, that implies that if you pass in MAX_ORDER+1 the VM will "retry
> >> >> infinitely". if that implies just OOPSing or actually be in a busy loop,
> >> >> I don't care. It could effectively happen with MAX_ORDER as well, as
> >> >> stated. But certainly not BUG_ON.
> >> >
> >> > No BUG_ON(), but also no endless loop.
> >> >
> >> > Just return NULL for bogus users. Really. Give a WARN_ON_ONCE() to
> >> > make it easy to find offenders, and then let them deal with it.
> >>
> >> Right now we give the WARN_ON_ONCE() (for !can_direct_reclaim) only when
> >> we're about to actually return NULL, so the memory has to be depleted
> >> already. To make it easier to find the offenders much more reliably, we
> >> should consider doing it sooner, but also not add unnecessary overhead to
> >> allocator fastpaths just because of the potentially buggy users. So either
> >> always in __alloc_pages_slowpath(), which should be often enough (unless the
> >> system never needs to wake up kswapd to reclaim) but with negligible enough
> >> overhead, or on every allocation but only with e.g. CONFIG_DEBUG_VM?
> >
> > We already have a WARN_ON for order > 1 in rmqueue. we might extend
> > the condition there to include checking flags as well?
>
> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
> existing users of __GFP_NOFAIL with order > 1 :)
>
> But also the check is in the hotpath, even before trying the pcplists, so we
> could move it to __alloc_pages_slowpath() while extending it?

Agreed. I don't think it is reasonable to check the order and flags in
two different places especially rmqueue() has already had
gfp_flags & __GFP_NOFAIL operation and order > 1
overhead.

We can at least extend the current check to make some improvement
though I still believe Michal's suggestion of implementing OOPS_ON is a
better approach to pursue, as it doesn't crash the entire system
while ensuring the problematic process is terminated.

>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7dcb0713eb57..b5717c6569f9 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3071,8 +3071,11 @@ struct page *rmqueue(struct zone *preferred_zone,
> >   /*
> >   * We most definitely don't want callers attempting to
> >   * allocate greater than order-1 page units with __GFP_NOFAIL.
> > + * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
> > + * which can result in a lockup
> >   */
> > - WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> > + WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) &&
> > +     (order > 1 || !(gfp_flags & __GFP_DIRECT_RECLAIM)));
> >
> >   if (likely(pcp_allowed_order(order))) {
> >   page = rmqueue_pcplist(preferred_zone, zone, order,
> >
> >>
> >> > Don't take it upon yourself to say "we have to deal with any amount of
> >> > stupidity".
> >> >
> >> > The MM layer is not some slave to users. The MM layer is one of the
> >> > most core pieces of code in the kernel, and as such the MM layer is
> >> > damn well in charge.
> >> >
> >> > Nobody has the right to say "I will not deal with allocation
> >> > failures". The MM should not bend over backwards over something like
> >> > that.
> >> >
> >> > Seriously. Get a spine already, people. Tell random drivers that claim
> >> > that they cannot deal with errors to just f-ck off.
> >> >
> >> > And you don't do it by looping forever, and you don't do it by killing
> >> > the kernel. You do it by ignoring their bullying tactics.
> >> >
> >> > Then you document the *LIMITED* cases where you actually will try forever.
> >> >
> >> > This discussion has gone on for too damn long.
> >> >
> >> >               Linus
> >>
>

Thanks
Barry

Vlastimil Babka Aug. 29, 2024, 10:24 a.m. UTC | #45

On 8/27/24 09:50, Barry Song wrote:
> On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>>
>> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
>> existing users of __GFP_NOFAIL with order > 1 :)
>>
>> But also the check is in the hotpath, even before trying the pcplists, so we
>> could move it to __alloc_pages_slowpath() while extending it?
> 
> Agreed. I don't think it is reasonable to check the order and flags in
> two different places especially rmqueue() has already had
> gfp_flags & __GFP_NOFAIL operation and order > 1
> overhead.
> 
> We can at least extend the current check to make some improvement
> though I still believe Michal's suggestion of implementing OOPS_ON is a
> better approach to pursue, as it doesn't crash the entire system
> while ensuring the problematic process is terminated.

Linus made clear it's not a mm concern. If e.g. hardening people want to
pursuit that instead, they can.

BTW I think BUG_ON already works like this, if possible only the calling
process is terminated. panic happens in case of being in a irq context, or
due to panic_on_oops. Which the security people are setting to 1 anyway and
OOPS_ON would have to observe it too. So AFAICS the only difference from
BUG_ON would be not panic in the irq context, if panic_on_oops isn't set.
(as for "no mm locks held" I think it's already satisfied at the points we
check for __GFP_NOFAIL).

Barry Song Aug. 29, 2024, 11:53 a.m. UTC | #46

On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 8/27/24 09:50, Barry Song wrote:
> > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >>
> >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
> >> existing users of __GFP_NOFAIL with order > 1 :)
> >>
> >> But also the check is in the hotpath, even before trying the pcplists, so we
> >> could move it to __alloc_pages_slowpath() while extending it?
> >
> > Agreed. I don't think it is reasonable to check the order and flags in
> > two different places especially rmqueue() has already had
> > gfp_flags & __GFP_NOFAIL operation and order > 1
> > overhead.
> >
> > We can at least extend the current check to make some improvement
> > though I still believe Michal's suggestion of implementing OOPS_ON is a
> > better approach to pursue, as it doesn't crash the entire system
> > while ensuring the problematic process is terminated.
>
> Linus made clear it's not a mm concern. If e.g. hardening people want to
> pursuit that instead, they can.
>
> BTW I think BUG_ON already works like this, if possible only the calling
> process is terminated. panic happens in case of being in a irq context, or

you are right. This is a detail I overlooked in the last discussion.
BUG_ON has already been exactly the case to only terminate the bad
process if it can
(panic_on_oops=N and not in irq context).

> due to panic_on_oops. Which the security people are setting to 1 anyway and
> OOPS_ON would have to observe it too. So AFAICS the only difference from
> BUG_ON would be not panic in the irq context, if panic_on_oops isn't set.

right.

> (as for "no mm locks held" I think it's already satisfied at the points we
> check for __GFP_NOFAIL).

Let me summarize the discussion:

Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic
in vdpa driver, is necessary.
Patch 2/4, which updates the documentation to clarify that
non-blockable gfp_nofail is not
                  supported, is needed.
Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the
size is too large,
                 where gfp_nofail will return NULL.
Patch 4/4: We will move the order > 1 check from the current fast path
to the slow path and extend
                 the check of gfp_direct_reclaim flag also in the slow path.

If nobody has an objection, I will prepare v4 as above.

Thanks
Barry

Michal Hocko Aug. 29, 2024, 1:20 p.m. UTC | #47

On Thu 29-08-24 23:53:33, Barry Song wrote:
> On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 8/27/24 09:50, Barry Song wrote:
> > > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >>
> > >>
> > >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
> > >> existing users of __GFP_NOFAIL with order > 1 :)
> > >>
> > >> But also the check is in the hotpath, even before trying the pcplists, so we
> > >> could move it to __alloc_pages_slowpath() while extending it?
> > >
> > > Agreed. I don't think it is reasonable to check the order and flags in
> > > two different places especially rmqueue() has already had
> > > gfp_flags & __GFP_NOFAIL operation and order > 1
> > > overhead.
> > >
> > > We can at least extend the current check to make some improvement
> > > though I still believe Michal's suggestion of implementing OOPS_ON is a
> > > better approach to pursue, as it doesn't crash the entire system
> > > while ensuring the problematic process is terminated.
> >
> > Linus made clear it's not a mm concern. If e.g. hardening people want to
> > pursuit that instead, they can.
> >
> > BTW I think BUG_ON already works like this, if possible only the calling
> > process is terminated. panic happens in case of being in a irq context, or
> 
> you are right. This is a detail I overlooked in the last discussion.
> BUG_ON has already been exactly the case to only terminate the bad
> process if it can
> (panic_on_oops=N and not in irq context).

Are you sure about that? Maybe x86 implementation treats BUG as oops but
is this what that does on all arches? BUG() has historically meant stop
everything and die and I am not really sure when that would have
changed TBH.

> > due to panic_on_oops. Which the security people are setting to 1 anyway and
> > OOPS_ON would have to observe it too. So AFAICS the only difference from
> > BUG_ON would be not panic in the irq context, if panic_on_oops isn't set.
> 
> right.
> 
> > (as for "no mm locks held" I think it's already satisfied at the points we
> > check for __GFP_NOFAIL).
> 
> Let me summarize the discussion:
> 
> Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic
> in vdpa driver, is necessary.
> Patch 2/4, which updates the documentation to clarify that
> non-blockable gfp_nofail is not
>                   supported, is needed.

Let's please have those merged now.

> Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the
> size is too large,
>                  where gfp_nofail will return NULL.


I would pull this one out for a separate discussion. We should really
define what the too large really means and INT_MAX etc. is not it at
all.

> Patch 4/4: We will move the order > 1 check from the current fast path
> to the slow path and extend
>                  the check of gfp_direct_reclaim flag also in the slow path.

OK, let's have that go in now as well.

Barry Song Aug. 29, 2024, 9:27 p.m. UTC | #48

On Fri, Aug 30, 2024 at 1:20 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 29-08-24 23:53:33, Barry Song wrote:
> > On Thu, Aug 29, 2024 at 10:24 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 8/27/24 09:50, Barry Song wrote:
> > > > On Tue, Aug 27, 2024 at 7:38 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >>
> > > >>
> > > >> Ugh, wasn't aware, well spotted. So it means there at least shouldn't be
> > > >> existing users of __GFP_NOFAIL with order > 1 :)
> > > >>
> > > >> But also the check is in the hotpath, even before trying the pcplists, so we
> > > >> could move it to __alloc_pages_slowpath() while extending it?
> > > >
> > > > Agreed. I don't think it is reasonable to check the order and flags in
> > > > two different places especially rmqueue() has already had
> > > > gfp_flags & __GFP_NOFAIL operation and order > 1
> > > > overhead.
> > > >
> > > > We can at least extend the current check to make some improvement
> > > > though I still believe Michal's suggestion of implementing OOPS_ON is a
> > > > better approach to pursue, as it doesn't crash the entire system
> > > > while ensuring the problematic process is terminated.
> > >
> > > Linus made clear it's not a mm concern. If e.g. hardening people want to
> > > pursuit that instead, they can.
> > >
> > > BTW I think BUG_ON already works like this, if possible only the calling
> > > process is terminated. panic happens in case of being in a irq context, or
> >
> > you are right. This is a detail I overlooked in the last discussion.
> > BUG_ON has already been exactly the case to only terminate the bad
> > process if it can
> > (panic_on_oops=N and not in irq context).
>
> Are you sure about that? Maybe x86 implementation treats BUG as oops but
> is this what that does on all arches? BUG() has historically meant stop
> everything and die and I am not really sure when that would have
> changed TBH.

My ARM64 machine also only terminates the bad process by BUG_ON()
if we are not in irq and we don't set panic_on_oops.

I guess it depends on HAVE_ARCH_BUG? if arch has no BUG(), BUG()
will be just a panic ?

#ifndef HAVE_ARCH_BUG
#define BUG() do { \
        printk("BUG: failure at %s:%d/%s()!\n", __FILE__, __LINE__, __func__); \
        barrier_before_unreachable(); \
        panic("BUG!"); \
} while (0)
#endif

I assume it is equally difficult to implement OOPS_ON() if arch lacks
HAVE_ARCH_BUG ?

"grep" shows the most mainstream archs have their own HAVE_ARCH_BUG:

$ git grep HAVE_ARCH_BUG
arch/alpha/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/arc/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/arm/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/arm64/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/csky/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/loongarch/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/m68k/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/mips/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/mips/include/asm/bug.h:#define HAVE_ARCH_BUG_ON
arch/parisc/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/powerpc/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/powerpc/include/asm/bug.h:#define HAVE_ARCH_BUG_ON
arch/riscv/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/s390/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/sh/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/sparc/include/asm/bug.h:#define HAVE_ARCH_BUG
arch/x86/include/asm/bug.h:#define HAVE_ARCH_BUG

>
> > > due to panic_on_oops. Which the security people are setting to 1 anyway and
> > > OOPS_ON would have to observe it too. So AFAICS the only difference from
> > > BUG_ON would be not panic in the irq context, if panic_on_oops isn't set.
> >
> > right.
> >
> > > (as for "no mm locks held" I think it's already satisfied at the points we
> > > check for __GFP_NOFAIL).
> >
> > Let me summarize the discussion:
> >
> > Patch 1/4, which fixes the misuse of combining gfp_nofail and atomic
> > in vdpa driver, is necessary.
> > Patch 2/4, which updates the documentation to clarify that
> > non-blockable gfp_nofail is not
> >                   supported, is needed.
>
> Let's please have those merged now.
>
> > Patch 3/4: We will replace BUG_ON with WARN_ON_ONCE to warn when the
> > size is too large,
> >                  where gfp_nofail will return NULL.
>
>
> I would pull this one out for a separate discussion. We should really
> define what the too large really means and INT_MAX etc. is not it at
> all.

make sense.

>
> > Patch 4/4: We will move the order > 1 check from the current fast path
> > to the slow path and extend
> >                  the check of gfp_direct_reclaim flag also in the slow path.
>
> OK, let's have that go in now as well.
>
> --
> Michal Hocko
> SUSE Labs

Thanks
Barry

Barry Song Aug. 29, 2024, 10:31 p.m. UTC | #49

> > > Patch 4/4: We will move the order > 1 check from the current fast path
> > > to the slow path and extend
> > >                  the check of gfp_direct_reclaim flag also in the slow path.
> >
> > OK, let's have that go in now as well.

Hi Michal and Vlastimil,
Could you please review the changes below before I send v4 for patch 4/4?

1. We should consolidate all warnings in one place. Currently, the order > 1 warning is
in the hotpath, while others are in less likely scenarios. Moving all warnings to the
slowpath will reduce the overhead for order > 1 and increase the visibility of other
warnings.

2. We currently have two warnings for order: one for order > 1 in the hotpath and another
for order > costly_order in the laziest path. I suggest standardizing on order > 1 since
it’s been in use for a long time.

3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is
meant to suppress allocation failure reports, but here we're dealing with bug detection, not
allocation failures.
So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c81ee5662cc7..0d3dd679d0ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone,
 {
 	struct page *page;
 
-	/*
-	 * We most definitely don't want callers attempting to
-	 * allocate greater than order-1 page units with __GFP_NOFAIL.
-	 */
-	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-
 	if (likely(pcp_allowed_order(order))) {
 		page = rmqueue_pcplist(preferred_zone, zone, order,
 				       migratetype, alloc_flags);
@@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
 {
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
+	bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM;
 	bool can_compact = gfp_compaction_allowed(gfp_mask);
 	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
 	struct page *page = NULL;
@@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned int zonelist_iter_cookie;
 	int reserve_flags;
 
+	if (nofail) {
+		/*
+		 * We most definitely don't want callers attempting to
+		 * allocate greater than order-1 page units with __GFP_NOFAIL.
+		 */
+		WARN_ON_ONCE(order > 1);
+		/*
+		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
+		 * otherwise, we may result in lockup.
+		 */
+		WARN_ON_ONCE(!can_direct_reclaim);
+		/*
+		 * PF_MEMALLOC request from this context is rather bizarre
+		 * because we cannot reclaim anything and only can loop waiting
+		 * for somebody to do a work for us.
+		 */
+		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
+	}
+
 restart:
 	compaction_retries = 0;
 	no_progress_loops = 0;
@@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
 	 * we always retry
 	 */
-	if (gfp_mask & __GFP_NOFAIL) {
+	if (nofail) {
 		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually require GFP_NOWAIT
+		 * Lacking direct_reclaim we can't do anything to reclaim memory,
+		 * we disregard these unreasonable nofail requests and still
+		 * return NULL
 		 */
-		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
+		if (!can_direct_reclaim)
 			goto fail;
 
-		/*
-		 * PF_MEMALLOC request from this context is rather bizarre
-		 * because we cannot reclaim anything and only can loop waiting
-		 * for somebody to do a work for us
-		 */
-		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
-
-		/*
-		 * non failing costly orders are a hard requirement which we
-		 * are not prepared for much so let's warn about these users
-		 * so that we can identify them and convert them to something
-		 * else.
-		 */
-		WARN_ON_ONCE_GFP(costly_order, gfp_mask);
-
 		/*
 		 * Help non-failing allocations by giving some access to memory
 		 * reserves normally used for high priority non-blocking

> >
> > --
> > Michal Hocko
> > SUSE Labs

Thanks
Barry

Michal Hocko Aug. 30, 2024, 7:24 a.m. UTC | #50

On Fri 30-08-24 10:31:14, Barry Song wrote:
> > > > Patch 4/4: We will move the order > 1 check from the current fast path
> > > > to the slow path and extend
> > > >                  the check of gfp_direct_reclaim flag also in the slow path.
> > >
> > > OK, let's have that go in now as well.
> 
> Hi Michal and Vlastimil,
> Could you please review the changes below before I send v4 for patch 4/4?
> 
> 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is
> in the hotpath, while others are in less likely scenarios. Moving all warnings to the
> slowpath will reduce the overhead for order > 1 and increase the visibility of other
> warnings.
> 
> 2. We currently have two warnings for order: one for order > 1 in the hotpath and another
> for order > costly_order in the laziest path. I suggest standardizing on order > 1 since
> it’s been in use for a long time.
> 
> 3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is
> meant to suppress allocation failure reports, but here we're dealing with bug detection, not
> allocation failures.
> So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c81ee5662cc7..0d3dd679d0ab 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone,
>  {
>  	struct page *page;
>  
> -	/*
> -	 * We most definitely don't want callers attempting to
> -	 * allocate greater than order-1 page units with __GFP_NOFAIL.
> -	 */
> -	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> -
>  	if (likely(pcp_allowed_order(order))) {
>  		page = rmqueue_pcplist(preferred_zone, zone, order,
>  				       migratetype, alloc_flags);
> @@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  						struct alloc_context *ac)
>  {
>  	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
> +	bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM;
>  	bool can_compact = gfp_compaction_allowed(gfp_mask);
>  	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
>  	struct page *page = NULL;
> @@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	unsigned int zonelist_iter_cookie;
>  	int reserve_flags;
>  
> +	if (nofail) {
> +		/*
> +		 * We most definitely don't want callers attempting to
> +		 * allocate greater than order-1 page units with __GFP_NOFAIL.
> +		 */
> +		WARN_ON_ONCE(order > 1);
> +		/*
> +		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
> +		 * otherwise, we may result in lockup.
> +		 */
> +		WARN_ON_ONCE(!can_direct_reclaim);
> +		/*
> +		 * PF_MEMALLOC request from this context is rather bizarre
> +		 * because we cannot reclaim anything and only can loop waiting
> +		 * for somebody to do a work for us.
> +		 */
> +		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
> +	}

Yes, this makes sense. Any reason you have not put that int the nofail
branch below?

> +
>  restart:
>  	compaction_retries = 0;
>  	no_progress_loops = 0;
> @@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
>  	 * we always retry
>  	 */
> -	if (gfp_mask & __GFP_NOFAIL) {
> +	if (nofail) {
>  		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually require GFP_NOWAIT
> +		 * Lacking direct_reclaim we can't do anything to reclaim memory,
> +		 * we disregard these unreasonable nofail requests and still
> +		 * return NULL
>  		 */
> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> +		if (!can_direct_reclaim)
>  			goto fail;
>  
> -		/*
> -		 * PF_MEMALLOC request from this context is rather bizarre
> -		 * because we cannot reclaim anything and only can loop waiting
> -		 * for somebody to do a work for us
> -		 */
> -		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
> -
> -		/*
> -		 * non failing costly orders are a hard requirement which we
> -		 * are not prepared for much so let's warn about these users
> -		 * so that we can identify them and convert them to something
> -		 * else.
> -		 */
> -		WARN_ON_ONCE_GFP(costly_order, gfp_mask);
> -
>  		/*
>  		 * Help non-failing allocations by giving some access to memory
>  		 * reserves normally used for high priority non-blocking
> 
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
> 
> Thanks
> Barry

Vlastimil Babka Aug. 30, 2024, 7:37 a.m. UTC | #51

On 8/30/24 09:24, Michal Hocko wrote:
> On Fri 30-08-24 10:31:14, Barry Song wrote:
>> > > > Patch 4/4: We will move the order > 1 check from the current fast path
>> > > > to the slow path and extend
>> > > >                  the check of gfp_direct_reclaim flag also in the slow path.
>> > >
>> > > OK, let's have that go in now as well.
>> 
>> Hi Michal and Vlastimil,
>> Could you please review the changes below before I send v4 for patch 4/4?
>> 
>> 1. We should consolidate all warnings in one place. Currently, the order > 1 warning is
>> in the hotpath, while others are in less likely scenarios. Moving all warnings to the
>> slowpath will reduce the overhead for order > 1 and increase the visibility of other
>> warnings.
>> 
>> 2. We currently have two warnings for order: one for order > 1 in the hotpath and another
>> for order > costly_order in the laziest path. I suggest standardizing on order > 1 since
>> it’s been in use for a long time.
>> 
>> 3.I don't think we need to check for __GFP_NOWARN in this case. __GFP_NOWARN is
>> meant to suppress allocation failure reports, but here we're dealing with bug detection, not
>> allocation failures.

Ack. __GFP_NOWARN is to suppress warnings in case the allocation has a less
expensive fallback to the current attempt, which logically means the current
attempt can't be a __GFP_NOFAIL one. So having both is a bug itself (not
worth reporting) so we can just ignore __GFP_NOWARN.

>> So I'd rather use WARN_ON_ONCE than WARN_ON_ONCE_GFP.
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c81ee5662cc7..0d3dd679d0ab 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3033,12 +3033,6 @@ struct page *rmqueue(struct zone *preferred_zone,
>>  {
>>  	struct page *page;
>>  
>> -	/*
>> -	 * We most definitely don't want callers attempting to
>> -	 * allocate greater than order-1 page units with __GFP_NOFAIL.
>> -	 */
>> -	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>> -
>>  	if (likely(pcp_allowed_order(order))) {
>>  		page = rmqueue_pcplist(preferred_zone, zone, order,
>>  				       migratetype, alloc_flags);
>> @@ -4174,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  						struct alloc_context *ac)
>>  {
>>  	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
>> +	bool nofail = gfp_mask & __GFP_DIRECT_RECLAIM;

__GFP_NOFAIL

>>  	bool can_compact = gfp_compaction_allowed(gfp_mask);
>>  	const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
>>  	struct page *page = NULL;
>> @@ -4187,6 +4182,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  	unsigned int zonelist_iter_cookie;
>>  	int reserve_flags;
>>  
>> +	if (nofail) {

Could add unlikely() to put it off the instruction cache hotpath.

>> +		/*
>> +		 * We most definitely don't want callers attempting to
>> +		 * allocate greater than order-1 page units with __GFP_NOFAIL.
>> +		 */
>> +		WARN_ON_ONCE(order > 1);
>> +		/*
>> +		 * Also we don't support __GFP_NOFAIL without __GFP_DIRECT_RECLAIM,
>> +		 * otherwise, we may result in lockup.
>> +		 */
>> +		WARN_ON_ONCE(!can_direct_reclaim);
>> +		/*
>> +		 * PF_MEMALLOC request from this context is rather bizarre
>> +		 * because we cannot reclaim anything and only can loop waiting
>> +		 * for somebody to do a work for us.
>> +		 */
>> +		WARN_ON_ONCE(current->flags & PF_MEMALLOC);
>> +	}
> 
> Yes, this makes sense. Any reason you have not put that int the nofail
> branch below?

Because that branch is executed only when we're already so depleted we gave
up retrying, and we want to warn about the buggy users more reliably (see
point 1 above).

>> +
>>  restart:
>>  	compaction_retries = 0;
>>  	no_progress_loops = 0;
>> @@ -4404,29 +4418,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  	 * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
>>  	 * we always retry
>>  	 */
>> -	if (gfp_mask & __GFP_NOFAIL) {
>> +	if (nofail) {
>>  		/*
>> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
>> -		 * of any new users that actually require GFP_NOWAIT
>> +		 * Lacking direct_reclaim we can't do anything to reclaim memory,
>> +		 * we disregard these unreasonable nofail requests and still
>> +		 * return NULL
>>  		 */
>> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>> +		if (!can_direct_reclaim)
>>  			goto fail;
>>  
>> -		/*
>> -		 * PF_MEMALLOC request from this context is rather bizarre
>> -		 * because we cannot reclaim anything and only can loop waiting
>> -		 * for somebody to do a work for us
>> -		 */
>> -		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
>> -
>> -		/*
>> -		 * non failing costly orders are a hard requirement which we
>> -		 * are not prepared for much so let's warn about these users
>> -		 * so that we can identify them and convert them to something
>> -		 * else.
>> -		 */
>> -		WARN_ON_ONCE_GFP(costly_order, gfp_mask);
>> -
>>  		/*
>>  		 * Help non-failing allocations by giving some access to memory
>>  		 * reserves normally used for high priority non-blocking
>> 
>> > >
>> > > --
>> > > Michal Hocko
>> > > SUSE Labs
>> 
>> Thanks
>> Barry
>