diff mbox series

[v3,4/4] mm: prohibit NULL deference exposed for unsupported non-blockable __GFP_NOFAIL

Message ID 20240817062449.21164-5-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series mm: clarify nofail memory allocation | expand

Commit Message

Barry Song Aug. 17, 2024, 6:24 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

When users allocate memory with the __GFP_NOFAIL flag, they might
incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
attempt and still fail to allocate memory for these users, we have two
choices:

    1. We could busy-loop and hope that some other direct reclamation or
    kswapd rescues the current process. However, this is unreliable
    and could ultimately lead to hard or soft lockups, which might not
    be well supported by some architectures.

    2. We could use BUG_ON to trigger a reliable system crash, avoiding
    exposing NULL dereference.

Neither option is ideal, but both are improvements over the existing code.
This patch selects the second option because, with the introduction of
scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
nofail users(which is in my plan), non-blockable nofail allocations will
no longer be possible.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <kees@kernel.org>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Hailong.Liu <hailong.liu@oppo.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 mm/page_alloc.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Yafang Shao Aug. 18, 2024, 2:55 a.m. UTC | #1
On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> When users allocate memory with the __GFP_NOFAIL flag, they might
> incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> attempt and still fail to allocate memory for these users, we have two
> choices:
>
>     1. We could busy-loop and hope that some other direct reclamation or
>     kswapd rescues the current process. However, this is unreliable
>     and could ultimately lead to hard or soft lockups,

That can occur even if we set both __GFP_NOFAIL and
__GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
design of __GFP_NOFAIL itself.

> which might not
>     be well supported by some architectures.
>
>     2. We could use BUG_ON to trigger a reliable system crash, avoiding
>     exposing NULL dereference.
>
> Neither option is ideal, but both are improvements over the existing code.
> This patch selects the second option because, with the introduction of
> scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> nofail users(which is in my plan), non-blockable nofail allocations will
> no longer be possible.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> Cc: Hailong.Liu <hailong.liu@oppo.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>  mm/page_alloc.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2c37f8f8d09..fb5850ecd3ae 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>          */
>         if (gfp_mask & __GFP_NOFAIL) {
>                 /*
> -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> -                * of any new users that actually require GFP_NOWAIT
> +                * All existing users of the __GFP_NOFAIL are blockable
> +                * otherwise we introduce a busy loop with inside the page
> +                * allocator from non-sleepable contexts
>                  */
> -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -                       goto fail;
> +               BUG_ON(!can_direct_reclaim);

I'm not in favor of using BUG_ON() here, as many call sites already
handle the return value of __GFP_NOFAIL.

If we believe BUG_ON() is necessary, why not place it at the beginning
of __alloc_pages_slowpath() instead of after numerous operations,
which could potentially obscure the issue?


>
>                 /*
>                  * PF_MEMALLOC request from this context is rather bizarre
> @@ -4434,7 +4434,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                 cond_resched();
>                 goto retry;
>         }
> -fail:
> +
>         warn_alloc(gfp_mask, ac->nodemask,
>                         "page allocation failure: order:%u", order);
>  got_pg:
> --
> 2.39.3 (Apple Git-146)
>
>


--
Regards
Yafang
Barry Song Aug. 18, 2024, 3:48 a.m. UTC | #2
On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > When users allocate memory with the __GFP_NOFAIL flag, they might
> > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > attempt and still fail to allocate memory for these users, we have two
> > choices:
> >
> >     1. We could busy-loop and hope that some other direct reclamation or
> >     kswapd rescues the current process. However, this is unreliable
> >     and could ultimately lead to hard or soft lockups,
>
> That can occur even if we set both __GFP_NOFAIL and
> __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> design of __GFP_NOFAIL itself.

the point of GFP_NOFAIL is that it won't fail and its user won't check
the return value. without direct_reclamation, it is sometimes impossible.
but with direct reclamation, users constantly wait and finally they can
get memory. if you read the doc of __GFP_NOFAIL you will find it.
it is absolutely clearly documented.

>
> > which might not
> >     be well supported by some architectures.
> >
> >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> >     exposing NULL dereference.
> >
> > Neither option is ideal, but both are improvements over the existing code.
> > This patch selects the second option because, with the introduction of
> > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > nofail users(which is in my plan), non-blockable nofail allocations will
> > no longer be possible.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >  mm/page_alloc.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d2c37f8f8d09..fb5850ecd3ae 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >          */
> >         if (gfp_mask & __GFP_NOFAIL) {
> >                 /*
> > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > -                * of any new users that actually require GFP_NOWAIT
> > +                * All existing users of the __GFP_NOFAIL are blockable
> > +                * otherwise we introduce a busy loop with inside the page
> > +                * allocator from non-sleepable contexts
> >                  */
> > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > -                       goto fail;
> > +               BUG_ON(!can_direct_reclaim);
>
> I'm not in favor of using BUG_ON() here, as many call sites already
> handle the return value of __GFP_NOFAIL.
>

it is not correct to handle the return value of __GFP_NOFAIL.
if you check the ret, don't use __GFP_NOFAIL.

> If we believe BUG_ON() is necessary, why not place it at the beginning
> of __alloc_pages_slowpath() instead of after numerous operations,
> which could potentially obscure the issue?

to some extent I agree with you. but the point here is that we might
want to avoid this check in the hot path. so basically, we check when
we have to check. in 99%+ case, this check can be avoided.

>
>
> >
> >                 /*
> >                  * PF_MEMALLOC request from this context is rather bizarre
> > @@ -4434,7 +4434,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >                 cond_resched();
> >                 goto retry;
> >         }
> > -fail:
> > +
> >         warn_alloc(gfp_mask, ac->nodemask,
> >                         "page allocation failure: order:%u", order);
> >  got_pg:
> > --
> > 2.39.3 (Apple Git-146)
> >
> >
>
>
> --
> Regards
> Yafang
Yafang Shao Aug. 18, 2024, 5:51 a.m. UTC | #3
On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > attempt and still fail to allocate memory for these users, we have two
> > > choices:
> > >
> > >     1. We could busy-loop and hope that some other direct reclamation or
> > >     kswapd rescues the current process. However, this is unreliable
> > >     and could ultimately lead to hard or soft lockups,
> >
> > That can occur even if we set both __GFP_NOFAIL and
> > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > design of __GFP_NOFAIL itself.
>
> the point of GFP_NOFAIL is that it won't fail and its user won't check
> the return value. without direct_reclamation, it is sometimes impossible.
> but with direct reclamation, users constantly wait and finally they can

So, what exactly is the difference between 'constantly waiting' and
'busy looping'? Could you please clarify? Also, why can't we
'constantly wait' when __GFP_DIRECT_RECLAIM is not set?

> get memory. if you read the doc of __GFP_NOFAIL you will find it.
> it is absolutely clearly documented.
>
> >
> > > which might not
> > >     be well supported by some architectures.
> > >
> > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > >     exposing NULL dereference.
> > >
> > > Neither option is ideal, but both are improvements over the existing code.
> > > This patch selects the second option because, with the introduction of
> > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > no longer be possible.
> > >
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > Cc: Christoph Hellwig <hch@infradead.org>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Pekka Enberg <penberg@kernel.org>
> > > Cc: David Rientjes <rientjes@google.com>
> > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > Cc: Kees Cook <kees@kernel.org>
> > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > Cc: Jason Wang <jasowang@redhat.com>
> > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > ---
> > >  mm/page_alloc.c | 10 +++++-----
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > >          */
> > >         if (gfp_mask & __GFP_NOFAIL) {
> > >                 /*
> > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > -                * of any new users that actually require GFP_NOWAIT
> > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > +                * otherwise we introduce a busy loop with inside the page
> > > +                * allocator from non-sleepable contexts
> > >                  */
> > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > -                       goto fail;
> > > +               BUG_ON(!can_direct_reclaim);
> >
> > I'm not in favor of using BUG_ON() here, as many call sites already
> > handle the return value of __GFP_NOFAIL.
> >
>
> it is not correct to handle the return value of __GFP_NOFAIL.
> if you check the ret, don't use __GFP_NOFAIL.

If so, you have many code changes to make in the linux kernel ;)

>
> > If we believe BUG_ON() is necessary, why not place it at the beginning
> > of __alloc_pages_slowpath() instead of after numerous operations,
> > which could potentially obscure the issue?
>
> to some extent I agree with you. but the point here is that we might
> want to avoid this check in the hot path. so basically, we check when
> we have to check. in 99%+ case, this check can be avoided.

It's on the slow path, but that's not the main point here.

--
Regards
Yafang
Barry Song Aug. 18, 2024, 6:27 a.m. UTC | #4
On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > attempt and still fail to allocate memory for these users, we have two
> > > > choices:
> > > >
> > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > >     kswapd rescues the current process. However, this is unreliable
> > > >     and could ultimately lead to hard or soft lockups,
> > >
> > > That can occur even if we set both __GFP_NOFAIL and
> > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > design of __GFP_NOFAIL itself.
> >
> > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > the return value. without direct_reclamation, it is sometimes impossible.
> > but with direct reclamation, users constantly wait and finally they can
>
> So, what exactly is the difference between 'constantly waiting' and
> 'busy looping'? Could you please clarify? Also, why can't we
> 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?

I list two options in changelog
1: busy loop 2. bug_on. I am actually fine with either one. either one is
better than the existing code. but returning null in the current code
is definitely wrong.

1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
legal. so it is a bit suspicious going in the wrong direction.

busy-loop is that you are not reclaiming memory you are not sleeping.
cpu is constantly working and busy, so it might result in a lockup, either
soft lockup or hard lockup.

with direct_reclamation, wait is the case you can sleep. it is not holding
cpu, not a busy loop. in rare case, users might end in endless wait,
but it matches the doc of __GFP_NOFAIL, never return till memory
is gotten (the current code is implemented in this way unless users
incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)

note, long-term we won't expose __GFP_NOFAIL any more. we
will only expose GFP_NOFAIL which enforces Blockable.  I am
quite busy on other issues, so this won't happen in a short time.

>
> > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > it is absolutely clearly documented.
> >
> > >
> > > > which might not
> > > >     be well supported by some architectures.
> > > >
> > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > >     exposing NULL dereference.
> > > >
> > > > Neither option is ideal, but both are improvements over the existing code.
> > > > This patch selects the second option because, with the introduction of
> > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > no longer be possible.
> > > >
> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Cc: Christoph Lameter <cl@linux.com>
> > > > Cc: Pekka Enberg <penberg@kernel.org>
> > > > Cc: David Rientjes <rientjes@google.com>
> > > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > Cc: Kees Cook <kees@kernel.org>
> > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > ---
> > > >  mm/page_alloc.c | 10 +++++-----
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > >          */
> > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > >                 /*
> > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > -                * of any new users that actually require GFP_NOWAIT
> > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > +                * otherwise we introduce a busy loop with inside the page
> > > > +                * allocator from non-sleepable contexts
> > > >                  */
> > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > -                       goto fail;
> > > > +               BUG_ON(!can_direct_reclaim);
> > >
> > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > handle the return value of __GFP_NOFAIL.
> > >
> >
> > it is not correct to handle the return value of __GFP_NOFAIL.
> > if you check the ret, don't use __GFP_NOFAIL.
>
> If so, you have many code changes to make in the linux kernel ;)
>

Please list those code using __GFP_NOFAIL and check the result
might fail, we should get them fixed. This is insane. NOFAIL means
no fail.

> >
> > > If we believe BUG_ON() is necessary, why not place it at the beginning
> > > of __alloc_pages_slowpath() instead of after numerous operations,
> > > which could potentially obscure the issue?
> >
> > to some extent I agree with you. but the point here is that we might
> > want to avoid this check in the hot path. so basically, we check when
> > we have to check. in 99%+ case, this check can be avoided.
>
> It's on the slow path, but that's not the main point here.

I actually recommended the approach, we can do an earlier check in the hotpath.
somehow, in the previous discussion, people didn't like it.

>
> --
> Regards
> Yafang

Thanks
barry
Barry Song Aug. 18, 2024, 6:45 a.m. UTC | #5
On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > choices:
> > > > >
> > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > >     kswapd rescues the current process. However, this is unreliable
> > > > >     and could ultimately lead to hard or soft lockups,
> > > >
> > > > That can occur even if we set both __GFP_NOFAIL and
> > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > design of __GFP_NOFAIL itself.
> > >
> > > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > > the return value. without direct_reclamation, it is sometimes impossible.
> > > but with direct reclamation, users constantly wait and finally they can
> >
> > So, what exactly is the difference between 'constantly waiting' and
> > 'busy looping'? Could you please clarify? Also, why can't we
> > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?
>
> I list two options in changelog
> 1: busy loop 2. bug_on. I am actually fine with either one. either one is
> better than the existing code. but returning null in the current code
> is definitely wrong.
>
> 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
> legal. so it is a bit suspicious going in the wrong direction.
>
> busy-loop is that you are not reclaiming memory you are not sleeping.
> cpu is constantly working and busy, so it might result in a lockup, either
> soft lockup or hard lockup.
>
> with direct_reclamation, wait is the case you can sleep. it is not holding
> cpu, not a busy loop. in rare case, users might end in endless wait,
> but it matches the doc of __GFP_NOFAIL, never return till memory
> is gotten (the current code is implemented in this way unless users
> incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)
>

and the essential difference between "w/ and w/o direct_reclaim": with
direct reclaim, the user is actively reclaiming memory to rescue itself
by all kinds of possible ways(compact, oom, reclamation), while without
direct reclamation, it can do nothing and just loop (busy-loop).

> note, long-term we won't expose __GFP_NOFAIL any more. we
> will only expose GFP_NOFAIL which enforces Blockable.  I am
> quite busy on other issues, so this won't happen in a short time.
>
> >
> > > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > > it is absolutely clearly documented.
> > >
> > > >
> > > > > which might not
> > > > >     be well supported by some architectures.
> > > > >
> > > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > > >     exposing NULL dereference.
> > > > >
> > > > > Neither option is ideal, but both are improvements over the existing code.
> > > > > This patch selects the second option because, with the introduction of
> > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > > no longer be possible.
> > > > >
> > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > Cc: Christoph Lameter <cl@linux.com>
> > > > > Cc: Pekka Enberg <penberg@kernel.org>
> > > > > Cc: David Rientjes <rientjes@google.com>
> > > > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > > Cc: Kees Cook <kees@kernel.org>
> > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > ---
> > > > >  mm/page_alloc.c | 10 +++++-----
> > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > >          */
> > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > >                 /*
> > > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > -                * of any new users that actually require GFP_NOWAIT
> > > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > > +                * otherwise we introduce a busy loop with inside the page
> > > > > +                * allocator from non-sleepable contexts
> > > > >                  */
> > > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > -                       goto fail;
> > > > > +               BUG_ON(!can_direct_reclaim);
> > > >
> > > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > > handle the return value of __GFP_NOFAIL.
> > > >
> > >
> > > it is not correct to handle the return value of __GFP_NOFAIL.
> > > if you check the ret, don't use __GFP_NOFAIL.
> >
> > If so, you have many code changes to make in the linux kernel ;)
> >
>
> Please list those code using __GFP_NOFAIL and check the result
> might fail, we should get them fixed. This is insane. NOFAIL means
> no fail.
>
> > >
> > > > If we believe BUG_ON() is necessary, why not place it at the beginning
> > > > of __alloc_pages_slowpath() instead of after numerous operations,
> > > > which could potentially obscure the issue?
> > >
> > > to some extent I agree with you. but the point here is that we might
> > > want to avoid this check in the hot path. so basically, we check when
> > > we have to check. in 99%+ case, this check can be avoided.
> >
> > It's on the slow path, but that's not the main point here.
>
> I actually recommended the approach, we can do an earlier check in the hotpath.
> somehow, in the previous discussion, people didn't like it.
>
> >
> > --
> > Regards
> > Yafang
>
> Thanks
> barry
Yafang Shao Aug. 18, 2024, 7:07 a.m. UTC | #6
On Sun, Aug 18, 2024 at 2:45 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > choices:
> > > > > >
> > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > >     and could ultimately lead to hard or soft lockups,
> > > > >
> > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > design of __GFP_NOFAIL itself.
> > > >
> > > > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > > > the return value. without direct_reclamation, it is sometimes impossible.
> > > > but with direct reclamation, users constantly wait and finally they can
> > >
> > > So, what exactly is the difference between 'constantly waiting' and
> > > 'busy looping'? Could you please clarify? Also, why can't we
> > > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?
> >
> > I list two options in changelog
> > 1: busy loop 2. bug_on. I am actually fine with either one. either one is
> > better than the existing code. but returning null in the current code
> > is definitely wrong.
> >
> > 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
> > legal. so it is a bit suspicious going in the wrong direction.
> >
> > busy-loop is that you are not reclaiming memory you are not sleeping.
> > cpu is constantly working and busy, so it might result in a lockup, either
> > soft lockup or hard lockup.

Thanks for the clarification.
That can be avoided by a simple cond_resched() if the hard lockup or
softlockup is the main issue ;)

> >
> > with direct_reclamation, wait is the case you can sleep. it is not holding
> > cpu, not a busy loop. in rare case, users might end in endless wait,
> > but it matches the doc of __GFP_NOFAIL, never return till memory
> > is gotten (the current code is implemented in this way unless users
> > incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)
> >
>
> and the essential difference between "w/ and w/o direct_reclaim": with
> direct reclaim, the user is actively reclaiming memory to rescue itself
> by all kinds of possible ways(compact, oom, reclamation), while without
> direct reclamation, it can do nothing and just loop (busy-loop).

It can wake up kswapd, which can then reclaim memory. If kswapd can't
keep up, the system is likely under heavy memory pressure. In such a
case, it makes little difference whether __GFP_DIRECT_RECLAIM is set
or not. For reference, see the old issue:
https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/.

I believe the core issue persists, and the design of __GFP_NOFAIL
exacerbates it.

By the way, I believe we could trigger an asynchronous OOM kill in the
case without direct reclaim to avoid busy looping.

>
> > note, long-term we won't expose __GFP_NOFAIL any more. we
> > will only expose GFP_NOFAIL which enforces Blockable.  I am
> > quite busy on other issues, so this won't happen in a short time.
> >
> > >
> > > > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > > > it is absolutely clearly documented.
> > > >
> > > > >
> > > > > > which might not
> > > > > >     be well supported by some architectures.
> > > > > >
> > > > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > > > >     exposing NULL dereference.
> > > > > >
> > > > > > Neither option is ideal, but both are improvements over the existing code.
> > > > > > This patch selects the second option because, with the introduction of
> > > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > > > no longer be possible.
> > > > > >
> > > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > Cc: Christoph Lameter <cl@linux.com>
> > > > > > Cc: Pekka Enberg <penberg@kernel.org>
> > > > > > Cc: David Rientjes <rientjes@google.com>
> > > > > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > > > Cc: Kees Cook <kees@kernel.org>
> > > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > ---
> > > > > >  mm/page_alloc.c | 10 +++++-----
> > > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > > >          */
> > > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > > >                 /*
> > > > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > > -                * of any new users that actually require GFP_NOWAIT
> > > > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > > > +                * otherwise we introduce a busy loop with inside the page
> > > > > > +                * allocator from non-sleepable contexts
> > > > > >                  */
> > > > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > > -                       goto fail;
> > > > > > +               BUG_ON(!can_direct_reclaim);
> > > > >
> > > > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > > > handle the return value of __GFP_NOFAIL.
> > > > >
> > > >
> > > > it is not correct to handle the return value of __GFP_NOFAIL.
> > > > if you check the ret, don't use __GFP_NOFAIL.
> > >
> > > If so, you have many code changes to make in the linux kernel ;)
> > >
> >
> > Please list those code using __GFP_NOFAIL and check the result
> > might fail, we should get them fixed. This is insane. NOFAIL means
> > no fail.

You can find some instances with grep commands, but there's no
reliable way to capture them all with a single command. Here are a few
examples:

                // drivers/infiniband/hw/cxgb4/mem.c
                skb = alloc_skb(wr_len, GFP_KERNEL | __GFP_NOFAIL);
                if (!skb)
                        return -ENOMEM;

        // fs/xfs/libxfs/xfs_dir2.c
        args = kzalloc(sizeof(*args), GFP_KERNEL | __GFP_NOFAIL);
        if (!args)
                return -ENOMEM;


--
Regards
Yafang
Barry Song Aug. 18, 2024, 7:25 a.m. UTC | #7
On Sun, Aug 18, 2024 at 7:07 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Sun, Aug 18, 2024 at 2:45 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sun, Aug 18, 2024 at 6:27 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sun, Aug 18, 2024 at 5:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Sun, Aug 18, 2024 at 11:48 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sun, Aug 18, 2024 at 2:55 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > >
> > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > choices:
> > > > > > >
> > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > >
> > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > __GFP_DIRECT_RECLAIM, right? So, I don't believe the issue is related
> > > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > > design of __GFP_NOFAIL itself.
> > > > >
> > > > > the point of GFP_NOFAIL is that it won't fail and its user won't check
> > > > > the return value. without direct_reclamation, it is sometimes impossible.
> > > > > but with direct reclamation, users constantly wait and finally they can
> > > >
> > > > So, what exactly is the difference between 'constantly waiting' and
> > > > 'busy looping'? Could you please clarify? Also, why can't we
> > > > 'constantly wait' when __GFP_DIRECT_RECLAIM is not set?
> > >
> > > I list two options in changelog
> > > 1: busy loop 2. bug_on. I am actually fine with either one. either one is
> > > better than the existing code. but returning null in the current code
> > > is definitely wrong.
> > >
> > > 1 somehow has the attempt to make __GFP_NOFAIL without direct_reclamation
> > > legal. so it is a bit suspicious going in the wrong direction.
> > >
> > > busy-loop is that you are not reclaiming memory you are not sleeping.
> > > cpu is constantly working and busy, so it might result in a lockup, either
> > > soft lockup or hard lockup.
>
> Thanks for the clarification.
> That can be avoided by a simple cond_resched() if the hard lockup or
> softlockup is the main issue ;)

Is your point to legitimize the combination of gfp_nofail and non-lockable?
I don't think you can simply fix the lockup. A direct example is
single-core, and preemptible kernel. you are calling __gfp_nofail without
direct reclamation in a spinlock,  cond_resched() will do nothing.

it doesn't have to be a single core, for example on phones, cpu hotplug is
used to save power. Sometimes, phones unplug lots of cpus to save
power. and even we are multiple cores, we have cpuset cgroups things
which can prevent async oom from running on other cores.

>
> > >
> > > with direct_reclamation, wait is the case you can sleep. it is not holding
> > > cpu, not a busy loop. in rare case, users might end in endless wait,
> > > but it matches the doc of __GFP_NOFAIL, never return till memory
> > > is gotten (the current code is implemented in this way unless users
> > > incorrectly combine __GFP_NOFAIL with aotmic/nowait etc.)
> > >
> >
> > and the essential difference between "w/ and w/o direct_reclaim": with
> > direct reclaim, the user is actively reclaiming memory to rescue itself
> > by all kinds of possible ways(compact, oom, reclamation), while without
> > direct reclamation, it can do nothing and just loop (busy-loop).
>
> It can wake up kswapd, which can then reclaim memory. If kswapd can't
> keep up, the system is likely under heavy memory pressure. In such a
> case, it makes little difference whether __GFP_DIRECT_RECLAIM is set
> or not. For reference, see the old issue:
> https://lore.kernel.org/lkml/d9802b6a-949b-b327-c4a6-3dbca485ec20@gmx.com/.
>
> I believe the core issue persists, and the design of __GFP_NOFAIL
> exacerbates it.
>
> By the way, I believe we could trigger an asynchronous OOM kill in the
> case without direct reclaim to avoid busy looping.

This async oom is sometimes impossible with the same reason as
the above example.

>
> >
> > > note, long-term we won't expose __GFP_NOFAIL any more. we
> > > will only expose GFP_NOFAIL which enforces Blockable.  I am
> > > quite busy on other issues, so this won't happen in a short time.
> > >
> > > >
> > > > > get memory. if you read the doc of __GFP_NOFAIL you will find it.
> > > > > it is absolutely clearly documented.
> > > > >
> > > > > >
> > > > > > > which might not
> > > > > > >     be well supported by some architectures.
> > > > > > >
> > > > > > >     2. We could use BUG_ON to trigger a reliable system crash, avoiding
> > > > > > >     exposing NULL dereference.
> > > > > > >
> > > > > > > Neither option is ideal, but both are improvements over the existing code.
> > > > > > > This patch selects the second option because, with the introduction of
> > > > > > > scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> > > > > > > nofail users(which is in my plan), non-blockable nofail allocations will
> > > > > > > no longer be possible.
> > > > > > >
> > > > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > > > > > > Cc: Christoph Hellwig <hch@infradead.org>
> > > > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > > > Cc: Christoph Lameter <cl@linux.com>
> > > > > > > Cc: Pekka Enberg <penberg@kernel.org>
> > > > > > > Cc: David Rientjes <rientjes@google.com>
> > > > > > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > > > > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > > > > > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > > > > > Cc: Kees Cook <kees@kernel.org>
> > > > > > > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > > > > > > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > > > > > > Cc: Jason Wang <jasowang@redhat.com>
> > > > > > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > > > > > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > > > > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > > > > ---
> > > > > > >  mm/page_alloc.c | 10 +++++-----
> > > > > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > > > >
> > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > index d2c37f8f8d09..fb5850ecd3ae 100644
> > > > > > > --- a/mm/page_alloc.c
> > > > > > > +++ b/mm/page_alloc.c
> > > > > > > @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > > > >          */
> > > > > > >         if (gfp_mask & __GFP_NOFAIL) {
> > > > > > >                 /*
> > > > > > > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > > > > > > -                * of any new users that actually require GFP_NOWAIT
> > > > > > > +                * All existing users of the __GFP_NOFAIL are blockable
> > > > > > > +                * otherwise we introduce a busy loop with inside the page
> > > > > > > +                * allocator from non-sleepable contexts
> > > > > > >                  */
> > > > > > > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > > > > > > -                       goto fail;
> > > > > > > +               BUG_ON(!can_direct_reclaim);
> > > > > >
> > > > > > I'm not in favor of using BUG_ON() here, as many call sites already
> > > > > > handle the return value of __GFP_NOFAIL.
> > > > > >
> > > > >
> > > > > it is not correct to handle the return value of __GFP_NOFAIL.
> > > > > if you check the ret, don't use __GFP_NOFAIL.
> > > >
> > > > If so, you have many code changes to make in the linux kernel ;)
> > > >
> > >
> > > Please list those code using __GFP_NOFAIL and check the result
> > > might fail, we should get them fixed. This is insane. NOFAIL means
> > > no fail.
>
> You can find some instances with grep commands, but there's no
> reliable way to capture them all with a single command. Here are a few
> examples:
>
>                 // drivers/infiniband/hw/cxgb4/mem.c
>                 skb = alloc_skb(wr_len, GFP_KERNEL | __GFP_NOFAIL);
>                 if (!skb)
>                         return -ENOMEM;
>

Thanks for pointing out this wrong code.

This is absolutely wrong. with GFP_KERNEL | __GFP_NOFAIL,
mm-core is doing endless retry till direct reclamation gives
memory to itself. there is no possibility to return NULL.

the only possibility is __GFP_NOFAIL | GFP_ATOMIC(NOWAIT),
mm-core is returning NULL incorrectly. actually mm-core just ignores
this __GFP_NOFAIL in this case. with GFP_FAIL enforcing
direct_reclamation done, the wrong case __GFP_NOFAIL |
GFP_ATOMIC(NOWAIT) will no longer exist.

>         // fs/xfs/libxfs/xfs_dir2.c
>         args = kzalloc(sizeof(*args), GFP_KERNEL | __GFP_NOFAIL);
>         if (!args)
>                 return -ENOMEM;
>
>
> --
> Regards
> Yafang
Michal Hocko Aug. 19, 2024, 7:50 a.m. UTC | #8
On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > When users allocate memory with the __GFP_NOFAIL flag, they might
> > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > attempt and still fail to allocate memory for these users, we have two
> > choices:
> >
> >     1. We could busy-loop and hope that some other direct reclamation or
> >     kswapd rescues the current process. However, this is unreliable
> >     and could ultimately lead to hard or soft lockups,
> 
> That can occur even if we set both __GFP_NOFAIL and
> __GFP_DIRECT_RECLAIM, right?

No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
time to satisfy the allocation but it will reclaim to get the memory, it
will sleep if necessary and it will will trigger OOM killer if there is
no other option. __GFP_DIRECT_RECLAIM is a completely different story
than without it which means _no_sleeping_ is allowed and therefore only
a busy loop waiting for the allocation to proceed is allowed.

> So, I don't believe the issue is related
> to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> design of __GFP_NOFAIL itself.

Care to elaborate?
Michal Hocko Aug. 19, 2024, 7:51 a.m. UTC | #9
On Sun 18-08-24 15:07:00, Yafang Shao wrote:
> That can be avoided by a simple cond_resched() if the hard lockup or
> softlockup is the main issue ;)

No, it cannot! cond_resched from withing an atomic context is no-no. And NOWAIT
is used from withing atomic contexts.
Yafang Shao Aug. 19, 2024, 9:25 a.m. UTC | #10
On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > attempt and still fail to allocate memory for these users, we have two
> > > choices:
> > >
> > >     1. We could busy-loop and hope that some other direct reclamation or
> > >     kswapd rescues the current process. However, this is unreliable
> > >     and could ultimately lead to hard or soft lockups,
> >
> > That can occur even if we set both __GFP_NOFAIL and
> > __GFP_DIRECT_RECLAIM, right?
>
> No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> time to satisfy the allocation but it will reclaim to get the memory, it
> will sleep if necessary and it will will trigger OOM killer if there is
> no other option. __GFP_DIRECT_RECLAIM is a completely different story
> than without it which means _no_sleeping_ is allowed and therefore only
> a busy loop waiting for the allocation to proceed is allowed.

That could be a livelock.
From the user's perspective, there's no noticeable difference between
a livelock, soft lockup, or hard lockup.

>
> > So, I don't believe the issue is related
> > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > design of __GFP_NOFAIL itself.
>
> Care to elaborate?

I've read the documentation explaining why the busy loop is embedded
within the page allocation process instead of letting users implement
it based on their needs. However, the complexity and numerous issues
suggest that this design might be fundamentally flawed.

--
Regards
Yafang
Barry Song Aug. 19, 2024, 9:39 a.m. UTC | #11
On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > attempt and still fail to allocate memory for these users, we have two
> > > > choices:
> > > >
> > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > >     kswapd rescues the current process. However, this is unreliable
> > > >     and could ultimately lead to hard or soft lockups,
> > >
> > > That can occur even if we set both __GFP_NOFAIL and
> > > __GFP_DIRECT_RECLAIM, right?
> >
> > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > time to satisfy the allocation but it will reclaim to get the memory, it
> > will sleep if necessary and it will will trigger OOM killer if there is
> > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > than without it which means _no_sleeping_ is allowed and therefore only
> > a busy loop waiting for the allocation to proceed is allowed.
>
> That could be a livelock.
> From the user's perspective, there's no noticeable difference between
> a livelock, soft lockup, or hard lockup.

This is certainly different. A lockup occurs when tasks can't be scheduled,
causing the entire system to stop functioning.

>
> >
> > > So, I don't believe the issue is related
> > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > design of __GFP_NOFAIL itself.
> >
> > Care to elaborate?
>
> I've read the documentation explaining why the busy loop is embedded
> within the page allocation process instead of letting users implement
> it based on their needs. However, the complexity and numerous issues
> suggest that this design might be fundamentally flawed.

I don't see "numerous issues", only two issues:

1. allocation size overflow with __GFP_NOFAIL
2. unsupported case: __GFP_NOWAIT/ATOMIC | __GFP_NOFAIL.

for 1, it has been a BUG to require an overflowed size to always succeed.

for 2,  it is an unsupported case. we just need to hide __GFP_NOFAIL
and only expose GFP_NOFAIL(which definitely includes blockable) so
any unsupported case like vdpa will no longer occur.  I would greatly
appreciate it if you or someone else could take over this task, as I am
currently extremely busy.

>
> --
> Regards
> Yafang

Thanks
Barry
David Hildenbrand Aug. 19, 2024, 9:44 a.m. UTC | #12
On 17.08.24 08:24, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> When users allocate memory with the __GFP_NOFAIL flag, they might
> incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> attempt and still fail to allocate memory for these users, we have two
> choices:
> 
>      1. We could busy-loop and hope that some other direct reclamation or
>      kswapd rescues the current process. However, this is unreliable
>      and could ultimately lead to hard or soft lockups, which might not
>      be well supported by some architectures.
> 
>      2. We could use BUG_ON to trigger a reliable system crash, avoiding
>      exposing NULL dereference.
> 
> Neither option is ideal, but both are improvements over the existing code.
> This patch selects the second option because, with the introduction of
> scoped API and GFP_NOFAIL—capable of enforcing direct reclamation for
> nofail users(which is in my plan), non-blockable nofail allocations will
> no longer be possible.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> Cc: Hailong.Liu <hailong.liu@oppo.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   mm/page_alloc.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d2c37f8f8d09..fb5850ecd3ae 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4399,11 +4399,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>   	 */
>   	if (gfp_mask & __GFP_NOFAIL) {
>   		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually require GFP_NOWAIT
> +		 * All existing users of the __GFP_NOFAIL are blockable
> +		 * otherwise we introduce a busy loop with inside the page
> +		 * allocator from non-sleepable contexts
>   		 */
> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -			goto fail;
> +		BUG_ON(!can_direct_reclaim);

No new BUG_ON(), WARN_ON_ONCE() is good enough for something that should 
be found during ordinary testing.
Yafang Shao Aug. 19, 2024, 9:45 a.m. UTC | #13
On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > choices:
> > > > >
> > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > >     kswapd rescues the current process. However, this is unreliable
> > > > >     and could ultimately lead to hard or soft lockups,
> > > >
> > > > That can occur even if we set both __GFP_NOFAIL and
> > > > __GFP_DIRECT_RECLAIM, right?
> > >
> > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > will sleep if necessary and it will will trigger OOM killer if there is
> > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > than without it which means _no_sleeping_ is allowed and therefore only
> > > a busy loop waiting for the allocation to proceed is allowed.
> >
> > That could be a livelock.
> > From the user's perspective, there's no noticeable difference between
> > a livelock, soft lockup, or hard lockup.
>
> This is certainly different. A lockup occurs when tasks can't be scheduled,
> causing the entire system to stop functioning.

When a livelock occurs, your only options are to migrate your
applications to other servers or reboot the system—there’s no other
resolution (except for using oomd, which is difficult for users
without cgroup2 or swap).

So, there's effectively no difference.

>
> >
> > >
> > > > So, I don't believe the issue is related
> > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > design of __GFP_NOFAIL itself.
> > >
> > > Care to elaborate?
> >
> > I've read the documentation explaining why the busy loop is embedded
> > within the page allocation process instead of letting users implement
> > it based on their needs. However, the complexity and numerous issues
> > suggest that this design might be fundamentally flawed.
>
> I don't see "numerous issues", only two issues:
>
> 1. allocation size overflow with __GFP_NOFAIL
> 2. unsupported case: __GFP_NOWAIT/ATOMIC | __GFP_NOFAIL.
>
> for 1, it has been a BUG to require an overflowed size to always succeed.
>
> for 2,  it is an unsupported case. we just need to hide __GFP_NOFAIL
> and only expose GFP_NOFAIL(which definitely includes blockable) so
> any unsupported case like vdpa will no longer occur.  I would greatly
> appreciate it if you or someone else could take over this task, as I am
> currently extremely busy.
>
> >
> > --
> > Regards
> > Yafang
>
> Thanks
> Barry



--
Regards
Yafang
Barry Song Aug. 19, 2024, 10:10 a.m. UTC | #14
On Mon, Aug 19, 2024 at 9:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > choices:
> > > > > >
> > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > >     and could ultimately lead to hard or soft lockups,
> > > > >
> > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > __GFP_DIRECT_RECLAIM, right?
> > > >
> > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > a busy loop waiting for the allocation to proceed is allowed.
> > >
> > > That could be a livelock.
> > > From the user's perspective, there's no noticeable difference between
> > > a livelock, soft lockup, or hard lockup.
> >
> > This is certainly different. A lockup occurs when tasks can't be scheduled,
> > causing the entire system to stop functioning.
>
> When a livelock occurs, your only options are to migrate your
> applications to other servers or reboot the system—there’s no other
> resolution (except for using oomd, which is difficult for users
> without cgroup2 or swap).
>
> So, there's effectively no difference.

Could you express your options more clearly? I am guessing two
possibilities?
1. entirely drop __GFP_NOFAIL and require all users who are
using __GFP_NOFAIL to add error handlers instead?

2. no matter if it is an unsupported case, such as, GFP_ATOMIC|
__GFP_NOFAIL, we always loop till a soft or hard lockup?

>
> >
> > >
> > > >
> > > > > So, I don't believe the issue is related
> > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > design of __GFP_NOFAIL itself.
> > > >
> > > > Care to elaborate?
> > >
> > > I've read the documentation explaining why the busy loop is embedded
> > > within the page allocation process instead of letting users implement
> > > it based on their needs. However, the complexity and numerous issues
> > > suggest that this design might be fundamentally flawed.
> >
> > I don't see "numerous issues", only two issues:
> >
> > 1. allocation size overflow with __GFP_NOFAIL
> > 2. unsupported case: __GFP_NOWAIT/ATOMIC | __GFP_NOFAIL.
> >
> > for 1, it has been a BUG to require an overflowed size to always succeed.
> >
> > for 2,  it is an unsupported case. we just need to hide __GFP_NOFAIL
> > and only expose GFP_NOFAIL(which definitely includes blockable) so
> > any unsupported case like vdpa will no longer occur.  I would greatly
> > appreciate it if you or someone else could take over this task, as I am
> > currently extremely busy.
> >
> > >
> > > --
> > > Regards
> > > Yafang
> >
> > Thanks
> > Barry
>
>
>
> --
> Regards
> Yafang

Thanks
Barry
Michal Hocko Aug. 19, 2024, 10:17 a.m. UTC | #15
On Mon 19-08-24 17:25:18, Yafang Shao wrote:
> On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > attempt and still fail to allocate memory for these users, we have two
> > > > choices:
> > > >
> > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > >     kswapd rescues the current process. However, this is unreliable
> > > >     and could ultimately lead to hard or soft lockups,
> > >
> > > That can occur even if we set both __GFP_NOFAIL and
> > > __GFP_DIRECT_RECLAIM, right?
> >
> > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > time to satisfy the allocation but it will reclaim to get the memory, it
> > will sleep if necessary and it will will trigger OOM killer if there is
> > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > than without it which means _no_sleeping_ is allowed and therefore only
> > a busy loop waiting for the allocation to proceed is allowed.
> 
> That could be a livelock.
> >From the user's perspective, there's no noticeable difference between
> a livelock, soft lockup, or hard lockup.

Ohh, it very much is different if somebody in a sleepable context is
taking too long to complete and making a CPU completely unusable for
anything else.

Please consider that asking for never failing allocation is a major
requirement.

> > > So, I don't believe the issue is related
> > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > design of __GFP_NOFAIL itself.
> >
> > Care to elaborate?
> 
> I've read the documentation explaining why the busy loop is embedded
> within the page allocation process instead of letting users implement
> it based on their needs. However, the complexity and numerous issues
> suggest that this design might be fundamentally flawed.

I really fail what you mean.
Michal Hocko Aug. 19, 2024, 10:19 a.m. UTC | #16
On Mon 19-08-24 11:44:39, David Hildenbrand wrote:
[...]
> >   	if (gfp_mask & __GFP_NOFAIL) {
> >   		/*
> > -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> > -		 * of any new users that actually require GFP_NOWAIT
> > +		 * All existing users of the __GFP_NOFAIL are blockable
> > +		 * otherwise we introduce a busy loop with inside the page
> > +		 * allocator from non-sleepable contexts
> >   		 */
> > -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > -			goto fail;
> > +		BUG_ON(!can_direct_reclaim);
> 
> No new BUG_ON(), WARN_ON_ONCE() is good enough for something that should be
> found during ordinary testing.

Do you mean 
	if (WARN_ON_ONCE_GFP(...))
		goto retry?

Barry has mentioned that option in the changelog.
Yafang Shao Aug. 19, 2024, 11:56 a.m. UTC | #17
On Mon, Aug 19, 2024 at 6:18 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 19-08-24 17:25:18, Yafang Shao wrote:
> > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > choices:
> > > > >
> > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > >     kswapd rescues the current process. However, this is unreliable
> > > > >     and could ultimately lead to hard or soft lockups,
> > > >
> > > > That can occur even if we set both __GFP_NOFAIL and
> > > > __GFP_DIRECT_RECLAIM, right?
> > >
> > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > will sleep if necessary and it will will trigger OOM killer if there is
> > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > than without it which means _no_sleeping_ is allowed and therefore only
> > > a busy loop waiting for the allocation to proceed is allowed.
> >
> > That could be a livelock.
> > >From the user's perspective, there's no noticeable difference between
> > a livelock, soft lockup, or hard lockup.
>
> Ohh, it very much is different if somebody in a sleepable context is
> taking too long to complete and making a CPU completely unusable for
> anything else.

__alloc_pages_slowpath
retry:
    if (gfp_mask & __GFP_NOFAIL) {
        goto retry;
    }

When the loop continues indefinitely here, it indicates that the
system is unstable. In such a scenario, does it really matter whether
you sleep or not?

>
> Please consider that asking for never failing allocation is a major
> requirement.
>
> > > > So, I don't believe the issue is related
> > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > design of __GFP_NOFAIL itself.
> > >
> > > Care to elaborate?
> >
> > I've read the documentation explaining why the busy loop is embedded
> > within the page allocation process instead of letting users implement
> > it based on their needs. However, the complexity and numerous issues
> > suggest that this design might be fundamentally flawed.
>
> I really fail what you mean.

I mean giving the user the option to handle the loop at the call site,
rather than having it loop within __alloc_pages_slowpath().

--
Regards
Yafang
Yafang Shao Aug. 19, 2024, 11:56 a.m. UTC | #18
On Mon, Aug 19, 2024 at 6:10 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 9:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > >
> > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > choices:
> > > > > > >
> > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > >
> > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > __GFP_DIRECT_RECLAIM, right?
> > > > >
> > > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > > a busy loop waiting for the allocation to proceed is allowed.
> > > >
> > > > That could be a livelock.
> > > > From the user's perspective, there's no noticeable difference between
> > > > a livelock, soft lockup, or hard lockup.
> > >
> > > This is certainly different. A lockup occurs when tasks can't be scheduled,
> > > causing the entire system to stop functioning.
> >
> > When a livelock occurs, your only options are to migrate your
> > applications to other servers or reboot the system—there’s no other
> > resolution (except for using oomd, which is difficult for users
> > without cgroup2 or swap).
> >
> > So, there's effectively no difference.
>
> Could you express your options more clearly? I am guessing two
> possibilities?
> 1. entirely drop __GFP_NOFAIL and require all users who are
> using __GFP_NOFAIL to add error handlers instead?

When the system is unstable—such as after reaching the maximum retries
without successfully allocating pages—simply failing the operation
might be the better option.

>
> 2. no matter if it is an unsupported case, such as, GFP_ATOMIC|
> __GFP_NOFAIL, we always loop till a soft or hard lockup?

--
Regards
Yafang
Michal Hocko Aug. 19, 2024, 12:04 p.m. UTC | #19
On Mon 19-08-24 19:56:16, Yafang Shao wrote:
> On Mon, Aug 19, 2024 at 6:18 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 19-08-24 17:25:18, Yafang Shao wrote:
> > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > choices:
> > > > > >
> > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > >     and could ultimately lead to hard or soft lockups,
> > > > >
> > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > __GFP_DIRECT_RECLAIM, right?
> > > >
> > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > a busy loop waiting for the allocation to proceed is allowed.
> > >
> > > That could be a livelock.
> > > >From the user's perspective, there's no noticeable difference between
> > > a livelock, soft lockup, or hard lockup.
> >
> > Ohh, it very much is different if somebody in a sleepable context is
> > taking too long to complete and making a CPU completely unusable for
> > anything else.
> 
> __alloc_pages_slowpath
> retry:
>     if (gfp_mask & __GFP_NOFAIL) {
>         goto retry;
>     }
> 
> When the loop continues indefinitely here, it indicates that the
> system is unstable.

No, it means the system is low on memory to satisfy the allocation
request. This doesn't automatically imply the system is unstable. The
requested NUMA node(s) or zone(s) might be depleted.

> In such a scenario, does it really matter whether
> you sleep or not?

Absolutely! Hogging CPU might prevent anybody else running on it.

> > Please consider that asking for never failing allocation is a major
> > requirement.
> >
> > > > > So, I don't believe the issue is related
> > > > > to setting __GFP_DIRECT_RECLAIM; rather, it stems from the flawed
> > > > > design of __GFP_NOFAIL itself.
> > > >
> > > > Care to elaborate?
> > >
> > > I've read the documentation explaining why the busy loop is embedded
> > > within the page allocation process instead of letting users implement
> > > it based on their needs. However, the complexity and numerous issues
> > > suggest that this design might be fundamentally flawed.
> >
> > I really fail what you mean.
> 
> I mean giving the user the option to handle the loop at the call site,
> rather than having it loop within __alloc_pages_slowpath().

Users who have a allocation failure strategy do not and should not use
__GFP_NOFAIL.
Michal Hocko Aug. 19, 2024, 12:09 p.m. UTC | #20
On Mon 19-08-24 19:56:53, Yafang Shao wrote:
> On Mon, Aug 19, 2024 at 6:10 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Mon, Aug 19, 2024 at 9:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > > >
> > > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > > choices:
> > > > > > > >
> > > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > > >
> > > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > > __GFP_DIRECT_RECLAIM, right?
> > > > > >
> > > > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > > > a busy loop waiting for the allocation to proceed is allowed.
> > > > >
> > > > > That could be a livelock.
> > > > > From the user's perspective, there's no noticeable difference between
> > > > > a livelock, soft lockup, or hard lockup.
> > > >
> > > > This is certainly different. A lockup occurs when tasks can't be scheduled,
> > > > causing the entire system to stop functioning.
> > >
> > > When a livelock occurs, your only options are to migrate your
> > > applications to other servers or reboot the system—there’s no other
> > > resolution (except for using oomd, which is difficult for users
> > > without cgroup2 or swap).
> > >
> > > So, there's effectively no difference.
> >
> > Could you express your options more clearly? I am guessing two
> > possibilities?
> > 1. entirely drop __GFP_NOFAIL and require all users who are
> > using __GFP_NOFAIL to add error handlers instead?
> 
> When the system is unstable—such as after reaching the maximum retries
> without successfully allocating pages—simply failing the operation
> might be the better option.

It seems you are failing to understand the __GFP_NOFAIL semantic and you
are circling around that. So let me repeat that for you here. Make sure
you understand before going forward with the discussion. Feel free if
something is not clear but please do not continue with what-if kind of
questions.

GFP_NOFAIL means that the caller has no way to deal with the allocation
strategy. Allocator simply cannot fail the request even if that takes
ages to succeed! To put it simpler if you have a code like

	while (!(ptr = alloc()));
or
	BUG_ON(!(ptr = alloc()));

then you should better use __GFP_NOFAIL rather than opencode the endless
loop or the bug on for the failure.

Our (page, vmalloc, kmalloc) allocators do support that node for
allocation that are allowed to sleep. But those allocators have never
supported and are unlikely to suppoort atomic non-failing allocations.

More clear?
Yafang Shao Aug. 19, 2024, 12:17 p.m. UTC | #21
On Mon, Aug 19, 2024 at 8:09 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 19-08-24 19:56:53, Yafang Shao wrote:
> > On Mon, Aug 19, 2024 at 6:10 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Mon, Aug 19, 2024 at 9:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 19, 2024 at 5:39 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 19, 2024 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 19, 2024 at 3:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > >
> > > > > > > On Sun 18-08-24 10:55:09, Yafang Shao wrote:
> > > > > > > > On Sat, Aug 17, 2024 at 2:25 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > > > >
> > > > > > > > > When users allocate memory with the __GFP_NOFAIL flag, they might
> > > > > > > > > incorrectly use it alongside GFP_ATOMIC, GFP_NOWAIT, etc.  This kind of
> > > > > > > > > non-blockable __GFP_NOFAIL is not supported and is pointless.  If we
> > > > > > > > > attempt and still fail to allocate memory for these users, we have two
> > > > > > > > > choices:
> > > > > > > > >
> > > > > > > > >     1. We could busy-loop and hope that some other direct reclamation or
> > > > > > > > >     kswapd rescues the current process. However, this is unreliable
> > > > > > > > >     and could ultimately lead to hard or soft lockups,
> > > > > > > >
> > > > > > > > That can occur even if we set both __GFP_NOFAIL and
> > > > > > > > __GFP_DIRECT_RECLAIM, right?
> > > > > > >
> > > > > > > No, it cannot! With __GFP_DIRECT_RECLAIM the allocator might take a long
> > > > > > > time to satisfy the allocation but it will reclaim to get the memory, it
> > > > > > > will sleep if necessary and it will will trigger OOM killer if there is
> > > > > > > no other option. __GFP_DIRECT_RECLAIM is a completely different story
> > > > > > > than without it which means _no_sleeping_ is allowed and therefore only
> > > > > > > a busy loop waiting for the allocation to proceed is allowed.
> > > > > >
> > > > > > That could be a livelock.
> > > > > > From the user's perspective, there's no noticeable difference between
> > > > > > a livelock, soft lockup, or hard lockup.
> > > > >
> > > > > This is certainly different. A lockup occurs when tasks can't be scheduled,
> > > > > causing the entire system to stop functioning.
> > > >
> > > > When a livelock occurs, your only options are to migrate your
> > > > applications to other servers or reboot the system—there’s no other
> > > > resolution (except for using oomd, which is difficult for users
> > > > without cgroup2 or swap).
> > > >
> > > > So, there's effectively no difference.
> > >
> > > Could you express your options more clearly? I am guessing two
> > > possibilities?
> > > 1. entirely drop __GFP_NOFAIL and require all users who are
> > > using __GFP_NOFAIL to add error handlers instead?
> >
> > When the system is unstable—such as after reaching the maximum retries
> > without successfully allocating pages—simply failing the operation
> > might be the better option.
>
> It seems you are failing to understand the __GFP_NOFAIL semantic and you
> are circling around that. So let me repeat that for you here. Make sure
> you understand before going forward with the discussion. Feel free if
> something is not clear but please do not continue with what-if kind of
> questions.
>
> GFP_NOFAIL means that the caller has no way to deal with the allocation
> strategy. Allocator simply cannot fail the request even if that takes
> ages to succeed! To put it simpler if you have a code like
>
>         while (!(ptr = alloc()));
> or
>         BUG_ON(!(ptr = alloc()));
>
> then you should better use __GFP_NOFAIL rather than opencode the endless
> loop or the bug on for the failure.
>
> Our (page, vmalloc, kmalloc) allocators do support that node for
> allocation that are allowed to sleep. But those allocators have never
> supported and are unlikely to suppoort atomic non-failing allocations.
>
> More clear?

 * New users should be evaluated carefully (and the flag should be
 * used only when there is no reasonable failure policy) but it is
 * definitely preferable to use the flag rather than opencode endless
 * loop around allocator.

The doc has already expressed what I mean.  My question is why is that
? Why not let it loop around the allocator?
David Hildenbrand Aug. 19, 2024, 12:48 p.m. UTC | #22
On 19.08.24 12:19, Michal Hocko wrote:
> On Mon 19-08-24 11:44:39, David Hildenbrand wrote:
> [...]
>>>    	if (gfp_mask & __GFP_NOFAIL) {
>>>    		/*
>>> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
>>> -		 * of any new users that actually require GFP_NOWAIT
>>> +		 * All existing users of the __GFP_NOFAIL are blockable
>>> +		 * otherwise we introduce a busy loop with inside the page
>>> +		 * allocator from non-sleepable contexts
>>>    		 */
>>> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>>> -			goto fail;
>>> +		BUG_ON(!can_direct_reclaim);
>>
>> No new BUG_ON(), WARN_ON_ONCE() is good enough for something that should be
>> found during ordinary testing.
> 
> Do you mean
> 	if (WARN_ON_ONCE_GFP(...))
> 		goto retry?

Not really ...

but now I read the description more carefully and I am not sure why we 
are so into throwing around BUG_ONs here, for something that is simply 
not allowed and doesn't make sense.

If __GFP_NOFAIL is documented to

"
+ * It _must_ be blockable and used together with __GFP_DIRECT_RECLAIM.
+ * It should _never_ be used in non-sleepable contexts.
"

Why not document

"__GFP_NOFAIL always implies __GFP_DIRECT_RECLAIM" and do exactly that?
Michal Hocko Aug. 19, 2024, 2:01 p.m. UTC | #23
On Mon 19-08-24 20:17:36, Yafang Shao wrote:
> My question is why is that ? Why not let it loop around the allocator?

Because of 2 reasons. It is much easier to see NOFAIL allocations when
they are annotated properly (try to grep for all sorts of endless loops)
and also the allocator can make sertain heuristics if it knows that
allocation must not fail.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d2c37f8f8d09..fb5850ecd3ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4399,11 +4399,11 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (gfp_mask & __GFP_NOFAIL) {
 		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually require GFP_NOWAIT
+		 * All existing users of the __GFP_NOFAIL are blockable
+		 * otherwise we introduce a busy loop with inside the page
+		 * allocator from non-sleepable contexts
 		 */
-		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
-			goto fail;
+		BUG_ON(!can_direct_reclaim);
 
 		/*
 		 * PF_MEMALLOC request from this context is rather bizarre
@@ -4434,7 +4434,7 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		cond_resched();
 		goto retry;
 	}
-fail:
+
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
 got_pg: