diff mbox series

[RFC] mm: warn potential return NULL for kmalloc_array and kvmalloc_array with __GFP_NOFAIL

Message ID 20240717230025.77361-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm: warn potential return NULL for kmalloc_array and kvmalloc_array with __GFP_NOFAIL | expand

Commit Message

Barry Song July 17, 2024, 11 p.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

Overflow in this context is highly unlikely. However, allocations using
GFP_NOFAIL are guaranteed to succeed, so checking the return value is
unnecessary. One option to fix this is allowing memory allocation with
an overflowed size, but it seems pointless. Let's at least issue a
warning. Likely BUG_ON() seems better as anyway we can't fix it?

Cc: Michal Hocko <mhocko@suse.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/slab.h | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Michal Hocko July 18, 2024, 6:58 a.m. UTC | #1
On Thu 18-07-24 11:00:25, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Overflow in this context is highly unlikely. However, allocations using
> GFP_NOFAIL are guaranteed to succeed, so checking the return value is
> unnecessary. One option to fix this is allowing memory allocation with
> an overflowed size, but it seems pointless. Let's at least issue a
> warning. Likely BUG_ON() seems better as anyway we can't fix it?

WARN_ON is effectively BUG_ON with panic_on_warn so if this happens to
be in a user triggerable path then you would have an easy way to panic
the whole machine. It is likely true that the kernel could oops just
right after the failure but that could be recoverable at least.

If anything I would just pr_warn with caller address or add dump_stack
to capture the full trace. That would give us the caller that needs
fixing without panicing the system with panic_on_warn.

Btw. what has led you to create this patch? Have you encountered a
misbehaving caller?

Realistically speaking k*malloc(GFP_NOFAIL) with large values (still
far from overflow) would be very dangerous as well. So I am not really
sure this is buying us much if anything at all.

> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/slab.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index a332dd2fa6cd..c6aec311864f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -692,8 +692,10 @@ static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t siz
>  {
>  	size_t bytes;
>  
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		WARN_ON(flags & __GFP_NOFAIL);
>  		return NULL;
> +	}
>  	if (__builtin_constant_p(n) && __builtin_constant_p(size))
>  		return kmalloc_noprof(bytes, flags);
>  	return kmalloc_noprof(bytes, flags);
> @@ -794,8 +796,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>  {
>  	size_t bytes;
>  
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		WARN_ON(flags & __GFP_NOFAIL);
>  		return NULL;
> +	}
>  
>  	return kvmalloc_node_noprof(bytes, flags, node);
>  }
> -- 
> 2.34.1
Christoph Hellwig July 18, 2024, 7:04 a.m. UTC | #2
On Thu, Jul 18, 2024 at 08:58:44AM +0200, Michal Hocko wrote:
> WARN_ON is effectively BUG_ON with panic_on_warn so if this happens to
> be in a user triggerable path then you would have an easy way to panic
> the whole machine. It is likely true that the kernel could oops just
> right after the failure but that could be recoverable at least.

If you set panic_on_warn you are either debugging in which case it's
the right thing, or you are fucked.  So don't do it unless you
expet frequent crashes.

> If anything I would just pr_warn with caller address or add dump_stack
> to capture the full trace. That would give us the caller that needs
> fixing without panicing the system with panic_on_warn.

The whole freaking point of __GFP_NOFAIL is that callers don't handle
allocation failures.  So in fact a straight BUG is the right thing
here.
Michal Hocko July 18, 2024, 7:12 a.m. UTC | #3
On Thu 18-07-24 00:04:54, Christoph Hellwig wrote:
> On Thu, Jul 18, 2024 at 08:58:44AM +0200, Michal Hocko wrote:
> > WARN_ON is effectively BUG_ON with panic_on_warn so if this happens to
> > be in a user triggerable path then you would have an easy way to panic
> > the whole machine. It is likely true that the kernel could oops just
> > right after the failure but that could be recoverable at least.
> 
> If you set panic_on_warn you are either debugging in which case it's
> the right thing, or you are fucked.  So don't do it unless you
> expet frequent crashes.

I do agree and I wouldn't recommend running panic_on_warn on anything
even touching production setups. Reality check disagrees.

> > If anything I would just pr_warn with caller address or add dump_stack
> > to capture the full trace. That would give us the caller that needs
> > fixing without panicing the system with panic_on_warn.
> 
> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> allocation failures.  So in fact a straight BUG is the right thing
> here.

OOPs from the NULL ptr could be just a safer option bacause it could be
recoverable.

Anyway, I am questioning that WARN/BUG/pr_warn on overflow check is
adding any actual value because GFP_NOFAIL large allocations could be
even more dangerous essentially rendering the system completely
unsuabale.
Barry Song July 18, 2024, 7:22 a.m. UTC | #4
On Thu, Jul 18, 2024 at 6:58 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 18-07-24 11:00:25, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Overflow in this context is highly unlikely. However, allocations using
> > GFP_NOFAIL are guaranteed to succeed, so checking the return value is
> > unnecessary. One option to fix this is allowing memory allocation with
> > an overflowed size, but it seems pointless. Let's at least issue a
> > warning. Likely BUG_ON() seems better as anyway we can't fix it?
>
> WARN_ON is effectively BUG_ON with panic_on_warn so if this happens to
> be in a user triggerable path then you would have an easy way to panic
> the whole machine. It is likely true that the kernel could oops just
> right after the failure but that could be recoverable at least.
>
> If anything I would just pr_warn with caller address or add dump_stack
> to capture the full trace. That would give us the caller that needs
> fixing without panicing the system with panic_on_warn.
>
> Btw. what has led you to create this patch? Have you encountered a
> misbehaving caller?

I didn't encounter any misbehaving callers in the in-tree code. I was
debugging another bug potentially related to kvmalloc in my out-of-tree
drivers, so I reviewed all the code and noticed the NULL return for
GFP_NOFAIL.

For future-proofing and security reasons, returning NULL for NOFAIL
still seems incorrect as the callers won't check the ret. If any future or
existing in-tree code has a potential bug which might be exploited by
hackers, for example

ptr = kvmalloc_array(NOFAIL);
ptr->callback(); //ptr=NULL;

callback could be a privilege escalation?

I'm not a security expert, so hopefully others can provide some
insights :-)

>
> Realistically speaking k*malloc(GFP_NOFAIL) with large values (still
> far from overflow) would be very dangerous as well. So I am not really
> sure this is buying us much if anything at all.
>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/slab.h | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index a332dd2fa6cd..c6aec311864f 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -692,8 +692,10 @@ static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t siz
> >  {
> >       size_t bytes;
> >
> > -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> > +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> > +             WARN_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >       if (__builtin_constant_p(n) && __builtin_constant_p(size))
> >               return kmalloc_noprof(bytes, flags);
> >       return kmalloc_noprof(bytes, flags);
> > @@ -794,8 +796,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >  {
> >       size_t bytes;
> >
> > -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> > +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> > +             WARN_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >
> >       return kvmalloc_node_noprof(bytes, flags, node);
> >  }
> > --
> > 2.34.1
>
> --
> Michal Hocko
> SUSE Labs

Thanks
Barry
Michal Hocko July 18, 2024, 7:27 a.m. UTC | #5
On Thu 18-07-24 19:22:37, Barry Song wrote:
[...]
> For future-proofing and security reasons, returning NULL for NOFAIL
> still seems incorrect as the callers won't check the ret. If any future or
> existing in-tree code has a potential bug which might be exploited by
> hackers, for example
> 
> ptr = kvmalloc_array(NOFAIL);
> ptr->callback(); //ptr=NULL;
> 
> callback could be a privilege escalation?

Only if you allow to map zero page AFAIK. Nobody reasonable should be
doing that.
Barry Song July 18, 2024, 7:41 a.m. UTC | #6
On Thu, Jul 18, 2024 at 7:27 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 18-07-24 19:22:37, Barry Song wrote:
> [...]
> > For future-proofing and security reasons, returning NULL for NOFAIL
> > still seems incorrect as the callers won't check the ret. If any future or
> > existing in-tree code has a potential bug which might be exploited by
> > hackers, for example
> >
> > ptr = kvmalloc_array(NOFAIL);
> > ptr->callback(); //ptr=NULL;
> >
> > callback could be a privilege escalation?
>
> Only if you allow to map zero page AFAIK. Nobody reasonable should be
> doing that.

ptr->callback could be above /proc/sys/vm/mmap_min_addr ?

>
> --
> Michal Hocko
> SUSE Labs
Hailong Liu July 18, 2024, 7:48 a.m. UTC | #7
On Thu, 18. Jul 11:00, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Overflow in this context is highly unlikely. However, allocations using
> GFP_NOFAIL are guaranteed to succeed, so checking the return value is
> unnecessary. One option to fix this is allowing memory allocation with
> an overflowed size, but it seems pointless. Let's at least issue a
> warning. Likely BUG_ON() seems better as anyway we can't fix it?
>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/slab.h | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index a332dd2fa6cd..c6aec311864f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -692,8 +692,10 @@ static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t siz
>  {
>  	size_t bytes;
>
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		WARN_ON(flags & __GFP_NOFAIL);
Hi Barry:

IMO, using __GFP_NOFAIL guarantees success if and only if the parameters are *correct*.
Maybe we can add here to help callers to find the reason as in mm/page_alloc.c

```
	if (gfp_mask & __GFP_NOFAIL) {
		/*
		 * All existing users of the __GFP_NOFAIL are blockable, so warn
		 * of any new users that actually require GFP_NOWAIT
		 */
		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
			goto fail;

		/*
		 * PF_MEMALLOC request from this context is rather bizarre
		 * because we cannot reclaim anything and only can loop waiting
		 * for somebody to do a work for us
		 */
		WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);

		/*
		 * non failing costly orders are a hard requirement which we
		 * are not prepared for much so let's warn about these users
		 * so that we can identify them and convert them to something
		 * else.
		 */
		WARN_ON_ONCE_GFP(costly_order, gfp_mask);
```

>  		return NULL;
> +	}
>  	if (__builtin_constant_p(n) && __builtin_constant_p(size))
>  		return kmalloc_noprof(bytes, flags);
>  	return kmalloc_noprof(bytes, flags);
> @@ -794,8 +796,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>  {
>  	size_t bytes;
>
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		WARN_ON(flags & __GFP_NOFAIL);
>  		return NULL;
> +	}
>
>  	return kvmalloc_node_noprof(bytes, flags, node);
>  }
> --
> 2.34.1
>
>

--
help you, help me,
Hailong.
Michal Hocko July 18, 2024, 7:53 a.m. UTC | #8
On Thu 18-07-24 19:41:33, Barry Song wrote:
> On Thu, Jul 18, 2024 at 7:27 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 18-07-24 19:22:37, Barry Song wrote:
> > [...]
> > > For future-proofing and security reasons, returning NULL for NOFAIL
> > > still seems incorrect as the callers won't check the ret. If any future or
> > > existing in-tree code has a potential bug which might be exploited by
> > > hackers, for example
> > >
> > > ptr = kvmalloc_array(NOFAIL);
> > > ptr->callback(); //ptr=NULL;
> > >
> > > callback could be a privilege escalation?
> >
> > Only if you allow to map zero page AFAIK. Nobody reasonable should be
> > doing that.
> 
> ptr->callback could be above /proc/sys/vm/mmap_min_addr ?

Yes, it can of course but this would require quite a stretch to trigger,
no?

Look at this from a real life code POV. You are allocating an array of
callbacks (or structure of callbacks). In order to have this exploitable
you need to direct the first dereference above mmap_min_addr.

If you really want to protect from a code like that then WARN_ON doesn't
buy you anything because it will stop the exploit only when
panic_on_warn. You would need BUG_ON as mentioned by Christoph.

So the real question is, do you want to stop exploits or do you want to
debug potentially incorrect but mostly harmless buggy code?
Vlastimil Babka July 18, 2024, 8:16 a.m. UTC | #9
On 7/18/24 9:12 AM, Michal Hocko wrote:
> On Thu 18-07-24 00:04:54, Christoph Hellwig wrote:
>> On Thu, Jul 18, 2024 at 08:58:44AM +0200, Michal Hocko wrote:
>> > WARN_ON is effectively BUG_ON with panic_on_warn so if this happens to
>> > be in a user triggerable path then you would have an easy way to panic
>> > the whole machine. It is likely true that the kernel could oops just
>> > right after the failure but that could be recoverable at least.
>> 
>> If you set panic_on_warn you are either debugging in which case it's
>> the right thing, or you are fucked.  So don't do it unless you
>> expet frequent crashes.
> 
> I do agree and I wouldn't recommend running panic_on_warn on anything
> even touching production setups. Reality check disagrees.
> 
>> > If anything I would just pr_warn with caller address or add dump_stack
>> > to capture the full trace. That would give us the caller that needs
>> > fixing without panicing the system with panic_on_warn.
>> 
>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
>> allocation failures.  So in fact a straight BUG is the right thing
>> here.

Agreed. It's just not a recoverable situation (WARN_ON is for recoverable
situations). The caller cannot handle allocation failure and at the same
time asked for an impossible allocation. BUG_ON() is a guaranteed oops with
stracktrace etc. We don't need to hope for the later NULL pointer
dereference (which might if really unlucky happen from a different context
where it's no longer obvious what lead to the allocation failing).

> OOPs from the NULL ptr could be just a safer option bacause it could be
> recoverable.

Wonder if people who enable panic_on_warn have any reason to not also enable
panic_on_oops, other than forgetting/not realizing it also exists?

> Anyway, I am questioning that WARN/BUG/pr_warn on overflow check is
> adding any actual value because GFP_NOFAIL large allocations could be
> even more dangerous essentially rendering the system completely
> unsuabale.
>
Barry Song July 18, 2024, 8:18 a.m. UTC | #10
On Thu, Jul 18, 2024 at 7:55 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 18-07-24 19:41:33, Barry Song wrote:
> > On Thu, Jul 18, 2024 at 7:27 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 18-07-24 19:22:37, Barry Song wrote:
> > > [...]
> > > > For future-proofing and security reasons, returning NULL for NOFAIL
> > > > still seems incorrect as the callers won't check the ret. If any future or
> > > > existing in-tree code has a potential bug which might be exploited by
> > > > hackers, for example
> > > >
> > > > ptr = kvmalloc_array(NOFAIL);
> > > > ptr->callback(); //ptr=NULL;
> > > >
> > > > callback could be a privilege escalation?
> > >
> > > Only if you allow to map zero page AFAIK. Nobody reasonable should be
> > > doing that.
> >
> > ptr->callback could be above /proc/sys/vm/mmap_min_addr ?
>
> Yes, it can of course but this would require quite a stretch to trigger,
> no?
>
> Look at this from a real life code POV. You are allocating an array of
> callbacks (or structure of callbacks). In order to have this exploitable
> you need to direct the first dereference above mmap_min_addr.
>
> If you really want to protect from a code like that then WARN_ON doesn't
> buy you anything because it will stop the exploit only when
> panic_on_warn. You would need BUG_ON as mentioned by Christoph.
>

I actually also mentioned BUG_ON in the changelog "Likely BUG_ON()
seems better as anyway we can't fix it?" though the code is WARN_ON.

> So the real question is, do you want to stop exploits or do you want to
> debug potentially incorrect but mostly harmless buggy code?

I want to ensure that GFP_NOFAIL has consistent semantics—we don't
check the return value, and it must succeed, although allocating a large
amount of NOFAIL(I mean a number less than overflow) memory might
make the system "unusable". While GFP_NOFAIL itself behaves correctly,
it's inappropriate for the caller.

So the purpose is making sure the semantics - NOFAIL means no failure
and we don't need to check ret.  If we can't really succeed, it should throw
a BUG to stop any potential exploits.

> --
> Michal Hocko
> SUSE Labs

Thanks
Barry
Michal Hocko July 18, 2024, 8:32 a.m. UTC | #11
On Thu 18-07-24 20:18:02, Barry Song wrote:
> So the purpose is making sure the semantics - NOFAIL means no failure
> and we don't need to check ret.  If we can't really succeed, it should throw
> a BUG to stop any potential exploits.

This would require to panic consistently on failure in all allocator
path that can bail out. E.g. page allocator on GFP_NOWAIT|GFP_NOFAIL
req. not sure how many more.
Barry Song July 18, 2024, 8:33 a.m. UTC | #12
On Thu, Jul 18, 2024 at 7:48 PM Hailong Liu <hailong.liu@oppo.com> wrote:
>
> On Thu, 18. Jul 11:00, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Overflow in this context is highly unlikely. However, allocations using
> > GFP_NOFAIL are guaranteed to succeed, so checking the return value is
> > unnecessary. One option to fix this is allowing memory allocation with
> > an overflowed size, but it seems pointless. Let's at least issue a
> > warning. Likely BUG_ON() seems better as anyway we can't fix it?
> >
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/slab.h | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index a332dd2fa6cd..c6aec311864f 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -692,8 +692,10 @@ static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t siz
> >  {
> >       size_t bytes;
> >
> > -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> > +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> > +             WARN_ON(flags & __GFP_NOFAIL);
> Hi Barry:
>
> IMO, using __GFP_NOFAIL guarantees success if and only if the parameters are *correct*.
> Maybe we can add here to help callers to find the reason as in mm/page_alloc.c

no, this doesn't make any sense:

 * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures. The allocation could block
 * indefinitely but will never return with failure. Testing for
 * failure is pointless.

I believe these are two separate things at different layers.

>
> ```
>         if (gfp_mask & __GFP_NOFAIL) {
>                 /*
>                  * All existing users of the __GFP_NOFAIL are blockable, so warn
>                  * of any new users that actually require GFP_NOWAIT
>                  */
>                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>                         goto fail;

If the code calling this function doesn't have a retry mechanism, it is
a BUG that needs to be fixed. even though the above code might return
NULL, its wrapper will still retry, for example:

        while (nr_allocated < nr_pages) {
                if (!nofail && fatal_signal_pending(current))
                        break;

                if (nid == NUMA_NO_NODE)
                        page = alloc_pages_noprof(alloc_gfp, order);
                else
                        page = alloc_pages_node_noprof(nid, alloc_gfp, order);
                if (unlikely(!page)) {
                        if (!nofail)
                                break;

                        /* fall back to the zero order allocations */
                        alloc_gfp |= __GFP_NOFAIL;
                        order = 0;
                        continue;
                }

                /*
                 * Higher order allocations must be able to be treated as
                 * indepdenent small pages by callers (as they can with
                 * small-page vmallocs). Some drivers do their own refcounting
                 * on vmalloc_to_page() pages, some use page->mapping,
                 * page->lru, etc.
                 */
                if (order)
                        split_page(page, order);

                /*
                 * Careful, we allocate and map page-order pages, but
                 * tracking is done per PAGE_SIZE page so as to keep the
                 * vm_struct APIs independent of the physical/mapped size.
                 */
                for (i = 0; i < (1U << order); i++)
                        pages[nr_allocated + i] = page + i;

                cond_resched();
                nr_allocated += 1U << order;
        }


>
>                 /*
>                  * PF_MEMALLOC request from this context is rather bizarre
>                  * because we cannot reclaim anything and only can loop waiting
>                  * for somebody to do a work for us
>                  */
>                 WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
>
>                 /*
>                  * non failing costly orders are a hard requirement which we
>                  * are not prepared for much so let's warn about these users
>                  * so that we can identify them and convert them to something
>                  * else.
>                  */
>                 WARN_ON_ONCE_GFP(costly_order, gfp_mask);
> ```
>
> >               return NULL;
> > +     }
> >       if (__builtin_constant_p(n) && __builtin_constant_p(size))
> >               return kmalloc_noprof(bytes, flags);
> >       return kmalloc_noprof(bytes, flags);
> > @@ -794,8 +796,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >  {
> >       size_t bytes;
> >
> > -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> > +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> > +             WARN_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >
> >       return kvmalloc_node_noprof(bytes, flags, node);
> >  }
> > --
> > 2.34.1
> >
> >
>
> --
> help you, help me,
> Hailong.

Thanks
Barry
Barry Song July 18, 2024, 8:43 a.m. UTC | #13
On Thu, Jul 18, 2024 at 8:32 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 18-07-24 20:18:02, Barry Song wrote:
> > So the purpose is making sure the semantics - NOFAIL means no failure
> > and we don't need to check ret.  If we can't really succeed, it should throw
> > a BUG to stop any potential exploits.
>
> This would require to panic consistently on failure in all allocator
> path that can bail out. E.g. page allocator on GFP_NOWAIT|GFP_NOFAIL
> req. not sure how many more.

Right, this GFP_NOFAIL issue seems quite messy. However, at least vmalloc
will retry by itself, even if alloc_pages might have failed with
GFP_NOWAIT | GFP_NOFAIL.

But isn't that the definition of __GFP_NOFAIL?

 * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures. The allocation could block
 * indefinitely but will never return with failure. Testing for
 * failure is pointless."

So I believe any code that doesn't retry and ends up returning NULL should be
fixed.

Otherwise, we should rename __GFP_NOFAIL to __GFP_UNLIKELY_FAIL in
the documentation and explain when it might fail.

> --
> Michal Hocko
> SUSE Labs

Thanks
Barry
Michal Hocko July 18, 2024, 8:50 a.m. UTC | #14
On Thu 18-07-24 20:43:53, Barry Song wrote:
> On Thu, Jul 18, 2024 at 8:32 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 18-07-24 20:18:02, Barry Song wrote:
> > > So the purpose is making sure the semantics - NOFAIL means no failure
> > > and we don't need to check ret.  If we can't really succeed, it should throw
> > > a BUG to stop any potential exploits.
> >
> > This would require to panic consistently on failure in all allocator
> > path that can bail out. E.g. page allocator on GFP_NOWAIT|GFP_NOFAIL
> > req. not sure how many more.
> 
> Right, this GFP_NOFAIL issue seems quite messy. However, at least vmalloc
> will retry by itself, even if alloc_pages might have failed with
> GFP_NOWAIT | GFP_NOFAIL.
> 
> But isn't that the definition of __GFP_NOFAIL?
> 
>  * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>  * cannot handle allocation failures. The allocation could block
>  * indefinitely but will never return with failure. Testing for
>  * failure is pointless."
> 
> So I believe any code that doesn't retry and ends up returning NULL should be
> fixed.

Yes, those shouldn't really fail. NOWAIT|NOFAIL was something that
should never happen and I really hope it doesn't. Others should really
retry but it's been some time since I've checked the last time.

These overflow checks were added without any acks by MM people...
Barry Song July 19, 2024, 12:35 a.m. UTC | #15
On Thu, Jul 18, 2024 at 8:50 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 18-07-24 20:43:53, Barry Song wrote:
> > On Thu, Jul 18, 2024 at 8:32 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 18-07-24 20:18:02, Barry Song wrote:
> > > > So the purpose is making sure the semantics - NOFAIL means no failure
> > > > and we don't need to check ret.  If we can't really succeed, it should throw
> > > > a BUG to stop any potential exploits.
> > >
> > > This would require to panic consistently on failure in all allocator
> > > path that can bail out. E.g. page allocator on GFP_NOWAIT|GFP_NOFAIL
> > > req. not sure how many more.
> >
> > Right, this GFP_NOFAIL issue seems quite messy. However, at least vmalloc
> > will retry by itself, even if alloc_pages might have failed with
> > GFP_NOWAIT | GFP_NOFAIL.
> >
> > But isn't that the definition of __GFP_NOFAIL?
> >
> >  * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> >  * cannot handle allocation failures. The allocation could block
> >  * indefinitely but will never return with failure. Testing for
> >  * failure is pointless."
> >
> > So I believe any code that doesn't retry and ends up returning NULL should be
> > fixed.
>
> Yes, those shouldn't really fail. NOWAIT|NOFAIL was something that
> should never happen and I really hope it doesn't. Others should really
> retry but it's been some time since I've checked the last time.


I assume allocations directly using alloc_pages() might not respect GFP_NOFAIL
and violate the semantics of GFP_NOFAIL.

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                                struct alloc_context *ac) {
        /*
         * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
         * we always retry
         */
        if (gfp_mask & __GFP_NOFAIL) {
                /*
                 * All existing users of the __GFP_NOFAIL are blockable, so warn
                 * of any new users that actually require GFP_NOWAIT
                 */
                if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
                        goto fail;
...
}

Additionally, at least drivers/vdpa/vdpa_user/iova_domain.c is
incorrect with GFP_ATOMIC
| __GFP_NOFAIL.

void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
{
        ...

        count = domain->bounce_size >> PAGE_SHIFT;
        for (i = 0; i < count; i++) {
                ...

                /* Copy user page to kernel page if it's in use */
                if (map->orig_phys != INVALID_PHYS_ADDR) {
                        page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);
                        memcpy_from_page(page_address(page),
                                         map->bounce_page, 0, PAGE_SIZE);
                }
                put_page(map->bounce_page);
                map->bounce_page = page;
        }
        domain->user_bounce_pages = false;
out:
        write_unlock(&domain->bounce_lock);
}

GFP_NOFAIL things need to be fixed. Let me investigate further.

>
> These overflow checks were added without any acks by MM people...
> --
> Michal Hocko
> SUSE Labs

Thanks
Barry
Michal Hocko July 19, 2024, 7:02 a.m. UTC | #16
On Fri 19-07-24 12:35:55, Barry Song wrote:
> On Thu, Jul 18, 2024 at 8:50 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Yes, those shouldn't really fail. NOWAIT|NOFAIL was something that
> > should never happen and I really hope it doesn't. Others should really
> > retry but it's been some time since I've checked the last time.
> 
> 
> I assume allocations directly using alloc_pages() might not respect GFP_NOFAIL
> and violate the semantics of GFP_NOFAIL.

What do you mean?
 
> Additionally, at least drivers/vdpa/vdpa_user/iova_domain.c is
> incorrect with GFP_ATOMIC
> | __GFP_NOFAIL.

This is broken! Please bring this up with the maintainer of the code.

[...]

> GFP_NOFAIL things need to be fixed. Let me investigate further.

Thanks!
Barry Song July 19, 2024, 7:07 a.m. UTC | #17
On Fri, Jul 19, 2024 at 7:02 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-07-24 12:35:55, Barry Song wrote:
> > On Thu, Jul 18, 2024 at 8:50 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > Yes, those shouldn't really fail. NOWAIT|NOFAIL was something that
> > > should never happen and I really hope it doesn't. Others should really
> > > retry but it's been some time since I've checked the last time.
> >
> >
> > I assume allocations directly using alloc_pages() might not respect GFP_NOFAIL
> > and violate the semantics of GFP_NOFAIL.
>
> What do you mean?

I mean, if we are using wrappers like vmalloc (GFP_NOFAIL | GFP_NOWAIT),
though alloc_pages might return NULL, vmalloc for itself will retry.

but if you call alloc_pages() directly with GFP_NOFAIL | GFP_NOWAIT,
alloc_pages() may return NULL without retry at all. I believe alloc_pages()
is also wrong.

>
> > Additionally, at least drivers/vdpa/vdpa_user/iova_domain.c is
> > incorrect with GFP_ATOMIC
> > | __GFP_NOFAIL.
>
> This is broken! Please bring this up with the maintainer of the code.
>
> [...]
>
> > GFP_NOFAIL things need to be fixed. Let me investigate further.
>
> Thanks!
>
> --
> Michal Hocko
> SUSE Labs
Christoph Hellwig July 19, 2024, 7:37 a.m. UTC | #18
On Fri, Jul 19, 2024 at 09:02:01AM +0200, Michal Hocko wrote:
> > Additionally, at least drivers/vdpa/vdpa_user/iova_domain.c is
> > incorrect with GFP_ATOMIC
> > | __GFP_NOFAIL.
> 
> This is broken! Please bring this up with the maintainer of the code.

No the only issue in this vdpa mess :(

But I wonder if we should have a macro/BUILD_BUG_ON magic to catch
totally obvious bugs like this.
Michal Hocko July 19, 2024, 7:42 a.m. UTC | #19
On Fri 19-07-24 19:07:31, Barry Song wrote:
> On Fri, Jul 19, 2024 at 7:02 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 19-07-24 12:35:55, Barry Song wrote:
> > > On Thu, Jul 18, 2024 at 8:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > Yes, those shouldn't really fail. NOWAIT|NOFAIL was something that
> > > > should never happen and I really hope it doesn't. Others should really
> > > > retry but it's been some time since I've checked the last time.
> > >
> > >
> > > I assume allocations directly using alloc_pages() might not respect GFP_NOFAIL
> > > and violate the semantics of GFP_NOFAIL.
> >
> > What do you mean?
> 
> I mean, if we are using wrappers like vmalloc (GFP_NOFAIL | GFP_NOWAIT),
> though alloc_pages might return NULL, vmalloc for itself will retry.

vmalloc(NOFAIL|NOWAIT) is equally unsupported. This combination of flags
simply cannot be delivered.

> but if you call alloc_pages() directly with GFP_NOFAIL | GFP_NOWAIT,
> alloc_pages() may return NULL without retry at all. I believe alloc_pages()
> is also wrong.

It cannot reclaim itself and it cannot sleep to wait for the memory so
NOFAIL semantic is simply impossible. We have put a warning in place to
catch abusers but apparently this hasn't been sufficient. There are only
two ways to deal with the failure. Either return NULL and break the
contract and see what happens (implementation now) or BUG_ON and blow up
later if the the failed allocation request blows up - potentially
recoverably. Linus tends to be against adding new BUG() calls unless the
failure is absolutely unrecoverable (e.g. corrupted data structures
etc.). I am not sure how he would look at simply incorrect memory
allocator usage to blow up the kernel. Now the argument could be made
that those failures could cause subtle memory corruptions or even be
exploitable which might be a sufficient reason to stop them early. You
can try that.

I do not see a saner way to deal with this particular memory request
type. Unless we require all __GFP_NOFAIL|GFP_NOWAIT requests to check
for the failure but this makes very little sense to me.
Barry Song July 19, 2024, 7:43 a.m. UTC | #20
On Fri, Jul 19, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Jul 19, 2024 at 09:02:01AM +0200, Michal Hocko wrote:
> > > Additionally, at least drivers/vdpa/vdpa_user/iova_domain.c is
> > > incorrect with GFP_ATOMIC
> > > | __GFP_NOFAIL.
> >
> > This is broken! Please bring this up with the maintainer of the code.
>
> No the only issue in this vdpa mess :(
>
> But I wonder if we should have a macro/BUILD_BUG_ON magic to catch
> totally obvious bugs like this.

I doubt this is going to work as users can use a variant to save gfp_flags.
On the other hand, isn't it necessarily a bug of vdpa, why can't it be mm?

if mm disallows GFP_NOFAIL,  there must be a doc to say that; if it allows,
we should never return NULL.

>
Barry Song July 19, 2024, 7:51 a.m. UTC | #21
On Fri, Jul 19, 2024 at 7:42 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-07-24 19:07:31, Barry Song wrote:
> > On Fri, Jul 19, 2024 at 7:02 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 19-07-24 12:35:55, Barry Song wrote:
> > > > On Thu, Jul 18, 2024 at 8:50 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > Yes, those shouldn't really fail. NOWAIT|NOFAIL was something that
> > > > > should never happen and I really hope it doesn't. Others should really
> > > > > retry but it's been some time since I've checked the last time.
> > > >
> > > >
> > > > I assume allocations directly using alloc_pages() might not respect GFP_NOFAIL
> > > > and violate the semantics of GFP_NOFAIL.
> > >
> > > What do you mean?
> >
> > I mean, if we are using wrappers like vmalloc (GFP_NOFAIL | GFP_NOWAIT),
> > though alloc_pages might return NULL, vmalloc for itself will retry.
>
> vmalloc(NOFAIL|NOWAIT) is equally unsupported. This combination of flags
> simply cannot be delivered.
>
> > but if you call alloc_pages() directly with GFP_NOFAIL | GFP_NOWAIT,
> > alloc_pages() may return NULL without retry at all. I believe alloc_pages()
> > is also wrong.
>
> It cannot reclaim itself and it cannot sleep to wait for the memory so
> NOFAIL semantic is simply impossible. We have put a warning in place to

this is still "right" behaviour to retry infinitely at least according
to the doc of
__GFP_NOFAIL. I assume getting new memory by many retries is still
possibly some other processes might be reclaiming or freeing memory
then providing free memory to this one being stuck.

 * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
 * cannot handle allocation failures. The allocation could block
 * indefinitely but will never return with failure. Testing for
 * failure is pointless.


> catch abusers but apparently this hasn't been sufficient. There are only
> two ways to deal with the failure. Either return NULL and break the
> contract and see what happens (implementation now) or BUG_ON and blow up
> later if the the failed allocation request blows up - potentially
> recoverably. Linus tends to be against adding new BUG() calls unless the
> failure is absolutely unrecoverable (e.g. corrupted data structures
> etc.). I am not sure how he would look at simply incorrect memory
> allocator usage to blow up the kernel. Now the argument could be made
> that those failures could cause subtle memory corruptions or even be
> exploitable which might be a sufficient reason to stop them early. You
> can try that.
>
> I do not see a saner way to deal with this particular memory request
> type. Unless we require all __GFP_NOFAIL|GFP_NOWAIT requests to check
> for the failure but this makes very little sense to me.
>
> --
> Michal Hocko
> SUSE Labs
Christoph Hellwig July 19, 2024, 7:53 a.m. UTC | #22
On Fri, Jul 19, 2024 at 07:43:38PM +1200, Barry Song wrote:
> I doubt this is going to work as users can use a variant to save gfp_flags.
> On the other hand, isn't it necessarily a bug of vdpa, why can't it be mm?
> 
> if mm disallows GFP_NOFAIL,  there must be a doc to say that; if it allows,
> we should never return NULL.

Yeah.  Maybe the right answer is to have separate _nofail variants that
don't take any flags and make GFP_NOFAIL an entirely mm-private internal
flags that is rejected by all external interfaces.  That should also
really help with auditing the users.
Michal Hocko July 19, 2024, 8:01 a.m. UTC | #23
On Fri 19-07-24 19:51:06, Barry Song wrote:
> On Fri, Jul 19, 2024 at 7:42 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > It cannot reclaim itself and it cannot sleep to wait for the memory so
> > NOFAIL semantic is simply impossible. We have put a warning in place to
> 
> this is still "right" behaviour to retry infinitely at least according
> to the doc of
> __GFP_NOFAIL.

I do not agree that implementing busy loop in the kernel is the right
practice!

> I assume getting new memory by many retries is still
> possibly some other processes might be reclaiming or freeing memory
> then providing free memory to this one being stuck.

No, I strongly disagree we should even pretend this is a supported
allocation strategy. NAK to any attempt to legalize it in some form.
Barry Song July 19, 2024, 8:28 a.m. UTC | #24
On Fri, Jul 19, 2024 at 8:01 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-07-24 19:51:06, Barry Song wrote:
> > On Fri, Jul 19, 2024 at 7:42 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > It cannot reclaim itself and it cannot sleep to wait for the memory so
> > > NOFAIL semantic is simply impossible. We have put a warning in place to
> >
> > this is still "right" behaviour to retry infinitely at least according
> > to the doc of
> > __GFP_NOFAIL.
>
> I do not agree that implementing busy loop in the kernel is the right
> practice!
>
> > I assume getting new memory by many retries is still
> > possibly some other processes might be reclaiming or freeing memory
> > then providing free memory to this one being stuck.
>
> No, I strongly disagree we should even pretend this is a supported
> allocation strategy. NAK to any attempt to legalize it in some form.

fare enough.
I am not trying to legitimize it, just explaining what the documentation says.
If it is illegal, we should clearly and firmly state that it is
illegal, rather than
pretending it is legal and returning NULL. This is also wrong.

>
> --
> Michal Hocko
> SUSE Labs
Vlastimil Babka July 19, 2024, 8:35 a.m. UTC | #25
On 7/19/24 2:35 AM, Barry Song wrote:
> On Thu, Jul 18, 2024 at 8:50 PM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Thu 18-07-24 20:43:53, Barry Song wrote:
>> > On Thu, Jul 18, 2024 at 8:32 PM Michal Hocko <mhocko@suse.com> wrote:
>> > >
>> > > On Thu 18-07-24 20:18:02, Barry Song wrote:
>> > > > So the purpose is making sure the semantics - NOFAIL means no failure
>> > > > and we don't need to check ret.  If we can't really succeed, it should throw
>> > > > a BUG to stop any potential exploits.
>> > >
>> > > This would require to panic consistently on failure in all allocator
>> > > path that can bail out. E.g. page allocator on GFP_NOWAIT|GFP_NOFAIL
>> > > req. not sure how many more.
>> >
>> > Right, this GFP_NOFAIL issue seems quite messy. However, at least vmalloc
>> > will retry by itself, even if alloc_pages might have failed with
>> > GFP_NOWAIT | GFP_NOFAIL.
>> >
>> > But isn't that the definition of __GFP_NOFAIL?
>> >
>> >  * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
>> >  * cannot handle allocation failures. The allocation could block
>> >  * indefinitely but will never return with failure. Testing for
>> >  * failure is pointless."
>> >
>> > So I believe any code that doesn't retry and ends up returning NULL should be
>> > fixed.
>>
>> Yes, those shouldn't really fail. NOWAIT|NOFAIL was something that
>> should never happen and I really hope it doesn't. Others should really
>> retry but it's been some time since I've checked the last time.
> 
> 
> I assume allocations directly using alloc_pages() might not respect GFP_NOFAIL
> and violate the semantics of GFP_NOFAIL.
> 
> static inline struct page *
> __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                                                 struct alloc_context *ac) {
>         /*
>          * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
>          * we always retry
>          */
>         if (gfp_mask & __GFP_NOFAIL) {
>                 /*
>                  * All existing users of the __GFP_NOFAIL are blockable, so warn
>                  * of any new users that actually require GFP_NOWAIT
>                  */
>                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>                         goto fail;
> ...
> }
> 
> Additionally, at least drivers/vdpa/vdpa_user/iova_domain.c is
> incorrect with GFP_ATOMIC
> | __GFP_NOFAIL.
> 
> void vduse_domain_remove_user_bounce_pages(struct vduse_iova_domain *domain)
> {
>         ...
> 
>         count = domain->bounce_size >> PAGE_SHIFT;
>         for (i = 0; i < count; i++) {
>                 ...
> 
>                 /* Copy user page to kernel page if it's in use */
>                 if (map->orig_phys != INVALID_PHYS_ADDR) {
>                         page = alloc_page(GFP_ATOMIC | __GFP_NOFAIL);

This should be already triggering the warning above? If it doesn't nobody
yet reached the particular line in the alloc slowpath. Probalby thanks to
the GFP_ATOMIC reserves.

Maybe we should tighten the warnigns then.

>                         memcpy_from_page(page_address(page),
>                                          map->bounce_page, 0, PAGE_SIZE);
>                 }
>                 put_page(map->bounce_page);
>                 map->bounce_page = page;
>         }
>         domain->user_bounce_pages = false;
> out:
>         write_unlock(&domain->bounce_lock);
> }
> 
> GFP_NOFAIL things need to be fixed. Let me investigate further.
> 
>>
>> These overflow checks were added without any acks by MM people...
>> --
>> Michal Hocko
>> SUSE Labs
> 
> Thanks
> Barry
Michal Hocko July 19, 2024, 8:40 a.m. UTC | #26
On Fri 19-07-24 20:28:41, Barry Song wrote:
> On Fri, Jul 19, 2024 at 8:01 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 19-07-24 19:51:06, Barry Song wrote:
> > > On Fri, Jul 19, 2024 at 7:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > It cannot reclaim itself and it cannot sleep to wait for the memory so
> > > > NOFAIL semantic is simply impossible. We have put a warning in place to
> > >
> > > this is still "right" behaviour to retry infinitely at least according
> > > to the doc of
> > > __GFP_NOFAIL.
> >
> > I do not agree that implementing busy loop in the kernel is the right
> > practice!
> >
> > > I assume getting new memory by many retries is still
> > > possibly some other processes might be reclaiming or freeing memory
> > > then providing free memory to this one being stuck.
> >
> > No, I strongly disagree we should even pretend this is a supported
> > allocation strategy. NAK to any attempt to legalize it in some form.
> 
> fare enough.
> I am not trying to legitimize it, just explaining what the documentation says.
> If it is illegal, we should clearly and firmly state that it is
> illegal, rather than
> pretending it is legal and returning NULL. This is also wrong.

Patches to docuementation are always welcome of course. Please keep in
mind that our internal interfaces (something that is not directly
exported to the userspace) are not really defensive against users. We do
expect some level of reasonable expectations from users. Think about it.
GFP_NOWAIT| __GFP_NOFAIL or GDP_ATOMIC|__GFP_NOFAIL is simply impossible
with a finite amount of memory. Isn't it? You are literally saying that
the request _must not_ fail yet it shouldn't resp. cannot wait for any
memory to reclaim if it is not ready for use!

With our gfp flag interface we have quite some combinations of flags
that we do not support. Do we want to document all of them?
Vlastimil Babka July 19, 2024, 8:50 a.m. UTC | #27
On 7/19/24 10:01 AM, Michal Hocko wrote:
> On Fri 19-07-24 19:51:06, Barry Song wrote:
>> On Fri, Jul 19, 2024 at 7:42 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
>> > It cannot reclaim itself and it cannot sleep to wait for the memory so
>> > NOFAIL semantic is simply impossible. We have put a warning in place to
>> 
>> this is still "right" behaviour to retry infinitely at least according
>> to the doc of
>> __GFP_NOFAIL.
> 
> I do not agree that implementing busy loop in the kernel is the right
> practice!

I think there are multiple aspects to this. One is what's the right
practice, another what's the most graceful recovery from a buggy code.

We can all agree that __GFP_NOFAIL together with GFP_NOWAIT or other
variants that can't sleep is wrong and unsupported. Hence we should put
sufficient warnings in place to prevent such combinations from being used in
the first place.

But what if some callsite makes it into mainline without anyone seeing the
warnings in the process and fixing them, because it's rare to hit? And then
an attacker finds a way to hit it?

We could fail the allocation which would probably crash afterwards,
hopefully non-exploitably. We could BUG_ON() which would crash reliably, but
still become a DoS vulnerability. Or we could busy loop, which might be also
a DoS but maybe not so reliable because kswapd or other direct reclaiming
processes could recover the situation (and the user could report the warning
that was produced in the process).

That wouldn't mean the busy loop is a correct and supported practice. It
would just mean it's the least bad of the bad options we have to deal with
an allocation that's wrong but we didn't catch soon enough in the development.

Compared to the original problem at hand, kmalloc_array() of impossible size
with __GFP_NOFAIL is not recoverable by busy looping so as I've already said
I like most the idea of BUG_ON(). Yes it's a DoS vector if someone finds
such a user triggerable allocation, but I don't see a better option.

Same thing should probably happen with a __GFP_NOFAIL page allocation that
requests >MAX_ORDER.

>> I assume getting new memory by many retries is still
>> possibly some other processes might be reclaiming or freeing memory
>> then providing free memory to this one being stuck.
> 
> No, I strongly disagree we should even pretend this is a supported
> allocation strategy. NAK to any attempt to legalize it in some form.
>
Michal Hocko July 19, 2024, 9:33 a.m. UTC | #28
On Fri 19-07-24 10:50:07, Vlastimil Babka wrote:
[...]
> That wouldn't mean the busy loop is a correct and supported practice. It
> would just mean it's the least bad of the bad options we have to deal with
> an allocation that's wrong but we didn't catch soon enough in the development.

So you want to make those potential BUG_ONs hard/soft lockups (not sure
all arches have a reliable detection) instead?
Barry Song July 19, 2024, 9:36 a.m. UTC | #29
On Fri, Jul 19, 2024 at 8:40 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-07-24 20:28:41, Barry Song wrote:
> > On Fri, Jul 19, 2024 at 8:01 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 19-07-24 19:51:06, Barry Song wrote:
> > > > On Fri, Jul 19, 2024 at 7:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > It cannot reclaim itself and it cannot sleep to wait for the memory so
> > > > > NOFAIL semantic is simply impossible. We have put a warning in place to
> > > >
> > > > this is still "right" behaviour to retry infinitely at least according
> > > > to the doc of
> > > > __GFP_NOFAIL.
> > >
> > > I do not agree that implementing busy loop in the kernel is the right
> > > practice!
> > >
> > > > I assume getting new memory by many retries is still
> > > > possibly some other processes might be reclaiming or freeing memory
> > > > then providing free memory to this one being stuck.
> > >
> > > No, I strongly disagree we should even pretend this is a supported
> > > allocation strategy. NAK to any attempt to legalize it in some form.
> >
> > fare enough.
> > I am not trying to legitimize it, just explaining what the documentation says.
> > If it is illegal, we should clearly and firmly state that it is
> > illegal, rather than
> > pretending it is legal and returning NULL. This is also wrong.
>
> Patches to docuementation are always welcome of course. Please keep in
> mind that our internal interfaces (something that is not directly
> exported to the userspace) are not really defensive against users. We do
> expect some level of reasonable expectations from users. Think about it.
> GFP_NOWAIT| __GFP_NOFAIL or GDP_ATOMIC|__GFP_NOFAIL is simply impossible
> with a finite amount of memory. Isn't it? You are literally saying that
> the request _must not_ fail yet it shouldn't resp. cannot wait for any
> memory to reclaim if it is not ready for use!
>
> With our gfp flag interface we have quite some combinations of flags
> that we do not support. Do we want to document all of them?

I don't see why not, at least for GFP_NOFAIL, because its current
documentation strongly states that it will loop infinitely even if it can't
get memory. It never mentions the potential for a NULL return. Not
everyone is an MM expert, especially considering we have hundreds
of driver developers who are simply calling APIs. We shouldn't rely on
specialized MM knowledge to implement a driver.
And I believe that even most MM experts have no idea when GFP_NOFAIL
will fail. This is so bad to keep it as is.

> --
> Michal Hocko
> SUSE Labs

Thanks
Barry
Michal Hocko July 19, 2024, 9:45 a.m. UTC | #30
On Fri 19-07-24 21:36:38, Barry Song wrote:
> And I believe that even most MM experts have no idea when GFP_NOFAIL
> will fail. This is so bad to keep it as is.

GFP_NOFAIL doesn't fail in any supported scenarios. We are talking how
to deal with those that are unsupported. I am not sure how much helpful
it is to document all potential gfp combinations that make no-sense.
Barry Song July 19, 2024, 9:58 a.m. UTC | #31
On Fri, Jul 19, 2024 at 9:45 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-07-24 21:36:38, Barry Song wrote:
> > And I believe that even most MM experts have no idea when GFP_NOFAIL
> > will fail. This is so bad to keep it as is.
>
> GFP_NOFAIL doesn't fail in any supported scenarios. We are talking how
> to deal with those that are unsupported. I am not sure how much helpful
> it is to document all potential gfp combinations that make no-sense.

Sorry, I don't see any point from what you are saying. You are simply claiming
this is the fault of those calling "unsupported" APIs while lacking a valid way
to stop this from happening. Bear in mind, Everything which is not forbidden
is allowed.

I don't think maintainers  outside mm know what are supported and what
are not supported. an "unsupported" scenario can find a way to come into
mainline easily.

> --
> Michal Hocko
> SUSE Labs

Thanks
Barry
Vlastimil Babka July 19, 2024, 10:10 a.m. UTC | #32
On 7/19/24 11:33 AM, Michal Hocko wrote:
> On Fri 19-07-24 10:50:07, Vlastimil Babka wrote:
> [...]
>> That wouldn't mean the busy loop is a correct and supported practice. It
>> would just mean it's the least bad of the bad options we have to deal with
>> an allocation that's wrong but we didn't catch soon enough in the development.
> 
> So you want to make those potential BUG_ONs hard/soft lockups (not sure
> all arches have a reliable detection) instead?

I'd expect on a SMP machine there's fair chance of being rescued by kswapd
or other direct reclaimer.
Michal Hocko July 19, 2024, 10:52 a.m. UTC | #33
On Fri 19-07-24 12:10:13, Vlastimil Babka wrote:
> On 7/19/24 11:33 AM, Michal Hocko wrote:
> > On Fri 19-07-24 10:50:07, Vlastimil Babka wrote:
> > [...]
> >> That wouldn't mean the busy loop is a correct and supported practice. It
> >> would just mean it's the least bad of the bad options we have to deal with
> >> an allocation that's wrong but we didn't catch soon enough in the development.
> > 
> > So you want to make those potential BUG_ONs hard/soft lockups (not sure
> > all arches have a reliable detection) instead?
> 
> I'd expect on a SMP machine there's fair chance of being rescued by kswapd
> or other direct reclaimer.

I would expect hard/soft lockups... Anyway, the question remains. What
is the preferred way to express this is not really supported scenario.
Michal Hocko July 19, 2024, 10:57 a.m. UTC | #34
On Fri 19-07-24 21:58:51, Barry Song wrote:
> On Fri, Jul 19, 2024 at 9:45 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 19-07-24 21:36:38, Barry Song wrote:
> > > And I believe that even most MM experts have no idea when GFP_NOFAIL
> > > will fail. This is so bad to keep it as is.
> >
> > GFP_NOFAIL doesn't fail in any supported scenarios. We are talking how
> > to deal with those that are unsupported. I am not sure how much helpful
> > it is to document all potential gfp combinations that make no-sense.
> 
> Sorry, I don't see any point from what you are saying. You are simply claiming
> this is the fault of those calling "unsupported" APIs while lacking a valid way
> to stop this from happening. Bear in mind, Everything which is not forbidden
> is allowed.
> 
> I don't think maintainers  outside mm know what are supported and what
> are not supported. an "unsupported" scenario can find a way to come into
> mainline easily.

Good luck documenting all of those in a comprehensible way and
maintaining the forward.
Barry Song July 19, 2024, 11:05 a.m. UTC | #35
On Fri, Jul 19, 2024 at 10:57 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-07-24 21:58:51, Barry Song wrote:
> > On Fri, Jul 19, 2024 at 9:45 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 19-07-24 21:36:38, Barry Song wrote:
> > > > And I believe that even most MM experts have no idea when GFP_NOFAIL
> > > > will fail. This is so bad to keep it as is.
> > >
> > > GFP_NOFAIL doesn't fail in any supported scenarios. We are talking how
> > > to deal with those that are unsupported. I am not sure how much helpful
> > > it is to document all potential gfp combinations that make no-sense.
> >
> > Sorry, I don't see any point from what you are saying. You are simply claiming
> > this is the fault of those calling "unsupported" APIs while lacking a valid way
> > to stop this from happening. Bear in mind, Everything which is not forbidden
> > is allowed.
> >
> > I don't think maintainers  outside mm know what are supported and what
> > are not supported. an "unsupported" scenario can find a way to come into
> > mainline easily.
>
> Good luck documenting all of those in a comprehensible way and
> maintaining the forward.

I don't mean all unsupported GFP combinations should be documented.
Other unsupported scenarios can reasonably return NULL, and the caller
will check the return value. So, I don't see much value in documenting
them all.
But GFP_NOFAIL is absolutely *different*. Callers won't check the ret.

BTW, we are really exposing potential exploits. Rather than an early
stage BUG_ON(), is it reasonable to BUG_ON when we really return
NULL for NOFAIL at the last moment? This will crash the system but it
is still safer than exposing exploits.

> --
> Michal Hocko
> SUSE Labs

Thanks
Barry
Vlastimil Babka July 19, 2024, 11:13 a.m. UTC | #36
On 7/19/24 12:52 PM, Michal Hocko wrote:
> On Fri 19-07-24 12:10:13, Vlastimil Babka wrote:
>> On 7/19/24 11:33 AM, Michal Hocko wrote:
>> > On Fri 19-07-24 10:50:07, Vlastimil Babka wrote:
>> > [...]
>> >> That wouldn't mean the busy loop is a correct and supported practice. It
>> >> would just mean it's the least bad of the bad options we have to deal with
>> >> an allocation that's wrong but we didn't catch soon enough in the development.
>> > 
>> > So you want to make those potential BUG_ONs hard/soft lockups (not sure
>> > all arches have a reliable detection) instead?
>> 
>> I'd expect on a SMP machine there's fair chance of being rescued by kswapd
>> or other direct reclaimer.
> 
> I would expect hard/soft lockups... Anyway, the question remains. What
> is the preferred way to express this is not really supported scenario.

The documentation and warnings? I don't think the warnings are failing us
fundamentally. We could limit the corner cases where the are not being
reached in case a buggy code is being executed (maybe only in some debug
config if the checks would be too intrusive for a fast path otherwise). That
would leave the corner cases where the buggy code is executed rarely. Maybe
Christoph's suggestion for a build-time check would catch some of those.
Michal Hocko July 19, 2024, 11:19 a.m. UTC | #37
On Fri 19-07-24 23:05:48, Barry Song wrote:
[...]
> BTW, we are really exposing potential exploits. Rather than an early
> stage BUG_ON(), is it reasonable to BUG_ON when we really return
> NULL for NOFAIL at the last moment? This will crash the system but it
> is still safer than exposing exploits.

I believe the whole discssion here revolves around either using BUG_ON
or retrying without any sleep. But I guess you are specifically talking
about those two original k[v]malloc_array* interfaces which have
introduced the early break. For those BUG_ON is a safer option than
WARN_ON definitely. Please involve Kees who has introduced those.
kvmalloc_node_noprof would require something similar. It checks for
INT_MAX. You can test whether Linus is OK with such a change that way ;)

Good luck
Michal Hocko July 19, 2024, 11:26 a.m. UTC | #38
On Fri 19-07-24 13:13:28, Vlastimil Babka wrote:
> On 7/19/24 12:52 PM, Michal Hocko wrote:
> > On Fri 19-07-24 12:10:13, Vlastimil Babka wrote:
> >> On 7/19/24 11:33 AM, Michal Hocko wrote:
> >> > On Fri 19-07-24 10:50:07, Vlastimil Babka wrote:
> >> > [...]
> >> >> That wouldn't mean the busy loop is a correct and supported practice. It
> >> >> would just mean it's the least bad of the bad options we have to deal with
> >> >> an allocation that's wrong but we didn't catch soon enough in the development.
> >> > 
> >> > So you want to make those potential BUG_ONs hard/soft lockups (not sure
> >> > all arches have a reliable detection) instead?
> >> 
> >> I'd expect on a SMP machine there's fair chance of being rescued by kswapd
> >> or other direct reclaimer.
> > 
> > I would expect hard/soft lockups... Anyway, the question remains. What
> > is the preferred way to express this is not really supported scenario.
> 
> The documentation and warnings? I don't think the warnings are failing us
> fundamentally. We could limit the corner cases where the are not being
> reached in case a buggy code is being executed (maybe only in some debug
> config if the checks would be too intrusive for a fast path otherwise). That
> would leave the corner cases where the buggy code is executed rarely. Maybe
> Christoph's suggestion for a build-time check would catch some of those.

I fail to see what you are actually proposing here. I can see only two
ways

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ecf99190ea2..d5b06ce04a0f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4391,7 +4391,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		 * of any new users that actually require GFP_NOWAIT
 		 */
 		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
-			goto fail;
+			goto retry;
 
 		/*
 		 * PF_MEMALLOC request from this context is rather bizarre

or

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ecf99190ea2..6ca9c4d33732 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4387,11 +4387,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (gfp_mask & __GFP_NOFAIL) {
 		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually require GFP_NOWAIT
+		 * All existing users of the __GFP_NOFAIL are blockable
+		 * otherwise we introduce a busy loop with inside the page
+		 * allocator from non-sleepable contexts.
 		 */
-		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
-			goto fail;
+		BUG_ON(!can_direct_reclaim)
 
 		/*
 		 * PF_MEMALLOC request from this context is rather bizarre

Choose your own poison ;)

BUILD_BUG_ON will only work for statically defined masks AFAIU (which
would catch the existing abuser) but it will not catch the code that
builds up the mask conditionaly so there needs to be a stop gap.
Barry Song July 19, 2024, 1:02 p.m. UTC | #39
On Fri, Jul 19, 2024 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-07-24 13:13:28, Vlastimil Babka wrote:
> > On 7/19/24 12:52 PM, Michal Hocko wrote:
> > > On Fri 19-07-24 12:10:13, Vlastimil Babka wrote:
> > >> On 7/19/24 11:33 AM, Michal Hocko wrote:
> > >> > On Fri 19-07-24 10:50:07, Vlastimil Babka wrote:
> > >> > [...]
> > >> >> That wouldn't mean the busy loop is a correct and supported practice. It
> > >> >> would just mean it's the least bad of the bad options we have to deal with
> > >> >> an allocation that's wrong but we didn't catch soon enough in the development.
> > >> >
> > >> > So you want to make those potential BUG_ONs hard/soft lockups (not sure
> > >> > all arches have a reliable detection) instead?
> > >>
> > >> I'd expect on a SMP machine there's fair chance of being rescued by kswapd
> > >> or other direct reclaimer.
> > >
> > > I would expect hard/soft lockups... Anyway, the question remains. What
> > > is the preferred way to express this is not really supported scenario.
> >
> > The documentation and warnings? I don't think the warnings are failing us
> > fundamentally. We could limit the corner cases where the are not being
> > reached in case a buggy code is being executed (maybe only in some debug
> > config if the checks would be too intrusive for a fast path otherwise). That
> > would leave the corner cases where the buggy code is executed rarely. Maybe
> > Christoph's suggestion for a build-time check would catch some of those.
>
> I fail to see what you are actually proposing here. I can see only two
> ways
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ecf99190ea2..d5b06ce04a0f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4391,7 +4391,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                  * of any new users that actually require GFP_NOWAIT
>                  */
>                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -                       goto fail;
> +                       goto retry;
>
>                 /*
>                  * PF_MEMALLOC request from this context is rather bizarre
>
> or
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ecf99190ea2..6ca9c4d33732 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4387,11 +4387,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>          */
>         if (gfp_mask & __GFP_NOFAIL) {
>                 /*
> -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> -                * of any new users that actually require GFP_NOWAIT
> +                * All existing users of the __GFP_NOFAIL are blockable
> +                * otherwise we introduce a busy loop with inside the page
> +                * allocator from non-sleepable contexts.
>                  */
> -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -                       goto fail;
> +               BUG_ON(!can_direct_reclaim)

what about an earlier WARN_ON, for example, even before we begin to
try the allocation?
This will help developers find their problems even during development stage.
then at the last moment, if we have to return NULL, we do BUG_ON(). a
potential way to
mix two poisons is that we retry for a couple of times(for example,
500ms), if we are not
rescued by anyone (other direct reclamation/kswap/free), we raise
BUG_ON. but it seems
too complex:-) I'd rather simply raise BUG_ON.

>
>                 /*
>                  * PF_MEMALLOC request from this context is rather bizarre
>
> Choose your own poison ;)
>
> BUILD_BUG_ON will only work for statically defined masks AFAIU (which
> would catch the existing abuser) but it will not catch the code that
> builds up the mask conditionaly so there needs to be a stop gap.
> --
> Michal Hocko
> SUSE Labs
Michal Hocko July 19, 2024, 1:30 p.m. UTC | #40
On Fri 19-07-24 21:02:10, Barry Song wrote:
> what about an earlier WARN_ON, for example, even before we begin to
> try the allocation?

Page allocator is a hot path and adding checks for something that
shouldn't really even exist is not a great addition there IMHO.
Barry Song July 20, 2024, 12:36 a.m. UTC | #41
On Fri, Jul 19, 2024 at 9:30 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 19-07-24 21:02:10, Barry Song wrote:
> > what about an earlier WARN_ON, for example, even before we begin to
> > try the allocation?
>
> Page allocator is a hot path and adding checks for something that
> shouldn't really even exist is not a great addition there IMHO.

I would argue adding checks for something that shouldn't really
even exist is precisely the point of having those checks. These
checks help ensure the integrity and robustness of the system
by catching unexpected conditions. I agree that the page allocator
is a hot path, but adding one line might not cause noticeable
overhead, especially considering that alloc_pages() already
contains a few hundred lines of code.

Regardless, let me try to summarize the discussions. Unless
anyone has better ideas, v2 series might start with the following
actions:

1. Update the documentation for GFP_NOFAIL to explicitly state
that non-wait GFP_NOFAIL is not legal and should be avoided.
This will provide a basis for other subsystems to explicitly
reject anyone using GFP_ATOMIC | GFP_NOFAIL, etc.

2. Add BUG_ON() at the existing points where we return NULL
for GFP_NOFAIL -  __alloc_pages_slowpath() , kmalloc, vmalloc
to avoid exposing security vulnerabilities.

3. Raise a bug report to the vdpa maintainer, requesting that they
either drop GFP_ATOMIC or GFP_NOFAIL based on whether
their context is atomic. If GFP_NOFAIL is dropped, ask for an
explicit check on the return value.

Christoph, Michal, Hailong, Vlastimil, if you have any better ideas
or objections, please let me know :-)

>
> --
> Michal Hocko
> SUSE Labs

Thanks
Barry
Barry Song July 20, 2024, 10:14 p.m. UTC | #42
On Fri, Jul 19, 2024 at 7:53 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Fri, Jul 19, 2024 at 07:43:38PM +1200, Barry Song wrote:
> > I doubt this is going to work as users can use a variant to save gfp_flags.
> > On the other hand, isn't it necessarily a bug of vdpa, why can't it be mm?
> >
> > if mm disallows GFP_NOFAIL,  there must be a doc to say that; if it allows,
> > we should never return NULL.
>
> Yeah.  Maybe the right answer is to have separate _nofail variants that
> don't take any flags and make GFP_NOFAIL an entirely mm-private internal
> flags that is rejected by all external interfaces.  That should also
> really help with auditing the users.

Just like Michal has consistently asserted that using GFP_NOFAIL with
non-wait is against the rules, I think we should enforce this policy by:

diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 313be4ad79fd..a5c09f9590f2 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -258,7 +258,7 @@ enum {
 #define __GFP_KSWAPD_RECLAIM   ((__force gfp_t)___GFP_KSWAPD_RECLAIM)
/* kswapd can wake */
 #define __GFP_RECLAIM ((__force
gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
 #define __GFP_RETRY_MAYFAIL    ((__force gfp_t)___GFP_RETRY_MAYFAIL)
-#define __GFP_NOFAIL   ((__force gfp_t)___GFP_NOFAIL)
+#define __GFP_NOFAIL   ((__force gfp_t)(___GFP_NOFAIL | ___GFP_DIRECT_RECLAIM))
 #define __GFP_NORETRY  ((__force gfp_t)___GFP_NORETRY)

Anyone misusing GFP_ATOMIC | __GFP_NOFAIL in an atomic context
risks experiencing a crash due to sleep in atomic. This is a common
consequence, as all instances of sleep in atomic should result in the
same issue.

>

Thanks
Barry
Michal Hocko July 22, 2024, 7:23 a.m. UTC | #43
On Sat 20-07-24 08:36:16, Barry Song wrote:
> On Fri, Jul 19, 2024 at 9:30 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 19-07-24 21:02:10, Barry Song wrote:
> > > what about an earlier WARN_ON, for example, even before we begin to
> > > try the allocation?
> >
> > Page allocator is a hot path and adding checks for something that
> > shouldn't really even exist is not a great addition there IMHO.
> 
> I would argue adding checks for something that shouldn't really
> even exist is precisely the point of having those checks. These
> checks help ensure the integrity and robustness of the system
> by catching unexpected conditions. I agree that the page allocator
> is a hot path, but adding one line might not cause noticeable
> overhead, especially considering that alloc_pages() already
> contains a few hundred lines of code.

We do not add stuff like that into hot path.

> Regardless, let me try to summarize the discussions. Unless
> anyone has better ideas, v2 series might start with the following
> actions:
> 
> 1. Update the documentation for GFP_NOFAIL to explicitly state
> that non-wait GFP_NOFAIL is not legal and should be avoided.
> This will provide a basis for other subsystems to explicitly
> reject anyone using GFP_ATOMIC | GFP_NOFAIL, etc.

No objection to that.

> 2. Add BUG_ON() at the existing points where we return NULL
> for GFP_NOFAIL -  __alloc_pages_slowpath() , kmalloc, vmalloc
> to avoid exposing security vulnerabilities.

Let's see how this goes.

> 3. Raise a bug report to the vdpa maintainer, requesting that they
> either drop GFP_ATOMIC or GFP_NOFAIL based on whether
> their context is atomic. If GFP_NOFAIL is dropped, ask for an
> explicit check on the return value.

GPF_ATOMIC is likely used because of write_lock(&domain->bounce_lock)
vduse_domain_remove_user_bounce_pages is iself called with spin lock held
from vduse_domain_release. So you would need some pre-allocation.
Michal Hocko July 22, 2024, 7:26 a.m. UTC | #44
On Sun 21-07-24 10:14:03, Barry Song wrote:
> On Fri, Jul 19, 2024 at 7:53 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Fri, Jul 19, 2024 at 07:43:38PM +1200, Barry Song wrote:
> > > I doubt this is going to work as users can use a variant to save gfp_flags.
> > > On the other hand, isn't it necessarily a bug of vdpa, why can't it be mm?
> > >
> > > if mm disallows GFP_NOFAIL,  there must be a doc to say that; if it allows,
> > > we should never return NULL.
> >
> > Yeah.  Maybe the right answer is to have separate _nofail variants that
> > don't take any flags and make GFP_NOFAIL an entirely mm-private internal
> > flags that is rejected by all external interfaces.  That should also
> > really help with auditing the users.

This would require duplicating many of our allocations APIs.

> Just like Michal has consistently asserted that using GFP_NOFAIL with
> non-wait is against the rules, I think we should enforce this policy by:
> 
> diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> index 313be4ad79fd..a5c09f9590f2 100644
> --- a/include/linux/gfp_types.h
> +++ b/include/linux/gfp_types.h
> @@ -258,7 +258,7 @@ enum {
>  #define __GFP_KSWAPD_RECLAIM   ((__force gfp_t)___GFP_KSWAPD_RECLAIM)
> /* kswapd can wake */
>  #define __GFP_RECLAIM ((__force
> gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
>  #define __GFP_RETRY_MAYFAIL    ((__force gfp_t)___GFP_RETRY_MAYFAIL)
> -#define __GFP_NOFAIL   ((__force gfp_t)___GFP_NOFAIL)
> +#define __GFP_NOFAIL   ((__force gfp_t)(___GFP_NOFAIL | ___GFP_DIRECT_RECLAIM))
>  #define __GFP_NORETRY  ((__force gfp_t)___GFP_NORETRY)
> 
> Anyone misusing GFP_ATOMIC | __GFP_NOFAIL in an atomic context
> risks experiencing a crash due to sleep in atomic. This is a common
> consequence, as all instances of sleep in atomic should result in the
> same issue.

I really dislike any of __GFP_$FOO to have side effects like this.
Please let's not overdo this.
Vlastimil Babka July 22, 2024, 7:34 a.m. UTC | #45
On 7/22/24 9:23 AM, Michal Hocko wrote:
> On Sat 20-07-24 08:36:16, Barry Song wrote:
>> On Fri, Jul 19, 2024 at 9:30 PM Michal Hocko <mhocko@suse.com> wrote:
>> >
>> > On Fri 19-07-24 21:02:10, Barry Song wrote:
>> > > what about an earlier WARN_ON, for example, even before we begin to
>> > > try the allocation?
>> >
>> > Page allocator is a hot path and adding checks for something that
>> > shouldn't really even exist is not a great addition there IMHO.
>> 
>> I would argue adding checks for something that shouldn't really
>> even exist is precisely the point of having those checks. These
>> checks help ensure the integrity and robustness of the system
>> by catching unexpected conditions. I agree that the page allocator
>> is a hot path, but adding one line might not cause noticeable
>> overhead, especially considering that alloc_pages() already
>> contains a few hundred lines of code.
> 
> We do not add stuff like that into hot path.

It should be acceptable with CONFIG_DEBUG_VM. Or in __alloc_pages_slowpath()
it could be placed higher even without CONFIG_DEBUG_VM. Right now it's
rather hard to reach, espcially for GPF_ATOMIC.

> 
>> Regardless, let me try to summarize the discussions. Unless
>> anyone has better ideas, v2 series might start with the following
>> actions:
>> 
>> 1. Update the documentation for GFP_NOFAIL to explicitly state
>> that non-wait GFP_NOFAIL is not legal and should be avoided.
>> This will provide a basis for other subsystems to explicitly
>> reject anyone using GFP_ATOMIC | GFP_NOFAIL, etc.
> 
> No objection to that.
> 
>> 2. Add BUG_ON() at the existing points where we return NULL
>> for GFP_NOFAIL -  __alloc_pages_slowpath() , kmalloc, vmalloc
>> to avoid exposing security vulnerabilities.
> 
> Let's see how this goes.

In __alloc_pages_noprof() we also have

        if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
                return NULL;

This should BUG_ON for __GFP_NOFAIL, same as the overflowing kmalloc_array().

> 
>> 3. Raise a bug report to the vdpa maintainer, requesting that they
>> either drop GFP_ATOMIC or GFP_NOFAIL based on whether
>> their context is atomic. If GFP_NOFAIL is dropped, ask for an
>> explicit check on the return value.

You could do it right now, based on the warning we have although it's hard
to reach.

> GPF_ATOMIC is likely used because of write_lock(&domain->bounce_lock)
> vduse_domain_remove_user_bounce_pages is iself called with spin lock held
> from vduse_domain_release. So you would need some pre-allocation.
Barry Song July 22, 2024, 8:09 a.m. UTC | #46
On Mon, Jul 22, 2024 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 21-07-24 10:14:03, Barry Song wrote:
> > On Fri, Jul 19, 2024 at 7:53 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Fri, Jul 19, 2024 at 07:43:38PM +1200, Barry Song wrote:
> > > > I doubt this is going to work as users can use a variant to save gfp_flags.
> > > > On the other hand, isn't it necessarily a bug of vdpa, why can't it be mm?
> > > >
> > > > if mm disallows GFP_NOFAIL,  there must be a doc to say that; if it allows,
> > > > we should never return NULL.
> > >
> > > Yeah.  Maybe the right answer is to have separate _nofail variants that
> > > don't take any flags and make GFP_NOFAIL an entirely mm-private internal
> > > flags that is rejected by all external interfaces.  That should also
> > > really help with auditing the users.
>
> This would require duplicating many of our allocations APIs.
>
> > Just like Michal has consistently asserted that using GFP_NOFAIL with
> > non-wait is against the rules, I think we should enforce this policy by:
> >
> > diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> > index 313be4ad79fd..a5c09f9590f2 100644
> > --- a/include/linux/gfp_types.h
> > +++ b/include/linux/gfp_types.h
> > @@ -258,7 +258,7 @@ enum {
> >  #define __GFP_KSWAPD_RECLAIM   ((__force gfp_t)___GFP_KSWAPD_RECLAIM)
> > /* kswapd can wake */
> >  #define __GFP_RECLAIM ((__force
> > gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
> >  #define __GFP_RETRY_MAYFAIL    ((__force gfp_t)___GFP_RETRY_MAYFAIL)
> > -#define __GFP_NOFAIL   ((__force gfp_t)___GFP_NOFAIL)
> > +#define __GFP_NOFAIL   ((__force gfp_t)(___GFP_NOFAIL | ___GFP_DIRECT_RECLAIM))
> >  #define __GFP_NORETRY  ((__force gfp_t)___GFP_NORETRY)
> >
> > Anyone misusing GFP_ATOMIC | __GFP_NOFAIL in an atomic context
> > risks experiencing a crash due to sleep in atomic. This is a common
> > consequence, as all instances of sleep in atomic should result in the
> > same issue.
>
> I really dislike any of __GFP_$FOO to have side effects like this.
> Please let's not overdo this.

Okay, but my point is that if GFP_NOFAIL is inevitably blockable, why
not enforce this and let users understand that it is definitively
blockable?  ust like when we call alloc_pages(GFP_KERNEL), we know
it might sleep.

My thoughts on this aren't very strong, just my two cents :-)

> --
> Michal Hocko
> SUSE Labs
Michal Hocko July 22, 2024, 9:01 a.m. UTC | #47
On Mon 22-07-24 20:09:37, Barry Song wrote:
> On Mon, Jul 22, 2024 at 7:26 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 21-07-24 10:14:03, Barry Song wrote:
> > > On Fri, Jul 19, 2024 at 7:53 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Fri, Jul 19, 2024 at 07:43:38PM +1200, Barry Song wrote:
> > > > > I doubt this is going to work as users can use a variant to save gfp_flags.
> > > > > On the other hand, isn't it necessarily a bug of vdpa, why can't it be mm?
> > > > >
> > > > > if mm disallows GFP_NOFAIL,  there must be a doc to say that; if it allows,
> > > > > we should never return NULL.
> > > >
> > > > Yeah.  Maybe the right answer is to have separate _nofail variants that
> > > > don't take any flags and make GFP_NOFAIL an entirely mm-private internal
> > > > flags that is rejected by all external interfaces.  That should also
> > > > really help with auditing the users.
> >
> > This would require duplicating many of our allocations APIs.
> >
> > > Just like Michal has consistently asserted that using GFP_NOFAIL with
> > > non-wait is against the rules, I think we should enforce this policy by:
> > >
> > > diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
> > > index 313be4ad79fd..a5c09f9590f2 100644
> > > --- a/include/linux/gfp_types.h
> > > +++ b/include/linux/gfp_types.h
> > > @@ -258,7 +258,7 @@ enum {
> > >  #define __GFP_KSWAPD_RECLAIM   ((__force gfp_t)___GFP_KSWAPD_RECLAIM)
> > > /* kswapd can wake */
> > >  #define __GFP_RECLAIM ((__force
> > > gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
> > >  #define __GFP_RETRY_MAYFAIL    ((__force gfp_t)___GFP_RETRY_MAYFAIL)
> > > -#define __GFP_NOFAIL   ((__force gfp_t)___GFP_NOFAIL)
> > > +#define __GFP_NOFAIL   ((__force gfp_t)(___GFP_NOFAIL | ___GFP_DIRECT_RECLAIM))
> > >  #define __GFP_NORETRY  ((__force gfp_t)___GFP_NORETRY)
> > >
> > > Anyone misusing GFP_ATOMIC | __GFP_NOFAIL in an atomic context
> > > risks experiencing a crash due to sleep in atomic. This is a common
> > > consequence, as all instances of sleep in atomic should result in the
> > > same issue.
> >
> > I really dislike any of __GFP_$FOO to have side effects like this.
> > Please let's not overdo this.
> 
> Okay, but my point is that if GFP_NOFAIL is inevitably blockable, why
> not enforce this and let users understand that it is definitively
> blockable?  ust like when we call alloc_pages(GFP_KERNEL), we know
> it might sleep.

__GFP_$FOO are usually low level. GFP_$FOO are high level and they
combine several subflags to have a specific meaning. So this would need
to be GFP_NOFAIL. Btw. the same applies to __GFP_NORETRY and
__GFP_RETRY_MAYFAIL.

Whether changing existing users of those flags is worth is a different
question.
Christoph Hellwig July 22, 2024, 11:18 p.m. UTC | #48
On Mon, Jul 22, 2024 at 11:01:23AM +0200, Michal Hocko wrote:
> __GFP_$FOO are usually low level. GFP_$FOO are high level and they
> combine several subflags to have a specific meaning. So this would need
> to be GFP_NOFAIL. Btw. the same applies to __GFP_NORETRY and
> __GFP_RETRY_MAYFAIL.

True.  But I think adding GFP_NOFAIL and slowly upping the enforcement
that no one is using __GFP_NOFAIL directly will get us a similar effect
to my *_nofail proposal.  It will require manual or scripted checking
instead of relying on the compiler, but it's much better than what we
have right now.
Barry Song July 22, 2024, 11:22 p.m. UTC | #49
On Tue, Jul 23, 2024 at 11:18 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jul 22, 2024 at 11:01:23AM +0200, Michal Hocko wrote:
> > __GFP_$FOO are usually low level. GFP_$FOO are high level and they
> > combine several subflags to have a specific meaning. So this would need
> > to be GFP_NOFAIL. Btw. the same applies to __GFP_NORETRY and
> > __GFP_RETRY_MAYFAIL.
>
> True.  But I think adding GFP_NOFAIL and slowly upping the enforcement
> that no one is using __GFP_NOFAIL directly will get us a similar effect
> to my *_nofail proposal.  It will require manual or scripted checking
> instead of relying on the compiler, but it's much better than what we
> have right now.

I agree. My proposal was actually to enforce blocking in GFP_NOFAIL while the
post code was an ugly hack at the lower level with __GFP_NOFAIL.
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index a332dd2fa6cd..c6aec311864f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -692,8 +692,10 @@  static inline __alloc_size(1, 2) void *kmalloc_array_noprof(size_t n, size_t siz
 {
 	size_t bytes;
 
-	if (unlikely(check_mul_overflow(n, size, &bytes)))
+	if (unlikely(check_mul_overflow(n, size, &bytes))) {
+		WARN_ON(flags & __GFP_NOFAIL);
 		return NULL;
+	}
 	if (__builtin_constant_p(n) && __builtin_constant_p(size))
 		return kmalloc_noprof(bytes, flags);
 	return kmalloc_noprof(bytes, flags);
@@ -794,8 +796,10 @@  kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
 {
 	size_t bytes;
 
-	if (unlikely(check_mul_overflow(n, size, &bytes)))
+	if (unlikely(check_mul_overflow(n, size, &bytes))) {
+		WARN_ON(flags & __GFP_NOFAIL);
 		return NULL;
+	}
 
 	return kvmalloc_node_noprof(bytes, flags, node);
 }