diff mbox series

[3/5] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails

Message ID 20240724085544.299090-4-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series mm: clarify nofail memory allocation | expand

Commit Message

Barry Song July 24, 2024, 8:55 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

We have cases we still fail though callers might have __GFP_NOFAIL.
Since they don't check the return, we are exposed to the security
risks for NULL deference.

Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
situation.

Christoph Hellwig:
The whole freaking point of __GFP_NOFAIL is that callers don't handle
allocation failures.  So in fact a straight BUG is the right thing
here.

Vlastimil Babka:
It's just not a recoverable situation (WARN_ON is for recoverable
situations). The caller cannot handle allocation failure and at the same
time asked for an impossible allocation. BUG_ON() is a guaranteed oops
with stracktrace etc. We don't need to hope for the later NULL pointer
dereference (which might if really unlucky happen from a different
context where it's no longer obvious what lead to the allocation failing).

Michal Hocko:
Linus tends to be against adding new BUG() calls unless the failure is
absolutely unrecoverable (e.g. corrupted data structures etc.). I am
not sure how he would look at simply incorrect memory allocator usage to
blow up the kernel. Now the argument could be made that those failures
could cause subtle memory corruptions or even be exploitable which might
be a sufficient reason to stop them early.

Cc: Michal Hocko <mhocko@suse.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/slab.h |  4 +++-
 mm/page_alloc.c      | 10 +++++-----
 mm/util.c            |  1 +
 3 files changed, 9 insertions(+), 6 deletions(-)

Comments

Vlastimil Babka July 24, 2024, 10:03 a.m. UTC | #1
On 7/24/24 10:55 AM, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> We have cases we still fail though callers might have __GFP_NOFAIL.
> Since they don't check the return, we are exposed to the security
> risks for NULL deference.
> 
> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> situation.
> 
> Christoph Hellwig:
> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> allocation failures.  So in fact a straight BUG is the right thing
> here.
> 
> Vlastimil Babka:
> It's just not a recoverable situation (WARN_ON is for recoverable
> situations). The caller cannot handle allocation failure and at the same
> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> with stracktrace etc. We don't need to hope for the later NULL pointer
> dereference (which might if really unlucky happen from a different
> context where it's no longer obvious what lead to the allocation failing).

Note that quote was meant specifically for the "too large" allocation, which
is truly impossible. That includes the kvmalloc_array() overflow, order >
MAX_ORDER etc.

The "can't sleep/reclaim" is a bit more nuanced as there's the alternative
in just warning and looping and hoping kswapd or some other direct reclaimer
saves the day. If yes, great, we have a system that still works and a
warning to repor. If no, there's still a warning, but later soft/hardlockup
hits. These might be eventually worse than an immediate BUG_ON so it's not a
clear cut. At least I think these cases should be handled in two different
patches and not together.

> Michal Hocko:
> Linus tends to be against adding new BUG() calls unless the failure is
> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> not sure how he would look at simply incorrect memory allocator usage to
> blow up the kernel. Now the argument could be made that those failures
> could cause subtle memory corruptions or even be exploitable which might
> be a sufficient reason to stop them early.
> 
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  include/linux/slab.h |  4 +++-
>  mm/page_alloc.c      | 10 +++++-----
>  mm/util.c            |  1 +
>  3 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index c9cb42203183..4a4d1fdc2afe 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>  {
>  	size_t bytes;
>  
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		BUG_ON(flags & __GFP_NOFAIL);
>  		return NULL;
> +	}
>  
>  	return kvmalloc_node_noprof(bytes, flags, node);
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 45d2f41b4783..4d6af00fccd4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4435,11 +4435,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	if (gfp_mask & __GFP_NOFAIL) {
>  		/*
> -		 * All existing users of the __GFP_NOFAIL are blockable, so warn
> -		 * of any new users that actually require GFP_NOWAIT
> +		 * All existing users of the __GFP_NOFAIL are blockable
> +		 * otherwise we introduce a busy loop with inside the page
> +		 * allocator from non-sleepable contexts
>  		 */
> -		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -			goto fail;
> +		BUG_ON(!can_direct_reclaim);
>  
>  		/*
>  		 * PF_MEMALLOC request from this context is rather bizarre
> @@ -4470,7 +4470,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		cond_resched();
>  		goto retry;
>  	}
> -fail:
> +
>  	warn_alloc(gfp_mask, ac->nodemask,
>  			"page allocation failure: order:%u", order);
>  got_pg:
> diff --git a/mm/util.c b/mm/util.c
> index 0ff5898cc6de..a1be50c243f1 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -668,6 +668,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>  	/* Don't even allow crazy sizes */
>  	if (unlikely(size > INT_MAX)) {
>  		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
> +		BUG_ON(flags & __GFP_NOFAIL);
>  		return NULL;
>  	}
>
Barry Song July 24, 2024, 10:11 a.m. UTC | #2
On Wed, Jul 24, 2024 at 10:03 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/24/24 10:55 AM, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > We have cases we still fail though callers might have __GFP_NOFAIL.
> > Since they don't check the return, we are exposed to the security
> > risks for NULL deference.
> >
> > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> > situation.
> >
> > Christoph Hellwig:
> > The whole freaking point of __GFP_NOFAIL is that callers don't handle
> > allocation failures.  So in fact a straight BUG is the right thing
> > here.
> >
> > Vlastimil Babka:
> > It's just not a recoverable situation (WARN_ON is for recoverable
> > situations). The caller cannot handle allocation failure and at the same
> > time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> > with stracktrace etc. We don't need to hope for the later NULL pointer
> > dereference (which might if really unlucky happen from a different
> > context where it's no longer obvious what lead to the allocation failing).
>
> Note that quote was meant specifically for the "too large" allocation, which
> is truly impossible. That includes the kvmalloc_array() overflow, order >
> MAX_ORDER etc.

I equally quote this for two cases because non-sleepable is also returning NULL,
in this means, they are currently facing the same problems.

>
> The "can't sleep/reclaim" is a bit more nuanced as there's the alternative
> in just warning and looping and hoping kswapd or some other direct reclaimer
> saves the day. If yes, great, we have a system that still works and a
> warning to repor. If no, there's still a warning, but later soft/hardlockup
> hits. These might be eventually worse than an immediate BUG_ON so it's not a
> clear cut. At least I think these cases should be handled in two different
> patches and not together.

But I fully agree these two can be separated and judged separately.

After more thinking, I am concerned that this issue might be difficult to be
rescued, as the misuse of GFP_ATOMIC | __GFP_NOFAIL typically occurs
in atomic contexts with strict time requirements. Even if some other
components release memory to satisfy the one busy-looping to obtain
memory, it might already be too late?

>
> > Michal Hocko:
> > Linus tends to be against adding new BUG() calls unless the failure is
> > absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> > not sure how he would look at simply incorrect memory allocator usage to
> > blow up the kernel. Now the argument could be made that those failures
> > could cause subtle memory corruptions or even be exploitable which might
> > be a sufficient reason to stop them early.
> >
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Lorenzo Stoakes <lstoakes@gmail.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Kees Cook <kees@kernel.org>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/linux/slab.h |  4 +++-
> >  mm/page_alloc.c      | 10 +++++-----
> >  mm/util.c            |  1 +
> >  3 files changed, 9 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index c9cb42203183..4a4d1fdc2afe 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >  {
> >       size_t bytes;
> >
> > -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> > +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> > +             BUG_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >
> >       return kvmalloc_node_noprof(bytes, flags, node);
> >  }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 45d2f41b4783..4d6af00fccd4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4435,11 +4435,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >        */
> >       if (gfp_mask & __GFP_NOFAIL) {
> >               /*
> > -              * All existing users of the __GFP_NOFAIL are blockable, so warn
> > -              * of any new users that actually require GFP_NOWAIT
> > +              * All existing users of the __GFP_NOFAIL are blockable
> > +              * otherwise we introduce a busy loop with inside the page
> > +              * allocator from non-sleepable contexts
> >                */
> > -             if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > -                     goto fail;
> > +             BUG_ON(!can_direct_reclaim);
> >
> >               /*
> >                * PF_MEMALLOC request from this context is rather bizarre
> > @@ -4470,7 +4470,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >               cond_resched();
> >               goto retry;
> >       }
> > -fail:
> > +
> >       warn_alloc(gfp_mask, ac->nodemask,
> >                       "page allocation failure: order:%u", order);
> >  got_pg:
> > diff --git a/mm/util.c b/mm/util.c
> > index 0ff5898cc6de..a1be50c243f1 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -668,6 +668,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >       /* Don't even allow crazy sizes */
> >       if (unlikely(size > INT_MAX)) {
> >               WARN_ON_ONCE(!(flags & __GFP_NOWARN));
> > +             BUG_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> >       }
> >
>
Michal Hocko July 24, 2024, 12:10 p.m. UTC | #3
On Wed 24-07-24 20:55:42, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> We have cases we still fail though callers might have __GFP_NOFAIL.
> Since they don't check the return, we are exposed to the security
> risks for NULL deference.
> 
> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> situation.
> 
> Christoph Hellwig:
> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> allocation failures.  So in fact a straight BUG is the right thing
> here.
> 
> Vlastimil Babka:
> It's just not a recoverable situation (WARN_ON is for recoverable
> situations). The caller cannot handle allocation failure and at the same
> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> with stracktrace etc. We don't need to hope for the later NULL pointer
> dereference (which might if really unlucky happen from a different
> context where it's no longer obvious what lead to the allocation failing).
> 
> Michal Hocko:
> Linus tends to be against adding new BUG() calls unless the failure is
> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> not sure how he would look at simply incorrect memory allocator usage to
> blow up the kernel. Now the argument could be made that those failures
> could cause subtle memory corruptions or even be exploitable which might
> be a sufficient reason to stop them early.

I think it is worth adding that size checks are not really actionable
because they either cause unexpected failure or BUG_ON. It is not too
much of a stretch to expect some of the user triggerable codepaths could
hit this - e.g. when input is not checked properly. Silent failure is
then a potential security risk.

The page allocator, on the other hand, can chose to keep retrying even
if that means that there is not reclaim going on and essentially cause a
busy loop in the kernel space. That would eventually cause soft/hard
lockup detector to fire (if an architecture offers a reliable one).
So essentially there is choice between two bad solutions and you have
chosen one that reliably bugs on rather than rely on something external
to intervene. The reasoning for that should be mentioned in the
changelog.

[...]
> diff --git a/mm/util.c b/mm/util.c
> index 0ff5898cc6de..a1be50c243f1 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -668,6 +668,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>  	/* Don't even allow crazy sizes */
>  	if (unlikely(size > INT_MAX)) {
>  		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
> +		BUG_ON(flags & __GFP_NOFAIL);

I guess you want to switch the ordering. WARNING on top of BUG on seems
rather pointless IMHO.

>  		return NULL;
>  	}
>  
> -- 
> 2.34.1
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index c9cb42203183..4a4d1fdc2afe 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -827,8 +827,10 @@  kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
 {
 	size_t bytes;
 
-	if (unlikely(check_mul_overflow(n, size, &bytes)))
+	if (unlikely(check_mul_overflow(n, size, &bytes))) {
+		BUG_ON(flags & __GFP_NOFAIL);
 		return NULL;
+	}
 
 	return kvmalloc_node_noprof(bytes, flags, node);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 45d2f41b4783..4d6af00fccd4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4435,11 +4435,11 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (gfp_mask & __GFP_NOFAIL) {
 		/*
-		 * All existing users of the __GFP_NOFAIL are blockable, so warn
-		 * of any new users that actually require GFP_NOWAIT
+		 * All existing users of the __GFP_NOFAIL are blockable
+		 * otherwise we introduce a busy loop with inside the page
+		 * allocator from non-sleepable contexts
 		 */
-		if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
-			goto fail;
+		BUG_ON(!can_direct_reclaim);
 
 		/*
 		 * PF_MEMALLOC request from this context is rather bizarre
@@ -4470,7 +4470,7 @@  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		cond_resched();
 		goto retry;
 	}
-fail:
+
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
 got_pg:
diff --git a/mm/util.c b/mm/util.c
index 0ff5898cc6de..a1be50c243f1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -668,6 +668,7 @@  void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
 	/* Don't even allow crazy sizes */
 	if (unlikely(size > INT_MAX)) {
 		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
+		BUG_ON(flags & __GFP_NOFAIL);
 		return NULL;
 	}