diff mbox series

[v3,3/4] mm: BUG_ON to avoid NULL deference while __GFP_NOFAIL fails

Message ID 20240817062449.21164-4-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series mm: clarify nofail memory allocation | expand

Commit Message

Barry Song Aug. 17, 2024, 6:24 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

We have cases we still fail though callers might have __GFP_NOFAIL.  Since
they don't check the return, we are exposed to the security risks for NULL
deference.

Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
situation.

Christoph Hellwig:
The whole freaking point of __GFP_NOFAIL is that callers don't handle
allocation failures.  So in fact a straight BUG is the right thing
here.

Vlastimil Babka:
It's just not a recoverable situation (WARN_ON is for recoverable
situations). The caller cannot handle allocation failure and at the same
time asked for an impossible allocation. BUG_ON() is a guaranteed oops
with stracktrace etc. We don't need to hope for the later NULL pointer
dereference (which might if really unlucky happen from a different
context where it's no longer obvious what lead to the allocation failing).

Michal Hocko:
Linus tends to be against adding new BUG() calls unless the failure is
absolutely unrecoverable (e.g. corrupted data structures etc.). I am
not sure how he would look at simply incorrect memory allocator usage to
blow up the kernel. Now the argument could be made that those failures
could cause subtle memory corruptions or even be exploitable which might
be a sufficient reason to stop them early.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <kees@kernel.org>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Hailong.Liu <hailong.liu@oppo.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
---
 include/linux/slab.h | 4 +++-
 mm/page_alloc.c      | 4 +++-
 mm/util.c            | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

Comments

David Hildenbrand Aug. 19, 2024, 9:43 a.m. UTC | #1
On 17.08.24 08:24, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
> they don't check the return, we are exposed to the security risks for NULL
> deference.
> 
> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> situation.
> 
> Christoph Hellwig:
> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> allocation failures.  So in fact a straight BUG is the right thing
> here.
> 
> Vlastimil Babka:
> It's just not a recoverable situation (WARN_ON is for recoverable
> situations). The caller cannot handle allocation failure and at the same
> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> with stracktrace etc. We don't need to hope for the later NULL pointer
> dereference (which might if really unlucky happen from a different
> context where it's no longer obvious what lead to the allocation failing).
> 
> Michal Hocko:
> Linus tends to be against adding new BUG() calls unless the failure is
> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> not sure how he would look at simply incorrect memory allocator usage to
> blow up the kernel. Now the argument could be made that those failures
> could cause subtle memory corruptions or even be exploitable which might
> be a sufficient reason to stop them early.
> 
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <kees@kernel.org>
> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> Cc: Hailong.Liu <hailong.liu@oppo.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> ---
>   include/linux/slab.h | 4 +++-
>   mm/page_alloc.c      | 4 +++-
>   mm/util.c            | 1 +
>   3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index c9cb42203183..4a4d1fdc2afe 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>   {
>   	size_t bytes;
>   
> -	if (unlikely(check_mul_overflow(n, size, &bytes)))
> +	if (unlikely(check_mul_overflow(n, size, &bytes))) {
> +		BUG_ON(flags & __GFP_NOFAIL);
>   		return NULL;
> +	}
>   
>   	return kvmalloc_node_noprof(bytes, flags, node);
>   }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 60742d057b05..d2c37f8f8d09 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>   	 * There are several places where we assume that the order value is sane
>   	 * so bail out early if the request is out of bound.
>   	 */
> -	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> +	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> +		BUG_ON(gfp & __GFP_NOFAIL);
>   		return NULL;
> +	}
>   
>   	gfp &= gfp_allowed_mask;
>   	/*
> diff --git a/mm/util.c b/mm/util.c
> index ac01925a4179..678c647b778f 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>   
>   	/* Don't even allow crazy sizes */
>   	if (unlikely(size > INT_MAX)) {
> +		BUG_ON(flags & __GFP_NOFAIL);

No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
Barry Song Aug. 19, 2024, 9:47 a.m. UTC | #2
On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 17.08.24 08:24, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > We have cases we still fail though callers might have __GFP_NOFAIL.  Since
> > they don't check the return, we are exposed to the security risks for NULL
> > deference.
> >
> > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> > situation.
> >
> > Christoph Hellwig:
> > The whole freaking point of __GFP_NOFAIL is that callers don't handle
> > allocation failures.  So in fact a straight BUG is the right thing
> > here.
> >
> > Vlastimil Babka:
> > It's just not a recoverable situation (WARN_ON is for recoverable
> > situations). The caller cannot handle allocation failure and at the same
> > time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> > with stracktrace etc. We don't need to hope for the later NULL pointer
> > dereference (which might if really unlucky happen from a different
> > context where it's no longer obvious what lead to the allocation failing).
> >
> > Michal Hocko:
> > Linus tends to be against adding new BUG() calls unless the failure is
> > absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> > not sure how he would look at simply incorrect memory allocator usage to
> > blow up the kernel. Now the argument could be made that those failures
> > could cause subtle memory corruptions or even be exploitable which might
> > be a sufficient reason to stop them early.
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Kees Cook <kees@kernel.org>
> > Cc: "Eugenio Pérez" <eperezma@redhat.com>
> > Cc: Hailong.Liu <hailong.liu@oppo.com>
> > Cc: Jason Wang <jasowang@redhat.com>
> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > ---
> >   include/linux/slab.h | 4 +++-
> >   mm/page_alloc.c      | 4 +++-
> >   mm/util.c            | 1 +
> >   3 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index c9cb42203183..4a4d1fdc2afe 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >   {
> >       size_t bytes;
> >
> > -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> > +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> > +             BUG_ON(flags & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >
> >       return kvmalloc_node_noprof(bytes, flags, node);
> >   }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 60742d057b05..d2c37f8f8d09 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> >        * There are several places where we assume that the order value is sane
> >        * so bail out early if the request is out of bound.
> >        */
> > -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> > +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> > +             BUG_ON(gfp & __GFP_NOFAIL);
> >               return NULL;
> > +     }
> >
> >       gfp &= gfp_allowed_mask;
> >       /*
> > diff --git a/mm/util.c b/mm/util.c
> > index ac01925a4179..678c647b778f 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >
> >       /* Don't even allow crazy sizes */
> >       if (unlikely(size > INT_MAX)) {
> > +             BUG_ON(flags & __GFP_NOFAIL);
>
> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.

Hi David,
WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Aug. 19, 2024, 9:55 a.m. UTC | #3
On 19.08.24 11:47, Barry Song wrote:
> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 17.08.24 08:24, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
>>> they don't check the return, we are exposed to the security risks for NULL
>>> deference.
>>>
>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
>>> situation.
>>>
>>> Christoph Hellwig:
>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
>>> allocation failures.  So in fact a straight BUG is the right thing
>>> here.
>>>
>>> Vlastimil Babka:
>>> It's just not a recoverable situation (WARN_ON is for recoverable
>>> situations). The caller cannot handle allocation failure and at the same
>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
>>> with stracktrace etc. We don't need to hope for the later NULL pointer
>>> dereference (which might if really unlucky happen from a different
>>> context where it's no longer obvious what lead to the allocation failing).
>>>
>>> Michal Hocko:
>>> Linus tends to be against adding new BUG() calls unless the failure is
>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
>>> not sure how he would look at simply incorrect memory allocator usage to
>>> blow up the kernel. Now the argument could be made that those failures
>>> could cause subtle memory corruptions or even be exploitable which might
>>> be a sufficient reason to stop them early.
>>>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Christoph Lameter <cl@linux.com>
>>> Cc: Pekka Enberg <penberg@kernel.org>
>>> Cc: David Rientjes <rientjes@google.com>
>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>> Cc: Kees Cook <kees@kernel.org>
>>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
>>> Cc: Hailong.Liu <hailong.liu@oppo.com>
>>> Cc: Jason Wang <jasowang@redhat.com>
>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>> ---
>>>    include/linux/slab.h | 4 +++-
>>>    mm/page_alloc.c      | 4 +++-
>>>    mm/util.c            | 1 +
>>>    3 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>> index c9cb42203183..4a4d1fdc2afe 100644
>>> --- a/include/linux/slab.h
>>> +++ b/include/linux/slab.h
>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>>>    {
>>>        size_t bytes;
>>>
>>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
>>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>                return NULL;
>>> +     }
>>>
>>>        return kvmalloc_node_noprof(bytes, flags, node);
>>>    }
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 60742d057b05..d2c37f8f8d09 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>>>         * There are several places where we assume that the order value is sane
>>>         * so bail out early if the request is out of bound.
>>>         */
>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
>>> +             BUG_ON(gfp & __GFP_NOFAIL);
>>>                return NULL;
>>> +     }
>>>
>>>        gfp &= gfp_allowed_mask;
>>>        /*
>>> diff --git a/mm/util.c b/mm/util.c
>>> index ac01925a4179..678c647b778f 100644
>>> --- a/mm/util.c
>>> +++ b/mm/util.c
>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>>>
>>>        /* Don't even allow crazy sizes */
>>>        if (unlikely(size > INT_MAX)) {
>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>
>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
> 
> Hi David,
> WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.

Just return NULL? "shit in shit out" :) ?
Barry Song Aug. 19, 2024, 10:02 a.m. UTC | #4
On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 11:47, Barry Song wrote:
> > On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 17.08.24 08:24, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
> >>> they don't check the return, we are exposed to the security risks for NULL
> >>> deference.
> >>>
> >>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> >>> situation.
> >>>
> >>> Christoph Hellwig:
> >>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> >>> allocation failures.  So in fact a straight BUG is the right thing
> >>> here.
> >>>
> >>> Vlastimil Babka:
> >>> It's just not a recoverable situation (WARN_ON is for recoverable
> >>> situations). The caller cannot handle allocation failure and at the same
> >>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> >>> with stracktrace etc. We don't need to hope for the later NULL pointer
> >>> dereference (which might if really unlucky happen from a different
> >>> context where it's no longer obvious what lead to the allocation failing).
> >>>
> >>> Michal Hocko:
> >>> Linus tends to be against adding new BUG() calls unless the failure is
> >>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> >>> not sure how he would look at simply incorrect memory allocator usage to
> >>> blow up the kernel. Now the argument could be made that those failures
> >>> could cause subtle memory corruptions or even be exploitable which might
> >>> be a sufficient reason to stop them early.
> >>>
> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>> Acked-by: Michal Hocko <mhocko@suse.com>
> >>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Cc: Christoph Lameter <cl@linux.com>
> >>> Cc: Pekka Enberg <penberg@kernel.org>
> >>> Cc: David Rientjes <rientjes@google.com>
> >>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >>> Cc: Kees Cook <kees@kernel.org>
> >>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> >>> Cc: Hailong.Liu <hailong.liu@oppo.com>
> >>> Cc: Jason Wang <jasowang@redhat.com>
> >>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> >>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>> ---
> >>>    include/linux/slab.h | 4 +++-
> >>>    mm/page_alloc.c      | 4 +++-
> >>>    mm/util.c            | 1 +
> >>>    3 files changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/linux/slab.h b/include/linux/slab.h
> >>> index c9cb42203183..4a4d1fdc2afe 100644
> >>> --- a/include/linux/slab.h
> >>> +++ b/include/linux/slab.h
> >>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >>>    {
> >>>        size_t bytes;
> >>>
> >>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> >>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> >>> +             BUG_ON(flags & __GFP_NOFAIL);
> >>>                return NULL;
> >>> +     }
> >>>
> >>>        return kvmalloc_node_noprof(bytes, flags, node);
> >>>    }
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 60742d057b05..d2c37f8f8d09 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> >>>         * There are several places where we assume that the order value is sane
> >>>         * so bail out early if the request is out of bound.
> >>>         */
> >>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> >>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> >>> +             BUG_ON(gfp & __GFP_NOFAIL);
> >>>                return NULL;
> >>> +     }
> >>>
> >>>        gfp &= gfp_allowed_mask;
> >>>        /*
> >>> diff --git a/mm/util.c b/mm/util.c
> >>> index ac01925a4179..678c647b778f 100644
> >>> --- a/mm/util.c
> >>> +++ b/mm/util.c
> >>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >>>
> >>>        /* Don't even allow crazy sizes */
> >>>        if (unlikely(size > INT_MAX)) {
> >>> +             BUG_ON(flags & __GFP_NOFAIL);
> >>
> >> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
> >
> > Hi David,
> > WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.
>
> Just return NULL? "shit in shit out" :) ?

Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL,
as it's the caller's responsibility to check the return value. However, with
__GFP_NOFAIL, users will directly dereference *(p + offset) even when
p == NULL. It is how __GFP_NOFAIL is supposed to work.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Aug. 19, 2024, 12:33 p.m. UTC | #5
On 19.08.24 12:02, Barry Song wrote:
> On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.08.24 11:47, Barry Song wrote:
>>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 17.08.24 08:24, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
>>>>> they don't check the return, we are exposed to the security risks for NULL
>>>>> deference.
>>>>>
>>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
>>>>> situation.
>>>>>
>>>>> Christoph Hellwig:
>>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
>>>>> allocation failures.  So in fact a straight BUG is the right thing
>>>>> here.
>>>>>
>>>>> Vlastimil Babka:
>>>>> It's just not a recoverable situation (WARN_ON is for recoverable
>>>>> situations). The caller cannot handle allocation failure and at the same
>>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
>>>>> with stracktrace etc. We don't need to hope for the later NULL pointer
>>>>> dereference (which might if really unlucky happen from a different
>>>>> context where it's no longer obvious what lead to the allocation failing).
>>>>>
>>>>> Michal Hocko:
>>>>> Linus tends to be against adding new BUG() calls unless the failure is
>>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
>>>>> not sure how he would look at simply incorrect memory allocator usage to
>>>>> blow up the kernel. Now the argument could be made that those failures
>>>>> could cause subtle memory corruptions or even be exploitable which might
>>>>> be a sufficient reason to stop them early.
>>>>>
>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> Cc: Christoph Lameter <cl@linux.com>
>>>>> Cc: Pekka Enberg <penberg@kernel.org>
>>>>> Cc: David Rientjes <rientjes@google.com>
>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>>>> Cc: Kees Cook <kees@kernel.org>
>>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
>>>>> Cc: Hailong.Liu <hailong.liu@oppo.com>
>>>>> Cc: Jason Wang <jasowang@redhat.com>
>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>> ---
>>>>>     include/linux/slab.h | 4 +++-
>>>>>     mm/page_alloc.c      | 4 +++-
>>>>>     mm/util.c            | 1 +
>>>>>     3 files changed, 7 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>>>> index c9cb42203183..4a4d1fdc2afe 100644
>>>>> --- a/include/linux/slab.h
>>>>> +++ b/include/linux/slab.h
>>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>>>>>     {
>>>>>         size_t bytes;
>>>>>
>>>>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
>>>>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
>>>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>>>                 return NULL;
>>>>> +     }
>>>>>
>>>>>         return kvmalloc_node_noprof(bytes, flags, node);
>>>>>     }
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 60742d057b05..d2c37f8f8d09 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>>>>>          * There are several places where we assume that the order value is sane
>>>>>          * so bail out early if the request is out of bound.
>>>>>          */
>>>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
>>>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
>>>>> +             BUG_ON(gfp & __GFP_NOFAIL);
>>>>>                 return NULL;
>>>>> +     }
>>>>>
>>>>>         gfp &= gfp_allowed_mask;
>>>>>         /*
>>>>> diff --git a/mm/util.c b/mm/util.c
>>>>> index ac01925a4179..678c647b778f 100644
>>>>> --- a/mm/util.c
>>>>> +++ b/mm/util.c
>>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>>>>>
>>>>>         /* Don't even allow crazy sizes */
>>>>>         if (unlikely(size > INT_MAX)) {
>>>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>>
>>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
>>>
>>> Hi David,
>>> WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.
>>
>> Just return NULL? "shit in shit out" :) ?
> 
> Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL,
> as it's the caller's responsibility to check the return value. However, with
> __GFP_NOFAIL, users will directly dereference *(p + offset) even when
> p == NULL. It is how __GFP_NOFAIL is supposed to work.

If the caller is not supposed to pass that flag combination (shit in), 
we are not obligated to give a reasonable result (shit out).

My point is that we should let the caller (possibly?) crash -- the one 
that did something that is wrong -- instead of forcing a crash using 
BUG_ON in this code here.

It should all be caught during testing either way. And if some OOT 
module does something nasty, that's not our responsibility.

BUG_ON is not a way to write assertions into the code.
Barry Song Aug. 19, 2024, 12:48 p.m. UTC | #6
On Tue, Aug 20, 2024 at 12:33 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 12:02, Barry Song wrote:
> > On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.08.24 11:47, Barry Song wrote:
> >>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 17.08.24 08:24, Barry Song wrote:
> >>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
> >>>>> they don't check the return, we are exposed to the security risks for NULL
> >>>>> deference.
> >>>>>
> >>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
> >>>>> situation.
> >>>>>
> >>>>> Christoph Hellwig:
> >>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
> >>>>> allocation failures.  So in fact a straight BUG is the right thing
> >>>>> here.
> >>>>>
> >>>>> Vlastimil Babka:
> >>>>> It's just not a recoverable situation (WARN_ON is for recoverable
> >>>>> situations). The caller cannot handle allocation failure and at the same
> >>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
> >>>>> with stracktrace etc. We don't need to hope for the later NULL pointer
> >>>>> dereference (which might if really unlucky happen from a different
> >>>>> context where it's no longer obvious what lead to the allocation failing).
> >>>>>
> >>>>> Michal Hocko:
> >>>>> Linus tends to be against adding new BUG() calls unless the failure is
> >>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
> >>>>> not sure how he would look at simply incorrect memory allocator usage to
> >>>>> blow up the kernel. Now the argument could be made that those failures
> >>>>> could cause subtle memory corruptions or even be exploitable which might
> >>>>> be a sufficient reason to stop them early.
> >>>>>
> >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
> >>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >>>>> Acked-by: Michal Hocko <mhocko@suse.com>
> >>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
> >>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>>>> Cc: Christoph Lameter <cl@linux.com>
> >>>>> Cc: Pekka Enberg <penberg@kernel.org>
> >>>>> Cc: David Rientjes <rientjes@google.com>
> >>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> >>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> >>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >>>>> Cc: Kees Cook <kees@kernel.org>
> >>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> >>>>> Cc: Hailong.Liu <hailong.liu@oppo.com>
> >>>>> Cc: Jason Wang <jasowang@redhat.com>
> >>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
> >>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> >>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> >>>>> ---
> >>>>>     include/linux/slab.h | 4 +++-
> >>>>>     mm/page_alloc.c      | 4 +++-
> >>>>>     mm/util.c            | 1 +
> >>>>>     3 files changed, 7 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
> >>>>> index c9cb42203183..4a4d1fdc2afe 100644
> >>>>> --- a/include/linux/slab.h
> >>>>> +++ b/include/linux/slab.h
> >>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
> >>>>>     {
> >>>>>         size_t bytes;
> >>>>>
> >>>>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
> >>>>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
> >>>>> +             BUG_ON(flags & __GFP_NOFAIL);
> >>>>>                 return NULL;
> >>>>> +     }
> >>>>>
> >>>>>         return kvmalloc_node_noprof(bytes, flags, node);
> >>>>>     }
> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>> index 60742d057b05..d2c37f8f8d09 100644
> >>>>> --- a/mm/page_alloc.c
> >>>>> +++ b/mm/page_alloc.c
> >>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> >>>>>          * There are several places where we assume that the order value is sane
> >>>>>          * so bail out early if the request is out of bound.
> >>>>>          */
> >>>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> >>>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> >>>>> +             BUG_ON(gfp & __GFP_NOFAIL);
> >>>>>                 return NULL;
> >>>>> +     }
> >>>>>
> >>>>>         gfp &= gfp_allowed_mask;
> >>>>>         /*
> >>>>> diff --git a/mm/util.c b/mm/util.c
> >>>>> index ac01925a4179..678c647b778f 100644
> >>>>> --- a/mm/util.c
> >>>>> +++ b/mm/util.c
> >>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
> >>>>>
> >>>>>         /* Don't even allow crazy sizes */
> >>>>>         if (unlikely(size > INT_MAX)) {
> >>>>> +             BUG_ON(flags & __GFP_NOFAIL);
> >>>>
> >>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
> >>>
> >>> Hi David,
> >>> WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.
> >>
> >> Just return NULL? "shit in shit out" :) ?
> >
> > Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL,
> > as it's the caller's responsibility to check the return value. However, with
> > __GFP_NOFAIL, users will directly dereference *(p + offset) even when
> > p == NULL. It is how __GFP_NOFAIL is supposed to work.
>
> If the caller is not supposed to pass that flag combination (shit in),
> we are not obligated to give a reasonable result (shit out).
>
> My point is that we should let the caller (possibly?) crash -- the one
> that did something that is wrong -- instead of forcing a crash using
> BUG_ON in this code here.
>
> It should all be caught during testing either way. And if some OOT
> module does something nasty, that's not our responsibility.
>
> BUG_ON is not a way to write assertions into the code.

It seems there was a misunderstanding regarding the purpose of
this change. we actually have many details in changelog.

Its aim is not to write an assertion, but rather to prevent exposing
a security vulnerability.

Returning NULL doesn't necessarily crash the caller's process, p->field,
*(p + offset) deference could be used by hackers to exploit the system.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
Christoph Hellwig Aug. 19, 2024, 12:49 p.m. UTC | #7
On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
> It should all be caught during testing either way. And if some OOT module 
> does something nasty, that's not our responsibility.
>
> BUG_ON is not a way to write assertions into the code.

So you'd rather create exploits than crashing on a fundamental API
violation?  That's exactly what the series is trying to fix.
David Hildenbrand Aug. 19, 2024, 12:49 p.m. UTC | #8
On 19.08.24 14:48, Barry Song wrote:
> On Tue, Aug 20, 2024 at 12:33 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.08.24 12:02, Barry Song wrote:
>>> On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 19.08.24 11:47, Barry Song wrote:
>>>>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 17.08.24 08:24, Barry Song wrote:
>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>
>>>>>>> We have cases we still fail though callers might have __GFP_NOFAIL.  Since
>>>>>>> they don't check the return, we are exposed to the security risks for NULL
>>>>>>> deference.
>>>>>>>
>>>>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable
>>>>>>> situation.
>>>>>>>
>>>>>>> Christoph Hellwig:
>>>>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle
>>>>>>> allocation failures.  So in fact a straight BUG is the right thing
>>>>>>> here.
>>>>>>>
>>>>>>> Vlastimil Babka:
>>>>>>> It's just not a recoverable situation (WARN_ON is for recoverable
>>>>>>> situations). The caller cannot handle allocation failure and at the same
>>>>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops
>>>>>>> with stracktrace etc. We don't need to hope for the later NULL pointer
>>>>>>> dereference (which might if really unlucky happen from a different
>>>>>>> context where it's no longer obvious what lead to the allocation failing).
>>>>>>>
>>>>>>> Michal Hocko:
>>>>>>> Linus tends to be against adding new BUG() calls unless the failure is
>>>>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am
>>>>>>> not sure how he would look at simply incorrect memory allocator usage to
>>>>>>> blow up the kernel. Now the argument could be made that those failures
>>>>>>> could cause subtle memory corruptions or even be exploitable which might
>>>>>>> be a sufficient reason to stop them early.
>>>>>>>
>>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de>
>>>>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>>>>>>> Acked-by: Michal Hocko <mhocko@suse.com>
>>>>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
>>>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>>>> Cc: Christoph Lameter <cl@linux.com>
>>>>>>> Cc: Pekka Enberg <penberg@kernel.org>
>>>>>>> Cc: David Rientjes <rientjes@google.com>
>>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev>
>>>>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
>>>>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>>>>>> Cc: Kees Cook <kees@kernel.org>
>>>>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com>
>>>>>>> Cc: Hailong.Liu <hailong.liu@oppo.com>
>>>>>>> Cc: Jason Wang <jasowang@redhat.com>
>>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
>>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>>>>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
>>>>>>> ---
>>>>>>>      include/linux/slab.h | 4 +++-
>>>>>>>      mm/page_alloc.c      | 4 +++-
>>>>>>>      mm/util.c            | 1 +
>>>>>>>      3 files changed, 7 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h
>>>>>>> index c9cb42203183..4a4d1fdc2afe 100644
>>>>>>> --- a/include/linux/slab.h
>>>>>>> +++ b/include/linux/slab.h
>>>>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
>>>>>>>      {
>>>>>>>          size_t bytes;
>>>>>>>
>>>>>>> -     if (unlikely(check_mul_overflow(n, size, &bytes)))
>>>>>>> +     if (unlikely(check_mul_overflow(n, size, &bytes))) {
>>>>>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>>>>>                  return NULL;
>>>>>>> +     }
>>>>>>>
>>>>>>>          return kvmalloc_node_noprof(bytes, flags, node);
>>>>>>>      }
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 60742d057b05..d2c37f8f8d09 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>>>>>>>           * There are several places where we assume that the order value is sane
>>>>>>>           * so bail out early if the request is out of bound.
>>>>>>>           */
>>>>>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
>>>>>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
>>>>>>> +             BUG_ON(gfp & __GFP_NOFAIL);
>>>>>>>                  return NULL;
>>>>>>> +     }
>>>>>>>
>>>>>>>          gfp &= gfp_allowed_mask;
>>>>>>>          /*
>>>>>>> diff --git a/mm/util.c b/mm/util.c
>>>>>>> index ac01925a4179..678c647b778f 100644
>>>>>>> --- a/mm/util.c
>>>>>>> +++ b/mm/util.c
>>>>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
>>>>>>>
>>>>>>>          /* Don't even allow crazy sizes */
>>>>>>>          if (unlikely(size > INT_MAX)) {
>>>>>>> +             BUG_ON(flags & __GFP_NOFAIL);
>>>>>>
>>>>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
>>>>>
>>>>> Hi David,
>>>>> WARN_ON_ONCE()  might be fine but I don't see how it is possible to recover.
>>>>
>>>> Just return NULL? "shit in shit out" :) ?
>>>
>>> Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL,
>>> as it's the caller's responsibility to check the return value. However, with
>>> __GFP_NOFAIL, users will directly dereference *(p + offset) even when
>>> p == NULL. It is how __GFP_NOFAIL is supposed to work.
>>
>> If the caller is not supposed to pass that flag combination (shit in),
>> we are not obligated to give a reasonable result (shit out).
>>
>> My point is that we should let the caller (possibly?) crash -- the one
>> that did something that is wrong -- instead of forcing a crash using
>> BUG_ON in this code here.
>>
>> It should all be caught during testing either way. And if some OOT
>> module does something nasty, that's not our responsibility.
>>
>> BUG_ON is not a way to write assertions into the code.
> 
> It seems there was a misunderstanding regarding the purpose of
> this change. we actually have many details in changelog.
> 
> Its aim is not to write an assertion, but rather to prevent exposing
> a security vulnerability.
> 
> Returning NULL doesn't necessarily crash the caller's process, p->field,
> *(p + offset) deference could be used by hackers to exploit the system.

See my other reply to Michal: why do we even allow to specify them 
separately and not simply let one enforce the other?

That might result in an issue elsewhere, but likely no security 
vulnerability?

I really hate each and every BUG_ON I have to stare at.
David Hildenbrand Aug. 19, 2024, 12:51 p.m. UTC | #9
On 19.08.24 14:49, Christoph Hellwig wrote:
> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>> It should all be caught during testing either way. And if some OOT module
>> does something nasty, that's not our responsibility.
>>
>> BUG_ON is not a way to write assertions into the code.
> 
> So you'd rather create exploits than crashing on a fundamental API
> violation?  That's exactly what the series is trying to fix.

I'd rather have a sane API that doesn't even allow this level of 
flexibility with NOFAIL.

But probably I'm missing more details here why this all has to be so 
complicated ;)
Christoph Hellwig Aug. 19, 2024, 12:53 p.m. UTC | #10
On Mon, Aug 19, 2024 at 02:51:06PM +0200, David Hildenbrand wrote:
> On 19.08.24 14:49, Christoph Hellwig wrote:
>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>>> It should all be caught during testing either way. And if some OOT module
>>> does something nasty, that's not our responsibility.
>>>
>>> BUG_ON is not a way to write assertions into the code.
>>
>> So you'd rather create exploits than crashing on a fundamental API
>> violation?  That's exactly what the series is trying to fix.
>
> I'd rather have a sane API that doesn't even allow this level of 
> flexibility with NOFAIL.
>
> But probably I'm missing more details here why this all has to be so 
> complicated ;)

Well, the only way to do that is to remove (__)GFP_NOFAIL and require
either explicit _nofail variants without a way to pass gfp flags, or
endless loops in the callers.  I suggested the first one earlier, but
no one liked it due to the API complexity and overhead.  And I've not
heard anyone arguing for the latter yet.

One other way might be a compile time check that requires any GPF
flag that contains (__)GFP_NOFAIL to be a compile time constants.
This is true for many but not all callers currently.
Barry Song Aug. 19, 2024, 1:05 p.m. UTC | #11
On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 14:49, Christoph Hellwig wrote:
> > On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
> >> It should all be caught during testing either way. And if some OOT module
> >> does something nasty, that's not our responsibility.
> >>
> >> BUG_ON is not a way to write assertions into the code.
> >
> > So you'd rather create exploits than crashing on a fundamental API
> > violation?  That's exactly what the series is trying to fix.
>
> I'd rather have a sane API that doesn't even allow this level of
> flexibility with NOFAIL.

yes, i have already sent a RFC enforcing direct_reclamation:
https://www.spinics.net/lists/linux-mm/msg394659.html

somehow, it is not ready yet. i think Christoph prefers scope
api rather than GFP_NOFAIL which definitely has
__GFP_DIRECT_RECLAIM set. I guess you know I have
at least  5 series running, so it will happen soon though.

>
> But probably I'm missing more details here why this all has to be so
> complicated ;)

enforcing direct_reclamation is right and will work for a reasonable size.
but for this overflow size, even if we enforce direct_reclamation
in GFP_NOFAIL, we are still failing.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Aug. 19, 2024, 1:10 p.m. UTC | #12
On 19.08.24 15:05, Barry Song wrote:
> On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.08.24 14:49, Christoph Hellwig wrote:
>>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>>>> It should all be caught during testing either way. And if some OOT module
>>>> does something nasty, that's not our responsibility.
>>>>
>>>> BUG_ON is not a way to write assertions into the code.
>>>
>>> So you'd rather create exploits than crashing on a fundamental API
>>> violation?  That's exactly what the series is trying to fix.
>>
>> I'd rather have a sane API that doesn't even allow this level of
>> flexibility with NOFAIL.
> 
> yes, i have already sent a RFC enforcing direct_reclamation:
> https://www.spinics.net/lists/linux-mm/msg394659.html
> 
> somehow, it is not ready yet. i think Christoph prefers scope
> api rather than GFP_NOFAIL which definitely has
> __GFP_DIRECT_RECLAIM set. I guess you know I have
> at least  5 series running, so it will happen soon though.

That really sounds like the right thing to do, at least with the "direct 
reclaim" problem ...

> 
>>
>> But probably I'm missing more details here why this all has to be so
>> complicated ;)
> 
> enforcing direct_reclamation is right and will work for a reasonable size.
> but for this overflow size, even if we enforce direct_reclamation
> in GFP_NOFAIL, we are still failing.

Right, someone requested something completely impossible. It's harder to 
do something here that doesn't return NULL. Except WARN_ON_ONCE() and 
loop for all infinity.
David Hildenbrand Aug. 19, 2024, 1:14 p.m. UTC | #13
On 19.08.24 14:53, Christoph Hellwig wrote:
> On Mon, Aug 19, 2024 at 02:51:06PM +0200, David Hildenbrand wrote:
>> On 19.08.24 14:49, Christoph Hellwig wrote:
>>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>>>> It should all be caught during testing either way. And if some OOT module
>>>> does something nasty, that's not our responsibility.
>>>>
>>>> BUG_ON is not a way to write assertions into the code.
>>>
>>> So you'd rather create exploits than crashing on a fundamental API
>>> violation?  That's exactly what the series is trying to fix.
>>
>> I'd rather have a sane API that doesn't even allow this level of
>> flexibility with NOFAIL.
>>
>> But probably I'm missing more details here why this all has to be so
>> complicated ;)
> 
> Well, the only way to do that is to remove (__)GFP_NOFAIL and require
> either explicit _nofail variants without a way to pass gfp flags, or
> endless loops in the callers.  I suggested the first one earlier, but
> no one liked it due to the API complexity and overhead.  And I've not
> heard anyone arguing for the latter yet.

Right, and and "allocate more than even possible" case is extremely ugly.

> 
> One other way might be a compile time check that requires any GPF
> flag that contains (__)GFP_NOFAIL to be a compile time constants.
> This is true for many but not all callers currently.

Right.
Barry Song Aug. 19, 2024, 1:19 p.m. UTC | #14
On Tue, Aug 20, 2024 at 1:10 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 15:05, Barry Song wrote:
> > On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.08.24 14:49, Christoph Hellwig wrote:
> >>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
> >>>> It should all be caught during testing either way. And if some OOT module
> >>>> does something nasty, that's not our responsibility.
> >>>>
> >>>> BUG_ON is not a way to write assertions into the code.
> >>>
> >>> So you'd rather create exploits than crashing on a fundamental API
> >>> violation?  That's exactly what the series is trying to fix.
> >>
> >> I'd rather have a sane API that doesn't even allow this level of
> >> flexibility with NOFAIL.
> >
> > yes, i have already sent a RFC enforcing direct_reclamation:
> > https://www.spinics.net/lists/linux-mm/msg394659.html
> >
> > somehow, it is not ready yet. i think Christoph prefers scope
> > api rather than GFP_NOFAIL which definitely has
> > __GFP_DIRECT_RECLAIM set. I guess you know I have
> > at least  5 series running, so it will happen soon though.
>
> That really sounds like the right thing to do, at least with the "direct
> reclaim" problem ...
>
> >
> >>
> >> But probably I'm missing more details here why this all has to be so
> >> complicated ;)
> >
> > enforcing direct_reclamation is right and will work for a reasonable size.
> > but for this overflow size, even if we enforce direct_reclamation
> > in GFP_NOFAIL, we are still failing.
>
> Right, someone requested something completely impossible. It's harder to
> do something here that doesn't return NULL. Except WARN_ON_ONCE() and
> loop for all infinity.

Returning NULL can introduce security vulnerabilities. While I’m not a hacker,
it’s hard to predict how they might exploit this. If we want to avoid using
BUG_ON, an alternative approach could be as follows:

while(gfp & __GFP_NOFAIL) some_cpu_relaxed_job;    ?

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Aug. 19, 2024, 1:22 p.m. UTC | #15
On 19.08.24 15:19, Barry Song wrote:
> On Tue, Aug 20, 2024 at 1:10 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.08.24 15:05, Barry Song wrote:
>>> On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 19.08.24 14:49, Christoph Hellwig wrote:
>>>>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote:
>>>>>> It should all be caught during testing either way. And if some OOT module
>>>>>> does something nasty, that's not our responsibility.
>>>>>>
>>>>>> BUG_ON is not a way to write assertions into the code.
>>>>>
>>>>> So you'd rather create exploits than crashing on a fundamental API
>>>>> violation?  That's exactly what the series is trying to fix.
>>>>
>>>> I'd rather have a sane API that doesn't even allow this level of
>>>> flexibility with NOFAIL.
>>>
>>> yes, i have already sent a RFC enforcing direct_reclamation:
>>> https://www.spinics.net/lists/linux-mm/msg394659.html
>>>
>>> somehow, it is not ready yet. i think Christoph prefers scope
>>> api rather than GFP_NOFAIL which definitely has
>>> __GFP_DIRECT_RECLAIM set. I guess you know I have
>>> at least  5 series running, so it will happen soon though.
>>
>> That really sounds like the right thing to do, at least with the "direct
>> reclaim" problem ...
>>
>>>
>>>>
>>>> But probably I'm missing more details here why this all has to be so
>>>> complicated ;)
>>>
>>> enforcing direct_reclamation is right and will work for a reasonable size.
>>> but for this overflow size, even if we enforce direct_reclamation
>>> in GFP_NOFAIL, we are still failing.
>>
>> Right, someone requested something completely impossible. It's harder to
>> do something here that doesn't return NULL. Except WARN_ON_ONCE() and
>> loop for all infinity.
> 
> Returning NULL can introduce security vulnerabilities. While I’m not a hacker,
> it’s hard to predict how they might exploit this. If we want to avoid using
> BUG_ON, an alternative approach could be as follows:
> 
> while(gfp & __GFP_NOFAIL) some_cpu_relaxed_job;    ?

At least for the "allocate something impossible in size", retrying 
forever makes *some* sense (with a warning, of course). We would retry 
until it works for sizes that are possible.

For the other case, we should just make __GFP_NOFAIL imply whatever is 
currently a "must". Returning NULL here is indeed strange.
Michal Hocko Aug. 19, 2024, 5:12 p.m. UTC | #16
On Mon 19-08-24 14:49:55, David Hildenbrand wrote:
> On 19.08.24 14:48, Barry Song wrote:
[...]
> > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > > > index 60742d057b05..d2c37f8f8d09 100644
> > > > > > > > --- a/mm/page_alloc.c
> > > > > > > > +++ b/mm/page_alloc.c
> > > > > > > > @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
> > > > > > > >           * There are several places where we assume that the order value is sane
> > > > > > > >           * so bail out early if the request is out of bound.
> > > > > > > >           */
> > > > > > > > -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
> > > > > > > > +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
> > > > > > > > +             BUG_ON(gfp & __GFP_NOFAIL);
> > > > > > > >                  return NULL;
> > > > > > > > +     }
[...]
> > Returning NULL doesn't necessarily crash the caller's process, p->field,
> > *(p + offset) deference could be used by hackers to exploit the system.
> 
> See my other reply to Michal: why do we even allow to specify them
> separately and not simply let one enforce the other?

Are you replying to this patch? This is not about a combination of
flags. This is about the above (and other similar) boundary checks which
return NULL if the size is deemed incorrect. I think those are potential
problems because it could be a lack of input check which could be turned
into a potentially malicious code. Because unchecked (return value
because NOFAIL never fails, right?) return value might even not OOPs and
become a silent read/write into memory.

Whether to BUG_ON or simply loop for ever in the allocator if somebody
requests non-sleeping NOFAIL allocation is a different story.
Linus Torvalds Aug. 19, 2024, 5:17 p.m. UTC | #17
On Mon, 19 Aug 2024 at 10:12, Michal Hocko <mhocko@suse.com> wrote:
>
> Whether to BUG_ON or simply loop for ever in the allocator if somebody
> requests non-sleeping NOFAIL allocation is a different story.

Just return NULL.

The bug isn't in the VM. It's in the caller. Don't take on other
peoples problems.

It was never valid to say "I want to allocate lots of memory and you
can't fail".

Don't validate that kind of bogus behavior, just tell them "you're
bad" and return NULL.

            Linus
David Hildenbrand Aug. 19, 2024, 8:24 p.m. UTC | #18
On 19.08.24 19:12, Michal Hocko wrote:
> On Mon 19-08-24 14:49:55, David Hildenbrand wrote:
>> On 19.08.24 14:48, Barry Song wrote:
> [...]
>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>>>> index 60742d057b05..d2c37f8f8d09 100644
>>>>>>>>> --- a/mm/page_alloc.c
>>>>>>>>> +++ b/mm/page_alloc.c
>>>>>>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
>>>>>>>>>            * There are several places where we assume that the order value is sane
>>>>>>>>>            * so bail out early if the request is out of bound.
>>>>>>>>>            */
>>>>>>>>> -     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
>>>>>>>>> +     if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
>>>>>>>>> +             BUG_ON(gfp & __GFP_NOFAIL);
>>>>>>>>>                   return NULL;
>>>>>>>>> +     }
> [...]
>>> Returning NULL doesn't necessarily crash the caller's process, p->field,
>>> *(p + offset) deference could be used by hackers to exploit the system.
>>
>> See my other reply to Michal: why do we even allow to specify them
>> separately and not simply let one enforce the other?
> 
> Are you replying to this patch? This is not about a combination of
> flags. This is about the above (and other similar) boundary checks which
> return NULL if the size is deemed incorrect. 
> I think those are potential
> problems because it could be a lack of input check which could be turned
> into a potentially malicious code. Because unchecked (return value
> because NOFAIL never fails, right?) return value might even not OOPs and
> become a silent read/write into memory.

Right, that's a different kind of issue than the simple "don't pass 
stupid flag combinations" thing where "we'll fix that up for you" is 
more reasonable.

Possibly NOFAIL allocations with an allocation sizes that is not a small 
compile-time constant already has a bad smell to it, but I'm sure there 
are reasonable exceptions ...

> 
> Whether to BUG_ON or simply loop for ever in the allocator if somebody
> requests non-sleeping NOFAIL allocation is a different story.

Right, "warn + loop forever" is one alternative where you could at least 
keep the system alive to some degree. Satisfying a large allocation 
might take a long time, satisfying a "too large" allocation would take 
forever.

But as Linus says, it's all workarounds for other buggy code, to make 
buggy code less exploitable, maybe, ...
Linus Torvalds Aug. 19, 2024, 8:35 p.m. UTC | #19
On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote:
>
> Right, "warn + loop forever" is one alternative where you could at least
> keep the system alive to some degree.

Maybe. Or it might just lock up the machine.

For small allocations looping forever is probably fine, because in
practice there's always *something* that can be thrown out.

But anything larger than order-3 (handwavy, but that was our
historical limit, I think, and we call it PAGE_ALLOC_COSTLY_ORDER) has
to fail at _some_ point, and the caller setting GFP_NOFAIL is just
fantasy and "Daddy, I want a pony", and should be ignored.

With a WARN_ON_ONCE(), by all means, so that people can see who the
fantasist is.

            Linus
David Hildenbrand Aug. 19, 2024, 9:57 p.m. UTC | #20
On 19.08.24 22:35, Linus Torvalds wrote:
> On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote:
>>
>> Right, "warn + loop forever" is one alternative where you could at least
>> keep the system alive to some degree.
> 
> Maybe. Or it might just lock up the machine.

Yes, regarding the security concerns it might be a bit better that way. 
(no security expert, so I cannot judge ...)

So, instead of saying "No, you can't have a pony, stupid", you would say 
"well, let me talk to your mother about that (WARN)" ... and just never 
reply to the real question ... :)
Linus Torvalds Aug. 19, 2024, 10:13 p.m. UTC | #21
On Mon, 19 Aug 2024 at 14:57, David Hildenbrand <david@redhat.com> wrote:
>
> On 19.08.24 22:35, Linus Torvalds wrote:
> >
> > Maybe. Or it might just lock up the machine.
>
> Yes, regarding the security concerns it might be a bit better that way.
> (no security expert, so I cannot judge ...)

The thing is, in a static universe where time does not flow, that might be true.

But in such a universe, we wouldn't actually be alive, much less do
kernel development.

In *REALITY*, security is not a single point in time. Real security is
about the *future* more than it is about anything else.

From a user perspective, security is very much about things like "keep
your system up-to-date", because we _know_ old systems have (known)
bugs.

But the other side of that is that from a developer perspective, the
#1 thing in security is about fixing bugs.

Yes, you want to be conservative so that bugs you haven't fixed yet
don't cause problems, but that "be conservative" absolutely *has* to
be priority #2.

Because no amount of conservatism is going to help you unless you
actively fix the bugs.

And the thing is, a locked-up machine is really really hard to debug.
I can tell you from experience just what the bug reports will look
like, but I think you can guess.

So locking up is one of the worst things you can do. You are pretty
much always much better off making forward progress and try to report
the issue.

Including for security reasons. Because you'd rather have a security
issue today that you can debug, and fix for tomorrow.

A dead machine with a bug you can't debug is not actually suddenly a
really secure machine. Because somebody will just figure out a way to
take advantage of *that* bug too indirectly.

Being able to kill machines at your leisure is a great way to stress
other parts of your network, and suddenly that dead machine might be a
big security hazard when somebody figures out a weakness elsewhere
("Oh, look, there's a bug in the fail-over that allows me to do X").

              Linus
Michal Hocko Aug. 20, 2024, 6:17 a.m. UTC | #22
On Mon 19-08-24 23:57:29, David Hildenbrand wrote:
> On 19.08.24 22:35, Linus Torvalds wrote:
> > On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > Right, "warn + loop forever" is one alternative where you could at least
> > > keep the system alive to some degree.
> > 
> > Maybe. Or it might just lock up the machine.
> 
> Yes, regarding the security concerns it might be a bit better that way. (no
> security expert, so I cannot judge ...)

Would it make sense to simply enforce and oops? We do expect that an
incorrect usage would trigger one but we have no guarantee because the
actual user could be 
array = kvmalloc(unsupported_size_provided_from_userspace, GFP_NOFAIL)

which might be actually a valid usecase because that the normaly
requested size is large, yet reasonable. A lack of user input checks is
just a sad reality we have to live with. While those bugs absolutely
_need_ to be fixed it is better to not just allow them to
array[large_index] = payload
and make them easier to exploit the kernel. Sure you will get a warning
but your machine has been compromised.

BUG_ON will shoot the whole machine down which I do understand is just
too drastic of a measure. Making the allocation loop for ever with
cond_resched() or a short sleep is slightly better because it contains
the bad user but the process context is still not killabale that way
which is a problem on its own. I am not aware of OOPS_ON that would kill
the calling user process. Yes, this could still be leaving locks behind
but still better than a compromised system.

WDYT?
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index c9cb42203183..4a4d1fdc2afe 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -827,8 +827,10 @@  kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
 {
 	size_t bytes;
 
-	if (unlikely(check_mul_overflow(n, size, &bytes)))
+	if (unlikely(check_mul_overflow(n, size, &bytes))) {
+		BUG_ON(flags & __GFP_NOFAIL);
 		return NULL;
+	}
 
 	return kvmalloc_node_noprof(bytes, flags, node);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 60742d057b05..d2c37f8f8d09 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4668,8 +4668,10 @@  struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order,
 	 * There are several places where we assume that the order value is sane
 	 * so bail out early if the request is out of bound.
 	 */
-	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
+	if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) {
+		BUG_ON(gfp & __GFP_NOFAIL);
 		return NULL;
+	}
 
 	gfp &= gfp_allowed_mask;
 	/*
diff --git a/mm/util.c b/mm/util.c
index ac01925a4179..678c647b778f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -667,6 +667,7 @@  void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
 
 	/* Don't even allow crazy sizes */
 	if (unlikely(size > INT_MAX)) {
+		BUG_ON(flags & __GFP_NOFAIL);
 		WARN_ON_ONCE(!(flags & __GFP_NOWARN));
 		return NULL;
 	}