Message ID | 20240817062449.21164-4-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: clarify nofail memory allocation | expand |
On 17.08.24 08:24, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > We have cases we still fail though callers might have __GFP_NOFAIL. Since > they don't check the return, we are exposed to the security risks for NULL > deference. > > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable > situation. > > Christoph Hellwig: > The whole freaking point of __GFP_NOFAIL is that callers don't handle > allocation failures. So in fact a straight BUG is the right thing > here. > > Vlastimil Babka: > It's just not a recoverable situation (WARN_ON is for recoverable > situations). The caller cannot handle allocation failure and at the same > time asked for an impossible allocation. BUG_ON() is a guaranteed oops > with stracktrace etc. We don't need to hope for the later NULL pointer > dereference (which might if really unlucky happen from a different > context where it's no longer obvious what lead to the allocation failing). > > Michal Hocko: > Linus tends to be against adding new BUG() calls unless the failure is > absolutely unrecoverable (e.g. corrupted data structures etc.). I am > not sure how he would look at simply incorrect memory allocator usage to > blow up the kernel. Now the argument could be made that those failures > could cause subtle memory corruptions or even be exploitable which might > be a sufficient reason to stop them early. > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Acked-by: Vlastimil Babka <vbabka@suse.cz> > Acked-by: Michal Hocko <mhocko@suse.com> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Cc: Christoph Lameter <cl@linux.com> > Cc: Pekka Enberg <penberg@kernel.org> > Cc: David Rientjes <rientjes@google.com> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > Cc: Roman Gushchin <roman.gushchin@linux.dev> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Kees Cook <kees@kernel.org> > Cc: "Eugenio Pérez" <eperezma@redhat.com> > Cc: Hailong.Liu <hailong.liu@oppo.com> > Cc: Jason Wang <jasowang@redhat.com> > Cc: Maxime Coquelin <maxime.coquelin@redhat.com> > Cc: "Michael S. Tsirkin" <mst@redhat.com> > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > --- > include/linux/slab.h | 4 +++- > mm/page_alloc.c | 4 +++- > mm/util.c | 1 + > 3 files changed, 7 insertions(+), 2 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index c9cb42203183..4a4d1fdc2afe 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) > { > size_t bytes; > > - if (unlikely(check_mul_overflow(n, size, &bytes))) > + if (unlikely(check_mul_overflow(n, size, &bytes))) { > + BUG_ON(flags & __GFP_NOFAIL); > return NULL; > + } > > return kvmalloc_node_noprof(bytes, flags, node); > } > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 60742d057b05..d2c37f8f8d09 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, > * There are several places where we assume that the order value is sane > * so bail out early if the request is out of bound. > */ > - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) > + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { > + BUG_ON(gfp & __GFP_NOFAIL); > return NULL; > + } > > gfp &= gfp_allowed_mask; > /* > diff --git a/mm/util.c b/mm/util.c > index ac01925a4179..678c647b778f 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > > /* Don't even allow crazy sizes */ > if (unlikely(size > INT_MAX)) { > + BUG_ON(flags & __GFP_NOFAIL); No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here.
On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote: > > On 17.08.24 08:24, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > We have cases we still fail though callers might have __GFP_NOFAIL. Since > > they don't check the return, we are exposed to the security risks for NULL > > deference. > > > > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable > > situation. > > > > Christoph Hellwig: > > The whole freaking point of __GFP_NOFAIL is that callers don't handle > > allocation failures. So in fact a straight BUG is the right thing > > here. > > > > Vlastimil Babka: > > It's just not a recoverable situation (WARN_ON is for recoverable > > situations). The caller cannot handle allocation failure and at the same > > time asked for an impossible allocation. BUG_ON() is a guaranteed oops > > with stracktrace etc. We don't need to hope for the later NULL pointer > > dereference (which might if really unlucky happen from a different > > context where it's no longer obvious what lead to the allocation failing). > > > > Michal Hocko: > > Linus tends to be against adding new BUG() calls unless the failure is > > absolutely unrecoverable (e.g. corrupted data structures etc.). I am > > not sure how he would look at simply incorrect memory allocator usage to > > blow up the kernel. Now the argument could be made that those failures > > could cause subtle memory corruptions or even be exploitable which might > > be a sufficient reason to stop them early. > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > Acked-by: Vlastimil Babka <vbabka@suse.cz> > > Acked-by: Michal Hocko <mhocko@suse.com> > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > Cc: Christoph Lameter <cl@linux.com> > > Cc: Pekka Enberg <penberg@kernel.org> > > Cc: David Rientjes <rientjes@google.com> > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > > Cc: Roman Gushchin <roman.gushchin@linux.dev> > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > Cc: Linus Torvalds <torvalds@linux-foundation.org> > > Cc: Kees Cook <kees@kernel.org> > > Cc: "Eugenio Pérez" <eperezma@redhat.com> > > Cc: Hailong.Liu <hailong.liu@oppo.com> > > Cc: Jason Wang <jasowang@redhat.com> > > Cc: Maxime Coquelin <maxime.coquelin@redhat.com> > > Cc: "Michael S. Tsirkin" <mst@redhat.com> > > Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > --- > > include/linux/slab.h | 4 +++- > > mm/page_alloc.c | 4 +++- > > mm/util.c | 1 + > > 3 files changed, 7 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/slab.h b/include/linux/slab.h > > index c9cb42203183..4a4d1fdc2afe 100644 > > --- a/include/linux/slab.h > > +++ b/include/linux/slab.h > > @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) > > { > > size_t bytes; > > > > - if (unlikely(check_mul_overflow(n, size, &bytes))) > > + if (unlikely(check_mul_overflow(n, size, &bytes))) { > > + BUG_ON(flags & __GFP_NOFAIL); > > return NULL; > > + } > > > > return kvmalloc_node_noprof(bytes, flags, node); > > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 60742d057b05..d2c37f8f8d09 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, > > * There are several places where we assume that the order value is sane > > * so bail out early if the request is out of bound. > > */ > > - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) > > + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { > > + BUG_ON(gfp & __GFP_NOFAIL); > > return NULL; > > + } > > > > gfp &= gfp_allowed_mask; > > /* > > diff --git a/mm/util.c b/mm/util.c > > index ac01925a4179..678c647b778f 100644 > > --- a/mm/util.c > > +++ b/mm/util.c > > @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > > > > /* Don't even allow crazy sizes */ > > if (unlikely(size > INT_MAX)) { > > + BUG_ON(flags & __GFP_NOFAIL); > > No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here. Hi David, WARN_ON_ONCE() might be fine but I don't see how it is possible to recover. > > -- > Cheers, > > David / dhildenb >
On 19.08.24 11:47, Barry Song wrote: > On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 17.08.24 08:24, Barry Song wrote: >>> From: Barry Song <v-songbaohua@oppo.com> >>> >>> We have cases we still fail though callers might have __GFP_NOFAIL. Since >>> they don't check the return, we are exposed to the security risks for NULL >>> deference. >>> >>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable >>> situation. >>> >>> Christoph Hellwig: >>> The whole freaking point of __GFP_NOFAIL is that callers don't handle >>> allocation failures. So in fact a straight BUG is the right thing >>> here. >>> >>> Vlastimil Babka: >>> It's just not a recoverable situation (WARN_ON is for recoverable >>> situations). The caller cannot handle allocation failure and at the same >>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops >>> with stracktrace etc. We don't need to hope for the later NULL pointer >>> dereference (which might if really unlucky happen from a different >>> context where it's no longer obvious what lead to the allocation failing). >>> >>> Michal Hocko: >>> Linus tends to be against adding new BUG() calls unless the failure is >>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am >>> not sure how he would look at simply incorrect memory allocator usage to >>> blow up the kernel. Now the argument could be made that those failures >>> could cause subtle memory corruptions or even be exploitable which might >>> be a sufficient reason to stop them early. >>> >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>> Acked-by: Vlastimil Babka <vbabka@suse.cz> >>> Acked-by: Michal Hocko <mhocko@suse.com> >>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>> Cc: Christoph Lameter <cl@linux.com> >>> Cc: Pekka Enberg <penberg@kernel.org> >>> Cc: David Rientjes <rientjes@google.com> >>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> >>> Cc: Roman Gushchin <roman.gushchin@linux.dev> >>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> >>> Cc: Linus Torvalds <torvalds@linux-foundation.org> >>> Cc: Kees Cook <kees@kernel.org> >>> Cc: "Eugenio Pérez" <eperezma@redhat.com> >>> Cc: Hailong.Liu <hailong.liu@oppo.com> >>> Cc: Jason Wang <jasowang@redhat.com> >>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> >>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >>> --- >>> include/linux/slab.h | 4 +++- >>> mm/page_alloc.c | 4 +++- >>> mm/util.c | 1 + >>> 3 files changed, 7 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/linux/slab.h b/include/linux/slab.h >>> index c9cb42203183..4a4d1fdc2afe 100644 >>> --- a/include/linux/slab.h >>> +++ b/include/linux/slab.h >>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) >>> { >>> size_t bytes; >>> >>> - if (unlikely(check_mul_overflow(n, size, &bytes))) >>> + if (unlikely(check_mul_overflow(n, size, &bytes))) { >>> + BUG_ON(flags & __GFP_NOFAIL); >>> return NULL; >>> + } >>> >>> return kvmalloc_node_noprof(bytes, flags, node); >>> } >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index 60742d057b05..d2c37f8f8d09 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, >>> * There are several places where we assume that the order value is sane >>> * so bail out early if the request is out of bound. >>> */ >>> - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) >>> + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { >>> + BUG_ON(gfp & __GFP_NOFAIL); >>> return NULL; >>> + } >>> >>> gfp &= gfp_allowed_mask; >>> /* >>> diff --git a/mm/util.c b/mm/util.c >>> index ac01925a4179..678c647b778f 100644 >>> --- a/mm/util.c >>> +++ b/mm/util.c >>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) >>> >>> /* Don't even allow crazy sizes */ >>> if (unlikely(size > INT_MAX)) { >>> + BUG_ON(flags & __GFP_NOFAIL); >> >> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here. > > Hi David, > WARN_ON_ONCE() might be fine but I don't see how it is possible to recover. Just return NULL? "shit in shit out" :) ?
On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote: > > On 19.08.24 11:47, Barry Song wrote: > > On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 17.08.24 08:24, Barry Song wrote: > >>> From: Barry Song <v-songbaohua@oppo.com> > >>> > >>> We have cases we still fail though callers might have __GFP_NOFAIL. Since > >>> they don't check the return, we are exposed to the security risks for NULL > >>> deference. > >>> > >>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable > >>> situation. > >>> > >>> Christoph Hellwig: > >>> The whole freaking point of __GFP_NOFAIL is that callers don't handle > >>> allocation failures. So in fact a straight BUG is the right thing > >>> here. > >>> > >>> Vlastimil Babka: > >>> It's just not a recoverable situation (WARN_ON is for recoverable > >>> situations). The caller cannot handle allocation failure and at the same > >>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops > >>> with stracktrace etc. We don't need to hope for the later NULL pointer > >>> dereference (which might if really unlucky happen from a different > >>> context where it's no longer obvious what lead to the allocation failing). > >>> > >>> Michal Hocko: > >>> Linus tends to be against adding new BUG() calls unless the failure is > >>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am > >>> not sure how he would look at simply incorrect memory allocator usage to > >>> blow up the kernel. Now the argument could be made that those failures > >>> could cause subtle memory corruptions or even be exploitable which might > >>> be a sufficient reason to stop them early. > >>> > >>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>> Reviewed-by: Christoph Hellwig <hch@lst.de> > >>> Acked-by: Vlastimil Babka <vbabka@suse.cz> > >>> Acked-by: Michal Hocko <mhocko@suse.com> > >>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> > >>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > >>> Cc: Christoph Lameter <cl@linux.com> > >>> Cc: Pekka Enberg <penberg@kernel.org> > >>> Cc: David Rientjes <rientjes@google.com> > >>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > >>> Cc: Roman Gushchin <roman.gushchin@linux.dev> > >>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> > >>> Cc: Linus Torvalds <torvalds@linux-foundation.org> > >>> Cc: Kees Cook <kees@kernel.org> > >>> Cc: "Eugenio Pérez" <eperezma@redhat.com> > >>> Cc: Hailong.Liu <hailong.liu@oppo.com> > >>> Cc: Jason Wang <jasowang@redhat.com> > >>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> > >>> Cc: "Michael S. Tsirkin" <mst@redhat.com> > >>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > >>> --- > >>> include/linux/slab.h | 4 +++- > >>> mm/page_alloc.c | 4 +++- > >>> mm/util.c | 1 + > >>> 3 files changed, 7 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/include/linux/slab.h b/include/linux/slab.h > >>> index c9cb42203183..4a4d1fdc2afe 100644 > >>> --- a/include/linux/slab.h > >>> +++ b/include/linux/slab.h > >>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) > >>> { > >>> size_t bytes; > >>> > >>> - if (unlikely(check_mul_overflow(n, size, &bytes))) > >>> + if (unlikely(check_mul_overflow(n, size, &bytes))) { > >>> + BUG_ON(flags & __GFP_NOFAIL); > >>> return NULL; > >>> + } > >>> > >>> return kvmalloc_node_noprof(bytes, flags, node); > >>> } > >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>> index 60742d057b05..d2c37f8f8d09 100644 > >>> --- a/mm/page_alloc.c > >>> +++ b/mm/page_alloc.c > >>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, > >>> * There are several places where we assume that the order value is sane > >>> * so bail out early if the request is out of bound. > >>> */ > >>> - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) > >>> + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { > >>> + BUG_ON(gfp & __GFP_NOFAIL); > >>> return NULL; > >>> + } > >>> > >>> gfp &= gfp_allowed_mask; > >>> /* > >>> diff --git a/mm/util.c b/mm/util.c > >>> index ac01925a4179..678c647b778f 100644 > >>> --- a/mm/util.c > >>> +++ b/mm/util.c > >>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > >>> > >>> /* Don't even allow crazy sizes */ > >>> if (unlikely(size > INT_MAX)) { > >>> + BUG_ON(flags & __GFP_NOFAIL); > >> > >> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here. > > > > Hi David, > > WARN_ON_ONCE() might be fine but I don't see how it is possible to recover. > > Just return NULL? "shit in shit out" :) ? Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL, as it's the caller's responsibility to check the return value. However, with __GFP_NOFAIL, users will directly dereference *(p + offset) even when p == NULL. It is how __GFP_NOFAIL is supposed to work. > > -- > Cheers, > > David / dhildenb >
On 19.08.24 12:02, Barry Song wrote: > On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.08.24 11:47, Barry Song wrote: >>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 17.08.24 08:24, Barry Song wrote: >>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>> >>>>> We have cases we still fail though callers might have __GFP_NOFAIL. Since >>>>> they don't check the return, we are exposed to the security risks for NULL >>>>> deference. >>>>> >>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable >>>>> situation. >>>>> >>>>> Christoph Hellwig: >>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle >>>>> allocation failures. So in fact a straight BUG is the right thing >>>>> here. >>>>> >>>>> Vlastimil Babka: >>>>> It's just not a recoverable situation (WARN_ON is for recoverable >>>>> situations). The caller cannot handle allocation failure and at the same >>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops >>>>> with stracktrace etc. We don't need to hope for the later NULL pointer >>>>> dereference (which might if really unlucky happen from a different >>>>> context where it's no longer obvious what lead to the allocation failing). >>>>> >>>>> Michal Hocko: >>>>> Linus tends to be against adding new BUG() calls unless the failure is >>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am >>>>> not sure how he would look at simply incorrect memory allocator usage to >>>>> blow up the kernel. Now the argument could be made that those failures >>>>> could cause subtle memory corruptions or even be exploitable which might >>>>> be a sufficient reason to stop them early. >>>>> >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz> >>>>> Acked-by: Michal Hocko <mhocko@suse.com> >>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> >>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>>>> Cc: Christoph Lameter <cl@linux.com> >>>>> Cc: Pekka Enberg <penberg@kernel.org> >>>>> Cc: David Rientjes <rientjes@google.com> >>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> >>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev> >>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> >>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org> >>>>> Cc: Kees Cook <kees@kernel.org> >>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com> >>>>> Cc: Hailong.Liu <hailong.liu@oppo.com> >>>>> Cc: Jason Wang <jasowang@redhat.com> >>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> >>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >>>>> --- >>>>> include/linux/slab.h | 4 +++- >>>>> mm/page_alloc.c | 4 +++- >>>>> mm/util.c | 1 + >>>>> 3 files changed, 7 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h >>>>> index c9cb42203183..4a4d1fdc2afe 100644 >>>>> --- a/include/linux/slab.h >>>>> +++ b/include/linux/slab.h >>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) >>>>> { >>>>> size_t bytes; >>>>> >>>>> - if (unlikely(check_mul_overflow(n, size, &bytes))) >>>>> + if (unlikely(check_mul_overflow(n, size, &bytes))) { >>>>> + BUG_ON(flags & __GFP_NOFAIL); >>>>> return NULL; >>>>> + } >>>>> >>>>> return kvmalloc_node_noprof(bytes, flags, node); >>>>> } >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 60742d057b05..d2c37f8f8d09 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, >>>>> * There are several places where we assume that the order value is sane >>>>> * so bail out early if the request is out of bound. >>>>> */ >>>>> - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) >>>>> + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { >>>>> + BUG_ON(gfp & __GFP_NOFAIL); >>>>> return NULL; >>>>> + } >>>>> >>>>> gfp &= gfp_allowed_mask; >>>>> /* >>>>> diff --git a/mm/util.c b/mm/util.c >>>>> index ac01925a4179..678c647b778f 100644 >>>>> --- a/mm/util.c >>>>> +++ b/mm/util.c >>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) >>>>> >>>>> /* Don't even allow crazy sizes */ >>>>> if (unlikely(size > INT_MAX)) { >>>>> + BUG_ON(flags & __GFP_NOFAIL); >>>> >>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here. >>> >>> Hi David, >>> WARN_ON_ONCE() might be fine but I don't see how it is possible to recover. >> >> Just return NULL? "shit in shit out" :) ? > > Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL, > as it's the caller's responsibility to check the return value. However, with > __GFP_NOFAIL, users will directly dereference *(p + offset) even when > p == NULL. It is how __GFP_NOFAIL is supposed to work. If the caller is not supposed to pass that flag combination (shit in), we are not obligated to give a reasonable result (shit out). My point is that we should let the caller (possibly?) crash -- the one that did something that is wrong -- instead of forcing a crash using BUG_ON in this code here. It should all be caught during testing either way. And if some OOT module does something nasty, that's not our responsibility. BUG_ON is not a way to write assertions into the code.
On Tue, Aug 20, 2024 at 12:33 AM David Hildenbrand <david@redhat.com> wrote: > > On 19.08.24 12:02, Barry Song wrote: > > On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 19.08.24 11:47, Barry Song wrote: > >>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote: > >>>> > >>>> On 17.08.24 08:24, Barry Song wrote: > >>>>> From: Barry Song <v-songbaohua@oppo.com> > >>>>> > >>>>> We have cases we still fail though callers might have __GFP_NOFAIL. Since > >>>>> they don't check the return, we are exposed to the security risks for NULL > >>>>> deference. > >>>>> > >>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable > >>>>> situation. > >>>>> > >>>>> Christoph Hellwig: > >>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle > >>>>> allocation failures. So in fact a straight BUG is the right thing > >>>>> here. > >>>>> > >>>>> Vlastimil Babka: > >>>>> It's just not a recoverable situation (WARN_ON is for recoverable > >>>>> situations). The caller cannot handle allocation failure and at the same > >>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops > >>>>> with stracktrace etc. We don't need to hope for the later NULL pointer > >>>>> dereference (which might if really unlucky happen from a different > >>>>> context where it's no longer obvious what lead to the allocation failing). > >>>>> > >>>>> Michal Hocko: > >>>>> Linus tends to be against adding new BUG() calls unless the failure is > >>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am > >>>>> not sure how he would look at simply incorrect memory allocator usage to > >>>>> blow up the kernel. Now the argument could be made that those failures > >>>>> could cause subtle memory corruptions or even be exploitable which might > >>>>> be a sufficient reason to stop them early. > >>>>> > >>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> > >>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> > >>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz> > >>>>> Acked-by: Michal Hocko <mhocko@suse.com> > >>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> > >>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > >>>>> Cc: Christoph Lameter <cl@linux.com> > >>>>> Cc: Pekka Enberg <penberg@kernel.org> > >>>>> Cc: David Rientjes <rientjes@google.com> > >>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > >>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev> > >>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> > >>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org> > >>>>> Cc: Kees Cook <kees@kernel.org> > >>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com> > >>>>> Cc: Hailong.Liu <hailong.liu@oppo.com> > >>>>> Cc: Jason Wang <jasowang@redhat.com> > >>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> > >>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> > >>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > >>>>> --- > >>>>> include/linux/slab.h | 4 +++- > >>>>> mm/page_alloc.c | 4 +++- > >>>>> mm/util.c | 1 + > >>>>> 3 files changed, 7 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h > >>>>> index c9cb42203183..4a4d1fdc2afe 100644 > >>>>> --- a/include/linux/slab.h > >>>>> +++ b/include/linux/slab.h > >>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) > >>>>> { > >>>>> size_t bytes; > >>>>> > >>>>> - if (unlikely(check_mul_overflow(n, size, &bytes))) > >>>>> + if (unlikely(check_mul_overflow(n, size, &bytes))) { > >>>>> + BUG_ON(flags & __GFP_NOFAIL); > >>>>> return NULL; > >>>>> + } > >>>>> > >>>>> return kvmalloc_node_noprof(bytes, flags, node); > >>>>> } > >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>> index 60742d057b05..d2c37f8f8d09 100644 > >>>>> --- a/mm/page_alloc.c > >>>>> +++ b/mm/page_alloc.c > >>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, > >>>>> * There are several places where we assume that the order value is sane > >>>>> * so bail out early if the request is out of bound. > >>>>> */ > >>>>> - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) > >>>>> + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { > >>>>> + BUG_ON(gfp & __GFP_NOFAIL); > >>>>> return NULL; > >>>>> + } > >>>>> > >>>>> gfp &= gfp_allowed_mask; > >>>>> /* > >>>>> diff --git a/mm/util.c b/mm/util.c > >>>>> index ac01925a4179..678c647b778f 100644 > >>>>> --- a/mm/util.c > >>>>> +++ b/mm/util.c > >>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > >>>>> > >>>>> /* Don't even allow crazy sizes */ > >>>>> if (unlikely(size > INT_MAX)) { > >>>>> + BUG_ON(flags & __GFP_NOFAIL); > >>>> > >>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here. > >>> > >>> Hi David, > >>> WARN_ON_ONCE() might be fine but I don't see how it is possible to recover. > >> > >> Just return NULL? "shit in shit out" :) ? > > > > Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL, > > as it's the caller's responsibility to check the return value. However, with > > __GFP_NOFAIL, users will directly dereference *(p + offset) even when > > p == NULL. It is how __GFP_NOFAIL is supposed to work. > > If the caller is not supposed to pass that flag combination (shit in), > we are not obligated to give a reasonable result (shit out). > > My point is that we should let the caller (possibly?) crash -- the one > that did something that is wrong -- instead of forcing a crash using > BUG_ON in this code here. > > It should all be caught during testing either way. And if some OOT > module does something nasty, that's not our responsibility. > > BUG_ON is not a way to write assertions into the code. It seems there was a misunderstanding regarding the purpose of this change. we actually have many details in changelog. Its aim is not to write an assertion, but rather to prevent exposing a security vulnerability. Returning NULL doesn't necessarily crash the caller's process, p->field, *(p + offset) deference could be used by hackers to exploit the system. > > -- > Cheers, > > David / dhildenb > Thanks Barry
On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote: > It should all be caught during testing either way. And if some OOT module > does something nasty, that's not our responsibility. > > BUG_ON is not a way to write assertions into the code. So you'd rather create exploits than crashing on a fundamental API violation? That's exactly what the series is trying to fix.
On 19.08.24 14:48, Barry Song wrote: > On Tue, Aug 20, 2024 at 12:33 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.08.24 12:02, Barry Song wrote: >>> On Mon, Aug 19, 2024 at 9:55 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 19.08.24 11:47, Barry Song wrote: >>>>> On Mon, Aug 19, 2024 at 9:43 PM David Hildenbrand <david@redhat.com> wrote: >>>>>> >>>>>> On 17.08.24 08:24, Barry Song wrote: >>>>>>> From: Barry Song <v-songbaohua@oppo.com> >>>>>>> >>>>>>> We have cases we still fail though callers might have __GFP_NOFAIL. Since >>>>>>> they don't check the return, we are exposed to the security risks for NULL >>>>>>> deference. >>>>>>> >>>>>>> Though BUG_ON() is not encouraged by Linus, this is an unrecoverable >>>>>>> situation. >>>>>>> >>>>>>> Christoph Hellwig: >>>>>>> The whole freaking point of __GFP_NOFAIL is that callers don't handle >>>>>>> allocation failures. So in fact a straight BUG is the right thing >>>>>>> here. >>>>>>> >>>>>>> Vlastimil Babka: >>>>>>> It's just not a recoverable situation (WARN_ON is for recoverable >>>>>>> situations). The caller cannot handle allocation failure and at the same >>>>>>> time asked for an impossible allocation. BUG_ON() is a guaranteed oops >>>>>>> with stracktrace etc. We don't need to hope for the later NULL pointer >>>>>>> dereference (which might if really unlucky happen from a different >>>>>>> context where it's no longer obvious what lead to the allocation failing). >>>>>>> >>>>>>> Michal Hocko: >>>>>>> Linus tends to be against adding new BUG() calls unless the failure is >>>>>>> absolutely unrecoverable (e.g. corrupted data structures etc.). I am >>>>>>> not sure how he would look at simply incorrect memory allocator usage to >>>>>>> blow up the kernel. Now the argument could be made that those failures >>>>>>> could cause subtle memory corruptions or even be exploitable which might >>>>>>> be a sufficient reason to stop them early. >>>>>>> >>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com> >>>>>>> Reviewed-by: Christoph Hellwig <hch@lst.de> >>>>>>> Acked-by: Vlastimil Babka <vbabka@suse.cz> >>>>>>> Acked-by: Michal Hocko <mhocko@suse.com> >>>>>>> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> >>>>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>>>>>> Cc: Christoph Lameter <cl@linux.com> >>>>>>> Cc: Pekka Enberg <penberg@kernel.org> >>>>>>> Cc: David Rientjes <rientjes@google.com> >>>>>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> >>>>>>> Cc: Roman Gushchin <roman.gushchin@linux.dev> >>>>>>> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> >>>>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org> >>>>>>> Cc: Kees Cook <kees@kernel.org> >>>>>>> Cc: "Eugenio Pérez" <eperezma@redhat.com> >>>>>>> Cc: Hailong.Liu <hailong.liu@oppo.com> >>>>>>> Cc: Jason Wang <jasowang@redhat.com> >>>>>>> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> >>>>>>> Cc: "Michael S. Tsirkin" <mst@redhat.com> >>>>>>> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> >>>>>>> --- >>>>>>> include/linux/slab.h | 4 +++- >>>>>>> mm/page_alloc.c | 4 +++- >>>>>>> mm/util.c | 1 + >>>>>>> 3 files changed, 7 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/include/linux/slab.h b/include/linux/slab.h >>>>>>> index c9cb42203183..4a4d1fdc2afe 100644 >>>>>>> --- a/include/linux/slab.h >>>>>>> +++ b/include/linux/slab.h >>>>>>> @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) >>>>>>> { >>>>>>> size_t bytes; >>>>>>> >>>>>>> - if (unlikely(check_mul_overflow(n, size, &bytes))) >>>>>>> + if (unlikely(check_mul_overflow(n, size, &bytes))) { >>>>>>> + BUG_ON(flags & __GFP_NOFAIL); >>>>>>> return NULL; >>>>>>> + } >>>>>>> >>>>>>> return kvmalloc_node_noprof(bytes, flags, node); >>>>>>> } >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>> index 60742d057b05..d2c37f8f8d09 100644 >>>>>>> --- a/mm/page_alloc.c >>>>>>> +++ b/mm/page_alloc.c >>>>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, >>>>>>> * There are several places where we assume that the order value is sane >>>>>>> * so bail out early if the request is out of bound. >>>>>>> */ >>>>>>> - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) >>>>>>> + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { >>>>>>> + BUG_ON(gfp & __GFP_NOFAIL); >>>>>>> return NULL; >>>>>>> + } >>>>>>> >>>>>>> gfp &= gfp_allowed_mask; >>>>>>> /* >>>>>>> diff --git a/mm/util.c b/mm/util.c >>>>>>> index ac01925a4179..678c647b778f 100644 >>>>>>> --- a/mm/util.c >>>>>>> +++ b/mm/util.c >>>>>>> @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) >>>>>>> >>>>>>> /* Don't even allow crazy sizes */ >>>>>>> if (unlikely(size > INT_MAX)) { >>>>>>> + BUG_ON(flags & __GFP_NOFAIL); >>>>>> >>>>>> No new BUG_ON please. WARN_ON_ONCE() + recovery code might be suitable here. >>>>> >>>>> Hi David, >>>>> WARN_ON_ONCE() might be fine but I don't see how it is possible to recover. >>>> >>>> Just return NULL? "shit in shit out" :) ? >>> >>> Returning NULL is perfectly right if gfp doesn't include __GFP_NOFAIL, >>> as it's the caller's responsibility to check the return value. However, with >>> __GFP_NOFAIL, users will directly dereference *(p + offset) even when >>> p == NULL. It is how __GFP_NOFAIL is supposed to work. >> >> If the caller is not supposed to pass that flag combination (shit in), >> we are not obligated to give a reasonable result (shit out). >> >> My point is that we should let the caller (possibly?) crash -- the one >> that did something that is wrong -- instead of forcing a crash using >> BUG_ON in this code here. >> >> It should all be caught during testing either way. And if some OOT >> module does something nasty, that's not our responsibility. >> >> BUG_ON is not a way to write assertions into the code. > > It seems there was a misunderstanding regarding the purpose of > this change. we actually have many details in changelog. > > Its aim is not to write an assertion, but rather to prevent exposing > a security vulnerability. > > Returning NULL doesn't necessarily crash the caller's process, p->field, > *(p + offset) deference could be used by hackers to exploit the system. See my other reply to Michal: why do we even allow to specify them separately and not simply let one enforce the other? That might result in an issue elsewhere, but likely no security vulnerability? I really hate each and every BUG_ON I have to stare at.
On 19.08.24 14:49, Christoph Hellwig wrote: > On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote: >> It should all be caught during testing either way. And if some OOT module >> does something nasty, that's not our responsibility. >> >> BUG_ON is not a way to write assertions into the code. > > So you'd rather create exploits than crashing on a fundamental API > violation? That's exactly what the series is trying to fix. I'd rather have a sane API that doesn't even allow this level of flexibility with NOFAIL. But probably I'm missing more details here why this all has to be so complicated ;)
On Mon, Aug 19, 2024 at 02:51:06PM +0200, David Hildenbrand wrote: > On 19.08.24 14:49, Christoph Hellwig wrote: >> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote: >>> It should all be caught during testing either way. And if some OOT module >>> does something nasty, that's not our responsibility. >>> >>> BUG_ON is not a way to write assertions into the code. >> >> So you'd rather create exploits than crashing on a fundamental API >> violation? That's exactly what the series is trying to fix. > > I'd rather have a sane API that doesn't even allow this level of > flexibility with NOFAIL. > > But probably I'm missing more details here why this all has to be so > complicated ;) Well, the only way to do that is to remove (__)GFP_NOFAIL and require either explicit _nofail variants without a way to pass gfp flags, or endless loops in the callers. I suggested the first one earlier, but no one liked it due to the API complexity and overhead. And I've not heard anyone arguing for the latter yet. One other way might be a compile time check that requires any GPF flag that contains (__)GFP_NOFAIL to be a compile time constants. This is true for many but not all callers currently.
On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote: > > On 19.08.24 14:49, Christoph Hellwig wrote: > > On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote: > >> It should all be caught during testing either way. And if some OOT module > >> does something nasty, that's not our responsibility. > >> > >> BUG_ON is not a way to write assertions into the code. > > > > So you'd rather create exploits than crashing on a fundamental API > > violation? That's exactly what the series is trying to fix. > > I'd rather have a sane API that doesn't even allow this level of > flexibility with NOFAIL. yes, i have already sent a RFC enforcing direct_reclamation: https://www.spinics.net/lists/linux-mm/msg394659.html somehow, it is not ready yet. i think Christoph prefers scope api rather than GFP_NOFAIL which definitely has __GFP_DIRECT_RECLAIM set. I guess you know I have at least 5 series running, so it will happen soon though. > > But probably I'm missing more details here why this all has to be so > complicated ;) enforcing direct_reclamation is right and will work for a reasonable size. but for this overflow size, even if we enforce direct_reclamation in GFP_NOFAIL, we are still failing. > > -- > Cheers, > > David / dhildenb >
On 19.08.24 15:05, Barry Song wrote: > On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.08.24 14:49, Christoph Hellwig wrote: >>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote: >>>> It should all be caught during testing either way. And if some OOT module >>>> does something nasty, that's not our responsibility. >>>> >>>> BUG_ON is not a way to write assertions into the code. >>> >>> So you'd rather create exploits than crashing on a fundamental API >>> violation? That's exactly what the series is trying to fix. >> >> I'd rather have a sane API that doesn't even allow this level of >> flexibility with NOFAIL. > > yes, i have already sent a RFC enforcing direct_reclamation: > https://www.spinics.net/lists/linux-mm/msg394659.html > > somehow, it is not ready yet. i think Christoph prefers scope > api rather than GFP_NOFAIL which definitely has > __GFP_DIRECT_RECLAIM set. I guess you know I have > at least 5 series running, so it will happen soon though. That really sounds like the right thing to do, at least with the "direct reclaim" problem ... > >> >> But probably I'm missing more details here why this all has to be so >> complicated ;) > > enforcing direct_reclamation is right and will work for a reasonable size. > but for this overflow size, even if we enforce direct_reclamation > in GFP_NOFAIL, we are still failing. Right, someone requested something completely impossible. It's harder to do something here that doesn't return NULL. Except WARN_ON_ONCE() and loop for all infinity.
On 19.08.24 14:53, Christoph Hellwig wrote: > On Mon, Aug 19, 2024 at 02:51:06PM +0200, David Hildenbrand wrote: >> On 19.08.24 14:49, Christoph Hellwig wrote: >>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote: >>>> It should all be caught during testing either way. And if some OOT module >>>> does something nasty, that's not our responsibility. >>>> >>>> BUG_ON is not a way to write assertions into the code. >>> >>> So you'd rather create exploits than crashing on a fundamental API >>> violation? That's exactly what the series is trying to fix. >> >> I'd rather have a sane API that doesn't even allow this level of >> flexibility with NOFAIL. >> >> But probably I'm missing more details here why this all has to be so >> complicated ;) > > Well, the only way to do that is to remove (__)GFP_NOFAIL and require > either explicit _nofail variants without a way to pass gfp flags, or > endless loops in the callers. I suggested the first one earlier, but > no one liked it due to the API complexity and overhead. And I've not > heard anyone arguing for the latter yet. Right, and and "allocate more than even possible" case is extremely ugly. > > One other way might be a compile time check that requires any GPF > flag that contains (__)GFP_NOFAIL to be a compile time constants. > This is true for many but not all callers currently. Right.
On Tue, Aug 20, 2024 at 1:10 AM David Hildenbrand <david@redhat.com> wrote: > > On 19.08.24 15:05, Barry Song wrote: > > On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 19.08.24 14:49, Christoph Hellwig wrote: > >>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote: > >>>> It should all be caught during testing either way. And if some OOT module > >>>> does something nasty, that's not our responsibility. > >>>> > >>>> BUG_ON is not a way to write assertions into the code. > >>> > >>> So you'd rather create exploits than crashing on a fundamental API > >>> violation? That's exactly what the series is trying to fix. > >> > >> I'd rather have a sane API that doesn't even allow this level of > >> flexibility with NOFAIL. > > > > yes, i have already sent a RFC enforcing direct_reclamation: > > https://www.spinics.net/lists/linux-mm/msg394659.html > > > > somehow, it is not ready yet. i think Christoph prefers scope > > api rather than GFP_NOFAIL which definitely has > > __GFP_DIRECT_RECLAIM set. I guess you know I have > > at least 5 series running, so it will happen soon though. > > That really sounds like the right thing to do, at least with the "direct > reclaim" problem ... > > > > >> > >> But probably I'm missing more details here why this all has to be so > >> complicated ;) > > > > enforcing direct_reclamation is right and will work for a reasonable size. > > but for this overflow size, even if we enforce direct_reclamation > > in GFP_NOFAIL, we are still failing. > > Right, someone requested something completely impossible. It's harder to > do something here that doesn't return NULL. Except WARN_ON_ONCE() and > loop for all infinity. Returning NULL can introduce security vulnerabilities. While I’m not a hacker, it’s hard to predict how they might exploit this. If we want to avoid using BUG_ON, an alternative approach could be as follows: while(gfp & __GFP_NOFAIL) some_cpu_relaxed_job; ? > > -- > Cheers, > > David / dhildenb >
On 19.08.24 15:19, Barry Song wrote: > On Tue, Aug 20, 2024 at 1:10 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 19.08.24 15:05, Barry Song wrote: >>> On Tue, Aug 20, 2024 at 12:51 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 19.08.24 14:49, Christoph Hellwig wrote: >>>>> On Mon, Aug 19, 2024 at 02:33:06PM +0200, David Hildenbrand wrote: >>>>>> It should all be caught during testing either way. And if some OOT module >>>>>> does something nasty, that's not our responsibility. >>>>>> >>>>>> BUG_ON is not a way to write assertions into the code. >>>>> >>>>> So you'd rather create exploits than crashing on a fundamental API >>>>> violation? That's exactly what the series is trying to fix. >>>> >>>> I'd rather have a sane API that doesn't even allow this level of >>>> flexibility with NOFAIL. >>> >>> yes, i have already sent a RFC enforcing direct_reclamation: >>> https://www.spinics.net/lists/linux-mm/msg394659.html >>> >>> somehow, it is not ready yet. i think Christoph prefers scope >>> api rather than GFP_NOFAIL which definitely has >>> __GFP_DIRECT_RECLAIM set. I guess you know I have >>> at least 5 series running, so it will happen soon though. >> >> That really sounds like the right thing to do, at least with the "direct >> reclaim" problem ... >> >>> >>>> >>>> But probably I'm missing more details here why this all has to be so >>>> complicated ;) >>> >>> enforcing direct_reclamation is right and will work for a reasonable size. >>> but for this overflow size, even if we enforce direct_reclamation >>> in GFP_NOFAIL, we are still failing. >> >> Right, someone requested something completely impossible. It's harder to >> do something here that doesn't return NULL. Except WARN_ON_ONCE() and >> loop for all infinity. > > Returning NULL can introduce security vulnerabilities. While I’m not a hacker, > it’s hard to predict how they might exploit this. If we want to avoid using > BUG_ON, an alternative approach could be as follows: > > while(gfp & __GFP_NOFAIL) some_cpu_relaxed_job; ? At least for the "allocate something impossible in size", retrying forever makes *some* sense (with a warning, of course). We would retry until it works for sizes that are possible. For the other case, we should just make __GFP_NOFAIL imply whatever is currently a "must". Returning NULL here is indeed strange.
On Mon 19-08-24 14:49:55, David Hildenbrand wrote: > On 19.08.24 14:48, Barry Song wrote: [...] > > > > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > > > > > index 60742d057b05..d2c37f8f8d09 100644 > > > > > > > > --- a/mm/page_alloc.c > > > > > > > > +++ b/mm/page_alloc.c > > > > > > > > @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, > > > > > > > > * There are several places where we assume that the order value is sane > > > > > > > > * so bail out early if the request is out of bound. > > > > > > > > */ > > > > > > > > - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) > > > > > > > > + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { > > > > > > > > + BUG_ON(gfp & __GFP_NOFAIL); > > > > > > > > return NULL; > > > > > > > > + } [...] > > Returning NULL doesn't necessarily crash the caller's process, p->field, > > *(p + offset) deference could be used by hackers to exploit the system. > > See my other reply to Michal: why do we even allow to specify them > separately and not simply let one enforce the other? Are you replying to this patch? This is not about a combination of flags. This is about the above (and other similar) boundary checks which return NULL if the size is deemed incorrect. I think those are potential problems because it could be a lack of input check which could be turned into a potentially malicious code. Because unchecked (return value because NOFAIL never fails, right?) return value might even not OOPs and become a silent read/write into memory. Whether to BUG_ON or simply loop for ever in the allocator if somebody requests non-sleeping NOFAIL allocation is a different story.
On Mon, 19 Aug 2024 at 10:12, Michal Hocko <mhocko@suse.com> wrote: > > Whether to BUG_ON or simply loop for ever in the allocator if somebody > requests non-sleeping NOFAIL allocation is a different story. Just return NULL. The bug isn't in the VM. It's in the caller. Don't take on other peoples problems. It was never valid to say "I want to allocate lots of memory and you can't fail". Don't validate that kind of bogus behavior, just tell them "you're bad" and return NULL. Linus
On 19.08.24 19:12, Michal Hocko wrote: > On Mon 19-08-24 14:49:55, David Hildenbrand wrote: >> On 19.08.24 14:48, Barry Song wrote: > [...] >>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>>> index 60742d057b05..d2c37f8f8d09 100644 >>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>> @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, >>>>>>>>> * There are several places where we assume that the order value is sane >>>>>>>>> * so bail out early if the request is out of bound. >>>>>>>>> */ >>>>>>>>> - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) >>>>>>>>> + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { >>>>>>>>> + BUG_ON(gfp & __GFP_NOFAIL); >>>>>>>>> return NULL; >>>>>>>>> + } > [...] >>> Returning NULL doesn't necessarily crash the caller's process, p->field, >>> *(p + offset) deference could be used by hackers to exploit the system. >> >> See my other reply to Michal: why do we even allow to specify them >> separately and not simply let one enforce the other? > > Are you replying to this patch? This is not about a combination of > flags. This is about the above (and other similar) boundary checks which > return NULL if the size is deemed incorrect. > I think those are potential > problems because it could be a lack of input check which could be turned > into a potentially malicious code. Because unchecked (return value > because NOFAIL never fails, right?) return value might even not OOPs and > become a silent read/write into memory. Right, that's a different kind of issue than the simple "don't pass stupid flag combinations" thing where "we'll fix that up for you" is more reasonable. Possibly NOFAIL allocations with an allocation sizes that is not a small compile-time constant already has a bad smell to it, but I'm sure there are reasonable exceptions ... > > Whether to BUG_ON or simply loop for ever in the allocator if somebody > requests non-sleeping NOFAIL allocation is a different story. Right, "warn + loop forever" is one alternative where you could at least keep the system alive to some degree. Satisfying a large allocation might take a long time, satisfying a "too large" allocation would take forever. But as Linus says, it's all workarounds for other buggy code, to make buggy code less exploitable, maybe, ...
On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote: > > Right, "warn + loop forever" is one alternative where you could at least > keep the system alive to some degree. Maybe. Or it might just lock up the machine. For small allocations looping forever is probably fine, because in practice there's always *something* that can be thrown out. But anything larger than order-3 (handwavy, but that was our historical limit, I think, and we call it PAGE_ALLOC_COSTLY_ORDER) has to fail at _some_ point, and the caller setting GFP_NOFAIL is just fantasy and "Daddy, I want a pony", and should be ignored. With a WARN_ON_ONCE(), by all means, so that people can see who the fantasist is. Linus
On 19.08.24 22:35, Linus Torvalds wrote: > On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote: >> >> Right, "warn + loop forever" is one alternative where you could at least >> keep the system alive to some degree. > > Maybe. Or it might just lock up the machine. Yes, regarding the security concerns it might be a bit better that way. (no security expert, so I cannot judge ...) So, instead of saying "No, you can't have a pony, stupid", you would say "well, let me talk to your mother about that (WARN)" ... and just never reply to the real question ... :)
On Mon, 19 Aug 2024 at 14:57, David Hildenbrand <david@redhat.com> wrote: > > On 19.08.24 22:35, Linus Torvalds wrote: > > > > Maybe. Or it might just lock up the machine. > > Yes, regarding the security concerns it might be a bit better that way. > (no security expert, so I cannot judge ...) The thing is, in a static universe where time does not flow, that might be true. But in such a universe, we wouldn't actually be alive, much less do kernel development. In *REALITY*, security is not a single point in time. Real security is about the *future* more than it is about anything else. From a user perspective, security is very much about things like "keep your system up-to-date", because we _know_ old systems have (known) bugs. But the other side of that is that from a developer perspective, the #1 thing in security is about fixing bugs. Yes, you want to be conservative so that bugs you haven't fixed yet don't cause problems, but that "be conservative" absolutely *has* to be priority #2. Because no amount of conservatism is going to help you unless you actively fix the bugs. And the thing is, a locked-up machine is really really hard to debug. I can tell you from experience just what the bug reports will look like, but I think you can guess. So locking up is one of the worst things you can do. You are pretty much always much better off making forward progress and try to report the issue. Including for security reasons. Because you'd rather have a security issue today that you can debug, and fix for tomorrow. A dead machine with a bug you can't debug is not actually suddenly a really secure machine. Because somebody will just figure out a way to take advantage of *that* bug too indirectly. Being able to kill machines at your leisure is a great way to stress other parts of your network, and suddenly that dead machine might be a big security hazard when somebody figures out a weakness elsewhere ("Oh, look, there's a bug in the fail-over that allows me to do X"). Linus
On Mon 19-08-24 23:57:29, David Hildenbrand wrote: > On 19.08.24 22:35, Linus Torvalds wrote: > > On Mon, 19 Aug 2024 at 13:24, David Hildenbrand <david@redhat.com> wrote: > > > > > > Right, "warn + loop forever" is one alternative where you could at least > > > keep the system alive to some degree. > > > > Maybe. Or it might just lock up the machine. > > Yes, regarding the security concerns it might be a bit better that way. (no > security expert, so I cannot judge ...) Would it make sense to simply enforce and oops? We do expect that an incorrect usage would trigger one but we have no guarantee because the actual user could be array = kvmalloc(unsupported_size_provided_from_userspace, GFP_NOFAIL) which might be actually a valid usecase because that the normaly requested size is large, yet reasonable. A lack of user input checks is just a sad reality we have to live with. While those bugs absolutely _need_ to be fixed it is better to not just allow them to array[large_index] = payload and make them easier to exploit the kernel. Sure you will get a warning but your machine has been compromised. BUG_ON will shoot the whole machine down which I do understand is just too drastic of a measure. Making the allocation loop for ever with cond_resched() or a short sleep is slightly better because it contains the bad user but the process context is still not killabale that way which is a problem on its own. I am not aware of OOPS_ON that would kill the calling user process. Yes, this could still be leaving locks behind but still better than a compromised system. WDYT?
diff --git a/include/linux/slab.h b/include/linux/slab.h index c9cb42203183..4a4d1fdc2afe 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) { size_t bytes; - if (unlikely(check_mul_overflow(n, size, &bytes))) + if (unlikely(check_mul_overflow(n, size, &bytes))) { + BUG_ON(flags & __GFP_NOFAIL); return NULL; + } return kvmalloc_node_noprof(bytes, flags, node); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 60742d057b05..d2c37f8f8d09 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4668,8 +4668,10 @@ struct page *__alloc_pages_noprof(gfp_t gfp, unsigned int order, * There are several places where we assume that the order value is sane * so bail out early if the request is out of bound. */ - if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) + if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp)) { + BUG_ON(gfp & __GFP_NOFAIL); return NULL; + } gfp &= gfp_allowed_mask; /* diff --git a/mm/util.c b/mm/util.c index ac01925a4179..678c647b778f 100644 --- a/mm/util.c +++ b/mm/util.c @@ -667,6 +667,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) /* Don't even allow crazy sizes */ if (unlikely(size > INT_MAX)) { + BUG_ON(flags & __GFP_NOFAIL); WARN_ON_ONCE(!(flags & __GFP_NOWARN)); return NULL; }