Message ID | 20240724085544.299090-4-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: clarify nofail memory allocation | expand |
On 7/24/24 10:55 AM, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > We have cases we still fail though callers might have __GFP_NOFAIL. > Since they don't check the return, we are exposed to the security > risks for NULL deference. > > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable > situation. > > Christoph Hellwig: > The whole freaking point of __GFP_NOFAIL is that callers don't handle > allocation failures. So in fact a straight BUG is the right thing > here. > > Vlastimil Babka: > It's just not a recoverable situation (WARN_ON is for recoverable > situations). The caller cannot handle allocation failure and at the same > time asked for an impossible allocation. BUG_ON() is a guaranteed oops > with stracktrace etc. We don't need to hope for the later NULL pointer > dereference (which might if really unlucky happen from a different > context where it's no longer obvious what lead to the allocation failing). Note that quote was meant specifically for the "too large" allocation, which is truly impossible. That includes the kvmalloc_array() overflow, order > MAX_ORDER etc. The "can't sleep/reclaim" is a bit more nuanced as there's the alternative in just warning and looping and hoping kswapd or some other direct reclaimer saves the day. If yes, great, we have a system that still works and a warning to repor. If no, there's still a warning, but later soft/hardlockup hits. These might be eventually worse than an immediate BUG_ON so it's not a clear cut. At least I think these cases should be handled in two different patches and not together. > Michal Hocko: > Linus tends to be against adding new BUG() calls unless the failure is > absolutely unrecoverable (e.g. corrupted data structures etc.). I am > not sure how he would look at simply incorrect memory allocator usage to > blow up the kernel. Now the argument could be made that those failures > could cause subtle memory corruptions or even be exploitable which might > be a sufficient reason to stop them early. > > Cc: Michal Hocko <mhocko@suse.com> > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> > Cc: Christoph Hellwig <hch@infradead.org> > Cc: Lorenzo Stoakes <lstoakes@gmail.com> > Cc: Christoph Lameter <cl@linux.com> > Cc: Pekka Enberg <penberg@kernel.org> > Cc: David Rientjes <rientjes@google.com> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > Cc: Vlastimil Babka <vbabka@suse.cz> > Cc: Roman Gushchin <roman.gushchin@linux.dev> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Kees Cook <kees@kernel.org> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > include/linux/slab.h | 4 +++- > mm/page_alloc.c | 10 +++++----- > mm/util.c | 1 + > 3 files changed, 9 insertions(+), 6 deletions(-) > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index c9cb42203183..4a4d1fdc2afe 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) > { > size_t bytes; > > - if (unlikely(check_mul_overflow(n, size, &bytes))) > + if (unlikely(check_mul_overflow(n, size, &bytes))) { > + BUG_ON(flags & __GFP_NOFAIL); > return NULL; > + } > > return kvmalloc_node_noprof(bytes, flags, node); > } > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 45d2f41b4783..4d6af00fccd4 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4435,11 +4435,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > */ > if (gfp_mask & __GFP_NOFAIL) { > /* > - * All existing users of the __GFP_NOFAIL are blockable, so warn > - * of any new users that actually require GFP_NOWAIT > + * All existing users of the __GFP_NOFAIL are blockable > + * otherwise we introduce a busy loop with inside the page > + * allocator from non-sleepable contexts > */ > - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > - goto fail; > + BUG_ON(!can_direct_reclaim); > > /* > * PF_MEMALLOC request from this context is rather bizarre > @@ -4470,7 +4470,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > cond_resched(); > goto retry; > } > -fail: > + > warn_alloc(gfp_mask, ac->nodemask, > "page allocation failure: order:%u", order); > got_pg: > diff --git a/mm/util.c b/mm/util.c > index 0ff5898cc6de..a1be50c243f1 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -668,6 +668,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > /* Don't even allow crazy sizes */ > if (unlikely(size > INT_MAX)) { > WARN_ON_ONCE(!(flags & __GFP_NOWARN)); > + BUG_ON(flags & __GFP_NOFAIL); > return NULL; > } >
On Wed, Jul 24, 2024 at 10:03 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 7/24/24 10:55 AM, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > We have cases we still fail though callers might have __GFP_NOFAIL. > > Since they don't check the return, we are exposed to the security > > risks for NULL deference. > > > > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable > > situation. > > > > Christoph Hellwig: > > The whole freaking point of __GFP_NOFAIL is that callers don't handle > > allocation failures. So in fact a straight BUG is the right thing > > here. > > > > Vlastimil Babka: > > It's just not a recoverable situation (WARN_ON is for recoverable > > situations). The caller cannot handle allocation failure and at the same > > time asked for an impossible allocation. BUG_ON() is a guaranteed oops > > with stracktrace etc. We don't need to hope for the later NULL pointer > > dereference (which might if really unlucky happen from a different > > context where it's no longer obvious what lead to the allocation failing). > > Note that quote was meant specifically for the "too large" allocation, which > is truly impossible. That includes the kvmalloc_array() overflow, order > > MAX_ORDER etc. I equally quote this for two cases because non-sleepable is also returning NULL, in this means, they are currently facing the same problems. > > The "can't sleep/reclaim" is a bit more nuanced as there's the alternative > in just warning and looping and hoping kswapd or some other direct reclaimer > saves the day. If yes, great, we have a system that still works and a > warning to repor. If no, there's still a warning, but later soft/hardlockup > hits. These might be eventually worse than an immediate BUG_ON so it's not a > clear cut. At least I think these cases should be handled in two different > patches and not together. But I fully agree these two can be separated and judged separately. After more thinking, I am concerned that this issue might be difficult to be rescued, as the misuse of GFP_ATOMIC | __GFP_NOFAIL typically occurs in atomic contexts with strict time requirements. Even if some other components release memory to satisfy the one busy-looping to obtain memory, it might already be too late? > > > Michal Hocko: > > Linus tends to be against adding new BUG() calls unless the failure is > > absolutely unrecoverable (e.g. corrupted data structures etc.). I am > > not sure how he would look at simply incorrect memory allocator usage to > > blow up the kernel. Now the argument could be made that those failures > > could cause subtle memory corruptions or even be exploitable which might > > be a sufficient reason to stop them early. > > > > Cc: Michal Hocko <mhocko@suse.com> > > Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> > > Cc: Christoph Hellwig <hch@infradead.org> > > Cc: Lorenzo Stoakes <lstoakes@gmail.com> > > Cc: Christoph Lameter <cl@linux.com> > > Cc: Pekka Enberg <penberg@kernel.org> > > Cc: David Rientjes <rientjes@google.com> > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > > Cc: Vlastimil Babka <vbabka@suse.cz> > > Cc: Roman Gushchin <roman.gushchin@linux.dev> > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > Cc: Linus Torvalds <torvalds@linux-foundation.org> > > Cc: Kees Cook <kees@kernel.org> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > include/linux/slab.h | 4 +++- > > mm/page_alloc.c | 10 +++++----- > > mm/util.c | 1 + > > 3 files changed, 9 insertions(+), 6 deletions(-) > > > > diff --git a/include/linux/slab.h b/include/linux/slab.h > > index c9cb42203183..4a4d1fdc2afe 100644 > > --- a/include/linux/slab.h > > +++ b/include/linux/slab.h > > @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) > > { > > size_t bytes; > > > > - if (unlikely(check_mul_overflow(n, size, &bytes))) > > + if (unlikely(check_mul_overflow(n, size, &bytes))) { > > + BUG_ON(flags & __GFP_NOFAIL); > > return NULL; > > + } > > > > return kvmalloc_node_noprof(bytes, flags, node); > > } > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 45d2f41b4783..4d6af00fccd4 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4435,11 +4435,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > */ > > if (gfp_mask & __GFP_NOFAIL) { > > /* > > - * All existing users of the __GFP_NOFAIL are blockable, so warn > > - * of any new users that actually require GFP_NOWAIT > > + * All existing users of the __GFP_NOFAIL are blockable > > + * otherwise we introduce a busy loop with inside the page > > + * allocator from non-sleepable contexts > > */ > > - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > > - goto fail; > > + BUG_ON(!can_direct_reclaim); > > > > /* > > * PF_MEMALLOC request from this context is rather bizarre > > @@ -4470,7 +4470,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > cond_resched(); > > goto retry; > > } > > -fail: > > + > > warn_alloc(gfp_mask, ac->nodemask, > > "page allocation failure: order:%u", order); > > got_pg: > > diff --git a/mm/util.c b/mm/util.c > > index 0ff5898cc6de..a1be50c243f1 100644 > > --- a/mm/util.c > > +++ b/mm/util.c > > @@ -668,6 +668,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > > /* Don't even allow crazy sizes */ > > if (unlikely(size > INT_MAX)) { > > WARN_ON_ONCE(!(flags & __GFP_NOWARN)); > > + BUG_ON(flags & __GFP_NOFAIL); > > return NULL; > > } > > >
On Wed 24-07-24 20:55:42, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > We have cases we still fail though callers might have __GFP_NOFAIL. > Since they don't check the return, we are exposed to the security > risks for NULL deference. > > Though BUG_ON() is not encouraged by Linus, this is an unrecoverable > situation. > > Christoph Hellwig: > The whole freaking point of __GFP_NOFAIL is that callers don't handle > allocation failures. So in fact a straight BUG is the right thing > here. > > Vlastimil Babka: > It's just not a recoverable situation (WARN_ON is for recoverable > situations). The caller cannot handle allocation failure and at the same > time asked for an impossible allocation. BUG_ON() is a guaranteed oops > with stracktrace etc. We don't need to hope for the later NULL pointer > dereference (which might if really unlucky happen from a different > context where it's no longer obvious what lead to the allocation failing). > > Michal Hocko: > Linus tends to be against adding new BUG() calls unless the failure is > absolutely unrecoverable (e.g. corrupted data structures etc.). I am > not sure how he would look at simply incorrect memory allocator usage to > blow up the kernel. Now the argument could be made that those failures > could cause subtle memory corruptions or even be exploitable which might > be a sufficient reason to stop them early. I think it is worth adding that size checks are not really actionable because they either cause unexpected failure or BUG_ON. It is not too much of a stretch to expect some of the user triggerable codepaths could hit this - e.g. when input is not checked properly. Silent failure is then a potential security risk. The page allocator, on the other hand, can chose to keep retrying even if that means that there is not reclaim going on and essentially cause a busy loop in the kernel space. That would eventually cause soft/hard lockup detector to fire (if an architecture offers a reliable one). So essentially there is choice between two bad solutions and you have chosen one that reliably bugs on rather than rely on something external to intervene. The reasoning for that should be mentioned in the changelog. [...] > diff --git a/mm/util.c b/mm/util.c > index 0ff5898cc6de..a1be50c243f1 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -668,6 +668,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) > /* Don't even allow crazy sizes */ > if (unlikely(size > INT_MAX)) { > WARN_ON_ONCE(!(flags & __GFP_NOWARN)); > + BUG_ON(flags & __GFP_NOFAIL); I guess you want to switch the ordering. WARNING on top of BUG on seems rather pointless IMHO. > return NULL; > } > > -- > 2.34.1
diff --git a/include/linux/slab.h b/include/linux/slab.h index c9cb42203183..4a4d1fdc2afe 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -827,8 +827,10 @@ kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node) { size_t bytes; - if (unlikely(check_mul_overflow(n, size, &bytes))) + if (unlikely(check_mul_overflow(n, size, &bytes))) { + BUG_ON(flags & __GFP_NOFAIL); return NULL; + } return kvmalloc_node_noprof(bytes, flags, node); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 45d2f41b4783..4d6af00fccd4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4435,11 +4435,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, */ if (gfp_mask & __GFP_NOFAIL) { /* - * All existing users of the __GFP_NOFAIL are blockable, so warn - * of any new users that actually require GFP_NOWAIT + * All existing users of the __GFP_NOFAIL are blockable + * otherwise we introduce a busy loop with inside the page + * allocator from non-sleepable contexts */ - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) - goto fail; + BUG_ON(!can_direct_reclaim); /* * PF_MEMALLOC request from this context is rather bizarre @@ -4470,7 +4470,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, cond_resched(); goto retry; } -fail: + warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: diff --git a/mm/util.c b/mm/util.c index 0ff5898cc6de..a1be50c243f1 100644 --- a/mm/util.c +++ b/mm/util.c @@ -668,6 +668,7 @@ void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node) /* Don't even allow crazy sizes */ if (unlikely(size > INT_MAX)) { WARN_ON_ONCE(!(flags & __GFP_NOWARN)); + BUG_ON(flags & __GFP_NOFAIL); return NULL; }