diff mbox series

mm/page_alloc: avoid high-order page allocation warn with __GFP_NOFAIL

Message ID 20230305053035.1911-1-hsiangkao@linux.alibaba.com (mailing list archive)
State New
Headers show
Series mm/page_alloc: avoid high-order page allocation warn with __GFP_NOFAIL | expand

Commit Message

Gao Xiang March 5, 2023, 5:30 a.m. UTC
My knowledge of this is somewhat limited, however, since vmalloc already
supported __GFP_NOFAIL in commit 9376130c390a ("mm/vmalloc: add
support for __GFP_NOFAIL").  __GFP_NOFAIL could trigger the following
stack and allocate high-order pages when CONFIG_HAVE_ARCH_HUGE_VMALLOC
is enabled:

 __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549
 alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286
 vm_area_alloc_pages mm/vmalloc.c:2989 [inline]
 __vmalloc_area_node mm/vmalloc.c:3057 [inline]
 __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227
 kvmalloc_node+0x156/0x1a0 mm/util.c:606
 kvmalloc include/linux/slab.h:737 [inline]
 kvmalloc_array include/linux/slab.h:755 [inline]
 kvcalloc include/linux/slab.h:760 [inline]
 (codebase: Linux 6.2-rc2)

Don't warn such cases since high-order pages with __GFP_NOFAIL is
somewhat legel.

Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
---
 mm/page_alloc.c | 6 ------
 1 file changed, 6 deletions(-)

Comments

Michal Hocko March 6, 2023, 7:51 a.m. UTC | #1
[Cc couple of more people recently involved with vmalloc code]

On Sun 05-03-23 13:30:35, Gao Xiang wrote:
> My knowledge of this is somewhat limited, however, since vmalloc already
> supported __GFP_NOFAIL in commit 9376130c390a ("mm/vmalloc: add
> support for __GFP_NOFAIL").  __GFP_NOFAIL could trigger the following
> stack and allocate high-order pages when CONFIG_HAVE_ARCH_HUGE_VMALLOC
> is enabled:
> 
>  __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549
>  alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286
>  vm_area_alloc_pages mm/vmalloc.c:2989 [inline]
>
>  __vmalloc_area_node mm/vmalloc.c:3057 [inline]
>  __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227
>  kvmalloc_node+0x156/0x1a0 mm/util.c:606
>  kvmalloc include/linux/slab.h:737 [inline]
>  kvmalloc_array include/linux/slab.h:755 [inline]
>  kvcalloc include/linux/slab.h:760 [inline]
>  (codebase: Linux 6.2-rc2)
> 
> Don't warn such cases since high-order pages with __GFP_NOFAIL is
> somewhat legel.

OK, this is definitely a bug and it seems my 9376130c390a was
incomplete because it hasn't covered the high order case. Not sure how
that happened but removing the warning is not the right thing to do
here. The higher order allocation is an optimization rather than a must.
So it is perfectly fine to fail that allocation and retry rather than
go into a very expensive and potentially impossible higher order
allocation that must not fail.

The proper fix should look like this unless I am missing something. I
would appreciate another pair of eyes on this because I am not fully
familiar with the high order optimization part much.

Thanks!
--- 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ef910bf349e1..a8aa2765618a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2883,6 +2883,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		unsigned int order, unsigned int nr_pages, struct page **pages)
 {
 	unsigned int nr_allocated = 0;
+	gfp_t alloc_gfp = gfp;
+	bool nofail = false;
 	struct page *page;
 	int i;
 
@@ -2931,20 +2933,30 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 			if (nr != nr_pages_request)
 				break;
 		}
+	} else {
+		alloc_gfp &= ~__GFP_NOFAIL;
+		nofail = true;
 	}
 
 	/* High-order pages or fallback path if "bulk" fails. */
-
 	while (nr_allocated < nr_pages) {
 		if (fatal_signal_pending(current))
 			break;
 
 		if (nid == NUMA_NO_NODE)
-			page = alloc_pages(gfp, order);
+			page = alloc_pages(alloc_gfp, order);
 		else
-			page = alloc_pages_node(nid, gfp, order);
-		if (unlikely(!page))
-			break;
+			page = alloc_pages_node(nid, alloc_gfp, order);
+		if (unlikely(!page)) {
+			if (!nofail)
+				break;
+
+			/* fall back to the zero order allocations */
+			alloc_gfp |= __GFP_NOFAIL;
+			order = 0;
+			continue;
+		}
+
 		/*
 		 * Higher order allocations must be able to be treated as
 		 * indepdenent small pages by callers (as they can with
Gao Xiang March 6, 2023, 8:03 a.m. UTC | #2
On 2023/3/6 15:51, Michal Hocko wrote:
> [Cc couple of more people recently involved with vmalloc code]
> 
> On Sun 05-03-23 13:30:35, Gao Xiang wrote:
>> My knowledge of this is somewhat limited, however, since vmalloc already
>> supported __GFP_NOFAIL in commit 9376130c390a ("mm/vmalloc: add
>> support for __GFP_NOFAIL").  __GFP_NOFAIL could trigger the following
>> stack and allocate high-order pages when CONFIG_HAVE_ARCH_HUGE_VMALLOC
>> is enabled:
>>
>>   __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549
>>   alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286
>>   vm_area_alloc_pages mm/vmalloc.c:2989 [inline]
>>
>>   __vmalloc_area_node mm/vmalloc.c:3057 [inline]
>>   __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227
>>   kvmalloc_node+0x156/0x1a0 mm/util.c:606
>>   kvmalloc include/linux/slab.h:737 [inline]
>>   kvmalloc_array include/linux/slab.h:755 [inline]
>>   kvcalloc include/linux/slab.h:760 [inline]
>>   (codebase: Linux 6.2-rc2)
>>
>> Don't warn such cases since high-order pages with __GFP_NOFAIL is
>> somewhat legel.
> 
> OK, this is definitely a bug and it seems my 9376130c390a was
> incomplete because it hasn't covered the high order case. Not sure how
> that happened but removing the warning is not the right thing to do
> here. The higher order allocation is an optimization rather than a must.
> So it is perfectly fine to fail that allocation and retry rather than
> go into a very expensive and potentially impossible higher order
> allocation that must not fail.
> 
> The proper fix should look like this unless I am missing something. I
> would appreciate another pair of eyes on this because I am not fully
> familiar with the high order optimization part much.

I'm fine with the fix. Although I'm not familiar with such vmalloc
allocation, I thought about this possibility as well.

The original issue was:
https://lore.kernel.org/r/0000000000007796bd05f1852ec2@google.com

which I used kvcalloc with __GFP_NOFAIL but it warned, and I made
a fix (which now seems wrong) to use kcalloc() but it now warns
the same:
https://lore.kernel.org/r/00000000000072eb6505f376dd4b@google.com

And I then realized it's a bug in kvmalloc() with __GFP_NOFAIL...

Thanks,
Gao Xiang

> 
> Thanks!
> ---
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ef910bf349e1..a8aa2765618a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2883,6 +2883,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>   		unsigned int order, unsigned int nr_pages, struct page **pages)
>   {
>   	unsigned int nr_allocated = 0;
> +	gfp_t alloc_gfp = gfp;
> +	bool nofail = false;
>   	struct page *page;
>   	int i;
>   
> @@ -2931,20 +2933,30 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>   			if (nr != nr_pages_request)
>   				break;
>   		}
> +	} else {
> +		alloc_gfp &= ~__GFP_NOFAIL;
> +		nofail = true;
>   	}
>   
>   	/* High-order pages or fallback path if "bulk" fails. */
> -
>   	while (nr_allocated < nr_pages) {
>   		if (fatal_signal_pending(current))
>   			break;
>   
>   		if (nid == NUMA_NO_NODE)
> -			page = alloc_pages(gfp, order);
> +			page = alloc_pages(alloc_gfp, order);
>   		else
> -			page = alloc_pages_node(nid, gfp, order);
> -		if (unlikely(!page))
> -			break;
> +			page = alloc_pages_node(nid, alloc_gfp, order);
> +		if (unlikely(!page)) {
> +			if (!nofail)
> +				break;
> +
> +			/* fall back to the zero order allocations */
> +			alloc_gfp |= __GFP_NOFAIL;
> +			order = 0;
> +			continue;
> +		}
> +
>   		/*
>   		 * Higher order allocations must be able to be treated as
>   		 * indepdenent small pages by callers (as they can with
Uladzislau Rezki March 6, 2023, 12:14 p.m. UTC | #3
On Mon, Mar 06, 2023 at 08:51:40AM +0100, Michal Hocko wrote:
> [Cc couple of more people recently involved with vmalloc code]
> 
> On Sun 05-03-23 13:30:35, Gao Xiang wrote:
> > My knowledge of this is somewhat limited, however, since vmalloc already
> > supported __GFP_NOFAIL in commit 9376130c390a ("mm/vmalloc: add
> > support for __GFP_NOFAIL").  __GFP_NOFAIL could trigger the following
> > stack and allocate high-order pages when CONFIG_HAVE_ARCH_HUGE_VMALLOC
> > is enabled:
> > 
> >  __alloc_pages+0x1cb/0x5b0 mm/page_alloc.c:5549
> >  alloc_pages+0x1aa/0x270 mm/mempolicy.c:2286
> >  vm_area_alloc_pages mm/vmalloc.c:2989 [inline]
> >
> >  __vmalloc_area_node mm/vmalloc.c:3057 [inline]
> >  __vmalloc_node_range+0x978/0x13c0 mm/vmalloc.c:3227
> >  kvmalloc_node+0x156/0x1a0 mm/util.c:606
> >  kvmalloc include/linux/slab.h:737 [inline]
> >  kvmalloc_array include/linux/slab.h:755 [inline]
> >  kvcalloc include/linux/slab.h:760 [inline]
> >  (codebase: Linux 6.2-rc2)
> > 
> > Don't warn such cases since high-order pages with __GFP_NOFAIL is
> > somewhat legel.
> 
> OK, this is definitely a bug and it seems my 9376130c390a was
> incomplete because it hasn't covered the high order case. Not sure how
> that happened but removing the warning is not the right thing to do
> here. The higher order allocation is an optimization rather than a must.
> So it is perfectly fine to fail that allocation and retry rather than
> go into a very expensive and potentially impossible higher order
> allocation that must not fail.
>
> 
> The proper fix should look like this unless I am missing something. I
> would appreciate another pair of eyes on this because I am not fully
> familiar with the high order optimization part much.
> 
> Thanks!
> --- 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ef910bf349e1..a8aa2765618a 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2883,6 +2883,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		unsigned int order, unsigned int nr_pages, struct page **pages)
>  {
>  	unsigned int nr_allocated = 0;
> +	gfp_t alloc_gfp = gfp;
> +	bool nofail = false;
>  	struct page *page;
>  	int i;
>  
> @@ -2931,20 +2933,30 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  			if (nr != nr_pages_request)
>  				break;
>  		}
> +	} else {
> +		alloc_gfp &= ~__GFP_NOFAIL;
> +		nofail = true;
>  	}
>  
>  	/* High-order pages or fallback path if "bulk" fails. */
> -
>  	while (nr_allocated < nr_pages) {
>  		if (fatal_signal_pending(current))
>  			break;
>  
>  		if (nid == NUMA_NO_NODE)
> -			page = alloc_pages(gfp, order);
> +			page = alloc_pages(alloc_gfp, order);
>  		else
> -			page = alloc_pages_node(nid, gfp, order);
> -		if (unlikely(!page))
> -			break;
> +			page = alloc_pages_node(nid, alloc_gfp, order);
> +		if (unlikely(!page)) {
> +			if (!nofail)
> +				break;
> +
> +			/* fall back to the zero order allocations */
> +			alloc_gfp |= __GFP_NOFAIL;
> +			order = 0;
> +			continue;
> +		}
> +
>  		/*
>  		 * Higher order allocations must be able to be treated as
>  		 * indepdenent small pages by callers (as they can with

Some questions:

1. Could you please add a comment why you want the bulk_gfp without the __GFP_NOFAIL(bulk path)?
2. Could you please add a comment why a high order pages do not want __GFP_NOFAIL? You have already explained.
3. Looking at the patch:

<snip>
+       } else {
+               alloc_gfp &= ~__GFP_NOFAIL;
+               nofail = true;
<snip>

if user does not want to go with __GFP_NOFAIL flag why you force it in
case a high order allocation fails and you switch to 0 order allocations? 
(for high order-pages scenario you always use __GFP_NOFAIL in the order-0 recovery path).

Thanks!

--
Uladzislau Rezki
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0745aedebb37..0618716c49df 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3822,12 +3822,6 @@  struct page *rmqueue(struct zone *preferred_zone,
 {
 	struct page *page;
 
-	/*
-	 * We most definitely don't want callers attempting to
-	 * allocate greater than order-1 page units with __GFP_NOFAIL.
-	 */
-	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-
 	if (likely(pcp_allowed_order(order))) {
 		/*
 		 * MIGRATE_MOVABLE pcplist could have the pages on CMA area and