diff mbox series

[1/2] mm: page_alloc: speed up fallbacks in rmqueue_bulk()

Message ID 20250407180154.63348-1-hannes@cmpxchg.org (mailing list archive)
State New
Headers show
Series [1/2] mm: page_alloc: speed up fallbacks in rmqueue_bulk() | expand

Commit Message

Johannes Weiner April 7, 2025, 6:01 p.m. UTC
The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
single pages from biggest buddy") as the root cause of a 56.4%
regression in vm-scalability::lru-file-mmap-read.

Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
freelist movement during block conversion"), as the root cause for a
regression in worst-case zone->lock+irqoff hold times.

Both of these patches modify the page allocator's fallback path to be
less greedy in an effort to stave off fragmentation. The flip side of
this is that fallbacks are also less productive each time around,
which means the fallback search can run much more frequently.

Carlos' traces point to rmqueue_bulk() specifically, which tries to
refill the percpu cache by allocating a large batch of pages in a
loop. It highlights how once the native freelists are exhausted, the
fallback code first scans orders top-down for whole blocks to claim,
then falls back to a bottom-up search for the smallest buddy to steal.
For the next batch page, it goes through the same thing again.

This can be made more efficient. Since rmqueue_bulk() holds the
zone->lock over the entire batch, the freelists are not subject to
outside changes; when the search for a block to claim has already
failed, there is no point in trying again for the next page.

Modify __rmqueue() to remember the last successful fallback mode, and
restart directly from there on the next rmqueue_bulk() iteration.

Oliver confirms that this improves beyond the regression that the test
robot reported against c2f6ea38fc1b:

commit:
  f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
  c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
  acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
  2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()")   <--- your patch

f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
---------------- --------------------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \          |                \
  25525364 ±  3%     -56.4%   11135467           -57.8%   10779336           +31.6%   33581409        vm-scalability.throughput

Carlos confirms that worst-case times are almost fully recovered
compared to before the earlier culprit patch:

  2dd482ba627d (before freelist hygiene):    1ms
  c0cd6f557b90  (after freelist hygiene):   90ms
 next-20250319    (steal smallest buddy):  280ms
    this patch                          :    8ms

Reported-by: kernel test robot <oliver.sang@intel.com>
Reported-by: Carlos Song <carlos.song@nxp.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
Cc: stable@vger.kernel.org	# 6.10+
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++-------------
 1 file changed, 74 insertions(+), 26 deletions(-)

Comments

Brendan Jackman April 8, 2025, 5:22 p.m. UTC | #1
On Mon Apr 7, 2025 at 6:01 PM UTC, Johannes Weiner wrote:
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2194,11 +2194,11 @@ try_to_claim_block(struct zone *zone, struct page *page,
>   * The use of signed ints for order and current_order is a deliberate
>   * deviation from the rest of this file, to make the for loop
>   * condition simpler.
> - *
> - * Return the stolen page, or NULL if none can be found.
>   */

This commentary is pretty confusing now, there's a block of text that
kinda vaguely applies to the aggregate of __rmqueue_steal(),
__rmqueue_fallback() and half of __rmqueue(). I think this new code does
a better job of speaking for itself so I think we should just delete
this block comment and replace it with some more verbosity elsewhere.

> +
> +/* Try to claim a whole foreign block, take a page, expand the remainder */

Also on the commentary front, I am not a fan of "foreign" and "native":

- "Foreign" is already used in this file to mean NUMA-nonlocal.

- We already have "start" and "fallback" being used in identifiers
  as adjectives to describe the mitegratetype concept.

  I wouldn't say those are _better_, "native" and "foreign" might be
  clearer, but it's not worth introducing inconsistency IMO.

>  static __always_inline struct page *
> -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> +__rmqueue_claim(struct zone *zone, int order, int start_migratetype,
>  						unsigned int alloc_flags)
>  {
>  	struct free_area *area;

[pasting in more context that wasn't in the original diff..]
>	/*
>	 * Find the largest available free page in the other list. This roughly
>	 * approximates finding the pageblock with the most free pages, which
>	 * would be too costly to do exactly.
>	 */
>	for (current_order = MAX_PAGE_ORDER; current_order >= min_order;
>				--current_order) {

IIUC we could go one step further here and also avoid repeating this
iteration? Maybe something for a separate patch though?

Anyway, the approach seems like a clear improvement, thanks. I will need
to take a closer look at it tomorrow, I've run out of brain juice today.

Here's what I got from redistributing the block comment and flipping
the terminology:

diff --git i/mm/page_alloc.c w/mm/page_alloc.c
index dfb2b3f508af..b8142d605691 100644
--- i/mm/page_alloc.c
+++ w/mm/page_alloc.c
@@ -2183,21 +2183,13 @@ try_to_claim_block(struct zone *zone, struct page *page,
 }
 
 /*
- * Try finding a free buddy page on the fallback list.
- *
- * This will attempt to claim a whole pageblock for the requested type
- * to ensure grouping of such requests in the future.
- *
- * If a whole block cannot be claimed, steal an individual page, regressing to
- * __rmqueue_smallest() logic to at least break up as little contiguity as
- * possible.
+ * Try to allocate from some fallback migratetype by claiming the entire block,
+ * i.e. converting it to the allocation's start migratetype.
  *
  * The use of signed ints for order and current_order is a deliberate
  * deviation from the rest of this file, to make the for loop
  * condition simpler.
  */
-
-/* Try to claim a whole foreign block, take a page, expand the remainder */
 static __always_inline struct page *
 __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
                                                unsigned int alloc_flags)
@@ -2247,7 +2239,10 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
        return NULL;
 }
 
-/* Try to steal a single page from a foreign block */
+/*
+ * Try to steal a single page from some fallback migratetype. Leave the rest of
+ * the block as its current migratetype, potentially causing fragmentation.
+ */
 static __always_inline struct page *
 __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
 {
@@ -2307,7 +2302,9 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
        }
 
        /*
-        * Try the different freelists, native then foreign.
+        * First try the freelists of the requested migratetype, then try
+        * fallbacks. Roughly, each fallback stage poses more of a fragmentation
+        * risk.
         *
         * The fallback logic is expensive and rmqueue_bulk() calls in
         * a loop with the zone->lock held, meaning the freelists are
@@ -2332,7 +2329,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
        case RMQUEUE_CLAIM:
                page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
                if (page) {
-                       /* Replenished native freelist, back to normal mode */
+                       /* Replenished requested migratetype's freelist, back to normal mode */
                        *mode = RMQUEUE_NORMAL;
                        return page;
                }
Johannes Weiner April 8, 2025, 6:50 p.m. UTC | #2
On Tue, Apr 08, 2025 at 05:22:00PM +0000, Brendan Jackman wrote:
> On Mon Apr 7, 2025 at 6:01 PM UTC, Johannes Weiner wrote:
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2194,11 +2194,11 @@ try_to_claim_block(struct zone *zone, struct page *page,
> >   * The use of signed ints for order and current_order is a deliberate
> >   * deviation from the rest of this file, to make the for loop
> >   * condition simpler.
> > - *
> > - * Return the stolen page, or NULL if none can be found.
> >   */
> 
> This commentary is pretty confusing now, there's a block of text that
> kinda vaguely applies to the aggregate of __rmqueue_steal(),
> __rmqueue_fallback() and half of __rmqueue(). I think this new code does
> a better job of speaking for itself so I think we should just delete
> this block comment and replace it with some more verbosity elsewhere.

I'm glad you think so, let's remove it then!

> > +/* Try to claim a whole foreign block, take a page, expand the remainder */
> 
> Also on the commentary front, I am not a fan of "foreign" and "native":
> 
> - "Foreign" is already used in this file to mean NUMA-nonlocal.
> 
> - We already have "start" and "fallback" being used in identifiers
>   as adjectives to describe the mitegratetype concept.
> 
>   I wouldn't say those are _better_, "native" and "foreign" might be
>   clearer, but it's not worth introducing inconsistency IMO.

That's a fair point, no objection to renaming them.

> >  static __always_inline struct page *
> > -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> > +__rmqueue_claim(struct zone *zone, int order, int start_migratetype,
> >  						unsigned int alloc_flags)
> >  {
> >  	struct free_area *area;
> 
> [pasting in more context that wasn't in the original diff..]
> >	/*
> >	 * Find the largest available free page in the other list. This roughly
> >	 * approximates finding the pageblock with the most free pages, which
> >	 * would be too costly to do exactly.
> >	 */
> >	for (current_order = MAX_PAGE_ORDER; current_order >= min_order;
> >				--current_order) {
> 
> IIUC we could go one step further here and also avoid repeating this
> iteration? Maybe something for a separate patch though?

That might be worth a test, but agree this should be a separate patch.

AFAICS, in the most common configurations MAX_PAGE_ORDER is only one
step above pageblock_order or even the same. It might not be worth the
complication.

> Anyway, the approach seems like a clear improvement, thanks. I will need
> to take a closer look at it tomorrow, I've run out of brain juice today.

Much appreciate you taking a look, thanks.

> Here's what I got from redistributing the block comment and flipping
> the terminology:
> 
> diff --git i/mm/page_alloc.c w/mm/page_alloc.c
> index dfb2b3f508af..b8142d605691 100644
> --- i/mm/page_alloc.c
> +++ w/mm/page_alloc.c
> @@ -2183,21 +2183,13 @@ try_to_claim_block(struct zone *zone, struct page *page,
>  }
>  
>  /*
> - * Try finding a free buddy page on the fallback list.
> - *
> - * This will attempt to claim a whole pageblock for the requested type
> - * to ensure grouping of such requests in the future.
> - *
> - * If a whole block cannot be claimed, steal an individual page, regressing to
> - * __rmqueue_smallest() logic to at least break up as little contiguity as
> - * possible.
> + * Try to allocate from some fallback migratetype by claiming the entire block,
> + * i.e. converting it to the allocation's start migratetype.
>   *
>   * The use of signed ints for order and current_order is a deliberate
>   * deviation from the rest of this file, to make the for loop
>   * condition simpler.
>   */
> -
> -/* Try to claim a whole foreign block, take a page, expand the remainder */
>  static __always_inline struct page *
>  __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
>                                                 unsigned int alloc_flags)
> @@ -2247,7 +2239,10 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
>         return NULL;
>  }
>  
> -/* Try to steal a single page from a foreign block */
> +/*
> + * Try to steal a single page from some fallback migratetype. Leave the rest of
> + * the block as its current migratetype, potentially causing fragmentation.
> + */
>  static __always_inline struct page *
>  __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
>  {
> @@ -2307,7 +2302,9 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>         }
>  
>         /*
> -        * Try the different freelists, native then foreign.
> +        * First try the freelists of the requested migratetype, then try
> +        * fallbacks. Roughly, each fallback stage poses more of a fragmentation
> +        * risk.

How about "then try fallback modes with increasing levels of
fragmentation risk."

>          * The fallback logic is expensive and rmqueue_bulk() calls in
>          * a loop with the zone->lock held, meaning the freelists are
> @@ -2332,7 +2329,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>         case RMQUEUE_CLAIM:
>                 page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
>                 if (page) {
> -                       /* Replenished native freelist, back to normal mode */
> +                       /* Replenished requested migratetype's freelist, back to normal mode */
>                         *mode = RMQUEUE_NORMAL;

This line is kind of long now. How about:

			/* Replenished preferred freelist, back to normal mode */

But yeah, I like your proposed changes. Would you care to send a
proper patch?

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Yunsheng Lin April 9, 2025, 8:02 a.m. UTC | #3
On 2025/4/8 2:01, Johannes Weiner wrote:

...

>  
> @@ -2934,6 +2981,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  {
>  	struct page *page;
>  	unsigned long flags;
> +	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
>  
>  	do {
>  		page = NULL;
> @@ -2945,7 +2993,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  		if (alloc_flags & ALLOC_HIGHATOMIC)
>  			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
>  		if (!page) {
> -			page = __rmqueue(zone, order, migratetype, alloc_flags);
> +			page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm);
>  
>  			/*
>  			 * If the allocation fails, allow OOM handling and

It was not in the diff, but it seems the zone->lock is held inside the do..while loop,
doesn't it mean that the freelists are subject to outside changes and rmqm is stale?
Johannes Weiner April 9, 2025, 2 p.m. UTC | #4
On Wed, Apr 09, 2025 at 04:02:39PM +0800, Yunsheng Lin wrote:
> On 2025/4/8 2:01, Johannes Weiner wrote:
> > @@ -2934,6 +2981,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> >  {
> >  	struct page *page;
> >  	unsigned long flags;
> > +	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
> >  
> >  	do {
> >  		page = NULL;
> > @@ -2945,7 +2993,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
> >  		if (alloc_flags & ALLOC_HIGHATOMIC)
> >  			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> >  		if (!page) {
> > -			page = __rmqueue(zone, order, migratetype, alloc_flags);
> > +			page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm);
> >  
> >  			/*
> >  			 * If the allocation fails, allow OOM handling and
> 
> It was not in the diff, but it seems the zone->lock is held inside the do..while loop,
> doesn't it mean that the freelists are subject to outside changes and rmqm is stale?

Yes. Note that it only loops when there is a bug/corrupted page, so it
won't make much difference in practice. But it's still kind of weird.

Thanks for your review, Yunsheng!

Andrew, could you please fold the below fixlet?

---

From 71b1eea7ded41c1f674909f9755c23b9ee9bcb6a Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Wed, 9 Apr 2025 09:56:52 -0400
Subject: [PATCH] mm: page_alloc: speed up fallbacks in rmqueue_bulk() fix

reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dfb2b3f508af..7ffeeb0f62d3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2983,7 +2983,6 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 {
 	struct page *page;
 	unsigned long flags;
-	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
 
 	do {
 		page = NULL;
@@ -2996,6 +2995,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
+			enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
+
 			page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm);
 
 			/*
Brendan Jackman April 9, 2025, 5:30 p.m. UTC | #5
On Tue Apr 8, 2025 at 6:50 PM UTC, Johannes Weiner wrote:
>> >	/*
>> >	 * Find the largest available free page in the other list. This roughly
>> >	 * approximates finding the pageblock with the most free pages, which
>> >	 * would be too costly to do exactly.
>> >	 */
>> >	for (current_order = MAX_PAGE_ORDER; current_order >= min_order;
>> >				--current_order) {
>> 
>> IIUC we could go one step further here and also avoid repeating this
>> iteration? Maybe something for a separate patch though?
>
> That might be worth a test, but agree this should be a separate patch.
>
> AFAICS, in the most common configurations MAX_PAGE_ORDER is only one
> step above pageblock_order or even the same. It might not be worth the
> complication.

Oh yeah, makes sense.

>>         /*
>> -        * Try the different freelists, native then foreign.
>> +        * First try the freelists of the requested migratetype, then try
>> +        * fallbacks. Roughly, each fallback stage poses more of a fragmentation
>> +        * risk.
>
> How about "then try fallback modes with increasing levels of
> fragmentation risk."

Yep, nice thanks.

>>          * The fallback logic is expensive and rmqueue_bulk() calls in
>>          * a loop with the zone->lock held, meaning the freelists are
>> @@ -2332,7 +2329,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>>         case RMQUEUE_CLAIM:
>>                 page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
>>                 if (page) {
>> -                       /* Replenished native freelist, back to normal mode */
>> +                       /* Replenished requested migratetype's freelist, back to normal mode */
>>                         *mode = RMQUEUE_NORMAL;
>
> This line is kind of long now. How about:
>
> 			/* Replenished preferred freelist, back to normal mode */

Yep, sounds good - it's still 81 characters, the rest of this file
sticks to 80 for comments, I guess I'll leave it to Andrew to decide if
that is an issue?

> But yeah, I like your proposed changes. Would you care to send a
> proper patch?

Sure, pasting below. Andrew, could you fold this in? Also, I haven't
done this style of patch sending before, please let me know if I'm doing
something to make your life difficult.

> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Aside from the commen stuff fixed by the patch below:

Reviewed-by: Brendan Jackman <jackmanb@google.com>

---

From 8ff20dbb52770d082e182482d2b47e521de028d1 Mon Sep 17 00:00:00 2001                                                                                                                                                                                                                    
From: Brendan Jackman <jackmanb@google.com>
Date: Wed, 9 Apr 2025 17:22:14 +000
Subject: [PATCH] page_alloc: speed up fallbacks in rmqueue_bulk() - comment updates

Tidy up some terminology and redistribute commentary.                                                                                                                                                                                                                                                                                            
Signed-off-by: Brendan Jackman <jackmanb@google.com>
---
 mm/page_alloc.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dfb2b3f508af4..220bd0bcc38c3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2183,21 +2183,13 @@ try_to_claim_block(struct zone *zone, struct page *page,
 }
  
 /*
- * Try finding a free buddy page on the fallback list.
- *
- * This will attempt to claim a whole pageblock for the requested type
- * to ensure grouping of such requests in the future.
- *
- * If a whole block cannot be claimed, steal an individual page, regressing to
- * __rmqueue_smallest() logic to at least break up as little contiguity as
- * possible.
+ * Try to allocate from some fallback migratetype by claiming the entire block,
+ * i.e. converting it to the allocation's start migratetype.
  *
  * The use of signed ints for order and current_order is a deliberate 
  * deviation from the rest of this file, to make the for loop
  * condition simpler.
  */
-
-/* Try to claim a whole foreign block, take a page, expand the remainder */
 static __always_inline struct page *
 __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
                                                unsigned int alloc_flags)
@@ -2247,7 +2239,10 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
        return NULL;
 }
  
-/* Try to steal a single page from a foreign block */
+/*
+ * Try to steal a single page from some fallback migratetype. Leave the rest of
+ * the block as its current migratetype, potentially causing fragmentation.
+ */
 static __always_inline struct page *
 __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
 {
@@ -2307,7 +2302,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
        }
  
        /*
-        * Try the different freelists, native then foreign.
+        * First try the freelists of the requested migratetype, then try
+        * fallbacks modes with increasing levels of fragmentation risk.
         *
         * The fallback logic is expensive and rmqueue_bulk() calls in
         * a loop with the zone->lock held, meaning the freelists are
@@ -2332,7 +2328,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
        case RMQUEUE_CLAIM:
                page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
                if (page) {
-                       /* Replenished native freelist, back to normal mode */
+                       /* Replenished preferred freelist, back to normal mode. */
                        *mode = RMQUEUE_NORMAL;
                        return page;
                }

base-commit: aa42382db4e2a4ed1f4ba97ffc50e2ce45accb0c
Zi Yan April 10, 2025, 2:02 a.m. UTC | #6
On Mon Apr 7, 2025 at 2:01 PM EDT, Johannes Weiner wrote:
> The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
> single pages from biggest buddy") as the root cause of a 56.4%
> regression in vm-scalability::lru-file-mmap-read.
>
> Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
> freelist movement during block conversion"), as the root cause for a
> regression in worst-case zone->lock+irqoff hold times.
>
> Both of these patches modify the page allocator's fallback path to be
> less greedy in an effort to stave off fragmentation. The flip side of
> this is that fallbacks are also less productive each time around,
> which means the fallback search can run much more frequently.
>
> Carlos' traces point to rmqueue_bulk() specifically, which tries to
> refill the percpu cache by allocating a large batch of pages in a
> loop. It highlights how once the native freelists are exhausted, the
> fallback code first scans orders top-down for whole blocks to claim,
> then falls back to a bottom-up search for the smallest buddy to steal.
> For the next batch page, it goes through the same thing again.
>
> This can be made more efficient. Since rmqueue_bulk() holds the
> zone->lock over the entire batch, the freelists are not subject to
> outside changes; when the search for a block to claim has already
> failed, there is no point in trying again for the next page.
>
> Modify __rmqueue() to remember the last successful fallback mode, and
> restart directly from there on the next rmqueue_bulk() iteration.
>
> Oliver confirms that this improves beyond the regression that the test
> robot reported against c2f6ea38fc1b:
>
> commit:
>   f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
>   c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
>   acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
>   2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()")   <--- your patch
>
> f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
> ---------------- --------------------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \          |                \
>   25525364 ±  3%     -56.4%   11135467           -57.8%   10779336           +31.6%   33581409        vm-scalability.throughput
>
> Carlos confirms that worst-case times are almost fully recovered
> compared to before the earlier culprit patch:
>
>   2dd482ba627d (before freelist hygiene):    1ms
>   c0cd6f557b90  (after freelist hygiene):   90ms
>  next-20250319    (steal smallest buddy):  280ms
>     this patch                          :    8ms
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Reported-by: Carlos Song <carlos.song@nxp.com>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
> Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
> Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
> Cc: stable@vger.kernel.org	# 6.10+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 74 insertions(+), 26 deletions(-)
>

It is a really nice cleanup. It improves my understanding of the rmqueue*()
and the whole flow a lot. Thank you for the patch.

Acked-by: Zi Yan <ziy@nvidia.com>
Carlos Song April 10, 2025, 7:03 a.m. UTC | #7
Hi,

That is a nice fix! I test it at imx7d.

Tested-by: Carlos Song <carlos.song@nxp.com>


> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Tuesday, April 8, 2025 2:02 AM
> To: Andrew Morton <akpm@linux-foundation.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>; Brendan Jackman
> <jackmanb@google.com>; Mel Gorman <mgorman@techsingularity.net>;
> Carlos Song <carlos.song@nxp.com>; linux-mm@kvack.org;
> linux-kernel@vger.kernel.org; kernel test robot <oliver.sang@intel.com>;
> stable@vger.kernel.org
> Subject: [EXT] [PATCH 1/2] mm: page_alloc: speed up fallbacks in
> rmqueue_bulk()
>
> Caution: This is an external email. Please take care when clicking links or
> opening attachments. When in doubt, report the message using the 'Report this
> email' button
>
>
> The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal single
> pages from biggest buddy") as the root cause of a 56.4% regression in
> vm-scalability::lru-file-mmap-read.
>
> Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix freelist
> movement during block conversion"), as the root cause for a regression in
> worst-case zone->lock+irqoff hold times.
>
> Both of these patches modify the page allocator's fallback path to be less greedy
> in an effort to stave off fragmentation. The flip side of this is that fallbacks are
> also less productive each time around, which means the fallback search can run
> much more frequently.
>
> Carlos' traces point to rmqueue_bulk() specifically, which tries to refill the
> percpu cache by allocating a large batch of pages in a loop. It highlights how
> once the native freelists are exhausted, the fallback code first scans orders
> top-down for whole blocks to claim, then falls back to a bottom-up search for the
> smallest buddy to steal.
> For the next batch page, it goes through the same thing again.
>
> This can be made more efficient. Since rmqueue_bulk() holds the
> zone->lock over the entire batch, the freelists are not subject to
> outside changes; when the search for a block to claim has already failed, there is
> no point in trying again for the next page.
>
> Modify __rmqueue() to remember the last successful fallback mode, and restart
> directly from there on the next rmqueue_bulk() iteration.
>
> Oliver confirms that this improves beyond the regression that the test robot
> reported against c2f6ea38fc1b:
>
> commit:
>   f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
>   c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
>   acc4d5ff0b ("Merge tag 'net-6.15-rc0' of
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
>   2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()")   <---
> your patch
>
> f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1
> acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
> ---------------- --------------------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change
>  %stddev     %change         %stddev
>              \          |                \          |
> \          |                \
>   25525364 ±  3%     -56.4%   11135467           -57.8%
> 10779336           +31.6%   33581409
> vm-scalability.throughput
>
> Carlos confirms that worst-case times are almost fully recovered compared to
> before the earlier culprit patch:
>
>   2dd482ba627d (before freelist hygiene):    1ms
>   c0cd6f557b90  (after freelist hygiene):   90ms
>  next-20250319    (steal smallest buddy):  280ms
>     this patch                          :    8ms
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Reported-by: Carlos Song <carlos.song@nxp.com>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block
> conversion")
> Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest
> buddy")
> Closes:
> https://lore.kern/
> el.org%2Foe-lkp%2F202503271547.fc08b188-lkp%40intel.com&data=05%7C02
> %7Ccarlos.song%40nxp.com%7C3325bfa02b7942cca36508dd75fe4a3d%7C686
> ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638796457225141027%7CUn
> known%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCI
> sIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdat
> a=gl5xQ8OLJNIccLDsYgVejUC%2B9ZQrxmt%2FxkfQXsmDuNY%3D&reserved=0
> Cc: stable@vger.kernel.org      # 6.10+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 74 insertions(+), 26 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c index
> f51aa6051a99..03b0d45ed45a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2194,11 +2194,11 @@ try_to_claim_block(struct zone *zone, struct page
> *page,
>   * The use of signed ints for order and current_order is a deliberate
>   * deviation from the rest of this file, to make the for loop
>   * condition simpler.
> - *
> - * Return the stolen page, or NULL if none can be found.
>   */
> +
> +/* Try to claim a whole foreign block, take a page, expand the
> +remainder */
>  static __always_inline struct page *
> -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
> +__rmqueue_claim(struct zone *zone, int order, int start_migratetype,
>                                                 unsigned int
> alloc_flags)  {
>         struct free_area *area;
> @@ -2236,14 +2236,26 @@ __rmqueue_fallback(struct zone *zone, int order,
> int start_migratetype,
>                 page = try_to_claim_block(zone, page, current_order, order,
>                                           start_migratetype,
> fallback_mt,
>                                           alloc_flags);
> -               if (page)
> -                       goto got_one;
> +               if (page) {
> +                       trace_mm_page_alloc_extfrag(page, order,
> current_order,
> +
> start_migratetype, fallback_mt);
> +                       return page;
> +               }
>         }
>
> -       if (alloc_flags & ALLOC_NOFRAGMENT)
> -               return NULL;
> +       return NULL;
> +}
> +
> +/* Try to steal a single page from a foreign block */ static
> +__always_inline struct page * __rmqueue_steal(struct zone *zone, int
> +order, int start_migratetype) {
> +       struct free_area *area;
> +       int current_order;
> +       struct page *page;
> +       int fallback_mt;
> +       bool claim_block;
>
> -       /* No luck claiming pageblock. Find the smallest fallback page */
>         for (current_order = order; current_order < NR_PAGE_ORDERS;
> current_order++) {
>                 area = &(zone->free_area[current_order]);
>                 fallback_mt = find_suitable_fallback(area, current_order,
> @@ -2253,25 +2265,28 @@ __rmqueue_fallback(struct zone *zone, int order,
> int start_migratetype,
>
>                 page = get_page_from_free_area(area, fallback_mt);
>                 page_del_and_expand(zone, page, order, current_order,
> fallback_mt);
> -               goto got_one;
> +               trace_mm_page_alloc_extfrag(page, order, current_order,
> +                                           start_migratetype,
> fallback_mt);
> +               return page;
>         }
>
>         return NULL;
> -
> -got_one:
> -       trace_mm_page_alloc_extfrag(page, order, current_order,
> -               start_migratetype, fallback_mt);
> -
> -       return page;
>  }
>
> +enum rmqueue_mode {
> +       RMQUEUE_NORMAL,
> +       RMQUEUE_CMA,
> +       RMQUEUE_CLAIM,
> +       RMQUEUE_STEAL,
> +};
> +
>  /*
>   * Do the hard work of removing an element from the buddy allocator.
>   * Call me with the zone->lock already held.
>   */
>  static __always_inline struct page *
>  __rmqueue(struct zone *zone, unsigned int order, int migratetype,
> -                                               unsigned int
> alloc_flags)
> +         unsigned int alloc_flags, enum rmqueue_mode *mode)
>  {
>         struct page *page;
>
> @@ -2290,16 +2305,47 @@ __rmqueue(struct zone *zone, unsigned int order,
> int migratetype,
>                 }
>         }
>
> -       page = __rmqueue_smallest(zone, order, migratetype);
> -       if (unlikely(!page)) {
> -               if (alloc_flags & ALLOC_CMA)
> +       /*
> +        * Try the different freelists, native then foreign.
> +        *
> +        * The fallback logic is expensive and rmqueue_bulk() calls in
> +        * a loop with the zone->lock held, meaning the freelists are
> +        * not subject to any outside changes. Remember in *mode where
> +        * we found pay dirt, to save us the search on the next call.
> +        */
> +       switch (*mode) {
> +       case RMQUEUE_NORMAL:
> +               page = __rmqueue_smallest(zone, order, migratetype);
> +               if (page)
> +                       return page;
> +               fallthrough;
> +       case RMQUEUE_CMA:
> +               if (alloc_flags & ALLOC_CMA) {
>                         page = __rmqueue_cma_fallback(zone, order);
> -
> -               if (!page)
> -                       page = __rmqueue_fallback(zone, order,
> migratetype,
> -                                                 alloc_flags);
> +                       if (page) {
> +                               *mode = RMQUEUE_CMA;
> +                               return page;
> +                       }
> +               }
> +               fallthrough;
> +       case RMQUEUE_CLAIM:
> +               page = __rmqueue_claim(zone, order, migratetype,
> alloc_flags);
> +               if (page) {
> +                       /* Replenished native freelist, back to normal
> mode */
> +                       *mode = RMQUEUE_NORMAL;
> +                       return page;
> +               }
> +               fallthrough;
> +       case RMQUEUE_STEAL:
> +               if (!(alloc_flags & ALLOC_NOFRAGMENT)) {
> +                       page = __rmqueue_steal(zone, order,
> migratetype);
> +                       if (page) {
> +                               *mode = RMQUEUE_STEAL;
> +                               return page;
> +                       }
> +               }
>         }
> -       return page;
> +       return NULL;
>  }
>
>  /*
> @@ -2311,6 +2357,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned
> int order,
>                         unsigned long count, struct list_head *list,
>                         int migratetype, unsigned int alloc_flags)  {
> +       enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
>         unsigned long flags;
>         int i;
>
> @@ -2321,7 +2368,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned
> int order,
>         }
>         for (i = 0; i < count; ++i) {
>                 struct page *page = __rmqueue(zone, order, migratetype,
> -
> alloc_flags);
> +                                             alloc_flags, &rmqm);
>                 if (unlikely(page == NULL))
>                         break;
>
> @@ -2934,6 +2981,7 @@ struct page *rmqueue_buddy(struct zone
> *preferred_zone, struct zone *zone,  {
>         struct page *page;
>         unsigned long flags;
> +       enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
>
>         do {
>                 page = NULL;
> @@ -2945,7 +2993,7 @@ struct page *rmqueue_buddy(struct zone
> *preferred_zone, struct zone *zone,
>                 if (alloc_flags & ALLOC_HIGHATOMIC)
>                         page = __rmqueue_smallest(zone, order,
> MIGRATE_HIGHATOMIC);
>                 if (!page) {
> -                       page = __rmqueue(zone, order, migratetype,
> alloc_flags);
> +                       page = __rmqueue(zone, order, migratetype,
> + alloc_flags, &rmqm);
>
>                         /*
>                          * If the allocation fails, allow OOM handling and
> --
> 2.49.0
Vlastimil Babka April 10, 2025, 8:12 a.m. UTC | #8
On 4/7/25 20:01, Johannes Weiner wrote:
> The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
> single pages from biggest buddy") as the root cause of a 56.4%
> regression in vm-scalability::lru-file-mmap-read.
> 
> Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
> freelist movement during block conversion"), as the root cause for a
> regression in worst-case zone->lock+irqoff hold times.
> 
> Both of these patches modify the page allocator's fallback path to be
> less greedy in an effort to stave off fragmentation. The flip side of
> this is that fallbacks are also less productive each time around,
> which means the fallback search can run much more frequently.
> 
> Carlos' traces point to rmqueue_bulk() specifically, which tries to
> refill the percpu cache by allocating a large batch of pages in a
> loop. It highlights how once the native freelists are exhausted, the
> fallback code first scans orders top-down for whole blocks to claim,
> then falls back to a bottom-up search for the smallest buddy to steal.
> For the next batch page, it goes through the same thing again.
> 
> This can be made more efficient. Since rmqueue_bulk() holds the
> zone->lock over the entire batch, the freelists are not subject to
> outside changes; when the search for a block to claim has already
> failed, there is no point in trying again for the next page.
> 
> Modify __rmqueue() to remember the last successful fallback mode, and
> restart directly from there on the next rmqueue_bulk() iteration.
> 
> Oliver confirms that this improves beyond the regression that the test
> robot reported against c2f6ea38fc1b:
> 
> commit:
>   f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
>   c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
>   acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
>   2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()")   <--- your patch
> 
> f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
> ---------------- --------------------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \          |                \
>   25525364 ±  3%     -56.4%   11135467           -57.8%   10779336           +31.6%   33581409        vm-scalability.throughput
> 
> Carlos confirms that worst-case times are almost fully recovered
> compared to before the earlier culprit patch:
> 
>   2dd482ba627d (before freelist hygiene):    1ms
>   c0cd6f557b90  (after freelist hygiene):   90ms
>  next-20250319    (steal smallest buddy):  280ms
>     this patch                          :    8ms
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Reported-by: Carlos Song <carlos.song@nxp.com>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
> Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
> Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
> Cc: stable@vger.kernel.org	# 6.10+

Might be going to be fun with dependency commits.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Cool stuff.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Vlastimil Babka April 10, 2025, 8:16 a.m. UTC | #9
On 4/9/25 19:30, Brendan Jackman wrote:
> 
> From 8ff20dbb52770d082e182482d2b47e521de028d1 Mon Sep 17 00:00:00 2001                                                                                                                                                                                                                    
> From: Brendan Jackman <jackmanb@google.com>
> Date: Wed, 9 Apr 2025 17:22:14 +000
> Subject: [PATCH] page_alloc: speed up fallbacks in rmqueue_bulk() - comment updates
> 
> Tidy up some terminology and redistribute commentary.                                                                                                                                                                                                                                                                                            
> Signed-off-by: Brendan Jackman <jackmanb@google.com>

LGTM (assuming folding)

> ---
>  mm/page_alloc.c | 22 +++++++++-------------
>  1 file changed, 9 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dfb2b3f508af4..220bd0bcc38c3 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2183,21 +2183,13 @@ try_to_claim_block(struct zone *zone, struct page *page,
>  }
>   
>  /*
> - * Try finding a free buddy page on the fallback list.
> - *
> - * This will attempt to claim a whole pageblock for the requested type
> - * to ensure grouping of such requests in the future.
> - *
> - * If a whole block cannot be claimed, steal an individual page, regressing to
> - * __rmqueue_smallest() logic to at least break up as little contiguity as
> - * possible.
> + * Try to allocate from some fallback migratetype by claiming the entire block,
> + * i.e. converting it to the allocation's start migratetype.
>   *
>   * The use of signed ints for order and current_order is a deliberate 
>   * deviation from the rest of this file, to make the for loop
>   * condition simpler.
>   */
> -
> -/* Try to claim a whole foreign block, take a page, expand the remainder */
>  static __always_inline struct page *
>  __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
>                                                 unsigned int alloc_flags)
> @@ -2247,7 +2239,10 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype,
>         return NULL;
>  }
>   
> -/* Try to steal a single page from a foreign block */
> +/*
> + * Try to steal a single page from some fallback migratetype. Leave the rest of
> + * the block as its current migratetype, potentially causing fragmentation.
> + */
>  static __always_inline struct page *
>  __rmqueue_steal(struct zone *zone, int order, int start_migratetype)
>  {
> @@ -2307,7 +2302,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>         }
>   
>         /*
> -        * Try the different freelists, native then foreign.
> +        * First try the freelists of the requested migratetype, then try
> +        * fallbacks modes with increasing levels of fragmentation risk.
>          *
>          * The fallback logic is expensive and rmqueue_bulk() calls in
>          * a loop with the zone->lock held, meaning the freelists are
> @@ -2332,7 +2328,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
>         case RMQUEUE_CLAIM:
>                 page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
>                 if (page) {
> -                       /* Replenished native freelist, back to normal mode */
> +                       /* Replenished preferred freelist, back to normal mode. */
>                         *mode = RMQUEUE_NORMAL;
>                         return page;
>                 }
> 
> base-commit: aa42382db4e2a4ed1f4ba97ffc50e2ce45accb0c
Shivank Garg April 10, 2025, 10:48 a.m. UTC | #10
On 4/7/2025 11:31 PM, Johannes Weiner wrote:
> The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal
> single pages from biggest buddy") as the root cause of a 56.4%
> regression in vm-scalability::lru-file-mmap-read.
> 
> Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix
> freelist movement during block conversion"), as the root cause for a
> regression in worst-case zone->lock+irqoff hold times.
> 
> Both of these patches modify the page allocator's fallback path to be
> less greedy in an effort to stave off fragmentation. The flip side of
> this is that fallbacks are also less productive each time around,
> which means the fallback search can run much more frequently.
> 
> Carlos' traces point to rmqueue_bulk() specifically, which tries to
> refill the percpu cache by allocating a large batch of pages in a
> loop. It highlights how once the native freelists are exhausted, the
> fallback code first scans orders top-down for whole blocks to claim,
> then falls back to a bottom-up search for the smallest buddy to steal.
> For the next batch page, it goes through the same thing again.
> 
> This can be made more efficient. Since rmqueue_bulk() holds the
> zone->lock over the entire batch, the freelists are not subject to
> outside changes; when the search for a block to claim has already
> failed, there is no point in trying again for the next page.
> 
> Modify __rmqueue() to remember the last successful fallback mode, and
> restart directly from there on the next rmqueue_bulk() iteration.
> 
> Oliver confirms that this improves beyond the regression that the test
> robot reported against c2f6ea38fc1b:
> 
> commit:
>   f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap")
>   c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy")
>   acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
>   2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()")   <--- your patch
> 
> f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5
> ---------------- --------------------------- --------------------------- ---------------------------
>          %stddev     %change         %stddev     %change         %stddev     %change         %stddev
>              \          |                \          |                \          |                \
>   25525364 ±  3%     -56.4%   11135467           -57.8%   10779336           +31.6%   33581409        vm-scalability.throughput
> 
> Carlos confirms that worst-case times are almost fully recovered
> compared to before the earlier culprit patch:
> 
>   2dd482ba627d (before freelist hygiene):    1ms
>   c0cd6f557b90  (after freelist hygiene):   90ms
>  next-20250319    (steal smallest buddy):  280ms
>     this patch                          :    8ms
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Reported-by: Carlos Song <carlos.song@nxp.com>
> Tested-by: kernel test robot <oliver.sang@intel.com>
> Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion")
> Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy")
> Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com
> Cc: stable@vger.kernel.org	# 6.10+
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---

Tested on AMD Zen 3 EPYC (2-socket and 1 NUMA node, 64 CPUs on each socket)
vm-scalability/300s-lru-file-mmap-read.

		Vanilla		Patched		Change
Throughput	32267451	36112127 	+11.9% improvement
Stddev% 	2477.36		2969.18		+19.8%
Free Time	0.144072	0.148774	+3.2%
Median		227967		249851		+9.6%

Tested-by: Shivank Garg <shivankg@amd.com>

Thanks,
Shivank
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f51aa6051a99..03b0d45ed45a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2194,11 +2194,11 @@  try_to_claim_block(struct zone *zone, struct page *page,
  * The use of signed ints for order and current_order is a deliberate
  * deviation from the rest of this file, to make the for loop
  * condition simpler.
- *
- * Return the stolen page, or NULL if none can be found.
  */
+
+/* Try to claim a whole foreign block, take a page, expand the remainder */
 static __always_inline struct page *
-__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
+__rmqueue_claim(struct zone *zone, int order, int start_migratetype,
 						unsigned int alloc_flags)
 {
 	struct free_area *area;
@@ -2236,14 +2236,26 @@  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 		page = try_to_claim_block(zone, page, current_order, order,
 					  start_migratetype, fallback_mt,
 					  alloc_flags);
-		if (page)
-			goto got_one;
+		if (page) {
+			trace_mm_page_alloc_extfrag(page, order, current_order,
+						    start_migratetype, fallback_mt);
+			return page;
+		}
 	}
 
-	if (alloc_flags & ALLOC_NOFRAGMENT)
-		return NULL;
+	return NULL;
+}
+
+/* Try to steal a single page from a foreign block */
+static __always_inline struct page *
+__rmqueue_steal(struct zone *zone, int order, int start_migratetype)
+{
+	struct free_area *area;
+	int current_order;
+	struct page *page;
+	int fallback_mt;
+	bool claim_block;
 
-	/* No luck claiming pageblock. Find the smallest fallback page */
 	for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) {
 		area = &(zone->free_area[current_order]);
 		fallback_mt = find_suitable_fallback(area, current_order,
@@ -2253,25 +2265,28 @@  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
 
 		page = get_page_from_free_area(area, fallback_mt);
 		page_del_and_expand(zone, page, order, current_order, fallback_mt);
-		goto got_one;
+		trace_mm_page_alloc_extfrag(page, order, current_order,
+					    start_migratetype, fallback_mt);
+		return page;
 	}
 
 	return NULL;
-
-got_one:
-	trace_mm_page_alloc_extfrag(page, order, current_order,
-		start_migratetype, fallback_mt);
-
-	return page;
 }
 
+enum rmqueue_mode {
+	RMQUEUE_NORMAL,
+	RMQUEUE_CMA,
+	RMQUEUE_CLAIM,
+	RMQUEUE_STEAL,
+};
+
 /*
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
 static __always_inline struct page *
 __rmqueue(struct zone *zone, unsigned int order, int migratetype,
-						unsigned int alloc_flags)
+	  unsigned int alloc_flags, enum rmqueue_mode *mode)
 {
 	struct page *page;
 
@@ -2290,16 +2305,47 @@  __rmqueue(struct zone *zone, unsigned int order, int migratetype,
 		}
 	}
 
-	page = __rmqueue_smallest(zone, order, migratetype);
-	if (unlikely(!page)) {
-		if (alloc_flags & ALLOC_CMA)
+	/*
+	 * Try the different freelists, native then foreign.
+	 *
+	 * The fallback logic is expensive and rmqueue_bulk() calls in
+	 * a loop with the zone->lock held, meaning the freelists are
+	 * not subject to any outside changes. Remember in *mode where
+	 * we found pay dirt, to save us the search on the next call.
+	 */
+	switch (*mode) {
+	case RMQUEUE_NORMAL:
+		page = __rmqueue_smallest(zone, order, migratetype);
+		if (page)
+			return page;
+		fallthrough;
+	case RMQUEUE_CMA:
+		if (alloc_flags & ALLOC_CMA) {
 			page = __rmqueue_cma_fallback(zone, order);
-
-		if (!page)
-			page = __rmqueue_fallback(zone, order, migratetype,
-						  alloc_flags);
+			if (page) {
+				*mode = RMQUEUE_CMA;
+				return page;
+			}
+		}
+		fallthrough;
+	case RMQUEUE_CLAIM:
+		page = __rmqueue_claim(zone, order, migratetype, alloc_flags);
+		if (page) {
+			/* Replenished native freelist, back to normal mode */
+			*mode = RMQUEUE_NORMAL;
+			return page;
+		}
+		fallthrough;
+	case RMQUEUE_STEAL:
+		if (!(alloc_flags & ALLOC_NOFRAGMENT)) {
+			page = __rmqueue_steal(zone, order, migratetype);
+			if (page) {
+				*mode = RMQUEUE_STEAL;
+				return page;
+			}
+		}
 	}
-	return page;
+	return NULL;
 }
 
 /*
@@ -2311,6 +2357,7 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 			unsigned long count, struct list_head *list,
 			int migratetype, unsigned int alloc_flags)
 {
+	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
 	unsigned long flags;
 	int i;
 
@@ -2321,7 +2368,7 @@  static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
-								alloc_flags);
+					      alloc_flags, &rmqm);
 		if (unlikely(page == NULL))
 			break;
 
@@ -2934,6 +2981,7 @@  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 {
 	struct page *page;
 	unsigned long flags;
+	enum rmqueue_mode rmqm = RMQUEUE_NORMAL;
 
 	do {
 		page = NULL;
@@ -2945,7 +2993,7 @@  struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
-			page = __rmqueue(zone, order, migratetype, alloc_flags);
+			page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm);
 
 			/*
 			 * If the allocation fails, allow OOM handling and