Message ID | 20250407180154.63348-1-hannes@cmpxchg.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2] mm: page_alloc: speed up fallbacks in rmqueue_bulk() | expand |
On Mon Apr 7, 2025 at 6:01 PM UTC, Johannes Weiner wrote: > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2194,11 +2194,11 @@ try_to_claim_block(struct zone *zone, struct page *page, > * The use of signed ints for order and current_order is a deliberate > * deviation from the rest of this file, to make the for loop > * condition simpler. > - * > - * Return the stolen page, or NULL if none can be found. > */ This commentary is pretty confusing now, there's a block of text that kinda vaguely applies to the aggregate of __rmqueue_steal(), __rmqueue_fallback() and half of __rmqueue(). I think this new code does a better job of speaking for itself so I think we should just delete this block comment and replace it with some more verbosity elsewhere. > + > +/* Try to claim a whole foreign block, take a page, expand the remainder */ Also on the commentary front, I am not a fan of "foreign" and "native": - "Foreign" is already used in this file to mean NUMA-nonlocal. - We already have "start" and "fallback" being used in identifiers as adjectives to describe the mitegratetype concept. I wouldn't say those are _better_, "native" and "foreign" might be clearer, but it's not worth introducing inconsistency IMO. > static __always_inline struct page * > -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > +__rmqueue_claim(struct zone *zone, int order, int start_migratetype, > unsigned int alloc_flags) > { > struct free_area *area; [pasting in more context that wasn't in the original diff..] > /* > * Find the largest available free page in the other list. This roughly > * approximates finding the pageblock with the most free pages, which > * would be too costly to do exactly. > */ > for (current_order = MAX_PAGE_ORDER; current_order >= min_order; > --current_order) { IIUC we could go one step further here and also avoid repeating this iteration? Maybe something for a separate patch though? Anyway, the approach seems like a clear improvement, thanks. I will need to take a closer look at it tomorrow, I've run out of brain juice today. Here's what I got from redistributing the block comment and flipping the terminology: diff --git i/mm/page_alloc.c w/mm/page_alloc.c index dfb2b3f508af..b8142d605691 100644 --- i/mm/page_alloc.c +++ w/mm/page_alloc.c @@ -2183,21 +2183,13 @@ try_to_claim_block(struct zone *zone, struct page *page, } /* - * Try finding a free buddy page on the fallback list. - * - * This will attempt to claim a whole pageblock for the requested type - * to ensure grouping of such requests in the future. - * - * If a whole block cannot be claimed, steal an individual page, regressing to - * __rmqueue_smallest() logic to at least break up as little contiguity as - * possible. + * Try to allocate from some fallback migratetype by claiming the entire block, + * i.e. converting it to the allocation's start migratetype. * * The use of signed ints for order and current_order is a deliberate * deviation from the rest of this file, to make the for loop * condition simpler. */ - -/* Try to claim a whole foreign block, take a page, expand the remainder */ static __always_inline struct page * __rmqueue_claim(struct zone *zone, int order, int start_migratetype, unsigned int alloc_flags) @@ -2247,7 +2239,10 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype, return NULL; } -/* Try to steal a single page from a foreign block */ +/* + * Try to steal a single page from some fallback migratetype. Leave the rest of + * the block as its current migratetype, potentially causing fragmentation. + */ static __always_inline struct page * __rmqueue_steal(struct zone *zone, int order, int start_migratetype) { @@ -2307,7 +2302,9 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, } /* - * Try the different freelists, native then foreign. + * First try the freelists of the requested migratetype, then try + * fallbacks. Roughly, each fallback stage poses more of a fragmentation + * risk. * * The fallback logic is expensive and rmqueue_bulk() calls in * a loop with the zone->lock held, meaning the freelists are @@ -2332,7 +2329,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, case RMQUEUE_CLAIM: page = __rmqueue_claim(zone, order, migratetype, alloc_flags); if (page) { - /* Replenished native freelist, back to normal mode */ + /* Replenished requested migratetype's freelist, back to normal mode */ *mode = RMQUEUE_NORMAL; return page; }
On Tue, Apr 08, 2025 at 05:22:00PM +0000, Brendan Jackman wrote: > On Mon Apr 7, 2025 at 6:01 PM UTC, Johannes Weiner wrote: > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -2194,11 +2194,11 @@ try_to_claim_block(struct zone *zone, struct page *page, > > * The use of signed ints for order and current_order is a deliberate > > * deviation from the rest of this file, to make the for loop > > * condition simpler. > > - * > > - * Return the stolen page, or NULL if none can be found. > > */ > > This commentary is pretty confusing now, there's a block of text that > kinda vaguely applies to the aggregate of __rmqueue_steal(), > __rmqueue_fallback() and half of __rmqueue(). I think this new code does > a better job of speaking for itself so I think we should just delete > this block comment and replace it with some more verbosity elsewhere. I'm glad you think so, let's remove it then! > > +/* Try to claim a whole foreign block, take a page, expand the remainder */ > > Also on the commentary front, I am not a fan of "foreign" and "native": > > - "Foreign" is already used in this file to mean NUMA-nonlocal. > > - We already have "start" and "fallback" being used in identifiers > as adjectives to describe the mitegratetype concept. > > I wouldn't say those are _better_, "native" and "foreign" might be > clearer, but it's not worth introducing inconsistency IMO. That's a fair point, no objection to renaming them. > > static __always_inline struct page * > > -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > > +__rmqueue_claim(struct zone *zone, int order, int start_migratetype, > > unsigned int alloc_flags) > > { > > struct free_area *area; > > [pasting in more context that wasn't in the original diff..] > > /* > > * Find the largest available free page in the other list. This roughly > > * approximates finding the pageblock with the most free pages, which > > * would be too costly to do exactly. > > */ > > for (current_order = MAX_PAGE_ORDER; current_order >= min_order; > > --current_order) { > > IIUC we could go one step further here and also avoid repeating this > iteration? Maybe something for a separate patch though? That might be worth a test, but agree this should be a separate patch. AFAICS, in the most common configurations MAX_PAGE_ORDER is only one step above pageblock_order or even the same. It might not be worth the complication. > Anyway, the approach seems like a clear improvement, thanks. I will need > to take a closer look at it tomorrow, I've run out of brain juice today. Much appreciate you taking a look, thanks. > Here's what I got from redistributing the block comment and flipping > the terminology: > > diff --git i/mm/page_alloc.c w/mm/page_alloc.c > index dfb2b3f508af..b8142d605691 100644 > --- i/mm/page_alloc.c > +++ w/mm/page_alloc.c > @@ -2183,21 +2183,13 @@ try_to_claim_block(struct zone *zone, struct page *page, > } > > /* > - * Try finding a free buddy page on the fallback list. > - * > - * This will attempt to claim a whole pageblock for the requested type > - * to ensure grouping of such requests in the future. > - * > - * If a whole block cannot be claimed, steal an individual page, regressing to > - * __rmqueue_smallest() logic to at least break up as little contiguity as > - * possible. > + * Try to allocate from some fallback migratetype by claiming the entire block, > + * i.e. converting it to the allocation's start migratetype. > * > * The use of signed ints for order and current_order is a deliberate > * deviation from the rest of this file, to make the for loop > * condition simpler. > */ > - > -/* Try to claim a whole foreign block, take a page, expand the remainder */ > static __always_inline struct page * > __rmqueue_claim(struct zone *zone, int order, int start_migratetype, > unsigned int alloc_flags) > @@ -2247,7 +2239,10 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype, > return NULL; > } > > -/* Try to steal a single page from a foreign block */ > +/* > + * Try to steal a single page from some fallback migratetype. Leave the rest of > + * the block as its current migratetype, potentially causing fragmentation. > + */ > static __always_inline struct page * > __rmqueue_steal(struct zone *zone, int order, int start_migratetype) > { > @@ -2307,7 +2302,9 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, > } > > /* > - * Try the different freelists, native then foreign. > + * First try the freelists of the requested migratetype, then try > + * fallbacks. Roughly, each fallback stage poses more of a fragmentation > + * risk. How about "then try fallback modes with increasing levels of fragmentation risk." > * The fallback logic is expensive and rmqueue_bulk() calls in > * a loop with the zone->lock held, meaning the freelists are > @@ -2332,7 +2329,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, > case RMQUEUE_CLAIM: > page = __rmqueue_claim(zone, order, migratetype, alloc_flags); > if (page) { > - /* Replenished native freelist, back to normal mode */ > + /* Replenished requested migratetype's freelist, back to normal mode */ > *mode = RMQUEUE_NORMAL; This line is kind of long now. How about: /* Replenished preferred freelist, back to normal mode */ But yeah, I like your proposed changes. Would you care to send a proper patch? Acked-by: Johannes Weiner <hannes@cmpxchg.org>
On 2025/4/8 2:01, Johannes Weiner wrote: ... > > @@ -2934,6 +2981,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, > { > struct page *page; > unsigned long flags; > + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; > > do { > page = NULL; > @@ -2945,7 +2993,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, > if (alloc_flags & ALLOC_HIGHATOMIC) > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > if (!page) { > - page = __rmqueue(zone, order, migratetype, alloc_flags); > + page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm); > > /* > * If the allocation fails, allow OOM handling and It was not in the diff, but it seems the zone->lock is held inside the do..while loop, doesn't it mean that the freelists are subject to outside changes and rmqm is stale?
On Wed, Apr 09, 2025 at 04:02:39PM +0800, Yunsheng Lin wrote: > On 2025/4/8 2:01, Johannes Weiner wrote: > > @@ -2934,6 +2981,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, > > { > > struct page *page; > > unsigned long flags; > > + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; > > > > do { > > page = NULL; > > @@ -2945,7 +2993,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, > > if (alloc_flags & ALLOC_HIGHATOMIC) > > page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); > > if (!page) { > > - page = __rmqueue(zone, order, migratetype, alloc_flags); > > + page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm); > > > > /* > > * If the allocation fails, allow OOM handling and > > It was not in the diff, but it seems the zone->lock is held inside the do..while loop, > doesn't it mean that the freelists are subject to outside changes and rmqm is stale? Yes. Note that it only loops when there is a bug/corrupted page, so it won't make much difference in practice. But it's still kind of weird. Thanks for your review, Yunsheng! Andrew, could you please fold the below fixlet? --- From 71b1eea7ded41c1f674909f9755c23b9ee9bcb6a Mon Sep 17 00:00:00 2001 From: Johannes Weiner <hannes@cmpxchg.org> Date: Wed, 9 Apr 2025 09:56:52 -0400 Subject: [PATCH] mm: page_alloc: speed up fallbacks in rmqueue_bulk() fix reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> --- mm/page_alloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dfb2b3f508af..7ffeeb0f62d3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2983,7 +2983,6 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, { struct page *page; unsigned long flags; - enum rmqueue_mode rmqm = RMQUEUE_NORMAL; do { page = NULL; @@ -2996,6 +2995,8 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; + page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm); /*
On Tue Apr 8, 2025 at 6:50 PM UTC, Johannes Weiner wrote: >> > /* >> > * Find the largest available free page in the other list. This roughly >> > * approximates finding the pageblock with the most free pages, which >> > * would be too costly to do exactly. >> > */ >> > for (current_order = MAX_PAGE_ORDER; current_order >= min_order; >> > --current_order) { >> >> IIUC we could go one step further here and also avoid repeating this >> iteration? Maybe something for a separate patch though? > > That might be worth a test, but agree this should be a separate patch. > > AFAICS, in the most common configurations MAX_PAGE_ORDER is only one > step above pageblock_order or even the same. It might not be worth the > complication. Oh yeah, makes sense. >> /* >> - * Try the different freelists, native then foreign. >> + * First try the freelists of the requested migratetype, then try >> + * fallbacks. Roughly, each fallback stage poses more of a fragmentation >> + * risk. > > How about "then try fallback modes with increasing levels of > fragmentation risk." Yep, nice thanks. >> * The fallback logic is expensive and rmqueue_bulk() calls in >> * a loop with the zone->lock held, meaning the freelists are >> @@ -2332,7 +2329,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, >> case RMQUEUE_CLAIM: >> page = __rmqueue_claim(zone, order, migratetype, alloc_flags); >> if (page) { >> - /* Replenished native freelist, back to normal mode */ >> + /* Replenished requested migratetype's freelist, back to normal mode */ >> *mode = RMQUEUE_NORMAL; > > This line is kind of long now. How about: > > /* Replenished preferred freelist, back to normal mode */ Yep, sounds good - it's still 81 characters, the rest of this file sticks to 80 for comments, I guess I'll leave it to Andrew to decide if that is an issue? > But yeah, I like your proposed changes. Would you care to send a > proper patch? Sure, pasting below. Andrew, could you fold this in? Also, I haven't done this style of patch sending before, please let me know if I'm doing something to make your life difficult. > Acked-by: Johannes Weiner <hannes@cmpxchg.org> Aside from the commen stuff fixed by the patch below: Reviewed-by: Brendan Jackman <jackmanb@google.com> --- From 8ff20dbb52770d082e182482d2b47e521de028d1 Mon Sep 17 00:00:00 2001 From: Brendan Jackman <jackmanb@google.com> Date: Wed, 9 Apr 2025 17:22:14 +000 Subject: [PATCH] page_alloc: speed up fallbacks in rmqueue_bulk() - comment updates Tidy up some terminology and redistribute commentary. Signed-off-by: Brendan Jackman <jackmanb@google.com> --- mm/page_alloc.c | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index dfb2b3f508af4..220bd0bcc38c3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2183,21 +2183,13 @@ try_to_claim_block(struct zone *zone, struct page *page, } /* - * Try finding a free buddy page on the fallback list. - * - * This will attempt to claim a whole pageblock for the requested type - * to ensure grouping of such requests in the future. - * - * If a whole block cannot be claimed, steal an individual page, regressing to - * __rmqueue_smallest() logic to at least break up as little contiguity as - * possible. + * Try to allocate from some fallback migratetype by claiming the entire block, + * i.e. converting it to the allocation's start migratetype. * * The use of signed ints for order and current_order is a deliberate * deviation from the rest of this file, to make the for loop * condition simpler. */ - -/* Try to claim a whole foreign block, take a page, expand the remainder */ static __always_inline struct page * __rmqueue_claim(struct zone *zone, int order, int start_migratetype, unsigned int alloc_flags) @@ -2247,7 +2239,10 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype, return NULL; } -/* Try to steal a single page from a foreign block */ +/* + * Try to steal a single page from some fallback migratetype. Leave the rest of + * the block as its current migratetype, potentially causing fragmentation. + */ static __always_inline struct page * __rmqueue_steal(struct zone *zone, int order, int start_migratetype) { @@ -2307,7 +2302,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, } /* - * Try the different freelists, native then foreign. + * First try the freelists of the requested migratetype, then try + * fallbacks modes with increasing levels of fragmentation risk. * * The fallback logic is expensive and rmqueue_bulk() calls in * a loop with the zone->lock held, meaning the freelists are @@ -2332,7 +2328,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, case RMQUEUE_CLAIM: page = __rmqueue_claim(zone, order, migratetype, alloc_flags); if (page) { - /* Replenished native freelist, back to normal mode */ + /* Replenished preferred freelist, back to normal mode. */ *mode = RMQUEUE_NORMAL; return page; } base-commit: aa42382db4e2a4ed1f4ba97ffc50e2ce45accb0c
On Mon Apr 7, 2025 at 2:01 PM EDT, Johannes Weiner wrote: > The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal > single pages from biggest buddy") as the root cause of a 56.4% > regression in vm-scalability::lru-file-mmap-read. > > Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix > freelist movement during block conversion"), as the root cause for a > regression in worst-case zone->lock+irqoff hold times. > > Both of these patches modify the page allocator's fallback path to be > less greedy in an effort to stave off fragmentation. The flip side of > this is that fallbacks are also less productive each time around, > which means the fallback search can run much more frequently. > > Carlos' traces point to rmqueue_bulk() specifically, which tries to > refill the percpu cache by allocating a large batch of pages in a > loop. It highlights how once the native freelists are exhausted, the > fallback code first scans orders top-down for whole blocks to claim, > then falls back to a bottom-up search for the smallest buddy to steal. > For the next batch page, it goes through the same thing again. > > This can be made more efficient. Since rmqueue_bulk() holds the > zone->lock over the entire batch, the freelists are not subject to > outside changes; when the search for a block to claim has already > failed, there is no point in trying again for the next page. > > Modify __rmqueue() to remember the last successful fallback mode, and > restart directly from there on the next rmqueue_bulk() iteration. > > Oliver confirms that this improves beyond the regression that the test > robot reported against c2f6ea38fc1b: > > commit: > f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") > c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") > acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") > 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch > > f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput > > Carlos confirms that worst-case times are almost fully recovered > compared to before the earlier culprit patch: > > 2dd482ba627d (before freelist hygiene): 1ms > c0cd6f557b90 (after freelist hygiene): 90ms > next-20250319 (steal smallest buddy): 280ms > this patch : 8ms > > Reported-by: kernel test robot <oliver.sang@intel.com> > Reported-by: Carlos Song <carlos.song@nxp.com> > Tested-by: kernel test robot <oliver.sang@intel.com> > Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") > Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") > Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com > Cc: stable@vger.kernel.org # 6.10+ > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++------------- > 1 file changed, 74 insertions(+), 26 deletions(-) > It is a really nice cleanup. It improves my understanding of the rmqueue*() and the whole flow a lot. Thank you for the patch. Acked-by: Zi Yan <ziy@nvidia.com>
Hi, That is a nice fix! I test it at imx7d. Tested-by: Carlos Song <carlos.song@nxp.com> > -----Original Message----- > From: Johannes Weiner <hannes@cmpxchg.org> > Sent: Tuesday, April 8, 2025 2:02 AM > To: Andrew Morton <akpm@linux-foundation.org> > Cc: Vlastimil Babka <vbabka@suse.cz>; Brendan Jackman > <jackmanb@google.com>; Mel Gorman <mgorman@techsingularity.net>; > Carlos Song <carlos.song@nxp.com>; linux-mm@kvack.org; > linux-kernel@vger.kernel.org; kernel test robot <oliver.sang@intel.com>; > stable@vger.kernel.org > Subject: [EXT] [PATCH 1/2] mm: page_alloc: speed up fallbacks in > rmqueue_bulk() > > Caution: This is an external email. Please take care when clicking links or > opening attachments. When in doubt, report the message using the 'Report this > email' button > > > The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal single > pages from biggest buddy") as the root cause of a 56.4% regression in > vm-scalability::lru-file-mmap-read. > > Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix freelist > movement during block conversion"), as the root cause for a regression in > worst-case zone->lock+irqoff hold times. > > Both of these patches modify the page allocator's fallback path to be less greedy > in an effort to stave off fragmentation. The flip side of this is that fallbacks are > also less productive each time around, which means the fallback search can run > much more frequently. > > Carlos' traces point to rmqueue_bulk() specifically, which tries to refill the > percpu cache by allocating a large batch of pages in a loop. It highlights how > once the native freelists are exhausted, the fallback code first scans orders > top-down for whole blocks to claim, then falls back to a bottom-up search for the > smallest buddy to steal. > For the next batch page, it goes through the same thing again. > > This can be made more efficient. Since rmqueue_bulk() holds the > zone->lock over the entire batch, the freelists are not subject to > outside changes; when the search for a block to claim has already failed, there is > no point in trying again for the next page. > > Modify __rmqueue() to remember the last successful fallback mode, and restart > directly from there on the next rmqueue_bulk() iteration. > > Oliver confirms that this improves beyond the regression that the test robot > reported against c2f6ea38fc1b: > > commit: > f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") > c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") > acc4d5ff0b ("Merge tag 'net-6.15-rc0' of > git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") > 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- > your patch > > f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 > acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change > %stddev %change %stddev > \ | \ | > \ | \ > 25525364 ± 3% -56.4% 11135467 -57.8% > 10779336 +31.6% 33581409 > vm-scalability.throughput > > Carlos confirms that worst-case times are almost fully recovered compared to > before the earlier culprit patch: > > 2dd482ba627d (before freelist hygiene): 1ms > c0cd6f557b90 (after freelist hygiene): 90ms > next-20250319 (steal smallest buddy): 280ms > this patch : 8ms > > Reported-by: kernel test robot <oliver.sang@intel.com> > Reported-by: Carlos Song <carlos.song@nxp.com> > Tested-by: kernel test robot <oliver.sang@intel.com> > Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block > conversion") > Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest > buddy") > Closes: > https://lore.kern/ > el.org%2Foe-lkp%2F202503271547.fc08b188-lkp%40intel.com&data=05%7C02 > %7Ccarlos.song%40nxp.com%7C3325bfa02b7942cca36508dd75fe4a3d%7C686 > ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638796457225141027%7CUn > known%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCI > sIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdat > a=gl5xQ8OLJNIccLDsYgVejUC%2B9ZQrxmt%2FxkfQXsmDuNY%3D&reserved=0 > Cc: stable@vger.kernel.org # 6.10+ > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- > mm/page_alloc.c | 100 +++++++++++++++++++++++++++++++++++------------- > 1 file changed, 74 insertions(+), 26 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c index > f51aa6051a99..03b0d45ed45a 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2194,11 +2194,11 @@ try_to_claim_block(struct zone *zone, struct page > *page, > * The use of signed ints for order and current_order is a deliberate > * deviation from the rest of this file, to make the for loop > * condition simpler. > - * > - * Return the stolen page, or NULL if none can be found. > */ > + > +/* Try to claim a whole foreign block, take a page, expand the > +remainder */ > static __always_inline struct page * > -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype, > +__rmqueue_claim(struct zone *zone, int order, int start_migratetype, > unsigned int > alloc_flags) { > struct free_area *area; > @@ -2236,14 +2236,26 @@ __rmqueue_fallback(struct zone *zone, int order, > int start_migratetype, > page = try_to_claim_block(zone, page, current_order, order, > start_migratetype, > fallback_mt, > alloc_flags); > - if (page) > - goto got_one; > + if (page) { > + trace_mm_page_alloc_extfrag(page, order, > current_order, > + > start_migratetype, fallback_mt); > + return page; > + } > } > > - if (alloc_flags & ALLOC_NOFRAGMENT) > - return NULL; > + return NULL; > +} > + > +/* Try to steal a single page from a foreign block */ static > +__always_inline struct page * __rmqueue_steal(struct zone *zone, int > +order, int start_migratetype) { > + struct free_area *area; > + int current_order; > + struct page *page; > + int fallback_mt; > + bool claim_block; > > - /* No luck claiming pageblock. Find the smallest fallback page */ > for (current_order = order; current_order < NR_PAGE_ORDERS; > current_order++) { > area = &(zone->free_area[current_order]); > fallback_mt = find_suitable_fallback(area, current_order, > @@ -2253,25 +2265,28 @@ __rmqueue_fallback(struct zone *zone, int order, > int start_migratetype, > > page = get_page_from_free_area(area, fallback_mt); > page_del_and_expand(zone, page, order, current_order, > fallback_mt); > - goto got_one; > + trace_mm_page_alloc_extfrag(page, order, current_order, > + start_migratetype, > fallback_mt); > + return page; > } > > return NULL; > - > -got_one: > - trace_mm_page_alloc_extfrag(page, order, current_order, > - start_migratetype, fallback_mt); > - > - return page; > } > > +enum rmqueue_mode { > + RMQUEUE_NORMAL, > + RMQUEUE_CMA, > + RMQUEUE_CLAIM, > + RMQUEUE_STEAL, > +}; > + > /* > * Do the hard work of removing an element from the buddy allocator. > * Call me with the zone->lock already held. > */ > static __always_inline struct page * > __rmqueue(struct zone *zone, unsigned int order, int migratetype, > - unsigned int > alloc_flags) > + unsigned int alloc_flags, enum rmqueue_mode *mode) > { > struct page *page; > > @@ -2290,16 +2305,47 @@ __rmqueue(struct zone *zone, unsigned int order, > int migratetype, > } > } > > - page = __rmqueue_smallest(zone, order, migratetype); > - if (unlikely(!page)) { > - if (alloc_flags & ALLOC_CMA) > + /* > + * Try the different freelists, native then foreign. > + * > + * The fallback logic is expensive and rmqueue_bulk() calls in > + * a loop with the zone->lock held, meaning the freelists are > + * not subject to any outside changes. Remember in *mode where > + * we found pay dirt, to save us the search on the next call. > + */ > + switch (*mode) { > + case RMQUEUE_NORMAL: > + page = __rmqueue_smallest(zone, order, migratetype); > + if (page) > + return page; > + fallthrough; > + case RMQUEUE_CMA: > + if (alloc_flags & ALLOC_CMA) { > page = __rmqueue_cma_fallback(zone, order); > - > - if (!page) > - page = __rmqueue_fallback(zone, order, > migratetype, > - alloc_flags); > + if (page) { > + *mode = RMQUEUE_CMA; > + return page; > + } > + } > + fallthrough; > + case RMQUEUE_CLAIM: > + page = __rmqueue_claim(zone, order, migratetype, > alloc_flags); > + if (page) { > + /* Replenished native freelist, back to normal > mode */ > + *mode = RMQUEUE_NORMAL; > + return page; > + } > + fallthrough; > + case RMQUEUE_STEAL: > + if (!(alloc_flags & ALLOC_NOFRAGMENT)) { > + page = __rmqueue_steal(zone, order, > migratetype); > + if (page) { > + *mode = RMQUEUE_STEAL; > + return page; > + } > + } > } > - return page; > + return NULL; > } > > /* > @@ -2311,6 +2357,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned > int order, > unsigned long count, struct list_head *list, > int migratetype, unsigned int alloc_flags) { > + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; > unsigned long flags; > int i; > > @@ -2321,7 +2368,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned > int order, > } > for (i = 0; i < count; ++i) { > struct page *page = __rmqueue(zone, order, migratetype, > - > alloc_flags); > + alloc_flags, &rmqm); > if (unlikely(page == NULL)) > break; > > @@ -2934,6 +2981,7 @@ struct page *rmqueue_buddy(struct zone > *preferred_zone, struct zone *zone, { > struct page *page; > unsigned long flags; > + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; > > do { > page = NULL; > @@ -2945,7 +2993,7 @@ struct page *rmqueue_buddy(struct zone > *preferred_zone, struct zone *zone, > if (alloc_flags & ALLOC_HIGHATOMIC) > page = __rmqueue_smallest(zone, order, > MIGRATE_HIGHATOMIC); > if (!page) { > - page = __rmqueue(zone, order, migratetype, > alloc_flags); > + page = __rmqueue(zone, order, migratetype, > + alloc_flags, &rmqm); > > /* > * If the allocation fails, allow OOM handling and > -- > 2.49.0
On 4/7/25 20:01, Johannes Weiner wrote: > The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal > single pages from biggest buddy") as the root cause of a 56.4% > regression in vm-scalability::lru-file-mmap-read. > > Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix > freelist movement during block conversion"), as the root cause for a > regression in worst-case zone->lock+irqoff hold times. > > Both of these patches modify the page allocator's fallback path to be > less greedy in an effort to stave off fragmentation. The flip side of > this is that fallbacks are also less productive each time around, > which means the fallback search can run much more frequently. > > Carlos' traces point to rmqueue_bulk() specifically, which tries to > refill the percpu cache by allocating a large batch of pages in a > loop. It highlights how once the native freelists are exhausted, the > fallback code first scans orders top-down for whole blocks to claim, > then falls back to a bottom-up search for the smallest buddy to steal. > For the next batch page, it goes through the same thing again. > > This can be made more efficient. Since rmqueue_bulk() holds the > zone->lock over the entire batch, the freelists are not subject to > outside changes; when the search for a block to claim has already > failed, there is no point in trying again for the next page. > > Modify __rmqueue() to remember the last successful fallback mode, and > restart directly from there on the next rmqueue_bulk() iteration. > > Oliver confirms that this improves beyond the regression that the test > robot reported against c2f6ea38fc1b: > > commit: > f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") > c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") > acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") > 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch > > f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput > > Carlos confirms that worst-case times are almost fully recovered > compared to before the earlier culprit patch: > > 2dd482ba627d (before freelist hygiene): 1ms > c0cd6f557b90 (after freelist hygiene): 90ms > next-20250319 (steal smallest buddy): 280ms > this patch : 8ms > > Reported-by: kernel test robot <oliver.sang@intel.com> > Reported-by: Carlos Song <carlos.song@nxp.com> > Tested-by: kernel test robot <oliver.sang@intel.com> > Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") > Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") > Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com > Cc: stable@vger.kernel.org # 6.10+ Might be going to be fun with dependency commits. > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cool stuff. Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
On 4/9/25 19:30, Brendan Jackman wrote: > > From 8ff20dbb52770d082e182482d2b47e521de028d1 Mon Sep 17 00:00:00 2001 > From: Brendan Jackman <jackmanb@google.com> > Date: Wed, 9 Apr 2025 17:22:14 +000 > Subject: [PATCH] page_alloc: speed up fallbacks in rmqueue_bulk() - comment updates > > Tidy up some terminology and redistribute commentary. > Signed-off-by: Brendan Jackman <jackmanb@google.com> LGTM (assuming folding) > --- > mm/page_alloc.c | 22 +++++++++------------- > 1 file changed, 9 insertions(+), 13 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index dfb2b3f508af4..220bd0bcc38c3 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2183,21 +2183,13 @@ try_to_claim_block(struct zone *zone, struct page *page, > } > > /* > - * Try finding a free buddy page on the fallback list. > - * > - * This will attempt to claim a whole pageblock for the requested type > - * to ensure grouping of such requests in the future. > - * > - * If a whole block cannot be claimed, steal an individual page, regressing to > - * __rmqueue_smallest() logic to at least break up as little contiguity as > - * possible. > + * Try to allocate from some fallback migratetype by claiming the entire block, > + * i.e. converting it to the allocation's start migratetype. > * > * The use of signed ints for order and current_order is a deliberate > * deviation from the rest of this file, to make the for loop > * condition simpler. > */ > - > -/* Try to claim a whole foreign block, take a page, expand the remainder */ > static __always_inline struct page * > __rmqueue_claim(struct zone *zone, int order, int start_migratetype, > unsigned int alloc_flags) > @@ -2247,7 +2239,10 @@ __rmqueue_claim(struct zone *zone, int order, int start_migratetype, > return NULL; > } > > -/* Try to steal a single page from a foreign block */ > +/* > + * Try to steal a single page from some fallback migratetype. Leave the rest of > + * the block as its current migratetype, potentially causing fragmentation. > + */ > static __always_inline struct page * > __rmqueue_steal(struct zone *zone, int order, int start_migratetype) > { > @@ -2307,7 +2302,8 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, > } > > /* > - * Try the different freelists, native then foreign. > + * First try the freelists of the requested migratetype, then try > + * fallbacks modes with increasing levels of fragmentation risk. > * > * The fallback logic is expensive and rmqueue_bulk() calls in > * a loop with the zone->lock held, meaning the freelists are > @@ -2332,7 +2328,7 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, > case RMQUEUE_CLAIM: > page = __rmqueue_claim(zone, order, migratetype, alloc_flags); > if (page) { > - /* Replenished native freelist, back to normal mode */ > + /* Replenished preferred freelist, back to normal mode. */ > *mode = RMQUEUE_NORMAL; > return page; > } > > base-commit: aa42382db4e2a4ed1f4ba97ffc50e2ce45accb0c
On 4/7/2025 11:31 PM, Johannes Weiner wrote: > The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal > single pages from biggest buddy") as the root cause of a 56.4% > regression in vm-scalability::lru-file-mmap-read. > > Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix > freelist movement during block conversion"), as the root cause for a > regression in worst-case zone->lock+irqoff hold times. > > Both of these patches modify the page allocator's fallback path to be > less greedy in an effort to stave off fragmentation. The flip side of > this is that fallbacks are also less productive each time around, > which means the fallback search can run much more frequently. > > Carlos' traces point to rmqueue_bulk() specifically, which tries to > refill the percpu cache by allocating a large batch of pages in a > loop. It highlights how once the native freelists are exhausted, the > fallback code first scans orders top-down for whole blocks to claim, > then falls back to a bottom-up search for the smallest buddy to steal. > For the next batch page, it goes through the same thing again. > > This can be made more efficient. Since rmqueue_bulk() holds the > zone->lock over the entire batch, the freelists are not subject to > outside changes; when the search for a block to claim has already > failed, there is no point in trying again for the next page. > > Modify __rmqueue() to remember the last successful fallback mode, and > restart directly from there on the next rmqueue_bulk() iteration. > > Oliver confirms that this improves beyond the regression that the test > robot reported against c2f6ea38fc1b: > > commit: > f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") > c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") > acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") > 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch > > f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 > ---------------- --------------------------- --------------------------- --------------------------- > %stddev %change %stddev %change %stddev %change %stddev > \ | \ | \ | \ > 25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput > > Carlos confirms that worst-case times are almost fully recovered > compared to before the earlier culprit patch: > > 2dd482ba627d (before freelist hygiene): 1ms > c0cd6f557b90 (after freelist hygiene): 90ms > next-20250319 (steal smallest buddy): 280ms > this patch : 8ms > > Reported-by: kernel test robot <oliver.sang@intel.com> > Reported-by: Carlos Song <carlos.song@nxp.com> > Tested-by: kernel test robot <oliver.sang@intel.com> > Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") > Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") > Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com > Cc: stable@vger.kernel.org # 6.10+ > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> > --- Tested on AMD Zen 3 EPYC (2-socket and 1 NUMA node, 64 CPUs on each socket) vm-scalability/300s-lru-file-mmap-read. Vanilla Patched Change Throughput 32267451 36112127 +11.9% improvement Stddev% 2477.36 2969.18 +19.8% Free Time 0.144072 0.148774 +3.2% Median 227967 249851 +9.6% Tested-by: Shivank Garg <shivankg@amd.com> Thanks, Shivank
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f51aa6051a99..03b0d45ed45a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2194,11 +2194,11 @@ try_to_claim_block(struct zone *zone, struct page *page, * The use of signed ints for order and current_order is a deliberate * deviation from the rest of this file, to make the for loop * condition simpler. - * - * Return the stolen page, or NULL if none can be found. */ + +/* Try to claim a whole foreign block, take a page, expand the remainder */ static __always_inline struct page * -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype, +__rmqueue_claim(struct zone *zone, int order, int start_migratetype, unsigned int alloc_flags) { struct free_area *area; @@ -2236,14 +2236,26 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, page = try_to_claim_block(zone, page, current_order, order, start_migratetype, fallback_mt, alloc_flags); - if (page) - goto got_one; + if (page) { + trace_mm_page_alloc_extfrag(page, order, current_order, + start_migratetype, fallback_mt); + return page; + } } - if (alloc_flags & ALLOC_NOFRAGMENT) - return NULL; + return NULL; +} + +/* Try to steal a single page from a foreign block */ +static __always_inline struct page * +__rmqueue_steal(struct zone *zone, int order, int start_migratetype) +{ + struct free_area *area; + int current_order; + struct page *page; + int fallback_mt; + bool claim_block; - /* No luck claiming pageblock. Find the smallest fallback page */ for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, @@ -2253,25 +2265,28 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, page = get_page_from_free_area(area, fallback_mt); page_del_and_expand(zone, page, order, current_order, fallback_mt); - goto got_one; + trace_mm_page_alloc_extfrag(page, order, current_order, + start_migratetype, fallback_mt); + return page; } return NULL; - -got_one: - trace_mm_page_alloc_extfrag(page, order, current_order, - start_migratetype, fallback_mt); - - return page; } +enum rmqueue_mode { + RMQUEUE_NORMAL, + RMQUEUE_CMA, + RMQUEUE_CLAIM, + RMQUEUE_STEAL, +}; + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ static __always_inline struct page * __rmqueue(struct zone *zone, unsigned int order, int migratetype, - unsigned int alloc_flags) + unsigned int alloc_flags, enum rmqueue_mode *mode) { struct page *page; @@ -2290,16 +2305,47 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, } } - page = __rmqueue_smallest(zone, order, migratetype); - if (unlikely(!page)) { - if (alloc_flags & ALLOC_CMA) + /* + * Try the different freelists, native then foreign. + * + * The fallback logic is expensive and rmqueue_bulk() calls in + * a loop with the zone->lock held, meaning the freelists are + * not subject to any outside changes. Remember in *mode where + * we found pay dirt, to save us the search on the next call. + */ + switch (*mode) { + case RMQUEUE_NORMAL: + page = __rmqueue_smallest(zone, order, migratetype); + if (page) + return page; + fallthrough; + case RMQUEUE_CMA: + if (alloc_flags & ALLOC_CMA) { page = __rmqueue_cma_fallback(zone, order); - - if (!page) - page = __rmqueue_fallback(zone, order, migratetype, - alloc_flags); + if (page) { + *mode = RMQUEUE_CMA; + return page; + } + } + fallthrough; + case RMQUEUE_CLAIM: + page = __rmqueue_claim(zone, order, migratetype, alloc_flags); + if (page) { + /* Replenished native freelist, back to normal mode */ + *mode = RMQUEUE_NORMAL; + return page; + } + fallthrough; + case RMQUEUE_STEAL: + if (!(alloc_flags & ALLOC_NOFRAGMENT)) { + page = __rmqueue_steal(zone, order, migratetype); + if (page) { + *mode = RMQUEUE_STEAL; + return page; + } + } } - return page; + return NULL; } /* @@ -2311,6 +2357,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count, struct list_head *list, int migratetype, unsigned int alloc_flags) { + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; unsigned long flags; int i; @@ -2321,7 +2368,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, } for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype, - alloc_flags); + alloc_flags, &rmqm); if (unlikely(page == NULL)) break; @@ -2934,6 +2981,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, { struct page *page; unsigned long flags; + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; do { page = NULL; @@ -2945,7 +2993,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { - page = __rmqueue(zone, order, migratetype, alloc_flags); + page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm); /* * If the allocation fails, allow OOM handling and