[1/6] mm: page_alloc: remove pcppage migratetype caching

Message ID	20230911195023.247694-2-hannes@cmpxchg.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@techsingularity.net>, Miaohe Lin <linmiaohe@huawei.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Zi Yan <ziy@nvidia.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH 1/6] mm: page_alloc: remove pcppage migratetype caching Date: Mon, 11 Sep 2023 15:41:42 -0400 Message-ID: <20230911195023.247694-2-hannes@cmpxchg.org> In-Reply-To: <20230911195023.247694-1-hannes@cmpxchg.org> References: <20230911195023.247694-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: page_alloc: freelist migratetype hygiene \| expand [V2,0/6] mm: page_alloc: freelist migratetype hygiene [1/6] mm: page_alloc: remove pcppage migratetype caching [2/6] mm: page_alloc: fix up block types when merging compatible blocks [3/6] mm: page_alloc: move free pages when converting block during isolation [4/6] mm: page_alloc: fix move_freepages_block() range error [5/6] mm: page_alloc: fix freelist movement during block conversion [6/6] mm: page_alloc: consolidate free page accounting

Johannes Weiner Sept. 11, 2023, 7:41 p.m. UTC

The idea behind the cache is to save get_pageblock_migratetype()
lookups during bulk freeing. A microbenchmark suggests this isn't
helping, though. The pcp migratetype can get stale, which means that
bulk freeing has an extra branch to check if the pageblock was
isolated while on the pcp.

While the variance overlaps, the cache write and the branch seem to
make this a net negative. The following test allocates and frees
batches of 10,000 pages (~3x the pcp high marks to trigger flushing):

Before:
          8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
                19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
                 0      cpu-migrations                   #    0.000 /sec
            17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
    41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
   126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
    25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
        33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )

         0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )

After:
          8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
                22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
                 0      cpu-migrations                   #    0.000 /sec
            17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
    40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
   126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
    25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
        32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )

         0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )

A side effect is that this also ensures that pages whose pageblock
gets stolen while on the pcplist end up on the right freelist and we
don't perform potentially type-incompatible buddy merges (or skip
merges when we shouldn't), whis is likely beneficial to long-term
fragmentation management, although the effects would be harder to
measure. Settle for simpler and faster code as justification here.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 61 ++++++++++++-------------------------------------
 1 file changed, 14 insertions(+), 47 deletions(-)

Zi Yan Sept. 11, 2023, 7:59 p.m. UTC | #1

On 11 Sep 2023, at 15:41, Johannes Weiner wrote:

> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
>
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>
> Before:
>           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
>
>          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
>
> After:
>           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
>
>          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
>
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term

s/whis/this

> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/page_alloc.c | 61 ++++++++++++-------------------------------------
>  1 file changed, 14 insertions(+), 47 deletions(-)
>

Acked-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

Andrew Morton Sept. 11, 2023, 9:09 p.m. UTC | #2

On Mon, 11 Sep 2023 15:59:43 -0400 Zi Yan <ziy@nvidia.com> wrote:

> > A side effect is that this also ensures that pages whose pageblock
> > gets stolen while on the pcplist end up on the right freelist and we
> > don't perform potentially type-incompatible buddy merges (or skip
> > merges when we shouldn't), whis is likely beneficial to long-term
> 
> s/whis/this

Thanks, I did s/whis/which/

Vlastimil Babka Sept. 12, 2023, 1:47 p.m. UTC | #3

On 9/11/23 21:41, Johannes Weiner wrote:
> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
> 
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> 
> Before:
>           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> 
>          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> 
> After:
>           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> 
>          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> 
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term
> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.

Makes sense to me, so

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Some notes below.

> @@ -1577,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>  			continue;
>  		del_page_from_free_list(page, zone, current_order);
>  		expand(zone, page, order, current_order, migratetype);
> -		set_pcppage_migratetype(page, migratetype);

Hm interesting, just noticed that __rmqueue_fallback() never did this
AFAICS, sounds like a bug.

>  		trace_mm_page_alloc_zone_locked(page, order, migratetype,
>  				pcp_allowed_order(order) &&
>  				migratetype < MIGRATE_PCPTYPES);
> @@ -2145,7 +2123,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  		 * pages are ordered properly.
>  		 */
>  		list_add_tail(&page->pcp_list, list);
> -		if (is_migrate_cma(get_pcppage_migratetype(page)))
> +		if (is_migrate_cma(get_pageblock_migratetype(page)))
>  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>  					      -(1 << order));

This is potentially a source of overhead, I assume patch 6/6 might
change that.

>  	}
> @@ -2304,19 +2282,6 @@ void drain_all_pages(struct zone *zone)
>  	__drain_all_pages(zone, false);
>  }
>  
> -static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
> -							unsigned int order)
> -{
> -	int migratetype;
> -
> -	if (!free_pages_prepare(page, order, FPI_NONE))
> -		return false;
> -
> -	migratetype = get_pfnblock_migratetype(page, pfn);
> -	set_pcppage_migratetype(page, migratetype);
> -	return true;
> -}
> -
>  static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
>  {
>  	int min_nr_free, max_nr_free;
> @@ -2402,7 +2367,7 @@ void free_unref_page(struct page *page, unsigned int order)
>  	unsigned long pfn = page_to_pfn(page);
>  	int migratetype, pcpmigratetype;
>  
> -	if (!free_unref_page_prepare(page, pfn, order))
> +	if (!free_pages_prepare(page, order, FPI_NONE))
>  		return;
>  
>  	/*
> @@ -2412,7 +2377,7 @@ void free_unref_page(struct page *page, unsigned int order)
>  	 * get those areas back if necessary. Otherwise, we may have to free
>  	 * excessively into the page allocator
>  	 */
> -	migratetype = pcpmigratetype = get_pcppage_migratetype(page);
> +	migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn);
>  	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
>  		if (unlikely(is_migrate_isolate(migratetype))) {
>  			free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
> @@ -2448,7 +2413,8 @@ void free_unref_page_list(struct list_head *list)
>  	/* Prepare pages for freeing */
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		unsigned long pfn = page_to_pfn(page);
> -		if (!free_unref_page_prepare(page, pfn, 0)) {
> +
> +		if (!free_pages_prepare(page, 0, FPI_NONE)) {
>  			list_del(&page->lru);
>  			continue;
>  		}
> @@ -2457,7 +2423,7 @@ void free_unref_page_list(struct list_head *list)
>  		 * Free isolated pages directly to the allocator, see
>  		 * comment in free_unref_page.
>  		 */
> -		migratetype = get_pcppage_migratetype(page);
> +		migratetype = get_pfnblock_migratetype(page, pfn);
>  		if (unlikely(is_migrate_isolate(migratetype))) {
>  			list_del(&page->lru);
>  			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);

I think after this change we should move the isolated pages handling to
the second loop below, so that we wouldn't have to call
get_pfnblock_migratetype() twice per page. Dunno yet if some later patch
does that. It would need to unlock pcp when necessary.

> @@ -2466,10 +2432,11 @@ void free_unref_page_list(struct list_head *list)
>  	}
>  
>  	list_for_each_entry_safe(page, next, list, lru) {
> +		unsigned long pfn = page_to_pfn(page);
>  		struct zone *zone = page_zone(page);
>  
>  		list_del(&page->lru);
> -		migratetype = get_pcppage_migratetype(page);
> +		migratetype = get_pfnblock_migratetype(page, pfn);
>  
>  		/*
>  		 * Either different zone requiring a different pcp lock or
> @@ -2492,7 +2459,7 @@ void free_unref_page_list(struct list_head *list)
>  			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>  			if (unlikely(!pcp)) {
>  				pcp_trylock_finish(UP_flags);
> -				free_one_page(zone, page, page_to_pfn(page),
> +				free_one_page(zone, page, pfn,
>  					      0, migratetype, FPI_NONE);
>  				locked_zone = NULL;
>  				continue;
> @@ -2661,7 +2628,7 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
>  			}
>  		}
>  		__mod_zone_freepage_state(zone, -(1 << order),
> -					  get_pcppage_migratetype(page));
> +					  get_pageblock_migratetype(page));
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  	} while (check_new_pages(page, order));
>

Johannes Weiner Sept. 12, 2023, 2:50 p.m. UTC | #4

On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
> On 9/11/23 21:41, Johannes Weiner wrote:
> > The idea behind the cache is to save get_pageblock_migratetype()
> > lookups during bulk freeing. A microbenchmark suggests this isn't
> > helping, though. The pcp migratetype can get stale, which means that
> > bulk freeing has an extra branch to check if the pageblock was
> > isolated while on the pcp.
> > 
> > While the variance overlaps, the cache write and the branch seem to
> > make this a net negative. The following test allocates and frees
> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> > 
> > Before:
> >           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
> >                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
> >     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
> >    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
> >     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
> >         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> > 
> >          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> > 
> > After:
> >           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
> >                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
> >     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
> >    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
> >     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
> >         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> > 
> >          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> > 
> > A side effect is that this also ensures that pages whose pageblock
> > gets stolen while on the pcplist end up on the right freelist and we
> > don't perform potentially type-incompatible buddy merges (or skip
> > merges when we shouldn't), whis is likely beneficial to long-term
> > fragmentation management, although the effects would be harder to
> > measure. Settle for simpler and faster code as justification here.
> 
> Makes sense to me, so
> 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > @@ -1577,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> >  			continue;
> >  		del_page_from_free_list(page, zone, current_order);
> >  		expand(zone, page, order, current_order, migratetype);
> > -		set_pcppage_migratetype(page, migratetype);
> 
> Hm interesting, just noticed that __rmqueue_fallback() never did this
> AFAICS, sounds like a bug.

I don't quite follow. Which part?

Keep in mind that at this point __rmqueue_fallback() doesn't return a
page. It just moves pages to the desired freelist, and then
__rmqueue_smallest() gets called again. This changes in 5/6, but until
now at least all of the above would apply to fallback pages.

> > @@ -2145,7 +2123,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >  		 * pages are ordered properly.
> >  		 */
> >  		list_add_tail(&page->pcp_list, list);
> > -		if (is_migrate_cma(get_pcppage_migratetype(page)))
> > +		if (is_migrate_cma(get_pageblock_migratetype(page)))
> >  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> >  					      -(1 << order));
> 
> This is potentially a source of overhead, I assume patch 6/6 might
> change that.

Yes, 6/6 removes it altogether.

But the test results in this patch's changelog are from this patch in
isolation, so it doesn't appear to be a concern even on its own.

> > @@ -2457,7 +2423,7 @@ void free_unref_page_list(struct list_head *list)
> >  		 * Free isolated pages directly to the allocator, see
> >  		 * comment in free_unref_page.
> >  		 */
> > -		migratetype = get_pcppage_migratetype(page);
> > +		migratetype = get_pfnblock_migratetype(page, pfn);
> >  		if (unlikely(is_migrate_isolate(migratetype))) {
> >  			list_del(&page->lru);
> >  			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
> 
> I think after this change we should move the isolated pages handling to
> the second loop below, so that we wouldn't have to call
> get_pfnblock_migratetype() twice per page. Dunno yet if some later patch
> does that. It would need to unlock pcp when necessary.

That sounds like a great idea. Something like the following?

Lightly tested. If you're good with it, I'll beat some more on it and
submit it as a follow-up.

---

From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 12 Sep 2023 10:16:10 -0400
Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()

Move direct freeing of isolated pages to the lock-breaking block in
the second loop. This saves an unnecessary migratetype reassessment.

Minor comment and local variable scoping cleanups.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 49 +++++++++++++++++++++----------------------------
 1 file changed, 21 insertions(+), 28 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3f1c777feed..9cad31de1bf5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2408,48 +2408,41 @@ void free_unref_page_list(struct list_head *list)
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
 	int batch_count = 0;
-	int migratetype;
-
-	/* Prepare pages for freeing */
-	list_for_each_entry_safe(page, next, list, lru) {
-		unsigned long pfn = page_to_pfn(page);
 
-		if (!free_pages_prepare(page, 0, FPI_NONE)) {
+	list_for_each_entry_safe(page, next, list, lru)
+		if (!free_pages_prepare(page, 0, FPI_NONE))
 			list_del(&page->lru);
-			continue;
-		}
-
-		/*
-		 * Free isolated pages directly to the allocator, see
-		 * comment in free_unref_page.
-		 */
-		migratetype = get_pfnblock_migratetype(page, pfn);
-		if (unlikely(is_migrate_isolate(migratetype))) {
-			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
-			continue;
-		}
-	}
 
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
 		struct zone *zone = page_zone(page);
+		int migratetype;
 
 		list_del(&page->lru);
 		migratetype = get_pfnblock_migratetype(page, pfn);
 
 		/*
-		 * Either different zone requiring a different pcp lock or
-		 * excessive lock hold times when freeing a large list of
-		 * pages.
+		 * Zone switch, batch complete, or non-pcp freeing?
+		 * Drop the pcp lock and evaluate.
 		 */
-		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
+		if (unlikely(zone != locked_zone ||
+			     batch_count == SWAP_CLUSTER_MAX ||
+			     is_migrate_isolate(migratetype))) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
 				pcp_trylock_finish(UP_flags);
+				locked_zone = NULL;
 			}
 
-			batch_count = 0;
+			/*
+			 * Free isolated pages directly to the
+			 * allocator, see comment in free_unref_page.
+			 */
+			if (is_migrate_isolate(migratetype)) {
+				free_one_page(zone, page, pfn, 0,
+					      migratetype, FPI_NONE);
+				continue;
+			}
 
 			/*
 			 * trylock is necessary as pages may be getting freed
@@ -2459,12 +2452,12 @@ void free_unref_page_list(struct list_head *list)
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
-				free_one_page(zone, page, pfn,
-					      0, migratetype, FPI_NONE);
-				locked_zone = NULL;
+				free_one_page(zone, page, pfn, 0,
+					      migratetype, FPI_NONE);
 				continue;
 			}
 			locked_zone = zone;
+			batch_count = 0;
 		}
 
 		/*

Johannes Weiner Sept. 12, 2023, 3:03 p.m. UTC | #5

On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
> I think after this change we should [...]

Speaking of follow-ups, AFAICS we no longer need those either:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cad31de1bf5..bea499fbca58 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1751,13 +1751,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 
 	old_block_type = get_pageblock_migratetype(page);
 
-	/*
-	 * This can happen due to races and we want to prevent broken
-	 * highatomic accounting.
-	 */
-	if (is_migrate_highatomic(old_block_type))
-		goto single_page;
-
 	/* Take ownership for orders >= pageblock_order */
 	if (current_order >= pageblock_order) {
 		change_pageblock_range(page, current_order, start_type);
@@ -1926,24 +1919,15 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 				continue;
 
 			/*
-			 * In page freeing path, migratetype change is racy so
-			 * we can counter several free pages in a pageblock
-			 * in this loop although we changed the pageblock type
-			 * from highatomic to ac->migratetype. So we should
-			 * adjust the count once.
+			 * It should never happen but changes to
+			 * locking could inadvertently allow a per-cpu
+			 * drain to add pages to MIGRATE_HIGHATOMIC
+			 * while unreserving so be safe and watch for
+			 * underflows.
 			 */
-			if (is_migrate_highatomic_page(page)) {
-				/*
-				 * It should never happen but changes to
-				 * locking could inadvertently allow a per-cpu
-				 * drain to add pages to MIGRATE_HIGHATOMIC
-				 * while unreserving so be safe and watch for
-				 * underflows.
-				 */
-				zone->nr_reserved_highatomic -= min(
-						pageblock_nr_pages,
-						zone->nr_reserved_highatomic);
-			}
+			zone->nr_reserved_highatomic -= min(
+				pageblock_nr_pages,
+				zone->nr_reserved_highatomic);
 
 			/*
 			 * Convert to ac->migratetype and avoid the normal

I think they were only in place because we could change the highatomic
status of pages on the pcplist, and those pages would then end up on
some other freelist due to the stale pcppage cache.

I replaced them locally with WARNs and ran an hour or so of kernel
builds under pressure. It didn't trigger. So I would send a follow up
to remove them.

Unless you point me to a good reason why they're definitely still
needed - in which case this is a moot proposal - but then we should
make the comments more specific.

Vlastimil Babka Sept. 13, 2023, 9:33 a.m. UTC | #6

On 9/12/23 16:50, Johannes Weiner wrote:
> On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
>> On 9/11/23 21:41, Johannes Weiner wrote:
> 
>> > @@ -1577,7 +1556,6 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>> >  			continue;
>> >  		del_page_from_free_list(page, zone, current_order);
>> >  		expand(zone, page, order, current_order, migratetype);
>> > -		set_pcppage_migratetype(page, migratetype);
>> 
>> Hm interesting, just noticed that __rmqueue_fallback() never did this
>> AFAICS, sounds like a bug.
> 
> I don't quite follow. Which part?
> 
> Keep in mind that at this point __rmqueue_fallback() doesn't return a
> page. It just moves pages to the desired freelist, and then
> __rmqueue_smallest() gets called again. This changes in 5/6, but until
> now at least all of the above would apply to fallback pages.

Yep, missed that "doesn't return a page", thanks.

>> > @@ -2145,7 +2123,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>> >  		 * pages are ordered properly.
>> >  		 */
>> >  		list_add_tail(&page->pcp_list, list);
>> > -		if (is_migrate_cma(get_pcppage_migratetype(page)))
>> > +		if (is_migrate_cma(get_pageblock_migratetype(page)))
>> >  			__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
>> >  					      -(1 << order));
>> 
>> This is potentially a source of overhead, I assume patch 6/6 might
>> change that.
> 
> Yes, 6/6 removes it altogether.
> 
> But the test results in this patch's changelog are from this patch in
> isolation, so it doesn't appear to be a concern even on its own.
> 
>> > @@ -2457,7 +2423,7 @@ void free_unref_page_list(struct list_head *list)
>> >  		 * Free isolated pages directly to the allocator, see
>> >  		 * comment in free_unref_page.
>> >  		 */
>> > -		migratetype = get_pcppage_migratetype(page);
>> > +		migratetype = get_pfnblock_migratetype(page, pfn);
>> >  		if (unlikely(is_migrate_isolate(migratetype))) {
>> >  			list_del(&page->lru);
>> >  			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
>> 
>> I think after this change we should move the isolated pages handling to
>> the second loop below, so that we wouldn't have to call
>> get_pfnblock_migratetype() twice per page. Dunno yet if some later patch
>> does that. It would need to unlock pcp when necessary.
> 
> That sounds like a great idea. Something like the following?
> 
> Lightly tested. If you're good with it, I'll beat some more on it and
> submit it as a follow-up.
> 
> ---
> 
> From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Tue, 12 Sep 2023 10:16:10 -0400
> Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()
> 
> Move direct freeing of isolated pages to the lock-breaking block in
> the second loop. This saves an unnecessary migratetype reassessment.
> 
> Minor comment and local variable scoping cleanups.

Looks like batch_count and locked_zone could be moved to the loop scope as well.

> 
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 49 +++++++++++++++++++++----------------------------
>  1 file changed, 21 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e3f1c777feed..9cad31de1bf5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2408,48 +2408,41 @@ void free_unref_page_list(struct list_head *list)
>  	struct per_cpu_pages *pcp = NULL;
>  	struct zone *locked_zone = NULL;
>  	int batch_count = 0;
> -	int migratetype;
> -
> -	/* Prepare pages for freeing */
> -	list_for_each_entry_safe(page, next, list, lru) {
> -		unsigned long pfn = page_to_pfn(page);
>  
> -		if (!free_pages_prepare(page, 0, FPI_NONE)) {
> +	list_for_each_entry_safe(page, next, list, lru)
> +		if (!free_pages_prepare(page, 0, FPI_NONE))
>  			list_del(&page->lru);
> -			continue;
> -		}
> -
> -		/*
> -		 * Free isolated pages directly to the allocator, see
> -		 * comment in free_unref_page.
> -		 */
> -		migratetype = get_pfnblock_migratetype(page, pfn);
> -		if (unlikely(is_migrate_isolate(migratetype))) {
> -			list_del(&page->lru);
> -			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
> -			continue;
> -		}
> -	}
>  
>  	list_for_each_entry_safe(page, next, list, lru) {
>  		unsigned long pfn = page_to_pfn(page);
>  		struct zone *zone = page_zone(page);
> +		int migratetype;
>  
>  		list_del(&page->lru);
>  		migratetype = get_pfnblock_migratetype(page, pfn);
>  
>  		/*
> -		 * Either different zone requiring a different pcp lock or
> -		 * excessive lock hold times when freeing a large list of
> -		 * pages.
> +		 * Zone switch, batch complete, or non-pcp freeing?
> +		 * Drop the pcp lock and evaluate.
>  		 */
> -		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
> +		if (unlikely(zone != locked_zone ||
> +			     batch_count == SWAP_CLUSTER_MAX ||
> +			     is_migrate_isolate(migratetype))) {
>  			if (pcp) {
>  				pcp_spin_unlock(pcp);
>  				pcp_trylock_finish(UP_flags);
> +				locked_zone = NULL;
>  			}
>  
> -			batch_count = 0;
> +			/*
> +			 * Free isolated pages directly to the
> +			 * allocator, see comment in free_unref_page.
> +			 */
> +			if (is_migrate_isolate(migratetype)) {
> +				free_one_page(zone, page, pfn, 0,
> +					      migratetype, FPI_NONE);
> +				continue;
> +			}
>  
>  			/*
>  			 * trylock is necessary as pages may be getting freed
> @@ -2459,12 +2452,12 @@ void free_unref_page_list(struct list_head *list)
>  			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
>  			if (unlikely(!pcp)) {
>  				pcp_trylock_finish(UP_flags);
> -				free_one_page(zone, page, pfn,
> -					      0, migratetype, FPI_NONE);
> -				locked_zone = NULL;
> +				free_one_page(zone, page, pfn, 0,
> +					      migratetype, FPI_NONE);
>  				continue;
>  			}
>  			locked_zone = zone;
> +			batch_count = 0;
>  		}
>  
>  		/*

Johannes Weiner Sept. 13, 2023, 1:24 p.m. UTC | #7

Hello Vlastimil,

On Wed, Sep 13, 2023 at 11:33:52AM +0200, Vlastimil Babka wrote:
> On 9/12/23 16:50, Johannes Weiner wrote:
> > From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Tue, 12 Sep 2023 10:16:10 -0400
> > Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()
> > 
> > Move direct freeing of isolated pages to the lock-breaking block in
> > the second loop. This saves an unnecessary migratetype reassessment.
> > 
> > Minor comment and local variable scoping cleanups.
> 
> Looks like batch_count and locked_zone could be moved to the loop scope as well.

Hm they both maintain values over multiple iterations, so I don't
think that's possible. Am I missing something?

> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

Thanks! I'll send this out properly with your tag.

Vlastimil Babka Sept. 13, 2023, 1:34 p.m. UTC | #8

On 9/13/23 15:24, Johannes Weiner wrote:
> Hello Vlastimil,
> 
> On Wed, Sep 13, 2023 at 11:33:52AM +0200, Vlastimil Babka wrote:
>> On 9/12/23 16:50, Johannes Weiner wrote:
>> > From 429d13322819ab38b3ba2fad6d1495997819ccc2 Mon Sep 17 00:00:00 2001
>> > From: Johannes Weiner <hannes@cmpxchg.org>
>> > Date: Tue, 12 Sep 2023 10:16:10 -0400
>> > Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()
>> > 
>> > Move direct freeing of isolated pages to the lock-breaking block in
>> > the second loop. This saves an unnecessary migratetype reassessment.
>> > 
>> > Minor comment and local variable scoping cleanups.
>> 
>> Looks like batch_count and locked_zone could be moved to the loop scope as well.
> 
> Hm they both maintain values over multiple iterations, so I don't
> think that's possible. Am I missing something?

True, disregard :D

>> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
>> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>> 
>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Thanks! I'll send this out properly with your tag.

np!

Vlastimil Babka Sept. 14, 2023, 7:29 a.m. UTC | #9

On 9/12/23 17:03, Johannes Weiner wrote:
> On Tue, Sep 12, 2023 at 03:47:45PM +0200, Vlastimil Babka wrote:
>> I think after this change we should [...]
> 
> Speaking of follow-ups, AFAICS we no longer need those either:

Seems so, but the comments do talk about races, so once those are sorted out :)

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9cad31de1bf5..bea499fbca58 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1751,13 +1751,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
>  
>  	old_block_type = get_pageblock_migratetype(page);
>  
> -	/*
> -	 * This can happen due to races and we want to prevent broken
> -	 * highatomic accounting.
> -	 */
> -	if (is_migrate_highatomic(old_block_type))
> -		goto single_page;
> -
>  	/* Take ownership for orders >= pageblock_order */
>  	if (current_order >= pageblock_order) {
>  		change_pageblock_range(page, current_order, start_type);
> @@ -1926,24 +1919,15 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
>  				continue;
>  
>  			/*
> -			 * In page freeing path, migratetype change is racy so
> -			 * we can counter several free pages in a pageblock
> -			 * in this loop although we changed the pageblock type
> -			 * from highatomic to ac->migratetype. So we should
> -			 * adjust the count once.
> +			 * It should never happen but changes to
> +			 * locking could inadvertently allow a per-cpu
> +			 * drain to add pages to MIGRATE_HIGHATOMIC
> +			 * while unreserving so be safe and watch for
> +			 * underflows.
>  			 */
> -			if (is_migrate_highatomic_page(page)) {
> -				/*
> -				 * It should never happen but changes to
> -				 * locking could inadvertently allow a per-cpu
> -				 * drain to add pages to MIGRATE_HIGHATOMIC
> -				 * while unreserving so be safe and watch for
> -				 * underflows.
> -				 */
> -				zone->nr_reserved_highatomic -= min(
> -						pageblock_nr_pages,
> -						zone->nr_reserved_highatomic);
> -			}
> +			zone->nr_reserved_highatomic -= min(
> +				pageblock_nr_pages,
> +				zone->nr_reserved_highatomic);
>  
>  			/*
>  			 * Convert to ac->migratetype and avoid the normal
> 
> I think they were only in place because we could change the highatomic
> status of pages on the pcplist, and those pages would then end up on
> some other freelist due to the stale pcppage cache.
> 
> I replaced them locally with WARNs and ran an hour or so of kernel
> builds under pressure. It didn't trigger. So I would send a follow up
> to remove them.
> 
> Unless you point me to a good reason why they're definitely still
> needed - in which case this is a moot proposal - but then we should
> make the comments more specific.

Mel Gorman Sept. 14, 2023, 9:56 a.m. UTC | #10

On Mon, Sep 11, 2023 at 03:41:42PM -0400, Johannes Weiner wrote:
> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
> 
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> 
> Before:
>           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> 
>          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> 
> After:
>           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> 
>          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> 
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term
> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I've no specific objection and other minor corrections have already been
suggested. I don't recall specifically but I think
get_pageblock_migratetype might have been called redundantly once upon a
time when there were concerns about page allocator overhead for high
speed network. Now that there is bulk allocation and the flow has
changed significantly, it's feasible to simply avoid calling
get_pageblock_migratetype unnecessarily.

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Huang, Ying Sept. 27, 2023, 5:42 a.m. UTC | #11

Johannes Weiner <hannes@cmpxchg.org> writes:

> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
>
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>
> Before:
>           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
>
>          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
>
> After:
>           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>                  0      cpu-migrations                   #    0.000 /sec
>             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
>
>          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
>
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term
> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.

I suspected the PCP allocating/freeing path may be influenced (that is,
allocating/freeing batch is less than PCP high).  So I tested
one-process will-it-scale/page_fault1 with sysctl
percpu_pagelist_high_fraction=8.  So pages will be allocated/freed
from/to PCP only.  The test results are as follows,

Before:
will-it-scale.1.processes                        618364.3      (+-  0.075%)
perf-profile.children.get_pfnblock_flags_mask         0.13     (+-  9.350%)

After:
will-it-scale.1.processes	                 616512.0      (+-  0.057%)
perf-profile.children.get_pfnblock_flags_mask	      0.41     (+-  22.44%)

The change isn't large: -0.3%.  Perf profiling shows the cycles% of
get_pfnblock_flags_mask() increases.

--
Best Regards,
Huang, Ying

Johannes Weiner Sept. 27, 2023, 2:51 p.m. UTC | #12

On Wed, Sep 27, 2023 at 01:42:25PM +0800, Huang, Ying wrote:
> Johannes Weiner <hannes@cmpxchg.org> writes:
> 
> > The idea behind the cache is to save get_pageblock_migratetype()
> > lookups during bulk freeing. A microbenchmark suggests this isn't
> > helping, though. The pcp migratetype can get stale, which means that
> > bulk freeing has an extra branch to check if the pageblock was
> > isolated while on the pcp.
> >
> > While the variance overlaps, the cache write and the branch seem to
> > make this a net negative. The following test allocates and frees
> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> >
> > Before:
> >           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
> >                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
> >     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
> >    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
> >     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
> >         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
> >
> >          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
> >
> > After:
> >           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
> >                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
> >                  0      cpu-migrations                   #    0.000 /sec
> >             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
> >     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
> >    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
> >     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
> >         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
> >
> >          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
> >
> > A side effect is that this also ensures that pages whose pageblock
> > gets stolen while on the pcplist end up on the right freelist and we
> > don't perform potentially type-incompatible buddy merges (or skip
> > merges when we shouldn't), whis is likely beneficial to long-term
> > fragmentation management, although the effects would be harder to
> > measure. Settle for simpler and faster code as justification here.
> 
> I suspected the PCP allocating/freeing path may be influenced (that is,
> allocating/freeing batch is less than PCP high).  So I tested
> one-process will-it-scale/page_fault1 with sysctl
> percpu_pagelist_high_fraction=8.  So pages will be allocated/freed
> from/to PCP only.  The test results are as follows,
> 
> Before:
> will-it-scale.1.processes                        618364.3      (+-  0.075%)
> perf-profile.children.get_pfnblock_flags_mask         0.13     (+-  9.350%)
> 
> After:
> will-it-scale.1.processes	                 616512.0      (+-  0.057%)
> perf-profile.children.get_pfnblock_flags_mask	      0.41     (+-  22.44%)
> 
> The change isn't large: -0.3%.  Perf profiling shows the cycles% of
> get_pfnblock_flags_mask() increases.

Ah, this is going through the free_unref_page_list() path that
Vlastimil had pointed out as well. I made another change on top that
eliminates the second lookup. After that, both pcp fast paths have the
same number of lookups as before: 1. This fixes the regression for me.

Would you mind confirming this as well?

--

From f5d032019ed832a1a50454347a33b00ca6abeb30 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 15 Sep 2023 16:03:24 -0400
Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()

Move direct freeing of isolated pages to the lock-breaking block in
the second loop. This saves an unnecessary migratetype reassessment.

Minor comment and local variable scoping cleanups.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 44 ++++++++++++++++++--------------------------
 1 file changed, 18 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bfffc1af94cd..665930ffe22a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2466,48 +2466,40 @@ void free_unref_page_list(struct list_head *list)
 	struct per_cpu_pages *pcp = NULL;
 	struct zone *locked_zone = NULL;
 	int batch_count = 0;
-	int migratetype;
-
-	/* Prepare pages for freeing */
-	list_for_each_entry_safe(page, next, list, lru) {
-		unsigned long pfn = page_to_pfn(page);
-
-		if (!free_pages_prepare(page, 0, FPI_NONE)) {
-			list_del(&page->lru);
-			continue;
-		}
 
-		/*
-		 * Free isolated pages directly to the allocator, see
-		 * comment in free_unref_page.
-		 */
-		migratetype = get_pfnblock_migratetype(page, pfn);
-		if (unlikely(is_migrate_isolate(migratetype))) {
+	list_for_each_entry_safe(page, next, list, lru)
+		if (!free_pages_prepare(page, 0, FPI_NONE))
 			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, FPI_NONE);
-			continue;
-		}
-	}
 
 	list_for_each_entry_safe(page, next, list, lru) {
 		unsigned long pfn = page_to_pfn(page);
 		struct zone *zone = page_zone(page);
+		int migratetype;
 
 		list_del(&page->lru);
 		migratetype = get_pfnblock_migratetype(page, pfn);
 
 		/*
-		 * Either different zone requiring a different pcp lock or
-		 * excessive lock hold times when freeing a large list of
-		 * pages.
+		 * Zone switch, batch complete, or non-pcp freeing?
+		 * Drop the pcp lock and evaluate.
 		 */
-		if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
+		if (unlikely(zone != locked_zone ||
+			     batch_count == SWAP_CLUSTER_MAX ||
+			     is_migrate_isolate(migratetype))) {
 			if (pcp) {
 				pcp_spin_unlock(pcp);
 				pcp_trylock_finish(UP_flags);
+				locked_zone = NULL;
 			}
 
-			batch_count = 0;
+			/*
+			 * Free isolated pages directly to the
+			 * allocator, see comment in free_unref_page.
+			 */
+			if (is_migrate_isolate(migratetype)) {
+				free_one_page(zone, page, pfn, 0, FPI_NONE);
+				continue;
+			}
 
 			/*
 			 * trylock is necessary as pages may be getting freed
@@ -2518,10 +2510,10 @@ void free_unref_page_list(struct list_head *list)
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
 				free_one_page(zone, page, pfn, 0, FPI_NONE);
-				locked_zone = NULL;
 				continue;
 			}
 			locked_zone = zone;
+			batch_count = 0;
 		}
 
 		/*

Huang, Ying Sept. 30, 2023, 4:26 a.m. UTC | #13

Johannes Weiner <hannes@cmpxchg.org> writes:

> On Wed, Sep 27, 2023 at 01:42:25PM +0800, Huang, Ying wrote:
>> Johannes Weiner <hannes@cmpxchg.org> writes:
>> 
>> > The idea behind the cache is to save get_pageblock_migratetype()
>> > lookups during bulk freeing. A microbenchmark suggests this isn't
>> > helping, though. The pcp migratetype can get stale, which means that
>> > bulk freeing has an extra branch to check if the pageblock was
>> > isolated while on the pcp.
>> >
>> > While the variance overlaps, the cache write and the branch seem to
>> > make this a net negative. The following test allocates and frees
>> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>> >
>> > Before:
>> >           8,668.48 msec task-clock                       #   99.735 CPUs utilized               ( +-  2.90% )
>> >                 19      context-switches                 #    4.341 /sec                        ( +-  3.24% )
>> >                  0      cpu-migrations                   #    0.000 /sec
>> >             17,440      page-faults                      #    3.984 K/sec                       ( +-  2.90% )
>> >     41,758,692,473      cycles                           #    9.541 GHz                         ( +-  2.90% )
>> >    126,201,294,231      instructions                     #    5.98  insn per cycle              ( +-  2.90% )
>> >     25,348,098,335      branches                         #    5.791 G/sec                       ( +-  2.90% )
>> >         33,436,921      branch-misses                    #    0.26% of all branches             ( +-  2.90% )
>> >
>> >          0.0869148 +- 0.0000302 seconds time elapsed  ( +-  0.03% )
>> >
>> > After:
>> >           8,444.81 msec task-clock                       #   99.726 CPUs utilized               ( +-  2.90% )
>> >                 22      context-switches                 #    5.160 /sec                        ( +-  3.23% )
>> >                  0      cpu-migrations                   #    0.000 /sec
>> >             17,443      page-faults                      #    4.091 K/sec                       ( +-  2.90% )
>> >     40,616,738,355      cycles                           #    9.527 GHz                         ( +-  2.90% )
>> >    126,383,351,792      instructions                     #    6.16  insn per cycle              ( +-  2.90% )
>> >     25,224,985,153      branches                         #    5.917 G/sec                       ( +-  2.90% )
>> >         32,236,793      branch-misses                    #    0.25% of all branches             ( +-  2.90% )
>> >
>> >          0.0846799 +- 0.0000412 seconds time elapsed  ( +-  0.05% )
>> >
>> > A side effect is that this also ensures that pages whose pageblock
>> > gets stolen while on the pcplist end up on the right freelist and we
>> > don't perform potentially type-incompatible buddy merges (or skip
>> > merges when we shouldn't), whis is likely beneficial to long-term
>> > fragmentation management, although the effects would be harder to
>> > measure. Settle for simpler and faster code as justification here.
>> 
>> I suspected the PCP allocating/freeing path may be influenced (that is,
>> allocating/freeing batch is less than PCP high).  So I tested
>> one-process will-it-scale/page_fault1 with sysctl
>> percpu_pagelist_high_fraction=8.  So pages will be allocated/freed
>> from/to PCP only.  The test results are as follows,
>> 
>> Before:
>> will-it-scale.1.processes                        618364.3      (+-  0.075%)
>> perf-profile.children.get_pfnblock_flags_mask         0.13     (+-  9.350%)
>> 
>> After:
>> will-it-scale.1.processes	                 616512.0      (+-  0.057%)
>> perf-profile.children.get_pfnblock_flags_mask	      0.41     (+-  22.44%)
>> 
>> The change isn't large: -0.3%.  Perf profiling shows the cycles% of
>> get_pfnblock_flags_mask() increases.
>
> Ah, this is going through the free_unref_page_list() path that
> Vlastimil had pointed out as well. I made another change on top that
> eliminates the second lookup. After that, both pcp fast paths have the
> same number of lookups as before: 1. This fixes the regression for me.
>
> Would you mind confirming this as well?

I have done more test for the series and addon patches.  The test
results are as follows,

base
perf-profile.children.get_pfnblock_flags_mask	     0.15	(+- 32.62%)
will-it-scale.1.processes			618621.7	(+-  0.18%)

mm: page_alloc: remove pcppage migratetype caching
perf-profile.children.get_pfnblock_flags_mask	     0.40	(+- 21.55%)
will-it-scale.1.processes			616350.3	(+-  0.27%)

mm: page_alloc: fix up block types when merging compatible blocks
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+-  8.36%)
will-it-scale.1.processes			617121.0	(+-  0.17%)

mm: page_alloc: move free pages when converting block during isolation
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 15.10%)
will-it-scale.1.processes			615578.0	(+-  0.18%)

mm: page_alloc: fix move_freepages_block() range error
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 12.78%)
will-it-scale.1.processes			615364.7	(+-  0.27%)

mm: page_alloc: fix freelist movement during block conversion
perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 10.52%)
will-it-scale.1.processes			617834.8	(+-  0.52%)

mm: page_alloc: consolidate free page accounting
perf-profile.children.get_pfnblock_flags_mask	     0.39	(+-  8.27%)
will-it-scale.1.processes			621000.0	(+-  0.13%)

mm: page_alloc: close migratetype race between freeing and stealing
perf-profile.children.get_pfnblock_flags_mask	     0.37	(+-  5.87%)
will-it-scale.1.processes			618378.8	(+-  0.17%)

mm: page_alloc: optimize free_unref_page_list()
perf-profile.children.get_pfnblock_flags_mask	     0.20	(+- 14.96%)
will-it-scale.1.processes			618136.3	(+-  0.16%)

It seems that the will-it-scale score is influenced by some other
factors too.  But anyway, the series + addon patches restores the score
of will-it-scale.  And the cycles% of get_pfnblock_flags_mask() is
almost restored by the final patch (mm: page_alloc: optimize
free_unref_page_list()).

Feel free to add my "Tested-by" for these patches.

--
Best Regards,
Huang, Ying

Johannes Weiner Oct. 2, 2023, 2:58 p.m. UTC | #14

On Sat, Sep 30, 2023 at 12:26:01PM +0800, Huang, Ying wrote:
> I have done more test for the series and addon patches.  The test
> results are as follows,
> 
> base
> perf-profile.children.get_pfnblock_flags_mask	     0.15	(+- 32.62%)
> will-it-scale.1.processes			618621.7	(+-  0.18%)
> 
> mm: page_alloc: remove pcppage migratetype caching
> perf-profile.children.get_pfnblock_flags_mask	     0.40	(+- 21.55%)
> will-it-scale.1.processes			616350.3	(+-  0.27%)
> 
> mm: page_alloc: fix up block types when merging compatible blocks
> perf-profile.children.get_pfnblock_flags_mask	     0.36	(+-  8.36%)
> will-it-scale.1.processes			617121.0	(+-  0.17%)
> 
> mm: page_alloc: move free pages when converting block during isolation
> perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 15.10%)
> will-it-scale.1.processes			615578.0	(+-  0.18%)
> 
> mm: page_alloc: fix move_freepages_block() range error
> perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 12.78%)
> will-it-scale.1.processes			615364.7	(+-  0.27%)
> 
> mm: page_alloc: fix freelist movement during block conversion
> perf-profile.children.get_pfnblock_flags_mask	     0.36	(+- 10.52%)
> will-it-scale.1.processes			617834.8	(+-  0.52%)
> 
> mm: page_alloc: consolidate free page accounting
> perf-profile.children.get_pfnblock_flags_mask	     0.39	(+-  8.27%)
> will-it-scale.1.processes			621000.0	(+-  0.13%)
> 
> mm: page_alloc: close migratetype race between freeing and stealing
> perf-profile.children.get_pfnblock_flags_mask	     0.37	(+-  5.87%)
> will-it-scale.1.processes			618378.8	(+-  0.17%)
> 
> mm: page_alloc: optimize free_unref_page_list()
> perf-profile.children.get_pfnblock_flags_mask	     0.20	(+- 14.96%)
> will-it-scale.1.processes			618136.3	(+-  0.16%)
> 
> It seems that the will-it-scale score is influenced by some other
> factors too.  But anyway, the series + addon patches restores the score
> of will-it-scale.  And the cycles% of get_pfnblock_flags_mask() is
> almost restored by the final patch (mm: page_alloc: optimize
> free_unref_page_list()).
> 
> Feel free to add my "Tested-by" for these patches.

Thanks, I'll add those!

[1/6] mm: page_alloc: remove pcppage migratetype caching

Commit Message

Comments

Patch