mbox series

[V4,00/10] mm: page_alloc: freelist migratetype hygiene

Message ID 20240320180429.678181-1-hannes@cmpxchg.org (mailing list archive)
Headers show
Series mm: page_alloc: freelist migratetype hygiene | expand

Message

Johannes Weiner March 20, 2024, 6:02 p.m. UTC
V4:
- fixed !pcp_order_allowed() case in free_unref_folios()
- reworded the patch 0 changelog a bit for the git log
- rebased to mm-everything-2024-03-19-23-01
- runtime-tested again with various CONFIG_DEBUG_FOOs enabled

---

The page allocator's mobility grouping is intended to keep unmovable
pages separate from reclaimable/compactable ones to allow on-demand
defragmentation for higher-order allocations and huge pages.

Currently, there are several places where accidental type mixing
occurs: an allocation asks for a page of a certain migratetype and
receives another. This ruins pageblocks for compaction, which in turn
makes allocating huge pages more expensive and less reliable.

The series addresses those causes. The last patch adds type checks on
all freelist movements to prevent new violations being introduced.

The benefits can be seen in a mixed workload that stresses the machine
with a memcache-type workload and a kernel build job while
periodically attempting to allocate batches of THP. The following data
is aggregated over 50 consecutive defconfig builds:

                                                        VANILLA                 PATCHED
Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)

Huge page success rate is higher, allocation latencies are shorter and
more predictable.

Stealing (fallback) rate is drastically reduced. Notably, while the
vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
kernel enters a steady state once the distribution of block types is
adequate for the workload. Steals over 50 runs:

VANILLA         PATCHED
1504.0		227.0
1557.0		6.0
1391.0		13.0
1080.0		26.0
1057.0		40.0
1156.0		6.0
805.0		46.0
736.0		20.0
1747.0		2.0
1699.0		34.0
1269.0		13.0
1858.0		12.0
907.0		4.0
727.0		2.0
563.0		2.0
3094.0		2.0
10211.0		3.0
2621.0		1.0
5508.0		2.0
1060.0		2.0
538.0		3.0
5773.0		2.0
2199.0		0.0
3781.0		2.0
1387.0		1.0
4977.0		0.0
2865.0		1.0
1814.0		1.0
3739.0		1.0
6857.0		0.0
382.0		0.0
407.0		1.0
3784.0		0.0
297.0		0.0
298.0		0.0
6636.0		0.0
4188.0		0.0
242.0		0.0
9960.0		0.0
5816.0		0.0
354.0		0.0
287.0		0.0
261.0		0.0
140.0		1.0
2065.0		0.0
312.0		0.0
331.0		0.0
164.0		0.0
465.0		1.0
219.0		0.0

Type mismatches are down too. Those count every time an allocation
request asks for one migratetype and gets another. This can still
occur minimally in the patched kernel due to non-stealing fallbacks,
but it's quite rare and follows the pattern of overall fallbacks -
once the block type distribution settles, mismatches cease as well:

VANILLA:        PATCHED:
182602.0	268.0
135794.0	20.0
88619.0		19.0
95973.0		0.0
129590.0	0.0
129298.0	0.0
147134.0	0.0
230854.0	0.0
239709.0	0.0
137670.0	0.0
132430.0	0.0
65712.0		0.0
57901.0		0.0
67506.0		0.0
63565.0		4.0
34806.0		0.0
42962.0		0.0
32406.0		0.0
38668.0		0.0
61356.0		0.0
57800.0		0.0
41435.0		0.0
83456.0		0.0
65048.0		0.0
28955.0		0.0
47597.0		0.0
75117.0		0.0
55564.0		0.0
38280.0		0.0
52404.0		0.0
26264.0		0.0
37538.0		0.0
19671.0		0.0
30936.0		0.0
26933.0		0.0
16962.0		0.0
44554.0		0.0
46352.0		0.0
24995.0		0.0
35152.0		0.0
12823.0		0.0
21583.0		0.0
18129.0		0.0
31693.0		0.0
28745.0		0.0
33308.0		0.0
31114.0		0.0
35034.0		0.0
12111.0		0.0
24885.0		0.0

Compaction work is markedly reduced despite much better THP rates.

In the vanilla kernel, reclaim seems to have been driven primarily by
watermark boosting that happens as a result of fallbacks. With those
all but eliminated, watermarks average lower and kswapd does less
work. The uptick in direct reclaim is because THP requests have to
fend for themselves more often - which is intended policy right
now. Aggregate reclaim activity is lowered significantly, though.

---

V3:
- fixed freelist type violations from non-atomic page isolation
  updates (Zi Yan)
- fixed incorrect migratetype update ordering during merge (Vlastimil Babka)
- reject moving a zone-straddling block altogether (Vlastimil Babka)
- fixed freelist type violations from lockless migratetype lookups in
  cornercase freeing paths (Vlastimil Babka)
- fixed erroneous WARN in the bulk freeing path that was intended to catch
  mistakes in the now-removed pcpcache (Mike Kravetz)
- fixed typo in patch 1's changelog (Zi Yan)
- optimized migratetype lookup in free_unref_page_list() (Vlastimil Babka)
- batched vmstat updates in page merging hot path (Vlastimil Babka)
- rebased to mm-everything-2024-03-05-20-43 (v6.8-rc5+)

V2:
- dropped the get_pfnblock_migratetype() optimization
  patchlet since somebody else beat me to it (thanks Zi)
- broke out pcp bypass fix since somebody else reported the bug:
  https://lore.kernel.org/linux-mm/20230911181108.GA104295@cmpxchg.org/
- fixed the CONFIG_UNACCEPTED_MEMORY build (lkp)
- rebased to v6.6-rc1

 include/linux/mm.h             |  18 +-
 include/linux/page-isolation.h |   5 +-
 include/linux/vmstat.h         |   8 -
 mm/debug_page_alloc.c          |  12 +-
 mm/internal.h                  |   9 -
 mm/page_alloc.c                | 650 +++++++++++++++++++++------------------
 mm/page_isolation.c            | 122 +++-----
 7 files changed, 415 insertions(+), 409 deletions(-)

Based on mm-everything-2024-03-19-23-01.

Comments

Vlastimil Babka March 27, 2024, 9:30 a.m. UTC | #1
On 3/20/24 7:02 PM, Johannes Weiner wrote:
> V4:
> - fixed !pcp_order_allowed() case in free_unref_folios()
> - reworded the patch 0 changelog a bit for the git log
> - rebased to mm-everything-2024-03-19-23-01
> - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> 
> ---
> 
> The page allocator's mobility grouping is intended to keep unmovable
> pages separate from reclaimable/compactable ones to allow on-demand
> defragmentation for higher-order allocations and huge pages.
> 
> Currently, there are several places where accidental type mixing
> occurs: an allocation asks for a page of a certain migratetype and
> receives another. This ruins pageblocks for compaction, which in turn
> makes allocating huge pages more expensive and less reliable.
> 
> The series addresses those causes. The last patch adds type checks on
> all freelist movements to prevent new violations being introduced.
> 
> The benefits can be seen in a mixed workload that stresses the machine
> with a memcache-type workload and a kernel build job while
> periodically attempting to allocate batches of THP. The following data
> is aggregated over 50 consecutive defconfig builds:

Great stuff. What would you say to the following on top?

----8<----
From 84f8a6d3a9e34c7ed8b438c3152d56e359a4ffb4 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 27 Mar 2024 10:19:47 +0100
Subject: [PATCH] mm: page_alloc: change move_freepages() to
 __move_freepages_block()

The function is now supposed to be called only on a single pageblock and
checks start_pfn and end_pfn accordingly. Rename it to make this more
obvious and drop the end_pfn parameter which can be determined trivially
and none of the callers use it for anything else.

Also make the (now internal) end_pfn exclusive, which is more common.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 34c84ef16b66..75aefbd52ef9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1566,18 +1566,18 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
  * Change the type of a block and move all its free pages to that
  * type's freelist.
  */
-static int move_freepages(struct zone *zone, unsigned long start_pfn,
-			  unsigned long end_pfn, int old_mt, int new_mt)
+static int __move_freepages_block(struct zone *zone, unsigned long start_pfn,
+				  int old_mt, int new_mt)
 {
 	struct page *page;
-	unsigned long pfn;
+	unsigned long pfn, end_pfn;
 	unsigned int order;
 	int pages_moved = 0;
 
 	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
-	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+	end_pfn = pageblock_end_pfn(start_pfn);
 
-	for (pfn = start_pfn; pfn <= end_pfn;) {
+	for (pfn = start_pfn; pfn < end_pfn;) {
 		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
 			pfn++;
@@ -1603,14 +1603,13 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 
 static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 				      unsigned long *start_pfn,
-				      unsigned long *end_pfn,
 				      int *num_free, int *num_movable)
 {
 	unsigned long pfn, start, end;
 
 	pfn = page_to_pfn(page);
 	start = pageblock_start_pfn(pfn);
-	end = pageblock_end_pfn(pfn) - 1;
+	end = pageblock_end_pfn(pfn);
 
 	/*
 	 * The caller only has the lock for @zone, don't touch ranges
@@ -1621,16 +1620,15 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 	 */
 	if (!zone_spans_pfn(zone, start))
 		return false;
-	if (!zone_spans_pfn(zone, end))
+	if (!zone_spans_pfn(zone, end - 1))
 		return false;
 
 	*start_pfn = start;
-	*end_pfn = end;
 
 	if (num_free) {
 		*num_free = 0;
 		*num_movable = 0;
-		for (pfn = start; pfn <= end;) {
+		for (pfn = start; pfn < end;) {
 			page = pfn_to_page(pfn);
 			if (PageBuddy(page)) {
 				int nr = 1 << buddy_order(page);
@@ -1656,13 +1654,12 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 static int move_freepages_block(struct zone *zone, struct page *page,
 				int old_mt, int new_mt)
 {
-	unsigned long start_pfn, end_pfn;
+	unsigned long start_pfn;
 
-	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
-				       NULL, NULL))
+	if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL))
 		return -1;
 
-	return move_freepages(zone, start_pfn, end_pfn, old_mt, new_mt);
+	return __move_freepages_block(zone, start_pfn, old_mt, new_mt);
 }
 
 #ifdef CONFIG_MEMORY_ISOLATION
@@ -1733,10 +1730,9 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 				  int migratetype)
 {
-	unsigned long start_pfn, end_pfn, pfn;
+	unsigned long start_pfn, pfn;
 
-	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
-				       NULL, NULL))
+	if (!prep_move_freepages_block(zone, page, &start_pfn, NULL, NULL))
 		return false;
 
 	/* No splits needed if buddies can't span multiple blocks */
@@ -1767,8 +1763,9 @@ bool move_freepages_block_isolate(struct zone *zone, struct page *page,
 		return true;
 	}
 move:
-	move_freepages(zone, start_pfn, end_pfn,
-		       get_pfnblock_migratetype(page, start_pfn), migratetype);
+	__move_freepages_block(zone, start_pfn,
+			       get_pfnblock_migratetype(page, start_pfn),
+			       migratetype);
 	return true;
 }
 #endif /* CONFIG_MEMORY_ISOLATION */
@@ -1868,7 +1865,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 			unsigned int alloc_flags, bool whole_block)
 {
 	int free_pages, movable_pages, alike_pages;
-	unsigned long start_pfn, end_pfn;
+	unsigned long start_pfn;
 	int block_type;
 
 	block_type = get_pageblock_migratetype(page);
@@ -1901,8 +1898,8 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 		goto single_page;
 
 	/* moving whole block can fail due to zone boundary conditions */
-	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
-				       &free_pages, &movable_pages))
+	if (!prep_move_freepages_block(zone, page, &start_pfn, &free_pages,
+				       &movable_pages))
 		goto single_page;
 
 	/*
@@ -1932,7 +1929,7 @@ steal_suitable_fallback(struct zone *zone, struct page *page,
 	 */
 	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
 			page_group_by_mobility_disabled) {
-		move_freepages(zone, start_pfn, end_pfn, block_type, start_type);
+		__move_freepages_block(zone, start_pfn, block_type, start_type);
 		return __rmqueue_smallest(zone, order, start_type);
 	}
Zi Yan March 27, 2024, 1:10 p.m. UTC | #2
On 27 Mar 2024, at 5:30, Vlastimil Babka wrote:

> On 3/20/24 7:02 PM, Johannes Weiner wrote:
>> V4:
>> - fixed !pcp_order_allowed() case in free_unref_folios()
>> - reworded the patch 0 changelog a bit for the git log
>> - rebased to mm-everything-2024-03-19-23-01
>> - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
>>
>> ---
>>
>> The page allocator's mobility grouping is intended to keep unmovable
>> pages separate from reclaimable/compactable ones to allow on-demand
>> defragmentation for higher-order allocations and huge pages.
>>
>> Currently, there are several places where accidental type mixing
>> occurs: an allocation asks for a page of a certain migratetype and
>> receives another. This ruins pageblocks for compaction, which in turn
>> makes allocating huge pages more expensive and less reliable.
>>
>> The series addresses those causes. The last patch adds type checks on
>> all freelist movements to prevent new violations being introduced.
>>
>> The benefits can be seen in a mixed workload that stresses the machine
>> with a memcache-type workload and a kernel build job while
>> periodically attempting to allocate batches of THP. The following data
>> is aggregated over 50 consecutive defconfig builds:
>
> Great stuff. What would you say to the following on top?
>
> ----8<----
> From 84f8a6d3a9e34c7ed8b438c3152d56e359a4ffb4 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 27 Mar 2024 10:19:47 +0100
> Subject: [PATCH] mm: page_alloc: change move_freepages() to
>  __move_freepages_block()
>
> The function is now supposed to be called only on a single pageblock and
> checks start_pfn and end_pfn accordingly. Rename it to make this more
> obvious and drop the end_pfn parameter which can be determined trivially
> and none of the callers use it for anything else.
>
> Also make the (now internal) end_pfn exclusive, which is more common.
>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 43 ++++++++++++++++++++-----------------------
>  1 file changed, 20 insertions(+), 23 deletions(-)
>
This looks good to me and makes sense. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi
Johannes Weiner March 27, 2024, 2:29 p.m. UTC | #3
On Wed, Mar 27, 2024 at 10:30:30AM +0100, Vlastimil Babka wrote:
> On 3/20/24 7:02 PM, Johannes Weiner wrote:
> > V4:
> > - fixed !pcp_order_allowed() case in free_unref_folios()
> > - reworded the patch 0 changelog a bit for the git log
> > - rebased to mm-everything-2024-03-19-23-01
> > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> > 
> > ---
> > 
> > The page allocator's mobility grouping is intended to keep unmovable
> > pages separate from reclaimable/compactable ones to allow on-demand
> > defragmentation for higher-order allocations and huge pages.
> > 
> > Currently, there are several places where accidental type mixing
> > occurs: an allocation asks for a page of a certain migratetype and
> > receives another. This ruins pageblocks for compaction, which in turn
> > makes allocating huge pages more expensive and less reliable.
> > 
> > The series addresses those causes. The last patch adds type checks on
> > all freelist movements to prevent new violations being introduced.
> > 
> > The benefits can be seen in a mixed workload that stresses the machine
> > with a memcache-type workload and a kernel build job while
> > periodically attempting to allocate batches of THP. The following data
> > is aggregated over 50 consecutive defconfig builds:
> 
> Great stuff. What would you say to the following on top?
> 
> ----8<----
> From 84f8a6d3a9e34c7ed8b438c3152d56e359a4ffb4 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 27 Mar 2024 10:19:47 +0100
> Subject: [PATCH] mm: page_alloc: change move_freepages() to
>  __move_freepages_block()
> 
> The function is now supposed to be called only on a single pageblock and
> checks start_pfn and end_pfn accordingly. Rename it to make this more
> obvious and drop the end_pfn parameter which can be determined trivially
> and none of the callers use it for anything else.
> 
> Also make the (now internal) end_pfn exclusive, which is more common.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Nice, that's better.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Baolin Wang April 8, 2024, 9:30 a.m. UTC | #4
On 2024/3/21 02:02, Johannes Weiner wrote:
> V4:
> - fixed !pcp_order_allowed() case in free_unref_folios()
> - reworded the patch 0 changelog a bit for the git log
> - rebased to mm-everything-2024-03-19-23-01
> - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> 
> ---
> 
> The page allocator's mobility grouping is intended to keep unmovable
> pages separate from reclaimable/compactable ones to allow on-demand
> defragmentation for higher-order allocations and huge pages.
> 
> Currently, there are several places where accidental type mixing
> occurs: an allocation asks for a page of a certain migratetype and
> receives another. This ruins pageblocks for compaction, which in turn
> makes allocating huge pages more expensive and less reliable.
> 
> The series addresses those causes. The last patch adds type checks on
> all freelist movements to prevent new violations being introduced.
> 
> The benefits can be seen in a mixed workload that stresses the machine
> with a memcache-type workload and a kernel build job while
> periodically attempting to allocate batches of THP. The following data
> is aggregated over 50 consecutive defconfig builds:
> 
>                                                          VANILLA                 PATCHED
> Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
> Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
> Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
> Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
> Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
> THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
> THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
> THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
> Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
> Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
> Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
> Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
> Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
> Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
> Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
> Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
> Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
> Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
> Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
> Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
> Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
> Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
> Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
> Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
> Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
> Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
> Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
> Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
> Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
> Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
> Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
> File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)
> 
> Huge page success rate is higher, allocation latencies are shorter and
> more predictable.
> 
> Stealing (fallback) rate is drastically reduced. Notably, while the
> vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> kernel enters a steady state once the distribution of block types is
> adequate for the workload. Steals over 50 runs:
> 
> VANILLA         PATCHED
> 1504.0		227.0
> 1557.0		6.0
> 1391.0		13.0
> 1080.0		26.0
> 1057.0		40.0
> 1156.0		6.0
> 805.0		46.0
> 736.0		20.0
> 1747.0		2.0
> 1699.0		34.0
> 1269.0		13.0
> 1858.0		12.0
> 907.0		4.0
> 727.0		2.0
> 563.0		2.0
> 3094.0		2.0
> 10211.0		3.0
> 2621.0		1.0
> 5508.0		2.0
> 1060.0		2.0
> 538.0		3.0
> 5773.0		2.0
> 2199.0		0.0
> 3781.0		2.0
> 1387.0		1.0
> 4977.0		0.0
> 2865.0		1.0
> 1814.0		1.0
> 3739.0		1.0
> 6857.0		0.0
> 382.0		0.0
> 407.0		1.0
> 3784.0		0.0
> 297.0		0.0
> 298.0		0.0
> 6636.0		0.0
> 4188.0		0.0
> 242.0		0.0
> 9960.0		0.0
> 5816.0		0.0
> 354.0		0.0
> 287.0		0.0
> 261.0		0.0
> 140.0		1.0
> 2065.0		0.0
> 312.0		0.0
> 331.0		0.0
> 164.0		0.0
> 465.0		1.0
> 219.0		0.0
> 
> Type mismatches are down too. Those count every time an allocation
> request asks for one migratetype and gets another. This can still
> occur minimally in the patched kernel due to non-stealing fallbacks,
> but it's quite rare and follows the pattern of overall fallbacks -
> once the block type distribution settles, mismatches cease as well:
> 
> VANILLA:        PATCHED:
> 182602.0	268.0
> 135794.0	20.0
> 88619.0		19.0
> 95973.0		0.0
> 129590.0	0.0
> 129298.0	0.0
> 147134.0	0.0
> 230854.0	0.0
> 239709.0	0.0
> 137670.0	0.0
> 132430.0	0.0
> 65712.0		0.0
> 57901.0		0.0
> 67506.0		0.0
> 63565.0		4.0
> 34806.0		0.0
> 42962.0		0.0
> 32406.0		0.0
> 38668.0		0.0
> 61356.0		0.0
> 57800.0		0.0
> 41435.0		0.0
> 83456.0		0.0
> 65048.0		0.0
> 28955.0		0.0
> 47597.0		0.0
> 75117.0		0.0
> 55564.0		0.0
> 38280.0		0.0
> 52404.0		0.0
> 26264.0		0.0
> 37538.0		0.0
> 19671.0		0.0
> 30936.0		0.0
> 26933.0		0.0
> 16962.0		0.0
> 44554.0		0.0
> 46352.0		0.0
> 24995.0		0.0
> 35152.0		0.0
> 12823.0		0.0
> 21583.0		0.0
> 18129.0		0.0
> 31693.0		0.0
> 28745.0		0.0
> 33308.0		0.0
> 31114.0		0.0
> 35034.0		0.0
> 12111.0		0.0
> 24885.0		0.0
> 
> Compaction work is markedly reduced despite much better THP rates.
> 
> In the vanilla kernel, reclaim seems to have been driven primarily by
> watermark boosting that happens as a result of fallbacks. With those
> all but eliminated, watermarks average lower and kswapd does less
> work. The uptick in direct reclaim is because THP requests have to
> fend for themselves more often - which is intended policy right
> now. Aggregate reclaim activity is lowered significantly, though.
> 
> ---

With my 2 fixes, the whole series works well on my platform, so please 
feel free to add:
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Johannes Weiner April 8, 2024, 2:24 p.m. UTC | #5
On Mon, Apr 08, 2024 at 05:30:04PM +0800, Baolin Wang wrote:
> 
> 
> On 2024/3/21 02:02, Johannes Weiner wrote:
> > V4:
> > - fixed !pcp_order_allowed() case in free_unref_folios()
> > - reworded the patch 0 changelog a bit for the git log
> > - rebased to mm-everything-2024-03-19-23-01
> > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> > 
> > ---
> > 
> > The page allocator's mobility grouping is intended to keep unmovable
> > pages separate from reclaimable/compactable ones to allow on-demand
> > defragmentation for higher-order allocations and huge pages.
> > 
> > Currently, there are several places where accidental type mixing
> > occurs: an allocation asks for a page of a certain migratetype and
> > receives another. This ruins pageblocks for compaction, which in turn
> > makes allocating huge pages more expensive and less reliable.
> > 
> > The series addresses those causes. The last patch adds type checks on
> > all freelist movements to prevent new violations being introduced.
> > 
> > The benefits can be seen in a mixed workload that stresses the machine
> > with a memcache-type workload and a kernel build job while
> > periodically attempting to allocate batches of THP. The following data
> > is aggregated over 50 consecutive defconfig builds:
> > 
> >                                                          VANILLA                 PATCHED
> > Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
> > Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
> > Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
> > Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
> > Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
> > THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
> > THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
> > THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
> > Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
> > Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
> > Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
> > Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
> > Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
> > Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
> > Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
> > Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
> > Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
> > Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
> > Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
> > Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
> > Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
> > Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
> > Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
> > Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
> > Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
> > Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
> > Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
> > Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
> > Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
> > Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
> > Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
> > File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)
> > 
> > Huge page success rate is higher, allocation latencies are shorter and
> > more predictable.
> > 
> > Stealing (fallback) rate is drastically reduced. Notably, while the
> > vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> > kernel enters a steady state once the distribution of block types is
> > adequate for the workload. Steals over 50 runs:
> > 
> > VANILLA         PATCHED
> > 1504.0		227.0
> > 1557.0		6.0
> > 1391.0		13.0
> > 1080.0		26.0
> > 1057.0		40.0
> > 1156.0		6.0
> > 805.0		46.0
> > 736.0		20.0
> > 1747.0		2.0
> > 1699.0		34.0
> > 1269.0		13.0
> > 1858.0		12.0
> > 907.0		4.0
> > 727.0		2.0
> > 563.0		2.0
> > 3094.0		2.0
> > 10211.0		3.0
> > 2621.0		1.0
> > 5508.0		2.0
> > 1060.0		2.0
> > 538.0		3.0
> > 5773.0		2.0
> > 2199.0		0.0
> > 3781.0		2.0
> > 1387.0		1.0
> > 4977.0		0.0
> > 2865.0		1.0
> > 1814.0		1.0
> > 3739.0		1.0
> > 6857.0		0.0
> > 382.0		0.0
> > 407.0		1.0
> > 3784.0		0.0
> > 297.0		0.0
> > 298.0		0.0
> > 6636.0		0.0
> > 4188.0		0.0
> > 242.0		0.0
> > 9960.0		0.0
> > 5816.0		0.0
> > 354.0		0.0
> > 287.0		0.0
> > 261.0		0.0
> > 140.0		1.0
> > 2065.0		0.0
> > 312.0		0.0
> > 331.0		0.0
> > 164.0		0.0
> > 465.0		1.0
> > 219.0		0.0
> > 
> > Type mismatches are down too. Those count every time an allocation
> > request asks for one migratetype and gets another. This can still
> > occur minimally in the patched kernel due to non-stealing fallbacks,
> > but it's quite rare and follows the pattern of overall fallbacks -
> > once the block type distribution settles, mismatches cease as well:
> > 
> > VANILLA:        PATCHED:
> > 182602.0	268.0
> > 135794.0	20.0
> > 88619.0		19.0
> > 95973.0		0.0
> > 129590.0	0.0
> > 129298.0	0.0
> > 147134.0	0.0
> > 230854.0	0.0
> > 239709.0	0.0
> > 137670.0	0.0
> > 132430.0	0.0
> > 65712.0		0.0
> > 57901.0		0.0
> > 67506.0		0.0
> > 63565.0		4.0
> > 34806.0		0.0
> > 42962.0		0.0
> > 32406.0		0.0
> > 38668.0		0.0
> > 61356.0		0.0
> > 57800.0		0.0
> > 41435.0		0.0
> > 83456.0		0.0
> > 65048.0		0.0
> > 28955.0		0.0
> > 47597.0		0.0
> > 75117.0		0.0
> > 55564.0		0.0
> > 38280.0		0.0
> > 52404.0		0.0
> > 26264.0		0.0
> > 37538.0		0.0
> > 19671.0		0.0
> > 30936.0		0.0
> > 26933.0		0.0
> > 16962.0		0.0
> > 44554.0		0.0
> > 46352.0		0.0
> > 24995.0		0.0
> > 35152.0		0.0
> > 12823.0		0.0
> > 21583.0		0.0
> > 18129.0		0.0
> > 31693.0		0.0
> > 28745.0		0.0
> > 33308.0		0.0
> > 31114.0		0.0
> > 35034.0		0.0
> > 12111.0		0.0
> > 24885.0		0.0
> > 
> > Compaction work is markedly reduced despite much better THP rates.
> > 
> > In the vanilla kernel, reclaim seems to have been driven primarily by
> > watermark boosting that happens as a result of fallbacks. With those
> > all but eliminated, watermarks average lower and kswapd does less
> > work. The uptick in direct reclaim is because THP requests have to
> > fend for themselves more often - which is intended policy right
> > now. Aggregate reclaim activity is lowered significantly, though.
> > 
> > ---
> 
> With my 2 fixes, the whole series works well on my platform, so please 
> feel free to add:
> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Very much appreciate your testing and the two fixes. Thank you!
Yu Zhao May 11, 2024, 5:14 a.m. UTC | #6
On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> V4:
> - fixed !pcp_order_allowed() case in free_unref_folios()
> - reworded the patch 0 changelog a bit for the git log
> - rebased to mm-everything-2024-03-19-23-01
> - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
>
> ---
>
> The page allocator's mobility grouping is intended to keep unmovable
> pages separate from reclaimable/compactable ones to allow on-demand
> defragmentation for higher-order allocations and huge pages.
>
> Currently, there are several places where accidental type mixing
> occurs: an allocation asks for a page of a certain migratetype and
> receives another. This ruins pageblocks for compaction, which in turn
> makes allocating huge pages more expensive and less reliable.
>
> The series addresses those causes. The last patch adds type checks on
> all freelist movements to prevent new violations being introduced.
>
> The benefits can be seen in a mixed workload that stresses the machine
> with a memcache-type workload and a kernel build job while
> periodically attempting to allocate batches of THP. The following data
> is aggregated over 50 consecutive defconfig builds:
>
>                                                         VANILLA                 PATCHED
> Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
> Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
> Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
> Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
> Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
> THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
> THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
> THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
> Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
> Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
> Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
> Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
> Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
> Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
> Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
> Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
> Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
> Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
> Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
> Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
> Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
> Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
> Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
> Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
> Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
> Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
> Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
> Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
> Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
> Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
> Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
> File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)
>
> Huge page success rate is higher, allocation latencies are shorter and
> more predictable.
>
> Stealing (fallback) rate is drastically reduced. Notably, while the
> vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> kernel enters a steady state once the distribution of block types is
> adequate for the workload. Steals over 50 runs:
>
> VANILLA         PATCHED
> 1504.0          227.0
> 1557.0          6.0
> 1391.0          13.0
> 1080.0          26.0
> 1057.0          40.0
> 1156.0          6.0
> 805.0           46.0
> 736.0           20.0
> 1747.0          2.0
> 1699.0          34.0
> 1269.0          13.0
> 1858.0          12.0
> 907.0           4.0
> 727.0           2.0
> 563.0           2.0
> 3094.0          2.0
> 10211.0         3.0
> 2621.0          1.0
> 5508.0          2.0
> 1060.0          2.0
> 538.0           3.0
> 5773.0          2.0
> 2199.0          0.0
> 3781.0          2.0
> 1387.0          1.0
> 4977.0          0.0
> 2865.0          1.0
> 1814.0          1.0
> 3739.0          1.0
> 6857.0          0.0
> 382.0           0.0
> 407.0           1.0
> 3784.0          0.0
> 297.0           0.0
> 298.0           0.0
> 6636.0          0.0
> 4188.0          0.0
> 242.0           0.0
> 9960.0          0.0
> 5816.0          0.0
> 354.0           0.0
> 287.0           0.0
> 261.0           0.0
> 140.0           1.0
> 2065.0          0.0
> 312.0           0.0
> 331.0           0.0
> 164.0           0.0
> 465.0           1.0
> 219.0           0.0
>
> Type mismatches are down too. Those count every time an allocation
> request asks for one migratetype and gets another. This can still
> occur minimally in the patched kernel due to non-stealing fallbacks,
> but it's quite rare and follows the pattern of overall fallbacks -
> once the block type distribution settles, mismatches cease as well:
>
> VANILLA:        PATCHED:
> 182602.0        268.0
> 135794.0        20.0
> 88619.0         19.0
> 95973.0         0.0
> 129590.0        0.0
> 129298.0        0.0
> 147134.0        0.0
> 230854.0        0.0
> 239709.0        0.0
> 137670.0        0.0
> 132430.0        0.0
> 65712.0         0.0
> 57901.0         0.0
> 67506.0         0.0
> 63565.0         4.0
> 34806.0         0.0
> 42962.0         0.0
> 32406.0         0.0
> 38668.0         0.0
> 61356.0         0.0
> 57800.0         0.0
> 41435.0         0.0
> 83456.0         0.0
> 65048.0         0.0
> 28955.0         0.0
> 47597.0         0.0
> 75117.0         0.0
> 55564.0         0.0
> 38280.0         0.0
> 52404.0         0.0
> 26264.0         0.0
> 37538.0         0.0
> 19671.0         0.0
> 30936.0         0.0
> 26933.0         0.0
> 16962.0         0.0
> 44554.0         0.0
> 46352.0         0.0
> 24995.0         0.0
> 35152.0         0.0
> 12823.0         0.0
> 21583.0         0.0
> 18129.0         0.0
> 31693.0         0.0
> 28745.0         0.0
> 33308.0         0.0
> 31114.0         0.0
> 35034.0         0.0
> 12111.0         0.0
> 24885.0         0.0
>
> Compaction work is markedly reduced despite much better THP rates.
>
> In the vanilla kernel, reclaim seems to have been driven primarily by
> watermark boosting that happens as a result of fallbacks. With those
> all but eliminated, watermarks average lower and kswapd does less
> work. The uptick in direct reclaim is because THP requests have to
> fend for themselves more often - which is intended policy right
> now. Aggregate reclaim activity is lowered significantly, though.

This series significantly regresses Android and ChromeOS under memory
pressure. THPs are virtually nonexistent on client devices, and IIRC,
it was mentioned in the early discussions that potential regressions
for such a case are somewhat expected?

On Android (ARMv8.2), app launch time regressed by about 7%; On
ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
(full and some) on both platforms increased by over 20%. I could post
the details of the benchmarks and the metrics they measure, but I
doubt they would mean much to you. I did ask our test teams to save
extra kernel logs that might be more helpful, and I could forward them
to you.

Note that the numbers above were from the default LRU, not MGLRU,
which I specifically asked our test teams to disable to double check
the regressions.

Given the merge window will be open soon, I don't plan to stand in its
way. If we can't fix the regression after a reasonable amount of time,
can we find a way to disable this series runtime/build time?
Johannes Weiner May 13, 2024, 4:03 p.m. UTC | #7
On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > V4:
> > - fixed !pcp_order_allowed() case in free_unref_folios()
> > - reworded the patch 0 changelog a bit for the git log
> > - rebased to mm-everything-2024-03-19-23-01
> > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> >
> > ---
> >
> > The page allocator's mobility grouping is intended to keep unmovable
> > pages separate from reclaimable/compactable ones to allow on-demand
> > defragmentation for higher-order allocations and huge pages.
> >
> > Currently, there are several places where accidental type mixing
> > occurs: an allocation asks for a page of a certain migratetype and
> > receives another. This ruins pageblocks for compaction, which in turn
> > makes allocating huge pages more expensive and less reliable.
> >
> > The series addresses those causes. The last patch adds type checks on
> > all freelist movements to prevent new violations being introduced.
> >
> > The benefits can be seen in a mixed workload that stresses the machine
> > with a memcache-type workload and a kernel build job while
> > periodically attempting to allocate batches of THP. The following data
> > is aggregated over 50 consecutive defconfig builds:
> >
> >                                                         VANILLA                 PATCHED
> > Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
> > Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
> > Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
> > Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
> > Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
> > THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
> > THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
> > THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
> > Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
> > Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
> > Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
> > Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
> > Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
> > Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
> > Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
> > Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
> > Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
> > Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
> > Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
> > Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
> > Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
> > Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
> > Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
> > Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
> > Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
> > Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
> > Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
> > Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
> > Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
> > Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
> > Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
> > File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)
> >
> > Huge page success rate is higher, allocation latencies are shorter and
> > more predictable.
> >
> > Stealing (fallback) rate is drastically reduced. Notably, while the
> > vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> > kernel enters a steady state once the distribution of block types is
> > adequate for the workload. Steals over 50 runs:
> >
> > VANILLA         PATCHED
> > 1504.0          227.0
> > 1557.0          6.0
> > 1391.0          13.0
> > 1080.0          26.0
> > 1057.0          40.0
> > 1156.0          6.0
> > 805.0           46.0
> > 736.0           20.0
> > 1747.0          2.0
> > 1699.0          34.0
> > 1269.0          13.0
> > 1858.0          12.0
> > 907.0           4.0
> > 727.0           2.0
> > 563.0           2.0
> > 3094.0          2.0
> > 10211.0         3.0
> > 2621.0          1.0
> > 5508.0          2.0
> > 1060.0          2.0
> > 538.0           3.0
> > 5773.0          2.0
> > 2199.0          0.0
> > 3781.0          2.0
> > 1387.0          1.0
> > 4977.0          0.0
> > 2865.0          1.0
> > 1814.0          1.0
> > 3739.0          1.0
> > 6857.0          0.0
> > 382.0           0.0
> > 407.0           1.0
> > 3784.0          0.0
> > 297.0           0.0
> > 298.0           0.0
> > 6636.0          0.0
> > 4188.0          0.0
> > 242.0           0.0
> > 9960.0          0.0
> > 5816.0          0.0
> > 354.0           0.0
> > 287.0           0.0
> > 261.0           0.0
> > 140.0           1.0
> > 2065.0          0.0
> > 312.0           0.0
> > 331.0           0.0
> > 164.0           0.0
> > 465.0           1.0
> > 219.0           0.0
> >
> > Type mismatches are down too. Those count every time an allocation
> > request asks for one migratetype and gets another. This can still
> > occur minimally in the patched kernel due to non-stealing fallbacks,
> > but it's quite rare and follows the pattern of overall fallbacks -
> > once the block type distribution settles, mismatches cease as well:
> >
> > VANILLA:        PATCHED:
> > 182602.0        268.0
> > 135794.0        20.0
> > 88619.0         19.0
> > 95973.0         0.0
> > 129590.0        0.0
> > 129298.0        0.0
> > 147134.0        0.0
> > 230854.0        0.0
> > 239709.0        0.0
> > 137670.0        0.0
> > 132430.0        0.0
> > 65712.0         0.0
> > 57901.0         0.0
> > 67506.0         0.0
> > 63565.0         4.0
> > 34806.0         0.0
> > 42962.0         0.0
> > 32406.0         0.0
> > 38668.0         0.0
> > 61356.0         0.0
> > 57800.0         0.0
> > 41435.0         0.0
> > 83456.0         0.0
> > 65048.0         0.0
> > 28955.0         0.0
> > 47597.0         0.0
> > 75117.0         0.0
> > 55564.0         0.0
> > 38280.0         0.0
> > 52404.0         0.0
> > 26264.0         0.0
> > 37538.0         0.0
> > 19671.0         0.0
> > 30936.0         0.0
> > 26933.0         0.0
> > 16962.0         0.0
> > 44554.0         0.0
> > 46352.0         0.0
> > 24995.0         0.0
> > 35152.0         0.0
> > 12823.0         0.0
> > 21583.0         0.0
> > 18129.0         0.0
> > 31693.0         0.0
> > 28745.0         0.0
> > 33308.0         0.0
> > 31114.0         0.0
> > 35034.0         0.0
> > 12111.0         0.0
> > 24885.0         0.0
> >
> > Compaction work is markedly reduced despite much better THP rates.
> >
> > In the vanilla kernel, reclaim seems to have been driven primarily by
> > watermark boosting that happens as a result of fallbacks. With those
> > all but eliminated, watermarks average lower and kswapd does less
> > work. The uptick in direct reclaim is because THP requests have to
> > fend for themselves more often - which is intended policy right
> > now. Aggregate reclaim activity is lowered significantly, though.
> 
> This series significantly regresses Android and ChromeOS under memory
> pressure. THPs are virtually nonexistent on client devices, and IIRC,
> it was mentioned in the early discussions that potential regressions
> for such a case are somewhat expected?

This is not expected for the 10 patches here. You might be referring
to the discussion around the huge page allocator series, which had
fallback restrictions and many changes to reclaim and compaction.

Can you confirm that you were testing the latest patches that are in
mm-stable as of today? There was a series of follow-up fixes.

Especially, please double check you have the follow-up fixes to
compaction capturing and the CMA fallback policy. It sounds like the
behavior Baolin described before the CMA fix.

Lastly, what's the base you backported this series to?

> On Android (ARMv8.2), app launch time regressed by about 7%; On
> ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> (full and some) on both platforms increased by over 20%. I could post
> the details of the benchmarks and the metrics they measure, but I
> doubt they would mean much to you. I did ask our test teams to save
> extra kernel logs that might be more helpful, and I could forward them
> to you.

If the issue persists with the latest patches in -mm, a kernel config
and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
before/during/after the problematic behavior would be very helpful.
Yu Zhao May 13, 2024, 6:10 p.m. UTC | #8
On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > V4:
> > > - fixed !pcp_order_allowed() case in free_unref_folios()
> > > - reworded the patch 0 changelog a bit for the git log
> > > - rebased to mm-everything-2024-03-19-23-01
> > > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> > >
> > > ---
> > >
> > > The page allocator's mobility grouping is intended to keep unmovable
> > > pages separate from reclaimable/compactable ones to allow on-demand
> > > defragmentation for higher-order allocations and huge pages.
> > >
> > > Currently, there are several places where accidental type mixing
> > > occurs: an allocation asks for a page of a certain migratetype and
> > > receives another. This ruins pageblocks for compaction, which in turn
> > > makes allocating huge pages more expensive and less reliable.
> > >
> > > The series addresses those causes. The last patch adds type checks on
> > > all freelist movements to prevent new violations being introduced.
> > >
> > > The benefits can be seen in a mixed workload that stresses the machine
> > > with a memcache-type workload and a kernel build job while
> > > periodically attempting to allocate batches of THP. The following data
> > > is aggregated over 50 consecutive defconfig builds:
> > >
> > >                                                         VANILLA                 PATCHED
> > > Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
> > > Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
> > > Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
> > > Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
> > > Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
> > > THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
> > > THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
> > > THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
> > > Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
> > > Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
> > > Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
> > > Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
> > > Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
> > > Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
> > > Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
> > > Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
> > > Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
> > > Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
> > > Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
> > > Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
> > > Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
> > > Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
> > > Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
> > > Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
> > > Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
> > > Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
> > > Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
> > > Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
> > > Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
> > > Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
> > > Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
> > > File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)
> > >
> > > Huge page success rate is higher, allocation latencies are shorter and
> > > more predictable.
> > >
> > > Stealing (fallback) rate is drastically reduced. Notably, while the
> > > vanilla kernel keeps doing fallbacks on an ongoing basis, the patched
> > > kernel enters a steady state once the distribution of block types is
> > > adequate for the workload. Steals over 50 runs:
> > >
> > > VANILLA         PATCHED
> > > 1504.0          227.0
> > > 1557.0          6.0
> > > 1391.0          13.0
> > > 1080.0          26.0
> > > 1057.0          40.0
> > > 1156.0          6.0
> > > 805.0           46.0
> > > 736.0           20.0
> > > 1747.0          2.0
> > > 1699.0          34.0
> > > 1269.0          13.0
> > > 1858.0          12.0
> > > 907.0           4.0
> > > 727.0           2.0
> > > 563.0           2.0
> > > 3094.0          2.0
> > > 10211.0         3.0
> > > 2621.0          1.0
> > > 5508.0          2.0
> > > 1060.0          2.0
> > > 538.0           3.0
> > > 5773.0          2.0
> > > 2199.0          0.0
> > > 3781.0          2.0
> > > 1387.0          1.0
> > > 4977.0          0.0
> > > 2865.0          1.0
> > > 1814.0          1.0
> > > 3739.0          1.0
> > > 6857.0          0.0
> > > 382.0           0.0
> > > 407.0           1.0
> > > 3784.0          0.0
> > > 297.0           0.0
> > > 298.0           0.0
> > > 6636.0          0.0
> > > 4188.0          0.0
> > > 242.0           0.0
> > > 9960.0          0.0
> > > 5816.0          0.0
> > > 354.0           0.0
> > > 287.0           0.0
> > > 261.0           0.0
> > > 140.0           1.0
> > > 2065.0          0.0
> > > 312.0           0.0
> > > 331.0           0.0
> > > 164.0           0.0
> > > 465.0           1.0
> > > 219.0           0.0
> > >
> > > Type mismatches are down too. Those count every time an allocation
> > > request asks for one migratetype and gets another. This can still
> > > occur minimally in the patched kernel due to non-stealing fallbacks,
> > > but it's quite rare and follows the pattern of overall fallbacks -
> > > once the block type distribution settles, mismatches cease as well:
> > >
> > > VANILLA:        PATCHED:
> > > 182602.0        268.0
> > > 135794.0        20.0
> > > 88619.0         19.0
> > > 95973.0         0.0
> > > 129590.0        0.0
> > > 129298.0        0.0
> > > 147134.0        0.0
> > > 230854.0        0.0
> > > 239709.0        0.0
> > > 137670.0        0.0
> > > 132430.0        0.0
> > > 65712.0         0.0
> > > 57901.0         0.0
> > > 67506.0         0.0
> > > 63565.0         4.0
> > > 34806.0         0.0
> > > 42962.0         0.0
> > > 32406.0         0.0
> > > 38668.0         0.0
> > > 61356.0         0.0
> > > 57800.0         0.0
> > > 41435.0         0.0
> > > 83456.0         0.0
> > > 65048.0         0.0
> > > 28955.0         0.0
> > > 47597.0         0.0
> > > 75117.0         0.0
> > > 55564.0         0.0
> > > 38280.0         0.0
> > > 52404.0         0.0
> > > 26264.0         0.0
> > > 37538.0         0.0
> > > 19671.0         0.0
> > > 30936.0         0.0
> > > 26933.0         0.0
> > > 16962.0         0.0
> > > 44554.0         0.0
> > > 46352.0         0.0
> > > 24995.0         0.0
> > > 35152.0         0.0
> > > 12823.0         0.0
> > > 21583.0         0.0
> > > 18129.0         0.0
> > > 31693.0         0.0
> > > 28745.0         0.0
> > > 33308.0         0.0
> > > 31114.0         0.0
> > > 35034.0         0.0
> > > 12111.0         0.0
> > > 24885.0         0.0
> > >
> > > Compaction work is markedly reduced despite much better THP rates.
> > >
> > > In the vanilla kernel, reclaim seems to have been driven primarily by
> > > watermark boosting that happens as a result of fallbacks. With those
> > > all but eliminated, watermarks average lower and kswapd does less
> > > work. The uptick in direct reclaim is because THP requests have to
> > > fend for themselves more often - which is intended policy right
> > > now. Aggregate reclaim activity is lowered significantly, though.
> >
> > This series significantly regresses Android and ChromeOS under memory
> > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > it was mentioned in the early discussions that potential regressions
> > for such a case are somewhat expected?
>
> This is not expected for the 10 patches here. You might be referring
> to the discussion around the huge page allocator series, which had
> fallback restrictions and many changes to reclaim and compaction.

Right, now I remember.

> Can you confirm that you were testing the latest patches that are in
> mm-stable as of today? There was a series of follow-up fixes.

Here is what I have on top of 6.8.y, which I think includes all the
follow-up fixes. The performance delta was measured between 5 & 22.

     1 mm: convert free_unref_page_list() to use folios
     2 mm: add free_unref_folios()
     3 mm: handle large folios in free_unref_folios()
     4 mm/page_alloc: remove unused fpi_flags in free_pages_prepare()
     5 mm: add alloc_contig_migrate_range allocation statistics
     6 mm: page_alloc: remove pcppage migratetype caching
     7 mm: page_alloc: optimize free_unref_folios()
     8 mm: page_alloc: fix up block types when merging compatible blocks
     9 mm: page_alloc: move free pages when converting block during isolation
    10 mm: page_alloc: fix move_freepages_block() range error
    11 mm: page_alloc: fix freelist movement during block conversion
    12 mm-page_alloc-fix-freelist-movement-during-block-conversion-fix
    13 mm: page_alloc: close migratetype race between freeing and stealing
    14 mm: page_alloc: set migratetype inside move_freepages()
    15 mm: page_isolation: prepare for hygienic freelists
    16 mm-page_isolation-prepare-for-hygienic-freelists-fix
    17 mm: page_alloc: consolidate free page accounting
    18 mm: page_alloc: consolidate free page accounting fix
    19 mm: page_alloc: consolidate free page accounting fix 2
    20 mm: page_alloc: consolidate free page accounting fix 3
    21 mm: page_alloc: change move_freepages() to __move_freepages_block()
    22 mm: page_alloc: batch vmstat updates in expand()

> Especially, please double check you have the follow-up fixes to
> compaction capturing and the CMA fallback policy. It sounds like the
> behavior Baolin described before the CMA fix.

Yes, that one was included.

> Lastly, what's the base you backported this series to?

It was 6.8, we can potentially try 6.9 this week and 6.10-rc in a few
weeks when it's in good shape for performance benchmarks.

> > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > (full and some) on both platforms increased by over 20%. I could post
> > the details of the benchmarks and the metrics they measure, but I
> > doubt they would mean much to you. I did ask our test teams to save
> > extra kernel logs that might be more helpful, and I could forward them
> > to you.
>
> If the issue persists with the latest patches in -mm, a kernel config
> and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> before/during/after the problematic behavior would be very helpful.

Assuming all the fixes were included, do you want the logs from 6.8?
We have them available now.
Johannes Weiner May 13, 2024, 7:04 p.m. UTC | #9
On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > V4:
> > > > - fixed !pcp_order_allowed() case in free_unref_folios()
> > > > - reworded the patch 0 changelog a bit for the git log
> > > > - rebased to mm-everything-2024-03-19-23-01
> > > > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> > > >
> > > > ---
> > > >
> > > > The page allocator's mobility grouping is intended to keep unmovable
> > > > pages separate from reclaimable/compactable ones to allow on-demand
> > > > defragmentation for higher-order allocations and huge pages.
> > > >
> > > > Currently, there are several places where accidental type mixing
> > > > occurs: an allocation asks for a page of a certain migratetype and
> > > > receives another. This ruins pageblocks for compaction, which in turn
> > > > makes allocating huge pages more expensive and less reliable.
> > > >
> > > > The series addresses those causes. The last patch adds type checks on
> > > > all freelist movements to prevent new violations being introduced.
> > > >
> > > > The benefits can be seen in a mixed workload that stresses the machine
> > > > with a memcache-type workload and a kernel build job while
> > > > periodically attempting to allocate batches of THP. The following data
> > > > is aggregated over 50 consecutive defconfig builds:
> > > >
> > > >                                                         VANILLA                 PATCHED
> > > > Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
> > > > Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
> > > > Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
> > > > Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
> > > > Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
> > > > THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
> > > > THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
> > > > THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
> > > > Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
> > > > Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
> > > > Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
> > > > Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
> > > > Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
> > > > Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
> > > > Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
> > > > Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
> > > > Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
> > > > Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
> > > > Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
> > > > Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
> > > > Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
> > > > Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
> > > > Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
> > > > Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
> > > > Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
> > > > Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
> > > > Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
> > > > Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
> > > > Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
> > > > Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
> > > > Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
> > > > File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)

[...]

> > >
> > > This series significantly regresses Android and ChromeOS under memory
> > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > it was mentioned in the early discussions that potential regressions
> > > for such a case are somewhat expected?
> >
> > This is not expected for the 10 patches here. You might be referring
> > to the discussion around the huge page allocator series, which had
> > fallback restrictions and many changes to reclaim and compaction.
> 
> Right, now I remember.
> 
> > Can you confirm that you were testing the latest patches that are in
> > mm-stable as of today? There was a series of follow-up fixes.
> 
> Here is what I have on top of 6.8.y, which I think includes all the
> follow-up fixes. The performance delta was measured between 5 & 22.
> 
>      1 mm: convert free_unref_page_list() to use folios
>      2 mm: add free_unref_folios()
>      3 mm: handle large folios in free_unref_folios()
>      4 mm/page_alloc: remove unused fpi_flags in free_pages_prepare()
>      5 mm: add alloc_contig_migrate_range allocation statistics
>      6 mm: page_alloc: remove pcppage migratetype caching
>      7 mm: page_alloc: optimize free_unref_folios()
>      8 mm: page_alloc: fix up block types when merging compatible blocks
>      9 mm: page_alloc: move free pages when converting block during isolation
>     10 mm: page_alloc: fix move_freepages_block() range error
>     11 mm: page_alloc: fix freelist movement during block conversion
>     12 mm-page_alloc-fix-freelist-movement-during-block-conversion-fix
>     13 mm: page_alloc: close migratetype race between freeing and stealing
>     14 mm: page_alloc: set migratetype inside move_freepages()
>     15 mm: page_isolation: prepare for hygienic freelists
>     16 mm-page_isolation-prepare-for-hygienic-freelists-fix
>     17 mm: page_alloc: consolidate free page accounting
>     18 mm: page_alloc: consolidate free page accounting fix
>     19 mm: page_alloc: consolidate free page accounting fix 2
>     20 mm: page_alloc: consolidate free page accounting fix 3
>     21 mm: page_alloc: change move_freepages() to __move_freepages_block()
>     22 mm: page_alloc: batch vmstat updates in expand()

It does look complete to me. Did you encounter any conflicts during
the backport? Is there any chance you can fold the fixes into their
respective main patches and bisect the sequence?

Again, it's not expected behavior given the fairly conservative
changes above. It sounds like a bug.

> > Especially, please double check you have the follow-up fixes to
> > compaction capturing and the CMA fallback policy. It sounds like the
> > behavior Baolin described before the CMA fix.
> 
> Yes, that one was included.

Ok.

> > Lastly, what's the base you backported this series to?
> 
> It was 6.8, we can potentially try 6.9 this week and 6.10-rc in a few
> weeks when it's in good shape for performance benchmarks.

If you could try 6.9 as well, that would be great. I backported the
series to 6.9 the other day (git cherry-picks from mm-stable) and I
didn't encounter any conflicts.

> > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > (full and some) on both platforms increased by over 20%. I could post
> > > the details of the benchmarks and the metrics they measure, but I
> > > doubt they would mean much to you. I did ask our test teams to save
> > > extra kernel logs that might be more helpful, and I could forward them
> > > to you.
> >
> > If the issue persists with the latest patches in -mm, a kernel config
> > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > before/during/after the problematic behavior would be very helpful.
> 
> Assuming all the fixes were included, do you want the logs from 6.8?
> We have them available now.

Yes, that would be helpful.

If you have them, it would also be quite useful to have the vmstat
before-after-test delta from a good kernel, for baseline comparison.

Thanks
Yu Zhao June 5, 2024, 4:53 a.m. UTC | #10
On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > On Wed, Mar 20, 2024 at 12:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > V4:
> > > > > - fixed !pcp_order_allowed() case in free_unref_folios()
> > > > > - reworded the patch 0 changelog a bit for the git log
> > > > > - rebased to mm-everything-2024-03-19-23-01
> > > > > - runtime-tested again with various CONFIG_DEBUG_FOOs enabled
> > > > >
> > > > > ---
> > > > >
> > > > > The page allocator's mobility grouping is intended to keep unmovable
> > > > > pages separate from reclaimable/compactable ones to allow on-demand
> > > > > defragmentation for higher-order allocations and huge pages.
> > > > >
> > > > > Currently, there are several places where accidental type mixing
> > > > > occurs: an allocation asks for a page of a certain migratetype and
> > > > > receives another. This ruins pageblocks for compaction, which in turn
> > > > > makes allocating huge pages more expensive and less reliable.
> > > > >
> > > > > The series addresses those causes. The last patch adds type checks on
> > > > > all freelist movements to prevent new violations being introduced.
> > > > >
> > > > > The benefits can be seen in a mixed workload that stresses the machine
> > > > > with a memcache-type workload and a kernel build job while
> > > > > periodically attempting to allocate batches of THP. The following data
> > > > > is aggregated over 50 consecutive defconfig builds:
> > > > >
> > > > >                                                         VANILLA                 PATCHED
> > > > > Hugealloc Time mean                      165843.93 (    +0.00%)  113025.88 (   -31.85%)
> > > > > Hugealloc Time stddev                    158957.35 (    +0.00%)  114716.07 (   -27.83%)
> > > > > Kbuild Real time                            310.24 (    +0.00%)     300.73 (    -3.06%)
> > > > > Kbuild User time                           1271.13 (    +0.00%)    1259.42 (    -0.92%)
> > > > > Kbuild System time                          582.02 (    +0.00%)     559.79 (    -3.81%)
> > > > > THP fault alloc                           30585.14 (    +0.00%)   40853.62 (   +33.57%)
> > > > > THP fault fallback                        36626.46 (    +0.00%)   26357.62 (   -28.04%)
> > > > > THP fault fail rate %                        54.49 (    +0.00%)      39.22 (   -27.53%)
> > > > > Pagealloc fallback                         1328.00 (    +0.00%)       1.00 (   -99.85%)
> > > > > Pagealloc type mismatch                  181009.50 (    +0.00%)       0.00 (  -100.00%)
> > > > > Direct compact stall                        434.56 (    +0.00%)     257.66 (   -40.61%)
> > > > > Direct compact fail                         421.70 (    +0.00%)     249.94 (   -40.63%)
> > > > > Direct compact success                       12.86 (    +0.00%)       7.72 (   -37.09%)
> > > > > Direct compact success rate %                 2.86 (    +0.00%)       2.82 (    -0.96%)
> > > > > Compact daemon scanned migrate          3370059.62 (    +0.00%) 3612054.76 (    +7.18%)
> > > > > Compact daemon scanned free             7718439.20 (    +0.00%) 5386385.02 (   -30.21%)
> > > > > Compact direct scanned migrate           309248.62 (    +0.00%)  176721.04 (   -42.85%)
> > > > > Compact direct scanned free              433582.84 (    +0.00%)  315727.66 (   -27.18%)
> > > > > Compact migrate scanned daemon %             91.20 (    +0.00%)      94.48 (    +3.56%)
> > > > > Compact free scanned daemon %                94.58 (    +0.00%)      94.42 (    -0.16%)
> > > > > Compact total migrate scanned           3679308.24 (    +0.00%) 3788775.80 (    +2.98%)
> > > > > Compact total free scanned              8152022.04 (    +0.00%) 5702112.68 (   -30.05%)
> > > > > Alloc stall                                 872.04 (    +0.00%)    5156.12 (  +490.71%)
> > > > > Pages kswapd scanned                     510645.86 (    +0.00%)    3394.94 (   -99.33%)
> > > > > Pages kswapd reclaimed                   134811.62 (    +0.00%)    2701.26 (   -98.00%)
> > > > > Pages direct scanned                      99546.06 (    +0.00%)  376407.52 (  +278.12%)
> > > > > Pages direct reclaimed                    62123.40 (    +0.00%)  289535.70 (  +366.06%)
> > > > > Pages total scanned                      610191.92 (    +0.00%)  379802.46 (   -37.76%)
> > > > > Pages scanned kswapd %                       76.36 (    +0.00%)       0.10 (   -98.58%)
> > > > > Swap out                                  12057.54 (    +0.00%)   15022.98 (   +24.59%)
> > > > > Swap in                                     209.16 (    +0.00%)     256.48 (   +22.52%)
> > > > > File refaults                             17701.64 (    +0.00%)   11765.40 (   -33.53%)
>
> [...]
>
> > > >
> > > > This series significantly regresses Android and ChromeOS under memory
> > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > it was mentioned in the early discussions that potential regressions
> > > > for such a case are somewhat expected?
> > >
> > > This is not expected for the 10 patches here. You might be referring
> > > to the discussion around the huge page allocator series, which had
> > > fallback restrictions and many changes to reclaim and compaction.
> >
> > Right, now I remember.
> >
> > > Can you confirm that you were testing the latest patches that are in
> > > mm-stable as of today? There was a series of follow-up fixes.
> >
> > Here is what I have on top of 6.8.y, which I think includes all the
> > follow-up fixes. The performance delta was measured between 5 & 22.
> >
> >      1 mm: convert free_unref_page_list() to use folios
> >      2 mm: add free_unref_folios()
> >      3 mm: handle large folios in free_unref_folios()
> >      4 mm/page_alloc: remove unused fpi_flags in free_pages_prepare()
> >      5 mm: add alloc_contig_migrate_range allocation statistics
> >      6 mm: page_alloc: remove pcppage migratetype caching
> >      7 mm: page_alloc: optimize free_unref_folios()
> >      8 mm: page_alloc: fix up block types when merging compatible blocks
> >      9 mm: page_alloc: move free pages when converting block during isolation
> >     10 mm: page_alloc: fix move_freepages_block() range error
> >     11 mm: page_alloc: fix freelist movement during block conversion
> >     12 mm-page_alloc-fix-freelist-movement-during-block-conversion-fix
> >     13 mm: page_alloc: close migratetype race between freeing and stealing
> >     14 mm: page_alloc: set migratetype inside move_freepages()
> >     15 mm: page_isolation: prepare for hygienic freelists
> >     16 mm-page_isolation-prepare-for-hygienic-freelists-fix
> >     17 mm: page_alloc: consolidate free page accounting
> >     18 mm: page_alloc: consolidate free page accounting fix
> >     19 mm: page_alloc: consolidate free page accounting fix 2
> >     20 mm: page_alloc: consolidate free page accounting fix 3
> >     21 mm: page_alloc: change move_freepages() to __move_freepages_block()
> >     22 mm: page_alloc: batch vmstat updates in expand()
>
> It does look complete to me. Did you encounter any conflicts during
> the backport? Is there any chance you can fold the fixes into their
> respective main patches and bisect the sequence?
>
> Again, it's not expected behavior given the fairly conservative
> changes above. It sounds like a bug.
>
> > > Especially, please double check you have the follow-up fixes to
> > > compaction capturing and the CMA fallback policy. It sounds like the
> > > behavior Baolin described before the CMA fix.
> >
> > Yes, that one was included.
>
> Ok.
>
> > > Lastly, what's the base you backported this series to?
> >
> > It was 6.8, we can potentially try 6.9 this week and 6.10-rc in a few
> > weeks when it's in good shape for performance benchmarks.
>
> If you could try 6.9 as well, that would be great. I backported the
> series to 6.9 the other day (git cherry-picks from mm-stable) and I
> didn't encounter any conflicts.
>
> > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > (full and some) on both platforms increased by over 20%. I could post
> > > > the details of the benchmarks and the metrics they measure, but I
> > > > doubt they would mean much to you. I did ask our test teams to save
> > > > extra kernel logs that might be more helpful, and I could forward them
> > > > to you.
> > >
> > > If the issue persists with the latest patches in -mm, a kernel config
> > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > before/during/after the problematic behavior would be very helpful.
> >
> > Assuming all the fixes were included, do you want the logs from 6.8?
> > We have them available now.
>
> Yes, that would be helpful.
>
> If you have them, it would also be quite useful to have the vmstat
> before-after-test delta from a good kernel, for baseline comparison.

Sorry for taking this long -- I wanted to see if the regression is
still reproducible on v6.9.

Apparently we got the similar results on v6.9 with the following
patches cherry-picked cleanly from v6.10-rc1:

     1  mm: page_alloc: remove pcppage migratetype caching
     2  mm: page_alloc: optimize free_unref_folios()
     3  mm: page_alloc: fix up block types when merging compatible blocks
     4  mm: page_alloc: move free pages when converting block during isolation
     5  mm: page_alloc: fix move_freepages_block() range error
     6  mm: page_alloc: fix freelist movement during block conversion
     7  mm: page_alloc: close migratetype race between freeing and stealing
     8  mm: page_alloc: set migratetype inside move_freepages()
     9  mm: page_isolation: prepare for hygienic freelists
    10  mm: page_alloc: consolidate free page accounting
    11  mm: page_alloc: change move_freepages() to __move_freepages_block()
    12  mm: page_alloc: batch vmstat updates in expand()

Unfortunately I just realized that that automated benchmark didn't
collect the kernel stats before it starts (since it always starts on a
freshly booted device). While this is being fixed, I'm attaching the
kernel stats collected after the benchmark finished. I grabbed 10 runs
for each (baseline/patched), and if you need more, please let me know.
(And we should have the stats before the benchmark soon.)
Johannes Weiner June 10, 2024, 3:28 p.m. UTC | #11
On Tue, Jun 04, 2024 at 10:53:55PM -0600, Yu Zhao wrote:
> On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > > This series significantly regresses Android and ChromeOS under memory
> > > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > > it was mentioned in the early discussions that potential regressions
> > > > > for such a case are somewhat expected?
> > > >
> > > > This is not expected for the 10 patches here. You might be referring
> > > > to the discussion around the huge page allocator series, which had
> > > > fallback restrictions and many changes to reclaim and compaction.
> > >
> > > Right, now I remember.
> > >
> > > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > > (full and some) on both platforms increased by over 20%. I could post
> > > > > the details of the benchmarks and the metrics they measure, but I
> > > > > doubt they would mean much to you. I did ask our test teams to save
> > > > > extra kernel logs that might be more helpful, and I could forward them
> > > > > to you.
> > > >
> > > > If the issue persists with the latest patches in -mm, a kernel config
> > > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > > before/during/after the problematic behavior would be very helpful.
> > >
> > > Assuming all the fixes were included, do you want the logs from 6.8?
> > > We have them available now.
> >
> > Yes, that would be helpful.
> >
> > If you have them, it would also be quite useful to have the vmstat
> > before-after-test delta from a good kernel, for baseline comparison.
> 
> Sorry for taking this long -- I wanted to see if the regression is
> still reproducible on v6.9.
> 
> Apparently we got the similar results on v6.9 with the following
> patches cherry-picked cleanly from v6.10-rc1:
> 
>      1  mm: page_alloc: remove pcppage migratetype caching
>      2  mm: page_alloc: optimize free_unref_folios()
>      3  mm: page_alloc: fix up block types when merging compatible blocks
>      4  mm: page_alloc: move free pages when converting block during isolation
>      5  mm: page_alloc: fix move_freepages_block() range error
>      6  mm: page_alloc: fix freelist movement during block conversion
>      7  mm: page_alloc: close migratetype race between freeing and stealing
>      8  mm: page_alloc: set migratetype inside move_freepages()
>      9  mm: page_isolation: prepare for hygienic freelists
>     10  mm: page_alloc: consolidate free page accounting
>     11  mm: page_alloc: change move_freepages() to __move_freepages_block()
>     12  mm: page_alloc: batch vmstat updates in expand()
> 
> Unfortunately I just realized that that automated benchmark didn't
> collect the kernel stats before it starts (since it always starts on a
> freshly booted device). While this is being fixed, I'm attaching the
> kernel stats collected after the benchmark finished. I grabbed 10 runs
> for each (baseline/patched), and if you need more, please let me know.
> (And we should have the stats before the benchmark soon.)

Thanks for grabbing these, and sorry about the delay, I was traveling
last week.

You mentioned "THPs are virtually non-existant". But the workload
doesn't seem to allocate anon THPs at all. For file THP, the patched
kernel's median for allocation success is 90% of baseline, but the
inter-run min/max deviation from the median in baseline is 85%/108%
and in patched and 85%/112% in patched, so this is quite noisy. Was
that initial comment regarding a different workload?

This other data point has me stumped. Comparing medians, there is a
1.5% reduction in anon refaults and a 4.8% increase in file
refaults. And indeed, there is more files and less anon being scanned.
I think this could explain the PSI delta, since AFAIK you have zram or
zswap, and anon decompression loads are cheaper than filesystem IO.

The above patches don't do anything that directly influences the
anon-file reclaim balance. However, if file THPs fall back to 4k file
pages more, that *might* be able to explain a shift in reclaim
balance, if some hot subpages in those THPs were protecting colder
subpages from being reclaimed and refaulting.

In that case, the root cause would still be a simple THP success rate
regression. To confirm this theory, could you run the baseline and the
patched sets both with THP disabled entirely?

Can you elaborate more on what the workload is doing exactly? What are
the parameters of the test machine (CPUs, memory size)? It'd be
helpful if I could reproduce this locally as well.
Yu Zhao June 12, 2024, 6:52 p.m. UTC | #12
On Mon, Jun 10, 2024 at 9:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Jun 04, 2024 at 10:53:55PM -0600, Yu Zhao wrote:
> > On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > > > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > > > This series significantly regresses Android and ChromeOS under memory
> > > > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > > > it was mentioned in the early discussions that potential regressions
> > > > > > for such a case are somewhat expected?
> > > > >
> > > > > This is not expected for the 10 patches here. You might be referring
> > > > > to the discussion around the huge page allocator series, which had
> > > > > fallback restrictions and many changes to reclaim and compaction.
> > > >
> > > > Right, now I remember.
> > > >
> > > > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > > > (full and some) on both platforms increased by over 20%. I could post
> > > > > > the details of the benchmarks and the metrics they measure, but I
> > > > > > doubt they would mean much to you. I did ask our test teams to save
> > > > > > extra kernel logs that might be more helpful, and I could forward them
> > > > > > to you.
> > > > >
> > > > > If the issue persists with the latest patches in -mm, a kernel config
> > > > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > > > before/during/after the problematic behavior would be very helpful.
> > > >
> > > > Assuming all the fixes were included, do you want the logs from 6.8?
> > > > We have them available now.
> > >
> > > Yes, that would be helpful.
> > >
> > > If you have them, it would also be quite useful to have the vmstat
> > > before-after-test delta from a good kernel, for baseline comparison.
> >
> > Sorry for taking this long -- I wanted to see if the regression is
> > still reproducible on v6.9.
> >
> > Apparently we got the similar results on v6.9 with the following
> > patches cherry-picked cleanly from v6.10-rc1:
> >
> >      1  mm: page_alloc: remove pcppage migratetype caching
> >      2  mm: page_alloc: optimize free_unref_folios()
> >      3  mm: page_alloc: fix up block types when merging compatible blocks
> >      4  mm: page_alloc: move free pages when converting block during isolation
> >      5  mm: page_alloc: fix move_freepages_block() range error
> >      6  mm: page_alloc: fix freelist movement during block conversion
> >      7  mm: page_alloc: close migratetype race between freeing and stealing
> >      8  mm: page_alloc: set migratetype inside move_freepages()
> >      9  mm: page_isolation: prepare for hygienic freelists
> >     10  mm: page_alloc: consolidate free page accounting
> >     11  mm: page_alloc: change move_freepages() to __move_freepages_block()
> >     12  mm: page_alloc: batch vmstat updates in expand()
> >
> > Unfortunately I just realized that that automated benchmark didn't
> > collect the kernel stats before it starts (since it always starts on a
> > freshly booted device). While this is being fixed, I'm attaching the
> > kernel stats collected after the benchmark finished. I grabbed 10 runs
> > for each (baseline/patched), and if you need more, please let me know.
> > (And we should have the stats before the benchmark soon.)
>
> Thanks for grabbing these, and sorry about the delay, I was traveling
> last week.
>
> You mentioned "THPs are virtually non-existant". But the workload
> doesn't seem to allocate anon THPs at all.

Sorry for not being clear there: you are correct.

I meant that client devices rarely use 2MB THPs or __GFP_COMP. (They
simply can't due to both internal and external fragmentations, but we
are trying!)

> For file THP, the patched
> kernel's median for allocation success is 90% of baseline, but the
> inter-run min/max deviation from the median in baseline is 85%/108%
> and in patched and 85%/112% in patched, so this is quite noisy. Was
> that initial comment regarding a different workload?

No, in both cases (Android and ChromeOS) we tried, we were hoping the
series could help with mTHP (64KB and 32KB). But we hit the
regressions with their baseline (4KB). Again, 2MB THPs, if they are
used, are reserved (allocated and mlocked to hold text/code sections
after a reboot). So they shouldn't matter, and I highly doubt the
regressions are because of them.

> This other data point has me stumped. Comparing medians, there is a
> 1.5% reduction in anon refaults and a 4.8% increase in file
> refaults. And indeed, there is more files and less anon being scanned.
> I think this could explain the PSI delta, since AFAIK you have zram or
> zswap, and anon decompression loads are cheaper than filesystem IO.
>
> The above patches don't do anything that directly influences the
> anon-file reclaim balance. However, if file THPs fall back to 4k file
> pages more, that *might* be able to explain a shift in reclaim
> balance, if some hot subpages in those THPs were protecting colder
> subpages from being reclaimed and refaulting.
>
> In that case, the root cause would still be a simple THP success rate
> regression. To confirm this theory, could you run the baseline and the
> patched sets both with THP disabled entirely?

Will try this. And is bisecting within this series possible?

> Can you elaborate more on what the workload is doing exactly?

These are simple benchmarks that measure the system and foreground
app/tab performance under memory pressure, e.g., [1]. They open a
bunch of apps/tabs (respectively on Android/ChromeOS) and switch
between them. At a given time, one of them is foreground and the rest
are background, obviously. When an app/tab has been in the background
for a while, the userspace may call madvise(PAGEOUT) to reclaim (most
of) its LRU pages, leaving unmovable kernel memory there. This
strategy allows client systems to cache more apps/tabs in the
background and reduce their startup/switch time. But it's also a major
source of fragmentation (I'm sure you get why so I won't go into
details here. And userspace also tries to make a better decision
between reclaim/compact/kill based on fragmentation, but it's not
easy.)

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/refs/heads/main/src/go.chromium.org/tast-tests/cros/local/bundles/cros/platform/memory_pressure.go

> What are
> the parameters of the test machine (CPUs, memory size)? It'd be
> helpful if I could reproduce this locally as well.

The data I shared previously is from an Intel i7-1255U + 4GB Chromebook.

More data attached -- it contains vmstat, zoneinfo and pagetypeinfo
files collected before the benchmark (after fresh reboots) and after
the benchmark.
Johannes Weiner June 13, 2024, 3:39 p.m. UTC | #13
On Wed, Jun 12, 2024 at 12:52:20PM -0600, Yu Zhao wrote:
> On Mon, Jun 10, 2024 at 9:28 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Jun 04, 2024 at 10:53:55PM -0600, Yu Zhao wrote:
> > > On Mon, May 13, 2024 at 1:04 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > On Mon, May 13, 2024 at 12:10:04PM -0600, Yu Zhao wrote:
> > > > > On Mon, May 13, 2024 at 10:03 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > > > On Fri, May 10, 2024 at 11:14:43PM -0600, Yu Zhao wrote:
> > > > > > > This series significantly regresses Android and ChromeOS under memory
> > > > > > > pressure. THPs are virtually nonexistent on client devices, and IIRC,
> > > > > > > it was mentioned in the early discussions that potential regressions
> > > > > > > for such a case are somewhat expected?
> > > > > >
> > > > > > This is not expected for the 10 patches here. You might be referring
> > > > > > to the discussion around the huge page allocator series, which had
> > > > > > fallback restrictions and many changes to reclaim and compaction.
> > > > >
> > > > > Right, now I remember.
> > > > >
> > > > > > > On Android (ARMv8.2), app launch time regressed by about 7%; On
> > > > > > > ChromeOS (Intel ADL), tab switch time regressed by about 8%. Also PSI
> > > > > > > (full and some) on both platforms increased by over 20%. I could post
> > > > > > > the details of the benchmarks and the metrics they measure, but I
> > > > > > > doubt they would mean much to you. I did ask our test teams to save
> > > > > > > extra kernel logs that might be more helpful, and I could forward them
> > > > > > > to you.
> > > > > >
> > > > > > If the issue persists with the latest patches in -mm, a kernel config
> > > > > > and snapshots of /proc/vmstat, /proc/pagetypeinfo, /proc/zoneinfo
> > > > > > before/during/after the problematic behavior would be very helpful.
> > > > >
> > > > > Assuming all the fixes were included, do you want the logs from 6.8?
> > > > > We have them available now.
> > > >
> > > > Yes, that would be helpful.
> > > >
> > > > If you have them, it would also be quite useful to have the vmstat
> > > > before-after-test delta from a good kernel, for baseline comparison.
> > >
> > > Sorry for taking this long -- I wanted to see if the regression is
> > > still reproducible on v6.9.
> > >
> > > Apparently we got the similar results on v6.9 with the following
> > > patches cherry-picked cleanly from v6.10-rc1:
> > >
> > >      1  mm: page_alloc: remove pcppage migratetype caching
> > >      2  mm: page_alloc: optimize free_unref_folios()
> > >      3  mm: page_alloc: fix up block types when merging compatible blocks
> > >      4  mm: page_alloc: move free pages when converting block during isolation
> > >      5  mm: page_alloc: fix move_freepages_block() range error
> > >      6  mm: page_alloc: fix freelist movement during block conversion
> > >      7  mm: page_alloc: close migratetype race between freeing and stealing
> > >      8  mm: page_alloc: set migratetype inside move_freepages()
> > >      9  mm: page_isolation: prepare for hygienic freelists
> > >     10  mm: page_alloc: consolidate free page accounting
> > >     11  mm: page_alloc: change move_freepages() to __move_freepages_block()
> > >     12  mm: page_alloc: batch vmstat updates in expand()
> > >
> > > Unfortunately I just realized that that automated benchmark didn't
> > > collect the kernel stats before it starts (since it always starts on a
> > > freshly booted device). While this is being fixed, I'm attaching the
> > > kernel stats collected after the benchmark finished. I grabbed 10 runs
> > > for each (baseline/patched), and if you need more, please let me know.
> > > (And we should have the stats before the benchmark soon.)
> >
> > Thanks for grabbing these, and sorry about the delay, I was traveling
> > last week.
> >
> > You mentioned "THPs are virtually non-existant". But the workload
> > doesn't seem to allocate anon THPs at all.
> 
> Sorry for not being clear there: you are correct.
> 
> I meant that client devices rarely use 2MB THPs or __GFP_COMP. (They
> simply can't due to both internal and external fragmentations, but we
> are trying!)

Ah, understood. So this is nominally a non-THP workload, and we're
suspecting a simple 4k allocation issue in low memory conditions.

Thanks for clarifying.

However, I don't think 4k alone would explain pressure just yet. PSI
is triggered by reclaim and compaction, but with this series type
fallbacks are still allowed to the full extent before entering any
such remediation. The series merely fixes type safety and eliminates
avoidable/accidental mixing.

So I'm thinking something else must still be going on. Either THP
(however limited the use in this workload); or the userspace feedback
mechanism you mention below...

> > For file THP, the patched
> > kernel's median for allocation success is 90% of baseline, but the
> > inter-run min/max deviation from the median in baseline is 85%/108%
> > and in patched and 85%/112% in patched, so this is quite noisy. Was
> > that initial comment regarding a different workload?
> 
> No, in both cases (Android and ChromeOS) we tried, we were hoping the
> series could help with mTHP (64KB and 32KB). But we hit the
> regressions with their baseline (4KB). Again, 2MB THPs, if they are
> used, are reserved (allocated and mlocked to hold text/code sections
> after a reboot). So they shouldn't matter, and I highly doubt the
> regressions are because of them.

Ok.

> > This other data point has me stumped. Comparing medians, there is a
> > 1.5% reduction in anon refaults and a 4.8% increase in file
> > refaults. And indeed, there is more files and less anon being scanned.
> > I think this could explain the PSI delta, since AFAIK you have zram or
> > zswap, and anon decompression loads are cheaper than filesystem IO.
> >
> > The above patches don't do anything that directly influences the
> > anon-file reclaim balance. However, if file THPs fall back to 4k file
> > pages more, that *might* be able to explain a shift in reclaim
> > balance, if some hot subpages in those THPs were protecting colder
> > subpages from being reclaimed and refaulting.
> >
> > In that case, the root cause would still be a simple THP success rate
> > regression. To confirm this theory, could you run the baseline and the
> > patched sets both with THP disabled entirely?
> 
> Will try this. And is bisecting within this series possible?

Yes. I built and put each commit incrementally through my test
machinery before sending them out. I can't vouch for all
configurations, of course, but I'd expect it to work.

> > Can you elaborate more on what the workload is doing exactly?
> 
> These are simple benchmarks that measure the system and foreground
> app/tab performance under memory pressure, e.g., [1]. They open a
> bunch of apps/tabs (respectively on Android/ChromeOS) and switch
> between them. At a given time, one of them is foreground and the rest
> are background, obviously. When an app/tab has been in the background
> for a while, the userspace may call madvise(PAGEOUT) to reclaim (most
> of) its LRU pages, leaving unmovable kernel memory there. This
> strategy allows client systems to cache more apps/tabs in the
> background and reduce their startup/switch time. But it's also a major
> source of fragmentation (I'm sure you get why so I won't go into
> details here. And userspace also tries to make a better decision
> between reclaim/compact/kill based on fragmentation, but it's not
> easy.)

Thanks for the detailed explanation.

That last bit is interesting: how does it determine "fragmentation"?
The series might well affect this metric.

> [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/refs/heads/main/src/go.chromium.org/tast-tests/cros/local/bundles/cros/platform/memory_pressure.go
> 
> > What are
> > the parameters of the test machine (CPUs, memory size)? It'd be
> > helpful if I could reproduce this locally as well.
> 
> The data I shared previously is from an Intel i7-1255U + 4GB Chromebook.
> 
> More data attached -- it contains vmstat, zoneinfo and pagetypeinfo
> files collected before the benchmark (after fresh reboots) and after
> the benchmark.

Thanks, I'll take a look.