mbox series

[V2,0/6] mm: page_alloc: freelist migratetype hygiene

Message ID 20230911195023.247694-1-hannes@cmpxchg.org (mailing list archive)
Headers show
Series mm: page_alloc: freelist migratetype hygiene | expand

Message

Johannes Weiner Sept. 11, 2023, 7:41 p.m. UTC
V2:
- dropped the get_pfnblock_migratetype() optimization
  patchlet since somebody else beat me to it (thanks Zi)
- broke out pcp bypass fix since somebody else reported the bug:
  https://lore.kernel.org/linux-mm/20230911181108.GA104295@cmpxchg.org/
- fixed the CONFIG_UNACCEPTED_MEMORY build (lkp)
- rebased to v6.6-rc1

The series is based on v6.6-rc1 plus the pcp bypass fix above ^

---

This is a breakout series from the huge page allocator patches[1].

While testing and benchmarking the series incrementally, as per
reviewer request, it became apparent that there are several sources of
freelist migratetype violations that later patches in the series hid.

Those violations occur when pages of one migratetype end up on the
freelists of another type. This encourages incompatible page mixing
down the line, where allocation requests ask for one migrate type, but
receives pages of another. This defeats the mobility grouping.

The series addresses those causes. The last patch adds type checks on
all freelist movements to rule out any violations. I used these checks
to identify the violations fixed up in the preceding patches.

The series is a breakout, but has merit on its own: Less type mixing
means improved grouping, means less work for compaction, means higher
THP success rate and lower allocation latencies. The results can be
seen in a mixed workload that stresses the machine with a kernel build
job while periodically attempting to allocate batches of THP. The data
is averaged over 50 consecutive defconfig builds:

                                                        VANILLA      PATCHED-CLEANLISTS
Hugealloc Time median                     14642.00 (    +0.00%)   10506.00 (   -28.25%)
Hugealloc Time min                         4820.00 (    +0.00%)    4783.00 (    -0.77%)
Hugealloc Time max                      6786868.00 (    +0.00%) 6556624.00 (    -3.39%)
Kbuild Real time                            240.03 (    +0.00%)     241.45 (    +0.59%)
Kbuild User time                           1195.49 (    +0.00%)    1195.69 (    +0.02%)
Kbuild System time                           96.44 (    +0.00%)      97.03 (    +0.61%)
THP fault alloc                           11490.00 (    +0.00%)   11802.30 (    +2.72%)
THP fault fallback                          782.62 (    +0.00%)     478.88 (   -38.76%)
THP fault fail rate %                         6.38 (    +0.00%)       3.90 (   -33.52%)
Direct compact stall                        297.70 (    +0.00%)     224.56 (   -24.49%)
Direct compact fail                         265.98 (    +0.00%)     191.56 (   -27.87%)
Direct compact success                       31.72 (    +0.00%)      33.00 (    +3.91%)
Direct compact success rate %                13.11 (    +0.00%)      17.26 (   +29.43%)
Compact daemon scanned migrate          1673661.58 (    +0.00%) 1591682.18 (    -4.90%)
Compact daemon scanned free             2711252.80 (    +0.00%) 2615217.78 (    -3.54%)
Compact direct scanned migrate           384998.62 (    +0.00%)  261689.42 (   -32.03%)
Compact direct scanned free              966308.94 (    +0.00%)  667459.76 (   -30.93%)
Compact migrate scanned daemon %             80.86 (    +0.00%)      83.34 (    +3.02%)
Compact free scanned daemon %                74.41 (    +0.00%)      78.26 (    +5.10%)
Alloc stall                                 338.06 (    +0.00%)     440.72 (   +30.28%)
Pages kswapd scanned                    1356339.42 (    +0.00%) 1402313.42 (    +3.39%)
Pages kswapd reclaimed                   581309.08 (    +0.00%)  587956.82 (    +1.14%)
Pages direct scanned                      56384.18 (    +0.00%)  141095.04 (  +150.24%)
Pages direct reclaimed                    17055.54 (    +0.00%)   22427.96 (   +31.50%)
Pages scanned kswapd %                       96.38 (    +0.00%)      93.60 (    -2.86%)
Swap out                                  41528.00 (    +0.00%)   47969.92 (   +15.51%)
Swap in                                    6541.42 (    +0.00%)    9093.30 (   +39.01%)
File refaults                            127666.50 (    +0.00%)  135766.84 (    +6.34%)

 include/linux/mm.h             |  18 +-
 include/linux/page-isolation.h |   2 +-
 include/linux/vmstat.h         |   8 -
 mm/debug_page_alloc.c          |  12 +-
 mm/internal.h                  |   5 -
 mm/page_alloc.c                | 357 ++++++++++++++++++---------------
 mm/page_isolation.c            |  23 ++-
 7 files changed, 217 insertions(+), 208 deletions(-)

Comments

Mike Kravetz Sept. 14, 2023, 11:52 p.m. UTC | #1
In next-20230913, I started hitting the following BUG.  Seems related
to this series.  And, if series is reverted I do not see the BUG.

I can easily reproduce on a small 16G VM.  kernel command line contains
"hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
while true; do
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
 echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

For the BUG below I believe it was the first (or second) 1G page creation from
CMA that triggered:  cma_alloc of 1G.

Sorry, have not looked deeper into the issue.

[   28.643019] page:ffffea0004fb4280 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ed0a
[   28.645455] flags: 0x200000000000000(node=0|zone=2)
[   28.646835] page_type: 0xffffffff()
[   28.647886] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
[   28.651170] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   28.653124] page dumped because: VM_BUG_ON_PAGE(is_migrate_isolate(mt))
[   28.654769] ------------[ cut here ]------------
[   28.655972] kernel BUG at mm/page_alloc.c:1231!
[   28.657139] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[   28.658354] CPU: 2 PID: 885 Comm: bash Not tainted 6.6.0-rc1-next-20230913+ #3
[   28.660090] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[   28.662054] RIP: 0010:free_pcppages_bulk+0x192/0x240
[   28.663284] Code: 22 48 89 45 08 8b 44 24 0c 41 29 44 24 04 41 29 c6 41 83 f8 05 0f 85 4c ff ff ff 48 c7 c6 20 a5 22 82 48 89 df e8 4e cf fc ff <0f> 0b 65 8b 05 41 8b d3 7e 89 c0 48 0f a3 05 fb 35 39 01 0f 83 40
[   28.667422] RSP: 0018:ffffc90003b9faf0 EFLAGS: 00010046
[   28.668643] RAX: 000000000000003b RBX: ffffea0004fb4280 RCX: 0000000000000000
[   28.670245] RDX: 0000000000000000 RSI: ffffffff8224dace RDI: 00000000ffffffff
[   28.671920] RBP: ffffea0004fb4288 R08: 0000000000009ffb R09: 00000000ffffdfff
[   28.673614] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff888477c30540
[   28.675213] R13: ffff888477c30550 R14: 00000000000012f5 R15: 000000000013ed0a
[   28.676832] FS:  00007f60039b9740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[   28.678709] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   28.680046] CR2: 00005615f9bf3048 CR3: 00000003128b6005 CR4: 0000000000370ee0
[   28.682897] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   28.684501] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   28.686098] Call Trace:
[   28.686792]  <TASK>
[   28.687414]  ? die+0x32/0x80
[   28.688197]  ? do_trap+0xd6/0x100
[   28.689069]  ? free_pcppages_bulk+0x192/0x240
[   28.690135]  ? do_error_trap+0x6a/0x90
[   28.691082]  ? free_pcppages_bulk+0x192/0x240
[   28.692187]  ? exc_invalid_op+0x49/0x60
[   28.693154]  ? free_pcppages_bulk+0x192/0x240
[   28.694225]  ? asm_exc_invalid_op+0x16/0x20
[   28.695291]  ? free_pcppages_bulk+0x192/0x240
[   28.696405]  drain_pages_zone+0x3f/0x50
[   28.697404]  __drain_all_pages+0xe2/0x1e0
[   28.698472]  alloc_contig_range+0x143/0x280
[   28.699581]  ? bitmap_find_next_zero_area_off+0x3d/0x90
[   28.700902]  cma_alloc+0x156/0x470
[   28.701852]  ? kernfs_fop_write_iter+0x160/0x1f0
[   28.703053]  alloc_fresh_hugetlb_folio+0x7e/0x270
[   28.704272]  alloc_pool_huge_page+0x7d/0x100
[   28.705448]  set_max_huge_pages+0x162/0x390
[   28.706530]  nr_hugepages_store_common+0x91/0xf0
[   28.707689]  kernfs_fop_write_iter+0x108/0x1f0
[   28.708819]  vfs_write+0x207/0x400
[   28.709743]  ksys_write+0x63/0xe0
[   28.710640]  do_syscall_64+0x37/0x90
[   28.712649]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[   28.713919] RIP: 0033:0x7f6003aade87
[   28.714879] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[   28.719096] RSP: 002b:00007ffdfd9d2e98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   28.720945] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f6003aade87
[   28.722626] RDX: 0000000000000002 RSI: 00005615f9bac620 RDI: 0000000000000001
[   28.724288] RBP: 00005615f9bac620 R08: 000000000000000a R09: 00007f6003b450c0
[   28.725939] R10: 00007f6003b44fc0 R11: 0000000000000246 R12: 0000000000000002
[   28.727611] R13: 00007f6003b81520 R14: 0000000000000002 R15: 00007f6003b81720
[   28.729285]  </TASK>
[   28.729944] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs joydev snd_pcm snd_timer 9pnet_virtio snd soundcore virtio_balloon 9pnet virtio_console virtio_net virtio_blk net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel virtio_pci ghash_clmulni_intel serio_raw virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[   28.739325] ---[ end trace 0000000000000000 ]---
Johannes Weiner Sept. 15, 2023, 2:16 p.m. UTC | #2
On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> In next-20230913, I started hitting the following BUG.  Seems related
> to this series.  And, if series is reverted I do not see the BUG.
> 
> I can easily reproduce on a small 16G VM.  kernel command line contains
> "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> while true; do
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
> 
> For the BUG below I believe it was the first (or second) 1G page creation from
> CMA that triggered:  cma_alloc of 1G.
> 
> Sorry, have not looked deeper into the issue.

Thanks for the report, and sorry about the breakage!

I was scratching my head at this:

                        /* MIGRATE_ISOLATE page should not go to pcplists */
                        VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);

because there is nothing in page isolation that prevents setting
MIGRATE_ISOLATE on something that's on the pcplist already. So why
didn't this trigger before already?

Then it clicked: it used to only check the *pcpmigratetype* determined
by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.

Pages that get isolated while *already* on the pcplist are fine, and
are handled properly:

                        mt = get_pcppage_migratetype(page);

                        /* MIGRATE_ISOLATE page should not go to pcplists */
                        VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);

                        /* Pageblock could have been isolated meanwhile */
                        if (unlikely(isolated_pageblocks))
                                mt = get_pageblock_migratetype(page);

So this was purely a sanity check against the pcpmigratetype cache
operations. With that gone, we can remove it.

---

From b0cb92ed10b40fab0921002effa8b726df245790 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 15 Sep 2023 09:59:52 -0400
Subject: [PATCH] mm: page_alloc: remove pcppage migratetype caching fix

Mike reports the following crash in -next:

[   28.643019] page:ffffea0004fb4280 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x13ed0a
[   28.645455] flags: 0x200000000000000(node=0|zone=2)
[   28.646835] page_type: 0xffffffff()
[   28.647886] raw: 0200000000000000 dead000000000100 dead000000000122 0000000000000000
[   28.651170] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
[   28.653124] page dumped because: VM_BUG_ON_PAGE(is_migrate_isolate(mt))
[   28.654769] ------------[ cut here ]------------
[   28.655972] kernel BUG at mm/page_alloc.c:1231!

This VM_BUG_ON() used to check that the cached pcppage_migratetype set
by free_unref_page() wasn't MIGRATE_ISOLATE.

When I removed the caching, I erroneously changed the assert to check
that no isolated pages are on the pcplist. This is quite different,
because pages can be isolated *after* they had been put on the
freelist already (which is handled just fine).

IOW, this was purely a sanity check on the migratetype caching. With
that gone, the check should have been removed as well. Do that now.

Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3f1c777feed..9469e4660b53 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1207,9 +1207,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			count -= nr_pages;
 			pcp->count -= nr_pages;
 
-			/* MIGRATE_ISOLATE page should not go to pcplists */
-			VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-
 			__free_one_page(page, pfn, zone, order, mt, FPI_NONE);
 			trace_mm_page_pcpu_drain(page, order, mt);
 		} while (count > 0 && !list_empty(list));
Mike Kravetz Sept. 15, 2023, 3:05 p.m. UTC | #3
On 09/15/23 10:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > In next-20230913, I started hitting the following BUG.  Seems related
> > to this series.  And, if series is reverted I do not see the BUG.
> > 
> > I can easily reproduce on a small 16G VM.  kernel command line contains
> > "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> > 
> > For the BUG below I believe it was the first (or second) 1G page creation from
> > CMA that triggered:  cma_alloc of 1G.
> > 
> > Sorry, have not looked deeper into the issue.
> 
> Thanks for the report, and sorry about the breakage!
> 
> I was scratching my head at this:
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
> 
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> 
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
> 
>                         mt = get_pcppage_migratetype(page);
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
>                         /* Pageblock could have been isolated meanwhile */
>                         if (unlikely(isolated_pageblocks))
>                                 mt = get_pageblock_migratetype(page);
> 
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

Thanks!  That makes sense.

Glad my testing (for something else) triggered it.
Mike Kravetz Sept. 16, 2023, 7:57 p.m. UTC | #4
On 09/15/23 10:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > In next-20230913, I started hitting the following BUG.  Seems related
> > to this series.  And, if series is reverted I do not see the BUG.
> > 
> > I can easily reproduce on a small 16G VM.  kernel command line contains
> > "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> > 
> > For the BUG below I believe it was the first (or second) 1G page creation from
> > CMA that triggered:  cma_alloc of 1G.
> > 
> > Sorry, have not looked deeper into the issue.
> 
> Thanks for the report, and sorry about the breakage!
> 
> I was scratching my head at this:
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
> 
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> 
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
> 
>                         mt = get_pcppage_migratetype(page);
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
>                         /* Pageblock could have been isolated meanwhile */
>                         if (unlikely(isolated_pageblocks))
>                                 mt = get_pageblock_migratetype(page);
> 
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

With the patch below applied, a slightly different workload triggers the
following warnings.  It seems related, and appears to go away when
reverting the series.

[  331.595382] ------------[ cut here ]------------
[  331.596665] page type is 5, passed migratetype is 1 (nr=512)
[  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
[  331.600549] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[  331.609530] CPU: 2 PID: 935 Comm: bash Tainted: G        W          6.6.0-rc1-next-20230913+ #26
[  331.611603] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  331.613527] RIP: 0010:expand+0x1c9/0x200
[  331.614492] Code: 89 ef be 07 00 00 00 c6 05 c9 b1 35 01 01 e8 de f7 ff ff 8b 4c 24 30 8b 54 24 0c 48 c7 c7 68 9f 22 82 48 89 c6 e8 97 b3 df ff <0f> 0b e9 db fe ff ff 48 c7 c6 f8 9f 22 82 48 89 df e8 41 e3 fc ff
[  331.618540] RSP: 0018:ffffc90003c97a88 EFLAGS: 00010086
[  331.619801] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[  331.621331] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
[  331.622914] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  331.624712] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff88827fffcd80
[  331.626317] R13: 0000000000000009 R14: 0000000000000200 R15: 000000000000000a
[  331.627810] FS:  00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[  331.630593] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  331.631865] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
[  331.633382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  331.634873] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  331.636324] Call Trace:
[  331.636934]  <TASK>
[  331.637521]  ? expand+0x1c9/0x200
[  331.638320]  ? __warn+0x7d/0x130
[  331.639116]  ? expand+0x1c9/0x200
[  331.639957]  ? report_bug+0x18d/0x1c0
[  331.640832]  ? handle_bug+0x41/0x70
[  331.641635]  ? exc_invalid_op+0x13/0x60
[  331.642522]  ? asm_exc_invalid_op+0x16/0x20
[  331.643494]  ? expand+0x1c9/0x200
[  331.644264]  ? expand+0x1c9/0x200
[  331.645007]  rmqueue_bulk+0xf4/0x530
[  331.645847]  get_page_from_freelist+0x3ed/0x1040
[  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
[  331.647977]  __alloc_pages+0xec/0x240
[  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
[  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
[  331.650938]  alloc_pool_huge_folio+0xad/0x110
[  331.651909]  set_max_huge_pages+0x17d/0x390
[  331.652760]  nr_hugepages_store_common+0x91/0xf0
[  331.653825]  kernfs_fop_write_iter+0x108/0x1f0
[  331.654986]  vfs_write+0x207/0x400
[  331.655925]  ksys_write+0x63/0xe0
[  331.656832]  do_syscall_64+0x37/0x90
[  331.657793]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  331.660398] RIP: 0033:0x7f24b3a26e87
[  331.661342] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  331.665673] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  331.667541] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
[  331.669197] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
[  331.670883] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
[  331.672536] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
[  331.674175] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
[  331.675841]  </TASK>
[  331.676450] ---[ end trace 0000000000000000 ]---
[  331.677659] ------------[ cut here ]------------


[  331.677659] ------------[ cut here ]------------
[  331.679109] page type is 5, passed migratetype is 1 (nr=512)
[  331.680376] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
[  331.682314] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[  331.691852] CPU: 2 PID: 935 Comm: bash Tainted: G        W          6.6.0-rc1-next-20230913+ #26
[  331.694026] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  331.696162] RIP: 0010:del_page_from_free_list+0x137/0x170
[  331.697589] Code: c6 05 a0 b5 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 68 9f 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 69 b7 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 a0 9f 22 82 48 89 df e8 13 e7 fc ff
[  331.702060] RSP: 0018:ffffc90003c97ac8 EFLAGS: 00010086
[  331.703430] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[  331.705284] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
[  331.707101] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  331.708933] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: 0000000000000001
[  331.710754] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 0000000000000009
[  331.712637] FS:  00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
[  331.714861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  331.716466] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
[  331.718441] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  331.720372] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  331.723583] Call Trace:
[  331.724351]  <TASK>
[  331.725045]  ? del_page_from_free_list+0x137/0x170
[  331.726370]  ? __warn+0x7d/0x130
[  331.727326]  ? del_page_from_free_list+0x137/0x170
[  331.728637]  ? report_bug+0x18d/0x1c0
[  331.729688]  ? handle_bug+0x41/0x70
[  331.730707]  ? exc_invalid_op+0x13/0x60
[  331.731798]  ? asm_exc_invalid_op+0x16/0x20
[  331.733007]  ? del_page_from_free_list+0x137/0x170
[  331.734317]  ? del_page_from_free_list+0x137/0x170
[  331.735649]  rmqueue_bulk+0xdf/0x530
[  331.736741]  get_page_from_freelist+0x3ed/0x1040
[  331.738069]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
[  331.739578]  __alloc_pages+0xec/0x240
[  331.740666]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
[  331.742135]  __alloc_fresh_hugetlb_folio+0x157/0x230
[  331.743521]  alloc_pool_huge_folio+0xad/0x110
[  331.744768]  set_max_huge_pages+0x17d/0x390
[  331.745988]  nr_hugepages_store_common+0x91/0xf0
[  331.747306]  kernfs_fop_write_iter+0x108/0x1f0
[  331.748651]  vfs_write+0x207/0x400
[  331.749735]  ksys_write+0x63/0xe0
[  331.750808]  do_syscall_64+0x37/0x90
[  331.753203]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  331.754857] RIP: 0033:0x7f24b3a26e87
[  331.756184] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  331.760239] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  331.761935] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
[  331.763524] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
[  331.765102] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
[  331.766740] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
[  331.768344] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
[  331.769949]  </TASK>
[  331.770559] ---[ end trace 0000000000000000 ]---
Andrew Morton Sept. 16, 2023, 8:13 p.m. UTC | #5
On Sat, 16 Sep 2023 12:57:39 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > So this was purely a sanity check against the pcpmigratetype cache
> > operations. With that gone, we can remove it.
> 
> With the patch below applied, a slightly different workload triggers the
> following warnings.  It seems related, and appears to go away when
> reverting the series.

Thanks, I've dropped this v2 series from mm.git.
Vlastimil Babka Sept. 18, 2023, 7:07 a.m. UTC | #6
On 9/15/23 16:16, Johannes Weiner wrote:
> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>> In next-20230913, I started hitting the following BUG.  Seems related
>> to this series.  And, if series is reverted I do not see the BUG.
>> 
>> I can easily reproduce on a small 16G VM.  kernel command line contains
>> "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
>> while true; do
>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> done
>> 
>> For the BUG below I believe it was the first (or second) 1G page creation from
>> CMA that triggered:  cma_alloc of 1G.
>> 
>> Sorry, have not looked deeper into the issue.
> 
> Thanks for the report, and sorry about the breakage!
> 
> I was scratching my head at this:
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
> because there is nothing in page isolation that prevents setting
> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> didn't this trigger before already?
> 
> Then it clicked: it used to only check the *pcpmigratetype* determined
> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> 
> Pages that get isolated while *already* on the pcplist are fine, and
> are handled properly:
> 
>                         mt = get_pcppage_migratetype(page);
> 
>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> 
>                         /* Pageblock could have been isolated meanwhile */
>                         if (unlikely(isolated_pageblocks))
>                                 mt = get_pageblock_migratetype(page);
> 
> So this was purely a sanity check against the pcpmigratetype cache
> operations. With that gone, we can remove it.

Agreed, I assume you'll fold it in 1/6 in v3.
Vlastimil Babka Sept. 18, 2023, 7:16 a.m. UTC | #7
On 9/16/23 21:57, Mike Kravetz wrote:
> On 09/15/23 10:16, Johannes Weiner wrote:
>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>> > In next-20230913, I started hitting the following BUG.  Seems related
>> > to this series.  And, if series is reverted I do not see the BUG.
>> > 
>> > I can easily reproduce on a small 16G VM.  kernel command line contains
>> > "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
>> > while true; do
>> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> > done
>> > 
>> > For the BUG below I believe it was the first (or second) 1G page creation from
>> > CMA that triggered:  cma_alloc of 1G.
>> > 
>> > Sorry, have not looked deeper into the issue.
>> 
>> Thanks for the report, and sorry about the breakage!
>> 
>> I was scratching my head at this:
>> 
>>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>> 
>> because there is nothing in page isolation that prevents setting
>> MIGRATE_ISOLATE on something that's on the pcplist already. So why
>> didn't this trigger before already?
>> 
>> Then it clicked: it used to only check the *pcpmigratetype* determined
>> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
>> 
>> Pages that get isolated while *already* on the pcplist are fine, and
>> are handled properly:
>> 
>>                         mt = get_pcppage_migratetype(page);
>> 
>>                         /* MIGRATE_ISOLATE page should not go to pcplists */
>>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
>> 
>>                         /* Pageblock could have been isolated meanwhile */
>>                         if (unlikely(isolated_pageblocks))
>>                                 mt = get_pageblock_migratetype(page);
>> 
>> So this was purely a sanity check against the pcpmigratetype cache
>> operations. With that gone, we can remove it.
> 
> With the patch below applied, a slightly different workload triggers the
> following warnings.  It seems related, and appears to go away when
> reverting the series.
> 
> [  331.595382] ------------[ cut here ]------------
> [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200

Initially I thought this demonstrates the possible race I was suggesting in
reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
are trying to get a MOVABLE page from a CMA page block, which is something
that's normally done and the pageblock stays CMA. So yeah if the warnings
are to stay, they need to handle this case. Maybe the same can happen with
HIGHATOMIC blocks?

> [  331.600549] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_seq 9p snd_seq_device netfs 9pnet_virtio snd_pcm joydev snd_timer virtio_balloon snd soundcore 9pnet virtio_blk virtio_console virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
> [  331.609530] CPU: 2 PID: 935 Comm: bash Tainted: G        W          6.6.0-rc1-next-20230913+ #26
> [  331.611603] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [  331.613527] RIP: 0010:expand+0x1c9/0x200
> [  331.614492] Code: 89 ef be 07 00 00 00 c6 05 c9 b1 35 01 01 e8 de f7 ff ff 8b 4c 24 30 8b 54 24 0c 48 c7 c7 68 9f 22 82 48 89 c6 e8 97 b3 df ff <0f> 0b e9 db fe ff ff 48 c7 c6 f8 9f 22 82 48 89 df e8 41 e3 fc ff
> [  331.618540] RSP: 0018:ffffc90003c97a88 EFLAGS: 00010086
> [  331.619801] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
> [  331.621331] RDX: 0000000000000005 RSI: ffffffff8224dce6 RDI: 00000000ffffffff
> [  331.622914] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
> [  331.624712] R10: 00000000ffffdfff R11: ffffffff824660c0 R12: ffff88827fffcd80
> [  331.626317] R13: 0000000000000009 R14: 0000000000000200 R15: 000000000000000a
> [  331.627810] FS:  00007f24b3932740(0000) GS:ffff888477c00000(0000) knlGS:0000000000000000
> [  331.630593] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  331.631865] CR2: 0000560a53875018 CR3: 000000017eee8003 CR4: 0000000000370ee0
> [  331.633382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  331.634873] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  331.636324] Call Trace:
> [  331.636934]  <TASK>
> [  331.637521]  ? expand+0x1c9/0x200
> [  331.638320]  ? __warn+0x7d/0x130
> [  331.639116]  ? expand+0x1c9/0x200
> [  331.639957]  ? report_bug+0x18d/0x1c0
> [  331.640832]  ? handle_bug+0x41/0x70
> [  331.641635]  ? exc_invalid_op+0x13/0x60
> [  331.642522]  ? asm_exc_invalid_op+0x16/0x20
> [  331.643494]  ? expand+0x1c9/0x200
> [  331.644264]  ? expand+0x1c9/0x200
> [  331.645007]  rmqueue_bulk+0xf4/0x530
> [  331.645847]  get_page_from_freelist+0x3ed/0x1040
> [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
> [  331.647977]  __alloc_pages+0xec/0x240
> [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
> [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
> [  331.650938]  alloc_pool_huge_folio+0xad/0x110
> [  331.651909]  set_max_huge_pages+0x17d/0x390
> [  331.652760]  nr_hugepages_store_common+0x91/0xf0
> [  331.653825]  kernfs_fop_write_iter+0x108/0x1f0
> [  331.654986]  vfs_write+0x207/0x400
> [  331.655925]  ksys_write+0x63/0xe0
> [  331.656832]  do_syscall_64+0x37/0x90
> [  331.657793]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [  331.660398] RIP: 0033:0x7f24b3a26e87
> [  331.661342] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [  331.665673] RSP: 002b:00007ffccd603de8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  331.667541] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f24b3a26e87
> [  331.669197] RDX: 0000000000000005 RSI: 0000560a5381bb50 RDI: 0000000000000001
> [  331.670883] RBP: 0000560a5381bb50 R08: 000000000000000a R09: 00007f24b3abe0c0
> [  331.672536] R10: 00007f24b3abdfc0 R11: 0000000000000246 R12: 0000000000000005
> [  331.674175] R13: 00007f24b3afa520 R14: 0000000000000005 R15: 00007f24b3afa720
> [  331.675841]  </TASK>
> [  331.676450] ---[ end trace 0000000000000000 ]---
> [  331.677659] ------------[ cut here ]------------
> 
>
Johannes Weiner Sept. 18, 2023, 2:09 p.m. UTC | #8
On Mon, Sep 18, 2023 at 09:07:53AM +0200, Vlastimil Babka wrote:
> On 9/15/23 16:16, Johannes Weiner wrote:
> > On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >> In next-20230913, I started hitting the following BUG.  Seems related
> >> to this series.  And, if series is reverted I do not see the BUG.
> >> 
> >> I can easily reproduce on a small 16G VM.  kernel command line contains
> >> "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> >> while true; do
> >>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >> done
> >> 
> >> For the BUG below I believe it was the first (or second) 1G page creation from
> >> CMA that triggered:  cma_alloc of 1G.
> >> 
> >> Sorry, have not looked deeper into the issue.
> > 
> > Thanks for the report, and sorry about the breakage!
> > 
> > I was scratching my head at this:
> > 
> >                         /* MIGRATE_ISOLATE page should not go to pcplists */
> >                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> > 
> > because there is nothing in page isolation that prevents setting
> > MIGRATE_ISOLATE on something that's on the pcplist already. So why
> > didn't this trigger before already?
> > 
> > Then it clicked: it used to only check the *pcpmigratetype* determined
> > by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> > 
> > Pages that get isolated while *already* on the pcplist are fine, and
> > are handled properly:
> > 
> >                         mt = get_pcppage_migratetype(page);
> > 
> >                         /* MIGRATE_ISOLATE page should not go to pcplists */
> >                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> > 
> >                         /* Pageblock could have been isolated meanwhile */
> >                         if (unlikely(isolated_pageblocks))
> >                                 mt = get_pageblock_migratetype(page);
> > 
> > So this was purely a sanity check against the pcpmigratetype cache
> > operations. With that gone, we can remove it.
> 
> Agreed, I assume you'll fold it in 1/6 in v3.

Yes, will do.
Johannes Weiner Sept. 18, 2023, 2:52 p.m. UTC | #9
On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> On 9/16/23 21:57, Mike Kravetz wrote:
> > On 09/15/23 10:16, Johannes Weiner wrote:
> >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >> > In next-20230913, I started hitting the following BUG.  Seems related
> >> > to this series.  And, if series is reverted I do not see the BUG.
> >> > 
> >> > I can easily reproduce on a small 16G VM.  kernel command line contains
> >> > "hugetlb_free_vmemmap=on hugetlb_cma=4G".  Then run the script,
> >> > while true; do
> >> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >> > done
> >> > 
> >> > For the BUG below I believe it was the first (or second) 1G page creation from
> >> > CMA that triggered:  cma_alloc of 1G.
> >> > 
> >> > Sorry, have not looked deeper into the issue.
> >> 
> >> Thanks for the report, and sorry about the breakage!
> >> 
> >> I was scratching my head at this:
> >> 
> >>                         /* MIGRATE_ISOLATE page should not go to pcplists */
> >>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >> 
> >> because there is nothing in page isolation that prevents setting
> >> MIGRATE_ISOLATE on something that's on the pcplist already. So why
> >> didn't this trigger before already?
> >> 
> >> Then it clicked: it used to only check the *pcpmigratetype* determined
> >> by free_unref_page(), which of course mustn't be MIGRATE_ISOLATE.
> >> 
> >> Pages that get isolated while *already* on the pcplist are fine, and
> >> are handled properly:
> >> 
> >>                         mt = get_pcppage_migratetype(page);
> >> 
> >>                         /* MIGRATE_ISOLATE page should not go to pcplists */
> >>                         VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> >> 
> >>                         /* Pageblock could have been isolated meanwhile */
> >>                         if (unlikely(isolated_pageblocks))
> >>                                 mt = get_pageblock_migratetype(page);
> >> 
> >> So this was purely a sanity check against the pcpmigratetype cache
> >> operations. With that gone, we can remove it.
> > 
> > With the patch below applied, a slightly different workload triggers the
> > following warnings.  It seems related, and appears to go away when
> > reverting the series.
> > 
> > [  331.595382] ------------[ cut here ]------------
> > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> 
> Initially I thought this demonstrates the possible race I was suggesting in
> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> are trying to get a MOVABLE page from a CMA page block, which is something
> that's normally done and the pageblock stays CMA. So yeah if the warnings
> are to stay, they need to handle this case. Maybe the same can happen with
> HIGHATOMIC blocks?

Hm I don't think that's quite it.

CMA and HIGHATOMIC have their own freelists. When MOVABLE requests dip
into CMA and HIGHATOMIC, we explicitly pass that migratetype to
__rmqueue_smallest(). This takes a chunk of e.g. CMA, expands the
remainder to the CMA freelist, then returns the page. While you get a
different mt than requested, the freelist typing should be consistent.

In this splat, the migratetype passed to __rmqueue_smallest() is
MOVABLE. There is no preceding warning from del_page_from_freelist()
(Mike, correct me if I'm wrong), so we got a confirmed MOVABLE
order-10 block from the MOVABLE list. So far so good. However, when we
expand() the order-9 tail of this block to the MOVABLE list, it warns
that its pageblock type is CMA.

This means we have an order-10 page where one half is MOVABLE and the
other is CMA.

I don't see how the merging code in __free_one_page() could have done
that. The CMA buddy would have failed the migrate_is_mergeable() test
and we should have left it at order-9s.

I also don't see how the CMA setup could have done this because
MIGRATE_CMA is set on the range before the pages are fed to the buddy.

Mike, could you describe the workload that is triggering this?

Does this reproduce instantly and reliably?

Is there high load on the system, or is it requesting the huge page
with not much else going on?

Do you see compact_* history in /proc/vmstat after this triggers?

Could you please also provide /proc/zoneinfo, /proc/pagetypeinfo and
the hugetlb_cma= parameter you're using?

Thanks!
Mike Kravetz Sept. 18, 2023, 5:40 p.m. UTC | #10
On 09/18/23 10:52, Johannes Weiner wrote:
> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > On 9/16/23 21:57, Mike Kravetz wrote:
> > > On 09/15/23 10:16, Johannes Weiner wrote:
> > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > 
> > > With the patch below applied, a slightly different workload triggers the
> > > following warnings.  It seems related, and appears to go away when
> > > reverting the series.
> > > 
> > > [  331.595382] ------------[ cut here ]------------
> > > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > 
> > Initially I thought this demonstrates the possible race I was suggesting in
> > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > are trying to get a MOVABLE page from a CMA page block, which is something
> > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > are to stay, they need to handle this case. Maybe the same can happen with
> > HIGHATOMIC blocks?
> 
> Hm I don't think that's quite it.
> 
> CMA and HIGHATOMIC have their own freelists. When MOVABLE requests dip
> into CMA and HIGHATOMIC, we explicitly pass that migratetype to
> __rmqueue_smallest(). This takes a chunk of e.g. CMA, expands the
> remainder to the CMA freelist, then returns the page. While you get a
> different mt than requested, the freelist typing should be consistent.
> 
> In this splat, the migratetype passed to __rmqueue_smallest() is
> MOVABLE. There is no preceding warning from del_page_from_freelist()
> (Mike, correct me if I'm wrong), so we got a confirmed MOVABLE
> order-10 block from the MOVABLE list. So far so good. However, when we
> expand() the order-9 tail of this block to the MOVABLE list, it warns
> that its pageblock type is CMA.
> 
> This means we have an order-10 page where one half is MOVABLE and the
> other is CMA.
> 
> I don't see how the merging code in __free_one_page() could have done
> that. The CMA buddy would have failed the migrate_is_mergeable() test
> and we should have left it at order-9s.
> 
> I also don't see how the CMA setup could have done this because
> MIGRATE_CMA is set on the range before the pages are fed to the buddy.
> 
> Mike, could you describe the workload that is triggering this?

This 'slightly different workload' is actually a slightly different
environment.  Sorry for mis-speaking!  The slight difference is that this
environment does not use the 'alloc hugetlb gigantic pages from CMA'
(hugetlb_cma) feature that triggered the previous issue.

This is still on a 16G VM.  Kernel command line here is:
"BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
hugetlb_free_vmemmap=on"

The workload is just running this script:
while true; do
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
 echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

> 
> Does this reproduce instantly and reliably?
> 

It is not 'instant' but will reproduce fairly reliably within a minute
or so.

Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
will eventually be freed via __free_pages(folio, 9).

> Is there high load on the system, or is it requesting the huge page
> with not much else going on?

Only the script was running.

> Do you see compact_* history in /proc/vmstat after this triggers?

As one might expect, compact_isolated continually increases during this
this run.

> Could you please also provide /proc/zoneinfo, /proc/pagetypeinfo and
> the hugetlb_cma= parameter you're using?

As mentioned above, hugetlb_cma is not used in this environment.  Strangely
enough, this does not reproduce (easily at least) if I use hugetlb_cma as
in the previous report.

The following are during a run after WARNING is triggered.

# cat /proc/zoneinfo
Node 0, zone      DMA
  per-node stats
      nr_inactive_anon 11800
      nr_active_anon 109
      nr_inactive_file 38161
      nr_active_file 10007
      nr_unevictable 12
      nr_slab_reclaimable 2766
      nr_slab_unreclaimable 6881
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 0
      workingset_refault_anon 0
      workingset_refault_file 0
      workingset_activate_anon 0
      workingset_activate_file 0
      workingset_restore_anon 0
      workingset_restore_file 0
      workingset_nodereclaim 0
      nr_anon_pages 11750
      nr_mapped    18402
      nr_file_pages 48339
      nr_dirty     0
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     166
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 6
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   14766
      nr_written   7701
      nr_throttled_written 0
      nr_kernel_misc_reclaimable 0
      nr_foll_pin_acquired 96
      nr_foll_pin_released 96
      nr_kernel_stack 1816
      nr_page_table_pages 1100
      nr_sec_page_table_pages 0
      nr_swapcached 0
  pages free     3840
        boost    0
        min      21
        low      26
        high     31
        spanned  4095
        present  3998
        managed  3840
        cma      0
        protection: (0, 1908, 7923, 7923)
      nr_free_pages 3840
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     0
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   0
      numa_other   0
  pagesets
    cpu: 0
              count: 0
              high:  13
              batch: 1
  vm stats threshold: 6
    cpu: 1
              count: 0
              high:  13
              batch: 1
  vm stats threshold: 6
    cpu: 2
              count: 0
              high:  13
              batch: 1
  vm stats threshold: 6
    cpu: 3
              count: 0
              high:  13
              batch: 1
  vm stats threshold: 6
  node_unreclaimable:  0
  start_pfn:           1
Node 0, zone    DMA32
  pages free     495317
        boost    0
        min      2687
        low      3358
        high     4029
        spanned  1044480
        present  520156
        managed  496486
        cma      0
        protection: (0, 0, 6015, 6015)
      nr_free_pages 495317
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     0
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   0
      numa_other   0
  pagesets
    cpu: 0
              count: 913
              high:  1679
              batch: 63
  vm stats threshold: 30
    cpu: 1
              count: 0
              high:  1679
              batch: 63
  vm stats threshold: 30
    cpu: 2
              count: 0
              high:  1679
              batch: 63
  vm stats threshold: 30
    cpu: 3
              count: 256
              high:  1679
              batch: 63
  vm stats threshold: 30
  node_unreclaimable:  0
  start_pfn:           4096
Node 0, zone   Normal
  pages free     1360836
        boost    0
        min      8473
        low      10591
        high     12709
        spanned  1572864
        present  1572864
        managed  1552266
        cma      0
        protection: (0, 0, 0, 0)
      nr_free_pages 1360836
      nr_zone_inactive_anon 11800
      nr_zone_active_anon 109
      nr_zone_inactive_file 38161
      nr_zone_active_file 10007
      nr_zone_unevictable 12
      nr_zone_write_pending 0
      nr_mlock     12
      nr_bounce    0
      nr_zspages   3
      nr_free_cma  0
      numa_hit     10623572
      numa_miss    0
      numa_foreign 0
      numa_interleave 1357
      numa_local   6902986
      numa_other   3720586
  pagesets
    cpu: 0
              count: 156
              high:  5295
              batch: 63
  vm stats threshold: 42
    cpu: 1
              count: 210
              high:  5295
              batch: 63
  vm stats threshold: 42
    cpu: 2
              count: 4956
              high:  5295
              batch: 63
  vm stats threshold: 42
    cpu: 3
              count: 1
              high:  5295
              batch: 63
  vm stats threshold: 42
  node_unreclaimable:  0
  start_pfn:           1048576
Node 0, zone  Movable
  pages free     0
        boost    0
        min      32
        low      32
        high     32
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0)
Node 1, zone      DMA
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0)
Node 1, zone    DMA32
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0)
Node 1, zone   Normal
  per-node stats
      nr_inactive_anon 15381
      nr_active_anon 81
      nr_inactive_file 66550
      nr_active_file 25965
      nr_unevictable 421
      nr_slab_reclaimable 4069
      nr_slab_unreclaimable 7836
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 0
      workingset_refault_anon 0
      workingset_refault_file 0
      workingset_activate_anon 0
      workingset_activate_file 0
      workingset_restore_anon 0
      workingset_restore_file 0
      workingset_nodereclaim 0
      nr_anon_pages 15420
      nr_mapped    24331
      nr_file_pages 92978
      nr_dirty     0
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     100
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 11
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   6217
      nr_written   2902
      nr_throttled_written 0
      nr_kernel_misc_reclaimable 0
      nr_foll_pin_acquired 0
      nr_foll_pin_released 0
      nr_kernel_stack 1656
      nr_page_table_pages 756
      nr_sec_page_table_pages 0
      nr_swapcached 0
  pages free     1829073
        boost    0
        min      11345
        low      14181
        high     17017
        spanned  2097152
        present  2097152
        managed  2086594
        cma      0
        protection: (0, 0, 0, 0)
      nr_free_pages 1829073
      nr_zone_inactive_anon 15381
      nr_zone_active_anon 81
      nr_zone_inactive_file 66550
      nr_zone_active_file 25965
      nr_zone_unevictable 421
      nr_zone_write_pending 0
      nr_mlock     421
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      numa_hit     10522401
      numa_miss    0
      numa_foreign 0
      numa_interleave 961
      numa_local   4057399
      numa_other   6465002
  pagesets
    cpu: 0
              count: 0
              high:  7090
              batch: 63
  vm stats threshold: 42
    cpu: 1
              count: 17
              high:  7090
              batch: 63
  vm stats threshold: 42
    cpu: 2
              count: 6997
              high:  7090
              batch: 63
  vm stats threshold: 42
    cpu: 3
              count: 0
              high:  7090
              batch: 63
  vm stats threshold: 42
  node_unreclaimable:  0
  start_pfn:           2621440
Node 1, zone  Movable
  pages free     0
        boost    0
        min      32
        low      32
        high     32
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0)

# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      1      0      0 
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3 
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type      Movable      1      0      1      2      2      3      3      3      4      4    480 
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type    Unmovable    566     14     22      7      8      8      9      4      7      0      1 
Node    0, zone   Normal, type      Movable    214    299    120     53     15     10      6      6      1      4   1159 
Node    0, zone   Normal, type  Reclaimable      0      9     18     11      6      1      0      0      0      0      0 
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0 

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate 
Node 0, zone      DMA            1            7            0            0            0            0 
Node 0, zone    DMA32            0         1016            0            0            0            0 
Node 0, zone   Normal           71         2995            6            0            0            0 
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10 
Node    1, zone   Normal, type    Unmovable    459     12      5      6      6      5      5      5      6      2      1 
Node    1, zone   Normal, type      Movable   1287    502    171     85     34     14     13      8      2      5   1861 
Node    1, zone   Normal, type  Reclaimable      1      5     12      6      9      3      1      1      0      1      0 
Node    1, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0 
Node    1, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0 
Node    1, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      3 

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate 
Node 1, zone   Normal          101         3977           10            0            0            8
Johannes Weiner Sept. 19, 2023, 6:49 a.m. UTC | #11
On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> On 09/18/23 10:52, Johannes Weiner wrote:
> > On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > > On 9/16/23 21:57, Mike Kravetz wrote:
> > > > On 09/15/23 10:16, Johannes Weiner wrote:
> > > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > > 
> > > > With the patch below applied, a slightly different workload triggers the
> > > > following warnings.  It seems related, and appears to go away when
> > > > reverting the series.
> > > > 
> > > > [  331.595382] ------------[ cut here ]------------
> > > > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > > 
> > > Initially I thought this demonstrates the possible race I was suggesting in
> > > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > > are trying to get a MOVABLE page from a CMA page block, which is something
> > > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > > are to stay, they need to handle this case. Maybe the same can happen with
> > > HIGHATOMIC blocks?

Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
show any CMA pages.

5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
and HIGHATOMIC.

> > This means we have an order-10 page where one half is MOVABLE and the
> > other is CMA.

This means the scenario is different:

We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
that the first pageblock is indeed MOVABLE. During the expand, the
second pageblock turns out to be of type MIGRATE_ISOLATE.

The page allocator wouldn't have merged those types. It triggers a bit
too fast to be a race condition.

It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
while the head is on the list, and then stranded there.

Could this be an issue in the page_isolation code? Maybe a range
rounding error?

Zi Yan, does this ring a bell for you?

I don't quite see how my patches could have caused this. But AFAICS we
also didn't have warnings for this scenario so it could be an old bug.

> > Mike, could you describe the workload that is triggering this?
> 
> This 'slightly different workload' is actually a slightly different
> environment.  Sorry for mis-speaking!  The slight difference is that this
> environment does not use the 'alloc hugetlb gigantic pages from CMA'
> (hugetlb_cma) feature that triggered the previous issue.
> 
> This is still on a 16G VM.  Kernel command line here is:
> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
> hugetlb_free_vmemmap=on"
> 
> The workload is just running this script:
> while true; do
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
> 
> > 
> > Does this reproduce instantly and reliably?
> > 
> 
> It is not 'instant' but will reproduce fairly reliably within a minute
> or so.
> 
> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
> will eventually be freed via __free_pages(folio, 9).

No luck reproducing this yet, but I have a question. In that crash
stack trace, the expand() is called via this:

 [  331.645847]  get_page_from_freelist+0x3ed/0x1040
 [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
 [  331.647977]  __alloc_pages+0xec/0x240
 [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
 [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
 [  331.650938]  alloc_pool_huge_folio+0xad/0x110
 [  331.651909]  set_max_huge_pages+0x17d/0x390

I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
alloc_fresh_hugetlb_folio(), which has this:

        if (hstate_is_gigantic(h))
                folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
        else
                folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
                                nid, nmask, node_alloc_noretry);

where gigantic is defined as the order exceeding MAX_ORDER, which
should be the case for 1G pages on x86.

So the crashing stack must be from a 2M allocation, no? I'm confused
how that could happen with the above test case.
Zi Yan Sept. 19, 2023, 12:37 p.m. UTC | #12
On 19 Sep 2023, at 2:49, Johannes Weiner wrote:

> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>> On 09/18/23 10:52, Johannes Weiner wrote:
>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>>>
>>>>> With the patch below applied, a slightly different workload triggers the
>>>>> following warnings.  It seems related, and appears to go away when
>>>>> reverting the series.
>>>>>
>>>>> [  331.595382] ------------[ cut here ]------------
>>>>> [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
>>>>> [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>>>>
>>>> Initially I thought this demonstrates the possible race I was suggesting in
>>>> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
>>>> are trying to get a MOVABLE page from a CMA page block, which is something
>>>> that's normally done and the pageblock stays CMA. So yeah if the warnings
>>>> are to stay, they need to handle this case. Maybe the same can happen with
>>>> HIGHATOMIC blocks?
>
> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
> show any CMA pages.
>
> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
> and HIGHATOMIC.
>
>>> This means we have an order-10 page where one half is MOVABLE and the
>>> other is CMA.
>
> This means the scenario is different:
>
> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
> that the first pageblock is indeed MOVABLE. During the expand, the
> second pageblock turns out to be of type MIGRATE_ISOLATE.
>
> The page allocator wouldn't have merged those types. It triggers a bit
> too fast to be a race condition.
>
> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
> while the head is on the list, and then stranded there.
>
> Could this be an issue in the page_isolation code? Maybe a range
> rounding error?
>
> Zi Yan, does this ring a bell for you?

Since isolation code works on pageblocks, a scenario I can think of
is that alloc_contig_range() is given a range starting from that tail
pageblock.

Hmm, I also notice that move_freepages_block() called by
set_migratetype_isolate() might change isolation range by your change.
I wonder if reverting that behavior would fix the issue. Basically,
do

	if (!zone_spans_pfn(zone, start))
		start = pfn;

in prep_move_freepages_block(). Just a wild guess. Mike, do you mind
giving it a try?

Meanwhile, let me try to reproduce it and look into it deeper.

>
> I don't quite see how my patches could have caused this. But AFAICS we
> also didn't have warnings for this scenario so it could be an old bug.
>
>>> Mike, could you describe the workload that is triggering this?
>>
>> This 'slightly different workload' is actually a slightly different
>> environment.  Sorry for mis-speaking!  The slight difference is that this
>> environment does not use the 'alloc hugetlb gigantic pages from CMA'
>> (hugetlb_cma) feature that triggered the previous issue.
>>
>> This is still on a 16G VM.  Kernel command line here is:
>> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
>> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
>> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
>> hugetlb_free_vmemmap=on"
>>
>> The workload is just running this script:
>> while true; do
>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>> done
>>
>>>
>>> Does this reproduce instantly and reliably?
>>>
>>
>> It is not 'instant' but will reproduce fairly reliably within a minute
>> or so.
>>
>> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
>> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
>> will eventually be freed via __free_pages(folio, 9).
>
> No luck reproducing this yet, but I have a question. In that crash
> stack trace, the expand() is called via this:
>
>  [  331.645847]  get_page_from_freelist+0x3ed/0x1040
>  [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
>  [  331.647977]  __alloc_pages+0xec/0x240
>  [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
>  [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
>  [  331.650938]  alloc_pool_huge_folio+0xad/0x110
>  [  331.651909]  set_max_huge_pages+0x17d/0x390
>
> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
> alloc_fresh_hugetlb_folio(), which has this:
>
>         if (hstate_is_gigantic(h))
>                 folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
>         else
>                 folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
>                                 nid, nmask, node_alloc_noretry);
>
> where gigantic is defined as the order exceeding MAX_ORDER, which
> should be the case for 1G pages on x86.
>
> So the crashing stack must be from a 2M allocation, no? I'm confused
> how that could happen with the above test case.

That matches my thinking too. Why the crash happened during 1GB page
allocation time? The range should be 1GB-aligned and of course cannot
be in the middle of a MAX_ORDER free page block.


--
Best Regards,
Yan, Zi
Zi Yan Sept. 19, 2023, 3:22 p.m. UTC | #13
On 19 Sep 2023, at 8:37, Zi Yan wrote:

> On 19 Sep 2023, at 2:49, Johannes Weiner wrote:
>
>> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>>> On 09/18/23 10:52, Johannes Weiner wrote:
>>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>>>>
>>>>>> With the patch below applied, a slightly different workload triggers the
>>>>>> following warnings.  It seems related, and appears to go away when
>>>>>> reverting the series.
>>>>>>
>>>>>> [  331.595382] ------------[ cut here ]------------
>>>>>> [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
>>>>>> [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>>>>>
>>>>> Initially I thought this demonstrates the possible race I was suggesting in
>>>>> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
>>>>> are trying to get a MOVABLE page from a CMA page block, which is something
>>>>> that's normally done and the pageblock stays CMA. So yeah if the warnings
>>>>> are to stay, they need to handle this case. Maybe the same can happen with
>>>>> HIGHATOMIC blocks?
>>
>> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
>> show any CMA pages.
>>
>> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
>> and HIGHATOMIC.
>>
>>>> This means we have an order-10 page where one half is MOVABLE and the
>>>> other is CMA.
>>
>> This means the scenario is different:
>>
>> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
>> that the first pageblock is indeed MOVABLE. During the expand, the
>> second pageblock turns out to be of type MIGRATE_ISOLATE.
>>
>> The page allocator wouldn't have merged those types. It triggers a bit
>> too fast to be a race condition.
>>
>> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
>> while the head is on the list, and then stranded there.
>>
>> Could this be an issue in the page_isolation code? Maybe a range
>> rounding error?
>>
>> Zi Yan, does this ring a bell for you?
>
> Since isolation code works on pageblocks, a scenario I can think of
> is that alloc_contig_range() is given a range starting from that tail
> pageblock.
>
> Hmm, I also notice that move_freepages_block() called by
> set_migratetype_isolate() might change isolation range by your change.
> I wonder if reverting that behavior would fix the issue. Basically,
> do
>
> 	if (!zone_spans_pfn(zone, start))
> 		start = pfn;
>
> in prep_move_freepages_block(). Just a wild guess. Mike, do you mind
> giving it a try?
>
> Meanwhile, let me try to reproduce it and look into it deeper.
>
>>
>> I don't quite see how my patches could have caused this. But AFAICS we
>> also didn't have warnings for this scenario so it could be an old bug.
>>
>>>> Mike, could you describe the workload that is triggering this?
>>>
>>> This 'slightly different workload' is actually a slightly different
>>> environment.  Sorry for mis-speaking!  The slight difference is that this
>>> environment does not use the 'alloc hugetlb gigantic pages from CMA'
>>> (hugetlb_cma) feature that triggered the previous issue.
>>>
>>> This is still on a 16G VM.  Kernel command line here is:
>>> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
>>> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
>>> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
>>> hugetlb_free_vmemmap=on"
>>>
>>> The workload is just running this script:
>>> while true; do
>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> done
>>>
>>>>
>>>> Does this reproduce instantly and reliably?
>>>>
>>>
>>> It is not 'instant' but will reproduce fairly reliably within a minute
>>> or so.
>>>
>>> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
>>> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
>>> will eventually be freed via __free_pages(folio, 9).
>>
>> No luck reproducing this yet, but I have a question. In that crash
>> stack trace, the expand() is called via this:

I cannot reproduce it locally either. Do you mind sharing your config file?

Thanks.

--
Best Regards,
Yan, Zi
Mike Kravetz Sept. 19, 2023, 6:47 p.m. UTC | #14
On 09/19/23 02:49, Johannes Weiner wrote:
> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> > On 09/18/23 10:52, Johannes Weiner wrote:
> > > On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> > > > On 9/16/23 21:57, Mike Kravetz wrote:
> > > > > On 09/15/23 10:16, Johannes Weiner wrote:
> > > > >> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> > > > > 
> > > > > With the patch below applied, a slightly different workload triggers the
> > > > > following warnings.  It seems related, and appears to go away when
> > > > > reverting the series.
> > > > > 
> > > > > [  331.595382] ------------[ cut here ]------------
> > > > > [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
> > > > > [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
> > > > 
> > > > Initially I thought this demonstrates the possible race I was suggesting in
> > > > reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
> > > > are trying to get a MOVABLE page from a CMA page block, which is something
> > > > that's normally done and the pageblock stays CMA. So yeah if the warnings
> > > > are to stay, they need to handle this case. Maybe the same can happen with
> > > > HIGHATOMIC blocks?
> 
> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
> show any CMA pages.
> 
> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
> and HIGHATOMIC.
> 
> > > This means we have an order-10 page where one half is MOVABLE and the
> > > other is CMA.
> 
> This means the scenario is different:
> 
> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
> that the first pageblock is indeed MOVABLE. During the expand, the
> second pageblock turns out to be of type MIGRATE_ISOLATE.
> 
> The page allocator wouldn't have merged those types. It triggers a bit
> too fast to be a race condition.
> 
> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
> while the head is on the list, and then stranded there.
> 
> Could this be an issue in the page_isolation code? Maybe a range
> rounding error?
> 
> Zi Yan, does this ring a bell for you?
> 
> I don't quite see how my patches could have caused this. But AFAICS we
> also didn't have warnings for this scenario so it could be an old bug.
> 
> > > Mike, could you describe the workload that is triggering this?
> > 
> > This 'slightly different workload' is actually a slightly different
> > environment.  Sorry for mis-speaking!  The slight difference is that this
> > environment does not use the 'alloc hugetlb gigantic pages from CMA'
> > (hugetlb_cma) feature that triggered the previous issue.
> > 
> > This is still on a 16G VM.  Kernel command line here is:
> > "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
> > root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
> > console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
> > hugetlb_free_vmemmap=on"
> > 
> > The workload is just running this script:
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> > 
> > > 
> > > Does this reproduce instantly and reliably?
> > > 
> > 
> > It is not 'instant' but will reproduce fairly reliably within a minute
> > or so.
> > 
> > Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
> > to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
> > will eventually be freed via __free_pages(folio, 9).
> 
> No luck reproducing this yet, but I have a question. In that crash
> stack trace, the expand() is called via this:
> 
>  [  331.645847]  get_page_from_freelist+0x3ed/0x1040
>  [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
>  [  331.647977]  __alloc_pages+0xec/0x240
>  [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
>  [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
>  [  331.650938]  alloc_pool_huge_folio+0xad/0x110
>  [  331.651909]  set_max_huge_pages+0x17d/0x390
> 
> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
> alloc_fresh_hugetlb_folio(), which has this:
> 
>         if (hstate_is_gigantic(h))
>                 folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
>         else
>                 folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
>                                 nid, nmask, node_alloc_noretry);
> 
> where gigantic is defined as the order exceeding MAX_ORDER, which
> should be the case for 1G pages on x86.
> 
> So the crashing stack must be from a 2M allocation, no? I'm confused
> how that could happen with the above test case.

Sorry for causing the confusion!

When I originally saw the warnings pop up, I was running the above script
as well as another that only allocated order 9 hugetlb pages:

while true; do
	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
done

The warnings were actually triggered by allocations in this second script.

However, when reporting the warnings I wanted to include the simplest
way to recreate.  And, I noticed that that second script running in
parallel was not required.  Again, sorry for the confusion!  Here is a
warning triggered via the alloc_contig_range path only running the one
script.

[  107.275821] ------------[ cut here ]------------
[  107.277001] page type is 0, passed migratetype is 1 (nr=512)
[  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
[  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
[  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
[  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
[  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
[  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
[  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
[  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
[  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
[  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
[  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
[  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
[  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
[  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
[  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  107.321575] Call Trace:
[  107.322314]  <TASK>
[  107.323002]  ? del_page_from_free_list+0x137/0x170
[  107.324380]  ? __warn+0x7d/0x130
[  107.325341]  ? del_page_from_free_list+0x137/0x170
[  107.326627]  ? report_bug+0x18d/0x1c0
[  107.327632]  ? prb_read_valid+0x17/0x20
[  107.328711]  ? handle_bug+0x41/0x70
[  107.329685]  ? exc_invalid_op+0x13/0x60
[  107.330787]  ? asm_exc_invalid_op+0x16/0x20
[  107.331937]  ? del_page_from_free_list+0x137/0x170
[  107.333189]  __free_one_page+0x2ab/0x6f0
[  107.334375]  free_pcppages_bulk+0x169/0x210
[  107.335575]  drain_pages_zone+0x3f/0x50
[  107.336691]  __drain_all_pages+0xe2/0x1e0
[  107.337843]  alloc_contig_range+0x143/0x280
[  107.339026]  alloc_contig_pages+0x210/0x270
[  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
[  107.341529]  alloc_pool_huge_page+0x7d/0x100
[  107.342745]  set_max_huge_pages+0x162/0x340
[  107.345059]  nr_hugepages_store_common+0x91/0xf0
[  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
[  107.347547]  vfs_write+0x207/0x400
[  107.348543]  ksys_write+0x63/0xe0
[  107.349511]  do_syscall_64+0x37/0x90
[  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[  107.351940] RIP: 0033:0x7fabb8daee87
[  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
[  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
[  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
[  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
[  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
[  107.365968]  </TASK>
[  107.366534] ---[ end trace 0000000000000000 ]---
[  121.542474] ------------[ cut here ]------------

Perhaps that is another piece of information in that the warning can be
triggered via both allocation paths.

To be perfectly clear, here is what I did today:
- built next-20230919.  It does not contain your series
  	I could not recreate the issue.
- Added your series and the patch to remove
  VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
	I could recreate the issue while running only the one script.
	The warning above is from that run.
- Added this suggested patch from Zi
	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
	index 1400e674ab86..77a4aea31a7f 100644
	--- a/mm/page_alloc.c
	+++ b/mm/page_alloc.c
	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 		end = pageblock_end_pfn(pfn) - 1;
 
 		/* Do not cross zone boundaries */
	+#if 0
 		if (!zone_spans_pfn(zone, start))
			start = zone->zone_start_pfn;
	+#else
	+	if (!zone_spans_pfn(zone, start))
	+		start = pfn;
	+#endif
	 	if (!zone_spans_pfn(zone, end))
	 		return false;
	I can still trigger warnings.

One idea about recreating the issue is that it may have to do with size
of my VM (16G) and the requested allocation sizes 4G.  However, I tried
to really stress the allocations by increasing the number of hugetlb
pages requested and that did not help.  I also noticed that I only seem
to get two warnings and then they stop, even if I continue to run the
script.
 
Zi asked about my config, so it is attached.
Zi Yan Sept. 19, 2023, 8:57 p.m. UTC | #15
On 19 Sep 2023, at 14:47, Mike Kravetz wrote:

> On 09/19/23 02:49, Johannes Weiner wrote:
>> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>>> On 09/18/23 10:52, Johannes Weiner wrote:
>>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>>>>
>>>>>> With the patch below applied, a slightly different workload triggers the
>>>>>> following warnings.  It seems related, and appears to go away when
>>>>>> reverting the series.
>>>>>>
>>>>>> [  331.595382] ------------[ cut here ]------------
>>>>>> [  331.596665] page type is 5, passed migratetype is 1 (nr=512)
>>>>>> [  331.598121] WARNING: CPU: 2 PID: 935 at mm/page_alloc.c:662 expand+0x1c9/0x200
>>>>>
>>>>> Initially I thought this demonstrates the possible race I was suggesting in
>>>>> reply to 6/6. But, assuming you have CONFIG_CMA, page type 5 is cma and we
>>>>> are trying to get a MOVABLE page from a CMA page block, which is something
>>>>> that's normally done and the pageblock stays CMA. So yeah if the warnings
>>>>> are to stay, they need to handle this case. Maybe the same can happen with
>>>>> HIGHATOMIC blocks?
>>
>> Ok, the CMA thing gave me pause because Mike's pagetypeinfo didn't
>> show any CMA pages.
>>
>> 5 is actually MIGRATE_ISOLATE - see the double use of 3 for PCPTYPES
>> and HIGHATOMIC.
>>
>>>> This means we have an order-10 page where one half is MOVABLE and the
>>>> other is CMA.
>>
>> This means the scenario is different:
>>
>> We get a MAX_ORDER page off the MOVABLE freelist. The removal checks
>> that the first pageblock is indeed MOVABLE. During the expand, the
>> second pageblock turns out to be of type MIGRATE_ISOLATE.
>>
>> The page allocator wouldn't have merged those types. It triggers a bit
>> too fast to be a race condition.
>>
>> It appears that MIGRATE_ISOLATE is simply set on the tail pageblock
>> while the head is on the list, and then stranded there.
>>
>> Could this be an issue in the page_isolation code? Maybe a range
>> rounding error?
>>
>> Zi Yan, does this ring a bell for you?
>>
>> I don't quite see how my patches could have caused this. But AFAICS we
>> also didn't have warnings for this scenario so it could be an old bug.
>>
>>>> Mike, could you describe the workload that is triggering this?
>>>
>>> This 'slightly different workload' is actually a slightly different
>>> environment.  Sorry for mis-speaking!  The slight difference is that this
>>> environment does not use the 'alloc hugetlb gigantic pages from CMA'
>>> (hugetlb_cma) feature that triggered the previous issue.
>>>
>>> This is still on a 16G VM.  Kernel command line here is:
>>> "BOOT_IMAGE=(hd0,msdos1)/vmlinuz-6.6.0-rc1-next-20230913+
>>> root=UUID=49c13301-2555-44dc-847b-caabe1d62bdf ro console=tty0
>>> console=ttyS0,115200 audit=0 selinux=0 transparent_hugepage=always
>>> hugetlb_free_vmemmap=on"
>>>
>>> The workload is just running this script:
>>> while true; do
>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote
>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> done
>>>
>>>>
>>>> Does this reproduce instantly and reliably?
>>>>
>>>
>>> It is not 'instant' but will reproduce fairly reliably within a minute
>>> or so.
>>>
>>> Note that the 'echo 4 > .../hugepages-1048576kB/nr_hugepages' is going
>>> to end up calling alloc_contig_pages -> alloc_contig_range.  Those pages
>>> will eventually be freed via __free_pages(folio, 9).
>>
>> No luck reproducing this yet, but I have a question. In that crash
>> stack trace, the expand() is called via this:
>>
>>  [  331.645847]  get_page_from_freelist+0x3ed/0x1040
>>  [  331.646837]  ? prepare_alloc_pages.constprop.0+0x197/0x1b0
>>  [  331.647977]  __alloc_pages+0xec/0x240
>>  [  331.648783]  alloc_buddy_hugetlb_folio.isra.0+0x6a/0x150
>>  [  331.649912]  __alloc_fresh_hugetlb_folio+0x157/0x230
>>  [  331.650938]  alloc_pool_huge_folio+0xad/0x110
>>  [  331.651909]  set_max_huge_pages+0x17d/0x390
>>
>> I don't see an __alloc_fresh_hugetlb_folio() in my tree. Only
>> alloc_fresh_hugetlb_folio(), which has this:
>>
>>         if (hstate_is_gigantic(h))
>>                 folio = alloc_gigantic_folio(h, gfp_mask, nid, nmask);
>>         else
>>                 folio = alloc_buddy_hugetlb_folio(h, gfp_mask,
>>                                 nid, nmask, node_alloc_noretry);
>>
>> where gigantic is defined as the order exceeding MAX_ORDER, which
>> should be the case for 1G pages on x86.
>>
>> So the crashing stack must be from a 2M allocation, no? I'm confused
>> how that could happen with the above test case.
>
> Sorry for causing the confusion!
>
> When I originally saw the warnings pop up, I was running the above script
> as well as another that only allocated order 9 hugetlb pages:
>
> while true; do
> 	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> 	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> done
>
> The warnings were actually triggered by allocations in this second script.
>
> However, when reporting the warnings I wanted to include the simplest
> way to recreate.  And, I noticed that that second script running in
> parallel was not required.  Again, sorry for the confusion!  Here is a
> warning triggered via the alloc_contig_range path only running the one
> script.
>
> [  107.275821] ------------[ cut here ]------------
> [  107.277001] page type is 0, passed migratetype is 1 (nr=512)
> [  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
> [  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
> [  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
> [  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> [  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
> [  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
> [  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
> [  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
> [  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
> [  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
> [  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
> [  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
> [  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
> [  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
> [  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  107.321575] Call Trace:
> [  107.322314]  <TASK>
> [  107.323002]  ? del_page_from_free_list+0x137/0x170
> [  107.324380]  ? __warn+0x7d/0x130
> [  107.325341]  ? del_page_from_free_list+0x137/0x170
> [  107.326627]  ? report_bug+0x18d/0x1c0
> [  107.327632]  ? prb_read_valid+0x17/0x20
> [  107.328711]  ? handle_bug+0x41/0x70
> [  107.329685]  ? exc_invalid_op+0x13/0x60
> [  107.330787]  ? asm_exc_invalid_op+0x16/0x20
> [  107.331937]  ? del_page_from_free_list+0x137/0x170
> [  107.333189]  __free_one_page+0x2ab/0x6f0
> [  107.334375]  free_pcppages_bulk+0x169/0x210
> [  107.335575]  drain_pages_zone+0x3f/0x50
> [  107.336691]  __drain_all_pages+0xe2/0x1e0
> [  107.337843]  alloc_contig_range+0x143/0x280
> [  107.339026]  alloc_contig_pages+0x210/0x270
> [  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
> [  107.341529]  alloc_pool_huge_page+0x7d/0x100
> [  107.342745]  set_max_huge_pages+0x162/0x340
> [  107.345059]  nr_hugepages_store_common+0x91/0xf0
> [  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
> [  107.347547]  vfs_write+0x207/0x400
> [  107.348543]  ksys_write+0x63/0xe0
> [  107.349511]  do_syscall_64+0x37/0x90
> [  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [  107.351940] RIP: 0033:0x7fabb8daee87
> [  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
> [  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
> [  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
> [  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
> [  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
> [  107.365968]  </TASK>
> [  107.366534] ---[ end trace 0000000000000000 ]---
> [  121.542474] ------------[ cut here ]------------
>
> Perhaps that is another piece of information in that the warning can be
> triggered via both allocation paths.
>
> To be perfectly clear, here is what I did today:
> - built next-20230919.  It does not contain your series
>   	I could not recreate the issue.
> - Added your series and the patch to remove
>   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
> 	I could recreate the issue while running only the one script.
> 	The warning above is from that run.
> - Added this suggested patch from Zi
> 	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> 	index 1400e674ab86..77a4aea31a7f 100644
> 	--- a/mm/page_alloc.c
> 	+++ b/mm/page_alloc.c
> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>  		end = pageblock_end_pfn(pfn) - 1;
>
>  		/* Do not cross zone boundaries */
> 	+#if 0
>  		if (!zone_spans_pfn(zone, start))
> 			start = zone->zone_start_pfn;
> 	+#else
> 	+	if (!zone_spans_pfn(zone, start))
> 	+		start = pfn;
> 	+#endif
> 	 	if (!zone_spans_pfn(zone, end))
> 	 		return false;
> 	I can still trigger warnings.

OK. One thing to note is that the page type in the warning changed from
5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.

>
> One idea about recreating the issue is that it may have to do with size
> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> to really stress the allocations by increasing the number of hugetlb
> pages requested and that did not help.  I also noticed that I only seem
> to get two warnings and then they stop, even if I continue to run the
> script.
>
> Zi asked about my config, so it is attached.

With your config, I still have no luck reproducing the issue. I will keep
trying. Thanks.


--
Best Regards,
Yan, Zi
Mike Kravetz Sept. 20, 2023, 12:32 a.m. UTC | #16
On 09/19/23 16:57, Zi Yan wrote:
> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> 
> > On 09/19/23 02:49, Johannes Weiner wrote:
> >> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
> >>> On 09/18/23 10:52, Johannes Weiner wrote:
> >>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
> >>>>> On 9/16/23 21:57, Mike Kravetz wrote:
> >>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
> >>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
> >
> > Sorry for causing the confusion!
> >
> > When I originally saw the warnings pop up, I was running the above script
> > as well as another that only allocated order 9 hugetlb pages:
> >
> > while true; do
> > 	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > 	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > done
> >
> > The warnings were actually triggered by allocations in this second script.
> >
> > However, when reporting the warnings I wanted to include the simplest
> > way to recreate.  And, I noticed that that second script running in
> > parallel was not required.  Again, sorry for the confusion!  Here is a
> > warning triggered via the alloc_contig_range path only running the one
> > script.
> >
> > [  107.275821] ------------[ cut here ]------------
> > [  107.277001] page type is 0, passed migratetype is 1 (nr=512)
> > [  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
> > [  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
> > [  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
> > [  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> > [  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
> > [  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
> > [  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
> > [  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
> > [  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
> > [  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
> > [  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
> > [  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
> > [  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
> > [  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
> > [  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [  107.321575] Call Trace:
> > [  107.322314]  <TASK>
> > [  107.323002]  ? del_page_from_free_list+0x137/0x170
> > [  107.324380]  ? __warn+0x7d/0x130
> > [  107.325341]  ? del_page_from_free_list+0x137/0x170
> > [  107.326627]  ? report_bug+0x18d/0x1c0
> > [  107.327632]  ? prb_read_valid+0x17/0x20
> > [  107.328711]  ? handle_bug+0x41/0x70
> > [  107.329685]  ? exc_invalid_op+0x13/0x60
> > [  107.330787]  ? asm_exc_invalid_op+0x16/0x20
> > [  107.331937]  ? del_page_from_free_list+0x137/0x170
> > [  107.333189]  __free_one_page+0x2ab/0x6f0
> > [  107.334375]  free_pcppages_bulk+0x169/0x210
> > [  107.335575]  drain_pages_zone+0x3f/0x50
> > [  107.336691]  __drain_all_pages+0xe2/0x1e0
> > [  107.337843]  alloc_contig_range+0x143/0x280
> > [  107.339026]  alloc_contig_pages+0x210/0x270
> > [  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
> > [  107.341529]  alloc_pool_huge_page+0x7d/0x100
> > [  107.342745]  set_max_huge_pages+0x162/0x340
> > [  107.345059]  nr_hugepages_store_common+0x91/0xf0
> > [  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
> > [  107.347547]  vfs_write+0x207/0x400
> > [  107.348543]  ksys_write+0x63/0xe0
> > [  107.349511]  do_syscall_64+0x37/0x90
> > [  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > [  107.351940] RIP: 0033:0x7fabb8daee87
> > [  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> > [  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
> > [  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
> > [  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
> > [  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
> > [  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
> > [  107.365968]  </TASK>
> > [  107.366534] ---[ end trace 0000000000000000 ]---
> > [  121.542474] ------------[ cut here ]------------
> >
> > Perhaps that is another piece of information in that the warning can be
> > triggered via both allocation paths.
> >
> > To be perfectly clear, here is what I did today:
> > - built next-20230919.  It does not contain your series
> >   	I could not recreate the issue.
> > - Added your series and the patch to remove
> >   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
> > 	I could recreate the issue while running only the one script.
> > 	The warning above is from that run.
> > - Added this suggested patch from Zi
> > 	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > 	index 1400e674ab86..77a4aea31a7f 100644
> > 	--- a/mm/page_alloc.c
> > 	+++ b/mm/page_alloc.c
> > 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> >  		end = pageblock_end_pfn(pfn) - 1;
> >
> >  		/* Do not cross zone boundaries */
> > 	+#if 0
> >  		if (!zone_spans_pfn(zone, start))
> > 			start = zone->zone_start_pfn;
> > 	+#else
> > 	+	if (!zone_spans_pfn(zone, start))
> > 	+		start = pfn;
> > 	+#endif
> > 	 	if (!zone_spans_pfn(zone, end))
> > 	 		return false;
> > 	I can still trigger warnings.
> 
> OK. One thing to note is that the page type in the warning changed from
> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> 

Just to be really clear,
- the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
- the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
  path WITHOUT your change.

I am guessing the difference here has more to do with the allocation path?

I went back and reran focusing on the specific migrate type.
Without your patch, and coming from the alloc_contig_range call path,
I got two warnings of 'page type is 0, passed migratetype is 1' as above.
With your patch I got one 'page type is 0, passed migratetype is 1'
warning and one 'page type is 1, passed migratetype is 0' warning.

I could be wrong, but I do not think your patch changes things.

> >
> > One idea about recreating the issue is that it may have to do with size
> > of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> > to really stress the allocations by increasing the number of hugetlb
> > pages requested and that did not help.  I also noticed that I only seem
> > to get two warnings and then they stop, even if I continue to run the
> > script.
> >
> > Zi asked about my config, so it is attached.
> 
> With your config, I still have no luck reproducing the issue. I will keep
> trying. Thanks.
> 

Perhaps try running both scripts in parallel?
Adjust the number of hugetlb pages allocated to equal 25% of memory?
Zi Yan Sept. 20, 2023, 1:38 a.m. UTC | #17
On 19 Sep 2023, at 20:32, Mike Kravetz wrote:

> On 09/19/23 16:57, Zi Yan wrote:
>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>
>>> On 09/19/23 02:49, Johannes Weiner wrote:
>>>> On Mon, Sep 18, 2023 at 10:40:37AM -0700, Mike Kravetz wrote:
>>>>> On 09/18/23 10:52, Johannes Weiner wrote:
>>>>>> On Mon, Sep 18, 2023 at 09:16:58AM +0200, Vlastimil Babka wrote:
>>>>>>> On 9/16/23 21:57, Mike Kravetz wrote:
>>>>>>>> On 09/15/23 10:16, Johannes Weiner wrote:
>>>>>>>>> On Thu, Sep 14, 2023 at 04:52:38PM -0700, Mike Kravetz wrote:
>>>
>>> Sorry for causing the confusion!
>>>
>>> When I originally saw the warnings pop up, I was running the above script
>>> as well as another that only allocated order 9 hugetlb pages:
>>>
>>> while true; do
>>> 	echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> 	echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>> done
>>>
>>> The warnings were actually triggered by allocations in this second script.
>>>
>>> However, when reporting the warnings I wanted to include the simplest
>>> way to recreate.  And, I noticed that that second script running in
>>> parallel was not required.  Again, sorry for the confusion!  Here is a
>>> warning triggered via the alloc_contig_range path only running the one
>>> script.
>>>
>>> [  107.275821] ------------[ cut here ]------------
>>> [  107.277001] page type is 0, passed migratetype is 1 (nr=512)
>>> [  107.278379] WARNING: CPU: 1 PID: 886 at mm/page_alloc.c:699 del_page_from_free_list+0x137/0x170
>>> [  107.280514] Modules linked in: rfkill ip6table_filter ip6_tables sunrpc snd_hda_codec_generic joydev 9p snd_hda_intel netfs snd_intel_dspcfg snd_hda_codec snd_hwdep 9pnet_virtio snd_hda_core snd_seq snd_seq_device 9pnet virtio_balloon snd_pcm snd_timer snd soundcore virtio_net net_failover failover virtio_console virtio_blk crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw virtio_pci virtio virtio_pci_legacy_dev virtio_pci_modern_dev virtio_ring fuse
>>> [  107.291033] CPU: 1 PID: 886 Comm: bash Not tainted 6.6.0-rc2-next-20230919-dirty #35
>>> [  107.293000] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
>>> [  107.295187] RIP: 0010:del_page_from_free_list+0x137/0x170
>>> [  107.296618] Code: c6 05 20 9b 35 01 01 e8 b7 fb ff ff 44 89 f1 44 89 e2 48 c7 c7 d8 ab 22 82 48 89 c6 b8 01 00 00 00 d3 e0 89 c1 e8 e9 99 df ff <0f> 0b e9 03 ff ff ff 48 c7 c6 10 ac 22 82 48 89 df e8 f3 e0 fc ff
>>> [  107.301236] RSP: 0018:ffffc90003ba7a70 EFLAGS: 00010086
>>> [  107.302535] RAX: 0000000000000000 RBX: ffffea0007ff8000 RCX: 0000000000000000
>>> [  107.304467] RDX: 0000000000000004 RSI: ffffffff8224e9de RDI: 00000000ffffffff
>>> [  107.306289] RBP: 00000000001ffe00 R08: 0000000000009ffb R09: 00000000ffffdfff
>>> [  107.308135] R10: 00000000ffffdfff R11: ffffffff824660e0 R12: 0000000000000001
>>> [  107.309956] R13: ffff88827fffcd80 R14: 0000000000000009 R15: 00000000001ffc00
>>> [  107.311839] FS:  00007fabb8cba740(0000) GS:ffff888277d00000(0000) knlGS:0000000000000000
>>> [  107.314695] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  107.316159] CR2: 00007f41ba01acf0 CR3: 0000000282ed4006 CR4: 0000000000370ee0
>>> [  107.317971] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [  107.319783] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [  107.321575] Call Trace:
>>> [  107.322314]  <TASK>
>>> [  107.323002]  ? del_page_from_free_list+0x137/0x170
>>> [  107.324380]  ? __warn+0x7d/0x130
>>> [  107.325341]  ? del_page_from_free_list+0x137/0x170
>>> [  107.326627]  ? report_bug+0x18d/0x1c0
>>> [  107.327632]  ? prb_read_valid+0x17/0x20
>>> [  107.328711]  ? handle_bug+0x41/0x70
>>> [  107.329685]  ? exc_invalid_op+0x13/0x60
>>> [  107.330787]  ? asm_exc_invalid_op+0x16/0x20
>>> [  107.331937]  ? del_page_from_free_list+0x137/0x170
>>> [  107.333189]  __free_one_page+0x2ab/0x6f0
>>> [  107.334375]  free_pcppages_bulk+0x169/0x210
>>> [  107.335575]  drain_pages_zone+0x3f/0x50
>>> [  107.336691]  __drain_all_pages+0xe2/0x1e0
>>> [  107.337843]  alloc_contig_range+0x143/0x280
>>> [  107.339026]  alloc_contig_pages+0x210/0x270
>>> [  107.340200]  alloc_fresh_hugetlb_folio+0xa6/0x270
>>> [  107.341529]  alloc_pool_huge_page+0x7d/0x100
>>> [  107.342745]  set_max_huge_pages+0x162/0x340
>>> [  107.345059]  nr_hugepages_store_common+0x91/0xf0
>>> [  107.346329]  kernfs_fop_write_iter+0x108/0x1f0
>>> [  107.347547]  vfs_write+0x207/0x400
>>> [  107.348543]  ksys_write+0x63/0xe0
>>> [  107.349511]  do_syscall_64+0x37/0x90
>>> [  107.350543]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>> [  107.351940] RIP: 0033:0x7fabb8daee87
>>> [  107.352819] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [  107.356373] RSP: 002b:00007ffc02737478 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [  107.358103] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fabb8daee87
>>> [  107.359695] RDX: 0000000000000002 RSI: 000055fe584a1620 RDI: 0000000000000001
>>> [  107.361258] RBP: 000055fe584a1620 R08: 000000000000000a R09: 00007fabb8e460c0
>>> [  107.362842] R10: 00007fabb8e45fc0 R11: 0000000000000246 R12: 0000000000000002
>>> [  107.364385] R13: 00007fabb8e82520 R14: 0000000000000002 R15: 00007fabb8e82720
>>> [  107.365968]  </TASK>
>>> [  107.366534] ---[ end trace 0000000000000000 ]---
>>> [  121.542474] ------------[ cut here ]------------
>>>
>>> Perhaps that is another piece of information in that the warning can be
>>> triggered via both allocation paths.
>>>
>>> To be perfectly clear, here is what I did today:
>>> - built next-20230919.  It does not contain your series
>>>   	I could not recreate the issue.
>>> - Added your series and the patch to remove
>>>   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page) from free_pcppages_bulk
>>> 	I could recreate the issue while running only the one script.
>>> 	The warning above is from that run.
>>> - Added this suggested patch from Zi
>>> 	diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> 	index 1400e674ab86..77a4aea31a7f 100644
>>> 	--- a/mm/page_alloc.c
>>> 	+++ b/mm/page_alloc.c
>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>
>>>  		/* Do not cross zone boundaries */
>>> 	+#if 0
>>>  		if (!zone_spans_pfn(zone, start))
>>> 			start = zone->zone_start_pfn;
>>> 	+#else
>>> 	+	if (!zone_spans_pfn(zone, start))
>>> 	+		start = pfn;
>>> 	+#endif
>>> 	 	if (!zone_spans_pfn(zone, end))
>>> 	 		return false;
>>> 	I can still trigger warnings.
>>
>> OK. One thing to note is that the page type in the warning changed from
>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>
>
> Just to be really clear,
> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>   path WITHOUT your change.
>
> I am guessing the difference here has more to do with the allocation path?
>
> I went back and reran focusing on the specific migrate type.
> Without your patch, and coming from the alloc_contig_range call path,
> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> With your patch I got one 'page type is 0, passed migratetype is 1'
> warning and one 'page type is 1, passed migratetype is 0' warning.
>
> I could be wrong, but I do not think your patch changes things.

Got it. Thanks for the clarification.
>
>>>
>>> One idea about recreating the issue is that it may have to do with size
>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>> to really stress the allocations by increasing the number of hugetlb
>>> pages requested and that did not help.  I also noticed that I only seem
>>> to get two warnings and then they stop, even if I continue to run the
>>> script.
>>>
>>> Zi asked about my config, so it is attached.
>>
>> With your config, I still have no luck reproducing the issue. I will keep
>> trying. Thanks.
>>
>
> Perhaps try running both scripts in parallel?

Yes. It seems to do the trick.

> Adjust the number of hugetlb pages allocated to equal 25% of memory?

I am able to reproduce it with the script below:

while true; do
 echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
 echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
 wait
 echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
 echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
done

I will look into the issue.

--
Best Regards,
Yan, Zi
Vlastimil Babka Sept. 20, 2023, 6:07 a.m. UTC | #18
On 9/20/23 03:38, Zi Yan wrote:
> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> 
>> On 09/19/23 16:57, Zi Yan wrote:
>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>
>>>> 	--- a/mm/page_alloc.c
>>>> 	+++ b/mm/page_alloc.c
>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>>
>>>>  		/* Do not cross zone boundaries */
>>>> 	+#if 0
>>>>  		if (!zone_spans_pfn(zone, start))
>>>> 			start = zone->zone_start_pfn;
>>>> 	+#else
>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>> 	+		start = pfn;
>>>> 	+#endif
>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>> 	 		return false;
>>>> 	I can still trigger warnings.
>>>
>>> OK. One thing to note is that the page type in the warning changed from
>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>
>>
>> Just to be really clear,
>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>   path WITHOUT your change.
>>
>> I am guessing the difference here has more to do with the allocation path?
>>
>> I went back and reran focusing on the specific migrate type.
>> Without your patch, and coming from the alloc_contig_range call path,
>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>> With your patch I got one 'page type is 0, passed migratetype is 1'
>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>
>> I could be wrong, but I do not think your patch changes things.
> 
> Got it. Thanks for the clarification.
>>
>>>>
>>>> One idea about recreating the issue is that it may have to do with size
>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>> to really stress the allocations by increasing the number of hugetlb
>>>> pages requested and that did not help.  I also noticed that I only seem
>>>> to get two warnings and then they stop, even if I continue to run the
>>>> script.
>>>>
>>>> Zi asked about my config, so it is attached.
>>>
>>> With your config, I still have no luck reproducing the issue. I will keep
>>> trying. Thanks.
>>>
>>
>> Perhaps try running both scripts in parallel?
> 
> Yes. It seems to do the trick.
> 
>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> 
> I am able to reproduce it with the script below:
> 
> while true; do
>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>  wait
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> done
> 
> I will look into the issue.

With migratetypes 0 and 1 and somewhat harder to reproduce scenario (= less
deterministic, more racy) it's possible we now see what I suspected can
happen here:
https://lore.kernel.org/all/37dbd4d0-c125-6694-dec4-6322ae5b6dee@suse.cz/
In that there are places reading the migratetype outside of zone lock.
Johannes Weiner Sept. 20, 2023, 1:48 p.m. UTC | #19
On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> On 9/20/23 03:38, Zi Yan wrote:
> > On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> > 
> >> On 09/19/23 16:57, Zi Yan wrote:
> >>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> >>>
> >>>> 	--- a/mm/page_alloc.c
> >>>> 	+++ b/mm/page_alloc.c
> >>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> >>>>  		end = pageblock_end_pfn(pfn) - 1;
> >>>>
> >>>>  		/* Do not cross zone boundaries */
> >>>> 	+#if 0
> >>>>  		if (!zone_spans_pfn(zone, start))
> >>>> 			start = zone->zone_start_pfn;
> >>>> 	+#else
> >>>> 	+	if (!zone_spans_pfn(zone, start))
> >>>> 	+		start = pfn;
> >>>> 	+#endif
> >>>> 	 	if (!zone_spans_pfn(zone, end))
> >>>> 	 		return false;
> >>>> 	I can still trigger warnings.
> >>>
> >>> OK. One thing to note is that the page type in the warning changed from
> >>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> >>>
> >>
> >> Just to be really clear,
> >> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> >> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> >>   path WITHOUT your change.
> >>
> >> I am guessing the difference here has more to do with the allocation path?
> >>
> >> I went back and reran focusing on the specific migrate type.
> >> Without your patch, and coming from the alloc_contig_range call path,
> >> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> >> With your patch I got one 'page type is 0, passed migratetype is 1'
> >> warning and one 'page type is 1, passed migratetype is 0' warning.
> >>
> >> I could be wrong, but I do not think your patch changes things.
> > 
> > Got it. Thanks for the clarification.
> >>
> >>>>
> >>>> One idea about recreating the issue is that it may have to do with size
> >>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> >>>> to really stress the allocations by increasing the number of hugetlb
> >>>> pages requested and that did not help.  I also noticed that I only seem
> >>>> to get two warnings and then they stop, even if I continue to run the
> >>>> script.
> >>>>
> >>>> Zi asked about my config, so it is attached.
> >>>
> >>> With your config, I still have no luck reproducing the issue. I will keep
> >>> trying. Thanks.
> >>>
> >>
> >> Perhaps try running both scripts in parallel?
> > 
> > Yes. It seems to do the trick.
> > 
> >> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> > 
> > I am able to reproduce it with the script below:
> > 
> > while true; do
> >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> >  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> >  wait
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > done
> > 
> > I will look into the issue.

Nice!

I managed to reproduce it ONCE, triggering it not even a second after
starting the script. But I can't seem to do it twice, even after
several reboots and letting it run for minutes.

> With migratetypes 0 and 1 and somewhat harder to reproduce scenario (= less
> deterministic, more racy) it's possible we now see what I suspected can
> happen here:
> https://lore.kernel.org/all/37dbd4d0-c125-6694-dec4-6322ae5b6dee@suse.cz/
> In that there are places reading the migratetype outside of zone lock.

Good point!

I had already written up a fix for this issue. Still trying to get the
reproducer to work, but attaching the fix below in case somebody with
a working environment beats me to it.

---

From 94f67bfa29a602a66014d079431b224cacbf79e9 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Fri, 15 Sep 2023 16:23:38 -0400
Subject: [PATCH] mm: page_alloc: close migratetype race between freeing and
 stealing

There are three freeing paths that read the page's migratetype
optimistically before grabbing the zone lock. When this races with
block stealing, those pages go on the wrong freelist.

The paths in question are:
- when freeing >costly orders that aren't THP
- when freeing pages to the buddy upon pcp lock contention
- when freeing pages that are isolated
- when freeing pages initially during boot
- when freeing the remainder in alloc_pages_exact()
- when "accepting" unaccepted VM host memory before first use
- when freeing pages during unpoisoning

None of these are so hot that they would need this optimization at the
cost of hampering defrag efforts. Especially when contrasted with the
fact that the most common buddy freeing path - free_pcppages_bulk - is
checking the migratetype under the zone->lock just fine.

In addition, isolated pages need to look up the migratetype under the
lock anyway, which adds branches to the locked section, and results in
a double lookup when the pages are in fact isolated.

Move the lookups into the lock.

Reported-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/page_alloc.c | 47 +++++++++++++++++------------------------------
 1 file changed, 17 insertions(+), 30 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0ca999d24a00..d902a8aaa3fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1222,18 +1222,15 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
-static void free_one_page(struct zone *zone,
-				struct page *page, unsigned long pfn,
-				unsigned int order,
-				int migratetype, fpi_t fpi_flags)
+static void free_one_page(struct zone *zone, struct page *page,
+			  unsigned long pfn, unsigned int order,
+			  fpi_t fpi_flags)
 {
 	unsigned long flags;
+	int migratetype;
 
 	spin_lock_irqsave(&zone->lock, flags);
-	if (unlikely(has_isolate_pageblock(zone) ||
-		is_migrate_isolate(migratetype))) {
-		migratetype = get_pfnblock_migratetype(page, pfn);
-	}
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -1249,18 +1246,8 @@ static void __free_pages_ok(struct page *page, unsigned int order,
 	if (!free_pages_prepare(page, order, fpi_flags))
 		return;
 
-	/*
-	 * Calling get_pfnblock_migratetype() without spin_lock_irqsave() here
-	 * is used to avoid calling get_pfnblock_migratetype() under the lock.
-	 * This will reduce the lock holding time.
-	 */
-	migratetype = get_pfnblock_migratetype(page, pfn);
-
 	spin_lock_irqsave(&zone->lock, flags);
-	if (unlikely(has_isolate_pageblock(zone) ||
-		is_migrate_isolate(migratetype))) {
-		migratetype = get_pfnblock_migratetype(page, pfn);
-	}
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	__free_one_page(page, pfn, zone, order, migratetype, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -2404,7 +2391,7 @@ void free_unref_page(struct page *page, unsigned int order)
 	struct per_cpu_pages *pcp;
 	struct zone *zone;
 	unsigned long pfn = page_to_pfn(page);
-	int migratetype, pcpmigratetype;
+	int migratetype;
 
 	if (!free_pages_prepare(page, order, FPI_NONE))
 		return;
@@ -2416,23 +2403,23 @@ void free_unref_page(struct page *page, unsigned int order)
 	 * get those areas back if necessary. Otherwise, we may have to free
 	 * excessively into the page allocator
 	 */
-	migratetype = pcpmigratetype = get_pfnblock_migratetype(page, pfn);
+	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(page_zone(page), page, pfn, order, migratetype, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, order, FPI_NONE);
 			return;
 		}
-		pcpmigratetype = MIGRATE_MOVABLE;
+		migratetype = MIGRATE_MOVABLE;
 	}
 
 	zone = page_zone(page);
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_unref_page_commit(zone, pcp, page, pcpmigratetype, order);
+		free_unref_page_commit(zone, pcp, page, migratetype, order);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, migratetype, FPI_NONE);
+		free_one_page(zone, page, pfn, order, FPI_NONE);
 	}
 	pcp_trylock_finish(UP_flags);
 }
@@ -2465,7 +2452,7 @@ void free_unref_page_list(struct list_head *list)
 		migratetype = get_pfnblock_migratetype(page, pfn);
 		if (unlikely(is_migrate_isolate(migratetype))) {
 			list_del(&page->lru);
-			free_one_page(page_zone(page), page, pfn, 0, migratetype, FPI_NONE);
+			free_one_page(page_zone(page), page, pfn, 0, FPI_NONE);
 			continue;
 		}
 	}
@@ -2498,8 +2485,7 @@ void free_unref_page_list(struct list_head *list)
 			pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 			if (unlikely(!pcp)) {
 				pcp_trylock_finish(UP_flags);
-				free_one_page(zone, page, pfn,
-					      0, migratetype, FPI_NONE);
+				free_one_page(zone, page, pfn, 0, FPI_NONE);
 				locked_zone = NULL;
 				continue;
 			}
@@ -6537,13 +6523,14 @@ bool take_page_off_buddy(struct page *page)
 bool put_page_back_buddy(struct page *page)
 {
 	struct zone *zone = page_zone(page);
-	unsigned long pfn = page_to_pfn(page);
 	unsigned long flags;
-	int migratetype = get_pfnblock_migratetype(page, pfn);
 	bool ret = false;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	if (put_page_testzero(page)) {
+		unsigned long pfn = page_to_pfn(page);
+		int migratetype = get_pfnblock_migratetype(page, pfn);
+
 		ClearPageHWPoisonTakenOff(page);
 		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
 		if (TestClearPageHWPoison(page)) {
Johannes Weiner Sept. 20, 2023, 4:04 p.m. UTC | #20
On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> > On 9/20/23 03:38, Zi Yan wrote:
> > > On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> > > 
> > >> On 09/19/23 16:57, Zi Yan wrote:
> > >>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> > >>>
> > >>>> 	--- a/mm/page_alloc.c
> > >>>> 	+++ b/mm/page_alloc.c
> > >>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> > >>>>  		end = pageblock_end_pfn(pfn) - 1;
> > >>>>
> > >>>>  		/* Do not cross zone boundaries */
> > >>>> 	+#if 0
> > >>>>  		if (!zone_spans_pfn(zone, start))
> > >>>> 			start = zone->zone_start_pfn;
> > >>>> 	+#else
> > >>>> 	+	if (!zone_spans_pfn(zone, start))
> > >>>> 	+		start = pfn;
> > >>>> 	+#endif
> > >>>> 	 	if (!zone_spans_pfn(zone, end))
> > >>>> 	 		return false;
> > >>>> 	I can still trigger warnings.
> > >>>
> > >>> OK. One thing to note is that the page type in the warning changed from
> > >>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> > >>>
> > >>
> > >> Just to be really clear,
> > >> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> > >> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> > >>   path WITHOUT your change.
> > >>
> > >> I am guessing the difference here has more to do with the allocation path?
> > >>
> > >> I went back and reran focusing on the specific migrate type.
> > >> Without your patch, and coming from the alloc_contig_range call path,
> > >> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> > >> With your patch I got one 'page type is 0, passed migratetype is 1'
> > >> warning and one 'page type is 1, passed migratetype is 0' warning.
> > >>
> > >> I could be wrong, but I do not think your patch changes things.
> > > 
> > > Got it. Thanks for the clarification.
> > >>
> > >>>>
> > >>>> One idea about recreating the issue is that it may have to do with size
> > >>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> > >>>> to really stress the allocations by increasing the number of hugetlb
> > >>>> pages requested and that did not help.  I also noticed that I only seem
> > >>>> to get two warnings and then they stop, even if I continue to run the
> > >>>> script.
> > >>>>
> > >>>> Zi asked about my config, so it is attached.
> > >>>
> > >>> With your config, I still have no luck reproducing the issue. I will keep
> > >>> trying. Thanks.
> > >>>
> > >>
> > >> Perhaps try running both scripts in parallel?
> > > 
> > > Yes. It seems to do the trick.
> > > 
> > >> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> > > 
> > > I am able to reproduce it with the script below:
> > > 
> > > while true; do
> > >  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> > >  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> > >  wait
> > >  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> > >  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > > done
> > > 
> > > I will look into the issue.
> 
> Nice!
> 
> I managed to reproduce it ONCE, triggering it not even a second after
> starting the script. But I can't seem to do it twice, even after
> several reboots and letting it run for minutes.

I managed to reproduce it reliably by cutting the nr_hugepages
parameters respectively in half.

The one that triggers for me is always MIGRATE_ISOLATE. With some
printk-tracing, the scenario seems to be this:

#0                                                   #1
start_isolate_page_range()
  isolate_single_pageblock()
    set_migratetype_isolate(tail)
      lock zone->lock
      move_freepages_block(tail) // nop
      set_pageblock_migratetype(tail)
      unlock zone->lock
                                                     del_page_from_freelist(head)
                                                     expand(head, head_mt)
                                                       WARN(head_mt != tail_mt)
    start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
    for (pfn = start_pfn, pfn < end_pfn)
      if (PageBuddy())
        split_free_page(head)

IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
lock. The move_freepages_block() does nothing because the PageBuddy()
is set on the pageblock to the left. Once we drop the lock, the buddy
gets allocated and the expand() puts things on the wrong list. The
splitting code that handles MAX_ORDER blocks runs *after* the tail
type is set and the lock has been dropped, so it's too late.

I think this would work fine if we always set MIGRATE_ISOLATE in a
linear fashion, with start and end aligned to MAX_ORDER. Then we also
wouldn't have to split things.

There are two reasons this doesn't happen today:

1. The isolation range is rounded to pageblocks, not MAX_ORDER. In
   this test case they always seem aligned, but it's not
   guaranteed. However,

2. start_isolate_page_range() explicitly breaks ordering by doing the
   last block in the range before the center. It's that last block
   that triggers the race with __rmqueue_smallest -> expand() for me.

With the below patch I can no longer reproduce the issue:

---

diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b5c7a9d21257..b7c8730bf0e2 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -538,8 +538,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	unsigned long pfn;
 	struct page *page;
 	/* isolation is done at page block granularity */
-	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
-	unsigned long isolate_end = pageblock_align(end_pfn);
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
+	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
 	int ret;
 	bool skip_isolation = false;
 
@@ -549,17 +549,6 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 	if (ret)
 		return ret;
 
-	if (isolate_start == isolate_end - pageblock_nr_pages)
-		skip_isolation = true;
-
-	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
-	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
-			skip_isolation, migratetype);
-	if (ret) {
-		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
-		return ret;
-	}
-
 	/* skip isolated pageblocks at the beginning and end */
 	for (pfn = isolate_start + pageblock_nr_pages;
 	     pfn < isolate_end - pageblock_nr_pages;
@@ -568,12 +557,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 		if (page && set_migratetype_isolate(page, migratetype, flags,
 					start_pfn, end_pfn)) {
 			undo_isolate_page_range(isolate_start, pfn, migratetype);
-			unset_migratetype_isolate(
-				pfn_to_page(isolate_end - pageblock_nr_pages),
-				migratetype);
 			return -EBUSY;
 		}
 	}
+
+	if (isolate_start == isolate_end - pageblock_nr_pages)
+		skip_isolation = true;
+
+	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
+	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
+			skip_isolation, migratetype);
+	if (ret) {
+		undo_isolate_page_range(isolate_start, pfn, migratetype);
+		return ret;
+	}
+
 	return 0;
 }
 
@@ -591,8 +589,8 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 {
 	unsigned long pfn;
 	struct page *page;
-	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
-	unsigned long isolate_end = pageblock_align(end_pfn);
+	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
+	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
 
 	for (pfn = isolate_start;
 	     pfn < isolate_end;
Zi Yan Sept. 20, 2023, 5:23 p.m. UTC | #21
On 20 Sep 2023, at 12:04, Johannes Weiner wrote:

> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>> On 9/20/23 03:38, Zi Yan wrote:
>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>
>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>
>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>
>>>>>>>  		/* Do not cross zone boundaries */
>>>>>>> 	+#if 0
>>>>>>>  		if (!zone_spans_pfn(zone, start))
>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>> 	+#else
>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>> 	+		start = pfn;
>>>>>>> 	+#endif
>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>> 	 		return false;
>>>>>>> 	I can still trigger warnings.
>>>>>>
>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>
>>>>>
>>>>> Just to be really clear,
>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>   path WITHOUT your change.
>>>>>
>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>
>>>>> I went back and reran focusing on the specific migrate type.
>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>
>>>>> I could be wrong, but I do not think your patch changes things.
>>>>
>>>> Got it. Thanks for the clarification.
>>>>>
>>>>>>>
>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>> script.
>>>>>>>
>>>>>>> Zi asked about my config, so it is attached.
>>>>>>
>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>> trying. Thanks.
>>>>>>
>>>>>
>>>>> Perhaps try running both scripts in parallel?
>>>>
>>>> Yes. It seems to do the trick.
>>>>
>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>
>>>> I am able to reproduce it with the script below:
>>>>
>>>> while true; do
>>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>  wait
>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>> done
>>>>
>>>> I will look into the issue.
>>
>> Nice!
>>
>> I managed to reproduce it ONCE, triggering it not even a second after
>> starting the script. But I can't seem to do it twice, even after
>> several reboots and letting it run for minutes.
>
> I managed to reproduce it reliably by cutting the nr_hugepages
> parameters respectively in half.
>
> The one that triggers for me is always MIGRATE_ISOLATE. With some
> printk-tracing, the scenario seems to be this:
>
> #0                                                   #1
> start_isolate_page_range()
>   isolate_single_pageblock()
>     set_migratetype_isolate(tail)
>       lock zone->lock
>       move_freepages_block(tail) // nop
>       set_pageblock_migratetype(tail)
>       unlock zone->lock
>                                                      del_page_from_freelist(head)
>                                                      expand(head, head_mt)
>                                                        WARN(head_mt != tail_mt)
>     start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>     for (pfn = start_pfn, pfn < end_pfn)
>       if (PageBuddy())
>         split_free_page(head)
>
> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
> lock. The move_freepages_block() does nothing because the PageBuddy()
> is set on the pageblock to the left. Once we drop the lock, the buddy
> gets allocated and the expand() puts things on the wrong list. The
> splitting code that handles MAX_ORDER blocks runs *after* the tail
> type is set and the lock has been dropped, so it's too late.

Yes, this is the issue I can confirm as well. But it is intentional to enable
allocating a contiguous range at pageblock granularity instead of MAX_ORDER
granularity. With your changes below, it no longer works, because if there
is an unmovable page in
[ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
the allocation fails but it would succeed in current implementation.

I think a proper fix would be to make move_freepages_block() split the
MAX_ORDER page and put the split pages in the right migratetype free lists.

I am working on that.

>
> I think this would work fine if we always set MIGRATE_ISOLATE in a
> linear fashion, with start and end aligned to MAX_ORDER. Then we also
> wouldn't have to split things.
>
> There are two reasons this doesn't happen today:
>
> 1. The isolation range is rounded to pageblocks, not MAX_ORDER. In
>    this test case they always seem aligned, but it's not
>    guaranteed. However,
>
> 2. start_isolate_page_range() explicitly breaks ordering by doing the
>    last block in the range before the center. It's that last block
>    that triggers the race with __rmqueue_smallest -> expand() for me.
>
> With the below patch I can no longer reproduce the issue:
>
> ---
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b5c7a9d21257..b7c8730bf0e2 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -538,8 +538,8 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  	unsigned long pfn;
>  	struct page *page;
>  	/* isolation is done at page block granularity */
> -	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
> -	unsigned long isolate_end = pageblock_align(end_pfn);
> +	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
> +	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
>  	int ret;
>  	bool skip_isolation = false;
>
> @@ -549,17 +549,6 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  	if (ret)
>  		return ret;
>
> -	if (isolate_start == isolate_end - pageblock_nr_pages)
> -		skip_isolation = true;
> -
> -	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> -	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
> -			skip_isolation, migratetype);
> -	if (ret) {
> -		unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
> -		return ret;
> -	}
> -
>  	/* skip isolated pageblocks at the beginning and end */
>  	for (pfn = isolate_start + pageblock_nr_pages;
>  	     pfn < isolate_end - pageblock_nr_pages;
> @@ -568,12 +557,21 @@ int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  		if (page && set_migratetype_isolate(page, migratetype, flags,
>  					start_pfn, end_pfn)) {
>  			undo_isolate_page_range(isolate_start, pfn, migratetype);
> -			unset_migratetype_isolate(
> -				pfn_to_page(isolate_end - pageblock_nr_pages),
> -				migratetype);
>  			return -EBUSY;
>  		}
>  	}
> +
> +	if (isolate_start == isolate_end - pageblock_nr_pages)
> +		skip_isolation = true;
> +
> +	/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
> +	ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
> +			skip_isolation, migratetype);
> +	if (ret) {
> +		undo_isolate_page_range(isolate_start, pfn, migratetype);
> +		return ret;
> +	}
> +
>  	return 0;
>  }
>
> @@ -591,8 +589,8 @@ void undo_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  {
>  	unsigned long pfn;
>  	struct page *page;
> -	unsigned long isolate_start = pageblock_start_pfn(start_pfn);
> -	unsigned long isolate_end = pageblock_align(end_pfn);
> +	unsigned long isolate_start = ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES);
> +	unsigned long isolate_end = ALIGN(end_pfn, MAX_ORDER_NR_PAGES);
>
>  	for (pfn = isolate_start;
>  	     pfn < isolate_end;


--
Best Regards,
Yan, Zi
Zi Yan Sept. 21, 2023, 2:31 a.m. UTC | #22
On 20 Sep 2023, at 13:23, Zi Yan wrote:

> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>
>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>
>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>
>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>  		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>
>>>>>>>>  		/* Do not cross zone boundaries */
>>>>>>>> 	+#if 0
>>>>>>>>  		if (!zone_spans_pfn(zone, start))
>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>> 	+#else
>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>> 	+		start = pfn;
>>>>>>>> 	+#endif
>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>> 	 		return false;
>>>>>>>> 	I can still trigger warnings.
>>>>>>>
>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>
>>>>>>
>>>>>> Just to be really clear,
>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>   path WITHOUT your change.
>>>>>>
>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>
>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>
>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>
>>>>> Got it. Thanks for the clarification.
>>>>>>
>>>>>>>>
>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>> script.
>>>>>>>>
>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>
>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>> trying. Thanks.
>>>>>>>
>>>>>>
>>>>>> Perhaps try running both scripts in parallel?
>>>>>
>>>>> Yes. It seems to do the trick.
>>>>>
>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>
>>>>> I am able to reproduce it with the script below:
>>>>>
>>>>> while true; do
>>>>>  echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>  echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>  wait
>>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>  echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>> done
>>>>>
>>>>> I will look into the issue.
>>>
>>> Nice!
>>>
>>> I managed to reproduce it ONCE, triggering it not even a second after
>>> starting the script. But I can't seem to do it twice, even after
>>> several reboots and letting it run for minutes.
>>
>> I managed to reproduce it reliably by cutting the nr_hugepages
>> parameters respectively in half.
>>
>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>> printk-tracing, the scenario seems to be this:
>>
>> #0                                                   #1
>> start_isolate_page_range()
>>   isolate_single_pageblock()
>>     set_migratetype_isolate(tail)
>>       lock zone->lock
>>       move_freepages_block(tail) // nop
>>       set_pageblock_migratetype(tail)
>>       unlock zone->lock
>>                                                      del_page_from_freelist(head)
>>                                                      expand(head, head_mt)
>>                                                        WARN(head_mt != tail_mt)
>>     start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>     for (pfn = start_pfn, pfn < end_pfn)
>>       if (PageBuddy())
>>         split_free_page(head)
>>
>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>> lock. The move_freepages_block() does nothing because the PageBuddy()
>> is set on the pageblock to the left. Once we drop the lock, the buddy
>> gets allocated and the expand() puts things on the wrong list. The
>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>> type is set and the lock has been dropped, so it's too late.
>
> Yes, this is the issue I can confirm as well. But it is intentional to enable
> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
> granularity. With your changes below, it no longer works, because if there
> is an unmovable page in
> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
> the allocation fails but it would succeed in current implementation.
>
> I think a proper fix would be to make move_freepages_block() split the
> MAX_ORDER page and put the split pages in the right migratetype free lists.
>
> I am working on that.

After spending half a day on this, I think it is much harder than I thought
to get alloc_contig_range() working with the freelist migratetype hygiene
patchset. Because alloc_contig_range() relies on racy migratetype changes:

1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
free list yet.

2. later in the process, isolate_freepages_range() is used to actually grab
the free pages.

3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
in-use pages. But it is not the case when alloc_contig_range() work on
pageblock aligned ranges. Now during isolation phase, free or in-use pages
will need to be split to get their subpages into the right free lists.

4. the hardest case is when a in-use page sits across two pageblocks, currently,
the code just isolate one pageblock, migrate the page, and let split_free_page()
to correct the free list later. But to strictly enforce freelist migratetype
hygiene, extra work is needed at free page path to split the free page into
the right freelists.

I need more time to think about how to get alloc_contig_range() properly.
Help is needed for the bullet point 4.

Thanks.

PS: One observation is that after move_to_free_list(), a page's migratetype
does not match the migratetype of its free list. I might need to make
changes on top of your patchset to get alloc_contig_range() working.


--
Best Regards,
Yan, Zi
David Hildenbrand Sept. 21, 2023, 10:19 a.m. UTC | #23
On 21.09.23 04:31, Zi Yan wrote:
> On 20 Sep 2023, at 13:23, Zi Yan wrote:
> 
>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>
>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>
>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>
>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>
>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>> 	+#if 0
>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>> 	+#else
>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>> 	+		start = pfn;
>>>>>>>>> 	+#endif
>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>> 	 		return false;
>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>
>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>
>>>>>>>
>>>>>>> Just to be really clear,
>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>    path WITHOUT your change.
>>>>>>>
>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>
>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>
>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>
>>>>>> Got it. Thanks for the clarification.
>>>>>>>
>>>>>>>>>
>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>> script.
>>>>>>>>>
>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>
>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>> trying. Thanks.
>>>>>>>>
>>>>>>>
>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>
>>>>>> Yes. It seems to do the trick.
>>>>>>
>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>
>>>>>> I am able to reproduce it with the script below:
>>>>>>
>>>>>> while true; do
>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>   wait
>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>> done
>>>>>>
>>>>>> I will look into the issue.
>>>>
>>>> Nice!
>>>>
>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>> starting the script. But I can't seem to do it twice, even after
>>>> several reboots and letting it run for minutes.
>>>
>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>> parameters respectively in half.
>>>
>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>> printk-tracing, the scenario seems to be this:
>>>
>>> #0                                                   #1
>>> start_isolate_page_range()
>>>    isolate_single_pageblock()
>>>      set_migratetype_isolate(tail)
>>>        lock zone->lock
>>>        move_freepages_block(tail) // nop
>>>        set_pageblock_migratetype(tail)
>>>        unlock zone->lock
>>>                                                       del_page_from_freelist(head)
>>>                                                       expand(head, head_mt)
>>>                                                         WARN(head_mt != tail_mt)
>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>        if (PageBuddy())
>>>          split_free_page(head)
>>>
>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>> gets allocated and the expand() puts things on the wrong list. The
>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>> type is set and the lock has been dropped, so it's too late.
>>
>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>> granularity. With your changes below, it no longer works, because if there
>> is an unmovable page in
>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>> the allocation fails but it would succeed in current implementation.
>>
>> I think a proper fix would be to make move_freepages_block() split the
>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>
>> I am working on that.
> 
> After spending half a day on this, I think it is much harder than I thought
> to get alloc_contig_range() working with the freelist migratetype hygiene
> patchset. Because alloc_contig_range() relies on racy migratetype changes:
> 
> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
> free list yet.
> 
> 2. later in the process, isolate_freepages_range() is used to actually grab
> the free pages.
> 
> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
> in-use pages. But it is not the case when alloc_contig_range() work on
> pageblock aligned ranges. Now during isolation phase, free or in-use pages
> will need to be split to get their subpages into the right free lists.
> 
> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
> the code just isolate one pageblock, migrate the page, and let split_free_page()
> to correct the free list later. But to strictly enforce freelist migratetype
> hygiene, extra work is needed at free page path to split the free page into
> the right freelists.
> 
> I need more time to think about how to get alloc_contig_range() properly.
> Help is needed for the bullet point 4.


I once raised that we should maybe try making MIGRATE_ISOLATE a flag 
that preserves the original migratetype. Not sure if that would help 
here in any way.

The whole alloc_contig_range() implementation is quite complicated and 
hard to grasp. If we could find ways to clean all that up and make it 
easier to understand and play along, that would be nice.
Zi Yan Sept. 21, 2023, 2:47 p.m. UTC | #24
On 21 Sep 2023, at 6:19, David Hildenbrand wrote:

> On 21.09.23 04:31, Zi Yan wrote:
>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>
>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>
>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>
>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>
>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>
>>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>>> 	+#if 0
>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>> 	+#else
>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>> 	+#endif
>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>> 	 		return false;
>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>
>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Just to be really clear,
>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>    path WITHOUT your change.
>>>>>>>>
>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>
>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>
>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>
>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>> script.
>>>>>>>>>>
>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>
>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>> trying. Thanks.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>
>>>>>>> Yes. It seems to do the trick.
>>>>>>>
>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>
>>>>>>> I am able to reproduce it with the script below:
>>>>>>>
>>>>>>> while true; do
>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>   wait
>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>> done
>>>>>>>
>>>>>>> I will look into the issue.
>>>>>
>>>>> Nice!
>>>>>
>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>> starting the script. But I can't seem to do it twice, even after
>>>>> several reboots and letting it run for minutes.
>>>>
>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>> parameters respectively in half.
>>>>
>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>> printk-tracing, the scenario seems to be this:
>>>>
>>>> #0                                                   #1
>>>> start_isolate_page_range()
>>>>    isolate_single_pageblock()
>>>>      set_migratetype_isolate(tail)
>>>>        lock zone->lock
>>>>        move_freepages_block(tail) // nop
>>>>        set_pageblock_migratetype(tail)
>>>>        unlock zone->lock
>>>>                                                       del_page_from_freelist(head)
>>>>                                                       expand(head, head_mt)
>>>>                                                         WARN(head_mt != tail_mt)
>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>>        if (PageBuddy())
>>>>          split_free_page(head)
>>>>
>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>> gets allocated and the expand() puts things on the wrong list. The
>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>> type is set and the lock has been dropped, so it's too late.
>>>
>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>> granularity. With your changes below, it no longer works, because if there
>>> is an unmovable page in
>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>> the allocation fails but it would succeed in current implementation.
>>>
>>> I think a proper fix would be to make move_freepages_block() split the
>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>
>>> I am working on that.
>>
>> After spending half a day on this, I think it is much harder than I thought
>> to get alloc_contig_range() working with the freelist migratetype hygiene
>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>
>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>> free list yet.
>>
>> 2. later in the process, isolate_freepages_range() is used to actually grab
>> the free pages.
>>
>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>> in-use pages. But it is not the case when alloc_contig_range() work on
>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>> will need to be split to get their subpages into the right free lists.
>>
>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>> to correct the free list later. But to strictly enforce freelist migratetype
>> hygiene, extra work is needed at free page path to split the free page into
>> the right freelists.
>>
>> I need more time to think about how to get alloc_contig_range() properly.
>> Help is needed for the bullet point 4.
>
>
> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.

I have that in my backlog since you asked and have been delaying it. ;) Hopefully
I can do it after I fix this. That change might or might not help only if we make
some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
overwrite existing migratetype, the code might not need to split a page and move
it to MIGRATE_ISOLATE freelist?

The fundamental issue in alloc_contig_range() is that to work at
pageblock level, a page (>pageblock_order) can have one part is isolated and
the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
now checks first pageblock migratetype, so such a page needs to be removed
from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
finally put back to multiple free lists. This needs to be done at isolation stage
before free pages are removed from their free lists (the stage after isolation).
If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
in their original migratetype and check migratetype before allocating a page,
that might help. But that might add extra work (e.g., splitting a partially
isolated free page before allocation) in the really hot code path, which is not
desirable.

>
> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.

I will try my best to simplify it.

--
Best Regards,
Yan, Zi
Zi Yan Sept. 25, 2023, 9:12 p.m. UTC | #25
On 21 Sep 2023, at 10:47, Zi Yan wrote:

> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>
>> On 21.09.23 04:31, Zi Yan wrote:
>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>
>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>
>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>
>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>
>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>
>>>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>> 	+#else
>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>> 	+#endif
>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>> 	 		return false;
>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>
>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just to be really clear,
>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>    path WITHOUT your change.
>>>>>>>>>
>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>
>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>
>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>
>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>> script.
>>>>>>>>>>>
>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>
>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>> trying. Thanks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>
>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>
>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>
>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>
>>>>>>>> while true; do
>>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>   wait
>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>> done
>>>>>>>>
>>>>>>>> I will look into the issue.
>>>>>>
>>>>>> Nice!
>>>>>>
>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>> several reboots and letting it run for minutes.
>>>>>
>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>> parameters respectively in half.
>>>>>
>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>> printk-tracing, the scenario seems to be this:
>>>>>
>>>>> #0                                                   #1
>>>>> start_isolate_page_range()
>>>>>    isolate_single_pageblock()
>>>>>      set_migratetype_isolate(tail)
>>>>>        lock zone->lock
>>>>>        move_freepages_block(tail) // nop
>>>>>        set_pageblock_migratetype(tail)
>>>>>        unlock zone->lock
>>>>>                                                       del_page_from_freelist(head)
>>>>>                                                       expand(head, head_mt)
>>>>>                                                         WARN(head_mt != tail_mt)
>>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>>>        if (PageBuddy())
>>>>>          split_free_page(head)
>>>>>
>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>> type is set and the lock has been dropped, so it's too late.
>>>>
>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>> granularity. With your changes below, it no longer works, because if there
>>>> is an unmovable page in
>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>> the allocation fails but it would succeed in current implementation.
>>>>
>>>> I think a proper fix would be to make move_freepages_block() split the
>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>
>>>> I am working on that.
>>>
>>> After spending half a day on this, I think it is much harder than I thought
>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>
>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>> free list yet.
>>>
>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>> the free pages.
>>>
>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>> will need to be split to get their subpages into the right free lists.
>>>
>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>> to correct the free list later. But to strictly enforce freelist migratetype
>>> hygiene, extra work is needed at free page path to split the free page into
>>> the right freelists.
>>>
>>> I need more time to think about how to get alloc_contig_range() properly.
>>> Help is needed for the bullet point 4.
>>
>>
>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>
> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
> I can do it after I fix this. That change might or might not help only if we make
> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
> overwrite existing migratetype, the code might not need to split a page and move
> it to MIGRATE_ISOLATE freelist?
>
> The fundamental issue in alloc_contig_range() is that to work at
> pageblock level, a page (>pageblock_order) can have one part is isolated and
> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
> now checks first pageblock migratetype, so such a page needs to be removed
> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
> finally put back to multiple free lists. This needs to be done at isolation stage
> before free pages are removed from their free lists (the stage after isolation).
> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
> in their original migratetype and check migratetype before allocating a page,
> that might help. But that might add extra work (e.g., splitting a partially
> isolated free page before allocation) in the really hot code path, which is not
> desirable.
>
>>
>> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.
>
> I will try my best to simplify it.

Hi Johannes,

I attached three patches to fix the issue and first two can be folded into
your patchset:

1. __free_one_page() bug you and Vlastimil discussed on the other email.
2. move set_pageblock_migratetype() into move_freepages() to prepare for patch 3.
3. enable move_freepages() to split a free page that is partially covered by
   [start_pfn, end_pfn] in the parameter and set migratetype correctly when
   a >pageblock_order free page is moved. Before when a >pageblock_order
   free page is moved, only first pageblock migratetype is changed. The added
   WARN_ON_ONCE might be triggered by these pages.

I ran Mike's test with transhuge-stress together with my patches on top of your
"close migratetype race" patch for more than an hour without any warning.
It should unblock your patchset. I will keep working on alloc_contig_range()
simplification.


--
Best Regards,
Yan, Zi
From a18de9a235dc97999fcabdac699f33da9138b0ba Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Fri, 22 Sep 2023 11:11:32 -0400
Subject: [PATCH 1/3] mm: fix __free_one_page().

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7de022bc4c7d..72f27d14c8e7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -787,8 +787,6 @@ static inline void __free_one_page(struct page *page,
 	VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
 	while (order < MAX_ORDER) {
-		int buddy_mt;
-
 		if (compaction_capture(capc, page, order, migratetype))
 			return;
 
@@ -796,8 +794,6 @@ static inline void __free_one_page(struct page *page,
 		if (!buddy)
 			goto done_merging;
 
-		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
-
 		if (unlikely(order >= pageblock_order)) {
 			/*
 			 * We want to prevent merge between freepages on pageblock
@@ -827,7 +823,7 @@ static inline void __free_one_page(struct page *page,
 		if (page_is_guard(buddy))
 			clear_page_guard(zone, buddy, order);
 		else
-			del_page_from_free_list(buddy, zone, order, buddy_mt);
+			del_page_from_free_list(buddy, zone, order, migratetype);
 		combined_pfn = buddy_pfn & pfn;
 		page = page + (combined_pfn - pfn);
 		pfn = combined_pfn;
Johannes Weiner Sept. 26, 2023, 5:39 p.m. UTC | #26
On Mon, Sep 25, 2023 at 05:12:38PM -0400, Zi Yan wrote:
> On 21 Sep 2023, at 10:47, Zi Yan wrote:
> 
> > On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
> >
> >> On 21.09.23 04:31, Zi Yan wrote:
> >>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
> >>>
> >>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
> >>>>
> >>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
> >>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
> >>>>>>> On 9/20/23 03:38, Zi Yan wrote:
> >>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
> >>>>>>>>
> >>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
> >>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
> >>>>>>>>>>
> >>>>>>>>>>> 	--- a/mm/page_alloc.c
> >>>>>>>>>>> 	+++ b/mm/page_alloc.c
> >>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
> >>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
> >>>>>>>>>>>
> >>>>>>>>>>>   		/* Do not cross zone boundaries */
> >>>>>>>>>>> 	+#if 0
> >>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
> >>>>>>>>>>> 			start = zone->zone_start_pfn;
> >>>>>>>>>>> 	+#else
> >>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
> >>>>>>>>>>> 	+		start = pfn;
> >>>>>>>>>>> 	+#endif
> >>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
> >>>>>>>>>>> 	 		return false;
> >>>>>>>>>>> 	I can still trigger warnings.
> >>>>>>>>>>
> >>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
> >>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Just to be really clear,
> >>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
> >>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
> >>>>>>>>>    path WITHOUT your change.
> >>>>>>>>>
> >>>>>>>>> I am guessing the difference here has more to do with the allocation path?
> >>>>>>>>>
> >>>>>>>>> I went back and reran focusing on the specific migrate type.
> >>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
> >>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
> >>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
> >>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
> >>>>>>>>>
> >>>>>>>>> I could be wrong, but I do not think your patch changes things.
> >>>>>>>>
> >>>>>>>> Got it. Thanks for the clarification.
> >>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
> >>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
> >>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
> >>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
> >>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
> >>>>>>>>>>> script.
> >>>>>>>>>>>
> >>>>>>>>>>> Zi asked about my config, so it is attached.
> >>>>>>>>>>
> >>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
> >>>>>>>>>> trying. Thanks.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Perhaps try running both scripts in parallel?
> >>>>>>>>
> >>>>>>>> Yes. It seems to do the trick.
> >>>>>>>>
> >>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
> >>>>>>>>
> >>>>>>>> I am able to reproduce it with the script below:
> >>>>>>>>
> >>>>>>>> while true; do
> >>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
> >>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
> >>>>>>>>   wait
> >>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
> >>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> >>>>>>>> done
> >>>>>>>>
> >>>>>>>> I will look into the issue.
> >>>>>>
> >>>>>> Nice!
> >>>>>>
> >>>>>> I managed to reproduce it ONCE, triggering it not even a second after
> >>>>>> starting the script. But I can't seem to do it twice, even after
> >>>>>> several reboots and letting it run for minutes.
> >>>>>
> >>>>> I managed to reproduce it reliably by cutting the nr_hugepages
> >>>>> parameters respectively in half.
> >>>>>
> >>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
> >>>>> printk-tracing, the scenario seems to be this:
> >>>>>
> >>>>> #0                                                   #1
> >>>>> start_isolate_page_range()
> >>>>>    isolate_single_pageblock()
> >>>>>      set_migratetype_isolate(tail)
> >>>>>        lock zone->lock
> >>>>>        move_freepages_block(tail) // nop
> >>>>>        set_pageblock_migratetype(tail)
> >>>>>        unlock zone->lock
> >>>>>                                                       del_page_from_freelist(head)
> >>>>>                                                       expand(head, head_mt)
> >>>>>                                                         WARN(head_mt != tail_mt)
> >>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
> >>>>>      for (pfn = start_pfn, pfn < end_pfn)
> >>>>>        if (PageBuddy())
> >>>>>          split_free_page(head)
> >>>>>
> >>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
> >>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
> >>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
> >>>>> gets allocated and the expand() puts things on the wrong list. The
> >>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
> >>>>> type is set and the lock has been dropped, so it's too late.
> >>>>
> >>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
> >>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
> >>>> granularity. With your changes below, it no longer works, because if there
> >>>> is an unmovable page in
> >>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
> >>>> the allocation fails but it would succeed in current implementation.
> >>>>
> >>>> I think a proper fix would be to make move_freepages_block() split the
> >>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
> >>>>
> >>>> I am working on that.
> >>>
> >>> After spending half a day on this, I think it is much harder than I thought
> >>> to get alloc_contig_range() working with the freelist migratetype hygiene
> >>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
> >>>
> >>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
> >>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
> >>> free list yet.
> >>>
> >>> 2. later in the process, isolate_freepages_range() is used to actually grab
> >>> the free pages.
> >>>
> >>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
> >>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
> >>> in-use pages. But it is not the case when alloc_contig_range() work on
> >>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
> >>> will need to be split to get their subpages into the right free lists.
> >>>
> >>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
> >>> the code just isolate one pageblock, migrate the page, and let split_free_page()
> >>> to correct the free list later. But to strictly enforce freelist migratetype
> >>> hygiene, extra work is needed at free page path to split the free page into
> >>> the right freelists.
> >>>
> >>> I need more time to think about how to get alloc_contig_range() properly.
> >>> Help is needed for the bullet point 4.
> >>
> >>
> >> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
> >
> > I have that in my backlog since you asked and have been delaying it. ;) Hopefully
> > I can do it after I fix this. That change might or might not help only if we make
> > some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
> > overwrite existing migratetype, the code might not need to split a page and move
> > it to MIGRATE_ISOLATE freelist?
> >
> > The fundamental issue in alloc_contig_range() is that to work at
> > pageblock level, a page (>pageblock_order) can have one part is isolated and
> > the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
> > now checks first pageblock migratetype, so such a page needs to be removed
> > from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
> > finally put back to multiple free lists. This needs to be done at isolation stage
> > before free pages are removed from their free lists (the stage after isolation).
> > If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
> > in their original migratetype and check migratetype before allocating a page,
> > that might help. But that might add extra work (e.g., splitting a partially
> > isolated free page before allocation) in the really hot code path, which is not
> > desirable.
> >
> >>
> >> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.
> >
> > I will try my best to simplify it.
> 
> Hi Johannes,
> 
> I attached three patches to fix the issue and first two can be folded into
> your patchset:

Hi Zi, thanks for providing these patches! I'll pick them up into the
series.

> 1. __free_one_page() bug you and Vlastimil discussed on the other email.
> 2. move set_pageblock_migratetype() into move_freepages() to prepare for patch 3.
> 3. enable move_freepages() to split a free page that is partially covered by
>    [start_pfn, end_pfn] in the parameter and set migratetype correctly when
>    a >pageblock_order free page is moved. Before when a >pageblock_order
>    free page is moved, only first pageblock migratetype is changed. The added
>    WARN_ON_ONCE might be triggered by these pages.
> 
> I ran Mike's test with transhuge-stress together with my patches on top of your
> "close migratetype race" patch for more than an hour without any warning.
> It should unblock your patchset. I will keep working on alloc_contig_range()
> simplification.
> 
> 
> --
> Best Regards,
> Yan, Zi

> From a18de9a235dc97999fcabdac699f33da9138b0ba Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Fri, 22 Sep 2023 11:11:32 -0400
> Subject: [PATCH 1/3] mm: fix __free_one_page().
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/page_alloc.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7de022bc4c7d..72f27d14c8e7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -787,8 +787,6 @@ static inline void __free_one_page(struct page *page,
>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>  
>  	while (order < MAX_ORDER) {
> -		int buddy_mt;
> -
>  		if (compaction_capture(capc, page, order, migratetype))
>  			return;
>  
> @@ -796,8 +794,6 @@ static inline void __free_one_page(struct page *page,
>  		if (!buddy)
>  			goto done_merging;
>  
> -		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
> -
>  		if (unlikely(order >= pageblock_order)) {
>  			/*
>  			 * We want to prevent merge between freepages on pageblock
> @@ -827,7 +823,7 @@ static inline void __free_one_page(struct page *page,
>  		if (page_is_guard(buddy))
>  			clear_page_guard(zone, buddy, order);
>  		else
> -			del_page_from_free_list(buddy, zone, order, buddy_mt);
> +			del_page_from_free_list(buddy, zone, order, migratetype);
>  		combined_pfn = buddy_pfn & pfn;
>  		page = page + (combined_pfn - pfn);
>  		pfn = combined_pfn;

I had a fix for this that's slightly different. The buddy's type can't
be changed while it's still on the freelist, so I moved that
around. The sequence now is:

	int buddy_mt = migratetype;

	if (unlikely(order >= pageblock_order)) {
		/* This is the only case where buddy_mt can differ */
		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
		// compat checks...
	}

	del_page_from_free_list(buddy, buddy_mt);

	if (unlikely(buddy_mt != migratetype))
		set_pageblock_migratetype(buddy, migratetype);


> From b11a0e3d8f9d7d91a884c90dc9cebb185c3a2bbc Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Mon, 25 Sep 2023 16:27:14 -0400
> Subject: [PATCH 2/3] mm: set migratetype after free pages are moved between
>  free lists.
> 
> This avoids changing migratetype after move_freepages() or
> move_freepages_block(), which is error prone. It also prepares for upcoming
> changes to fix move_freepages() not moving free pages partially in the
> range.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

This makes the code much cleaner, thank you!

> From 75a4d327efd94230f3b9aab29ef6ec0badd488a6 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Mon, 25 Sep 2023 16:55:18 -0400
> Subject: [PATCH 3/3] mm: enable move_freepages() to properly move part of free
>  pages.
> 
> alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
> move_freepages(), to isolate free pages. But move_freepages() was not able
> to move free pages partially covered by the specified range, leaving a race
> window open[1]. Fix it by teaching move_freepages() to split a free page
> when only part of it is going to be moved.
> 
> In addition, when a >pageblock_order free page is moved, only its first
> pageblock migratetype is changed. It can cause warnings later. Fix it by
> set all pageblocks in a free page to the same migratetype after move.
> 
> split_free_page() is changed to be used in move_freepages() and
> isolate_single_pageblock(). A common code to find the start pfn of a free
> page is refactored in get_freepage_start_pfn().
> 
> [1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/page_alloc.c     | 75 ++++++++++++++++++++++++++++++++++++---------
>  mm/page_isolation.c | 17 +++++++---
>  2 files changed, 73 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7c41cb5d8a36..3fd5ab40b55c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -866,15 +866,15 @@ int split_free_page(struct page *free_page,
>  	struct zone *zone = page_zone(free_page);
>  	unsigned long free_page_pfn = page_to_pfn(free_page);
>  	unsigned long pfn;
> -	unsigned long flags;
>  	int free_page_order;
>  	int mt;
>  	int ret = 0;
>  
> -	if (split_pfn_offset == 0)
> -		return ret;
> +	/* zone lock should be held when this function is called */
> +	lockdep_assert_held(&zone->lock);
>  
> -	spin_lock_irqsave(&zone->lock, flags);
> +	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
> +		return ret;
>  
>  	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
>  		ret = -ENOENT;
> @@ -900,7 +900,6 @@ int split_free_page(struct page *free_page,
>  			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>  	}
>  out:
> -	spin_unlock_irqrestore(&zone->lock, flags);
>  	return ret;
>  }
>  /*
> @@ -1589,6 +1588,25 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>  					unsigned int order) { return NULL; }
>  #endif
>  
> +/*
> + * Get first pfn of the free page, where pfn is in. If this free page does
> + * not exist, return the given pfn.
> + */
> +static unsigned long get_freepage_start_pfn(unsigned long pfn)
> +{
> +	int order = 0;
> +	unsigned long start_pfn = pfn;
> +
> +	while (!PageBuddy(pfn_to_page(start_pfn))) {
> +		if (++order > MAX_ORDER) {
> +			start_pfn = pfn;
> +			break;
> +		}
> +		start_pfn &= ~0UL << order;
> +	}
> +	return start_pfn;
> +}
> +
>  /*
>   * Move the free pages in a range to the freelist tail of the requested type.
>   * Note that start_page and end_pages are not aligned on a pageblock
> @@ -1598,9 +1616,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  			  unsigned long end_pfn, int old_mt, int new_mt)
>  {
>  	struct page *page;
> -	unsigned long pfn;
> +	unsigned long pfn, pfn2;
>  	unsigned int order;
>  	int pages_moved = 0;
> +	unsigned long mt_change_pfn = start_pfn;
> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
> +
> +	/* split at start_pfn if it is in the middle of a free page */
> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
> +		struct page *new_page = pfn_to_page(new_start_pfn);
> +		int new_page_order = buddy_order(new_page);
> +
> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
> +			/* change migratetype so that split_free_page can work */
> +			set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
> +			split_free_page(new_page, buddy_order(new_page),
> +					start_pfn - new_start_pfn);
> +
> +			mt_change_pfn = start_pfn;
> +			/* move to next page */
> +			start_pfn = new_start_pfn + (1 << new_page_order);
> +		}
> +	}

Ok, so if there is a straddle from the previous block into our block
of interest, it's split and the migratetype is set only on our block.

> @@ -1615,10 +1653,24 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  
>  		order = buddy_order(page);
>  		move_to_free_list(page, zone, order, old_mt, new_mt);
> +		/*
> +		 * set page migratetype for all pageblocks within the page and
> +		 * only after we move all free pages in one pageblock
> +		 */
> +		if (pfn + (1 << order) >= pageblock_end_pfn(pfn)) {
> +			for (pfn2 = pfn; pfn2 < pfn + (1 << order);
> +			     pfn2 += pageblock_nr_pages) {
> +				set_pageblock_migratetype(pfn_to_page(pfn2),
> +							  new_mt);
> +				mt_change_pfn = pfn2;
> +			}

But if we have the first block of a MAX_ORDER chunk, then we don't
split but rather move the whole chunk and make sure to update the
chunk's blocks that are outside the range of interest.

It looks like either way would work, but why not split here as well
and keep the move contained to the block? Wouldn't this be a bit more
predictable and easier to understand?

> +		}
>  		pfn += 1 << order;
>  		pages_moved += 1 << order;
>  	}
> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
> +	/* set migratetype for the remaining pageblocks */
> +	for (pfn2 = mt_change_pfn; pfn2 <= end_pfn; pfn2 += pageblock_nr_pages)
> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);

I think I'm missing something for this.

- If there was no straddle, there is only our block of interest to
  update.

- If there was a straddle from the previous block, it was split and
  the block of interest was already updated. Nothing to do here?

- If there was a straddle into the next block, both blocks are updated
  to the new type. Nothing to do here?

What's the case where there are multiple blocks to update in the end?

> @@ -380,8 +380,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  			int order = buddy_order(page);
>  
>  			if (pfn + (1UL << order) > boundary_pfn) {
> +				int res;
> +				unsigned long flags;
> +
> +				spin_lock_irqsave(&zone->lock, flags);
> +				res = split_free_page(page, order, boundary_pfn - pfn);
> +				spin_unlock_irqrestore(&zone->lock, flags);
> +
>  				/* free page changed before split, check it again */
> -				if (split_free_page(page, order, boundary_pfn - pfn))
> +				if (res)
>  					continue;

At this point, we've already set the migratetype, which has handled
straddling free pages. Is this split still needed?

> @@ -426,9 +433,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  				/*
>  				 * XXX: mark the page as MIGRATE_ISOLATE so that
>  				 * no one else can grab the freed page after migration.
> -				 * Ideally, the page should be freed as two separate
> -				 * pages to be added into separate migratetype free
> -				 * lists.
> +				 * The page should be freed into separate migratetype
> +				 * free lists, unless the free page order is greater
> +				 * than pageblock order. It is not the case now,
> +				 * since gigantic hugetlb is freed as order-0
> +				 * pages and LRU pages do not cross pageblocks.
>  				 */
>  				if (isolate_page) {
>  					ret = set_migratetype_isolate(page, page_mt,

I hadn't thought about LRU pages being constrained to single
pageblocks before. Does this mean we only ever migrate here in case
there is a movable gigantic page? And since those are already split
during the free, does that mean the "reset pfn to head of the free
page" part after the migration is actually unnecessary?
David Hildenbrand Sept. 26, 2023, 6:19 p.m. UTC | #27
On 21.09.23 16:47, Zi Yan wrote:
> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
> 
>> On 21.09.23 04:31, Zi Yan wrote:
>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>
>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>
>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>
>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>
>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>    		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>
>>>>>>>>>>>    		/* Do not cross zone boundaries */
>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>    		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>> 	+#else
>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>> 	+#endif
>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>> 	 		return false;
>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>
>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just to be really clear,
>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>     path WITHOUT your change.
>>>>>>>>>
>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>
>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>
>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>
>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>> script.
>>>>>>>>>>>
>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>
>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>> trying. Thanks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>
>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>
>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>
>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>
>>>>>>>> while true; do
>>>>>>>>    echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>    echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>    wait
>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>> done
>>>>>>>>
>>>>>>>> I will look into the issue.
>>>>>>
>>>>>> Nice!
>>>>>>
>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>> several reboots and letting it run for minutes.
>>>>>
>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>> parameters respectively in half.
>>>>>
>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>> printk-tracing, the scenario seems to be this:
>>>>>
>>>>> #0                                                   #1
>>>>> start_isolate_page_range()
>>>>>     isolate_single_pageblock()
>>>>>       set_migratetype_isolate(tail)
>>>>>         lock zone->lock
>>>>>         move_freepages_block(tail) // nop
>>>>>         set_pageblock_migratetype(tail)
>>>>>         unlock zone->lock
>>>>>                                                        del_page_from_freelist(head)
>>>>>                                                        expand(head, head_mt)
>>>>>                                                          WARN(head_mt != tail_mt)
>>>>>       start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>       for (pfn = start_pfn, pfn < end_pfn)
>>>>>         if (PageBuddy())
>>>>>           split_free_page(head)
>>>>>
>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>> type is set and the lock has been dropped, so it's too late.
>>>>
>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>> granularity. With your changes below, it no longer works, because if there
>>>> is an unmovable page in
>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>> the allocation fails but it would succeed in current implementation.
>>>>
>>>> I think a proper fix would be to make move_freepages_block() split the
>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>
>>>> I am working on that.
>>>
>>> After spending half a day on this, I think it is much harder than I thought
>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>
>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>> free list yet.
>>>
>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>> the free pages.
>>>
>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>> will need to be split to get their subpages into the right free lists.
>>>
>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>> to correct the free list later. But to strictly enforce freelist migratetype
>>> hygiene, extra work is needed at free page path to split the free page into
>>> the right freelists.
>>>
>>> I need more time to think about how to get alloc_contig_range() properly.
>>> Help is needed for the bullet point 4.
>>
>>
>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
> 
> I have that in my backlog since you asked and have been delaying it. ;) Hopefully

It's complicated and I wish I would have had more time to review it
back then ... or now to clean it up later.

Unfortunately, nobody else did have the time to review it back then ... maybe we can
do better next time. David doesn't scale.

Doing page migration from inside start_isolate_page_range()->isolate_single_pageblock()
really is sub-optimal (and mostly code duplication from alloc_contig_range).

> I can do it after I fix this. That change might or might not help only if we make
> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
> overwrite existing migratetype, the code might not need to split a page and move
> it to MIGRATE_ISOLATE freelist?

Did someone test how memory offlining plays along with that? (I can try myself
within the next 1-2 weeks)

There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
though.

ret = start_isolate_page_range(start_pfn, end_pfn,
			       MIGRATE_MOVABLE,
			       MEMORY_OFFLINE | REPORT_FAILURE,
			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);

> 
> The fundamental issue in alloc_contig_range() is that to work at
> pageblock level, a page (>pageblock_order) can have one part is isolated and
> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
> now checks first pageblock migratetype, so such a page needs to be removed
> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
> finally put back to multiple free lists. This needs to be done at isolation stage
> before free pages are removed from their free lists (the stage after isolation).

One idea was to always isolate larger chunks, and handle movability checks/split/etc
at a later stage. Once isolation would be decoupled from the actual/original migratetype,
the could have been easier to handle (especially some corner cases I had in mind back then).

> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
> in their original migratetype and check migratetype before allocating a page,
> that might help. But that might add extra work (e.g., splitting a partially
> isolated free page before allocation) in the really hot code path, which is not
> desirable.

With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
separate isolate list, but one per "proper migratetype". But again, just some random
thoughts I had back then, I never had sufficient time to think it all through.
Zi Yan Sept. 28, 2023, 2:51 a.m. UTC | #28
On 26 Sep 2023, at 13:39, Johannes Weiner wrote:

> On Mon, Sep 25, 2023 at 05:12:38PM -0400, Zi Yan wrote:
>> On 21 Sep 2023, at 10:47, Zi Yan wrote:
>>
>>> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>>>
>>>> On 21.09.23 04:31, Zi Yan wrote:
>>>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>>>
>>>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>>>
>>>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>>>
>>>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>>>
>>>>>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>>>> 	+#else
>>>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>>>> 	+#endif
>>>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>>>> 	 		return false;
>>>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>>>
>>>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Just to be really clear,
>>>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>>>    path WITHOUT your change.
>>>>>>>>>>>
>>>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>>>
>>>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>>>
>>>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>>>
>>>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>>>> script.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>>>
>>>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>>>> trying. Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>>>
>>>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>>>
>>>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>>>
>>>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>>>
>>>>>>>>>> while true; do
>>>>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>>>   wait
>>>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>>>> done
>>>>>>>>>>
>>>>>>>>>> I will look into the issue.
>>>>>>>>
>>>>>>>> Nice!
>>>>>>>>
>>>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>>>> several reboots and letting it run for minutes.
>>>>>>>
>>>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>>>> parameters respectively in half.
>>>>>>>
>>>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>>>> printk-tracing, the scenario seems to be this:
>>>>>>>
>>>>>>> #0                                                   #1
>>>>>>> start_isolate_page_range()
>>>>>>>    isolate_single_pageblock()
>>>>>>>      set_migratetype_isolate(tail)
>>>>>>>        lock zone->lock
>>>>>>>        move_freepages_block(tail) // nop
>>>>>>>        set_pageblock_migratetype(tail)
>>>>>>>        unlock zone->lock
>>>>>>>                                                       del_page_from_freelist(head)
>>>>>>>                                                       expand(head, head_mt)
>>>>>>>                                                         WARN(head_mt != tail_mt)
>>>>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>>>>>        if (PageBuddy())
>>>>>>>          split_free_page(head)
>>>>>>>
>>>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>>>> type is set and the lock has been dropped, so it's too late.
>>>>>>
>>>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>>>> granularity. With your changes below, it no longer works, because if there
>>>>>> is an unmovable page in
>>>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>>>> the allocation fails but it would succeed in current implementation.
>>>>>>
>>>>>> I think a proper fix would be to make move_freepages_block() split the
>>>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>>>
>>>>>> I am working on that.
>>>>>
>>>>> After spending half a day on this, I think it is much harder than I thought
>>>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>>>
>>>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>>>> free list yet.
>>>>>
>>>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>>>> the free pages.
>>>>>
>>>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>>>> will need to be split to get their subpages into the right free lists.
>>>>>
>>>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>>>> to correct the free list later. But to strictly enforce freelist migratetype
>>>>> hygiene, extra work is needed at free page path to split the free page into
>>>>> the right freelists.
>>>>>
>>>>> I need more time to think about how to get alloc_contig_range() properly.
>>>>> Help is needed for the bullet point 4.
>>>>
>>>>
>>>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>>>
>>> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
>>> I can do it after I fix this. That change might or might not help only if we make
>>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>>> overwrite existing migratetype, the code might not need to split a page and move
>>> it to MIGRATE_ISOLATE freelist?
>>>
>>> The fundamental issue in alloc_contig_range() is that to work at
>>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>>> now checks first pageblock migratetype, so such a page needs to be removed
>>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>>> finally put back to multiple free lists. This needs to be done at isolation stage
>>> before free pages are removed from their free lists (the stage after isolation).
>>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>>> in their original migratetype and check migratetype before allocating a page,
>>> that might help. But that might add extra work (e.g., splitting a partially
>>> isolated free page before allocation) in the really hot code path, which is not
>>> desirable.
>>>
>>>>
>>>> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.
>>>
>>> I will try my best to simplify it.
>>
>> Hi Johannes,
>>
>> I attached three patches to fix the issue and first two can be folded into
>> your patchset:
>
> Hi Zi, thanks for providing these patches! I'll pick them up into the
> series.
>
>> 1. __free_one_page() bug you and Vlastimil discussed on the other email.
>> 2. move set_pageblock_migratetype() into move_freepages() to prepare for patch 3.
>> 3. enable move_freepages() to split a free page that is partially covered by
>>    [start_pfn, end_pfn] in the parameter and set migratetype correctly when
>>    a >pageblock_order free page is moved. Before when a >pageblock_order
>>    free page is moved, only first pageblock migratetype is changed. The added
>>    WARN_ON_ONCE might be triggered by these pages.
>>
>> I ran Mike's test with transhuge-stress together with my patches on top of your
>> "close migratetype race" patch for more than an hour without any warning.
>> It should unblock your patchset. I will keep working on alloc_contig_range()
>> simplification.
>>
>>
>> --
>> Best Regards,
>> Yan, Zi
>
>> From a18de9a235dc97999fcabdac699f33da9138b0ba Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Fri, 22 Sep 2023 11:11:32 -0400
>> Subject: [PATCH 1/3] mm: fix __free_one_page().
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/page_alloc.c | 6 +-----
>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 7de022bc4c7d..72f27d14c8e7 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -787,8 +787,6 @@ static inline void __free_one_page(struct page *page,
>>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>>
>>  	while (order < MAX_ORDER) {
>> -		int buddy_mt;
>> -
>>  		if (compaction_capture(capc, page, order, migratetype))
>>  			return;
>>
>> @@ -796,8 +794,6 @@ static inline void __free_one_page(struct page *page,
>>  		if (!buddy)
>>  			goto done_merging;
>>
>> -		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
>> -
>>  		if (unlikely(order >= pageblock_order)) {
>>  			/*
>>  			 * We want to prevent merge between freepages on pageblock
>> @@ -827,7 +823,7 @@ static inline void __free_one_page(struct page *page,
>>  		if (page_is_guard(buddy))
>>  			clear_page_guard(zone, buddy, order);
>>  		else
>> -			del_page_from_free_list(buddy, zone, order, buddy_mt);
>> +			del_page_from_free_list(buddy, zone, order, migratetype);
>>  		combined_pfn = buddy_pfn & pfn;
>>  		page = page + (combined_pfn - pfn);
>>  		pfn = combined_pfn;
>
> I had a fix for this that's slightly different. The buddy's type can't
> be changed while it's still on the freelist, so I moved that
> around. The sequence now is:
>
> 	int buddy_mt = migratetype;
>
> 	if (unlikely(order >= pageblock_order)) {
> 		/* This is the only case where buddy_mt can differ */
> 		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
> 		// compat checks...
> 	}
>
> 	del_page_from_free_list(buddy, buddy_mt);
>
> 	if (unlikely(buddy_mt != migratetype))
> 		set_pageblock_migratetype(buddy, migratetype);
>
>
>> From b11a0e3d8f9d7d91a884c90dc9cebb185c3a2bbc Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>> Subject: [PATCH 2/3] mm: set migratetype after free pages are moved between
>>  free lists.
>>
>> This avoids changing migratetype after move_freepages() or
>> move_freepages_block(), which is error prone. It also prepares for upcoming
>> changes to fix move_freepages() not moving free pages partially in the
>> range.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> This makes the code much cleaner, thank you!
>
>> From 75a4d327efd94230f3b9aab29ef6ec0badd488a6 Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Mon, 25 Sep 2023 16:55:18 -0400
>> Subject: [PATCH 3/3] mm: enable move_freepages() to properly move part of free
>>  pages.
>>
>> alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
>> move_freepages(), to isolate free pages. But move_freepages() was not able
>> to move free pages partially covered by the specified range, leaving a race
>> window open[1]. Fix it by teaching move_freepages() to split a free page
>> when only part of it is going to be moved.
>>
>> In addition, when a >pageblock_order free page is moved, only its first
>> pageblock migratetype is changed. It can cause warnings later. Fix it by
>> set all pageblocks in a free page to the same migratetype after move.
>>
>> split_free_page() is changed to be used in move_freepages() and
>> isolate_single_pageblock(). A common code to find the start pfn of a free
>> page is refactored in get_freepage_start_pfn().
>>
>> [1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>> ---
>>  mm/page_alloc.c     | 75 ++++++++++++++++++++++++++++++++++++---------
>>  mm/page_isolation.c | 17 +++++++---
>>  2 files changed, 73 insertions(+), 19 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 7c41cb5d8a36..3fd5ab40b55c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -866,15 +866,15 @@ int split_free_page(struct page *free_page,
>>  	struct zone *zone = page_zone(free_page);
>>  	unsigned long free_page_pfn = page_to_pfn(free_page);
>>  	unsigned long pfn;
>> -	unsigned long flags;
>>  	int free_page_order;
>>  	int mt;
>>  	int ret = 0;
>>
>> -	if (split_pfn_offset == 0)
>> -		return ret;
>> +	/* zone lock should be held when this function is called */
>> +	lockdep_assert_held(&zone->lock);
>>
>> -	spin_lock_irqsave(&zone->lock, flags);
>> +	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
>> +		return ret;
>>
>>  	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
>>  		ret = -ENOENT;
>> @@ -900,7 +900,6 @@ int split_free_page(struct page *free_page,
>>  			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>>  	}
>>  out:
>> -	spin_unlock_irqrestore(&zone->lock, flags);
>>  	return ret;
>>  }
>>  /*
>> @@ -1589,6 +1588,25 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>>  					unsigned int order) { return NULL; }
>>  #endif
>>
>> +/*
>> + * Get first pfn of the free page, where pfn is in. If this free page does
>> + * not exist, return the given pfn.
>> + */
>> +static unsigned long get_freepage_start_pfn(unsigned long pfn)
>> +{
>> +	int order = 0;
>> +	unsigned long start_pfn = pfn;
>> +
>> +	while (!PageBuddy(pfn_to_page(start_pfn))) {
>> +		if (++order > MAX_ORDER) {
>> +			start_pfn = pfn;
>> +			break;
>> +		}
>> +		start_pfn &= ~0UL << order;
>> +	}
>> +	return start_pfn;
>> +}
>> +
>>  /*
>>   * Move the free pages in a range to the freelist tail of the requested type.
>>   * Note that start_page and end_pages are not aligned on a pageblock
>> @@ -1598,9 +1616,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>  			  unsigned long end_pfn, int old_mt, int new_mt)
>>  {
>>  	struct page *page;
>> -	unsigned long pfn;
>> +	unsigned long pfn, pfn2;
>>  	unsigned int order;
>>  	int pages_moved = 0;
>> +	unsigned long mt_change_pfn = start_pfn;
>> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>> +
>> +	/* split at start_pfn if it is in the middle of a free page */
>> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>> +		struct page *new_page = pfn_to_page(new_start_pfn);
>> +		int new_page_order = buddy_order(new_page);
>> +
>> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>> +			/* change migratetype so that split_free_page can work */
>> +			set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>> +			split_free_page(new_page, buddy_order(new_page),
>> +					start_pfn - new_start_pfn);
>> +
>> +			mt_change_pfn = start_pfn;
>> +			/* move to next page */
>> +			start_pfn = new_start_pfn + (1 << new_page_order);
>> +		}
>> +	}
>
> Ok, so if there is a straddle from the previous block into our block
> of interest, it's split and the migratetype is set only on our block.

Correct. For example, start_pfn is 0x200 (2MB) and the free page starting from 0x0 is order-10 (4MB).

>
>> @@ -1615,10 +1653,24 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>
>>  		order = buddy_order(page);
>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>> +		/*
>> +		 * set page migratetype for all pageblocks within the page and
>> +		 * only after we move all free pages in one pageblock
>> +		 */
>> +		if (pfn + (1 << order) >= pageblock_end_pfn(pfn)) {
>> +			for (pfn2 = pfn; pfn2 < pfn + (1 << order);
>> +			     pfn2 += pageblock_nr_pages) {
>> +				set_pageblock_migratetype(pfn_to_page(pfn2),
>> +							  new_mt);
>> +				mt_change_pfn = pfn2;
>> +			}
>
> But if we have the first block of a MAX_ORDER chunk, then we don't
> split but rather move the whole chunk and make sure to update the
> chunk's blocks that are outside the range of interest.
>
> It looks like either way would work, but why not split here as well
> and keep the move contained to the block? Wouldn't this be a bit more
> predictable and easier to understand?

Yes, having a split here would be consistent.

Also I want to spell out the corner case I am handling here (and I will add
it to the comment): since move_to_free_list() checks page's migratetype
with old_mt and changing one page' migratetype affects all pages within
the same pageblock, if we are moving more than one free pages that are
in the same pageblock, setting migratetype right after move_to_free_list()
triggers the warning.

>> +		}
>>  		pfn += 1 << order;
>>  		pages_moved += 1 << order;
>>  	}
>> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>> +	/* set migratetype for the remaining pageblocks */
>> +	for (pfn2 = mt_change_pfn; pfn2 <= end_pfn; pfn2 += pageblock_nr_pages)
>> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>
> I think I'm missing something for this.
>
> - If there was no straddle, there is only our block of interest to
>   update.
>
> - If there was a straddle from the previous block, it was split and
>   the block of interest was already updated. Nothing to do here?
>
> - If there was a straddle into the next block, both blocks are updated
>   to the new type. Nothing to do here?
>
> What's the case where there are multiple blocks to update in the end?

When a pageblock has free pages at the beginning and in-use pages at the end.
The pageblock migratetype is not changed in the for loop above, since free
pages do not cross pageblock boundary. But these free pages are moved
to a new mt free list and will trigger warnings later.

Also if multiple pageblocks are filled with only in-use pages, the for loop
does nothing either. Their pageblocks will be set at this moment. I notice
it might be a change of behavior as I am writing, but this change might
be better. Before, in-page migrateype might or might not be changed,
depending on if there is a free page in the same pageblock or not, meaning
there will be migratetype holes in the specified range. Now the whole range
is changed to new_mt. Let me know if you have a different opinion.


>> @@ -380,8 +380,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>  			int order = buddy_order(page);
>>
>>  			if (pfn + (1UL << order) > boundary_pfn) {
>> +				int res;
>> +				unsigned long flags;
>> +
>> +				spin_lock_irqsave(&zone->lock, flags);
>> +				res = split_free_page(page, order, boundary_pfn - pfn);
>> +				spin_unlock_irqrestore(&zone->lock, flags);
>> +
>>  				/* free page changed before split, check it again */
>> -				if (split_free_page(page, order, boundary_pfn - pfn))
>> +				if (res)
>>  					continue;
>
> At this point, we've already set the migratetype, which has handled
> straddling free pages. Is this split still needed?

Good point. I will remove it. Originally, I thought it should stay to handle
the free page coming from the migration below. But unless a greater than pageblock
order in-use page shows up in the system and it is freed directly via __free_pages(),
any free page coming from the migration below should be put in the right
free list.

Such > pageblock order pages are possible, only if we have >PMD order THPs
or __PageMovable. IIRC, both do not exist yet.

>
>> @@ -426,9 +433,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>  				/*
>>  				 * XXX: mark the page as MIGRATE_ISOLATE so that
>>  				 * no one else can grab the freed page after migration.
>> -				 * Ideally, the page should be freed as two separate
>> -				 * pages to be added into separate migratetype free
>> -				 * lists.
>> +				 * The page should be freed into separate migratetype
>> +				 * free lists, unless the free page order is greater
>> +				 * than pageblock order. It is not the case now,
>> +				 * since gigantic hugetlb is freed as order-0
>> +				 * pages and LRU pages do not cross pageblocks.
>>  				 */
>>  				if (isolate_page) {
>>  					ret = set_migratetype_isolate(page, page_mt,
>
> I hadn't thought about LRU pages being constrained to single
> pageblocks before. Does this mean we only ever migrate here in case

Initially, I thought a lot about what if a high order folio crosses
two adjacent pageblocks, but at the end I find that __find_buddy_pfn()
does not treat pfns from adjacent pageblocks as buddy unless order
is greater than pageblock order. So any high order folio from
buddy allocator does not cross pageblocks. That is a relief.

Another (future) possibility is once anon large folio is merged and
my split huge page to any lower order patches are merged, a high order
folio might not come directly from buddy allocator but from a huge page
split. But that requires a > pageblock order folio exist first, which
is not possible either. So we are good.

> there is a movable gigantic page? And since those are already split
> during the free, does that mean the "reset pfn to head of the free
> page" part after the migration is actually unnecessary?

Yes. the "reset pfn" code could be removed.

Thank you for the review. Really appreciate it. Let me revise my
patch 3 and send it out again.


--
Best Regards,
Yan, Zi
Zi Yan Sept. 28, 2023, 3:22 a.m. UTC | #29
On 26 Sep 2023, at 14:19, David Hildenbrand wrote:

> On 21.09.23 16:47, Zi Yan wrote:
>> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>>
>>> On 21.09.23 04:31, Zi Yan wrote:
>>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>>
>>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>>
>>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>>
>>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>>
>>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>>    		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>>
>>>>>>>>>>>>    		/* Do not cross zone boundaries */
>>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>>    		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>>> 	+#else
>>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>>> 	+#endif
>>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>>> 	 		return false;
>>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>>
>>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Just to be really clear,
>>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>>     path WITHOUT your change.
>>>>>>>>>>
>>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>>
>>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>>
>>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>>
>>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>>> script.
>>>>>>>>>>>>
>>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>>
>>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>>> trying. Thanks.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>>
>>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>>
>>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>>
>>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>>
>>>>>>>>> while true; do
>>>>>>>>>    echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>>    echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>>    wait
>>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>>    echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>> I will look into the issue.
>>>>>>>
>>>>>>> Nice!
>>>>>>>
>>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>>> several reboots and letting it run for minutes.
>>>>>>
>>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>>> parameters respectively in half.
>>>>>>
>>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>>> printk-tracing, the scenario seems to be this:
>>>>>>
>>>>>> #0                                                   #1
>>>>>> start_isolate_page_range()
>>>>>>     isolate_single_pageblock()
>>>>>>       set_migratetype_isolate(tail)
>>>>>>         lock zone->lock
>>>>>>         move_freepages_block(tail) // nop
>>>>>>         set_pageblock_migratetype(tail)
>>>>>>         unlock zone->lock
>>>>>>                                                        del_page_from_freelist(head)
>>>>>>                                                        expand(head, head_mt)
>>>>>>                                                          WARN(head_mt != tail_mt)
>>>>>>       start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>>       for (pfn = start_pfn, pfn < end_pfn)
>>>>>>         if (PageBuddy())
>>>>>>           split_free_page(head)
>>>>>>
>>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>>> type is set and the lock has been dropped, so it's too late.
>>>>>
>>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>>> granularity. With your changes below, it no longer works, because if there
>>>>> is an unmovable page in
>>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>>> the allocation fails but it would succeed in current implementation.
>>>>>
>>>>> I think a proper fix would be to make move_freepages_block() split the
>>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>>
>>>>> I am working on that.
>>>>
>>>> After spending half a day on this, I think it is much harder than I thought
>>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>>
>>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>>> free list yet.
>>>>
>>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>>> the free pages.
>>>>
>>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>>> will need to be split to get their subpages into the right free lists.
>>>>
>>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>>> to correct the free list later. But to strictly enforce freelist migratetype
>>>> hygiene, extra work is needed at free page path to split the free page into
>>>> the right freelists.
>>>>
>>>> I need more time to think about how to get alloc_contig_range() properly.
>>>> Help is needed for the bullet point 4.
>>>
>>>
>>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>>
>> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
>
> It's complicated and I wish I would have had more time to review it
> back then ... or now to clean it up later.
>
> Unfortunately, nobody else did have the time to review it back then ... maybe we can
> do better next time. David doesn't scale.
>
> Doing page migration from inside start_isolate_page_range()->isolate_single_pageblock()
> really is sub-optimal (and mostly code duplication from alloc_contig_range).

I felt the same when I wrote the code. But I thought it was the only way out.

>
>> I can do it after I fix this. That change might or might not help only if we make
>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>> overwrite existing migratetype, the code might not need to split a page and move
>> it to MIGRATE_ISOLATE freelist?
>
> Did someone test how memory offlining plays along with that? (I can try myself
> within the next 1-2 weeks)
>
> There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
> though.
>
> ret = start_isolate_page_range(start_pfn, end_pfn,
> 			       MIGRATE_MOVABLE,
> 			       MEMORY_OFFLINE | REPORT_FAILURE,
> 			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);

Since a full MAX_ORDER range is passed, no free page split will happen.

>
>>
>> The fundamental issue in alloc_contig_range() is that to work at
>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>> now checks first pageblock migratetype, so such a page needs to be removed
>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>> finally put back to multiple free lists. This needs to be done at isolation stage
>> before free pages are removed from their free lists (the stage after isolation).
>
> One idea was to always isolate larger chunks, and handle movability checks/split/etc
> at a later stage. Once isolation would be decoupled from the actual/original migratetype,
> the could have been easier to handle (especially some corner cases I had in mind back then).

I think it is a good idea. When I coded alloc_contig_range() up, I tried to
accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages().
If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned
ranges and has_unmovable_pages() can still work on pageblock-aligned ranges.
Let me give this a try.

>
>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>> in their original migratetype and check migratetype before allocating a page,
>> that might help. But that might add extra work (e.g., splitting a partially
>> isolated free page before allocation) in the really hot code path, which is not
>> desirable.
>
> With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
> separate isolate list, but one per "proper migratetype". But again, just some random
> thoughts I had back then, I never had sufficient time to think it all through.

Got it. I will think about it.

One question on separate MIGRATE_ISOLATE:

the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag
bit instead of being one of migratetype. But now there are 5 migratetypes +
MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed.
But current migratetype implementation is a word-based operation, requiring
NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS
needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space.
An alternative is to have a separate array for MIGRATE_ISOLATE, which requires
additional changes. Let me know if you have a better idea. Thanks.



--
Best Regards,
Yan, Zi
David Hildenbrand Oct. 2, 2023, 11:43 a.m. UTC | #30
>>> I can do it after I fix this. That change might or might not help only if we make
>>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>>> overwrite existing migratetype, the code might not need to split a page and move
>>> it to MIGRATE_ISOLATE freelist?
>>
>> Did someone test how memory offlining plays along with that? (I can try myself
>> within the next 1-2 weeks)
>>
>> There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
>> though.
>>
>> ret = start_isolate_page_range(start_pfn, end_pfn,
>> 			       MIGRATE_MOVABLE,
>> 			       MEMORY_OFFLINE | REPORT_FAILURE,
>> 			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
> 
> Since a full MAX_ORDER range is passed, no free page split will happen.

Okay, thanks for verifying that it should not be affected!

> 
>>
>>>
>>> The fundamental issue in alloc_contig_range() is that to work at
>>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>>> now checks first pageblock migratetype, so such a page needs to be removed
>>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>>> finally put back to multiple free lists. This needs to be done at isolation stage
>>> before free pages are removed from their free lists (the stage after isolation).
>>
>> One idea was to always isolate larger chunks, and handle movability checks/split/etc
>> at a later stage. Once isolation would be decoupled from the actual/original migratetype,
>> the could have been easier to handle (especially some corner cases I had in mind back then).
> 
> I think it is a good idea. When I coded alloc_contig_range() up, I tried to
> accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages().
> If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned
> ranges and has_unmovable_pages() can still work on pageblock-aligned ranges.
> Let me give this a try.
> 

But again, just some thought I had back then, maybe it doesn't help for 
anything; I found more time to look into the whole thing in more detail.

>>
>>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>>> in their original migratetype and check migratetype before allocating a page,
>>> that might help. But that might add extra work (e.g., splitting a partially
>>> isolated free page before allocation) in the really hot code path, which is not
>>> desirable.
>>
>> With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
>> separate isolate list, but one per "proper migratetype". But again, just some random
>> thoughts I had back then, I never had sufficient time to think it all through.
> 
> Got it. I will think about it.
> 
> One question on separate MIGRATE_ISOLATE:
> 
> the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag
> bit instead of being one of migratetype. But now there are 5 migratetypes +

Exactly what I was concerned about back then ...

> MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed.
> But current migratetype implementation is a word-based operation, requiring
> NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS
> needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space.

... until I did the math. Let's assume a pageblock is 2 MiB.

4/(2* 1024 * 1024 * 8) = 0,00002384185791016 %

8/(2* 1024 * 1024 * 8) -> 1 / (2* 1024 * 1024) = 0,00004768371582031 %

For a 1 TiB machine that means 256 KiB vs. 512 KiB

I concluded that "wasting a lot of space" is not really the right word 
to describe that :)

Just to put it into perspective, the memmap (64/4096) for a 1 TiB 
machine is ... 16 GiB.

> An alternative is to have a separate array for MIGRATE_ISOLATE, which requires
> additional changes. Let me know if you have a better idea. Thanks.

It would probably be cleanest to just use one byte per pageblock. That 
would cleanup the whole machinery eventually as well.
Zi Yan Oct. 3, 2023, 2:26 a.m. UTC | #31
On 27 Sep 2023, at 22:51, Zi Yan wrote:

> On 26 Sep 2023, at 13:39, Johannes Weiner wrote:
>
>> On Mon, Sep 25, 2023 at 05:12:38PM -0400, Zi Yan wrote:
>>> On 21 Sep 2023, at 10:47, Zi Yan wrote:
>>>
>>>> On 21 Sep 2023, at 6:19, David Hildenbrand wrote:
>>>>
>>>>> On 21.09.23 04:31, Zi Yan wrote:
>>>>>> On 20 Sep 2023, at 13:23, Zi Yan wrote:
>>>>>>
>>>>>>> On 20 Sep 2023, at 12:04, Johannes Weiner wrote:
>>>>>>>
>>>>>>>> On Wed, Sep 20, 2023 at 09:48:12AM -0400, Johannes Weiner wrote:
>>>>>>>>> On Wed, Sep 20, 2023 at 08:07:53AM +0200, Vlastimil Babka wrote:
>>>>>>>>>> On 9/20/23 03:38, Zi Yan wrote:
>>>>>>>>>>> On 19 Sep 2023, at 20:32, Mike Kravetz wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 09/19/23 16:57, Zi Yan wrote:
>>>>>>>>>>>>> On 19 Sep 2023, at 14:47, Mike Kravetz wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 	--- a/mm/page_alloc.c
>>>>>>>>>>>>>> 	+++ b/mm/page_alloc.c
>>>>>>>>>>>>>> 	@@ -1651,8 +1651,13 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>>>>>>>>>>>>>>   		end = pageblock_end_pfn(pfn) - 1;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>   		/* Do not cross zone boundaries */
>>>>>>>>>>>>>> 	+#if 0
>>>>>>>>>>>>>>   		if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>>>> 			start = zone->zone_start_pfn;
>>>>>>>>>>>>>> 	+#else
>>>>>>>>>>>>>> 	+	if (!zone_spans_pfn(zone, start))
>>>>>>>>>>>>>> 	+		start = pfn;
>>>>>>>>>>>>>> 	+#endif
>>>>>>>>>>>>>> 	 	if (!zone_spans_pfn(zone, end))
>>>>>>>>>>>>>> 	 		return false;
>>>>>>>>>>>>>> 	I can still trigger warnings.
>>>>>>>>>>>>>
>>>>>>>>>>>>> OK. One thing to note is that the page type in the warning changed from
>>>>>>>>>>>>> 5 (MIGRATE_ISOLATE) to 0 (MIGRATE_UNMOVABLE) with my suggested change.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Just to be really clear,
>>>>>>>>>>>> - the 5 (MIGRATE_ISOLATE) warning was from the __alloc_pages call path.
>>>>>>>>>>>> - the 0 (MIGRATE_UNMOVABLE) as above was from the alloc_contig_range call
>>>>>>>>>>>>    path WITHOUT your change.
>>>>>>>>>>>>
>>>>>>>>>>>> I am guessing the difference here has more to do with the allocation path?
>>>>>>>>>>>>
>>>>>>>>>>>> I went back and reran focusing on the specific migrate type.
>>>>>>>>>>>> Without your patch, and coming from the alloc_contig_range call path,
>>>>>>>>>>>> I got two warnings of 'page type is 0, passed migratetype is 1' as above.
>>>>>>>>>>>> With your patch I got one 'page type is 0, passed migratetype is 1'
>>>>>>>>>>>> warning and one 'page type is 1, passed migratetype is 0' warning.
>>>>>>>>>>>>
>>>>>>>>>>>> I could be wrong, but I do not think your patch changes things.
>>>>>>>>>>>
>>>>>>>>>>> Got it. Thanks for the clarification.
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One idea about recreating the issue is that it may have to do with size
>>>>>>>>>>>>>> of my VM (16G) and the requested allocation sizes 4G.  However, I tried
>>>>>>>>>>>>>> to really stress the allocations by increasing the number of hugetlb
>>>>>>>>>>>>>> pages requested and that did not help.  I also noticed that I only seem
>>>>>>>>>>>>>> to get two warnings and then they stop, even if I continue to run the
>>>>>>>>>>>>>> script.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zi asked about my config, so it is attached.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With your config, I still have no luck reproducing the issue. I will keep
>>>>>>>>>>>>> trying. Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps try running both scripts in parallel?
>>>>>>>>>>>
>>>>>>>>>>> Yes. It seems to do the trick.
>>>>>>>>>>>
>>>>>>>>>>>> Adjust the number of hugetlb pages allocated to equal 25% of memory?
>>>>>>>>>>>
>>>>>>>>>>> I am able to reproduce it with the script below:
>>>>>>>>>>>
>>>>>>>>>>> while true; do
>>>>>>>>>>>   echo 4 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages&
>>>>>>>>>>>   echo 2048 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages&
>>>>>>>>>>>   wait
>>>>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
>>>>>>>>>>>   echo 0 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
>>>>>>>>>>> done
>>>>>>>>>>>
>>>>>>>>>>> I will look into the issue.
>>>>>>>>>
>>>>>>>>> Nice!
>>>>>>>>>
>>>>>>>>> I managed to reproduce it ONCE, triggering it not even a second after
>>>>>>>>> starting the script. But I can't seem to do it twice, even after
>>>>>>>>> several reboots and letting it run for minutes.
>>>>>>>>
>>>>>>>> I managed to reproduce it reliably by cutting the nr_hugepages
>>>>>>>> parameters respectively in half.
>>>>>>>>
>>>>>>>> The one that triggers for me is always MIGRATE_ISOLATE. With some
>>>>>>>> printk-tracing, the scenario seems to be this:
>>>>>>>>
>>>>>>>> #0                                                   #1
>>>>>>>> start_isolate_page_range()
>>>>>>>>    isolate_single_pageblock()
>>>>>>>>      set_migratetype_isolate(tail)
>>>>>>>>        lock zone->lock
>>>>>>>>        move_freepages_block(tail) // nop
>>>>>>>>        set_pageblock_migratetype(tail)
>>>>>>>>        unlock zone->lock
>>>>>>>>                                                       del_page_from_freelist(head)
>>>>>>>>                                                       expand(head, head_mt)
>>>>>>>>                                                         WARN(head_mt != tail_mt)
>>>>>>>>      start_pfn = ALIGN_DOWN(MAX_ORDER_NR_PAGES)
>>>>>>>>      for (pfn = start_pfn, pfn < end_pfn)
>>>>>>>>        if (PageBuddy())
>>>>>>>>          split_free_page(head)
>>>>>>>>
>>>>>>>> IOW, we update a pageblock that isn't MAX_ORDER aligned, then drop the
>>>>>>>> lock. The move_freepages_block() does nothing because the PageBuddy()
>>>>>>>> is set on the pageblock to the left. Once we drop the lock, the buddy
>>>>>>>> gets allocated and the expand() puts things on the wrong list. The
>>>>>>>> splitting code that handles MAX_ORDER blocks runs *after* the tail
>>>>>>>> type is set and the lock has been dropped, so it's too late.
>>>>>>>
>>>>>>> Yes, this is the issue I can confirm as well. But it is intentional to enable
>>>>>>> allocating a contiguous range at pageblock granularity instead of MAX_ORDER
>>>>>>> granularity. With your changes below, it no longer works, because if there
>>>>>>> is an unmovable page in
>>>>>>> [ALIGN_DOWN(start_pfn, MAX_ORDER_NR_PAGES), pageblock_start_pfn(start_pfn)),
>>>>>>> the allocation fails but it would succeed in current implementation.
>>>>>>>
>>>>>>> I think a proper fix would be to make move_freepages_block() split the
>>>>>>> MAX_ORDER page and put the split pages in the right migratetype free lists.
>>>>>>>
>>>>>>> I am working on that.
>>>>>>
>>>>>> After spending half a day on this, I think it is much harder than I thought
>>>>>> to get alloc_contig_range() working with the freelist migratetype hygiene
>>>>>> patchset. Because alloc_contig_range() relies on racy migratetype changes:
>>>>>>
>>>>>> 1. pageblocks in the range are first marked as MIGRATE_ISOLATE to prevent
>>>>>> another parallel isolation, but they are not moved to the MIGRATE_ISOLATE
>>>>>> free list yet.
>>>>>>
>>>>>> 2. later in the process, isolate_freepages_range() is used to actually grab
>>>>>> the free pages.
>>>>>>
>>>>>> 3. there was no problem when alloc_contig_range() works on MAX_ORDER aligned
>>>>>> ranges, since MIGRATE_ISOLATE cannot be set in the middle of free pages or
>>>>>> in-use pages. But it is not the case when alloc_contig_range() work on
>>>>>> pageblock aligned ranges. Now during isolation phase, free or in-use pages
>>>>>> will need to be split to get their subpages into the right free lists.
>>>>>>
>>>>>> 4. the hardest case is when a in-use page sits across two pageblocks, currently,
>>>>>> the code just isolate one pageblock, migrate the page, and let split_free_page()
>>>>>> to correct the free list later. But to strictly enforce freelist migratetype
>>>>>> hygiene, extra work is needed at free page path to split the free page into
>>>>>> the right freelists.
>>>>>>
>>>>>> I need more time to think about how to get alloc_contig_range() properly.
>>>>>> Help is needed for the bullet point 4.
>>>>>
>>>>>
>>>>> I once raised that we should maybe try making MIGRATE_ISOLATE a flag that preserves the original migratetype. Not sure if that would help here in any way.
>>>>
>>>> I have that in my backlog since you asked and have been delaying it. ;) Hopefully
>>>> I can do it after I fix this. That change might or might not help only if we make
>>>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>>>> overwrite existing migratetype, the code might not need to split a page and move
>>>> it to MIGRATE_ISOLATE freelist?
>>>>
>>>> The fundamental issue in alloc_contig_range() is that to work at
>>>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>>>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>>>> now checks first pageblock migratetype, so such a page needs to be removed
>>>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>>>> finally put back to multiple free lists. This needs to be done at isolation stage
>>>> before free pages are removed from their free lists (the stage after isolation).
>>>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>>>> in their original migratetype and check migratetype before allocating a page,
>>>> that might help. But that might add extra work (e.g., splitting a partially
>>>> isolated free page before allocation) in the really hot code path, which is not
>>>> desirable.
>>>>
>>>>>
>>>>> The whole alloc_contig_range() implementation is quite complicated and hard to grasp. If we could find ways to clean all that up and make it easier to understand and play along, that would be nice.
>>>>
>>>> I will try my best to simplify it.
>>>
>>> Hi Johannes,
>>>
>>> I attached three patches to fix the issue and first two can be folded into
>>> your patchset:
>>
>> Hi Zi, thanks for providing these patches! I'll pick them up into the
>> series.
>>
>>> 1. __free_one_page() bug you and Vlastimil discussed on the other email.
>>> 2. move set_pageblock_migratetype() into move_freepages() to prepare for patch 3.
>>> 3. enable move_freepages() to split a free page that is partially covered by
>>>    [start_pfn, end_pfn] in the parameter and set migratetype correctly when
>>>    a >pageblock_order free page is moved. Before when a >pageblock_order
>>>    free page is moved, only first pageblock migratetype is changed. The added
>>>    WARN_ON_ONCE might be triggered by these pages.
>>>
>>> I ran Mike's test with transhuge-stress together with my patches on top of your
>>> "close migratetype race" patch for more than an hour without any warning.
>>> It should unblock your patchset. I will keep working on alloc_contig_range()
>>> simplification.
>>>
>>>
>>> --
>>> Best Regards,
>>> Yan, Zi
>>
>>> From a18de9a235dc97999fcabdac699f33da9138b0ba Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Fri, 22 Sep 2023 11:11:32 -0400
>>> Subject: [PATCH 1/3] mm: fix __free_one_page().
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>  mm/page_alloc.c | 6 +-----
>>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 7de022bc4c7d..72f27d14c8e7 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -787,8 +787,6 @@ static inline void __free_one_page(struct page *page,
>>>  	VM_BUG_ON_PAGE(bad_range(zone, page), page);
>>>
>>>  	while (order < MAX_ORDER) {
>>> -		int buddy_mt;
>>> -
>>>  		if (compaction_capture(capc, page, order, migratetype))
>>>  			return;
>>>
>>> @@ -796,8 +794,6 @@ static inline void __free_one_page(struct page *page,
>>>  		if (!buddy)
>>>  			goto done_merging;
>>>
>>> -		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
>>> -
>>>  		if (unlikely(order >= pageblock_order)) {
>>>  			/*
>>>  			 * We want to prevent merge between freepages on pageblock
>>> @@ -827,7 +823,7 @@ static inline void __free_one_page(struct page *page,
>>>  		if (page_is_guard(buddy))
>>>  			clear_page_guard(zone, buddy, order);
>>>  		else
>>> -			del_page_from_free_list(buddy, zone, order, buddy_mt);
>>> +			del_page_from_free_list(buddy, zone, order, migratetype);
>>>  		combined_pfn = buddy_pfn & pfn;
>>>  		page = page + (combined_pfn - pfn);
>>>  		pfn = combined_pfn;
>>
>> I had a fix for this that's slightly different. The buddy's type can't
>> be changed while it's still on the freelist, so I moved that
>> around. The sequence now is:
>>
>> 	int buddy_mt = migratetype;
>>
>> 	if (unlikely(order >= pageblock_order)) {
>> 		/* This is the only case where buddy_mt can differ */
>> 		buddy_mt = get_pfnblock_migratetype(buddy, buddy_pfn);
>> 		// compat checks...
>> 	}
>>
>> 	del_page_from_free_list(buddy, buddy_mt);
>>
>> 	if (unlikely(buddy_mt != migratetype))
>> 		set_pageblock_migratetype(buddy, migratetype);
>>
>>
>>> From b11a0e3d8f9d7d91a884c90dc9cebb185c3a2bbc Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>>> Subject: [PATCH 2/3] mm: set migratetype after free pages are moved between
>>>  free lists.
>>>
>>> This avoids changing migratetype after move_freepages() or
>>> move_freepages_block(), which is error prone. It also prepares for upcoming
>>> changes to fix move_freepages() not moving free pages partially in the
>>> range.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>
>> This makes the code much cleaner, thank you!
>>
>>> From 75a4d327efd94230f3b9aab29ef6ec0badd488a6 Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Mon, 25 Sep 2023 16:55:18 -0400
>>> Subject: [PATCH 3/3] mm: enable move_freepages() to properly move part of free
>>>  pages.
>>>
>>> alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
>>> move_freepages(), to isolate free pages. But move_freepages() was not able
>>> to move free pages partially covered by the specified range, leaving a race
>>> window open[1]. Fix it by teaching move_freepages() to split a free page
>>> when only part of it is going to be moved.
>>>
>>> In addition, when a >pageblock_order free page is moved, only its first
>>> pageblock migratetype is changed. It can cause warnings later. Fix it by
>>> set all pageblocks in a free page to the same migratetype after move.
>>>
>>> split_free_page() is changed to be used in move_freepages() and
>>> isolate_single_pageblock(). A common code to find the start pfn of a free
>>> page is refactored in get_freepage_start_pfn().
>>>
>>> [1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>  mm/page_alloc.c     | 75 ++++++++++++++++++++++++++++++++++++---------
>>>  mm/page_isolation.c | 17 +++++++---
>>>  2 files changed, 73 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 7c41cb5d8a36..3fd5ab40b55c 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -866,15 +866,15 @@ int split_free_page(struct page *free_page,
>>>  	struct zone *zone = page_zone(free_page);
>>>  	unsigned long free_page_pfn = page_to_pfn(free_page);
>>>  	unsigned long pfn;
>>> -	unsigned long flags;
>>>  	int free_page_order;
>>>  	int mt;
>>>  	int ret = 0;
>>>
>>> -	if (split_pfn_offset == 0)
>>> -		return ret;
>>> +	/* zone lock should be held when this function is called */
>>> +	lockdep_assert_held(&zone->lock);
>>>
>>> -	spin_lock_irqsave(&zone->lock, flags);
>>> +	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
>>> +		return ret;
>>>
>>>  	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
>>>  		ret = -ENOENT;
>>> @@ -900,7 +900,6 @@ int split_free_page(struct page *free_page,
>>>  			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
>>>  	}
>>>  out:
>>> -	spin_unlock_irqrestore(&zone->lock, flags);
>>>  	return ret;
>>>  }
>>>  /*
>>> @@ -1589,6 +1588,25 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
>>>  					unsigned int order) { return NULL; }
>>>  #endif
>>>
>>> +/*
>>> + * Get first pfn of the free page, where pfn is in. If this free page does
>>> + * not exist, return the given pfn.
>>> + */
>>> +static unsigned long get_freepage_start_pfn(unsigned long pfn)
>>> +{
>>> +	int order = 0;
>>> +	unsigned long start_pfn = pfn;
>>> +
>>> +	while (!PageBuddy(pfn_to_page(start_pfn))) {
>>> +		if (++order > MAX_ORDER) {
>>> +			start_pfn = pfn;
>>> +			break;
>>> +		}
>>> +		start_pfn &= ~0UL << order;
>>> +	}
>>> +	return start_pfn;
>>> +}
>>> +
>>>  /*
>>>   * Move the free pages in a range to the freelist tail of the requested type.
>>>   * Note that start_page and end_pages are not aligned on a pageblock
>>> @@ -1598,9 +1616,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>  			  unsigned long end_pfn, int old_mt, int new_mt)
>>>  {
>>>  	struct page *page;
>>> -	unsigned long pfn;
>>> +	unsigned long pfn, pfn2;
>>>  	unsigned int order;
>>>  	int pages_moved = 0;
>>> +	unsigned long mt_change_pfn = start_pfn;
>>> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>>> +
>>> +	/* split at start_pfn if it is in the middle of a free page */
>>> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>>> +		struct page *new_page = pfn_to_page(new_start_pfn);
>>> +		int new_page_order = buddy_order(new_page);
>>> +
>>> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>>> +			/* change migratetype so that split_free_page can work */
>>> +			set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>>> +			split_free_page(new_page, buddy_order(new_page),
>>> +					start_pfn - new_start_pfn);
>>> +
>>> +			mt_change_pfn = start_pfn;
>>> +			/* move to next page */
>>> +			start_pfn = new_start_pfn + (1 << new_page_order);
>>> +		}
>>> +	}
>>
>> Ok, so if there is a straddle from the previous block into our block
>> of interest, it's split and the migratetype is set only on our block.
>
> Correct. For example, start_pfn is 0x200 (2MB) and the free page starting from 0x0 is order-10 (4MB).
>
>>
>>> @@ -1615,10 +1653,24 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>
>>>  		order = buddy_order(page);
>>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>>> +		/*
>>> +		 * set page migratetype for all pageblocks within the page and
>>> +		 * only after we move all free pages in one pageblock
>>> +		 */
>>> +		if (pfn + (1 << order) >= pageblock_end_pfn(pfn)) {
>>> +			for (pfn2 = pfn; pfn2 < pfn + (1 << order);
>>> +			     pfn2 += pageblock_nr_pages) {
>>> +				set_pageblock_migratetype(pfn_to_page(pfn2),
>>> +							  new_mt);
>>> +				mt_change_pfn = pfn2;
>>> +			}
>>
>> But if we have the first block of a MAX_ORDER chunk, then we don't
>> split but rather move the whole chunk and make sure to update the
>> chunk's blocks that are outside the range of interest.
>>
>> It looks like either way would work, but why not split here as well
>> and keep the move contained to the block? Wouldn't this be a bit more
>> predictable and easier to understand?
>
> Yes, having a split here would be consistent.
>
> Also I want to spell out the corner case I am handling here (and I will add
> it to the comment): since move_to_free_list() checks page's migratetype
> with old_mt and changing one page' migratetype affects all pages within
> the same pageblock, if we are moving more than one free pages that are
> in the same pageblock, setting migratetype right after move_to_free_list()
> triggers the warning.
>
>>> +		}
>>>  		pfn += 1 << order;
>>>  		pages_moved += 1 << order;
>>>  	}
>>> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>>> +	/* set migratetype for the remaining pageblocks */
>>> +	for (pfn2 = mt_change_pfn; pfn2 <= end_pfn; pfn2 += pageblock_nr_pages)
>>> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>>
>> I think I'm missing something for this.
>>
>> - If there was no straddle, there is only our block of interest to
>>   update.
>>
>> - If there was a straddle from the previous block, it was split and
>>   the block of interest was already updated. Nothing to do here?
>>
>> - If there was a straddle into the next block, both blocks are updated
>>   to the new type. Nothing to do here?
>>
>> What's the case where there are multiple blocks to update in the end?
>
> When a pageblock has free pages at the beginning and in-use pages at the end.
> The pageblock migratetype is not changed in the for loop above, since free
> pages do not cross pageblock boundary. But these free pages are moved
> to a new mt free list and will trigger warnings later.
>
> Also if multiple pageblocks are filled with only in-use pages, the for loop
> does nothing either. Their pageblocks will be set at this moment. I notice
> it might be a change of behavior as I am writing, but this change might
> be better. Before, in-page migrateype might or might not be changed,
> depending on if there is a free page in the same pageblock or not, meaning
> there will be migratetype holes in the specified range. Now the whole range
> is changed to new_mt. Let me know if you have a different opinion.
>
>
>>> @@ -380,8 +380,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>  			int order = buddy_order(page);
>>>
>>>  			if (pfn + (1UL << order) > boundary_pfn) {
>>> +				int res;
>>> +				unsigned long flags;
>>> +
>>> +				spin_lock_irqsave(&zone->lock, flags);
>>> +				res = split_free_page(page, order, boundary_pfn - pfn);
>>> +				spin_unlock_irqrestore(&zone->lock, flags);
>>> +
>>>  				/* free page changed before split, check it again */
>>> -				if (split_free_page(page, order, boundary_pfn - pfn))
>>> +				if (res)
>>>  					continue;
>>
>> At this point, we've already set the migratetype, which has handled
>> straddling free pages. Is this split still needed?
>
> Good point. I will remove it. Originally, I thought it should stay to handle
> the free page coming from the migration below. But unless a greater than pageblock
> order in-use page shows up in the system and it is freed directly via __free_pages(),
> any free page coming from the migration below should be put in the right
> free list.
>
> Such > pageblock order pages are possible, only if we have >PMD order THPs
> or __PageMovable. IIRC, both do not exist yet.
>
>>
>>> @@ -426,9 +433,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>>>  				/*
>>>  				 * XXX: mark the page as MIGRATE_ISOLATE so that
>>>  				 * no one else can grab the freed page after migration.
>>> -				 * Ideally, the page should be freed as two separate
>>> -				 * pages to be added into separate migratetype free
>>> -				 * lists.
>>> +				 * The page should be freed into separate migratetype
>>> +				 * free lists, unless the free page order is greater
>>> +				 * than pageblock order. It is not the case now,
>>> +				 * since gigantic hugetlb is freed as order-0
>>> +				 * pages and LRU pages do not cross pageblocks.
>>>  				 */
>>>  				if (isolate_page) {
>>>  					ret = set_migratetype_isolate(page, page_mt,
>>
>> I hadn't thought about LRU pages being constrained to single
>> pageblocks before. Does this mean we only ever migrate here in case
>
> Initially, I thought a lot about what if a high order folio crosses
> two adjacent pageblocks, but at the end I find that __find_buddy_pfn()
> does not treat pfns from adjacent pageblocks as buddy unless order
> is greater than pageblock order. So any high order folio from
> buddy allocator does not cross pageblocks. That is a relief.
>
> Another (future) possibility is once anon large folio is merged and
> my split huge page to any lower order patches are merged, a high order
> folio might not come directly from buddy allocator but from a huge page
> split. But that requires a > pageblock order folio exist first, which
> is not possible either. So we are good.
>
>> there is a movable gigantic page? And since those are already split
>> during the free, does that mean the "reset pfn to head of the free
>> page" part after the migration is actually unnecessary?
>
> Yes. the "reset pfn" code could be removed.
>
> Thank you for the review. Really appreciate it. Let me revise my
> patch 3 and send it out again.

It turns out that there was a bug in my patch 2: set_pageblock_migratetype()
is used by isolated_page case too, thus cannot be removed unconditionally.

I attached my revised patch 2 and 3 (with all the suggestions above).


--
Best Regards,
Yan, Zi
From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:27:14 -0400
Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
 free lists.

This avoids changing migratetype after move_freepages() or
move_freepages_block(), which is error prone. It also prepares for upcoming
changes to fix move_freepages() not moving free pages partially in the
range.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/page_alloc.c     | 10 +++-------
 mm/page_isolation.c |  7 +++----
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d839311d7c6e..928bb595d7cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1617,6 +1617,7 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 		pfn += 1 << order;
 		pages_moved += 1 << order;
 	}
+	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
 
 	return pages_moved;
 }
@@ -1838,7 +1839,6 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	if (free_pages + alike_pages >= (1 << (pageblock_order-1)) ||
 			page_group_by_mobility_disabled) {
 		move_freepages(zone, start_pfn, end_pfn, block_type, start_type);
-		set_pageblock_migratetype(page, start_type);
 		block_type = start_type;
 	}
 
@@ -1910,7 +1910,6 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone)
 	if (migratetype_is_mergeable(mt)) {
 		if (move_freepages_block(zone, page,
 					 mt, MIGRATE_HIGHATOMIC) != -1) {
-			set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
 			zone->nr_reserved_highatomic += pageblock_nr_pages;
 		}
 	}
@@ -1995,7 +1994,6 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			 * not fail on zone boundaries.
 			 */
 			WARN_ON_ONCE(ret == -1);
-			set_pageblock_migratetype(page, ac->migratetype);
 			if (ret > 0) {
 				spin_unlock_irqrestore(&zone->lock, flags);
 				return ret;
@@ -2607,10 +2605,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
 			 * Only change normal pageblocks (i.e., they can merge
 			 * with others)
 			 */
-			if (migratetype_is_mergeable(mt) &&
-			    move_freepages_block(zone, page, mt,
-						 MIGRATE_MOVABLE) != -1)
-				set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+			if (migratetype_is_mergeable(mt))
+			    move_freepages_block(zone, page, mt, MIGRATE_MOVABLE);
 		}
 	}
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b5c7a9d21257..5f8c658c0853 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -187,7 +187,6 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
-		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -261,10 +260,10 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 		 * should not fail on zone boundaries.
 		 */
 		WARN_ON_ONCE(nr_pages == -1);
-	}
-	set_pageblock_migratetype(page, migratetype);
-	if (isolated_page)
+	} else {
+		set_pageblock_migratetype(page, migratetype);
 		__putback_isolated_page(page, order, migratetype);
+	}
 	zone->nr_isolate_pageblock--;
 out:
 	spin_unlock_irqrestore(&zone->lock, flags);
Zi Yan Oct. 3, 2023, 2:35 a.m. UTC | #32
On 2 Oct 2023, at 7:43, David Hildenbrand wrote:

>>>> I can do it after I fix this. That change might or might not help only if we make
>>>> some redesign on how migratetype is managed. If MIGRATE_ISOLATE does not
>>>> overwrite existing migratetype, the code might not need to split a page and move
>>>> it to MIGRATE_ISOLATE freelist?
>>>
>>> Did someone test how memory offlining plays along with that? (I can try myself
>>> within the next 1-2 weeks)
>>>
>>> There [mm/memory_hotplug.c:offline_pages] we always cover full MAX_ORDER ranges,
>>> though.
>>>
>>> ret = start_isolate_page_range(start_pfn, end_pfn,
>>> 			       MIGRATE_MOVABLE,
>>> 			       MEMORY_OFFLINE | REPORT_FAILURE,
>>> 			       GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL);
>>
>> Since a full MAX_ORDER range is passed, no free page split will happen.
>
> Okay, thanks for verifying that it should not be affected!
>
>>
>>>
>>>>
>>>> The fundamental issue in alloc_contig_range() is that to work at
>>>> pageblock level, a page (>pageblock_order) can have one part is isolated and
>>>> the rest is a different migratetype. {add_to,move_to,del_page_from}_free_list()
>>>> now checks first pageblock migratetype, so such a page needs to be removed
>>>> from its free_list, set MIGRATE_ISOLATE on one of the pageblock, split, and
>>>> finally put back to multiple free lists. This needs to be done at isolation stage
>>>> before free pages are removed from their free lists (the stage after isolation).
>>>
>>> One idea was to always isolate larger chunks, and handle movability checks/split/etc
>>> at a later stage. Once isolation would be decoupled from the actual/original migratetype,
>>> the could have been easier to handle (especially some corner cases I had in mind back then).
>>
>> I think it is a good idea. When I coded alloc_contig_range() up, I tried to
>> accommodate existing set_migratetype_isolate(), which calls has_unmovable_pages().
>> If these two are decoupled, set_migrateype_isolate() can work on MAX_ORDER-aligned
>> ranges and has_unmovable_pages() can still work on pageblock-aligned ranges.
>> Let me give this a try.
>>
>
> But again, just some thought I had back then, maybe it doesn't help for anything; I found more time to look into the whole thing in more detail.

Sure. The devil is in the details, but I will only know the details and what works
after I code it up. :)

>>>
>>>> If MIGRATE_ISOLATE is a separate flag and we are OK with leaving isolated pages
>>>> in their original migratetype and check migratetype before allocating a page,
>>>> that might help. But that might add extra work (e.g., splitting a partially
>>>> isolated free page before allocation) in the really hot code path, which is not
>>>> desirable.
>>>
>>> With MIGRATE_ISOLATE being a separate flag, one idea was to have not a single
>>> separate isolate list, but one per "proper migratetype". But again, just some random
>>> thoughts I had back then, I never had sufficient time to think it all through.
>>
>> Got it. I will think about it.
>>
>> One question on separate MIGRATE_ISOLATE:
>>
>> the implementation I have in mind is that MIGRATE_ISOLATE will need a dedicated flag
>> bit instead of being one of migratetype. But now there are 5 migratetypes +
>
> Exactly what I was concerned about back then ...
>
>> MIGRATE_ISOLATE and PB_migratetype_bits is 3, so an extra migratetype_bit is needed.
>> But current migratetype implementation is a word-based operation, requiring
>> NR_PAGEBLOCK_BITS to be divisor of BITS_PER_LONG. This means NR_PAGEBLOCK_BITS
>> needs to be increased from 4 to 8 to meet the requirement, wasting a lot of space.
>
> ... until I did the math. Let's assume a pageblock is 2 MiB.
>
> 4/(2* 1024 * 1024 * 8) = 0,00002384185791016 %
>
> 8/(2* 1024 * 1024 * 8) -> 1 / (2* 1024 * 1024) = 0,00004768371582031 %
>
> For a 1 TiB machine that means 256 KiB vs. 512 KiB
>
> I concluded that "wasting a lot of space" is not really the right word to describe that :)
>
> Just to put it into perspective, the memmap (64/4096) for a 1 TiB machine is ... 16 GiB.

You are right. I should have done the math. The absolute increase is not much.

>> An alternative is to have a separate array for MIGRATE_ISOLATE, which requires
>> additional changes. Let me know if you have a better idea. Thanks.
>
> It would probably be cleanest to just use one byte per pageblock. That would cleanup the whole machinery eventually as well.

Let me give this a try and see if it cleans things up.


--
Best Regards,
Yan, Zi
Johannes Weiner Oct. 10, 2023, 9:12 p.m. UTC | #33
Hello!

On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
> On 27 Sep 2023, at 22:51, Zi Yan wrote:
> I attached my revised patch 2 and 3 (with all the suggestions above).

Thanks! It took me a bit to read through them. It's a tricky codebase!

Some comments below.

> From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Mon, 25 Sep 2023 16:27:14 -0400
> Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
>  free lists.
> 
> This avoids changing migratetype after move_freepages() or
> move_freepages_block(), which is error prone. It also prepares for upcoming
> changes to fix move_freepages() not moving free pages partially in the
> range.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

This is great and indeed makes the callsites much simpler. Thanks,
I'll fold this into the series.

> @@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  			  unsigned long end_pfn, int old_mt, int new_mt)
>  {
>  	struct page *page;
> -	unsigned long pfn;
> +	unsigned long pfn, pfn2;
>  	unsigned int order;
>  	int pages_moved = 0;
> +	unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
> +
> +	/* split at start_pfn if it is in the middle of a free page */
> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
> +		struct page *new_page = pfn_to_page(new_start_pfn);
> +		int new_page_order = buddy_order(new_page);

get_freepage_start_pfn() returns start_pfn if it didn't find a large
buddy, so the buddy check shouldn't be necessary, right?

> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {

This *should* be implied according to the comments on
get_freepage_start_pfn(), but it currently isn't. Doing so would help
here, and seemingly also in alloc_contig_range().

How about this version of get_freepage_start_pfn()?

/*
 * Scan the range before this pfn for a buddy that straddles it
 */
static unsigned long find_straddling_buddy(unsigned long start_pfn)
{
	int order = 0;
	struct page *page;
	unsigned long pfn = start_pfn;

	while (!PageBuddy(page = pfn_to_page(pfn))) {
		/* Nothing found */
		if (++order > MAX_ORDER)
			return start_pfn;
		pfn &= ~0UL << order;
	}

	/*
	 * Found a preceding buddy, but does it straddle?
	 */
	if (pfn + (1 << buddy_order(page)) > start_pfn)
		return pfn;

	/* Nothing found */
	return start_pfn;
}

> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>  
>  		order = buddy_order(page);
>  		move_to_free_list(page, zone, order, old_mt, new_mt);
> +		/*
> +		 * set page migratetype 1) only after we move all free pages in
> +		 * one pageblock and 2) for all pageblocks within the page.
> +		 *
> +		 * for 1), since move_to_free_list() checks page migratetype with
> +		 * old_mt and changing one page migratetype affects all pages
> +		 * within the same pageblock, if we are moving more than
> +		 * one free pages in the same pageblock, setting migratetype
> +		 * right after first move_to_free_list() triggers the warning
> +		 * in the following move_to_free_list().
> +		 *
> +		 * for 2), when a free page order is greater than pageblock_order,
> +		 * all pageblocks within the free page need to be changed after
> +		 * move_to_free_list().

I think this can be somewhat simplified.

There are two assumptions we can make. Buddies always consist of 2^n
pages. And buddies and pageblocks are naturally aligned. This means
that if this pageblock has the start of a buddy that straddles into
the next pageblock(s), it must be the first page in the block. That in
turn means we can move the handling before the loop.

If we split first, it also makes the loop a little simpler because we
know that any buddies that start inside this block cannot extend
beyond it (due to the alignment). The loop how it was originally
written can remain untouched.

> +		 */
> +		if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
> +			for (pfn2 = pfn;
> +			     pfn2 < min_t(unsigned long,
> +					  pfn + (1 << order),
> +					  end_pfn + 1);
> +			     pfn2 += pageblock_nr_pages) {
> +				set_pageblock_migratetype(pfn_to_page(pfn2),
> +							  new_mt);
> +				mt_changed_pfn = pfn2;

Hm, this seems to assume that start_pfn to end_pfn can be more than
one block. Why is that? This function is only used on single blocks.

> +			}
> +			/* split the free page if it goes beyond the specified range */
> +			if (pfn + (1 << order) > (end_pfn + 1))
> +				split_free_page(page, order, end_pfn + 1 - pfn);
> +		}
>  		pfn += 1 << order;
>  		pages_moved += 1 << order;
>  	}
> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
> +	/* set migratetype for the remaining pageblocks */
> +	for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
> +	     pfn2 <= end_pfn;
> +	     pfn2 += pageblock_nr_pages)
> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);

If I rework the code on the above, I'm arriving at the following:

static int move_freepages(struct zone *zone, unsigned long start_pfn,
			  unsigned long end_pfn, int old_mt, int new_mt)
{
	struct page *start_page = pfn_to_page(start_pfn);
	int pages_moved = 0;
	unsigned long pfn;

	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);

	/*
	 * A free page may be comprised of 2^n blocks, which means our
	 * block of interest could be head or tail in such a page.
	 *
	 * If we're a tail, update the type of our block, then split
	 * the page into pageblocks. The splitting will do the leg
	 * work of sorting the blocks into the right freelists.
	 *
	 * If we're a head, split the page into pageblocks first. This
	 * ensures the migratetypes still match up during the freelist
	 * removal. Then do the regular scan for buddies in the block
	 * of interest, which will handle the rest.
	 *
	 * In theory, we could try to preserve 2^1 and larger blocks
	 * that lie outside our range. In practice, MAX_ORDER is
	 * usually one or two pageblocks anyway, so don't bother.
	 *
	 * Note that this only applies to page isolation, which calls
	 * this on random blocks in the pfn range! When we move stuff
	 * from inside the page allocator, the pages are coming off
	 * the freelist (can't be tail) and multi-block pages are
	 * handled directly in the stealing code (can't be a head).
	 */

	/* We're a tail */
	pfn = find_straddling_buddy(start_pfn);
	if (pfn != start_pfn) {
		struct page *free_page = pfn_to_page(pfn);

		set_pageblock_migratetype(start_page, new_mt);
		split_free_page(free_page, buddy_order(free_page),
				pageblock_nr_pages);
		return pageblock_nr_pages;
	}

	/* We're a head */
	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order)
		split_free_page(start_page, buddy_order(start_page),
				pageblock_nr_pages);

	/* Move buddies within the block */
	while (pfn <= end_pfn) {
		struct page *page = pfn_to_page(pfn);
		int order, nr_pages;

		if (!PageBuddy(page)) {
			pfn++;
			continue;
		}

		/* Make sure we are not inadvertently changing nodes */
		VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
		VM_BUG_ON_PAGE(page_zone(page) != zone, page);

		order = buddy_order(page);
		nr_pages = 1 << order;

		move_to_free_list(page, zone, order, old_mt, new_mt);

		pfn += nr_pages;
		pages_moved += nr_pages;
	}

	set_pageblock_migratetype(start_page, new_mt);

	return pages_moved;
}

Does this look reasonable to you?

Note that the page isolation specific stuff comes first. If this code
holds up, we should be able to move it to page-isolation.c and keep it
out of the regular allocator path.

Thanks!
Johannes Weiner Oct. 11, 2023, 3:25 p.m. UTC | #34
On Tue, Oct 10, 2023 at 05:12:01PM -0400, Johannes Weiner wrote:
> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
> > @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
> >  
> >  		order = buddy_order(page);
> >  		move_to_free_list(page, zone, order, old_mt, new_mt);
> > +		/*
> > +		 * set page migratetype 1) only after we move all free pages in
> > +		 * one pageblock and 2) for all pageblocks within the page.
> > +		 *
> > +		 * for 1), since move_to_free_list() checks page migratetype with
> > +		 * old_mt and changing one page migratetype affects all pages
> > +		 * within the same pageblock, if we are moving more than
> > +		 * one free pages in the same pageblock, setting migratetype
> > +		 * right after first move_to_free_list() triggers the warning
> > +		 * in the following move_to_free_list().
> > +		 *
> > +		 * for 2), when a free page order is greater than pageblock_order,
> > +		 * all pageblocks within the free page need to be changed after
> > +		 * move_to_free_list().
> 
> I think this can be somewhat simplified.
> 
> There are two assumptions we can make. Buddies always consist of 2^n
> pages. And buddies and pageblocks are naturally aligned. This means
> that if this pageblock has the start of a buddy that straddles into
> the next pageblock(s), it must be the first page in the block. That in
> turn means we can move the handling before the loop.

Eh, scratch that. Obviously, a sub-block buddy can straddle blocks :(

So forget about my version of move_free_pages(). Only consider the
changes to find_straddling_buddy() and my question about multiple
blocks inside the requested range.

But I do have another question about your patch then. Say you have an
order-1 buddy that straddles into the block:

+       /* split at start_pfn if it is in the middle of a free page */
+       if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
+               struct page *new_page = pfn_to_page(new_start_pfn);
+               int new_page_order = buddy_order(new_page);
+
+               if (new_start_pfn + (1 << new_page_order) > start_pfn) {
+                       /* change migratetype so that split_free_page can work */
+                       set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+                       split_free_page(new_page, buddy_order(new_page),
+                                       start_pfn - new_start_pfn);
+
+                       mt_changed_pfn = start_pfn;
+                       /* move to next page */
+                       start_pfn = new_start_pfn + (1 << new_page_order);
+               }
+       }

this will have changed the type of the block to new_mt.

But then the buddy scan will do this:

                move_to_free_list(page, zone, order, old_mt, new_mt);
+               /*
+                * set page migratetype 1) only after we move all free pages in
+                * one pageblock and 2) for all pageblocks within the page.
+                *
+                * for 1), since move_to_free_list() checks page migratetype with
+                * old_mt and changing one page migratetype affects all pages
+                * within the same pageblock, if we are moving more than
+                * one free pages in the same pageblock, setting migratetype
+                * right after first move_to_free_list() triggers the warning
+                * in the following move_to_free_list().
+                *
+                * for 2), when a free page order is greater than pageblock_order,
+                * all pageblocks within the free page need to be changed after
+                * move_to_free_list().

That move_to_free_list() will complain that the pages no longer match
old_mt, no?
Johannes Weiner Oct. 11, 2023, 3:45 p.m. UTC | #35
On Wed, Oct 11, 2023 at 11:25:27AM -0400, Johannes Weiner wrote:
> On Tue, Oct 10, 2023 at 05:12:01PM -0400, Johannes Weiner wrote:
> > On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
> > > @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
> > >  
> > >  		order = buddy_order(page);
> > >  		move_to_free_list(page, zone, order, old_mt, new_mt);
> > > +		/*
> > > +		 * set page migratetype 1) only after we move all free pages in
> > > +		 * one pageblock and 2) for all pageblocks within the page.
> > > +		 *
> > > +		 * for 1), since move_to_free_list() checks page migratetype with
> > > +		 * old_mt and changing one page migratetype affects all pages
> > > +		 * within the same pageblock, if we are moving more than
> > > +		 * one free pages in the same pageblock, setting migratetype
> > > +		 * right after first move_to_free_list() triggers the warning
> > > +		 * in the following move_to_free_list().
> > > +		 *
> > > +		 * for 2), when a free page order is greater than pageblock_order,
> > > +		 * all pageblocks within the free page need to be changed after
> > > +		 * move_to_free_list().
> > 
> > I think this can be somewhat simplified.
> > 
> > There are two assumptions we can make. Buddies always consist of 2^n
> > pages. And buddies and pageblocks are naturally aligned. This means
> > that if this pageblock has the start of a buddy that straddles into
> > the next pageblock(s), it must be the first page in the block. That in
> > turn means we can move the handling before the loop.
> 
> Eh, scratch that. Obviously, a sub-block buddy can straddle blocks :(

I apologize for the back and forth, but I think I had it right the
first time. Say we have order-0 frees at pfn 511 and 512. Those can't
merge because their order-0 buddies are 510 and 513 respectively. The
same keeps higher-order merges below block size within the pageblock.
So again, due to the pow2 alignment, the only way for a buddy to
straddle a pageblock boundary is if it's >pageblock_order.

Please double check me on this, because I've stared at your patches
and the allocator code long enough now to thoroughly confuse myself.

My proposal for the follow-up changes still stands for now.
Zi Yan Oct. 11, 2023, 3:57 p.m. UTC | #36
On 11 Oct 2023, at 11:45, Johannes Weiner wrote:

> On Wed, Oct 11, 2023 at 11:25:27AM -0400, Johannes Weiner wrote:
>> On Tue, Oct 10, 2023 at 05:12:01PM -0400, Johannes Weiner wrote:
>>> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
>>>> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>>
>>>>  		order = buddy_order(page);
>>>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>>>> +		/*
>>>> +		 * set page migratetype 1) only after we move all free pages in
>>>> +		 * one pageblock and 2) for all pageblocks within the page.
>>>> +		 *
>>>> +		 * for 1), since move_to_free_list() checks page migratetype with
>>>> +		 * old_mt and changing one page migratetype affects all pages
>>>> +		 * within the same pageblock, if we are moving more than
>>>> +		 * one free pages in the same pageblock, setting migratetype
>>>> +		 * right after first move_to_free_list() triggers the warning
>>>> +		 * in the following move_to_free_list().
>>>> +		 *
>>>> +		 * for 2), when a free page order is greater than pageblock_order,
>>>> +		 * all pageblocks within the free page need to be changed after
>>>> +		 * move_to_free_list().
>>>
>>> I think this can be somewhat simplified.
>>>
>>> There are two assumptions we can make. Buddies always consist of 2^n
>>> pages. And buddies and pageblocks are naturally aligned. This means
>>> that if this pageblock has the start of a buddy that straddles into
>>> the next pageblock(s), it must be the first page in the block. That in
>>> turn means we can move the handling before the loop.
>>
>> Eh, scratch that. Obviously, a sub-block buddy can straddle blocks :(
>
> I apologize for the back and forth, but I think I had it right the
> first time. Say we have order-0 frees at pfn 511 and 512. Those can't
> merge because their order-0 buddies are 510 and 513 respectively. The
> same keeps higher-order merges below block size within the pageblock.
> So again, due to the pow2 alignment, the only way for a buddy to
> straddle a pageblock boundary is if it's >pageblock_order.
>
> Please double check me on this, because I've stared at your patches
> and the allocator code long enough now to thoroughly confuse myself.
>
> My proposal for the follow-up changes still stands for now.

Sure. I admit that current alloc_contig_range() code is too complicated
and I am going to refactor it.

find_straddling_buddy() looks good to me. You will this change in
alloc_contig_range() to replace get_freepage_start_pfn():

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a86025f5e80a..e8ed25c94863 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,7 +6209,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
                       unsigned migratetype, gfp_t gfp_mask)
 {
        unsigned long outer_start, outer_end;
-       int order;
        int ret = 0;

        struct compact_control cc = {
@@ -6283,21 +6282,13 @@ int alloc_contig_range(unsigned long start, unsigned long end,
         * isolated thus they won't get removed from buddy.
         */

-       order = 0;
-       outer_start = get_freepage_start_pfn(start);
-
-       if (outer_start != start) {
-               order = buddy_order(pfn_to_page(outer_start));
-
-               /*
-                * outer_start page could be small order buddy page and
-                * it doesn't include start page. Adjust outer_start
-                * in this case to report failed page properly
-                * on tracepoint in test_pages_isolated()
-                */
-               if (outer_start + (1UL << order) <= start)
-                       outer_start = start;
-       }
+       /*
+        * outer_start page could be small order buddy page and it doesn't
+        * include start page. outer_start is set to start in
+        * find_straddling_buddy() to report failed page properly on tracepoint
+        * in test_pages_isolated()
+        */
+       outer_start = find_straddling_buddy(start);

        /* Make sure the range is really isolated. */
        if (test_pages_isolated(outer_start, end, 0)) {

Let me go through your move_freepages() in details and get back to you.

Thank you for the feedback!

--
Best Regards,
Yan, Zi
Zi Yan Oct. 13, 2023, 12:06 a.m. UTC | #37
On 10 Oct 2023, at 17:12, Johannes Weiner wrote:

> Hello!
>
> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
>> On 27 Sep 2023, at 22:51, Zi Yan wrote:
>> I attached my revised patch 2 and 3 (with all the suggestions above).
>
> Thanks! It took me a bit to read through them. It's a tricky codebase!
>
> Some comments below.
>
>> From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
>> From: Zi Yan <ziy@nvidia.com>
>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>> Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
>>  free lists.
>>
>> This avoids changing migratetype after move_freepages() or
>> move_freepages_block(), which is error prone. It also prepares for upcoming
>> changes to fix move_freepages() not moving free pages partially in the
>> range.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> This is great and indeed makes the callsites much simpler. Thanks,
> I'll fold this into the series.
>
>> @@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>  			  unsigned long end_pfn, int old_mt, int new_mt)
>>  {
>>  	struct page *page;
>> -	unsigned long pfn;
>> +	unsigned long pfn, pfn2;
>>  	unsigned int order;
>>  	int pages_moved = 0;
>> +	unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
>> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>> +
>> +	/* split at start_pfn if it is in the middle of a free page */
>> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>> +		struct page *new_page = pfn_to_page(new_start_pfn);
>> +		int new_page_order = buddy_order(new_page);
>
> get_freepage_start_pfn() returns start_pfn if it didn't find a large
> buddy, so the buddy check shouldn't be necessary, right?
>
>> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>
> This *should* be implied according to the comments on
> get_freepage_start_pfn(), but it currently isn't. Doing so would help
> here, and seemingly also in alloc_contig_range().
>
> How about this version of get_freepage_start_pfn()?
>
> /*
>  * Scan the range before this pfn for a buddy that straddles it
>  */
> static unsigned long find_straddling_buddy(unsigned long start_pfn)
> {
> 	int order = 0;
> 	struct page *page;
> 	unsigned long pfn = start_pfn;
>
> 	while (!PageBuddy(page = pfn_to_page(pfn))) {
> 		/* Nothing found */
> 		if (++order > MAX_ORDER)
> 			return start_pfn;
> 		pfn &= ~0UL << order;
> 	}
>
> 	/*
> 	 * Found a preceding buddy, but does it straddle?
> 	 */
> 	if (pfn + (1 << buddy_order(page)) > start_pfn)
> 		return pfn;
>
> 	/* Nothing found */
> 	return start_pfn;
> }
>
>> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>
>>  		order = buddy_order(page);
>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>> +		/*
>> +		 * set page migratetype 1) only after we move all free pages in
>> +		 * one pageblock and 2) for all pageblocks within the page.
>> +		 *
>> +		 * for 1), since move_to_free_list() checks page migratetype with
>> +		 * old_mt and changing one page migratetype affects all pages
>> +		 * within the same pageblock, if we are moving more than
>> +		 * one free pages in the same pageblock, setting migratetype
>> +		 * right after first move_to_free_list() triggers the warning
>> +		 * in the following move_to_free_list().
>> +		 *
>> +		 * for 2), when a free page order is greater than pageblock_order,
>> +		 * all pageblocks within the free page need to be changed after
>> +		 * move_to_free_list().
>
> I think this can be somewhat simplified.
>
> There are two assumptions we can make. Buddies always consist of 2^n
> pages. And buddies and pageblocks are naturally aligned. This means
> that if this pageblock has the start of a buddy that straddles into
> the next pageblock(s), it must be the first page in the block. That in
> turn means we can move the handling before the loop.
>
> If we split first, it also makes the loop a little simpler because we
> know that any buddies that start inside this block cannot extend
> beyond it (due to the alignment). The loop how it was originally
> written can remain untouched.
>
>> +		 */
>> +		if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
>> +			for (pfn2 = pfn;
>> +			     pfn2 < min_t(unsigned long,
>> +					  pfn + (1 << order),
>> +					  end_pfn + 1);
>> +			     pfn2 += pageblock_nr_pages) {
>> +				set_pageblock_migratetype(pfn_to_page(pfn2),
>> +							  new_mt);
>> +				mt_changed_pfn = pfn2;
>
> Hm, this seems to assume that start_pfn to end_pfn can be more than
> one block. Why is that? This function is only used on single blocks.

You are right. I made unnecessary assumptions when I wrote the code.

>
>> +			}
>> +			/* split the free page if it goes beyond the specified range */
>> +			if (pfn + (1 << order) > (end_pfn + 1))
>> +				split_free_page(page, order, end_pfn + 1 - pfn);
>> +		}
>>  		pfn += 1 << order;
>>  		pages_moved += 1 << order;
>>  	}
>> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>> +	/* set migratetype for the remaining pageblocks */
>> +	for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
>> +	     pfn2 <= end_pfn;
>> +	     pfn2 += pageblock_nr_pages)
>> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>
> If I rework the code on the above, I'm arriving at the following:
>
> static int move_freepages(struct zone *zone, unsigned long start_pfn,
> 			  unsigned long end_pfn, int old_mt, int new_mt)
> {
> 	struct page *start_page = pfn_to_page(start_pfn);
> 	int pages_moved = 0;
> 	unsigned long pfn;
>
> 	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
> 	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
>
> 	/*
> 	 * A free page may be comprised of 2^n blocks, which means our
> 	 * block of interest could be head or tail in such a page.
> 	 *
> 	 * If we're a tail, update the type of our block, then split
> 	 * the page into pageblocks. The splitting will do the leg
> 	 * work of sorting the blocks into the right freelists.
> 	 *
> 	 * If we're a head, split the page into pageblocks first. This
> 	 * ensures the migratetypes still match up during the freelist
> 	 * removal. Then do the regular scan for buddies in the block
> 	 * of interest, which will handle the rest.
> 	 *
> 	 * In theory, we could try to preserve 2^1 and larger blocks
> 	 * that lie outside our range. In practice, MAX_ORDER is
> 	 * usually one or two pageblocks anyway, so don't bother.
> 	 *
> 	 * Note that this only applies to page isolation, which calls
> 	 * this on random blocks in the pfn range! When we move stuff
> 	 * from inside the page allocator, the pages are coming off
> 	 * the freelist (can't be tail) and multi-block pages are
> 	 * handled directly in the stealing code (can't be a head).
> 	 */
>
> 	/* We're a tail */
> 	pfn = find_straddling_buddy(start_pfn);
> 	if (pfn != start_pfn) {
> 		struct page *free_page = pfn_to_page(pfn);
>
> 		set_pageblock_migratetype(start_page, new_mt);
> 		split_free_page(free_page, buddy_order(free_page),
> 				pageblock_nr_pages);
> 		return pageblock_nr_pages;
> 	}
>
> 	/* We're a head */
> 	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order)
> 		split_free_page(start_page, buddy_order(start_page),
> 				pageblock_nr_pages);

This actually can be:

/* We're a head */
if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
        set_pageblock_migratetype(start_page, new_mt);
        split_free_page(start_page, buddy_order(start_page),
                        pageblock_nr_pages);
        return pageblock_nr_pages;
}


>
> 	/* Move buddies within the block */
> 	while (pfn <= end_pfn) {
> 		struct page *page = pfn_to_page(pfn);
> 		int order, nr_pages;
>
> 		if (!PageBuddy(page)) {
> 			pfn++;
> 			continue;
> 		}
>
> 		/* Make sure we are not inadvertently changing nodes */
> 		VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
> 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>
> 		order = buddy_order(page);
> 		nr_pages = 1 << order;
>
> 		move_to_free_list(page, zone, order, old_mt, new_mt);
>
> 		pfn += nr_pages;
> 		pages_moved += nr_pages;
> 	}
>
> 	set_pageblock_migratetype(start_page, new_mt);
>
> 	return pages_moved;
> }
>
> Does this look reasonable to you?

Looks good to me. Thanks.

>
> Note that the page isolation specific stuff comes first. If this code
> holds up, we should be able to move it to page-isolation.c and keep it
> out of the regular allocator path.

You mean move the tail and head part to set_migratetype_isolate()?
And change move_freepages_block() to separate prep_move_freepages_block(),
the tail and head code, and move_freepages()? It should work and looks
like a similar code pattern as steal_suitable_fallback().


--
Best Regards,
Yan, Zi
Zi Yan Oct. 13, 2023, 2:51 p.m. UTC | #38
On 12 Oct 2023, at 20:06, Zi Yan wrote:

> On 10 Oct 2023, at 17:12, Johannes Weiner wrote:
>
>> Hello!
>>
>> On Mon, Oct 02, 2023 at 10:26:44PM -0400, Zi Yan wrote:
>>> On 27 Sep 2023, at 22:51, Zi Yan wrote:
>>> I attached my revised patch 2 and 3 (with all the suggestions above).
>>
>> Thanks! It took me a bit to read through them. It's a tricky codebase!
>>
>> Some comments below.
>>
>>> From 1c8f99cff5f469ee89adc33e9c9499254cad13f2 Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Mon, 25 Sep 2023 16:27:14 -0400
>>> Subject: [PATCH v2 1/2] mm: set migratetype after free pages are moved between
>>>  free lists.
>>>
>>> This avoids changing migratetype after move_freepages() or
>>> move_freepages_block(), which is error prone. It also prepares for upcoming
>>> changes to fix move_freepages() not moving free pages partially in the
>>> range.
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>
>> This is great and indeed makes the callsites much simpler. Thanks,
>> I'll fold this into the series.
>>
>>> @@ -1597,9 +1615,29 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>  			  unsigned long end_pfn, int old_mt, int new_mt)
>>>  {
>>>  	struct page *page;
>>> -	unsigned long pfn;
>>> +	unsigned long pfn, pfn2;
>>>  	unsigned int order;
>>>  	int pages_moved = 0;
>>> +	unsigned long mt_changed_pfn = start_pfn - pageblock_nr_pages;
>>> +	unsigned long new_start_pfn = get_freepage_start_pfn(start_pfn);
>>> +
>>> +	/* split at start_pfn if it is in the middle of a free page */
>>> +	if (new_start_pfn != start_pfn && PageBuddy(pfn_to_page(new_start_pfn))) {
>>> +		struct page *new_page = pfn_to_page(new_start_pfn);
>>> +		int new_page_order = buddy_order(new_page);
>>
>> get_freepage_start_pfn() returns start_pfn if it didn't find a large
>> buddy, so the buddy check shouldn't be necessary, right?
>>
>>> +		if (new_start_pfn + (1 << new_page_order) > start_pfn) {
>>
>> This *should* be implied according to the comments on
>> get_freepage_start_pfn(), but it currently isn't. Doing so would help
>> here, and seemingly also in alloc_contig_range().
>>
>> How about this version of get_freepage_start_pfn()?
>>
>> /*
>>  * Scan the range before this pfn for a buddy that straddles it
>>  */
>> static unsigned long find_straddling_buddy(unsigned long start_pfn)
>> {
>> 	int order = 0;
>> 	struct page *page;
>> 	unsigned long pfn = start_pfn;
>>
>> 	while (!PageBuddy(page = pfn_to_page(pfn))) {
>> 		/* Nothing found */
>> 		if (++order > MAX_ORDER)
>> 			return start_pfn;
>> 		pfn &= ~0UL << order;
>> 	}
>>
>> 	/*
>> 	 * Found a preceding buddy, but does it straddle?
>> 	 */
>> 	if (pfn + (1 << buddy_order(page)) > start_pfn)
>> 		return pfn;
>>
>> 	/* Nothing found */
>> 	return start_pfn;
>> }
>>
>>> @@ -1614,10 +1652,43 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
>>>
>>>  		order = buddy_order(page);
>>>  		move_to_free_list(page, zone, order, old_mt, new_mt);
>>> +		/*
>>> +		 * set page migratetype 1) only after we move all free pages in
>>> +		 * one pageblock and 2) for all pageblocks within the page.
>>> +		 *
>>> +		 * for 1), since move_to_free_list() checks page migratetype with
>>> +		 * old_mt and changing one page migratetype affects all pages
>>> +		 * within the same pageblock, if we are moving more than
>>> +		 * one free pages in the same pageblock, setting migratetype
>>> +		 * right after first move_to_free_list() triggers the warning
>>> +		 * in the following move_to_free_list().
>>> +		 *
>>> +		 * for 2), when a free page order is greater than pageblock_order,
>>> +		 * all pageblocks within the free page need to be changed after
>>> +		 * move_to_free_list().
>>
>> I think this can be somewhat simplified.
>>
>> There are two assumptions we can make. Buddies always consist of 2^n
>> pages. And buddies and pageblocks are naturally aligned. This means
>> that if this pageblock has the start of a buddy that straddles into
>> the next pageblock(s), it must be the first page in the block. That in
>> turn means we can move the handling before the loop.
>>
>> If we split first, it also makes the loop a little simpler because we
>> know that any buddies that start inside this block cannot extend
>> beyond it (due to the alignment). The loop how it was originally
>> written can remain untouched.
>>
>>> +		 */
>>> +		if (pfn + (1 << order) > pageblock_end_pfn(pfn)) {
>>> +			for (pfn2 = pfn;
>>> +			     pfn2 < min_t(unsigned long,
>>> +					  pfn + (1 << order),
>>> +					  end_pfn + 1);
>>> +			     pfn2 += pageblock_nr_pages) {
>>> +				set_pageblock_migratetype(pfn_to_page(pfn2),
>>> +							  new_mt);
>>> +				mt_changed_pfn = pfn2;
>>
>> Hm, this seems to assume that start_pfn to end_pfn can be more than
>> one block. Why is that? This function is only used on single blocks.
>
> You are right. I made unnecessary assumptions when I wrote the code.
>
>>
>>> +			}
>>> +			/* split the free page if it goes beyond the specified range */
>>> +			if (pfn + (1 << order) > (end_pfn + 1))
>>> +				split_free_page(page, order, end_pfn + 1 - pfn);
>>> +		}
>>>  		pfn += 1 << order;
>>>  		pages_moved += 1 << order;
>>>  	}
>>> -	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
>>> +	/* set migratetype for the remaining pageblocks */
>>> +	for (pfn2 = mt_changed_pfn + pageblock_nr_pages;
>>> +	     pfn2 <= end_pfn;
>>> +	     pfn2 += pageblock_nr_pages)
>>> +		set_pageblock_migratetype(pfn_to_page(pfn2), new_mt);
>>
>> If I rework the code on the above, I'm arriving at the following:
>>
>> static int move_freepages(struct zone *zone, unsigned long start_pfn,
>> 			  unsigned long end_pfn, int old_mt, int new_mt)
>> {
>> 	struct page *start_page = pfn_to_page(start_pfn);
>> 	int pages_moved = 0;
>> 	unsigned long pfn;
>>
>> 	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
>> 	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
>>
>> 	/*
>> 	 * A free page may be comprised of 2^n blocks, which means our
>> 	 * block of interest could be head or tail in such a page.
>> 	 *
>> 	 * If we're a tail, update the type of our block, then split
>> 	 * the page into pageblocks. The splitting will do the leg
>> 	 * work of sorting the blocks into the right freelists.
>> 	 *
>> 	 * If we're a head, split the page into pageblocks first. This
>> 	 * ensures the migratetypes still match up during the freelist
>> 	 * removal. Then do the regular scan for buddies in the block
>> 	 * of interest, which will handle the rest.
>> 	 *
>> 	 * In theory, we could try to preserve 2^1 and larger blocks
>> 	 * that lie outside our range. In practice, MAX_ORDER is
>> 	 * usually one or two pageblocks anyway, so don't bother.
>> 	 *
>> 	 * Note that this only applies to page isolation, which calls
>> 	 * this on random blocks in the pfn range! When we move stuff
>> 	 * from inside the page allocator, the pages are coming off
>> 	 * the freelist (can't be tail) and multi-block pages are
>> 	 * handled directly in the stealing code (can't be a head).
>> 	 */
>>
>> 	/* We're a tail */
>> 	pfn = find_straddling_buddy(start_pfn);
>> 	if (pfn != start_pfn) {
>> 		struct page *free_page = pfn_to_page(pfn);
>>
>> 		set_pageblock_migratetype(start_page, new_mt);
>> 		split_free_page(free_page, buddy_order(free_page),
>> 				pageblock_nr_pages);
>> 		return pageblock_nr_pages;
>> 	}
>>
>> 	/* We're a head */
>> 	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order)
>> 		split_free_page(start_page, buddy_order(start_page),
>> 				pageblock_nr_pages);
>
> This actually can be:
>
> /* We're a head */
> if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
>         set_pageblock_migratetype(start_page, new_mt);
>         split_free_page(start_page, buddy_order(start_page),
>                         pageblock_nr_pages);
>         return pageblock_nr_pages;
> }
>
>
>>
>> 	/* Move buddies within the block */
>> 	while (pfn <= end_pfn) {
>> 		struct page *page = pfn_to_page(pfn);
>> 		int order, nr_pages;
>>
>> 		if (!PageBuddy(page)) {
>> 			pfn++;
>> 			continue;
>> 		}
>>
>> 		/* Make sure we are not inadvertently changing nodes */
>> 		VM_BUG_ON_PAGE(page_to_nid(page) != zone_to_nid(zone), page);
>> 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
>>
>> 		order = buddy_order(page);
>> 		nr_pages = 1 << order;
>>
>> 		move_to_free_list(page, zone, order, old_mt, new_mt);
>>
>> 		pfn += nr_pages;
>> 		pages_moved += nr_pages;
>> 	}
>>
>> 	set_pageblock_migratetype(start_page, new_mt);
>>
>> 	return pages_moved;
>> }
>>
>> Does this look reasonable to you?
>
> Looks good to me. Thanks.
>
>>
>> Note that the page isolation specific stuff comes first. If this code
>> holds up, we should be able to move it to page-isolation.c and keep it
>> out of the regular allocator path.
>
> You mean move the tail and head part to set_migratetype_isolate()?
> And change move_freepages_block() to separate prep_move_freepages_block(),
> the tail and head code, and move_freepages()? It should work and looks
> like a similar code pattern as steal_suitable_fallback().

The attached patch has all the suggested changes, let me know how it
looks to you. Thanks.

--
Best Regards,
Yan, Zi
From 32e7aefe352785b29b31b72ce0bb8b4e608860ca Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:55:18 -0400
Subject: [PATCH] mm/page_isolation: split cross-pageblock free pages during
 isolation

alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
move_freepages(), to isolate free pages. But move_freepages() was not able
to move free pages partially covered by the specified range, leaving a race
window open[1]. Fix it by splitting such pages before calling
move_freepages().

A common code to find the start pfn of a free page straddling a given pfn
is refactored in find_straddling_buddy().

[1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/page-isolation.h |  7 +++
 mm/page_alloc.c                | 94 ++++++++++++++++++++--------------
 mm/page_isolation.c            | 90 ++++++++++++++++++++------------
 3 files changed, 121 insertions(+), 70 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 901915747960..4873f1a41792 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,8 +34,15 @@ static inline bool is_migrate_isolate(int migratetype)
 #define REPORT_FAILURE	0x2
 
 void set_pageblock_migratetype(struct page *page, int migratetype);
+unsigned long find_straddling_buddy(unsigned long start_pfn);
 int move_freepages_block(struct zone *zone, struct page *page,
 			 int old_mt, int new_mt);
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
+				      unsigned long *start_pfn,
+				      unsigned long *end_pfn,
+				      int *num_free, int *num_movable);
+int move_freepages(struct zone *zone, unsigned long start_pfn,
+			  unsigned long end_pfn, int old_mt, int new_mt);
 
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 928bb595d7cc..74831a86f41d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -865,15 +865,15 @@ int split_free_page(struct page *free_page,
 	struct zone *zone = page_zone(free_page);
 	unsigned long free_page_pfn = page_to_pfn(free_page);
 	unsigned long pfn;
-	unsigned long flags;
 	int free_page_order;
 	int mt;
 	int ret = 0;
 
-	if (split_pfn_offset == 0)
-		return ret;
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
+		return ret;
 
 	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
 		ret = -ENOENT;
@@ -899,7 +899,6 @@ int split_free_page(struct page *free_page,
 			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
 	}
 out:
-	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
 /*
@@ -1588,21 +1587,52 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order) { return NULL; }
 #endif
 
+/*
+ * Scan the range before this pfn for a buddy that straddles it
+ */
+unsigned long find_straddling_buddy(unsigned long start_pfn)
+{
+	int order = 0;
+	struct page *page;
+	unsigned long pfn = start_pfn;
+
+	while (!PageBuddy(page = pfn_to_page(pfn))) {
+		/* Nothing found */
+		if (++order > MAX_ORDER)
+			return start_pfn;
+		pfn &= ~0UL << order;
+	}
+
+	/*
+	 * Found a preceding buddy, but does it straddle?
+	 */
+	if (pfn + (1 << buddy_order(page)) > start_pfn)
+		return pfn;
+
+	/* Nothing found */
+	return start_pfn;
+}
+
 /*
  * Move the free pages in a range to the freelist tail of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
  * boundary. If alignment is required, use move_freepages_block()
  */
-static int move_freepages(struct zone *zone, unsigned long start_pfn,
+int move_freepages(struct zone *zone, unsigned long start_pfn,
 			  unsigned long end_pfn, int old_mt, int new_mt)
 {
-	struct page *page;
-	unsigned long pfn;
-	unsigned int order;
+	struct page *start_page = pfn_to_page(start_pfn);
 	int pages_moved = 0;
+	unsigned long pfn = start_pfn;
+
+	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
+	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+
+	/* Move buddies within the block */
+	while (pfn <= end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+		int order, nr_pages;
 
-	for (pfn = start_pfn; pfn <= end_pfn;) {
-		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
 			pfn++;
 			continue;
@@ -1613,16 +1643,20 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		order = buddy_order(page);
+		nr_pages = 1 << order;
+
 		move_to_free_list(page, zone, order, old_mt, new_mt);
-		pfn += 1 << order;
-		pages_moved += 1 << order;
+
+		pfn += nr_pages;
+		pages_moved += nr_pages;
 	}
-	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+
+	set_pageblock_migratetype(start_page, new_mt);
 
 	return pages_moved;
 }
 
-static bool prep_move_freepages_block(struct zone *zone, struct page *page,
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
 				      unsigned long *start_pfn,
 				      unsigned long *end_pfn,
 				      int *num_free, int *num_movable)
@@ -6138,7 +6172,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	int order;
 	int ret = 0;
 
 	struct compact_control cc = {
@@ -6212,28 +6245,13 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * isolated thus they won't get removed from buddy.
 	 */
 
-	order = 0;
-	outer_start = start;
-	while (!PageBuddy(pfn_to_page(outer_start))) {
-		if (++order > MAX_ORDER) {
-			outer_start = start;
-			break;
-		}
-		outer_start &= ~0UL << order;
-	}
-
-	if (outer_start != start) {
-		order = buddy_order(pfn_to_page(outer_start));
-
-		/*
-		 * outer_start page could be small order buddy page and
-		 * it doesn't include start page. Adjust outer_start
-		 * in this case to report failed page properly
-		 * on tracepoint in test_pages_isolated()
-		 */
-		if (outer_start + (1UL << order) <= start)
-			outer_start = start;
-	}
+	/*
+	 * outer_start page could be small order buddy page and it doesn't
+	 * include start page. outer_start is set to start in
+	 * find_straddling_buddy() to report failed page properly on tracepoint
+	 * in test_pages_isolated()
+	 */
+	outer_start = find_straddling_buddy(start);
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, 0)) {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 5f8c658c0853..c6a4e02ed588 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -178,15 +178,61 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
 			migratetype, isol_flags);
 	if (!unmovable) {
-		int nr_pages;
 		int mt = get_pageblock_migratetype(page);
+		unsigned long start_pfn, end_pfn, free_page_pfn;
+		struct page *start_page;
 
-		nr_pages = move_freepages_block(zone, page, mt, MIGRATE_ISOLATE);
 		/* Block spans zone boundaries? */
-		if (nr_pages == -1) {
+		if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL)) {
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
+
+		/*
+		 * A free page may be comprised of 2^n blocks, which means our
+		 * block of interest could be head or tail in such a page.
+		 *
+		 * If we're a tail, update the type of our block, then split
+		 * the page into pageblocks. The splitting will do the leg
+		 * work of sorting the blocks into the right freelists.
+		 *
+		 * If we're a head, split the page into pageblocks first. This
+		 * ensures the migratetypes still match up during the freelist
+		 * removal. Then do the regular scan for buddies in the block
+		 * of interest, which will handle the rest.
+		 *
+		 * In theory, we could try to preserve 2^1 and larger blocks
+		 * that lie outside our range. In practice, MAX_ORDER is
+		 * usually one or two pageblocks anyway, so don't bother.
+		 *
+		 * Note that this only applies to page isolation, which calls
+		 * this on random blocks in the pfn range! When we move stuff
+		 * from inside the page allocator, the pages are coming off
+		 * the freelist (can't be tail) and multi-block pages are
+		 * handled directly in the stealing code (can't be a head).
+		 */
+		start_page = pfn_to_page(start_pfn);
+
+		free_page_pfn = find_straddling_buddy(start_pfn);
+		/*
+		 * 1) We're a tail: free_page_pfn != start_pfn
+		 * 2) We're a head: free_page_pfn == start_pfn &&
+		 *		    PageBuddy(start_page) &&
+		 *		    buddy_order(start_page) > pageblock_order
+		 *
+		 * In both cases, the free page needs to be split.
+		 */
+		if (free_page_pfn != start_pfn ||
+		    (PageBuddy(start_page) &&
+		     buddy_order(start_page) > pageblock_order)) {
+			struct page *free_page = pfn_to_page(free_page_pfn);
+
+			set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
+			split_free_page(free_page, buddy_order(free_page),
+					pageblock_nr_pages);
+		} else
+			move_freepages(zone, start_pfn, end_pfn, mt, MIGRATE_ISOLATE);
+
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -380,11 +426,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 		if (PageBuddy(page)) {
 			int order = buddy_order(page);
 
-			if (pfn + (1UL << order) > boundary_pfn) {
-				/* free page changed before split, check it again */
-				if (split_free_page(page, order, boundary_pfn - pfn))
-					continue;
-			}
+			VM_WARN_ONCE(pfn + (1UL << order) > boundary_pfn,
+				"a free page sits across isolation boundary");
 
 			pfn += 1UL << order;
 			continue;
@@ -408,8 +451,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 			 * can be migrated. Otherwise, fail the isolation.
 			 */
 			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
-				int order;
-				unsigned long outer_pfn;
 				int page_mt = get_pageblock_migratetype(page);
 				bool isolate_page = !is_migrate_isolate_page(page);
 				struct compact_control cc = {
@@ -427,9 +468,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				/*
 				 * XXX: mark the page as MIGRATE_ISOLATE so that
 				 * no one else can grab the freed page after migration.
-				 * Ideally, the page should be freed as two separate
-				 * pages to be added into separate migratetype free
-				 * lists.
+				 * The page should be freed into separate migratetype
+				 * free lists, unless the free page order is greater
+				 * than pageblock order. It is not the case now,
+				 * since gigantic hugetlb is freed as order-0
+				 * pages and LRU pages do not cross pageblocks.
 				 */
 				if (isolate_page) {
 					ret = set_migratetype_isolate(page, page_mt,
@@ -451,25 +494,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 				if (ret)
 					goto failed;
-				/*
-				 * reset pfn to the head of the free page, so
-				 * that the free page handling code above can split
-				 * the free page to the right migratetype list.
-				 *
-				 * head_pfn is not used here as a hugetlb page order
-				 * can be bigger than MAX_ORDER, but after it is
-				 * freed, the free page order is not. Use pfn within
-				 * the range to find the head of the free page.
-				 */
-				order = 0;
-				outer_pfn = pfn;
-				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					/* stop if we cannot find the free page */
-					if (++order > MAX_ORDER)
-						goto failed;
-					outer_pfn &= ~0UL << order;
-				}
-				pfn = outer_pfn;
+
+				pfn = head_pfn + nr_pages;
 				continue;
 			} else
 #endif
Zi Yan Oct. 16, 2023, 1:35 p.m. UTC | #39
> The attached patch has all the suggested changes, let me know how it
> looks to you. Thanks.

The one I sent has free page accounting issues. The attached one fixes them.

--
Best Regards,
Yan, Zi
From b428b4919e30dc0556406325d3c173a87f45f135 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Mon, 25 Sep 2023 16:55:18 -0400
Subject: [PATCH v2] mm/page_isolation: split cross-pageblock free pages during
 isolation

alloc_contig_range() uses set_migrateype_isolate(), which eventually calls
move_freepages(), to isolate free pages. But move_freepages() was not able
to move free pages partially covered by the specified range, leaving a race
window open[1]. Fix it by splitting such pages before calling
move_freepages().

A common code to find the start pfn of a free page straddling a given pfn
is refactored in find_straddling_buddy(). split_free_page() is modified
to change pageblock migratetype inside the function.

[1] https://lore.kernel.org/linux-mm/20230920160400.GC124289@cmpxchg.org/

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 include/linux/page-isolation.h |  12 +++-
 mm/internal.h                  |   3 -
 mm/page_alloc.c                | 103 ++++++++++++++++++------------
 mm/page_isolation.c            | 113 ++++++++++++++++++++++-----------
 4 files changed, 151 insertions(+), 80 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 901915747960..e82ab67867df 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -33,9 +33,17 @@ static inline bool is_migrate_isolate(int migratetype)
 #define MEMORY_OFFLINE	0x1
 #define REPORT_FAILURE	0x2
 
+unsigned long find_straddling_buddy(unsigned long start_pfn);
+int split_free_page(struct page *free_page,
+			unsigned int order, unsigned long split_pfn_offset,
+			int mt1, int mt2);
 void set_pageblock_migratetype(struct page *page, int migratetype);
-int move_freepages_block(struct zone *zone, struct page *page,
-			 int old_mt, int new_mt);
+int move_freepages(struct zone *zone, unsigned long start_pfn,
+			  unsigned long end_pfn, int old_mt, int new_mt);
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
+				      unsigned long *start_pfn,
+				      unsigned long *end_pfn,
+				      int *num_free, int *num_movable);
 
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/internal.h b/mm/internal.h
index 8c90e966e9f8..cda702359c0f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -457,9 +457,6 @@ void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
 		unsigned long, enum meminit_context, struct vmem_altmap *, int);
 
 
-int split_free_page(struct page *free_page,
-			unsigned int order, unsigned long split_pfn_offset);
-
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 928bb595d7cc..e877fbdb700e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -851,6 +851,8 @@ static inline void __free_one_page(struct page *page,
  * @free_page:		the original free page
  * @order:		the order of the page
  * @split_pfn_offset:	split offset within the page
+ * @mt1:		migratetype set before the offset
+ * @mt2:		migratetype set after the offset
  *
  * Return -ENOENT if the free page is changed, otherwise 0
  *
@@ -860,20 +862,21 @@ static inline void __free_one_page(struct page *page,
  * nothing.
  */
 int split_free_page(struct page *free_page,
-			unsigned int order, unsigned long split_pfn_offset)
+			unsigned int order, unsigned long split_pfn_offset,
+			int mt1, int mt2)
 {
 	struct zone *zone = page_zone(free_page);
 	unsigned long free_page_pfn = page_to_pfn(free_page);
 	unsigned long pfn;
-	unsigned long flags;
 	int free_page_order;
 	int mt;
 	int ret = 0;
 
-	if (split_pfn_offset == 0)
-		return ret;
+	/* zone lock should be held when this function is called */
+	lockdep_assert_held(&zone->lock);
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (split_pfn_offset == 0 || split_pfn_offset >= (1 << order))
+		return ret;
 
 	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
 		ret = -ENOENT;
@@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
 	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
 	del_page_from_free_list(free_page, zone, order, mt);
 
+	set_pageblock_migratetype(free_page, mt1);
+	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
+				  mt2);
+
 	for (pfn = free_page_pfn;
 	     pfn < free_page_pfn + (1UL << order);) {
 		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
@@ -899,7 +906,6 @@ int split_free_page(struct page *free_page,
 			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
 	}
 out:
-	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
 /*
@@ -1588,21 +1594,52 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
 					unsigned int order) { return NULL; }
 #endif
 
+/*
+ * Scan the range before this pfn for a buddy that straddles it
+ */
+unsigned long find_straddling_buddy(unsigned long start_pfn)
+{
+	int order = 0;
+	struct page *page;
+	unsigned long pfn = start_pfn;
+
+	while (!PageBuddy(page = pfn_to_page(pfn))) {
+		/* Nothing found */
+		if (++order > MAX_ORDER)
+			return start_pfn;
+		pfn &= ~0UL << order;
+	}
+
+	/*
+	 * Found a preceding buddy, but does it straddle?
+	 */
+	if (pfn + (1 << buddy_order(page)) > start_pfn)
+		return pfn;
+
+	/* Nothing found */
+	return start_pfn;
+}
+
 /*
  * Move the free pages in a range to the freelist tail of the requested type.
  * Note that start_page and end_pages are not aligned on a pageblock
  * boundary. If alignment is required, use move_freepages_block()
  */
-static int move_freepages(struct zone *zone, unsigned long start_pfn,
+int move_freepages(struct zone *zone, unsigned long start_pfn,
 			  unsigned long end_pfn, int old_mt, int new_mt)
 {
-	struct page *page;
-	unsigned long pfn;
-	unsigned int order;
+	struct page *start_page = pfn_to_page(start_pfn);
 	int pages_moved = 0;
+	unsigned long pfn = start_pfn;
+
+	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
+	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+
+	/* Move buddies within the block */
+	while (pfn <= end_pfn) {
+		struct page *page = pfn_to_page(pfn);
+		int order, nr_pages;
 
-	for (pfn = start_pfn; pfn <= end_pfn;) {
-		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
 			pfn++;
 			continue;
@@ -1613,16 +1650,20 @@ static int move_freepages(struct zone *zone, unsigned long start_pfn,
 		VM_BUG_ON_PAGE(page_zone(page) != zone, page);
 
 		order = buddy_order(page);
+		nr_pages = 1 << order;
+
 		move_to_free_list(page, zone, order, old_mt, new_mt);
-		pfn += 1 << order;
-		pages_moved += 1 << order;
+
+		pfn += nr_pages;
+		pages_moved += nr_pages;
 	}
-	set_pageblock_migratetype(pfn_to_page(start_pfn), new_mt);
+
+	set_pageblock_migratetype(start_page, new_mt);
 
 	return pages_moved;
 }
 
-static bool prep_move_freepages_block(struct zone *zone, struct page *page,
+bool prep_move_freepages_block(struct zone *zone, struct page *page,
 				      unsigned long *start_pfn,
 				      unsigned long *end_pfn,
 				      int *num_free, int *num_movable)
@@ -6138,7 +6179,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	int order;
 	int ret = 0;
 
 	struct compact_control cc = {
@@ -6212,28 +6252,13 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * isolated thus they won't get removed from buddy.
 	 */
 
-	order = 0;
-	outer_start = start;
-	while (!PageBuddy(pfn_to_page(outer_start))) {
-		if (++order > MAX_ORDER) {
-			outer_start = start;
-			break;
-		}
-		outer_start &= ~0UL << order;
-	}
-
-	if (outer_start != start) {
-		order = buddy_order(pfn_to_page(outer_start));
-
-		/*
-		 * outer_start page could be small order buddy page and
-		 * it doesn't include start page. Adjust outer_start
-		 * in this case to report failed page properly
-		 * on tracepoint in test_pages_isolated()
-		 */
-		if (outer_start + (1UL << order) <= start)
-			outer_start = start;
-	}
+	/*
+	 * outer_start page could be small order buddy page and it doesn't
+	 * include start page. outer_start is set to start in
+	 * find_straddling_buddy() to report failed page properly on tracepoint
+	 * in test_pages_isolated()
+	 */
+	outer_start = find_straddling_buddy(start);
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, 0)) {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 5f8c658c0853..0500dff477f8 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -139,6 +139,62 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
 	return NULL;
 }
 
+/*
+ * additional steps for moving free pages during page isolation
+ */
+static int move_freepages_for_isolation(struct zone *zone, unsigned long start_pfn,
+			  unsigned long end_pfn, int old_mt, int new_mt)
+{
+	struct page *start_page = pfn_to_page(start_pfn);
+	unsigned long pfn;
+
+	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
+	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
+
+	/*
+	 * A free page may be comprised of 2^n blocks, which means our
+	 * block of interest could be head or tail in such a page.
+	 *
+	 * If we're a tail, update the type of our block, then split
+	 * the page into pageblocks. The splitting will do the leg
+	 * work of sorting the blocks into the right freelists.
+	 *
+	 * If we're a head, split the page into pageblocks first. This
+	 * ensures the migratetypes still match up during the freelist
+	 * removal. Then do the regular scan for buddies in the block
+	 * of interest, which will handle the rest.
+	 *
+	 * In theory, we could try to preserve 2^1 and larger blocks
+	 * that lie outside our range. In practice, MAX_ORDER is
+	 * usually one or two pageblocks anyway, so don't bother.
+	 *
+	 * Note that this only applies to page isolation, which calls
+	 * this on random blocks in the pfn range! When we move stuff
+	 * from inside the page allocator, the pages are coming off
+	 * the freelist (can't be tail) and multi-block pages are
+	 * handled directly in the stealing code (can't be a head).
+	 */
+
+	/* We're a tail */
+	pfn = find_straddling_buddy(start_pfn);
+	if (pfn != start_pfn) {
+		struct page *free_page = pfn_to_page(pfn);
+
+		split_free_page(free_page, buddy_order(free_page),
+				pageblock_nr_pages, old_mt, new_mt);
+		return pageblock_nr_pages;
+	}
+
+	/* We're a head */
+	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
+		split_free_page(start_page, buddy_order(start_page),
+				pageblock_nr_pages, new_mt, old_mt);
+		return pageblock_nr_pages;
+	}
+
+	return 0;
+}
+
 /*
  * This function set pageblock migratetype to isolate if no unmovable page is
  * present in [start_pfn, end_pfn). The pageblock must intersect with
@@ -178,15 +234,17 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
 			migratetype, isol_flags);
 	if (!unmovable) {
-		int nr_pages;
 		int mt = get_pageblock_migratetype(page);
+		unsigned long start_pfn, end_pfn;
 
-		nr_pages = move_freepages_block(zone, page, mt, MIGRATE_ISOLATE);
-		/* Block spans zone boundaries? */
-		if (nr_pages == -1) {
+		if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL)) {
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
+
+		if (!move_freepages_for_isolation(zone, start_pfn, end_pfn, mt, MIGRATE_ISOLATE))
+			move_freepages(zone, start_pfn, end_pfn, mt, MIGRATE_ISOLATE);
+
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -253,13 +311,16 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	 * allocation.
 	 */
 	if (!isolated_page) {
-		int nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE,
-						    migratetype);
+		unsigned long start_pfn, end_pfn;
+
 		/*
 		 * Isolating this block already succeeded, so this
 		 * should not fail on zone boundaries.
 		 */
-		WARN_ON_ONCE(nr_pages == -1);
+		if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn, NULL, NULL))
+			WARN_ON_ONCE(1);
+		else if (!move_freepages_for_isolation(zone, start_pfn, end_pfn, MIGRATE_ISOLATE, migratetype))
+			move_freepages(zone, start_pfn, end_pfn, MIGRATE_ISOLATE, migratetype);
 	} else {
 		set_pageblock_migratetype(page, migratetype);
 		__putback_isolated_page(page, order, migratetype);
@@ -380,11 +441,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 		if (PageBuddy(page)) {
 			int order = buddy_order(page);
 
-			if (pfn + (1UL << order) > boundary_pfn) {
-				/* free page changed before split, check it again */
-				if (split_free_page(page, order, boundary_pfn - pfn))
-					continue;
-			}
+			VM_WARN_ONCE(pfn + (1UL << order) > boundary_pfn,
+				"a free page sits across isolation boundary");
 
 			pfn += 1UL << order;
 			continue;
@@ -408,8 +466,6 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 			 * can be migrated. Otherwise, fail the isolation.
 			 */
 			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
-				int order;
-				unsigned long outer_pfn;
 				int page_mt = get_pageblock_migratetype(page);
 				bool isolate_page = !is_migrate_isolate_page(page);
 				struct compact_control cc = {
@@ -427,9 +483,11 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				/*
 				 * XXX: mark the page as MIGRATE_ISOLATE so that
 				 * no one else can grab the freed page after migration.
-				 * Ideally, the page should be freed as two separate
-				 * pages to be added into separate migratetype free
-				 * lists.
+				 * The page should be freed into separate migratetype
+				 * free lists, unless the free page order is greater
+				 * than pageblock order. It is not the case now,
+				 * since gigantic hugetlb is freed as order-0
+				 * pages and LRU pages do not cross pageblocks.
 				 */
 				if (isolate_page) {
 					ret = set_migratetype_isolate(page, page_mt,
@@ -451,25 +509,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 				if (ret)
 					goto failed;
-				/*
-				 * reset pfn to the head of the free page, so
-				 * that the free page handling code above can split
-				 * the free page to the right migratetype list.
-				 *
-				 * head_pfn is not used here as a hugetlb page order
-				 * can be bigger than MAX_ORDER, but after it is
-				 * freed, the free page order is not. Use pfn within
-				 * the range to find the head of the free page.
-				 */
-				order = 0;
-				outer_pfn = pfn;
-				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					/* stop if we cannot find the free page */
-					if (++order > MAX_ORDER)
-						goto failed;
-					outer_pfn &= ~0UL << order;
-				}
-				pfn = outer_pfn;
+
+				pfn = head_pfn + nr_pages;
 				continue;
 			} else
 #endif
Johannes Weiner Oct. 16, 2023, 2:37 p.m. UTC | #40
On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
> > The attached patch has all the suggested changes, let me know how it
> > looks to you. Thanks.
> 
> The one I sent has free page accounting issues. The attached one fixes them.

Do you still have the warnings? I wonder what went wrong.

> @@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
>  	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
>  	del_page_from_free_list(free_page, zone, order, mt);
>  
> +	set_pageblock_migratetype(free_page, mt1);
> +	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
> +				  mt2);
> +
>  	for (pfn = free_page_pfn;
>  	     pfn < free_page_pfn + (1UL << order);) {
>  		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);

I don't think this is quite right.

With CONFIG_ARCH_FORCE_MAX_ORDER it's possible that we're dealing with
a buddy that is more than two blocks:

[pageblock 0][pageblock 1][pageblock 2][pageblock 3]
[buddy                                             ]
                                       [isolate range ..

That for loop splits the buddy into 4 blocks. The code above would set
pageblock 0 to old_mt, and pageblock 1 to new_mt. But it should only
set pageblock 3 to new_mt.

My proposal had the mt update in the caller:

> @@ -139,6 +139,62 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>  	return NULL;
>  }
>  
> +/*
> + * additional steps for moving free pages during page isolation
> + */
> +static int move_freepages_for_isolation(struct zone *zone, unsigned long start_pfn,
> +			  unsigned long end_pfn, int old_mt, int new_mt)
> +{
> +	struct page *start_page = pfn_to_page(start_pfn);
> +	unsigned long pfn;
> +
> +	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
> +	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
> +
> +	/*
> +	 * A free page may be comprised of 2^n blocks, which means our
> +	 * block of interest could be head or tail in such a page.
> +	 *
> +	 * If we're a tail, update the type of our block, then split
> +	 * the page into pageblocks. The splitting will do the leg
> +	 * work of sorting the blocks into the right freelists.
> +	 *
> +	 * If we're a head, split the page into pageblocks first. This
> +	 * ensures the migratetypes still match up during the freelist
> +	 * removal. Then do the regular scan for buddies in the block
> +	 * of interest, which will handle the rest.
> +	 *
> +	 * In theory, we could try to preserve 2^1 and larger blocks
> +	 * that lie outside our range. In practice, MAX_ORDER is
> +	 * usually one or two pageblocks anyway, so don't bother.
> +	 *
> +	 * Note that this only applies to page isolation, which calls
> +	 * this on random blocks in the pfn range! When we move stuff
> +	 * from inside the page allocator, the pages are coming off
> +	 * the freelist (can't be tail) and multi-block pages are
> +	 * handled directly in the stealing code (can't be a head).
> +	 */
> +
> +	/* We're a tail */
> +	pfn = find_straddling_buddy(start_pfn);
> +	if (pfn != start_pfn) {
> +		struct page *free_page = pfn_to_page(pfn);
> +
> +		split_free_page(free_page, buddy_order(free_page),
> +				pageblock_nr_pages, old_mt, new_mt);
> +		return pageblock_nr_pages;
> +	}
> +
> +	/* We're a head */
> +	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
> +		split_free_page(start_page, buddy_order(start_page),
> +				pageblock_nr_pages, new_mt, old_mt);
> +		return pageblock_nr_pages;
> +	}

i.e. here ^: set the mt of the block that's in isolation range, then
split the block.

I think I can guess the warning you were getting: in the head case, we
need to change the type of the head pageblock that's on the
freelist. If we do it before calling split, the
del_page_from_freelist() in there warns about the wrong type.

How about pulling the freelist removal out of split_free_page()?

	del_page_from_freelist(huge_buddy);
	set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
	split_free_page(huge_buddy, buddy_order(), pageblock_nr_pages);
	return pageblock_nr_pages;
Zi Yan Oct. 16, 2023, 3 p.m. UTC | #41
On 16 Oct 2023, at 10:37, Johannes Weiner wrote:

> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
>>> The attached patch has all the suggested changes, let me know how it
>>> looks to you. Thanks.
>>
>> The one I sent has free page accounting issues. The attached one fixes them.
>
> Do you still have the warnings? I wonder what went wrong.

No warnings. But something with the code:

1. in your version, split_free_page() is called without changing any pageblock
migratetypes, then split_free_page() is just a no-op, since the page is
just deleted from the free list, then freed via different orders. Buddy allocator
will merge them back.

2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
but it causes free page accounting issues, since in the case of head, free pages
are deleted from new_mt when they are in old_mt free list and the accounting
decreases new_mt free page number instead of old_mt one.

Basically, split_free_page() is awkward as it relies on preset migratetypes,
which changes migratetypes without deleting the free pages from the list first.
That is why I came up with the new split_free_page() below.

>
>> @@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
>>  	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
>>  	del_page_from_free_list(free_page, zone, order, mt);
>>
>> +	set_pageblock_migratetype(free_page, mt1);
>> +	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
>> +				  mt2);
>> +
>>  	for (pfn = free_page_pfn;
>>  	     pfn < free_page_pfn + (1UL << order);) {
>>  		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
>
> I don't think this is quite right.
>
> With CONFIG_ARCH_FORCE_MAX_ORDER it's possible that we're dealing with
> a buddy that is more than two blocks:
>
> [pageblock 0][pageblock 1][pageblock 2][pageblock 3]
> [buddy                                             ]
>                                        [isolate range ..
>
> That for loop splits the buddy into 4 blocks. The code above would set
> pageblock 0 to old_mt, and pageblock 1 to new_mt. But it should only
> set pageblock 3 to new_mt.

OK. I think I need to fix split_free_page().

Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
pageblock and in turn makes an in-use page have more than one pageblock,
we will have problems. Since in isolate_single_pageblock(), an in-use page
can have part of its pageblock set to a different migratetype and be freed,
causing the free page with unmatched migratetypes. We might need to
free pages at pageblock_order if their orders are bigger than pageblock_order.

Which arch with CONFIG_ARCH_FORCE_MAX_ORDER can have a buddy containing more
than one pageblocks? I would like to make some tests.

>
> My proposal had the mt update in the caller:
>
>> @@ -139,6 +139,62 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>>  	return NULL;
>>  }
>>
>> +/*
>> + * additional steps for moving free pages during page isolation
>> + */
>> +static int move_freepages_for_isolation(struct zone *zone, unsigned long start_pfn,
>> +			  unsigned long end_pfn, int old_mt, int new_mt)
>> +{
>> +	struct page *start_page = pfn_to_page(start_pfn);
>> +	unsigned long pfn;
>> +
>> +	VM_WARN_ON(start_pfn & (pageblock_nr_pages - 1));
>> +	VM_WARN_ON(start_pfn + pageblock_nr_pages - 1 != end_pfn);
>> +
>> +	/*
>> +	 * A free page may be comprised of 2^n blocks, which means our
>> +	 * block of interest could be head or tail in such a page.
>> +	 *
>> +	 * If we're a tail, update the type of our block, then split
>> +	 * the page into pageblocks. The splitting will do the leg
>> +	 * work of sorting the blocks into the right freelists.
>> +	 *
>> +	 * If we're a head, split the page into pageblocks first. This
>> +	 * ensures the migratetypes still match up during the freelist
>> +	 * removal. Then do the regular scan for buddies in the block
>> +	 * of interest, which will handle the rest.
>> +	 *
>> +	 * In theory, we could try to preserve 2^1 and larger blocks
>> +	 * that lie outside our range. In practice, MAX_ORDER is
>> +	 * usually one or two pageblocks anyway, so don't bother.
>> +	 *
>> +	 * Note that this only applies to page isolation, which calls
>> +	 * this on random blocks in the pfn range! When we move stuff
>> +	 * from inside the page allocator, the pages are coming off
>> +	 * the freelist (can't be tail) and multi-block pages are
>> +	 * handled directly in the stealing code (can't be a head).
>> +	 */
>> +
>> +	/* We're a tail */
>> +	pfn = find_straddling_buddy(start_pfn);
>> +	if (pfn != start_pfn) {
>> +		struct page *free_page = pfn_to_page(pfn);
>> +
>> +		split_free_page(free_page, buddy_order(free_page),
>> +				pageblock_nr_pages, old_mt, new_mt);
>> +		return pageblock_nr_pages;
>> +	}
>> +
>> +	/* We're a head */
>> +	if (PageBuddy(start_page) && buddy_order(start_page) > pageblock_order) {
>> +		split_free_page(start_page, buddy_order(start_page),
>> +				pageblock_nr_pages, new_mt, old_mt);
>> +		return pageblock_nr_pages;
>> +	}
>
> i.e. here ^: set the mt of the block that's in isolation range, then
> split the block.
>
> I think I can guess the warning you were getting: in the head case, we
> need to change the type of the head pageblock that's on the
> freelist. If we do it before calling split, the
> del_page_from_freelist() in there warns about the wrong type.
>
> How about pulling the freelist removal out of split_free_page()?
>
> 	del_page_from_freelist(huge_buddy);
> 	set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
> 	split_free_page(huge_buddy, buddy_order(), pageblock_nr_pages);
> 	return pageblock_nr_pages;

Yes, this is better. Let me change to this implementation.

But I would like to test it on an environment where a buddy contains more than
one pageblocks first. I probably can change MAX_ORDER of x86_64 to do it locally.
I will report back.

--
Best Regards,
Yan, Zi
Johannes Weiner Oct. 16, 2023, 6:51 p.m. UTC | #42
On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
> 
> > On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
> >>> The attached patch has all the suggested changes, let me know how it
> >>> looks to you. Thanks.
> >>
> >> The one I sent has free page accounting issues. The attached one fixes them.
> >
> > Do you still have the warnings? I wonder what went wrong.
> 
> No warnings. But something with the code:
> 
> 1. in your version, split_free_page() is called without changing any pageblock
> migratetypes, then split_free_page() is just a no-op, since the page is
> just deleted from the free list, then freed via different orders. Buddy allocator
> will merge them back.

Hm not quite.

If it's the tail block of a buddy, I update its type before
splitting. The splitting loop looks up the type of each block for
sorting it onto freelists.

If it's the head block, yes I split it first according to its old
type. But then I let it fall through to scanning the block, which will
find that buddy, update its type and move it.

> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
> but it causes free page accounting issues, since in the case of head, free pages
> are deleted from new_mt when they are in old_mt free list and the accounting
> decreases new_mt free page number instead of old_mt one.

Right, that makes sense.

> Basically, split_free_page() is awkward as it relies on preset migratetypes,
> which changes migratetypes without deleting the free pages from the list first.
> That is why I came up with the new split_free_page() below.

Yeah, the in-between thing is bad. Either it fixes the migratetype
before deletion, or it doesn't do the deletion. I'm thinking it would
be simpler to move the deletion out instead.

> >> @@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
> >>  	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
> >>  	del_page_from_free_list(free_page, zone, order, mt);
> >>
> >> +	set_pageblock_migratetype(free_page, mt1);
> >> +	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
> >> +				  mt2);
> >> +
> >>  	for (pfn = free_page_pfn;
> >>  	     pfn < free_page_pfn + (1UL << order);) {
> >>  		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
> >
> > I don't think this is quite right.
> >
> > With CONFIG_ARCH_FORCE_MAX_ORDER it's possible that we're dealing with
> > a buddy that is more than two blocks:
> >
> > [pageblock 0][pageblock 1][pageblock 2][pageblock 3]
> > [buddy                                             ]
> >                                        [isolate range ..
> >
> > That for loop splits the buddy into 4 blocks. The code above would set
> > pageblock 0 to old_mt, and pageblock 1 to new_mt. But it should only
> > set pageblock 3 to new_mt.
> 
> OK. I think I need to fix split_free_page().
> 
> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
> pageblock and in turn makes an in-use page have more than one pageblock,
> we will have problems. Since in isolate_single_pageblock(), an in-use page
> can have part of its pageblock set to a different migratetype and be freed,
> causing the free page with unmatched migratetypes. We might need to
> free pages at pageblock_order if their orders are bigger than pageblock_order.

Is this a practical issue? You mentioned that right now only gigantic
pages can be larger than a pageblock, and those are freed in order-0
chunks.

> > How about pulling the freelist removal out of split_free_page()?
> >
> > 	del_page_from_freelist(huge_buddy);
> > 	set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
> > 	split_free_page(huge_buddy, buddy_order(), pageblock_nr_pages);
> > 	return pageblock_nr_pages;
> 
> Yes, this is better. Let me change to this implementation.
> 
> But I would like to test it on an environment where a buddy contains more than
> one pageblocks first. I probably can change MAX_ORDER of x86_64 to do it locally.
> I will report back.

I tweaked my version some more based on our discussion. Would you mind
taking a look? It survived an hour of stressing with a kernel build
and Mike's reproducer that allocates gigantics and demotes them.

Note that it applies *before* consolidating of the free counts, as
isolation needs to be fixed before the warnings are added, to avoid
bisectability issues. The consolidation patch doesn't change it much,
except removing freepage accounting in move_freepages_block_isolate().

---

From a0460ad30a24cf73816ac40b262af0ba3723a242 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 16 Oct 2023 12:32:21 -0400
Subject: [PATCH] mm: page_isolation: prepare for hygienic freelists

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/page-isolation.h |   4 +-
 mm/internal.h                  |   4 -
 mm/page_alloc.c                | 198 +++++++++++++++++++--------------
 mm/page_isolation.c            |  96 +++++-----------
 4 files changed, 142 insertions(+), 160 deletions(-)

diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
index 8550b3c91480..c16db0067090 100644
--- a/include/linux/page-isolation.h
+++ b/include/linux/page-isolation.h
@@ -34,7 +34,9 @@ static inline bool is_migrate_isolate(int migratetype)
 #define REPORT_FAILURE	0x2
 
 void set_pageblock_migratetype(struct page *page, int migratetype);
-int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
+
+bool move_freepages_block_isolate(struct zone *zone, struct page *page,
+				  int migratetype);
 
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     int migratetype, int flags, gfp_t gfp_flags);
diff --git a/mm/internal.h b/mm/internal.h
index 3a72975425bb..0681094ad260 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -464,10 +464,6 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
 void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
 		unsigned long, enum meminit_context, struct vmem_altmap *, int);
 
-
-int split_free_page(struct page *free_page,
-			unsigned int order, unsigned long split_pfn_offset);
-
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6185b076cf90..17e9a06027c8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -834,64 +834,6 @@ static inline void __free_one_page(struct page *page,
 		page_reporting_notify_free(order);
 }
 
-/**
- * split_free_page() -- split a free page at split_pfn_offset
- * @free_page:		the original free page
- * @order:		the order of the page
- * @split_pfn_offset:	split offset within the page
- *
- * Return -ENOENT if the free page is changed, otherwise 0
- *
- * It is used when the free page crosses two pageblocks with different migratetypes
- * at split_pfn_offset within the page. The split free page will be put into
- * separate migratetype lists afterwards. Otherwise, the function achieves
- * nothing.
- */
-int split_free_page(struct page *free_page,
-			unsigned int order, unsigned long split_pfn_offset)
-{
-	struct zone *zone = page_zone(free_page);
-	unsigned long free_page_pfn = page_to_pfn(free_page);
-	unsigned long pfn;
-	unsigned long flags;
-	int free_page_order;
-	int mt;
-	int ret = 0;
-
-	if (split_pfn_offset == 0)
-		return ret;
-
-	spin_lock_irqsave(&zone->lock, flags);
-
-	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
-		ret = -ENOENT;
-		goto out;
-	}
-
-	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
-	if (likely(!is_migrate_isolate(mt)))
-		__mod_zone_freepage_state(zone, -(1UL << order), mt);
-
-	del_page_from_free_list(free_page, zone, order);
-	for (pfn = free_page_pfn;
-	     pfn < free_page_pfn + (1UL << order);) {
-		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
-
-		free_page_order = min_t(unsigned int,
-					pfn ? __ffs(pfn) : order,
-					__fls(split_pfn_offset));
-		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
-				mt, FPI_NONE);
-		pfn += 1UL << free_page_order;
-		split_pfn_offset -= (1UL << free_page_order);
-		/* we have done the first part, now switch to second part */
-		if (split_pfn_offset == 0)
-			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
-	}
-out:
-	spin_unlock_irqrestore(&zone->lock, flags);
-	return ret;
-}
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
@@ -1673,8 +1615,8 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
 	return true;
 }
 
-int move_freepages_block(struct zone *zone, struct page *page,
-			 int migratetype)
+static int move_freepages_block(struct zone *zone, struct page *page,
+				int migratetype)
 {
 	unsigned long start_pfn, end_pfn;
 
@@ -1685,6 +1627,117 @@ int move_freepages_block(struct zone *zone, struct page *page,
 	return move_freepages(zone, start_pfn, end_pfn, migratetype);
 }
 
+#ifdef CONFIG_MEMORY_ISOLATION
+/* Look for a multi-block buddy that straddles start_pfn */
+static unsigned long find_large_buddy(unsigned long start_pfn)
+{
+	int order = 0;
+	struct page *page;
+	unsigned long pfn = start_pfn;
+
+	while (!PageBuddy(page = pfn_to_page(pfn))) {
+		/* Nothing found */
+		if (++order > MAX_ORDER)
+			return start_pfn;
+		pfn &= ~0UL << order;
+	}
+
+	/*
+	 * Found a preceding buddy, but does it straddle?
+	 */
+	if (pfn + (1 << buddy_order(page)) > start_pfn)
+		return pfn;
+
+	/* Nothing found */
+	return start_pfn;
+}
+
+/* Split a multi-block buddy into its individual pageblocks */
+static void split_large_buddy(struct page *buddy, int order)
+{
+	unsigned long pfn = page_to_pfn(buddy);
+	unsigned long end = pfn + (1 << order);
+	struct zone *zone = page_zone(buddy);
+
+	lockdep_assert_held(&zone->lock);
+	VM_WARN_ON_ONCE(PageBuddy(buddy));
+
+	while (pfn < end) {
+		int mt = get_pfnblock_migratetype(buddy, pfn);
+
+		__free_one_page(buddy, pfn, zone, pageblock_order, mt, FPI_NONE);
+		pfn += pageblock_nr_pages;
+		buddy = pfn_to_page(pfn);
+	}
+}
+
+/**
+ * move_freepages_block_isolate - move free pages in block for page isolation
+ * @zone: the zone
+ * @page: the pageblock page
+ * @migratetype: migratetype to set on the pageblock
+ *
+ * This is similar to move_freepages_block(), but handles the special
+ * case encountered in page isolation, where the block of interest
+ * might be part of a larger buddy spanning multiple pageblocks.
+ *
+ * Unlike the regular page allocator path, which moves pages while
+ * stealing buddies off the freelist, page isolation is interested in
+ * arbitrary pfn ranges that may have overlapping buddies on both ends.
+ *
+ * This function handles that. Straddling buddies are split into
+ * individual pageblocks. Only the block of interest is moved.
+ *
+ * Returns %true if pages could be moved, %false otherwise.
+ */
+bool move_freepages_block_isolate(struct zone *zone, struct page *page,
+				  int migratetype)
+{
+	unsigned long start_pfn, end_pfn, pfn;
+	int nr_moved, mt;
+
+	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
+				       NULL, NULL))
+		return false;
+
+	/* We're a tail block in a larger buddy */
+	pfn = find_large_buddy(start_pfn);
+	if (pfn != start_pfn) {
+		struct page *buddy = pfn_to_page(pfn);
+		int order = buddy_order(buddy);
+		int mt = get_pfnblock_migratetype(buddy, pfn);
+
+		if (!is_migrate_isolate(mt))
+			__mod_zone_freepage_state(zone, -(1UL << order), mt);
+		del_page_from_free_list(buddy, zone, order);
+		set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype);
+		split_large_buddy(buddy, order);
+		return true;
+	}
+
+	/* We're the starting block of a larger buddy */
+	if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
+		int mt = get_pfnblock_migratetype(page, pfn);
+		int order = buddy_order(page);
+
+		if (!is_migrate_isolate(mt))
+			__mod_zone_freepage_state(zone, -(1UL << order), mt);
+		del_page_from_free_list(page, zone, order);
+		set_pageblock_migratetype(page, migratetype);
+		split_large_buddy(page, order);
+		return true;
+	}
+
+	mt = get_pfnblock_migratetype(page, start_pfn);
+	nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
+	if (!is_migrate_isolate(mt))
+		__mod_zone_freepage_state(zone, -nr_moved, mt);
+	else if (!is_migrate_isolate(migratetype))
+		__mod_zone_freepage_state(zone, nr_moved, migratetype);
+	return true;
+}
+#endif /* CONFIG_MEMORY_ISOLATION */
+
 static void change_pageblock_range(struct page *pageblock_page,
 					int start_order, int migratetype)
 {
@@ -6318,7 +6371,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		       unsigned migratetype, gfp_t gfp_mask)
 {
 	unsigned long outer_start, outer_end;
-	int order;
 	int ret = 0;
 
 	struct compact_control cc = {
@@ -6391,29 +6443,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 	 * We don't have to hold zone->lock here because the pages are
 	 * isolated thus they won't get removed from buddy.
 	 */
-
-	order = 0;
-	outer_start = start;
-	while (!PageBuddy(pfn_to_page(outer_start))) {
-		if (++order > MAX_ORDER) {
-			outer_start = start;
-			break;
-		}
-		outer_start &= ~0UL << order;
-	}
-
-	if (outer_start != start) {
-		order = buddy_order(pfn_to_page(outer_start));
-
-		/*
-		 * outer_start page could be small order buddy page and
-		 * it doesn't include start page. Adjust outer_start
-		 * in this case to report failed page properly
-		 * on tracepoint in test_pages_isolated()
-		 */
-		if (outer_start + (1UL << order) <= start)
-			outer_start = start;
-	}
+	outer_start = find_large_buddy(start);
 
 	/* Make sure the range is really isolated. */
 	if (test_pages_isolated(outer_start, end, 0)) {
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 27ee994a57d3..b4d53545496d 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -178,16 +178,10 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
 	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
 			migratetype, isol_flags);
 	if (!unmovable) {
-		int nr_pages;
-		int mt = get_pageblock_migratetype(page);
-
-		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
-		/* Block spans zone boundaries? */
-		if (nr_pages == -1) {
+		if (!move_freepages_block_isolate(zone, page, MIGRATE_ISOLATE)) {
 			spin_unlock_irqrestore(&zone->lock, flags);
 			return -EBUSY;
 		}
-		__mod_zone_freepage_state(zone, -nr_pages, mt);
 		zone->nr_isolate_pageblock++;
 		spin_unlock_irqrestore(&zone->lock, flags);
 		return 0;
@@ -254,13 +248,11 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
 	 * allocation.
 	 */
 	if (!isolated_page) {
-		int nr_pages = move_freepages_block(zone, page, migratetype);
 		/*
 		 * Isolating this block already succeeded, so this
 		 * should not fail on zone boundaries.
 		 */
-		WARN_ON_ONCE(nr_pages == -1);
-		__mod_zone_freepage_state(zone, nr_pages, migratetype);
+		WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
 	} else {
 		set_pageblock_migratetype(page, migratetype);
 		__putback_isolated_page(page, order, migratetype);
@@ -373,26 +365,29 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 		VM_BUG_ON(!page);
 		pfn = page_to_pfn(page);
-		/*
-		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
-		 * free pages in [start_pfn, boundary_pfn), its head page will
-		 * always be in the range.
-		 */
+
 		if (PageBuddy(page)) {
 			int order = buddy_order(page);
 
-			if (pfn + (1UL << order) > boundary_pfn) {
-				/* free page changed before split, check it again */
-				if (split_free_page(page, order, boundary_pfn - pfn))
-					continue;
-			}
+			/* move_freepages_block_isolate() handled this */
+			VM_WARN_ON_ONCE(pfn + (1 << order) > boundary_pfn);
 
 			pfn += 1UL << order;
 			continue;
 		}
+
 		/*
-		 * migrate compound pages then let the free page handling code
-		 * above do the rest. If migration is not possible, just fail.
+		 * If a compound page is straddling our block, attempt
+		 * to migrate it out of the way.
+		 *
+		 * We don't have to worry about this creating a large
+		 * free page that straddles into our block: gigantic
+		 * pages are freed as order-0 chunks, and LRU pages
+		 * (currently) do not exceed pageblock_order.
+		 *
+		 * The block of interest has already been marked
+		 * MIGRATE_ISOLATE above, so when migration is done it
+		 * will free its pages onto the correct freelists.
 		 */
 		if (PageCompound(page)) {
 			struct page *head = compound_head(page);
@@ -403,16 +398,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				pfn = head_pfn + nr_pages;
 				continue;
 			}
+
+			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 			/*
-			 * hugetlb, lru compound (THP), and movable compound pages
-			 * can be migrated. Otherwise, fail the isolation.
+			 * hugetlb, and movable compound pages can be
+			 * migrated. Otherwise, fail the isolation.
 			 */
-			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
-				int order;
-				unsigned long outer_pfn;
-				int page_mt = get_pageblock_migratetype(page);
-				bool isolate_page = !is_migrate_isolate_page(page);
+			if (PageHuge(page) || __PageMovable(page)) {
 				struct compact_control cc = {
 					.nr_migratepages = 0,
 					.order = -1,
@@ -425,52 +419,12 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				};
 				INIT_LIST_HEAD(&cc.migratepages);
 
-				/*
-				 * XXX: mark the page as MIGRATE_ISOLATE so that
-				 * no one else can grab the freed page after migration.
-				 * Ideally, the page should be freed as two separate
-				 * pages to be added into separate migratetype free
-				 * lists.
-				 */
-				if (isolate_page) {
-					ret = set_migratetype_isolate(page, page_mt,
-						flags, head_pfn, head_pfn + nr_pages);
-					if (ret)
-						goto failed;
-				}
-
 				ret = __alloc_contig_migrate_range(&cc, head_pfn,
 							head_pfn + nr_pages);
-
-				/*
-				 * restore the page's migratetype so that it can
-				 * be split into separate migratetype free lists
-				 * later.
-				 */
-				if (isolate_page)
-					unset_migratetype_isolate(page, page_mt);
-
 				if (ret)
 					goto failed;
-				/*
-				 * reset pfn to the head of the free page, so
-				 * that the free page handling code above can split
-				 * the free page to the right migratetype list.
-				 *
-				 * head_pfn is not used here as a hugetlb page order
-				 * can be bigger than MAX_ORDER, but after it is
-				 * freed, the free page order is not. Use pfn within
-				 * the range to find the head of the free page.
-				 */
-				order = 0;
-				outer_pfn = pfn;
-				while (!PageBuddy(pfn_to_page(outer_pfn))) {
-					/* stop if we cannot find the free page */
-					if (++order > MAX_ORDER)
-						goto failed;
-					outer_pfn &= ~0UL << order;
-				}
-				pfn = outer_pfn;
+
+				pfn = head_pfn + nr_pages;
 				continue;
 			} else
 #endif
Zi Yan Oct. 16, 2023, 7:49 p.m. UTC | #43
On 16 Oct 2023, at 14:51, Johannes Weiner wrote:

> On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
>> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
>>
>>> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
>>>>> The attached patch has all the suggested changes, let me know how it
>>>>> looks to you. Thanks.
>>>>
>>>> The one I sent has free page accounting issues. The attached one fixes them.
>>>
>>> Do you still have the warnings? I wonder what went wrong.
>>
>> No warnings. But something with the code:
>>
>> 1. in your version, split_free_page() is called without changing any pageblock
>> migratetypes, then split_free_page() is just a no-op, since the page is
>> just deleted from the free list, then freed via different orders. Buddy allocator
>> will merge them back.
>
> Hm not quite.
>
> If it's the tail block of a buddy, I update its type before
> splitting. The splitting loop looks up the type of each block for
> sorting it onto freelists.
>
> If it's the head block, yes I split it first according to its old
> type. But then I let it fall through to scanning the block, which will
> find that buddy, update its type and move it.

That is the issue, since split_free_page() assumes the pageblocks of
that free page have different types. It basically just free the page
with different small orders summed up to the original free page order.
If all pageblocks of the free page have the same migratetype, __free_one_page()
will merge these small order pages back to the original order free page.

>
>> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
>> but it causes free page accounting issues, since in the case of head, free pages
>> are deleted from new_mt when they are in old_mt free list and the accounting
>> decreases new_mt free page number instead of old_mt one.
>
> Right, that makes sense.
>
>> Basically, split_free_page() is awkward as it relies on preset migratetypes,
>> which changes migratetypes without deleting the free pages from the list first.
>> That is why I came up with the new split_free_page() below.
>
> Yeah, the in-between thing is bad. Either it fixes the migratetype
> before deletion, or it doesn't do the deletion. I'm thinking it would
> be simpler to move the deletion out instead.

Yes and no. After deletion, a free page no longer has PageBuddy set and
has buddy_order information cleared. Either we reset PageBuddy and order
to the deleted free page, or split_free_page() needs to be changed to
accept pages without the information (basically remove the PageBuddy
and order check code).

>>>> @@ -883,6 +886,10 @@ int split_free_page(struct page *free_page,
>>>>  	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
>>>>  	del_page_from_free_list(free_page, zone, order, mt);
>>>>
>>>> +	set_pageblock_migratetype(free_page, mt1);
>>>> +	set_pageblock_migratetype(pfn_to_page(free_page_pfn + split_pfn_offset),
>>>> +				  mt2);
>>>> +
>>>>  	for (pfn = free_page_pfn;
>>>>  	     pfn < free_page_pfn + (1UL << order);) {
>>>>  		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
>>>
>>> I don't think this is quite right.
>>>
>>> With CONFIG_ARCH_FORCE_MAX_ORDER it's possible that we're dealing with
>>> a buddy that is more than two blocks:
>>>
>>> [pageblock 0][pageblock 1][pageblock 2][pageblock 3]
>>> [buddy                                             ]
>>>                                        [isolate range ..
>>>
>>> That for loop splits the buddy into 4 blocks. The code above would set
>>> pageblock 0 to old_mt, and pageblock 1 to new_mt. But it should only
>>> set pageblock 3 to new_mt.
>>
>> OK. I think I need to fix split_free_page().
>>
>> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
>> pageblock and in turn makes an in-use page have more than one pageblock,
>> we will have problems. Since in isolate_single_pageblock(), an in-use page
>> can have part of its pageblock set to a different migratetype and be freed,
>> causing the free page with unmatched migratetypes. We might need to
>> free pages at pageblock_order if their orders are bigger than pageblock_order.
>
> Is this a practical issue? You mentioned that right now only gigantic
> pages can be larger than a pageblock, and those are freed in order-0
> chunks.

Only if the system allocates a page (non hugetlb pages) with >pageblock_order
and frees it with the same order. I just do not know if such pages exist on
other arch than x86. Maybe I just think too much.

>
>>> How about pulling the freelist removal out of split_free_page()?
>>>
>>> 	del_page_from_freelist(huge_buddy);
>>> 	set_pageblock_migratetype(start_page, MIGRATE_ISOLATE);
>>> 	split_free_page(huge_buddy, buddy_order(), pageblock_nr_pages);
>>> 	return pageblock_nr_pages;
>>
>> Yes, this is better. Let me change to this implementation.
>>
>> But I would like to test it on an environment where a buddy contains more than
>> one pageblocks first. I probably can change MAX_ORDER of x86_64 to do it locally.
>> I will report back.
>
> I tweaked my version some more based on our discussion. Would you mind
> taking a look? It survived an hour of stressing with a kernel build
> and Mike's reproducer that allocates gigantics and demotes them.
>
> Note that it applies *before* consolidating of the free counts, as
> isolation needs to be fixed before the warnings are added, to avoid
> bisectability issues. The consolidation patch doesn't change it much,
> except removing freepage accounting in move_freepages_block_isolate().
>
> ---
>
> From a0460ad30a24cf73816ac40b262af0ba3723a242 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Mon, 16 Oct 2023 12:32:21 -0400
> Subject: [PATCH] mm: page_isolation: prepare for hygienic freelists
>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/page-isolation.h |   4 +-
>  mm/internal.h                  |   4 -
>  mm/page_alloc.c                | 198 +++++++++++++++++++--------------
>  mm/page_isolation.c            |  96 +++++-----------
>  4 files changed, 142 insertions(+), 160 deletions(-)
>
> diff --git a/include/linux/page-isolation.h b/include/linux/page-isolation.h
> index 8550b3c91480..c16db0067090 100644
> --- a/include/linux/page-isolation.h
> +++ b/include/linux/page-isolation.h
> @@ -34,7 +34,9 @@ static inline bool is_migrate_isolate(int migratetype)
>  #define REPORT_FAILURE	0x2
>
>  void set_pageblock_migratetype(struct page *page, int migratetype);
> -int move_freepages_block(struct zone *zone, struct page *page, int migratetype);
> +
> +bool move_freepages_block_isolate(struct zone *zone, struct page *page,
> +				  int migratetype);
>
>  int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
>  			     int migratetype, int flags, gfp_t gfp_flags);
> diff --git a/mm/internal.h b/mm/internal.h
> index 3a72975425bb..0681094ad260 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -464,10 +464,6 @@ extern void *memmap_alloc(phys_addr_t size, phys_addr_t align,
>  void memmap_init_range(unsigned long, int, unsigned long, unsigned long,
>  		unsigned long, enum meminit_context, struct vmem_altmap *, int);
>
> -
> -int split_free_page(struct page *free_page,
> -			unsigned int order, unsigned long split_pfn_offset);
> -
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6185b076cf90..17e9a06027c8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -834,64 +834,6 @@ static inline void __free_one_page(struct page *page,
>  		page_reporting_notify_free(order);
>  }
>
> -/**
> - * split_free_page() -- split a free page at split_pfn_offset
> - * @free_page:		the original free page
> - * @order:		the order of the page
> - * @split_pfn_offset:	split offset within the page
> - *
> - * Return -ENOENT if the free page is changed, otherwise 0
> - *
> - * It is used when the free page crosses two pageblocks with different migratetypes
> - * at split_pfn_offset within the page. The split free page will be put into
> - * separate migratetype lists afterwards. Otherwise, the function achieves
> - * nothing.
> - */
> -int split_free_page(struct page *free_page,
> -			unsigned int order, unsigned long split_pfn_offset)
> -{
> -	struct zone *zone = page_zone(free_page);
> -	unsigned long free_page_pfn = page_to_pfn(free_page);
> -	unsigned long pfn;
> -	unsigned long flags;
> -	int free_page_order;
> -	int mt;
> -	int ret = 0;
> -
> -	if (split_pfn_offset == 0)
> -		return ret;
> -
> -	spin_lock_irqsave(&zone->lock, flags);
> -
> -	if (!PageBuddy(free_page) || buddy_order(free_page) != order) {
> -		ret = -ENOENT;
> -		goto out;
> -	}
> -
> -	mt = get_pfnblock_migratetype(free_page, free_page_pfn);
> -	if (likely(!is_migrate_isolate(mt)))
> -		__mod_zone_freepage_state(zone, -(1UL << order), mt);
> -
> -	del_page_from_free_list(free_page, zone, order);
> -	for (pfn = free_page_pfn;
> -	     pfn < free_page_pfn + (1UL << order);) {
> -		int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn);
> -
> -		free_page_order = min_t(unsigned int,
> -					pfn ? __ffs(pfn) : order,
> -					__fls(split_pfn_offset));
> -		__free_one_page(pfn_to_page(pfn), pfn, zone, free_page_order,
> -				mt, FPI_NONE);
> -		pfn += 1UL << free_page_order;
> -		split_pfn_offset -= (1UL << free_page_order);
> -		/* we have done the first part, now switch to second part */
> -		if (split_pfn_offset == 0)
> -			split_pfn_offset = (1UL << order) - (pfn - free_page_pfn);
> -	}
> -out:
> -	spin_unlock_irqrestore(&zone->lock, flags);
> -	return ret;
> -}
>  /*
>   * A bad page could be due to a number of fields. Instead of multiple branches,
>   * try and check multiple fields with one check. The caller must do a detailed
> @@ -1673,8 +1615,8 @@ static bool prep_move_freepages_block(struct zone *zone, struct page *page,
>  	return true;
>  }
>
> -int move_freepages_block(struct zone *zone, struct page *page,
> -			 int migratetype)
> +static int move_freepages_block(struct zone *zone, struct page *page,
> +				int migratetype)
>  {
>  	unsigned long start_pfn, end_pfn;
>
> @@ -1685,6 +1627,117 @@ int move_freepages_block(struct zone *zone, struct page *page,
>  	return move_freepages(zone, start_pfn, end_pfn, migratetype);
>  }
>
> +#ifdef CONFIG_MEMORY_ISOLATION
> +/* Look for a multi-block buddy that straddles start_pfn */
> +static unsigned long find_large_buddy(unsigned long start_pfn)
> +{
> +	int order = 0;
> +	struct page *page;
> +	unsigned long pfn = start_pfn;
> +
> +	while (!PageBuddy(page = pfn_to_page(pfn))) {
> +		/* Nothing found */
> +		if (++order > MAX_ORDER)
> +			return start_pfn;
> +		pfn &= ~0UL << order;
> +	}
> +
> +	/*
> +	 * Found a preceding buddy, but does it straddle?
> +	 */
> +	if (pfn + (1 << buddy_order(page)) > start_pfn)
> +		return pfn;
> +
> +	/* Nothing found */
> +	return start_pfn;
> +}
> +
> +/* Split a multi-block buddy into its individual pageblocks */
> +static void split_large_buddy(struct page *buddy, int order)
> +{
> +	unsigned long pfn = page_to_pfn(buddy);
> +	unsigned long end = pfn + (1 << order);
> +	struct zone *zone = page_zone(buddy);
> +
> +	lockdep_assert_held(&zone->lock);
> +	VM_WARN_ON_ONCE(PageBuddy(buddy));
> +
> +	while (pfn < end) {
> +		int mt = get_pfnblock_migratetype(buddy, pfn);
> +
> +		__free_one_page(buddy, pfn, zone, pageblock_order, mt, FPI_NONE);
> +		pfn += pageblock_nr_pages;
> +		buddy = pfn_to_page(pfn);
> +	}
> +}
> +
> +/**
> + * move_freepages_block_isolate - move free pages in block for page isolation
> + * @zone: the zone
> + * @page: the pageblock page
> + * @migratetype: migratetype to set on the pageblock
> + *
> + * This is similar to move_freepages_block(), but handles the special
> + * case encountered in page isolation, where the block of interest
> + * might be part of a larger buddy spanning multiple pageblocks.
> + *
> + * Unlike the regular page allocator path, which moves pages while
> + * stealing buddies off the freelist, page isolation is interested in
> + * arbitrary pfn ranges that may have overlapping buddies on both ends.
> + *
> + * This function handles that. Straddling buddies are split into
> + * individual pageblocks. Only the block of interest is moved.
> + *
> + * Returns %true if pages could be moved, %false otherwise.
> + */
> +bool move_freepages_block_isolate(struct zone *zone, struct page *page,
> +				  int migratetype)
> +{
> +	unsigned long start_pfn, end_pfn, pfn;
> +	int nr_moved, mt;
> +
> +	if (!prep_move_freepages_block(zone, page, &start_pfn, &end_pfn,
> +				       NULL, NULL))
> +		return false;
> +
> +	/* We're a tail block in a larger buddy */
> +	pfn = find_large_buddy(start_pfn);
> +	if (pfn != start_pfn) {
> +		struct page *buddy = pfn_to_page(pfn);
> +		int order = buddy_order(buddy);
> +		int mt = get_pfnblock_migratetype(buddy, pfn);
> +
> +		if (!is_migrate_isolate(mt))
> +			__mod_zone_freepage_state(zone, -(1UL << order), mt);
> +		del_page_from_free_list(buddy, zone, order);
> +		set_pageblock_migratetype(pfn_to_page(start_pfn), migratetype);
> +		split_large_buddy(buddy, order);
> +		return true;
> +	}
> +
> +	/* We're the starting block of a larger buddy */
> +	if (PageBuddy(page) && buddy_order(page) > pageblock_order) {
> +		int mt = get_pfnblock_migratetype(page, pfn);
> +		int order = buddy_order(page);
> +
> +		if (!is_migrate_isolate(mt))
> +			__mod_zone_freepage_state(zone, -(1UL << order), mt);
> +		del_page_from_free_list(page, zone, order);
> +		set_pageblock_migratetype(page, migratetype);
> +		split_large_buddy(page, order);
> +		return true;
> +	}
> +
> +	mt = get_pfnblock_migratetype(page, start_pfn);
> +	nr_moved = move_freepages(zone, start_pfn, end_pfn, migratetype);
> +	if (!is_migrate_isolate(mt))
> +		__mod_zone_freepage_state(zone, -nr_moved, mt);
> +	else if (!is_migrate_isolate(migratetype))
> +		__mod_zone_freepage_state(zone, nr_moved, migratetype);
> +	return true;
> +}
> +#endif /* CONFIG_MEMORY_ISOLATION */
> +
>  static void change_pageblock_range(struct page *pageblock_page,
>  					int start_order, int migratetype)
>  {
> @@ -6318,7 +6371,6 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  		       unsigned migratetype, gfp_t gfp_mask)
>  {
>  	unsigned long outer_start, outer_end;
> -	int order;
>  	int ret = 0;
>
>  	struct compact_control cc = {
> @@ -6391,29 +6443,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
>  	 * We don't have to hold zone->lock here because the pages are
>  	 * isolated thus they won't get removed from buddy.
>  	 */
> -
> -	order = 0;
> -	outer_start = start;
> -	while (!PageBuddy(pfn_to_page(outer_start))) {
> -		if (++order > MAX_ORDER) {
> -			outer_start = start;
> -			break;
> -		}
> -		outer_start &= ~0UL << order;
> -	}
> -
> -	if (outer_start != start) {
> -		order = buddy_order(pfn_to_page(outer_start));
> -
> -		/*
> -		 * outer_start page could be small order buddy page and
> -		 * it doesn't include start page. Adjust outer_start
> -		 * in this case to report failed page properly
> -		 * on tracepoint in test_pages_isolated()
> -		 */
> -		if (outer_start + (1UL << order) <= start)
> -			outer_start = start;
> -	}
> +	outer_start = find_large_buddy(start);
>
>  	/* Make sure the range is really isolated. */
>  	if (test_pages_isolated(outer_start, end, 0)) {
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index 27ee994a57d3..b4d53545496d 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -178,16 +178,10 @@ static int set_migratetype_isolate(struct page *page, int migratetype, int isol_
>  	unmovable = has_unmovable_pages(check_unmovable_start, check_unmovable_end,
>  			migratetype, isol_flags);
>  	if (!unmovable) {
> -		int nr_pages;
> -		int mt = get_pageblock_migratetype(page);
> -
> -		nr_pages = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> -		/* Block spans zone boundaries? */
> -		if (nr_pages == -1) {
> +		if (!move_freepages_block_isolate(zone, page, MIGRATE_ISOLATE)) {
>  			spin_unlock_irqrestore(&zone->lock, flags);
>  			return -EBUSY;
>  		}
> -		__mod_zone_freepage_state(zone, -nr_pages, mt);
>  		zone->nr_isolate_pageblock++;
>  		spin_unlock_irqrestore(&zone->lock, flags);
>  		return 0;
> @@ -254,13 +248,11 @@ static void unset_migratetype_isolate(struct page *page, int migratetype)
>  	 * allocation.
>  	 */
>  	if (!isolated_page) {
> -		int nr_pages = move_freepages_block(zone, page, migratetype);
>  		/*
>  		 * Isolating this block already succeeded, so this
>  		 * should not fail on zone boundaries.
>  		 */
> -		WARN_ON_ONCE(nr_pages == -1);
> -		__mod_zone_freepage_state(zone, nr_pages, migratetype);
> +		WARN_ON_ONCE(!move_freepages_block_isolate(zone, page, migratetype));
>  	} else {
>  		set_pageblock_migratetype(page, migratetype);
>  		__putback_isolated_page(page, order, migratetype);
> @@ -373,26 +365,29 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>
>  		VM_BUG_ON(!page);
>  		pfn = page_to_pfn(page);
> -		/*
> -		 * start_pfn is MAX_ORDER_NR_PAGES aligned, if there is any
> -		 * free pages in [start_pfn, boundary_pfn), its head page will
> -		 * always be in the range.
> -		 */
> +
>  		if (PageBuddy(page)) {
>  			int order = buddy_order(page);
>
> -			if (pfn + (1UL << order) > boundary_pfn) {
> -				/* free page changed before split, check it again */
> -				if (split_free_page(page, order, boundary_pfn - pfn))
> -					continue;
> -			}
> +			/* move_freepages_block_isolate() handled this */
> +			VM_WARN_ON_ONCE(pfn + (1 << order) > boundary_pfn);
>
>  			pfn += 1UL << order;
>  			continue;
>  		}
> +
>  		/*
> -		 * migrate compound pages then let the free page handling code
> -		 * above do the rest. If migration is not possible, just fail.
> +		 * If a compound page is straddling our block, attempt
> +		 * to migrate it out of the way.
> +		 *
> +		 * We don't have to worry about this creating a large
> +		 * free page that straddles into our block: gigantic
> +		 * pages are freed as order-0 chunks, and LRU pages
> +		 * (currently) do not exceed pageblock_order.
> +		 *
> +		 * The block of interest has already been marked
> +		 * MIGRATE_ISOLATE above, so when migration is done it
> +		 * will free its pages onto the correct freelists.
>  		 */
>  		if (PageCompound(page)) {
>  			struct page *head = compound_head(page);
> @@ -403,16 +398,15 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  				pfn = head_pfn + nr_pages;
>  				continue;
>  			}
> +
> +			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
> +
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>  			/*
> -			 * hugetlb, lru compound (THP), and movable compound pages
> -			 * can be migrated. Otherwise, fail the isolation.
> +			 * hugetlb, and movable compound pages can be
> +			 * migrated. Otherwise, fail the isolation.
>  			 */
> -			if (PageHuge(page) || PageLRU(page) || __PageMovable(page)) {
> -				int order;
> -				unsigned long outer_pfn;
> -				int page_mt = get_pageblock_migratetype(page);
> -				bool isolate_page = !is_migrate_isolate_page(page);
> +			if (PageHuge(page) || __PageMovable(page)) {
>  				struct compact_control cc = {
>  					.nr_migratepages = 0,
>  					.order = -1,
> @@ -425,52 +419,12 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  				};
>  				INIT_LIST_HEAD(&cc.migratepages);
>
> -				/*
> -				 * XXX: mark the page as MIGRATE_ISOLATE so that
> -				 * no one else can grab the freed page after migration.
> -				 * Ideally, the page should be freed as two separate
> -				 * pages to be added into separate migratetype free
> -				 * lists.
> -				 */
> -				if (isolate_page) {
> -					ret = set_migratetype_isolate(page, page_mt,
> -						flags, head_pfn, head_pfn + nr_pages);
> -					if (ret)
> -						goto failed;
> -				}
> -
>  				ret = __alloc_contig_migrate_range(&cc, head_pfn,
>  							head_pfn + nr_pages);
> -
> -				/*
> -				 * restore the page's migratetype so that it can
> -				 * be split into separate migratetype free lists
> -				 * later.
> -				 */
> -				if (isolate_page)
> -					unset_migratetype_isolate(page, page_mt);
> -
>  				if (ret)
>  					goto failed;
> -				/*
> -				 * reset pfn to the head of the free page, so
> -				 * that the free page handling code above can split
> -				 * the free page to the right migratetype list.
> -				 *
> -				 * head_pfn is not used here as a hugetlb page order
> -				 * can be bigger than MAX_ORDER, but after it is
> -				 * freed, the free page order is not. Use pfn within
> -				 * the range to find the head of the free page.
> -				 */
> -				order = 0;
> -				outer_pfn = pfn;
> -				while (!PageBuddy(pfn_to_page(outer_pfn))) {
> -					/* stop if we cannot find the free page */
> -					if (++order > MAX_ORDER)
> -						goto failed;
> -					outer_pfn &= ~0UL << order;
> -				}
> -				pfn = outer_pfn;
> +
> +				pfn = head_pfn + nr_pages;
>  				continue;
>  			} else
>  #endif
> -- 
> 2.42.0

It looks good to me. Thanks.

Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi
Johannes Weiner Oct. 16, 2023, 8:26 p.m. UTC | #44
On Mon, Oct 16, 2023 at 03:49:49PM -0400, Zi Yan wrote:
> On 16 Oct 2023, at 14:51, Johannes Weiner wrote:
> 
> > On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
> >> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
> >>
> >>> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
> >>>>> The attached patch has all the suggested changes, let me know how it
> >>>>> looks to you. Thanks.
> >>>>
> >>>> The one I sent has free page accounting issues. The attached one fixes them.
> >>>
> >>> Do you still have the warnings? I wonder what went wrong.
> >>
> >> No warnings. But something with the code:
> >>
> >> 1. in your version, split_free_page() is called without changing any pageblock
> >> migratetypes, then split_free_page() is just a no-op, since the page is
> >> just deleted from the free list, then freed via different orders. Buddy allocator
> >> will merge them back.
> >
> > Hm not quite.
> >
> > If it's the tail block of a buddy, I update its type before
> > splitting. The splitting loop looks up the type of each block for
> > sorting it onto freelists.
> >
> > If it's the head block, yes I split it first according to its old
> > type. But then I let it fall through to scanning the block, which will
> > find that buddy, update its type and move it.
> 
> That is the issue, since split_free_page() assumes the pageblocks of
> that free page have different types. It basically just free the page
> with different small orders summed up to the original free page order.
> If all pageblocks of the free page have the same migratetype, __free_one_page()
> will merge these small order pages back to the original order free page.

duh, of course, you're right. Thanks for patiently explaining this.

> >> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
> >> but it causes free page accounting issues, since in the case of head, free pages
> >> are deleted from new_mt when they are in old_mt free list and the accounting
> >> decreases new_mt free page number instead of old_mt one.
> >
> > Right, that makes sense.
> >
> >> Basically, split_free_page() is awkward as it relies on preset migratetypes,
> >> which changes migratetypes without deleting the free pages from the list first.
> >> That is why I came up with the new split_free_page() below.
> >
> > Yeah, the in-between thing is bad. Either it fixes the migratetype
> > before deletion, or it doesn't do the deletion. I'm thinking it would
> > be simpler to move the deletion out instead.
> 
> Yes and no. After deletion, a free page no longer has PageBuddy set and
> has buddy_order information cleared. Either we reset PageBuddy and order
> to the deleted free page, or split_free_page() needs to be changed to
> accept pages without the information (basically remove the PageBuddy
> and order check code).

Good point, that requires extra care.

It's correct in the code now, but it deserves a comment, especially
because of the "buddy" naming in the new split function.

> >> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
> >> pageblock and in turn makes an in-use page have more than one pageblock,
> >> we will have problems. Since in isolate_single_pageblock(), an in-use page
> >> can have part of its pageblock set to a different migratetype and be freed,
> >> causing the free page with unmatched migratetypes. We might need to
> >> free pages at pageblock_order if their orders are bigger than pageblock_order.
> >
> > Is this a practical issue? You mentioned that right now only gigantic
> > pages can be larger than a pageblock, and those are freed in order-0
> > chunks.
> 
> Only if the system allocates a page (non hugetlb pages) with >pageblock_order
> and frees it with the same order. I just do not know if such pages exist on
> other arch than x86. Maybe I just think too much.

Hm, I removed LRU pages from the handling (and added the warning) but
I left in PageMovable(). The only users are z3fold, zsmalloc and
memory ballooning. AFAICS none of them can be bigger than a pageblock.
Let me remove that and add a warning for that case as well.

This way, we only attempt to migrate hugetlb, where we know the free
path - and get warnings for anything else that's larger than expected.

This seems like the safest option. On the off chance that there is a
regression, it won't jeopardize anybody's systems, while the warning
provides all the information we need to debug what's going on.

> > From a0460ad30a24cf73816ac40b262af0ba3723a242 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@cmpxchg.org>
> > Date: Mon, 16 Oct 2023 12:32:21 -0400
> > Subject: [PATCH] mm: page_isolation: prepare for hygienic freelists
> >
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

> It looks good to me. Thanks.
> 
> Reviewed-by: Zi Yan <ziy@nvidia.com>

Thank you for all your help!
Johannes Weiner Oct. 16, 2023, 8:39 p.m. UTC | #45
On Mon, Oct 16, 2023 at 04:26:30PM -0400, Johannes Weiner wrote:
> On Mon, Oct 16, 2023 at 03:49:49PM -0400, Zi Yan wrote:
> > On 16 Oct 2023, at 14:51, Johannes Weiner wrote:
> > 
> > > On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
> > >> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
> > >>
> > >>> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
> > >>>>> The attached patch has all the suggested changes, let me know how it
> > >>>>> looks to you. Thanks.
> > >>>>
> > >>>> The one I sent has free page accounting issues. The attached one fixes them.
> > >>>
> > >>> Do you still have the warnings? I wonder what went wrong.
> > >>
> > >> No warnings. But something with the code:
> > >>
> > >> 1. in your version, split_free_page() is called without changing any pageblock
> > >> migratetypes, then split_free_page() is just a no-op, since the page is
> > >> just deleted from the free list, then freed via different orders. Buddy allocator
> > >> will merge them back.
> > >
> > > Hm not quite.
> > >
> > > If it's the tail block of a buddy, I update its type before
> > > splitting. The splitting loop looks up the type of each block for
> > > sorting it onto freelists.
> > >
> > > If it's the head block, yes I split it first according to its old
> > > type. But then I let it fall through to scanning the block, which will
> > > find that buddy, update its type and move it.
> > 
> > That is the issue, since split_free_page() assumes the pageblocks of
> > that free page have different types. It basically just free the page
> > with different small orders summed up to the original free page order.
> > If all pageblocks of the free page have the same migratetype, __free_one_page()
> > will merge these small order pages back to the original order free page.
> 
> duh, of course, you're right. Thanks for patiently explaining this.
> 
> > >> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
> > >> but it causes free page accounting issues, since in the case of head, free pages
> > >> are deleted from new_mt when they are in old_mt free list and the accounting
> > >> decreases new_mt free page number instead of old_mt one.
> > >
> > > Right, that makes sense.
> > >
> > >> Basically, split_free_page() is awkward as it relies on preset migratetypes,
> > >> which changes migratetypes without deleting the free pages from the list first.
> > >> That is why I came up with the new split_free_page() below.
> > >
> > > Yeah, the in-between thing is bad. Either it fixes the migratetype
> > > before deletion, or it doesn't do the deletion. I'm thinking it would
> > > be simpler to move the deletion out instead.
> > 
> > Yes and no. After deletion, a free page no longer has PageBuddy set and
> > has buddy_order information cleared. Either we reset PageBuddy and order
> > to the deleted free page, or split_free_page() needs to be changed to
> > accept pages without the information (basically remove the PageBuddy
> > and order check code).
> 
> Good point, that requires extra care.
> 
> It's correct in the code now, but it deserves a comment, especially
> because of the "buddy" naming in the new split function.
> 
> > >> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
> > >> pageblock and in turn makes an in-use page have more than one pageblock,
> > >> we will have problems. Since in isolate_single_pageblock(), an in-use page
> > >> can have part of its pageblock set to a different migratetype and be freed,
> > >> causing the free page with unmatched migratetypes. We might need to
> > >> free pages at pageblock_order if their orders are bigger than pageblock_order.
> > >
> > > Is this a practical issue? You mentioned that right now only gigantic
> > > pages can be larger than a pageblock, and those are freed in order-0
> > > chunks.
> > 
> > Only if the system allocates a page (non hugetlb pages) with >pageblock_order
> > and frees it with the same order. I just do not know if such pages exist on
> > other arch than x86. Maybe I just think too much.
> 
> Hm, I removed LRU pages from the handling (and added the warning) but
> I left in PageMovable(). The only users are z3fold, zsmalloc and
> memory ballooning. AFAICS none of them can be bigger than a pageblock.
> Let me remove that and add a warning for that case as well.
> 
> This way, we only attempt to migrate hugetlb, where we know the free
> path - and get warnings for anything else that's larger than expected.
> 
> This seems like the safest option. On the off chance that there is a
> regression, it won't jeopardize anybody's systems, while the warning
> provides all the information we need to debug what's going on.

This delta on top?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b5292ad9860c..0da7c61af37e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1628,7 +1628,7 @@ static int move_freepages_block(struct zone *zone, struct page *page,
 }
 
 #ifdef CONFIG_MEMORY_ISOLATION
-/* Look for a multi-block buddy that straddles start_pfn */
+/* Look for a buddy that straddles start_pfn */
 static unsigned long find_large_buddy(unsigned long start_pfn)
 {
 	int order = 0;
@@ -1652,7 +1652,7 @@ static unsigned long find_large_buddy(unsigned long start_pfn)
 	return start_pfn;
 }
 
-/* Split a multi-block buddy into its individual pageblocks */
+/* Split a multi-block free page into its individual pageblocks */
 static void split_large_buddy(struct zone *zone, struct page *page,
 			      unsigned long pfn, int order)
 {
@@ -1661,6 +1661,9 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	VM_WARN_ON_ONCE(order < pageblock_order);
 	VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
 
+	/* Caller removed page from freelist, buddy info cleared! */
+	VM_WARN_ON_ONCE(PageBuddy(page));
+
 	while (pfn != end_pfn) {
 		int mt = get_pfnblock_migratetype(page, pfn);
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b4d53545496d..c8b3c0699683 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -399,14 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 				continue;
 			}
 
-			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
-
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-			/*
-			 * hugetlb, and movable compound pages can be
-			 * migrated. Otherwise, fail the isolation.
-			 */
-			if (PageHuge(page) || __PageMovable(page)) {
+			if (PageHuge(page)) {
 				struct compact_control cc = {
 					.nr_migratepages = 0,
 					.order = -1,
@@ -426,9 +420,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
 
 				pfn = head_pfn + nr_pages;
 				continue;
-			} else
+			}
+
+			/*
+			 * These pages are movable too, but they're
+			 * not expected to exceed pageblock_order.
+			 *
+			 * Let us know when they do, so we can add
+			 * proper free and split handling for them.
+			 */
+			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
+			VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page);
 #endif
-				goto failed;
+			goto failed;
 		}
 
 		pfn++;
Zi Yan Oct. 16, 2023, 8:48 p.m. UTC | #46
On 16 Oct 2023, at 16:39, Johannes Weiner wrote:

> On Mon, Oct 16, 2023 at 04:26:30PM -0400, Johannes Weiner wrote:
>> On Mon, Oct 16, 2023 at 03:49:49PM -0400, Zi Yan wrote:
>>> On 16 Oct 2023, at 14:51, Johannes Weiner wrote:
>>>
>>>> On Mon, Oct 16, 2023 at 11:00:33AM -0400, Zi Yan wrote:
>>>>> On 16 Oct 2023, at 10:37, Johannes Weiner wrote:
>>>>>
>>>>>> On Mon, Oct 16, 2023 at 09:35:34AM -0400, Zi Yan wrote:
>>>>>>>> The attached patch has all the suggested changes, let me know how it
>>>>>>>> looks to you. Thanks.
>>>>>>>
>>>>>>> The one I sent has free page accounting issues. The attached one fixes them.
>>>>>>
>>>>>> Do you still have the warnings? I wonder what went wrong.
>>>>>
>>>>> No warnings. But something with the code:
>>>>>
>>>>> 1. in your version, split_free_page() is called without changing any pageblock
>>>>> migratetypes, then split_free_page() is just a no-op, since the page is
>>>>> just deleted from the free list, then freed via different orders. Buddy allocator
>>>>> will merge them back.
>>>>
>>>> Hm not quite.
>>>>
>>>> If it's the tail block of a buddy, I update its type before
>>>> splitting. The splitting loop looks up the type of each block for
>>>> sorting it onto freelists.
>>>>
>>>> If it's the head block, yes I split it first according to its old
>>>> type. But then I let it fall through to scanning the block, which will
>>>> find that buddy, update its type and move it.
>>>
>>> That is the issue, since split_free_page() assumes the pageblocks of
>>> that free page have different types. It basically just free the page
>>> with different small orders summed up to the original free page order.
>>> If all pageblocks of the free page have the same migratetype, __free_one_page()
>>> will merge these small order pages back to the original order free page.
>>
>> duh, of course, you're right. Thanks for patiently explaining this.
>>
>>>>> 2. in my version, I set pageblock migratetype to new_mt before split_free_page(),
>>>>> but it causes free page accounting issues, since in the case of head, free pages
>>>>> are deleted from new_mt when they are in old_mt free list and the accounting
>>>>> decreases new_mt free page number instead of old_mt one.
>>>>
>>>> Right, that makes sense.
>>>>
>>>>> Basically, split_free_page() is awkward as it relies on preset migratetypes,
>>>>> which changes migratetypes without deleting the free pages from the list first.
>>>>> That is why I came up with the new split_free_page() below.
>>>>
>>>> Yeah, the in-between thing is bad. Either it fixes the migratetype
>>>> before deletion, or it doesn't do the deletion. I'm thinking it would
>>>> be simpler to move the deletion out instead.
>>>
>>> Yes and no. After deletion, a free page no longer has PageBuddy set and
>>> has buddy_order information cleared. Either we reset PageBuddy and order
>>> to the deleted free page, or split_free_page() needs to be changed to
>>> accept pages without the information (basically remove the PageBuddy
>>> and order check code).
>>
>> Good point, that requires extra care.
>>
>> It's correct in the code now, but it deserves a comment, especially
>> because of the "buddy" naming in the new split function.
>>
>>>>> Hmm, if CONFIG_ARCH_FORCE_MAX_ORDER can make a buddy have more than one
>>>>> pageblock and in turn makes an in-use page have more than one pageblock,
>>>>> we will have problems. Since in isolate_single_pageblock(), an in-use page
>>>>> can have part of its pageblock set to a different migratetype and be freed,
>>>>> causing the free page with unmatched migratetypes. We might need to
>>>>> free pages at pageblock_order if their orders are bigger than pageblock_order.
>>>>
>>>> Is this a practical issue? You mentioned that right now only gigantic
>>>> pages can be larger than a pageblock, and those are freed in order-0
>>>> chunks.
>>>
>>> Only if the system allocates a page (non hugetlb pages) with >pageblock_order
>>> and frees it with the same order. I just do not know if such pages exist on
>>> other arch than x86. Maybe I just think too much.
>>
>> Hm, I removed LRU pages from the handling (and added the warning) but
>> I left in PageMovable(). The only users are z3fold, zsmalloc and
>> memory ballooning. AFAICS none of them can be bigger than a pageblock.
>> Let me remove that and add a warning for that case as well.
>>
>> This way, we only attempt to migrate hugetlb, where we know the free
>> path - and get warnings for anything else that's larger than expected.
>>
>> This seems like the safest option. On the off chance that there is a
>> regression, it won't jeopardize anybody's systems, while the warning
>> provides all the information we need to debug what's going on.
>
> This delta on top?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b5292ad9860c..0da7c61af37e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1628,7 +1628,7 @@ static int move_freepages_block(struct zone *zone, struct page *page,
>  }
>
>  #ifdef CONFIG_MEMORY_ISOLATION
> -/* Look for a multi-block buddy that straddles start_pfn */
> +/* Look for a buddy that straddles start_pfn */
>  static unsigned long find_large_buddy(unsigned long start_pfn)
>  {
>  	int order = 0;
> @@ -1652,7 +1652,7 @@ static unsigned long find_large_buddy(unsigned long start_pfn)
>  	return start_pfn;
>  }
>
> -/* Split a multi-block buddy into its individual pageblocks */
> +/* Split a multi-block free page into its individual pageblocks */
>  static void split_large_buddy(struct zone *zone, struct page *page,
>  			      unsigned long pfn, int order)
>  {
> @@ -1661,6 +1661,9 @@ static void split_large_buddy(struct zone *zone, struct page *page,
>  	VM_WARN_ON_ONCE(order < pageblock_order);
>  	VM_WARN_ON_ONCE(pfn & (pageblock_nr_pages - 1));
>
> +	/* Caller removed page from freelist, buddy info cleared! */
> +	VM_WARN_ON_ONCE(PageBuddy(page));
> +
>  	while (pfn != end_pfn) {
>  		int mt = get_pfnblock_migratetype(page, pfn);
>
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b4d53545496d..c8b3c0699683 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -399,14 +399,8 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>  				continue;
>  			}
>
> -			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
> -
>  #if defined CONFIG_COMPACTION || defined CONFIG_CMA
> -			/*
> -			 * hugetlb, and movable compound pages can be
> -			 * migrated. Otherwise, fail the isolation.
> -			 */
> -			if (PageHuge(page) || __PageMovable(page)) {
> +			if (PageHuge(page)) {
>  				struct compact_control cc = {
>  					.nr_migratepages = 0,
>  					.order = -1,
> @@ -426,9 +420,19 @@ static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
>
>  				pfn = head_pfn + nr_pages;
>  				continue;
> -			} else
> +			}
> +
> +			/*
> +			 * These pages are movable too, but they're
> +			 * not expected to exceed pageblock_order.
> +			 *
> +			 * Let us know when they do, so we can add
> +			 * proper free and split handling for them.
> +			 */
> +			VM_WARN_ON_ONCE_PAGE(PageLRU(page), page);
> +			VM_WARN_ON_ONCE_PAGE(__PageMovable(page), page);
>  #endif
> -				goto failed;
> +			goto failed;
>  		}
>
>  		pfn++;

LGTM.

I was thinking about adding

VM_WARN_ON_ONCE(order > pageblock_order, page);

in __free_pages() to catch all possible cases, but that is a really hot path.

And just for the record, we probably can easily fix the above warnings,
if they ever show up, by freeing >pageblock_order pages in unit of
pageblock_order.

--
Best Regards,
Yan, Zi