mbox series

[0/5] Remove some races around folio_test_hugetlb

Message ID 20240301214712.2853147-1-willy@infradead.org (mailing list archive)
Headers show
Series Remove some races around folio_test_hugetlb | expand

Message

Matthew Wilcox March 1, 2024, 9:47 p.m. UTC
Oscar and I have been exchanging a bit of email recently about the
bug reported here:
https://lore.kernel.org/all/ZXNhGsX32y19a2Xv@casper.infradead.org

I've come to the conclusion that folio_test_hugetlb() is just too fragile
as it can give both false positives and false negatives, as well as
resulting in the above bug.  With this patch series, it becomes a lot
more robust.  In the memory-failure case, we always hold the hugetlb_lock
so it's perfectly reliable.  In the compaction caase, it's unreliable, but
the failures are acceptable and we recheck after taking the hugetlb_lock.

The cost of this reliability is that we now consume the word I recently
freed in folio->page[1].  I think this is acceptable; we've still gained
a completely reliable folio_test_hugetlb() (which we didn't have before
I started messing around with the folio dtors).  Non-hugetlb users
can use large_id as a pointer to something else entirely, or even as a
non-pointer, as long as they can guarantee it can't conflict (ie don't
use it as a bitfield).

So far, this is working for me.  Some stress testing would be appreciated.

Matthew Wilcox (Oracle) (5):
  hugetlb: Make folio_test_hugetlb safer to call
  hugetlb: Add hugetlb_pfn_folio
  memory-failure: Use hugetlb_pfn_folio
  memory-failure: Reorganise get_huge_page_for_hwpoison()
  compaction: Use hugetlb_pfn_folio in isolate_migratepages_block

 include/linux/hugetlb.h    | 13 ++-----
 include/linux/mm.h         |  8 -----
 include/linux/mm_types.h   |  4 ++-
 include/linux/page-flags.h | 25 +++----------
 kernel/vmcore_info.c       |  3 +-
 mm/compaction.c            | 16 ++++-----
 mm/huge_memory.c           | 10 ++----
 mm/hugetlb.c               | 72 +++++++++++++++++++++++++++++---------
 mm/memory-failure.c        | 14 +++++---
 9 files changed, 87 insertions(+), 78 deletions(-)

Comments

Miaohe Lin March 4, 2024, 9:09 a.m. UTC | #1
On 2024/3/2 5:47, Matthew Wilcox (Oracle) wrote:
> Oscar and I have been exchanging a bit of email recently about the
> bug reported here:
> https://lore.kernel.org/all/ZXNhGsX32y19a2Xv@casper.infradead.org

Thanks for your patch.

> 
> I've come to the conclusion that folio_test_hugetlb() is just too fragile
> as it can give both false positives and false negatives, as well as
> resulting in the above bug.  With this patch series, it becomes a lot
> more robust.  In the memory-failure case, we always hold the hugetlb_lock
> so it's perfectly reliable.  In the compaction caase, it's unreliable, but
> the failures are acceptable and we recheck after taking the hugetlb_lock.

I encountered similar issues with PageSwapCache check when doing memory-failure test:

[66258.945079] page:00000000135e1205 refcount:1 mapcount:0 mapping:0000000000000000 index:0x9b pfn:0xa04e9a
[66258.949096] head:0000000038449724 order:9 entire_mapcount:1 nr_pages_mapped:0 pincount:0
[66258.949485] memcg:ffff95fb43379000
[66258.950334] anon flags: 0x6fffc00000a0068(uptodate|lru|head|mappedtodisk|swapbacked|node=1|zone=2|lastcpupid=0x3fff)
[66258.951212] page_type: 0xffffffff()
[66258.951882] raw: 06fffc0000000000 ffffc89628138001 dead000000000122 dead000000000400
[66258.952273] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
[66258.952884] head: 06fffc00000a0068 ffffc896218a8008 ffffc89621680008 ffff95fb4349c439
[66258.953239] head: 0000000700000600 0000000000000000 00000001ffffffff ffff95fb43379000
[66258.953725] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
[66258.954497] ------------[ cut here ]------------
[66258.954937] kernel BUG at include/linux/page-flags.h:313!
[66258.956502] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[66258.957001] CPU: 14 PID: 174237 Comm: page-types Kdump: loaded Not tainted 6.8.0-rc1-00162-gd162e170f118 #11
[66258.957001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[66258.958415] RIP: 0010:folio_flags.constprop.0+0x1c/0x50
[66258.958415] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 8b 57 08 48 89 f8 83 e2 01 74 12 48 c7 c6 a0 59 34 a7 48 89 c7 e8 b5 60 e8 ff 90 <0f> 0b 66 90 c3 cc cc cc cc f7 c7 ff 0f 00 00 75 1a 48 8b 17 83 e2
[66258.958415] RSP: 0018:ffffa0f38ae53e00 EFLAGS: 00000282
[66258.958415] RAX: 0000000000000033 RBX: 0000000000000000 RCX: ffff96031fd9c9c8
[66258.958415] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff96031fd9c9c0
[66258.958415] RBP: ffffc8962813a680 R08: ffffffffa7756f88 R09: 0000000000009ffb
[66258.962155] R10: 000000000000054a R11: ffffffffa7726fa0 R12: 06fffc0000000000
[66258.962155] R13: 0000000000000000 R14: 00007fff93bf1348 R15: 0000000000a04e9a
[66258.962155] FS:  00007f47cc5c4740(0000) GS:ffff96031fd80000(0000) knlGS:0000000000000000
[66258.962155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66258.962155] CR2: 00007fff93c7b000 CR3: 0000000850c28000 CR4: 00000000000006f0
[66258.962155] Call Trace:
[66258.962155]  <TASK>
[66258.965730]  ? die+0x32/0x90
[66258.965730]  ? do_trap+0xdf/0x110
[66258.965730]  ? folio_flags.constprop.0+0x1c/0x50
[66258.965730]  ? do_error_trap+0x8b/0x110
[66258.965730]  ? folio_flags.constprop.0+0x1c/0x50
[66258.965730]  ? folio_flags.constprop.0+0x1c/0x50
[66258.965730]  ? exc_invalid_op+0x53/0x70
[66258.965730]  ? folio_flags.constprop.0+0x1c/0x50
[66258.965730]  ? asm_exc_invalid_op+0x1a/0x20
[66258.965730]  ? folio_flags.constprop.0+0x1c/0x50
[66258.965730]  stable_page_flags+0x210/0x940
[66258.965730]  kpageflags_read+0x97/0xf0
[66258.965730]  vfs_read+0xa0/0x370
[66258.965730]  __x64_sys_pread64+0x90/0xc0
[66258.965730]  do_syscall_64+0xcd/0x1e0
[66258.965730]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[66258.965730] RIP: 0033:0x7f47cc31274a
[66258.969711] Code: 44 24 78 00 00 00 00 e9 2b f1 ff ff 0f 1f 40 00 f3 0f 1e fa 49 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 11 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 5e c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[66258.969711] RSP: 002b:00007fff93af1298 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[66258.969711] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f47cc31274a
[66258.969711] RDX: 0000000000000008 RSI: 00007fff93bf1340 RDI: 0000000000000004
[66258.969711] RBP: 00007fff93af12e0 R08: 0000000000000001 R09: 8100000000a04e99
[66258.969711] R10: 00000000050274d0 R11: 0000000000000246 R12: 00007fff93cf1588
[66258.972680] R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f47cc609040
[66258.972680]  </TASK>
[66258.972680] Modules linked in: mce_inject hwpoison_inject

After debugging, I think below race leads to the above panic:

 CPU1								CPU2
 kpageflags_read
  stable_page_flags
   PageSwapCache() check 4k page without page refcnt held
    folio_test_swapcache(page_folio(page));
     folio_test_swapbacked(folio) && /* page is swapbacked. */

								 page is freed into buddy and merged into larger order.
								 page is allocated as THP tail page.

     test_bit(PG_swapcache, folio_flags(folio, 0)); /* BUG_ON PageTail check in folio_flags. It's tail page now! */

So the PageSwapCache test is fragile too. Any thought on how to fix this 'similar' issue?

Thanks.

> 
> The cost of this reliability is that we now consume the word I recently
> freed in folio->page[1].  I think this is acceptable; we've still gained
> a completely reliable folio_test_hugetlb() (which we didn't have before
> I started messing around with the folio dtors).  Non-hugetlb users
> can use large_id as a pointer to something else entirely, or even as a
> non-pointer, as long as they can guarantee it can't conflict (ie don't
> use it as a bitfield).
> 
> So far, this is working for me.  Some stress testing would be appreciated.
> 
> Matthew Wilcox (Oracle) (5):
>   hugetlb: Make folio_test_hugetlb safer to call
>   hugetlb: Add hugetlb_pfn_folio
>   memory-failure: Use hugetlb_pfn_folio
>   memory-failure: Reorganise get_huge_page_for_hwpoison()
>   compaction: Use hugetlb_pfn_folio in isolate_migratepages_block
> 
>  include/linux/hugetlb.h    | 13 ++-----
>  include/linux/mm.h         |  8 -----
>  include/linux/mm_types.h   |  4 ++-
>  include/linux/page-flags.h | 25 +++----------
>  kernel/vmcore_info.c       |  3 +-
>  mm/compaction.c            | 16 ++++-----
>  mm/huge_memory.c           | 10 ++----
>  mm/hugetlb.c               | 72 +++++++++++++++++++++++++++++---------
>  mm/memory-failure.c        | 14 +++++---
>  9 files changed, 87 insertions(+), 78 deletions(-)
>
Matthew Wilcox March 4, 2024, 5:08 p.m. UTC | #2
On Mon, Mar 04, 2024 at 05:09:58PM +0800, Miaohe Lin wrote:
> I encountered similar issues with PageSwapCache check when doing memory-failure test:
> 
> [66258.945079] page:00000000135e1205 refcount:1 mapcount:0 mapping:0000000000000000 index:0x9b pfn:0xa04e9a
> [66258.949096] head:0000000038449724 order:9 entire_mapcount:1 nr_pages_mapped:0 pincount:0
> [66258.949485] memcg:ffff95fb43379000
> [66258.950334] anon flags: 0x6fffc00000a0068(uptodate|lru|head|mappedtodisk|swapbacked|node=1|zone=2|lastcpupid=0x3fff)
> [66258.951212] page_type: 0xffffffff()
> [66258.951882] raw: 06fffc0000000000 ffffc89628138001 dead000000000122 dead000000000400
> [66258.952273] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> [66258.952884] head: 06fffc00000a0068 ffffc896218a8008 ffffc89621680008 ffff95fb4349c439
> [66258.953239] head: 0000000700000600 0000000000000000 00000001ffffffff ffff95fb43379000
> [66258.953725] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
> [66258.954497] ------------[ cut here ]------------
> [66258.954937] kernel BUG at include/linux/page-flags.h:313!
> [66258.956502] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [66258.957001] CPU: 14 PID: 174237 Comm: page-types Kdump: loaded Not tainted 6.8.0-rc1-00162-gd162e170f118 #11
> [66258.957001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [66258.958415] RIP: 0010:folio_flags.constprop.0+0x1c/0x50
> [66258.958415] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 8b 57 08 48 89 f8 83 e2 01 74 12 48 c7 c6 a0 59 34 a7 48 89 c7 e8 b5 60 e8 ff 90 <0f> 0b 66 90 c3 cc cc cc cc f7 c7 ff 0f 00 00 75 1a 48 8b 17 83 e2
> [66258.958415] RSP: 0018:ffffa0f38ae53e00 EFLAGS: 00000282
> [66258.958415] RAX: 0000000000000033 RBX: 0000000000000000 RCX: ffff96031fd9c9c8
> [66258.958415] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff96031fd9c9c0
> [66258.958415] RBP: ffffc8962813a680 R08: ffffffffa7756f88 R09: 0000000000009ffb
> [66258.962155] R10: 000000000000054a R11: ffffffffa7726fa0 R12: 06fffc0000000000
> [66258.962155] R13: 0000000000000000 R14: 00007fff93bf1348 R15: 0000000000a04e9a
> [66258.962155] FS:  00007f47cc5c4740(0000) GS:ffff96031fd80000(0000) knlGS:0000000000000000
> [66258.962155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [66258.962155] CR2: 00007fff93c7b000 CR3: 0000000850c28000 CR4: 00000000000006f0
> [66258.962155] Call Trace:
> [66258.962155]  <TASK>
> [66258.965730]  stable_page_flags+0x210/0x940
> [66258.965730]  kpageflags_read+0x97/0xf0

Clearly nobody has loved kpageflags_read() in a long time.  It's
absolutely full of bugs, some harmless, others less so.

The heart of the problem is that nobody has a refcount on the page, so
literally everything can change under us.  The old implementations of
PageSwapCache (etc) would silently give bad information; the folio
reimplementations warn you when you hit this race.

We have a few options:

 - We could grab a reference.  That would probaby be unwelcome.
 - We can grab a snapshot.  Might be a bit overkill.
 - We can grab the parts of the page/folio we need and open-code our
   tests.  This actually seems easiest.

Here's option 3.  Compile-tested only.  Some notes ...

 - Slab no longer uses page mapcount, so we can remove that test.
 - We still want to check page_mapped(), not folio_mapped() because
   each page can be mapped individually.
 - Our reporting of THP is probably wrong, but I've preserved the
   current behaviour here.
 - Now report the entire HZP with the ZERO_PAGE flag instead of just the
   head page.

diff --git a/fs/proc/page.c b/fs/proc/page.c
index 195b077c0fac..f5e3cc6509be 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -107,10 +107,18 @@ static inline u64 kpf_copy_bit(u64 kflags, int ubit, int kbit)
 	return ((kflags >> kbit) & 1) << ubit;
 }
 
+/*
+ * We do not have a reference on the struct page!  We must be very careful
+ * with what functions we call.  Some inaccuracy is tolerable here but the
+ * helper functions may warn.
+ */
 u64 stable_page_flags(struct page *page)
 {
-	u64 k;
-	u64 u;
+	struct folio *folio;
+	unsigned long k;
+	unsigned long mapping;
+	bool anon;
+	u64 u = 0;
 
 	/*
 	 * pseudo flag: KPF_NOPAGE
@@ -118,45 +126,39 @@ u64 stable_page_flags(struct page *page)
 	 */
 	if (!page)
 		return 1 << KPF_NOPAGE;
+	folio = page_folio(page);
 
-	k = page->flags;
-	u = 0;
+	k = folio->flags;
+	mapping = (unsigned long)folio->mapping;
+	anon = mapping & PAGE_MAPPING_ANON;
 
 	/*
 	 * pseudo flags for the well known (anonymous) memory mapped pages
-	 *
-	 * Note that page->_mapcount is overloaded in SLAB, so the
-	 * simple test in page_mapped() is not enough.
 	 */
-	if (!PageSlab(page) && page_mapped(page))
+	if (page_mapped(page))
 		u |= 1 << KPF_MMAP;
-	if (PageAnon(page))
+	if (anon)
 		u |= 1 << KPF_ANON;
-	if (PageKsm(page))
+	if (mapping & PAGE_MAPPING_KSM)
 		u |= 1 << KPF_KSM;
 
 	/*
 	 * compound pages: export both head/tail info
 	 * they together define a compound page's start/end pos and order
 	 */
-	if (PageHead(page))
-		u |= 1 << KPF_COMPOUND_HEAD;
+	u |= kpf_copy_bit(k, KPF_COMPOUND_HEAD,	PG_head);
 	if (PageTail(page))
 		u |= 1 << KPF_COMPOUND_TAIL;
-	if (PageHuge(page))
+	if (folio_test_hugetlb(folio))
 		u |= 1 << KPF_HUGE;
 	/*
-	 * PageTransCompound can be true for non-huge compound pages (slab
-	 * pages or pages allocated by drivers with __GFP_COMP) because it
-	 * just checks PG_head/PG_tail, so we need to check PageLRU/PageAnon
+	 * We need to check LRU/Anon
 	 * to make sure a given page is a thp, not a non-huge compound page.
 	 */
-	else if (PageTransCompound(page)) {
-		struct page *head = compound_head(page);
-
-		if (PageLRU(head) || PageAnon(head))
+	else if (folio_test_large(folio)) {
+		if ((k & PG_lru) || anon)
 			u |= 1 << KPF_THP;
-		else if (is_huge_zero_page(head)) {
+		else if (is_huge_zero_page(&folio->page)) {
 			u |= 1 << KPF_ZERO_PAGE;
 			u |= 1 << KPF_THP;
 		}
@@ -178,15 +180,15 @@ u64 stable_page_flags(struct page *page)
 	if (PageTable(page))
 		u |= 1 << KPF_PGTABLE;
 
-	if (page_is_idle(page))
+#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
+	u |= kpf_copy_bit(k, KPF_IDLE,		PG_idle);
+#else
+	if (folio_test_idle(folio))
 		u |= 1 << KPF_IDLE;
+#endif
 
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
-
 	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
-	if (PageTail(page) && PageSlab(page))
-		u |= 1 << KPF_SLAB;
-
 	u |= kpf_copy_bit(k, KPF_ERROR,		PG_error);
 	u |= kpf_copy_bit(k, KPF_DIRTY,		PG_dirty);
 	u |= kpf_copy_bit(k, KPF_UPTODATE,	PG_uptodate);
@@ -197,7 +199,8 @@ u64 stable_page_flags(struct page *page)
 	u |= kpf_copy_bit(k, KPF_ACTIVE,	PG_active);
 	u |= kpf_copy_bit(k, KPF_RECLAIM,	PG_reclaim);
 
-	if (PageSwapCache(page))
+	if ((k & ((1 << PG_swapbacked) | (1 << PG_swapcache))) ==
+			((1 << PG_swapbacked) | (1 << PG_swapcache)))
 		u |= 1 << KPF_SWAPCACHE;
 	u |= kpf_copy_bit(k, KPF_SWAPBACKED,	PG_swapbacked);
David Hildenbrand March 5, 2024, 9:10 a.m. UTC | #3
On 01.03.24 22:47, Matthew Wilcox (Oracle) wrote:
> Oscar and I have been exchanging a bit of email recently about the
> bug reported here:
> https://lore.kernel.org/all/ZXNhGsX32y19a2Xv@casper.infradead.org
> 
> I've come to the conclusion that folio_test_hugetlb() is just too fragile
> as it can give both false positives and false negatives, as well as
> resulting in the above bug.  With this patch series, it becomes a lot
> more robust.  In the memory-failure case, we always hold the hugetlb_lock
> so it's perfectly reliable.  In the compaction caase, it's unreliable, but
> the failures are acceptable and we recheck after taking the hugetlb_lock.
> 
> The cost of this reliability is that we now consume the word I recently
> freed in folio->page[1].  I think this is acceptable; we've still gained
> a completely reliable folio_test_hugetlb() (which we didn't have before
> I started messing around with the folio dtors).  Non-hugetlb users
> can use large_id as a pointer to something else entirely, or even as a
> non-pointer, as long as they can guarantee it can't conflict (ie don't
> use it as a bitfield).

That probably means that we have to always set the lowest bit to use it 
for something else, or use another bit.

I was wondering if

a) We could move that to another subpage. In hugetlb folios we have 
plenty of space for such things. I guess we'd have be able to detect the 
folio size without holding a reference, to make sure we can touch 
another subpage.

b) We could overload _nr_pages_mapped. We'd effectively have to steal 
one bit from _nr_pages_mapped to make this work.


Maybe what works is using the existing mechanism (hugetlb flag), and 
then storing the pointer in __nr_pages_mapped.

So depending on the hugetlb flag, we can interpret __nr_pages_mapped 
either as the pointer or as the old variant.

Mostly only folio_large_is_mapped() would need care for now, to ignore 
_nr_pages_mapped if the hugetlb flag is set.
Matthew Wilcox March 5, 2024, 8:35 p.m. UTC | #4
On Tue, Mar 05, 2024 at 10:10:08AM +0100, David Hildenbrand wrote:
> > The cost of this reliability is that we now consume the word I recently
> > freed in folio->page[1].  I think this is acceptable; we've still gained
> > a completely reliable folio_test_hugetlb() (which we didn't have before
> > I started messing around with the folio dtors).  Non-hugetlb users
> > can use large_id as a pointer to something else entirely, or even as a
> > non-pointer, as long as they can guarantee it can't conflict (ie don't
> > use it as a bitfield).
> 
> That probably means that we have to always set the lowest bit to use it for
> something else, or use another bit.

Yes, that would work.

> I was wondering if
> 
> a) We could move that to another subpage. In hugetlb folios we have plenty
> of space for such things. I guess we'd have be able to detect the folio size
> without holding a reference, to make sure we can touch another subpage.

Yes, that was my concern.  I wanted to put it in page[2] with all the
other hugetlb goop, but I got to thinking about an order-1 compound
page allocated at the end of memmap and got scared.  We could make
folio_test_hugetlb() look at ->flags for the head bit, then look at
->flags_1 for the order and finally at ->hugetlb_id, but now we've looked
at three cachelines to answer a fairly frequent question.  And then what
if the folio got split between looking at ->flags and ->flags_1 and we
get a bogus folio order that makes it look OK?  We can't even look at
->flags, ->flags_1 and recheck ->flags because it might have got split,
freed and reallocated in the meantime.

> b) We could overload _nr_pages_mapped. We'd effectively have to steal one
> bit from _nr_pages_mapped to make this work.
>
> Maybe what works is using the existing mechanism (hugetlb flag), and then
> storing the pointer in __nr_pages_mapped.
> 
> So depending on the hugetlb flag, we can interpret __nr_pages_mapped either
> as the pointer or as the old variant.
> 
> Mostly only folio_large_is_mapped() would need care for now, to ignore
> _nr_pages_mapped if the hugetlb flag is set.

I don't mind that at all.  We wouldn't even need to steal a bit or use the
existing flag; we could just say that -2 means this is a hugetlb folio.
As long as it ends up at the same offset as page->mapping (because that's
always NULL or a pointer possibly with a low bit set so can't ever be a
number between -4095 and -1).

IOW:

word	page0			page1
0	flags			flags
1	lru.next		head
2	lru.prev		entire_mapcount + gap
3	mapping			nr_pages_mapped + gap / hugetlb_id
4	index			pincount + nr_pages
5	private			unused
6	mapcount+refcount	mapcount+refcount(0)
7	memcg_data		-

or on 32-bit

word	page0			page1
0	flags			flags
1	lru.next		head
2	lru.prev		entire_mapcount
3	mapping			nr_pages_mapped / hugetlb_id
4	index			pincount
5	private			unused
6	mapcount		mapcount
7	refcount		refcount
8	memcg_data		-
9+	virtual? last_cpupid? whatever

Does this fit with your plans?
Miaohe Lin March 6, 2024, 7:58 a.m. UTC | #5
On 2024/3/5 1:08, Matthew Wilcox wrote:
> On Mon, Mar 04, 2024 at 05:09:58PM +0800, Miaohe Lin wrote:
>> I encountered similar issues with PageSwapCache check when doing memory-failure test:
>>
>> [66258.945079] page:00000000135e1205 refcount:1 mapcount:0 mapping:0000000000000000 index:0x9b pfn:0xa04e9a
>> [66258.949096] head:0000000038449724 order:9 entire_mapcount:1 nr_pages_mapped:0 pincount:0
>> [66258.949485] memcg:ffff95fb43379000
>> [66258.950334] anon flags: 0x6fffc00000a0068(uptodate|lru|head|mappedtodisk|swapbacked|node=1|zone=2|lastcpupid=0x3fff)
>> [66258.951212] page_type: 0xffffffff()
>> [66258.951882] raw: 06fffc0000000000 ffffc89628138001 dead000000000122 dead000000000400
>> [66258.952273] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
>> [66258.952884] head: 06fffc00000a0068 ffffc896218a8008 ffffc89621680008 ffff95fb4349c439
>> [66258.953239] head: 0000000700000600 0000000000000000 00000001ffffffff ffff95fb43379000
>> [66258.953725] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
>> [66258.954497] ------------[ cut here ]------------
>> [66258.954937] kernel BUG at include/linux/page-flags.h:313!
>> [66258.956502] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
>> [66258.957001] CPU: 14 PID: 174237 Comm: page-types Kdump: loaded Not tainted 6.8.0-rc1-00162-gd162e170f118 #11
>> [66258.957001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
>> [66258.958415] RIP: 0010:folio_flags.constprop.0+0x1c/0x50
>> [66258.958415] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 8b 57 08 48 89 f8 83 e2 01 74 12 48 c7 c6 a0 59 34 a7 48 89 c7 e8 b5 60 e8 ff 90 <0f> 0b 66 90 c3 cc cc cc cc f7 c7 ff 0f 00 00 75 1a 48 8b 17 83 e2
>> [66258.958415] RSP: 0018:ffffa0f38ae53e00 EFLAGS: 00000282
>> [66258.958415] RAX: 0000000000000033 RBX: 0000000000000000 RCX: ffff96031fd9c9c8
>> [66258.958415] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff96031fd9c9c0
>> [66258.958415] RBP: ffffc8962813a680 R08: ffffffffa7756f88 R09: 0000000000009ffb
>> [66258.962155] R10: 000000000000054a R11: ffffffffa7726fa0 R12: 06fffc0000000000
>> [66258.962155] R13: 0000000000000000 R14: 00007fff93bf1348 R15: 0000000000a04e9a
>> [66258.962155] FS:  00007f47cc5c4740(0000) GS:ffff96031fd80000(0000) knlGS:0000000000000000
>> [66258.962155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [66258.962155] CR2: 00007fff93c7b000 CR3: 0000000850c28000 CR4: 00000000000006f0
>> [66258.962155] Call Trace:
>> [66258.962155]  <TASK>
>> [66258.965730]  stable_page_flags+0x210/0x940
>> [66258.965730]  kpageflags_read+0x97/0xf0
> 
> Clearly nobody has loved kpageflags_read() in a long time.  It's
> absolutely full of bugs, some harmless, others less so.
> 
> The heart of the problem is that nobody has a refcount on the page, so
> literally everything can change under us.  The old implementations of
> PageSwapCache (etc) would silently give bad information; the folio
> reimplementations warn you when you hit this race.
> 
> We have a few options:
> 
>  - We could grab a reference.  That would probaby be unwelcome.
>  - We can grab a snapshot.  Might be a bit overkill.
>  - We can grab the parts of the page/folio we need and open-code our
>    tests.  This actually seems easiest.

Option 3 should be the easiest way to fix the issue though it seems the code would be slightly ugly.
Will you send a formal patch?

Thanks Matthew!
David Hildenbrand March 6, 2024, 3:18 p.m. UTC | #6
On 05.03.24 21:35, Matthew Wilcox wrote:
> On Tue, Mar 05, 2024 at 10:10:08AM +0100, David Hildenbrand wrote:
>>> The cost of this reliability is that we now consume the word I recently
>>> freed in folio->page[1].  I think this is acceptable; we've still gained
>>> a completely reliable folio_test_hugetlb() (which we didn't have before
>>> I started messing around with the folio dtors).  Non-hugetlb users
>>> can use large_id as a pointer to something else entirely, or even as a
>>> non-pointer, as long as they can guarantee it can't conflict (ie don't
>>> use it as a bitfield).
>>
>> That probably means that we have to always set the lowest bit to use it for
>> something else, or use another bit.
> 
> Yes, that would work.
> 
>> I was wondering if
>>
>> a) We could move that to another subpage. In hugetlb folios we have plenty
>> of space for such things. I guess we'd have be able to detect the folio size
>> without holding a reference, to make sure we can touch another subpage.
> 
> Yes, that was my concern.  I wanted to put it in page[2] with all the
> other hugetlb goop, but I got to thinking about an order-1 compound
> page allocated at the end of memmap and got scared.  We could make
> folio_test_hugetlb() look at ->flags for the head bit, then look at
> ->flags_1 for the order and finally at ->hugetlb_id, but now we've looked
> at three cachelines to answer a fairly frequent question.  And then what
> if the folio got split between looking at ->flags and ->flags_1 and we
> get a bogus folio order that makes it look OK?  We can't even look at
> ->flags, ->flags_1 and recheck ->flags because it might have got split,
> freed and reallocated in the meantime.
> 
>> b) We could overload _nr_pages_mapped. We'd effectively have to steal one
>> bit from _nr_pages_mapped to make this work.
>>
>> Maybe what works is using the existing mechanism (hugetlb flag), and then
>> storing the pointer in __nr_pages_mapped.
>>
>> So depending on the hugetlb flag, we can interpret __nr_pages_mapped either
>> as the pointer or as the old variant.
>>
>> Mostly only folio_large_is_mapped() would need care for now, to ignore
>> _nr_pages_mapped if the hugetlb flag is set.
> 
> I don't mind that at all.  We wouldn't even need to steal a bit or use the
> existing flag; we could just say that -2 means this is a hugetlb folio.
> As long as it ends up at the same offset as page->mapping (because that's
> always NULL or a pointer possibly with a low bit set so can't ever be a
> number between -4095 and -1).

Would hugetlb_id below be 32bit or 64bit on 64-bit?

> 
> IOW:
> 
> word	page0			page1
> 0	flags			flags
> 1	lru.next		head
> 2	lru.prev		entire_mapcount + gap
> 3	mapping			nr_pages_mapped + gap / hugetlb_id
> 4	index			pincount + nr_pages
> 5	private			unused
> 6	mapcount+refcount	mapcount+refcount(0)
> 7	memcg_data		-
> 
> or on 32-bit
> 
> word	page0			page1
> 0	flags			flags
> 1	lru.next		head
> 2	lru.prev		entire_mapcount
> 3	mapping			nr_pages_mapped / hugetlb_id

^ In the worst case, I think, nr_pages_mapped with a lot of entire 
mappings could end up matching hugetlb_id. We add a large value to 
nr_pages_mapped every time we add an entire mapping ... (not sure if 
that could currently be a problem with many entire mappings of a large 
folio)


> 4	index			pincount
> 5	private			unused
> 6	mapcount		mapcount
> 7	refcount		refcount
> 8	memcg_data		-
> 9+	virtual? last_cpupid? whatever
> 
> Does this fit with your plans?

For the total mapcount this would do (and it would be better), but the 
layout gets a bit "sparse" on 64bit that way, which will end up being 
problematic for some other stuff I might want to put in there.

Not that we have to resolve that now, just bringing it up, that maybe we 
can do better right away :)


IIUC, we would not be able to reuse the "gap" in "nr_pages_mapped + gap 
/ hugetlb_id", essentially consuming an additional 32bit compared to 
what we do now, correct?

I was thinking of the following, assuming your example above indicated 
on64bit a hugetlb_id that is 64bit:

hugetlb folios set
* compound page
* flag in subpage 1 (like we do)
* nr_pages_mapped in subpage 1 to e.g., -2.

So to check lockless if we have a hugetlb folio
* Check if compund
* Check if the flag in subpage 1 is set
* Check if nr_pages_mapped matches

Would that still be too unreliable?
Matthew Wilcox March 7, 2024, 4:31 a.m. UTC | #7
On Wed, Mar 06, 2024 at 04:18:45PM +0100, David Hildenbrand wrote:
> On 05.03.24 21:35, Matthew Wilcox wrote:
> > On Tue, Mar 05, 2024 at 10:10:08AM +0100, David Hildenbrand wrote:
> > > > The cost of this reliability is that we now consume the word I recently
> > > > freed in folio->page[1].  I think this is acceptable; we've still gained
> > > > a completely reliable folio_test_hugetlb() (which we didn't have before
> > > > I started messing around with the folio dtors).  Non-hugetlb users
> > > > can use large_id as a pointer to something else entirely, or even as a
> > > > non-pointer, as long as they can guarantee it can't conflict (ie don't
> > > > use it as a bitfield).
> > > 
> > > That probably means that we have to always set the lowest bit to use it for
> > > something else, or use another bit.
> > 
> > Yes, that would work.
> > 
> > > I was wondering if
> > > 
> > > a) We could move that to another subpage. In hugetlb folios we have plenty
> > > of space for such things. I guess we'd have be able to detect the folio size
> > > without holding a reference, to make sure we can touch another subpage.
> > 
> > Yes, that was my concern.  I wanted to put it in page[2] with all the
> > other hugetlb goop, but I got to thinking about an order-1 compound
> > page allocated at the end of memmap and got scared.  We could make
> > folio_test_hugetlb() look at ->flags for the head bit, then look at
> > ->flags_1 for the order and finally at ->hugetlb_id, but now we've looked
> > at three cachelines to answer a fairly frequent question.  And then what
> > if the folio got split between looking at ->flags and ->flags_1 and we
> > get a bogus folio order that makes it look OK?  We can't even look at
> > ->flags, ->flags_1 and recheck ->flags because it might have got split,
> > freed and reallocated in the meantime.
> > 
> > > b) We could overload _nr_pages_mapped. We'd effectively have to steal one
> > > bit from _nr_pages_mapped to make this work.
> > > 
> > > Maybe what works is using the existing mechanism (hugetlb flag), and then
> > > storing the pointer in __nr_pages_mapped.
> > > 
> > > So depending on the hugetlb flag, we can interpret __nr_pages_mapped either
> > > as the pointer or as the old variant.
> > > 
> > > Mostly only folio_large_is_mapped() would need care for now, to ignore
> > > _nr_pages_mapped if the hugetlb flag is set.
> > 
> > I don't mind that at all.  We wouldn't even need to steal a bit or use the
> > existing flag; we could just say that -2 means this is a hugetlb folio.
> > As long as it ends up at the same offset as page->mapping (because that's
> > always NULL or a pointer possibly with a low bit set so can't ever be a
> > number between -4095 and -1).
> 
> Would hugetlb_id below be 32bit or 64bit on 64-bit?

64-bit, so when it's reused by page->mapping after a split, it isn't
ambiguous.

> > 
> > IOW:
> > 
> > word	page0			page1
> > 0	flags			flags
> > 1	lru.next		head
> > 2	lru.prev		entire_mapcount + gap
> > 3	mapping			nr_pages_mapped + gap / hugetlb_id
> > 4	index			pincount + nr_pages
> > 5	private			unused
> > 6	mapcount+refcount	mapcount+refcount(0)
> > 7	memcg_data		-
> > 
> > or on 32-bit
> > 
> > word	page0			page1
> > 0	flags			flags
> > 1	lru.next		head
> > 2	lru.prev		entire_mapcount
> > 3	mapping			nr_pages_mapped / hugetlb_id
> 
> ^ In the worst case, I think, nr_pages_mapped with a lot of entire mappings
> could end up matching hugetlb_id. We add a large value to nr_pages_mapped
> every time we add an entire mapping ... (not sure if that could currently be
> a problem with many entire mappings of a large folio)

My understanding was that nr_pages_mapped was incremented by one for
each page which has a non-zero mapcount.  It was also incremented by
ENTIRELY_MAPPED the first time that we increment ->entire_mapcount.
As such, I don't think entire_mapcount can get the top bit set.

> 
> > 4	index			pincount
> > 5	private			unused
> > 6	mapcount		mapcount
> > 7	refcount		refcount
> > 8	memcg_data		-
> > 9+	virtual? last_cpupid? whatever
> > 
> > Does this fit with your plans?
> 
> For the total mapcount this would do (and it would be better), but the
> layout gets a bit "sparse" on 64bit that way, which will end up being
> problematic for some other stuff I might want to put in there.
> 
> Not that we have to resolve that now, just bringing it up, that maybe we can
> do better right away :)

How about this layout?

@@ -350,8 +350,13 @@ struct folio {
                        unsigned long _head_1;
                        unsigned long _folio_avail;
        /* public: */
-                       atomic_t _entire_mapcount;
-                       atomic_t _nr_pages_mapped;
+                       union {
+                               unsigned long _hugetlb_id;
+                               struct {
+                                       atomic_t _entire_mapcount;
+                                       atomic_t _nr_pages_mapped;
+                               };
+                       };
                        atomic_t _pincount;
 #ifdef CONFIG_64BIT
                        unsigned int _folio_nr_pages;

That keeps _folio_avail as, well, available.  It puts _hugetlb_id in
the same bits as ->mapping.  It continues to leave ->private unused
on 64-bit.  I think this does everything we want?
David Hildenbrand March 7, 2024, 9:20 a.m. UTC | #8
>>>
>>> IOW:
>>>
>>> word	page0			page1
>>> 0	flags			flags
>>> 1	lru.next		head
>>> 2	lru.prev		entire_mapcount + gap
>>> 3	mapping			nr_pages_mapped + gap / hugetlb_id
>>> 4	index			pincount + nr_pages
>>> 5	private			unused
>>> 6	mapcount+refcount	mapcount+refcount(0)
>>> 7	memcg_data		-
>>>
>>> or on 32-bit
>>>
>>> word	page0			page1
>>> 0	flags			flags
>>> 1	lru.next		head
>>> 2	lru.prev		entire_mapcount
>>> 3	mapping			nr_pages_mapped / hugetlb_id
>>
>> ^ In the worst case, I think, nr_pages_mapped with a lot of entire mappings
>> could end up matching hugetlb_id. We add a large value to nr_pages_mapped
>> every time we add an entire mapping ... (not sure if that could currently be
>> a problem with many entire mappings of a large folio)
> 
> My understanding was that nr_pages_mapped was incremented by one for
> each page which has a non-zero mapcount.  It was also incremented by
> ENTIRELY_MAPPED the first time that we increment ->entire_mapcount.
> As such, I don't think entire_mapcount can get the top bit set.

Right, I misremembered!

> 
>>
>>> 4	index			pincount
>>> 5	private			unused
>>> 6	mapcount		mapcount
>>> 7	refcount		refcount
>>> 8	memcg_data		-
>>> 9+	virtual? last_cpupid? whatever
>>>
>>> Does this fit with your plans?
>>
>> For the total mapcount this would do (and it would be better), but the
>> layout gets a bit "sparse" on 64bit that way, which will end up being
>> problematic for some other stuff I might want to put in there.
>>
>> Not that we have to resolve that now, just bringing it up, that maybe we can
>> do better right away :)
> 
> How about this layout?
> 
> @@ -350,8 +350,13 @@ struct folio {
>                          unsigned long _head_1;
>                          unsigned long _folio_avail;
>          /* public: */
> -                       atomic_t _entire_mapcount;
> -                       atomic_t _nr_pages_mapped;
> +                       union {
> +                               unsigned long _hugetlb_id;
> +                               struct {
> +                                       atomic_t _entire_mapcount;
> +                                       atomic_t _nr_pages_mapped;
> +                               };
> +                       };
>                          atomic_t _pincount;
>   #ifdef CONFIG_64BIT
>                          unsigned int _folio_nr_pages;
> 
> That keeps _folio_avail as, well, available.  It puts _hugetlb_id in
> the same bits as ->mapping.  It continues to leave ->private unused
> on 64-bit.  I think this does everything we want?

_entire_mapcount is (still) used for hugetlb folios.

With the total mapcount in place, I was thinking about renaming it to 
"_pmd_mapcount" and stop using it for hugetlb folios, just like we'd not 
be using _nr_pages_mapped for hugetlb folios.

[I also thought about moving the _pmd_mapcount to another subpage, where 
we'd also have a _pud_mapcount in the future; but again, stuff for the 
future]

Until then, wouldn't _hugetlb_id be problematic here? [I could move 
_entire_mapcount/_pmd_mapcount later I guess]
Matthew Wilcox March 7, 2024, 9:14 p.m. UTC | #9
On Thu, Mar 07, 2024 at 10:20:15AM +0100, David Hildenbrand wrote:
> > > > IOW:
> > > > 
> > > > word	page0			page1
> > > > 0	flags			flags
> > > > 1	lru.next		head
> > > > 2	lru.prev		entire_mapcount + gap
> > > > 3	mapping			nr_pages_mapped + gap / hugetlb_id
> > > > 4	index			pincount + nr_pages
> > > > 5	private			unused
> > > > 6	mapcount+refcount	mapcount+refcount(0)
> > > > 7	memcg_data		-
> > > > 
> > > > or on 32-bit
> > > > 
> > > > word	page0			page1
> > > > 0	flags			flags
> > > > 1	lru.next		head
> > > > 2	lru.prev		entire_mapcount
> > > > 3	mapping			nr_pages_mapped / hugetlb_id
> > > > 4	index			pincount
> > > > 5	private			unused
> > > > 6	mapcount		mapcount
> > > > 7	refcount		refcount
> > > > 8	memcg_data		-
> > > > 9+	virtual? last_cpupid? whatever
> > 
> > How about this layout?
> > 
> > @@ -350,8 +350,13 @@ struct folio {
> >                          unsigned long _head_1;
> >                          unsigned long _folio_avail;
> >          /* public: */
> > -                       atomic_t _entire_mapcount;
> > -                       atomic_t _nr_pages_mapped;
> > +                       union {
> > +                               unsigned long _hugetlb_id;
> > +                               struct {
> > +                                       atomic_t _entire_mapcount;
> > +                                       atomic_t _nr_pages_mapped;
> > +                               };
> > +                       };
> >                          atomic_t _pincount;
> >   #ifdef CONFIG_64BIT
> >                          unsigned int _folio_nr_pages;
> > 
> > That keeps _folio_avail as, well, available.  It puts _hugetlb_id in
> > the same bits as ->mapping.  It continues to leave ->private unused
> > on 64-bit.  I think this does everything we want?
> 
> _entire_mapcount is (still) used for hugetlb folios.

Oh, duh, of course it is.  I thought we used page[0].mapcount for them,
but we don't and shouldn't.  I suppose we could use a magic value for
page[0].mapcount to indicate hugetlb, but that'd make page_mapcount()
more complex.

> With the total mapcount in place, I was thinking about renaming it to
> "_pmd_mapcount" and stop using it for hugetlb folios, just like we'd not be
> using _nr_pages_mapped for hugetlb folios.
> 
> [I also thought about moving the _pmd_mapcount to another subpage, where
> we'd also have a _pud_mapcount in the future; but again, stuff for the
> future]
> 
> Until then, wouldn't _hugetlb_id be problematic here? [I could move
> _entire_mapcount/_pmd_mapcount later I guess]

New idea then, how about simply:

                        unsigned long _flags_1;
                        unsigned long _head_1;
-                       unsigned long _folio_avail;
+                       unsigned long _hugetlb_id;

We have to check the various other users of struct page to see what we
might conflict with.

slab:
                                struct list_head slab_list;
                                        void *freelist;         /* first free object */
                struct rcu_head rcu_head;

ptdesc:
                struct rcu_head pt_rcu_head;
                struct list_head pt_list;
                        pgtable_t pmd_huge_pte;

So it's always used as a pointer (or a pointer in disguise).  That means
that either using a pointer to something hugetlb-related or a value like
(void *)-2 will be unambiguous.  It'll just be something to document in
each of the types split from struct page.
Matthew Wilcox March 7, 2024, 9:16 p.m. UTC | #10
On Wed, Mar 06, 2024 at 03:58:31PM +0800, Miaohe Lin wrote:
> >  - We could grab a reference.  That would probaby be unwelcome.
> >  - We can grab a snapshot.  Might be a bit overkill.
> >  - We can grab the parts of the page/folio we need and open-code our
> >    tests.  This actually seems easiest.
> 
> Option 3 should be the easiest way to fix the issue though it seems the code would be slightly ugly.
> Will you send a formal patch?

Yes; I just want to get the reliable folio_test_hugetlb done first, and
then I have a patch series ready to go.  The timing is a bit suspect
too; I'd like to give this a few weeks in -next before it gets merged
and the merge window starts on Monday.  So I think I'll wait until after
the merge window to get this sent out.
> 
>
David Hildenbrand March 7, 2024, 9:38 p.m. UTC | #11
On 07.03.24 22:14, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 10:20:15AM +0100, David Hildenbrand wrote:
>>>>> IOW:
>>>>>
>>>>> word	page0			page1
>>>>> 0	flags			flags
>>>>> 1	lru.next		head
>>>>> 2	lru.prev		entire_mapcount + gap
>>>>> 3	mapping			nr_pages_mapped + gap / hugetlb_id
>>>>> 4	index			pincount + nr_pages
>>>>> 5	private			unused
>>>>> 6	mapcount+refcount	mapcount+refcount(0)
>>>>> 7	memcg_data		-
>>>>>
>>>>> or on 32-bit
>>>>>
>>>>> word	page0			page1
>>>>> 0	flags			flags
>>>>> 1	lru.next		head
>>>>> 2	lru.prev		entire_mapcount
>>>>> 3	mapping			nr_pages_mapped / hugetlb_id
>>>>> 4	index			pincount
>>>>> 5	private			unused
>>>>> 6	mapcount		mapcount
>>>>> 7	refcount		refcount
>>>>> 8	memcg_data		-
>>>>> 9+	virtual? last_cpupid? whatever
>>>
>>> How about this layout?
>>>
>>> @@ -350,8 +350,13 @@ struct folio {
>>>                           unsigned long _head_1;
>>>                           unsigned long _folio_avail;
>>>           /* public: */
>>> -                       atomic_t _entire_mapcount;
>>> -                       atomic_t _nr_pages_mapped;
>>> +                       union {
>>> +                               unsigned long _hugetlb_id;
>>> +                               struct {
>>> +                                       atomic_t _entire_mapcount;
>>> +                                       atomic_t _nr_pages_mapped;
>>> +                               };
>>> +                       };
>>>                           atomic_t _pincount;
>>>    #ifdef CONFIG_64BIT
>>>                           unsigned int _folio_nr_pages;
>>>
>>> That keeps _folio_avail as, well, available.  It puts _hugetlb_id in
>>> the same bits as ->mapping.  It continues to leave ->private unused
>>> on 64-bit.  I think this does everything we want?
>>
>> _entire_mapcount is (still) used for hugetlb folios.
> 
> Oh, duh, of course it is.  I thought we used page[0].mapcount for them,
> but we don't and shouldn't.  I suppose we could use a magic value for
> page[0].mapcount to indicate hugetlb, but that'd make page_mapcount()
> more complex.
> 
>> With the total mapcount in place, I was thinking about renaming it to
>> "_pmd_mapcount" and stop using it for hugetlb folios, just like we'd not be
>> using _nr_pages_mapped for hugetlb folios.
>>
>> [I also thought about moving the _pmd_mapcount to another subpage, where
>> we'd also have a _pud_mapcount in the future; but again, stuff for the
>> future]
>>
>> Until then, wouldn't _hugetlb_id be problematic here? [I could move
>> _entire_mapcount/_pmd_mapcount later I guess]
> 
> New idea then, how about simply:
> 
>                          unsigned long _flags_1;
>                          unsigned long _head_1;
> -                       unsigned long _folio_avail;
> +                       unsigned long _hugetlb_id;
> 
> We have to check the various other users of struct page to see what we
> might conflict with.

Well, compared to the current version there is no change for me: the 
space I wanted to use for the total_mapcount is gone and I'll have to 
start being creative :)

Free space in subpage1: "The LORD gave and the LORD has taken away"

We seem to be running again into space issues in subpage1 (also, more 
relevant now that we will support order-1 folios ...). This kind of 
wastage+over-complicated layout (again :( ) just to handle hugetlb 
oddities for free pages -- refcount 0 -- feels very wrong.


Stupid idea: Do we really have to identify (possibly free) hugetlb 
folios that way from lockless pfnwalkers without a folio reference?

I mean, with a folio reference held it's all working as expected.

Couldn't we just make hugetlb store them in some kind of xarray that we 
can walk (using RCU?), indexed by start PFN we want to test?

So if we find the current hugetlb flag to be set on the lockless path, 
simply check that xarray. It's all super racy either way, we can get a 
free+split+reuse anytime.

Or is that completely flawed?
Matthew Wilcox March 8, 2024, 4:31 a.m. UTC | #12
On Thu, Mar 07, 2024 at 09:14:50PM +0000, Matthew Wilcox wrote:
> I suppose we could use a magic value for
> page[0].mapcount to indicate hugetlb, but that'd make page_mapcount()
> more complex.

Actually, not significantly.  I haven't tested this yet, but it should
work ... thoughts?  It's certainly the simplest implementation of
folio_test_hugetlb() that we've ever had.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..32281097f07e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1212,6 +1212,9 @@ static inline int page_mapcount(struct page *page)
 {
 	int mapcount = atomic_read(&page->_mapcount) + 1;
 
+	/* Head page repurposes mapcount; tail page leaves at -1 */
+	if (PageHeadHuge(page))
+		mapcount = 0;
 	if (unlikely(PageCompound(page)))
 		mapcount += folio_entire_mapcount(page_folio(page));
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 735cddc13d20..6ebb573c7195 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -190,7 +190,6 @@ enum pageflags {
 
 	/* At least one page in this folio has the hwpoison flag set */
 	PG_has_hwpoisoned = PG_error,
-	PG_hugetlb = PG_active,
 	PG_large_rmappable = PG_workingset, /* anon or file-backed */
 };
 
@@ -829,29 +828,6 @@ TESTPAGEFLAG_FALSE(LargeRmappable, large_rmappable)
 
 #define PG_head_mask ((1UL << PG_head))
 
-#ifdef CONFIG_HUGETLB_PAGE
-int PageHuge(struct page *page);
-SETPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
-CLEARPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
-
-/**
- * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs
- * @folio: The folio to test.
- *
- * Context: Any context.  Caller should have a reference on the folio to
- * prevent it from being turned into a tail page.
- * Return: True for hugetlbfs folios, false for anon folios or folios
- * belonging to other filesystems.
- */
-static inline bool folio_test_hugetlb(struct folio *folio)
-{
-	return folio_test_large(folio) &&
-		test_bit(PG_hugetlb, folio_flags(folio, 1));
-}
-#else
-TESTPAGEFLAG_FALSE(Huge, hugetlb)
-#endif
-
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for
@@ -907,18 +883,6 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned)
 	TESTSCFLAG_FALSE(HasHWPoisoned, has_hwpoisoned)
 #endif
 
-/*
- * Check if a page is currently marked HWPoisoned. Note that this check is
- * best effort only and inherently racy: there is no way to synchronize with
- * failing hardware.
- */
-static inline bool is_page_hwpoison(struct page *page)
-{
-	if (PageHWPoison(page))
-		return true;
-	return PageHuge(page) && PageHWPoison(compound_head(page));
-}
-
 /*
  * For pages that are never mapped to userspace (and aren't PageSlab),
  * page_type may be used.  Because it is initialised to -1, we invert the
@@ -935,6 +899,7 @@ static inline bool is_page_hwpoison(struct page *page)
 #define PG_offline	0x00000100
 #define PG_table	0x00000200
 #define PG_guard	0x00000400
+#define PG_hugetlb	0x00000800
 
 #define PageType(page, flag)						\
 	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -1026,6 +991,37 @@ PAGE_TYPE_OPS(Table, table, pgtable)
  */
 PAGE_TYPE_OPS(Guard, guard, guard)
 
+#ifdef CONFIG_HUGETLB_PAGE
+PAGE_TYPE_OPS(HeadHuge, hugetlb, hugetlb)
+#else
+TESTPAGEFLAG_FALSE(HeadHuge, hugetlb)
+#endif
+
+/**
+ * PageHuge - Determine if the page belongs to hugetlbfs
+ * @page: The page to test.
+ *
+ * Context: Any context.
+ * Return: True for hugetlbfs pages, false for anon pages or pages
+ * belonging to other filesystems.
+ */
+static inline bool PageHuge(struct page *page)
+{
+	return folio_test_hugetlb(page_folio(page));
+}
+
+/*
+ * Check if a page is currently marked HWPoisoned. Note that this check is
+ * best effort only and inherently racy: there is no way to synchronize with
+ * failing hardware.
+ */
+static inline bool is_page_hwpoison(struct page *page)
+{
+	if (PageHWPoison(page))
+		return true;
+	return PageHuge(page) && PageHWPoison(compound_head(page));
+}
+
 extern bool is_free_buddy_page(struct page *page);
 
 PAGEFLAG(Isolated, isolated, PF_ANY);
@@ -1092,7 +1088,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
  */
 #define PAGE_FLAGS_SECOND						\
 	(0xffUL /* order */		| 1UL << PG_has_hwpoisoned |	\
-	 1UL << PG_hugetlb		| 1UL << PG_large_rmappable)
+	 1UL << PG_large_rmappable)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d4..23b62df21971 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1623,7 +1623,7 @@ static inline void __clear_hugetlb_destructor(struct hstate *h,
 {
 	lockdep_assert_held(&hugetlb_lock);
 
-	folio_clear_hugetlb(folio);
+	__folio_clear_hugetlb(folio);
 }
 
 /*
@@ -1710,7 +1710,7 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
 		h->surplus_huge_pages_node[nid]++;
 	}
 
-	folio_set_hugetlb(folio);
+	__folio_set_hugetlb(folio);
 	folio_change_private(folio, NULL);
 	/*
 	 * We have to set hugetlb_vmemmap_optimized again as above
@@ -2048,7 +2048,7 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
 
 static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
-	folio_set_hugetlb(folio);
+	__folio_set_hugetlb(folio);
 	INIT_LIST_HEAD(&folio->lru);
 	hugetlb_set_folio_subpool(folio, NULL);
 	set_hugetlb_cgroup(folio, NULL);
@@ -2158,22 +2158,6 @@ static bool prep_compound_gigantic_folio_for_demote(struct folio *folio,
 	return __prep_compound_gigantic_folio(folio, order, true);
 }
 
-/*
- * PageHuge() only returns true for hugetlbfs pages, but not for normal or
- * transparent huge pages.  See the PageTransHuge() documentation for more
- * details.
- */
-int PageHuge(struct page *page)
-{
-	struct folio *folio;
-
-	if (!PageCompound(page))
-		return 0;
-	folio = page_folio(page);
-	return folio_test_hugetlb(folio);
-}
-EXPORT_SYMBOL_GPL(PageHuge);
-
 /*
  * Find and lock address space (mapping) in write mode.
  *
David Hildenbrand March 8, 2024, 8:46 a.m. UTC | #13
On 08.03.24 05:31, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 09:14:50PM +0000, Matthew Wilcox wrote:
>> I suppose we could use a magic value for
>> page[0].mapcount to indicate hugetlb, but that'd make page_mapcount()
>> more complex.
> 
> Actually, not significantly.  I haven't tested this yet, but it should
> work ... thoughts?  It's certainly the simplest implementation of
> folio_test_hugetlb() that we've ever had.

Yes, yes, yes! So obvious and so simple!

1) In __folio_rmap_sanity_checks() we now VM_WARN_ON_ONCE we would get a 
hugetlb folio.

2) In folio_total_mapcount() we'll never read the subpage mapcount 
because folio_nr_pages_mapped() == 0.


__dump_folio() might require care to just ignore the subpage mpacount as 
well. Maybe we can even use something like "page_has_type" to never read 
subpage mapcounts, and read the entire mapcount only for the exception 
of hugetlb folios.

Apart from that, I did look at users of page_has_type():

1) validate_page_before_insert() now correctly rejects hugetlb folios, 
insert_page_into_pte_locked() -> folio_add_file_rmap_pte() would 
nowadays properly complain about that in __folio_rmap_sanity_checks() 
already.

2) kpagecount_read() will always end up with pcount=0. We either 
special-case hugetlb folios here as well, or just let this interface 
bitrot and hopefully go away at some point :)


Very nice!


> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f5a97dec5169..32281097f07e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1212,6 +1212,9 @@ static inline int page_mapcount(struct page *page)
>   {
>   	int mapcount = atomic_read(&page->_mapcount) + 1;
>   
> +	/* Head page repurposes mapcount; tail page leaves at -1 */
> +	if (PageHeadHuge(page))
> +		mapcount = 0;
>   	if (unlikely(PageCompound(page)))
>   		mapcount += folio_entire_mapcount(page_folio(page));
>   
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 735cddc13d20..6ebb573c7195 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -190,7 +190,6 @@ enum pageflags {
>   
>   	/* At least one page in this folio has the hwpoison flag set */
>   	PG_has_hwpoisoned = PG_error,
> -	PG_hugetlb = PG_active,
>   	PG_large_rmappable = PG_workingset, /* anon or file-backed */
>   };
>   
> @@ -829,29 +828,6 @@ TESTPAGEFLAG_FALSE(LargeRmappable, large_rmappable)
>   
>   #define PG_head_mask ((1UL << PG_head))
>   
> -#ifdef CONFIG_HUGETLB_PAGE
> -int PageHuge(struct page *page);
> -SETPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
> -CLEARPAGEFLAG(HugeTLB, hugetlb, PF_SECOND)
> -
> -/**
> - * folio_test_hugetlb - Determine if the folio belongs to hugetlbfs
> - * @folio: The folio to test.
> - *
> - * Context: Any context.  Caller should have a reference on the folio to
> - * prevent it from being turned into a tail page.
> - * Return: True for hugetlbfs folios, false for anon folios or folios
> - * belonging to other filesystems.
> - */
> -static inline bool folio_test_hugetlb(struct folio *folio)
> -{
> -	return folio_test_large(folio) &&
> -		test_bit(PG_hugetlb, folio_flags(folio, 1));
> -}
> -#else
> -TESTPAGEFLAG_FALSE(Huge, hugetlb)
> -#endif
> -
>   #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>   /*
>    * PageHuge() only returns true for hugetlbfs pages, but not for
> @@ -907,18 +883,6 @@ PAGEFLAG_FALSE(HasHWPoisoned, has_hwpoisoned)
>   	TESTSCFLAG_FALSE(HasHWPoisoned, has_hwpoisoned)
>   #endif
>   
> -/*
> - * Check if a page is currently marked HWPoisoned. Note that this check is
> - * best effort only and inherently racy: there is no way to synchronize with
> - * failing hardware.
> - */
> -static inline bool is_page_hwpoison(struct page *page)
> -{
> -	if (PageHWPoison(page))
> -		return true;
> -	return PageHuge(page) && PageHWPoison(compound_head(page));
> -}
> -
>   /*
>    * For pages that are never mapped to userspace (and aren't PageSlab),
>    * page_type may be used.  Because it is initialised to -1, we invert the
> @@ -935,6 +899,7 @@ static inline bool is_page_hwpoison(struct page *page)
>   #define PG_offline	0x00000100
>   #define PG_table	0x00000200
>   #define PG_guard	0x00000400
> +#define PG_hugetlb	0x00000800
>   
>   #define PageType(page, flag)						\
>   	((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> @@ -1026,6 +991,37 @@ PAGE_TYPE_OPS(Table, table, pgtable)
>    */
>   PAGE_TYPE_OPS(Guard, guard, guard)
>   
> +#ifdef CONFIG_HUGETLB_PAGE
> +PAGE_TYPE_OPS(HeadHuge, hugetlb, hugetlb)
> +#else
> +TESTPAGEFLAG_FALSE(HeadHuge, hugetlb)
> +#endif
> +
> +/**
> + * PageHuge - Determine if the page belongs to hugetlbfs
> + * @page: The page to test.
> + *
> + * Context: Any context.
> + * Return: True for hugetlbfs pages, false for anon pages or pages
> + * belonging to other filesystems.
> + */
> +static inline bool PageHuge(struct page *page)
> +{
> +	return folio_test_hugetlb(page_folio(page));
> +}
> +
> +/*
> + * Check if a page is currently marked HWPoisoned. Note that this check is
> + * best effort only and inherently racy: there is no way to synchronize with
> + * failing hardware.
> + */
> +static inline bool is_page_hwpoison(struct page *page)
> +{
> +	if (PageHWPoison(page))
> +		return true;
> +	return PageHuge(page) && PageHWPoison(compound_head(page));
> +}
> +
>   extern bool is_free_buddy_page(struct page *page);
>   
>   PAGEFLAG(Isolated, isolated, PF_ANY);
> @@ -1092,7 +1088,7 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
>    */
>   #define PAGE_FLAGS_SECOND						\
>   	(0xffUL /* order */		| 1UL << PG_has_hwpoisoned |	\
> -	 1UL << PG_hugetlb		| 1UL << PG_large_rmappable)
> +	 1UL << PG_large_rmappable)
>   
>   #define PAGE_FLAGS_PRIVATE				\
>   	(1UL << PG_private | 1UL << PG_private_2)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed1581b670d4..23b62df21971 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1623,7 +1623,7 @@ static inline void __clear_hugetlb_destructor(struct hstate *h,
>   {
>   	lockdep_assert_held(&hugetlb_lock);
>   
> -	folio_clear_hugetlb(folio);
> +	__folio_clear_hugetlb(folio);
>   }
>   
>   /*
> @@ -1710,7 +1710,7 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio,
>   		h->surplus_huge_pages_node[nid]++;
>   	}
>   
> -	folio_set_hugetlb(folio);
> +	__folio_set_hugetlb(folio);
>   	folio_change_private(folio, NULL);
>   	/*
>   	 * We have to set hugetlb_vmemmap_optimized again as above
> @@ -2048,7 +2048,7 @@ static void __prep_account_new_huge_page(struct hstate *h, int nid)
>   
>   static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
>   {
> -	folio_set_hugetlb(folio);
> +	__folio_set_hugetlb(folio);
>   	INIT_LIST_HEAD(&folio->lru);
>   	hugetlb_set_folio_subpool(folio, NULL);
>   	set_hugetlb_cgroup(folio, NULL);
> @@ -2158,22 +2158,6 @@ static bool prep_compound_gigantic_folio_for_demote(struct folio *folio,
>   	return __prep_compound_gigantic_folio(folio, order, true);
>   }
>   
> -/*
> - * PageHuge() only returns true for hugetlbfs pages, but not for normal or
> - * transparent huge pages.  See the PageTransHuge() documentation for more
> - * details.
> - */
> -int PageHuge(struct page *page)
> -{
> -	struct folio *folio;
> -
> -	if (!PageCompound(page))
> -		return 0;
> -	folio = page_folio(page);
> -	return folio_test_hugetlb(folio);
> -}
> -EXPORT_SYMBOL_GPL(PageHuge);
> -
>   /*
>    * Find and lock address space (mapping) in write mode.
>    *
>