diff mbox series

[v9,2/8] mm/huge_memory: add two new (not yet used) functions for folio_split()

Message ID 20250226210032.2044041-3-ziy@nvidia.com (mailing list archive)
State New
Headers show
Series Buddy allocator like (or non-uniform) folio split | expand

Commit Message

Zi Yan Feb. 26, 2025, 9 p.m. UTC
This is a preparation patch, both added functions are not used yet.

The added __split_unmapped_folio() is able to split a folio with its
mapping removed in two manners: 1) uniform split (the existing way), and
2) buddy allocator like split.

The added __split_folio_to_order() can split a folio into any lower order.
For uniform split, __split_unmapped_folio() calls it once to split the
given folio to the new order.  For buddy allocator split,
__split_unmapped_folio() calls it (folio_order - new_order) times and each
time splits the folio containing the given page to one lower order.

Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
---
 mm/huge_memory.c | 338 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 337 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox Feb. 27, 2025, 5:55 a.m. UTC | #1
On Wed, Feb 26, 2025 at 04:00:25PM -0500, Zi Yan wrote:
> +static int __split_unmapped_folio(struct folio *folio, int new_order,
> +		struct page *split_at, struct page *lock_at,
> +		struct list_head *list, pgoff_t end,
> +		struct xa_state *xas, struct address_space *mapping,
> +		bool uniform_split)
> +{
[...]
> +		/* complete memcg works before add pages to LRU */
> +		split_page_memcg(&folio->page, old_order, split_order);
> +		split_page_owner(&folio->page, old_order, split_order);
> +		pgalloc_tag_split(folio, old_order, split_order);

At least split_page_memcg() needs to become aware of 'uniform_split'.

        if (folio_memcg_kmem(folio))
                obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1);

If we're doing uniform_split, that calculation should be
	old_order - new_order - 1
Matthew Wilcox Feb. 27, 2025, 3:14 p.m. UTC | #2
On Thu, Feb 27, 2025 at 05:55:43AM +0000, Matthew Wilcox wrote:
> On Wed, Feb 26, 2025 at 04:00:25PM -0500, Zi Yan wrote:
> > +static int __split_unmapped_folio(struct folio *folio, int new_order,
> > +		struct page *split_at, struct page *lock_at,
> > +		struct list_head *list, pgoff_t end,
> > +		struct xa_state *xas, struct address_space *mapping,
> > +		bool uniform_split)
> > +{
> [...]
> > +		/* complete memcg works before add pages to LRU */
> > +		split_page_memcg(&folio->page, old_order, split_order);
> > +		split_page_owner(&folio->page, old_order, split_order);
> > +		pgalloc_tag_split(folio, old_order, split_order);
> 
> At least split_page_memcg() needs to become aware of 'uniform_split'.
> 
>         if (folio_memcg_kmem(folio))
>                 obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1);
> 
> If we're doing uniform_split, that calculation should be
> 	old_order - new_order - 1

umm, old_order - new_order.  Anyway, here's a patch I've done on top of
your work, but it probably needs to be massaged slightly and placed
before your work?


From 190e13ed77e562eb59fa1fa4bfefdefe5d0416ed Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Mon, 28 Oct 2024 16:23:30 -0400
Subject: [PATCH] mm: Separate folio_split_memcg() from split_page_memcg()

Folios always use memcg_data to refer to the mem_cgroup while pages
allocated with GFP_ACCOUNT have a pointer to the obj_cgroup.  Since the
caller already knows what it has, split the function into two and then
we don't need to check.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/memcontrol.h |  7 +++++++
 mm/huge_memory.c           |  6 ++++--
 mm/memcontrol.c            | 18 +++++++++++++++---
 3 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 57664e2a8fb7..155c3f81f4df 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1039,6 +1039,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 }
 
 void split_page_memcg(struct page *head, int old_order, int new_order);
+void folio_split_memcg(struct folio *folio, unsigned old_order,
+		unsigned new_order, bool uniform_split);
 
 static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 {
@@ -1463,6 +1465,11 @@ static inline void split_page_memcg(struct page *head, int old_order, int new_or
 {
 }
 
+static inline void folio_split_memcg(struct folio *folio, unsigned old_order,
+		unsigned new_order, bool uniform)
+{
+}
+
 static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
 {
 	return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1e45064046a0..75fa9c9d9ec9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3401,6 +3401,9 @@ static void __split_folio_to_order(struct folio *folio, int new_order)
 			folio_set_young(new_folio);
 		if (folio_test_idle(folio))
 			folio_set_idle(new_folio);
+#ifdef CONFIG_MEMCG
+		new_folio->memcg_data = folio->memcg_data;
+#endif
 
 		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
 	}
@@ -3529,8 +3532,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 			}
 		}
 
-		/* complete memcg works before add pages to LRU */
-		split_page_memcg(&folio->page, old_order, split_order);
+		folio_split_memcg(folio, old_order, split_order, uniform_split);
 		split_page_owner(&folio->page, old_order, split_order);
 		pgalloc_tag_split(folio, old_order, split_order);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16f3bdbd37d8..c2d41e1337cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3064,10 +3064,22 @@ void split_page_memcg(struct page *head, int old_order, int new_order)
 	for (i = new_nr; i < old_nr; i += new_nr)
 		folio_page(folio, i)->memcg_data = folio->memcg_data;
 
-	if (folio_memcg_kmem(folio))
-		obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1);
+	obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1);
+}
+
+void folio_split_memcg(struct folio *folio, unsigned old_order,
+		unsigned new_order, bool uniform_split)
+{
+	unsigned new_refs;
+
+	if (mem_cgroup_disabled() || !folio_memcg_charged(folio))
+		return;
+
+	if (uniform_split)
+		new_refs = (1 << (old_order - new_order)) - 1;
 	else
-		css_get_many(&folio_memcg(folio)->css, old_nr / new_nr - 1);
+		new_refs = old_order - new_order;
+	css_get_many(&__folio_memcg(folio)->css, new_refs);
 }
 
 unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
Zi Yan Feb. 27, 2025, 3:42 p.m. UTC | #3
On 27 Feb 2025, at 10:14, Matthew Wilcox wrote:

> On Thu, Feb 27, 2025 at 05:55:43AM +0000, Matthew Wilcox wrote:
>> On Wed, Feb 26, 2025 at 04:00:25PM -0500, Zi Yan wrote:
>>> +static int __split_unmapped_folio(struct folio *folio, int new_order,
>>> +		struct page *split_at, struct page *lock_at,
>>> +		struct list_head *list, pgoff_t end,
>>> +		struct xa_state *xas, struct address_space *mapping,
>>> +		bool uniform_split)
>>> +{
>> [...]
>>> +		/* complete memcg works before add pages to LRU */
>>> +		split_page_memcg(&folio->page, old_order, split_order);
>>> +		split_page_owner(&folio->page, old_order, split_order);
>>> +		pgalloc_tag_split(folio, old_order, split_order);
>>
>> At least split_page_memcg() needs to become aware of 'uniform_split'.
>>
>>         if (folio_memcg_kmem(folio))
>>                 obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1);
>>
>> If we're doing uniform_split, that calculation should be
>> 	old_order - new_order - 1
>
> umm, old_order - new_order.  Anyway, here's a patch I've done on top of
> your work, but it probably needs to be massaged slightly and placed
> before your work?

Wait. uniform_split is the existing splitting one order-9 to 512 order-0
approach, so split_page_memcg() still works. For !uniform_split,
split_page_memcg() is called multiple times,
each time old_order = new_order + 1, so what split_page_memcg() does
is:
1. two order-8 folios get their memcg, and ref count is increased by 1;
2. one of the order-8s is split into two order-7, each of which gets
their memcg, and ref count is increased by 1;
…

8. one of the order-1s is split into two order-0, each of which gets
their memcg, and ref count is increased by 1.

At the end, the refcount is increased by old_order - new_order like
you described above. Let me know if it makes sense to you.


>
> From 190e13ed77e562eb59fa1fa4bfefdefe5d0416ed Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> Date: Mon, 28 Oct 2024 16:23:30 -0400
> Subject: [PATCH] mm: Separate folio_split_memcg() from split_page_memcg()
>
> Folios always use memcg_data to refer to the mem_cgroup while pages
> allocated with GFP_ACCOUNT have a pointer to the obj_cgroup.  Since the
> caller already knows what it has, split the function into two and then
> we don't need to check.
>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  include/linux/memcontrol.h |  7 +++++++
>  mm/huge_memory.c           |  6 ++++--
>  mm/memcontrol.c            | 18 +++++++++++++++---
>  3 files changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 57664e2a8fb7..155c3f81f4df 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1039,6 +1039,8 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  }
>
>  void split_page_memcg(struct page *head, int old_order, int new_order);
> +void folio_split_memcg(struct folio *folio, unsigned old_order,
> +		unsigned new_order, bool uniform_split);
>
>  static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
>  {
> @@ -1463,6 +1465,11 @@ static inline void split_page_memcg(struct page *head, int old_order, int new_or
>  {
>  }
>
> +static inline void folio_split_memcg(struct folio *folio, unsigned old_order,
> +		unsigned new_order, bool uniform)
> +{
> +}
> +
>  static inline u64 cgroup_id_from_mm(struct mm_struct *mm)
>  {
>  	return 0;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1e45064046a0..75fa9c9d9ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3401,6 +3401,9 @@ static void __split_folio_to_order(struct folio *folio, int new_order)
>  			folio_set_young(new_folio);
>  		if (folio_test_idle(folio))
>  			folio_set_idle(new_folio);
> +#ifdef CONFIG_MEMCG
> +		new_folio->memcg_data = folio->memcg_data;
> +#endif
>
>  		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
>  	}
> @@ -3529,8 +3532,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  			}
>  		}
>
> -		/* complete memcg works before add pages to LRU */
> -		split_page_memcg(&folio->page, old_order, split_order);
> +		folio_split_memcg(folio, old_order, split_order, uniform_split);
>  		split_page_owner(&folio->page, old_order, split_order);
>  		pgalloc_tag_split(folio, old_order, split_order);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 16f3bdbd37d8..c2d41e1337cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3064,10 +3064,22 @@ void split_page_memcg(struct page *head, int old_order, int new_order)
>  	for (i = new_nr; i < old_nr; i += new_nr)
>  		folio_page(folio, i)->memcg_data = folio->memcg_data;
>
> -	if (folio_memcg_kmem(folio))
> -		obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1);
> +	obj_cgroup_get_many(__folio_objcg(folio), old_nr / new_nr - 1);
> +}
> +
> +void folio_split_memcg(struct folio *folio, unsigned old_order,
> +		unsigned new_order, bool uniform_split)
> +{
> +	unsigned new_refs;
> +
> +	if (mem_cgroup_disabled() || !folio_memcg_charged(folio))
> +		return;
> +
> +	if (uniform_split)
> +		new_refs = (1 << (old_order - new_order)) - 1;
>  	else
> -		css_get_many(&folio_memcg(folio)->css, old_nr / new_nr - 1);
> +		new_refs = old_order - new_order;
> +	css_get_many(&__folio_memcg(folio)->css, new_refs);
>  }
>
>  unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
> -- 
> 2.47.2


Best Regards,
Yan, Zi
Hugh Dickins March 4, 2025, 11:49 a.m. UTC | #4
On Wed, 26 Feb 2025, Zi Yan wrote:

> This is a preparation patch, both added functions are not used yet.
> 
> The added __split_unmapped_folio() is able to split a folio with its
> mapping removed in two manners: 1) uniform split (the existing way), and
> 2) buddy allocator like split.
> 
> The added __split_folio_to_order() can split a folio into any lower order.
> For uniform split, __split_unmapped_folio() calls it once to split the
> given folio to the new order.  For buddy allocator split,
> __split_unmapped_folio() calls it (folio_order - new_order) times and each
> time splits the folio containing the given page to one lower order.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>

Sorry, I'm tired and don't really want to be writing this yet, but the
migrate "hotfix" has tipped my hand, and I need to get this out to you
before more days pass.

I'd been unable to complete even a single iteration of my "kernel builds
on huge tmpfs while swapping to SSD" testing during this current 6.14-rc
mm.git cycle (6.14-rc itself fine) - until the last week, when some
important fixes have come in, so I'm no longer getting I/O errors from
ext4-on-loop0-on-huge-tmpfs, and "Huh VM_FAULT_OOM leaked" warnings: good.

But I still can't get beyond a few iterations, a few minutes: there's
some corruption of user data, which usually manifests as a kernel build
failing because fixdep couldn't find some truncated-on-the-left pathname.

While it definitely bisected to your folio_split() series, it's quite
possible that you're merely exposing an existing bug to wider use.

I've spent the last few days trying to track this down, but still not
succeeded: I'm still getting much the same corruption.  But have been
folding in various fixes as I found them, even though they have not
solved the main problem at all.  I'll return to trying to debug the
corruption "tomorrow".

I think (might be wrong, I'm in a rush) my mods are all to this
"add two new (not yet used) functions for folio_split()" patch:
please merge them in if you agree.

1. From source inspection, it looks like a folio_set_order() was missed.

2. Why is swapcache only checked when folio_test_anon? I can see that
   you've just copied that over from the old __split_huge_page(), but
   it seems wrong to me here and there - I guess a relic from before
   shmem could swap out a huge page.

3. Doing folio_next() inside the for(;;) is unsafe in those configs
   which have to look up zone etc, I got an oops from the "new_folio"
   loop; didn't hit an oops from the "release" loop but fixed that too.

4. While correcting anon versus mapping versus swap_cache, shortened
   the lines by avoiding origin_folio->mapping and &release->page.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/huge_memory.c | 39 ++++++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0e45937c0d91..9ce3906672b9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3612,7 +3612,9 @@ static void __split_folio_to_order(struct folio *folio, int new_order)
 		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
 	}
 
-	if (!new_order)
+	if (new_order)
+		folio_set_order(folio, new_order);
+	else
 		ClearPageCompound(&folio->page);
 }
 
@@ -3682,7 +3684,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	int ret = 0;
 	bool stop_split = false;
 
-	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
+	if (folio_test_swapcache(folio)) {
+		VM_BUG_ON(mapping);
+
 		/* a swapcache folio can only be uniformly split to order-0 */
 		if (!uniform_split || new_order != 0)
 			return -EINVAL;
@@ -3750,9 +3754,8 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		 * is new_order, since the folio will be worked on in next
 		 * iteration.
 		 */
-		for (release = folio, next = folio_next(folio);
-		     release != end_folio;
-		     release = next, next = folio_next(next)) {
+		for (release = folio; release != end_folio; release = next) {
+			next = folio_next(release);
 			/*
 			 * for buddy allocator like split, the folio containing
 			 * page will be split next and should not be released,
@@ -3784,32 +3787,31 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 			lru_add_page_tail(origin_folio, &release->page,
 						lruvec, list);
 
-			/* Some pages can be beyond EOF: drop them from page cache */
+			/* Some pages can be beyond EOF: drop them from cache */
 			if (release->index >= end) {
-				if (shmem_mapping(origin_folio->mapping))
+				if (shmem_mapping(mapping))
 					nr_dropped += folio_nr_pages(release);
 				else if (folio_test_clear_dirty(release))
 					folio_account_cleaned(release,
-						inode_to_wb(origin_folio->mapping->host));
+						inode_to_wb(mapping->host));
 				__filemap_remove_folio(release, NULL);
 				folio_put(release);
-			} else if (!folio_test_anon(release)) {
-				__xa_store(&origin_folio->mapping->i_pages,
-						release->index, &release->page, 0);
+			} else if (mapping) {
+				__xa_store(&mapping->i_pages,
+						release->index, release, 0);
 			} else if (swap_cache) {
 				__xa_store(&swap_cache->i_pages,
 						swap_cache_index(release->swap),
-						&release->page, 0);
+						release, 0);
 			}
 		}
 	}
 
 	unlock_page_lruvec(lruvec);
 
-	if (folio_test_anon(origin_folio)) {
-		if (folio_test_swapcache(origin_folio))
-			xa_unlock(&swap_cache->i_pages);
-	} else
+	if (swap_cache)
+		xa_unlock(&swap_cache->i_pages);
+	if (mapping)
 		xa_unlock(&mapping->i_pages);
 
 	/* Caller disabled irqs, so they are still disabled here */
@@ -3828,9 +3830,8 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	 * For buddy allocator like split, the first after-split folio is left
 	 * for caller to unlock.
 	 */
-	for (new_folio = origin_folio, next = folio_next(origin_folio);
-	     new_folio != next_folio;
-	     new_folio = next, next = folio_next(next)) {
+	for (new_folio = origin_folio; new_folio != next_folio; new_folio = next) {
+		next = folio_next(new_folio);
 		if (new_folio == page_folio(lock_at))
 			continue;
Zi Yan March 4, 2025, 4:20 p.m. UTC | #5
On 4 Mar 2025, at 6:49, Hugh Dickins wrote:

> On Wed, 26 Feb 2025, Zi Yan wrote:
>
>> This is a preparation patch, both added functions are not used yet.
>>
>> The added __split_unmapped_folio() is able to split a folio with its
>> mapping removed in two manners: 1) uniform split (the existing way), and
>> 2) buddy allocator like split.
>>
>> The added __split_folio_to_order() can split a folio into any lower order.
>> For uniform split, __split_unmapped_folio() calls it once to split the
>> given folio to the new order.  For buddy allocator split,
>> __split_unmapped_folio() calls it (folio_order - new_order) times and each
>> time splits the folio containing the given page to one lower order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> Sorry, I'm tired and don't really want to be writing this yet, but the
> migrate "hotfix" has tipped my hand, and I need to get this out to you
> before more days pass.

Thank you for taking the time to test my patches. I really appreciate it.

>
> I'd been unable to complete even a single iteration of my "kernel builds
> on huge tmpfs while swapping to SSD" testing during this current 6.14-rc
> mm.git cycle (6.14-rc itself fine) - until the last week, when some
> important fixes have come in, so I'm no longer getting I/O errors from
> ext4-on-loop0-on-huge-tmpfs, and "Huh VM_FAULT_OOM leaked" warnings: good.

This error should be related to the other patch I sent out on using
xas_try_split() in shmem_large_entry_split(). Great to have you confirm
it fixed some of the bugs.

>
> But I still can't get beyond a few iterations, a few minutes: there's
> some corruption of user data, which usually manifests as a kernel build
> failing because fixdep couldn't find some truncated-on-the-left pathname.

It is likely that this patch might fix it (partially):
https://lore.kernel.org/linux-mm/56EBE3B6-99EA-470E-B2B3-92C9C13032DF@nvidia.com/.
Andrew has picked it yesterday.

>
> While it definitely bisected to your folio_split() series, it's quite
> possible that you're merely exposing an existing bug to wider use.
>
> I've spent the last few days trying to track this down, but still not
> succeeded: I'm still getting much the same corruption.  But have been
> folding in various fixes as I found them, even though they have not
> solved the main problem at all.  I'll return to trying to debug the
> corruption "tomorrow".

Thank you very much. This patchset has not got much review yet, your
help is really appreciated.

>
> I think (might be wrong, I'm in a rush) my mods are all to this
> "add two new (not yet used) functions for folio_split()" patch:
> please merge them in if you agree.
>
> 1. From source inspection, it looks like a folio_set_order() was missed.
>
> 2. Why is swapcache only checked when folio_test_anon? I can see that
>    you've just copied that over from the old __split_huge_page(), but
>    it seems wrong to me here and there - I guess a relic from before
>    shmem could swap out a huge page.
>
> 3. Doing folio_next() inside the for(;;) is unsafe in those configs
>    which have to look up zone etc, I got an oops from the "new_folio"
>    loop; didn't hit an oops from the "release" loop but fixed that too.
>
> 4. While correcting anon versus mapping versus swap_cache, shortened
>    the lines by avoiding origin_folio->mapping and &release->page.

All these fixes make sense to me. Thanks again for your effort.

Hi Andrew,

Do you mind folding Hugh’s fixes to this patch? Let me know if you prefer
a V10. Thanks.

>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/huge_memory.c | 39 ++++++++++++++++++++-------------------
>  1 file changed, 20 insertions(+), 19 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0e45937c0d91..9ce3906672b9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3612,7 +3612,9 @@ static void __split_folio_to_order(struct folio *folio, int new_order)
>  		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
>  	}
>
> -	if (!new_order)
> +	if (new_order)
> +		folio_set_order(folio, new_order);
> +	else
>  		ClearPageCompound(&folio->page);
>  }
>
> @@ -3682,7 +3684,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  	int ret = 0;
>  	bool stop_split = false;
>
> -	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> +	if (folio_test_swapcache(folio)) {
> +		VM_BUG_ON(mapping);
> +
>  		/* a swapcache folio can only be uniformly split to order-0 */
>  		if (!uniform_split || new_order != 0)
>  			return -EINVAL;
> @@ -3750,9 +3754,8 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		 * is new_order, since the folio will be worked on in next
>  		 * iteration.
>  		 */
> -		for (release = folio, next = folio_next(folio);
> -		     release != end_folio;
> -		     release = next, next = folio_next(next)) {
> +		for (release = folio; release != end_folio; release = next) {
> +			next = folio_next(release);
>  			/*
>  			 * for buddy allocator like split, the folio containing
>  			 * page will be split next and should not be released,
> @@ -3784,32 +3787,31 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  			lru_add_page_tail(origin_folio, &release->page,
>  						lruvec, list);
>
> -			/* Some pages can be beyond EOF: drop them from page cache */
> +			/* Some pages can be beyond EOF: drop them from cache */
>  			if (release->index >= end) {
> -				if (shmem_mapping(origin_folio->mapping))
> +				if (shmem_mapping(mapping))
>  					nr_dropped += folio_nr_pages(release);
>  				else if (folio_test_clear_dirty(release))
>  					folio_account_cleaned(release,
> -						inode_to_wb(origin_folio->mapping->host));
> +						inode_to_wb(mapping->host));
>  				__filemap_remove_folio(release, NULL);
>  				folio_put(release);
> -			} else if (!folio_test_anon(release)) {
> -				__xa_store(&origin_folio->mapping->i_pages,
> -						release->index, &release->page, 0);
> +			} else if (mapping) {
> +				__xa_store(&mapping->i_pages,
> +						release->index, release, 0);
>  			} else if (swap_cache) {
>  				__xa_store(&swap_cache->i_pages,
>  						swap_cache_index(release->swap),
> -						&release->page, 0);
> +						release, 0);
>  			}
>  		}
>  	}
>
>  	unlock_page_lruvec(lruvec);
>
> -	if (folio_test_anon(origin_folio)) {
> -		if (folio_test_swapcache(origin_folio))
> -			xa_unlock(&swap_cache->i_pages);
> -	} else
> +	if (swap_cache)
> +		xa_unlock(&swap_cache->i_pages);
> +	if (mapping)
>  		xa_unlock(&mapping->i_pages);
>
>  	/* Caller disabled irqs, so they are still disabled here */
> @@ -3828,9 +3830,8 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  	 * For buddy allocator like split, the first after-split folio is left
>  	 * for caller to unlock.
>  	 */
> -	for (new_folio = origin_folio, next = folio_next(origin_folio);
> -	     new_folio != next_folio;
> -	     new_folio = next, next = folio_next(next)) {
> +	for (new_folio = origin_folio; new_folio != next_folio; new_folio = next) {
> +		next = folio_next(new_folio);
>  		if (new_folio == page_folio(lock_at))
>  			continue;
>
> -- 
> 2.43.0


Best Regards,
Yan, Zi
Andrew Morton March 4, 2025, 8:29 p.m. UTC | #6
On Tue, 04 Mar 2025 11:20:53 -0500 Zi Yan <ziy@nvidia.com> wrote:

> Do you mind folding Hugh’s fixes to this patch? Let me know if you prefer
> a V10. Thanks.

I think a new series, please. I'll remove the current version from mm.git.

Can I suggest that you repeat Hugh's testing, hopefully see the same
failures and then get in and debug them?
Zi Yan March 4, 2025, 8:34 p.m. UTC | #7
On 4 Mar 2025, at 15:29, Andrew Morton wrote:

> On Tue, 04 Mar 2025 11:20:53 -0500 Zi Yan <ziy@nvidia.com> wrote:
>
>> Do you mind folding Hugh’s fixes to this patch? Let me know if you prefer
>> a V10. Thanks.
>
> I think a new series, please. I'll remove the current version from mm.git.
>
> Can I suggest that you repeat Hugh's testing, hopefully see the same
> failures and then get in and debug them?
Sure. Will do that.

Best Regards,
Yan, Zi
Zi Yan March 5, 2025, 7:45 p.m. UTC | #8
On 4 Mar 2025, at 6:49, Hugh Dickins wrote:

> On Wed, 26 Feb 2025, Zi Yan wrote:
>
>> This is a preparation patch, both added functions are not used yet.
>>
>> The added __split_unmapped_folio() is able to split a folio with its
>> mapping removed in two manners: 1) uniform split (the existing way), and
>> 2) buddy allocator like split.
>>
>> The added __split_folio_to_order() can split a folio into any lower order.
>> For uniform split, __split_unmapped_folio() calls it once to split the
>> given folio to the new order.  For buddy allocator split,
>> __split_unmapped_folio() calls it (folio_order - new_order) times and each
>> time splits the folio containing the given page to one lower order.
>>
>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>
> Sorry, I'm tired and don't really want to be writing this yet, but the
> migrate "hotfix" has tipped my hand, and I need to get this out to you
> before more days pass.
>
> I'd been unable to complete even a single iteration of my "kernel builds
> on huge tmpfs while swapping to SSD" testing during this current 6.14-rc
> mm.git cycle (6.14-rc itself fine) - until the last week, when some
> important fixes have come in, so I'm no longer getting I/O errors from
> ext4-on-loop0-on-huge-tmpfs, and "Huh VM_FAULT_OOM leaked" warnings: good.
>
> But I still can't get beyond a few iterations, a few minutes: there's
> some corruption of user data, which usually manifests as a kernel build
> failing because fixdep couldn't find some truncated-on-the-left pathname.
>
> While it definitely bisected to your folio_split() series, it's quite
> possible that you're merely exposing an existing bug to wider use.
>
> I've spent the last few days trying to track this down, but still not
> succeeded: I'm still getting much the same corruption.  But have been
> folding in various fixes as I found them, even though they have not
> solved the main problem at all.  I'll return to trying to debug the
> corruption "tomorrow".
>
> I think (might be wrong, I'm in a rush) my mods are all to this
> "add two new (not yet used) functions for folio_split()" patch:
> please merge them in if you agree.
>
> 1. From source inspection, it looks like a folio_set_order() was missed.

Actually no. folio_set_order(folio, new_order) is called multiple times
in the for loop above. It is duplicated but not missing.

>
> 2. Why is swapcache only checked when folio_test_anon? I can see that
>    you've just copied that over from the old __split_huge_page(), but
>    it seems wrong to me here and there - I guess a relic from before
>    shmem could swap out a huge page.

Yes, it is a relic, but it is still right before I change another relic
in __folio_split() or split_huge_page_to_list_to_order() from mainline,
if (!mapping) { ret = -EBUSY; goto out; }. It excludes the shmem in swap
cache case. I probably will leave it as is in my next folio_split() version
to avoid adding more potential bugs, but will come back later in another
patch.


Best Regards,
Yan, Zi
Hugh Dickins March 5, 2025, 8:50 p.m. UTC | #9
On Wed, 5 Mar 2025, Zi Yan wrote:
> On 4 Mar 2025, at 6:49, Hugh Dickins wrote:
> >
> > I think (might be wrong, I'm in a rush) my mods are all to this
> > "add two new (not yet used) functions for folio_split()" patch:
> > please merge them in if you agree.
> >
> > 1. From source inspection, it looks like a folio_set_order() was missed.
> 
> Actually no. folio_set_order(folio, new_order) is called multiple times
> in the for loop above. It is duplicated but not missing.

I was about to disagree with you, when at last I saw that, yes,
it is doing that on "folio" at the time of setting up "new_folio".

That is confusing: in all other respects, that loop is reading folio
to set up new_folio.  Do you have a reason for doing it there?

The transient "nested folio" situation is anomalous either way.
I'd certainly prefer it to be done at the point where you
ClearPageCompound when !new_order; but if you think there's an issue
with racing isolate_migratepages_block() or something like that, which
your current placement handles better, then please add a line of comment
both where you do it and where I expected to find it - thanks.

(Historically, there was quite a lot of difficulty in getting the order
of events in __split_huge_page_tail() to be safe: I wonder whether we
shall see a crop of new weird bugs from these changes. I note that your
loops advance forwards, whereas the old ones went backwards: but I don't
have anything to say you're wrong.  I think it's mainly a matter of how
the first tail or two gets handled: which might be why you want to
folio_set_order(folio, new_order) at the earliest opportunity.)

> 
> >
> > 2. Why is swapcache only checked when folio_test_anon? I can see that
> >    you've just copied that over from the old __split_huge_page(), but
> >    it seems wrong to me here and there - I guess a relic from before
> >    shmem could swap out a huge page.
> 
> Yes, it is a relic, but it is still right before I change another relic
> in __folio_split() or split_huge_page_to_list_to_order() from mainline,
> if (!mapping) { ret = -EBUSY; goto out; }. It excludes the shmem in swap
> cache case. I probably will leave it as is in my next folio_split() version
> to avoid adding more potential bugs, but will come back later in another
> patch.

I agree.  The "Truncated ?" check.  Good.  But I do prefer that you use
that part of my patch, referring to mapping and swap_cache instead of anon,
rather than rely on that accident of what's done at the higher level.

Thanks,
Hugh
Hugh Dickins March 5, 2025, 9:03 p.m. UTC | #10
On Tue, 4 Mar 2025, Zi Yan wrote:
> On 4 Mar 2025, at 6:49, Hugh Dickins wrote:
> >
> > I'd been unable to complete even a single iteration of my "kernel builds
> > on huge tmpfs while swapping to SSD" testing during this current 6.14-rc
> > mm.git cycle (6.14-rc itself fine) - until the last week, when some
> > important fixes have come in, so I'm no longer getting I/O errors from
> > ext4-on-loop0-on-huge-tmpfs, and "Huh VM_FAULT_OOM leaked" warnings: good.
> 
> This error should be related to the other patch I sent out on using
> xas_try_split() in shmem_large_entry_split(). Great to have you confirm
> it fixed some of the bugs.
> 
> >
> > But I still can't get beyond a few iterations, a few minutes: there's
> > some corruption of user data, which usually manifests as a kernel build
> > failing because fixdep couldn't find some truncated-on-the-left pathname.
> 
> It is likely that this patch might fix it (partially):
> https://lore.kernel.org/linux-mm/56EBE3B6-99EA-470E-B2B3-92C9C13032DF@nvidia.com/.
> Andrew has picked it yesterday.

No, that's a fix to a truncation issue which I had not hit:
I did try adding that patch, but it has not helped in my case.

Beyond checking that, I didn't have time yesterday to investigate
further, but I'll try again today (still using last weekend's mm.git).

Hugh
Zi Yan March 5, 2025, 9:08 p.m. UTC | #11
On 5 Mar 2025, at 15:50, Hugh Dickins wrote:

> On Wed, 5 Mar 2025, Zi Yan wrote:
>> On 4 Mar 2025, at 6:49, Hugh Dickins wrote:
>>>
>>> I think (might be wrong, I'm in a rush) my mods are all to this
>>> "add two new (not yet used) functions for folio_split()" patch:
>>> please merge them in if you agree.
>>>
>>> 1. From source inspection, it looks like a folio_set_order() was missed.
>>
>> Actually no. folio_set_order(folio, new_order) is called multiple times
>> in the for loop above. It is duplicated but not missing.
>
> I was about to disagree with you, when at last I saw that, yes,
> it is doing that on "folio" at the time of setting up "new_folio".
>
> That is confusing: in all other respects, that loop is reading folio
> to set up new_folio.  Do you have a reason for doing it there?

No. I agree your fix is better. Just point out folio_set_order() should
not trigger a bug.

>
> The transient "nested folio" situation is anomalous either way.
> I'd certainly prefer it to be done at the point where you
> ClearPageCompound when !new_order; but if you think there's an issue
> with racing isolate_migratepages_block() or something like that, which
> your current placement handles better, then please add a line of comment
> both where you do it and where I expected to find it - thanks.

Sure. I will use your patch unless I find some racing issue.

>
> (Historically, there was quite a lot of difficulty in getting the order
> of events in __split_huge_page_tail() to be safe: I wonder whether we
> shall see a crop of new weird bugs from these changes. I note that your
> loops advance forwards, whereas the old ones went backwards: but I don't
> have anything to say you're wrong.  I think it's mainly a matter of how
> the first tail or two gets handled: which might be why you want to
> folio_set_order(folio, new_order) at the earliest opportunity.)

I am worried about that too. In addition, in __split_huge_page_tail(),
page refcount is restored right after new tail folio split is done,
whereas I needed to delay them until all new after-split folios
are done, since non-uniform split is iterative and only the after-split
folios NOT containing the split_at page will be released. These
folios are locked and frozen after __split_folio_to_order() like
the original folio. Maybe because there are more such locked frozen
folios than before?


>>
>>>
>>> 2. Why is swapcache only checked when folio_test_anon? I can see that
>>>    you've just copied that over from the old __split_huge_page(), but
>>>    it seems wrong to me here and there - I guess a relic from before
>>>    shmem could swap out a huge page.
>>
>> Yes, it is a relic, but it is still right before I change another relic
>> in __folio_split() or split_huge_page_to_list_to_order() from mainline,
>> if (!mapping) { ret = -EBUSY; goto out; }. It excludes the shmem in swap
>> cache case. I probably will leave it as is in my next folio_split() version
>> to avoid adding more potential bugs, but will come back later in another
>> patch.
>
> I agree.  The "Truncated ?" check.  Good.  But I do prefer that you use
> that part of my patch, referring to mapping and swap_cache instead of anon,
> rather than rely on that accident of what's done at the higher level.

Definitely.


Best Regards,
Yan, Zi
Zi Yan March 5, 2025, 9:10 p.m. UTC | #12
On 5 Mar 2025, at 16:03, Hugh Dickins wrote:

> On Tue, 4 Mar 2025, Zi Yan wrote:
>> On 4 Mar 2025, at 6:49, Hugh Dickins wrote:
>>>
>>> I'd been unable to complete even a single iteration of my "kernel builds
>>> on huge tmpfs while swapping to SSD" testing during this current 6.14-rc
>>> mm.git cycle (6.14-rc itself fine) - until the last week, when some
>>> important fixes have come in, so I'm no longer getting I/O errors from
>>> ext4-on-loop0-on-huge-tmpfs, and "Huh VM_FAULT_OOM leaked" warnings: good.
>>
>> This error should be related to the other patch I sent out on using
>> xas_try_split() in shmem_large_entry_split(). Great to have you confirm
>> it fixed some of the bugs.
>>
>>>
>>> But I still can't get beyond a few iterations, a few minutes: there's
>>> some corruption of user data, which usually manifests as a kernel build
>>> failing because fixdep couldn't find some truncated-on-the-left pathname.
>>
>> It is likely that this patch might fix it (partially):
>> https://lore.kernel.org/linux-mm/56EBE3B6-99EA-470E-B2B3-92C9C13032DF@nvidia.com/.
>> Andrew has picked it yesterday.
>
> No, that's a fix to a truncation issue which I had not hit:
> I did try adding that patch, but it has not helped in my case.

Got it.

>
> Beyond checking that, I didn't have time yesterday to investigate
> further, but I'll try again today (still using last weekend's mm.git).

I am trying to replicate your runs locally. Can you clarify your steps
of “kernel builds on huge tmpfs while swapping to SSD”? Do you impose
a memory limit so that anonymous memory is swapped to SSD or make tmpfs
swap to SSD?

Thanks.


Best Regards,
Yan, Zi
Hugh Dickins March 5, 2025, 9:49 p.m. UTC | #13
On Wed, 5 Mar 2025, Zi Yan wrote:
> On 5 Mar 2025, at 15:50, Hugh Dickins wrote:
> >
> > (Historically, there was quite a lot of difficulty in getting the order
> > of events in __split_huge_page_tail() to be safe: I wonder whether we
> > shall see a crop of new weird bugs from these changes. I note that your
> > loops advance forwards, whereas the old ones went backwards: but I don't
> > have anything to say you're wrong.  I think it's mainly a matter of how
> > the first tail or two gets handled: which might be why you want to
> > folio_set_order(folio, new_order) at the earliest opportunity.)
> 
> I am worried about that too. In addition, in __split_huge_page_tail(),
> page refcount is restored right after new tail folio split is done,
> whereas I needed to delay them until all new after-split folios
> are done, since non-uniform split is iterative and only the after-split
> folios NOT containing the split_at page will be released. These
> folios are locked and frozen after __split_folio_to_order() like
> the original folio. Maybe because there are more such locked frozen
> folios than before?

Sorry, I gave up trying to work out what you're asking me there.
Let's assume I don't know the answer.

Hugh
Hugh Dickins March 5, 2025, 10:38 p.m. UTC | #14
On Wed, 5 Mar 2025, Zi Yan wrote:
> On 5 Mar 2025, at 16:03, Hugh Dickins wrote:
> >
> > Beyond checking that, I didn't have time yesterday to investigate
> > further, but I'll try again today (still using last weekend's mm.git).
> 
> I am trying to replicate your runs locally. Can you clarify your steps
> of “kernel builds on huge tmpfs while swapping to SSD”? Do you impose
> a memory limit so that anonymous memory is swapped to SSD or make tmpfs
> swap to SSD?

Yeah, my heart sank a bit when I saw Andrew (with good intention) asking
you to repeat my testing.

We could spend weeks going back and forth on that, and neither of us has
weeks to spare.

"To fulfil contractual obligations" I'll mail you the tarfile I send
out each time I'm asked for this; but I haven't updated that tarfile
in four years, whereas I'm frequently tweaking things to match what's
needed (most recently and relevantly, I guess enabling 64kB hugepages
for anon and shmem in addition to the PMD-sized).

Please don't waste much of your time over trying to replicate what
I'm doing: just give the scripts a glance, as a source for "oh,
I could exercise something like that in my testing too" ideas.

Yes, I limit physical memory by booting with mem=1G, and also apply
lower memcg v1 limits.

I made a point of saying "SSD" there because I'm not testing zram or
zswap at all, whereas many others are testing those rather than disk.

swapoff, and ext4 on loop0 on tmpfs, feature in what I exercise, but are
NOT relevant to the corruption I'm seeing here - that can occur before
any swapoff, and it's always on the kernel build in tmpfs: the parallel
build in ext4 on loop0 on tmpfs completes successfully.

Hugh
David Hildenbrand March 6, 2025, 9:19 a.m. UTC | #15
On 05.03.25 22:08, Zi Yan wrote:
> On 5 Mar 2025, at 15:50, Hugh Dickins wrote:
> 
>> On Wed, 5 Mar 2025, Zi Yan wrote:
>>> On 4 Mar 2025, at 6:49, Hugh Dickins wrote:
>>>>
>>>> I think (might be wrong, I'm in a rush) my mods are all to this
>>>> "add two new (not yet used) functions for folio_split()" patch:
>>>> please merge them in if you agree.
>>>>
>>>> 1. From source inspection, it looks like a folio_set_order() was missed.
>>>
>>> Actually no. folio_set_order(folio, new_order) is called multiple times
>>> in the for loop above. It is duplicated but not missing.
>>
>> I was about to disagree with you, when at last I saw that, yes,
>> it is doing that on "folio" at the time of setting up "new_folio".
>>
>> That is confusing: in all other respects, that loop is reading folio
>> to set up new_folio.  Do you have a reason for doing it there?
> 
> No. I agree your fix is better. Just point out folio_set_order() should
> not trigger a bug.
> 
>>
>> The transient "nested folio" situation is anomalous either way.
>> I'd certainly prefer it to be done at the point where you
>> ClearPageCompound when !new_order; but if you think there's an issue
>> with racing isolate_migratepages_block() or something like that, which
>> your current placement handles better, then please add a line of comment
>> both where you do it and where I expected to find it - thanks.
> 
> Sure. I will use your patch unless I find some racing issue.
> 
>>
>> (Historically, there was quite a lot of difficulty in getting the order
>> of events in __split_huge_page_tail() to be safe: I wonder whether we
>> shall see a crop of new weird bugs from these changes. I note that your
>> loops advance forwards, whereas the old ones went backwards: but I don't
>> have anything to say you're wrong.  I think it's mainly a matter of how
>> the first tail or two gets handled: which might be why you want to
>> folio_set_order(folio, new_order) at the earliest opportunity.)
> 
> I am worried about that too. In addition, in __split_huge_page_tail(),
> page refcount is restored right after new tail folio split is done,
> whereas I needed to delay them until all new after-split folios
> are done, since non-uniform split is iterative and only the after-split
> folios NOT containing the split_at page will be released. These
> folios are locked and frozen after __split_folio_to_order() like
> the original folio. Maybe because there are more such locked frozen
> folios than before?

What's the general concern here?

A frozen folio cannot be referenced and consequently not trusted. For 
example, if we want to speculatively lookup a folio in the pagecache and 
find it to be frozen, we'll have to spin (retry) until we find a folio 
that is unfrozen.

While a folio has a refcount of 0, there are no guarantees. It could 
change its size, it could be freed + reallocated (changed mapping etc) ...

So whoever wants to grab a speculative reference -- using 
folio_try_get() -- must re-verify folio properties after grabbing the 
speculative reference succeeded. Including whether it is small/large, 
number of pages, mapping, ...

The important part is to unfreeze a folio only once it was fully 
prepared (e.g., order set, compound pages links to head set up etc).

I am not sure if the sequence in which you process folios during a split 
matters here when doing a split: only that, whatever new folio  is 
unfrozen, is properly initialized.
Zi Yan March 6, 2025, 4:21 p.m. UTC | #16
On 5 Mar 2025, at 17:38, Hugh Dickins wrote:

> On Wed, 5 Mar 2025, Zi Yan wrote:
>> On 5 Mar 2025, at 16:03, Hugh Dickins wrote:
>>>
>>> Beyond checking that, I didn't have time yesterday to investigate
>>> further, but I'll try again today (still using last weekend's mm.git).
>>
>> I am trying to replicate your runs locally. Can you clarify your steps
>> of “kernel builds on huge tmpfs while swapping to SSD”? Do you impose
>> a memory limit so that anonymous memory is swapped to SSD or make tmpfs
>> swap to SSD?
>
> Yeah, my heart sank a bit when I saw Andrew (with good intention) asking
> you to repeat my testing.
>
> We could spend weeks going back and forth on that, and neither of us has
> weeks to spare.
>
> "To fulfil contractual obligations" I'll mail you the tarfile I send
> out each time I'm asked for this; but I haven't updated that tarfile
> in four years, whereas I'm frequently tweaking things to match what's
> needed (most recently and relevantly, I guess enabling 64kB hugepages
> for anon and shmem in addition to the PMD-sized).
>
> Please don't waste much of your time over trying to replicate what
> I'm doing: just give the scripts a glance, as a source for "oh,
> I could exercise something like that in my testing too" ideas.
>
> Yes, I limit physical memory by booting with mem=1G, and also apply
> lower memcg v1 limits.
>
> I made a point of saying "SSD" there because I'm not testing zram or
> zswap at all, whereas many others are testing those rather than disk.
>
> swapoff, and ext4 on loop0 on tmpfs, feature in what I exercise, but are
> NOT relevant to the corruption I'm seeing here - that can occur before
> any swapoff, and it's always on the kernel build in tmpfs: the parallel
> build in ext4 on loop0 on tmpfs completes successfully.

Thanks for the scripts. I kinda replicate your setup as follows:

1. boot a VM with 1GB memory and 8 cores;
2. mount a tmpfs with huge=always and 200GB;
3. clone the mainline kernel and use x86_64 defconfig (my gcc 14 gives
   errors during the old kernel builds), this takes about 2GB space,
   so some of tmpfs is already swapped to SSD;
4. create a new cgroupv2 and set memory.high to 700MB to induce memory
   swap during kernel compilation;
5. run “while true; do echo 1 | sudo tee /proc/sys/vm/compact_memory >/dev/null; done” to trigger compaction all the time;
6. build the kernel with make -j20.

I ran the above on mm-everything-2025-03-05-03-54 plus the xarray fix v3,
folio_split() with your fixes, and Minimize xa_node allocation during
xarry split patches. The repo is at: https://github.com/x-y-z/linux-dev/tree/shmem_fix-mm-everything-2025-03-05-03-54.

It has ran over night for 30 kernel builds and no crash happened so far.
I wonder if you can give my repo a shot.

I just boosted khugepaged like you did and see no immediate crash. But I will
let it run for longer.

Thanks.

Best Regards,
Yan, Zi
Zi Yan March 6, 2025, 4:27 p.m. UTC | #17
On 6 Mar 2025, at 4:19, David Hildenbrand wrote:

> On 05.03.25 22:08, Zi Yan wrote:
>> On 5 Mar 2025, at 15:50, Hugh Dickins wrote:
>>
>>> On Wed, 5 Mar 2025, Zi Yan wrote:
>>>> On 4 Mar 2025, at 6:49, Hugh Dickins wrote:
>>>>>
>>>>> I think (might be wrong, I'm in a rush) my mods are all to this
>>>>> "add two new (not yet used) functions for folio_split()" patch:
>>>>> please merge them in if you agree.
>>>>>
>>>>> 1. From source inspection, it looks like a folio_set_order() was missed.
>>>>
>>>> Actually no. folio_set_order(folio, new_order) is called multiple times
>>>> in the for loop above. It is duplicated but not missing.
>>>
>>> I was about to disagree with you, when at last I saw that, yes,
>>> it is doing that on "folio" at the time of setting up "new_folio".
>>>
>>> That is confusing: in all other respects, that loop is reading folio
>>> to set up new_folio.  Do you have a reason for doing it there?
>>
>> No. I agree your fix is better. Just point out folio_set_order() should
>> not trigger a bug.
>>
>>>
>>> The transient "nested folio" situation is anomalous either way.
>>> I'd certainly prefer it to be done at the point where you
>>> ClearPageCompound when !new_order; but if you think there's an issue
>>> with racing isolate_migratepages_block() or something like that, which
>>> your current placement handles better, then please add a line of comment
>>> both where you do it and where I expected to find it - thanks.
>>
>> Sure. I will use your patch unless I find some racing issue.
>>
>>>
>>> (Historically, there was quite a lot of difficulty in getting the order
>>> of events in __split_huge_page_tail() to be safe: I wonder whether we
>>> shall see a crop of new weird bugs from these changes. I note that your
>>> loops advance forwards, whereas the old ones went backwards: but I don't
>>> have anything to say you're wrong.  I think it's mainly a matter of how
>>> the first tail or two gets handled: which might be why you want to
>>> folio_set_order(folio, new_order) at the earliest opportunity.)
>>
>> I am worried about that too. In addition, in __split_huge_page_tail(),
>> page refcount is restored right after new tail folio split is done,
>> whereas I needed to delay them until all new after-split folios
>> are done, since non-uniform split is iterative and only the after-split
>> folios NOT containing the split_at page will be released. These
>> folios are locked and frozen after __split_folio_to_order() like
>> the original folio. Maybe because there are more such locked frozen
>> folios than before?
>
> What's the general concern here?
>
> A frozen folio cannot be referenced and consequently not trusted. For example, if we want to speculatively lookup a folio in the pagecache and find it to be frozen, we'll have to spin (retry) until we find a folio that is unfrozen.
>
> While a folio has a refcount of 0, there are no guarantees. It could change its size, it could be freed + reallocated (changed mapping etc) ...
>
> So whoever wants to grab a speculative reference -- using folio_try_get() -- must re-verify folio properties after grabbing the speculative reference succeeded. Including whether it is small/large, number of pages, mapping, ...
>
> The important part is to unfreeze a folio only once it was fully prepared (e.g., order set, compound pages links to head set up etc).
>
> I am not sure if the sequence in which you process folios during a split matters here when doing a split: only that, whatever new folio  is unfrozen, is properly initialized.

Got it. Thanks for the confirmation.

My worry came from that after I rebased on mm-everything-2025-03-05-03-54,
which does not have folio_split() patches, I see a crash saying a buddy
page is hit in __split_folio_to_order(). It turns out that I did not
add the changes from your “mm: let _folio_nr_pages overlay memcg_data in
first tail page” patch. With that fixed, no crash is observed so far.


Best Regards,
Yan, Zi
Zi Yan March 7, 2025, 3:23 p.m. UTC | #18
On 6 Mar 2025, at 11:21, Zi Yan wrote:

> On 5 Mar 2025, at 17:38, Hugh Dickins wrote:
>
>> On Wed, 5 Mar 2025, Zi Yan wrote:
>>> On 5 Mar 2025, at 16:03, Hugh Dickins wrote:
>>>>
>>>> Beyond checking that, I didn't have time yesterday to investigate
>>>> further, but I'll try again today (still using last weekend's mm.git).
>>>
>>> I am trying to replicate your runs locally. Can you clarify your steps
>>> of “kernel builds on huge tmpfs while swapping to SSD”? Do you impose
>>> a memory limit so that anonymous memory is swapped to SSD or make tmpfs
>>> swap to SSD?
>>
>> Yeah, my heart sank a bit when I saw Andrew (with good intention) asking
>> you to repeat my testing.
>>
>> We could spend weeks going back and forth on that, and neither of us has
>> weeks to spare.
>>
>> "To fulfil contractual obligations" I'll mail you the tarfile I send
>> out each time I'm asked for this; but I haven't updated that tarfile
>> in four years, whereas I'm frequently tweaking things to match what's
>> needed (most recently and relevantly, I guess enabling 64kB hugepages
>> for anon and shmem in addition to the PMD-sized).
>>
>> Please don't waste much of your time over trying to replicate what
>> I'm doing: just give the scripts a glance, as a source for "oh,
>> I could exercise something like that in my testing too" ideas.
>>
>> Yes, I limit physical memory by booting with mem=1G, and also apply
>> lower memcg v1 limits.
>>
>> I made a point of saying "SSD" there because I'm not testing zram or
>> zswap at all, whereas many others are testing those rather than disk.
>>
>> swapoff, and ext4 on loop0 on tmpfs, feature in what I exercise, but are
>> NOT relevant to the corruption I'm seeing here - that can occur before
>> any swapoff, and it's always on the kernel build in tmpfs: the parallel
>> build in ext4 on loop0 on tmpfs completes successfully.
>
> Thanks for the scripts. I kinda replicate your setup as follows:
>
> 1. boot a VM with 1GB memory and 8 cores;
> 2. mount a tmpfs with huge=always and 200GB;
> 3. clone the mainline kernel and use x86_64 defconfig (my gcc 14 gives
>    errors during the old kernel builds), this takes about 2GB space,
>    so some of tmpfs is already swapped to SSD;
> 4. create a new cgroupv2 and set memory.high to 700MB to induce memory
>    swap during kernel compilation;
> 5. run “while true; do echo 1 | sudo tee /proc/sys/vm/compact_memory >/dev/null; done” to trigger compaction all the time;
> 6. build the kernel with make -j20.
>
> I ran the above on mm-everything-2025-03-05-03-54 plus the xarray fix v3,
> folio_split() with your fixes, and Minimize xa_node allocation during
> xarry split patches. The repo is at: https://github.com/x-y-z/linux-dev/tree/shmem_fix-mm-everything-2025-03-05-03-54.
>
> It has ran over night for 30 kernel builds and no crash happened so far.
> I wonder if you can give my repo a shot.
>
> I just boosted khugepaged like you did and see no immediate crash. But I will
> let it run for longer.

I have run this over night and have not seen any crash. I assume it is stable.
I am going to send V10 and resend Minimize xa_node allocation during xarry split.


Best Regards,
Yan, Zi
David Hildenbrand March 7, 2025, 5:46 p.m. UTC | #19
On 06.03.25 17:27, Zi Yan wrote:
> On 6 Mar 2025, at 4:19, David Hildenbrand wrote:
> 
>> On 05.03.25 22:08, Zi Yan wrote:
>>> On 5 Mar 2025, at 15:50, Hugh Dickins wrote:
>>>
>>>> On Wed, 5 Mar 2025, Zi Yan wrote:
>>>>> On 4 Mar 2025, at 6:49, Hugh Dickins wrote:
>>>>>>
>>>>>> I think (might be wrong, I'm in a rush) my mods are all to this
>>>>>> "add two new (not yet used) functions for folio_split()" patch:
>>>>>> please merge them in if you agree.
>>>>>>
>>>>>> 1. From source inspection, it looks like a folio_set_order() was missed.
>>>>>
>>>>> Actually no. folio_set_order(folio, new_order) is called multiple times
>>>>> in the for loop above. It is duplicated but not missing.
>>>>
>>>> I was about to disagree with you, when at last I saw that, yes,
>>>> it is doing that on "folio" at the time of setting up "new_folio".
>>>>
>>>> That is confusing: in all other respects, that loop is reading folio
>>>> to set up new_folio.  Do you have a reason for doing it there?
>>>
>>> No. I agree your fix is better. Just point out folio_set_order() should
>>> not trigger a bug.
>>>
>>>>
>>>> The transient "nested folio" situation is anomalous either way.
>>>> I'd certainly prefer it to be done at the point where you
>>>> ClearPageCompound when !new_order; but if you think there's an issue
>>>> with racing isolate_migratepages_block() or something like that, which
>>>> your current placement handles better, then please add a line of comment
>>>> both where you do it and where I expected to find it - thanks.
>>>
>>> Sure. I will use your patch unless I find some racing issue.
>>>
>>>>
>>>> (Historically, there was quite a lot of difficulty in getting the order
>>>> of events in __split_huge_page_tail() to be safe: I wonder whether we
>>>> shall see a crop of new weird bugs from these changes. I note that your
>>>> loops advance forwards, whereas the old ones went backwards: but I don't
>>>> have anything to say you're wrong.  I think it's mainly a matter of how
>>>> the first tail or two gets handled: which might be why you want to
>>>> folio_set_order(folio, new_order) at the earliest opportunity.)
>>>
>>> I am worried about that too. In addition, in __split_huge_page_tail(),
>>> page refcount is restored right after new tail folio split is done,
>>> whereas I needed to delay them until all new after-split folios
>>> are done, since non-uniform split is iterative and only the after-split
>>> folios NOT containing the split_at page will be released. These
>>> folios are locked and frozen after __split_folio_to_order() like
>>> the original folio. Maybe because there are more such locked frozen
>>> folios than before?
>>
>> What's the general concern here?
>>
>> A frozen folio cannot be referenced and consequently not trusted. For example, if we want to speculatively lookup a folio in the pagecache and find it to be frozen, we'll have to spin (retry) until we find a folio that is unfrozen.
>>
>> While a folio has a refcount of 0, there are no guarantees. It could change its size, it could be freed + reallocated (changed mapping etc) ...
>>
>> So whoever wants to grab a speculative reference -- using folio_try_get() -- must re-verify folio properties after grabbing the speculative reference succeeded. Including whether it is small/large, number of pages, mapping, ...
>>
>> The important part is to unfreeze a folio only once it was fully prepared (e.g., order set, compound pages links to head set up etc).
>>
>> I am not sure if the sequence in which you process folios during a split matters here when doing a split: only that, whatever new folio  is unfrozen, is properly initialized.
> 
> Got it. Thanks for the confirmation.
> 
> My worry came from that after I rebased on mm-everything-2025-03-05-03-54,
> which does not have folio_split() patches, I see a crash saying a buddy
> page is hit in __split_folio_to_order(). It turns out that I did not
> add the changes from your “mm: let _folio_nr_pages overlay memcg_data in
> first tail page” patch. With that fixed, no crash is observed so far.

Ah that makes sense. Yes, it must look like in my original v3 that was 
based on your patches. (now it's the other way around :) )
Hugh Dickins March 10, 2025, 8:54 a.m. UTC | #20
On Thu, 6 Mar 2025, Zi Yan wrote:
> On 5 Mar 2025, at 17:38, Hugh Dickins wrote:
> > On Wed, 5 Mar 2025, Zi Yan wrote:
> >> On 5 Mar 2025, at 16:03, Hugh Dickins wrote:
> >>>
> >>> Beyond checking that, I didn't have time yesterday to investigate
> >>> further, but I'll try again today (still using last weekend's mm.git).
> >>
> >> I am trying to replicate your runs locally. Can you clarify your steps
> >> of “kernel builds on huge tmpfs while swapping to SSD”? Do you impose
> >> a memory limit so that anonymous memory is swapped to SSD or make tmpfs
> >> swap to SSD?
> >
> > Yeah, my heart sank a bit when I saw Andrew (with good intention) asking
> > you to repeat my testing.
> >
> > We could spend weeks going back and forth on that, and neither of us has
> > weeks to spare.
> >
> > "To fulfil contractual obligations" I'll mail you the tarfile I send
> > out each time I'm asked for this; but I haven't updated that tarfile
> > in four years, whereas I'm frequently tweaking things to match what's
> > needed (most recently and relevantly, I guess enabling 64kB hugepages
> > for anon and shmem in addition to the PMD-sized).
> >
> > Please don't waste much of your time over trying to replicate what
> > I'm doing: just give the scripts a glance, as a source for "oh,
> > I could exercise something like that in my testing too" ideas.
> >
> > Yes, I limit physical memory by booting with mem=1G, and also apply
> > lower memcg v1 limits.
> >
> > I made a point of saying "SSD" there because I'm not testing zram or
> > zswap at all, whereas many others are testing those rather than disk.
> >
> > swapoff, and ext4 on loop0 on tmpfs, feature in what I exercise, but are
> > NOT relevant to the corruption I'm seeing here - that can occur before
> > any swapoff, and it's always on the kernel build in tmpfs: the parallel
> > build in ext4 on loop0 on tmpfs completes successfully.
> 
> Thanks for the scripts. I kinda replicate your setup as follows:
> 
> 1. boot a VM with 1GB memory and 8 cores;
> 2. mount a tmpfs with huge=always and 200GB;
> 3. clone the mainline kernel and use x86_64 defconfig (my gcc 14 gives
>    errors during the old kernel builds), this takes about 2GB space,
>    so some of tmpfs is already swapped to SSD;
> 4. create a new cgroupv2 and set memory.high to 700MB to induce memory
>    swap during kernel compilation;
> 5. run “while true; do echo 1 | sudo tee /proc/sys/vm/compact_memory >/dev/null; done” to trigger compaction all the time;
> 6. build the kernel with make -j20.
> 
> I ran the above on mm-everything-2025-03-05-03-54 plus the xarray fix v3,
> folio_split() with your fixes, and Minimize xa_node allocation during
> xarry split patches. The repo is at: https://github.com/x-y-z/linux-dev/tree/shmem_fix-mm-everything-2025-03-05-03-54.
> 
> It has ran over night for 30 kernel builds and no crash happened so far.
> I wonder if you can give my repo a shot.

Thanks for trying, I hope you didn't waste too much time on it,
I was not optimistic that it could be adapted easily.

You appeared to be suggesting above that I try your setup, which did
not reproduce the corruption, instead of mine which did.  And later you
appeared to conclude that all is good because you saw no corruption.

No. I continued with my setup (working on mm-everything-2025-03-08-00-43),
have now root-caused the corruption, and have the fix: as so often,
it is obvious in retrospect.

After looking at enough of the fixdep failures and their .o.d files,
I found them consistent with seeing the head page of a large folio
in place of its first tail (presumably while racing a split).

And the debug patch to filemap_get_entry() below (not something we want
to go into the tree, but good for confirmation - maybe it will even
show warnings on your setup) confirmed that - well, head in place of
first tail was the majority of cases, but head in place of any tail
in general.

There's a race between RCU lookup of the xarray and your splitting.
That's something that normally the xas_reload(&xas) check guards
against, but it was not effective.

I wasted a day getting it exactly back to front: I thought the problem
was that __split_unmap_folio() needed to __xa_store the former tail
before unfreezing it; but the patch reversing that ordering still
issued some warnings.

No, it's that __split_unmap_folio() must not unfreeze the original
head until all the xa slots which used to point to it have been
updated with their new contents (as __split_huge_page() always did).

I've taken the liberty of simplifying the unfreeze calculaton in terms
of mapping and swap_cache, as we did elsewhere (and after fiddling
around with the comment above it, just dropped it as adding nothing
beyond what the code already does).

And there's one other, unrelated change in there: I've changed the
folio_put() after __filemap_remove_folio() to folio_put_refs().
I believe that is what's correct there, but please check carefully:
I'm a little worried that my testing (which I expect to be exercising
that "beyond EOF" case plenty) has run well both before and after that
change, whereas I thought there should be a noticeable leak of memory
while it was just folio_put().

Or maybe my testing is barely exercising anything more than uniform
splits to 0-order, in which case there's no difference: I imagine
your selftests (I've not tried them) will do much better on that.

Please fold in the mm/huge_memory.c mods if you are in agreement.

Signed-off-by: Hugh Dickins <<hughd@google.com>

diff --git a/mm/filemap.c b/mm/filemap.c
index f7281ad22743..34b4fdafec40 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1871,6 +1871,15 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
 		folio_put(folio);
 		goto repeat;
 	}
+
+	if (mapping->host /* filter out swap cache */ &&
+	    /* !folio_contains(folio, index), but that requires lock */
+	    WARN_ON(index - folio->index >= folio_nr_pages(folio))) {
+		pr_warn("Mismatched index:%#lx\n", index);
+		dump_page(&folio->page, NULL);
+		folio_put(folio);
+		goto repeat;
+	}
 out:
 	rcu_read_unlock();
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3e05e62fdccb..be0c9873019c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3787,18 +3787,13 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 						MTHP_STAT_NR_ANON, 1);
 			}
 
-			/*
-			 * Unfreeze refcount first. Additional reference from
-			 * page cache.
-			 */
-			folio_ref_unfreeze(release,
-				1 + ((!folio_test_anon(origin_folio) ||
-				     folio_test_swapcache(origin_folio)) ?
-					     folio_nr_pages(release) : 0));
-
 			if (release == origin_folio)
 				continue;
 
+			folio_ref_unfreeze(release, 1 +
+					((mapping || swap_cache) ?
+						folio_nr_pages(release) : 0));
+
 			lru_add_page_tail(origin_folio, &release->page,
 						lruvec, list);
 
@@ -3810,7 +3805,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 					folio_account_cleaned(release,
 						inode_to_wb(mapping->host));
 				__filemap_remove_folio(release, NULL);
-				folio_put(release);
+				folio_put_refs(release, folio_nr_pages(release));
 			} else if (mapping) {
 				__xa_store(&mapping->i_pages,
 						release->index, release, 0);
@@ -3822,6 +3817,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		}
 	}
 
+	folio_ref_unfreeze(origin_folio, 1 +
+		((mapping || swap_cache) ?  folio_nr_pages(origin_folio) : 0));
+
 	unlock_page_lruvec(lruvec);
 
 	if (swap_cache)
Zi Yan March 10, 2025, 3:35 p.m. UTC | #21
On 10 Mar 2025, at 4:54, Hugh Dickins wrote:

> On Thu, 6 Mar 2025, Zi Yan wrote:
>> On 5 Mar 2025, at 17:38, Hugh Dickins wrote:
>>> On Wed, 5 Mar 2025, Zi Yan wrote:
>>>> On 5 Mar 2025, at 16:03, Hugh Dickins wrote:
>>>>>
>>>>> Beyond checking that, I didn't have time yesterday to investigate
>>>>> further, but I'll try again today (still using last weekend's mm.git).
>>>>
>>>> I am trying to replicate your runs locally. Can you clarify your steps
>>>> of “kernel builds on huge tmpfs while swapping to SSD”? Do you impose
>>>> a memory limit so that anonymous memory is swapped to SSD or make tmpfs
>>>> swap to SSD?
>>>
>>> Yeah, my heart sank a bit when I saw Andrew (with good intention) asking
>>> you to repeat my testing.
>>>
>>> We could spend weeks going back and forth on that, and neither of us has
>>> weeks to spare.
>>>
>>> "To fulfil contractual obligations" I'll mail you the tarfile I send
>>> out each time I'm asked for this; but I haven't updated that tarfile
>>> in four years, whereas I'm frequently tweaking things to match what's
>>> needed (most recently and relevantly, I guess enabling 64kB hugepages
>>> for anon and shmem in addition to the PMD-sized).
>>>
>>> Please don't waste much of your time over trying to replicate what
>>> I'm doing: just give the scripts a glance, as a source for "oh,
>>> I could exercise something like that in my testing too" ideas.
>>>
>>> Yes, I limit physical memory by booting with mem=1G, and also apply
>>> lower memcg v1 limits.
>>>
>>> I made a point of saying "SSD" there because I'm not testing zram or
>>> zswap at all, whereas many others are testing those rather than disk.
>>>
>>> swapoff, and ext4 on loop0 on tmpfs, feature in what I exercise, but are
>>> NOT relevant to the corruption I'm seeing here - that can occur before
>>> any swapoff, and it's always on the kernel build in tmpfs: the parallel
>>> build in ext4 on loop0 on tmpfs completes successfully.
>>
>> Thanks for the scripts. I kinda replicate your setup as follows:
>>
>> 1. boot a VM with 1GB memory and 8 cores;
>> 2. mount a tmpfs with huge=always and 200GB;
>> 3. clone the mainline kernel and use x86_64 defconfig (my gcc 14 gives
>>    errors during the old kernel builds), this takes about 2GB space,
>>    so some of tmpfs is already swapped to SSD;
>> 4. create a new cgroupv2 and set memory.high to 700MB to induce memory
>>    swap during kernel compilation;
>> 5. run “while true; do echo 1 | sudo tee /proc/sys/vm/compact_memory >/dev/null; done” to trigger compaction all the time;
>> 6. build the kernel with make -j20.
>>
>> I ran the above on mm-everything-2025-03-05-03-54 plus the xarray fix v3,
>> folio_split() with your fixes, and Minimize xa_node allocation during
>> xarry split patches. The repo is at: https://github.com/x-y-z/linux-dev/tree/shmem_fix-mm-everything-2025-03-05-03-54.
>>
>> It has ran over night for 30 kernel builds and no crash happened so far.
>> I wonder if you can give my repo a shot.
>
> Thanks for trying, I hope you didn't waste too much time on it,
> I was not optimistic that it could be adapted easily.
>
> You appeared to be suggesting above that I try your setup, which did
> not reproduce the corruption, instead of mine which did.  And later you
> appeared to conclude that all is good because you saw no corruption.
>
> No. I continued with my setup (working on mm-everything-2025-03-08-00-43),
> have now root-caused the corruption, and have the fix: as so often,
> it is obvious in retrospect.
>
> After looking at enough of the fixdep failures and their .o.d files,
> I found them consistent with seeing the head page of a large folio
> in place of its first tail (presumably while racing a split).
>
> And the debug patch to filemap_get_entry() below (not something we want
> to go into the tree, but good for confirmation - maybe it will even
> show warnings on your setup) confirmed that - well, head in place of
> first tail was the majority of cases, but head in place of any tail
> in general.
>
> There's a race between RCU lookup of the xarray and your splitting.
> That's something that normally the xas_reload(&xas) check guards
> against, but it was not effective.
>
> I wasted a day getting it exactly back to front: I thought the problem
> was that __split_unmap_folio() needed to __xa_store the former tail
> before unfreezing it; but the patch reversing that ordering still
> issued some warnings.
>
> No, it's that __split_unmap_folio() must not unfreeze the original
> head until all the xa slots which used to point to it have been
> updated with their new contents (as __split_huge_page() always did).

Right. It becomes very clear after I see your debug code in
filemap_get_entry(). The folio_try_get() will get the origin_folio
after it is unfrozen meanwhile the remaining after-split folios have not
been inserted into page cache entries.

Thanks a lot for teaching me about this. I will add comments with
your fixes and send a patch soon.

>
> I've taken the liberty of simplifying the unfreeze calculaton in terms
> of mapping and swap_cache, as we did elsewhere (and after fiddling
> around with the comment above it, just dropped it as adding nothing
> beyond what the code already does).

Looks good to me.

>
> And there's one other, unrelated change in there: I've changed the
> folio_put() after __filemap_remove_folio() to folio_put_refs().
> I believe that is what's correct there, but please check carefully:
> I'm a little worried that my testing (which I expect to be exercising
> that "beyond EOF" case plenty) has run well both before and after that
> change, whereas I thought there should be a noticeable leak of memory
> while it was just folio_put().

You are right here. I will send a fix for mainline as well. It can be
only be triggered when a filesystem, like XFS, with a block_size that
is >page_size for mainline.

>
> Or maybe my testing is barely exercising anything more than uniform
> splits to 0-order, in which case there's no difference: I imagine
> your selftests (I've not tried them) will do much better on that.

My selftests might be able to reveal it after tons of repeated runs
due to leaked large tail folios.

Thanks again for all your help.

>
> Please fold in the mm/huge_memory.c mods if you are in agreement.
>
> Signed-off-by: Hugh Dickins <<hughd@google.com>
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index f7281ad22743..34b4fdafec40 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1871,6 +1871,15 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
>  		folio_put(folio);
>  		goto repeat;
>  	}
> +
> +	if (mapping->host /* filter out swap cache */ &&
> +	    /* !folio_contains(folio, index), but that requires lock */
> +	    WARN_ON(index - folio->index >= folio_nr_pages(folio))) {
> +		pr_warn("Mismatched index:%#lx\n", index);
> +		dump_page(&folio->page, NULL);
> +		folio_put(folio);
> +		goto repeat;
> +	}
>  out:
>  	rcu_read_unlock();
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3e05e62fdccb..be0c9873019c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3787,18 +3787,13 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  						MTHP_STAT_NR_ANON, 1);
>  			}
>
> -			/*
> -			 * Unfreeze refcount first. Additional reference from
> -			 * page cache.
> -			 */
> -			folio_ref_unfreeze(release,
> -				1 + ((!folio_test_anon(origin_folio) ||
> -				     folio_test_swapcache(origin_folio)) ?
> -					     folio_nr_pages(release) : 0));
> -
>  			if (release == origin_folio)
>  				continue;
>
> +			folio_ref_unfreeze(release, 1 +
> +					((mapping || swap_cache) ?
> +						folio_nr_pages(release) : 0));
> +
>  			lru_add_page_tail(origin_folio, &release->page,
>  						lruvec, list);
>
> @@ -3810,7 +3805,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  					folio_account_cleaned(release,
>  						inode_to_wb(mapping->host));
>  				__filemap_remove_folio(release, NULL);
> -				folio_put(release);
> +				folio_put_refs(release, folio_nr_pages(release));
>  			} else if (mapping) {
>  				__xa_store(&mapping->i_pages,
>  						release->index, release, 0);
> @@ -3822,6 +3817,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		}
>  	}
>
> +	folio_ref_unfreeze(origin_folio, 1 +
> +		((mapping || swap_cache) ?  folio_nr_pages(origin_folio) : 0));
> +
>  	unlock_page_lruvec(lruvec);
>
>  	if (swap_cache)


Best Regards,
Yan, Zi
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 468e8ea76bc9..b0105ba6db94 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3267,7 +3267,6 @@  static void remap_page(struct folio *folio, unsigned long nr, int flags)
 static void lru_add_page_tail(struct folio *folio, struct page *tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
-	VM_BUG_ON_FOLIO(!folio_test_large(folio), folio);
 	VM_BUG_ON_FOLIO(PageLRU(tail), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
@@ -3511,6 +3510,343 @@  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 					caller_pins;
 }
 
+/*
+ * It splits @folio into @new_order folios and copies the @folio metadata to
+ * all the resulting folios.
+ */
+static void __split_folio_to_order(struct folio *folio, int new_order)
+{
+	long nr_pages = folio_nr_pages(folio);
+	long new_nr_pages = 1 << new_order;
+	long index;
+
+	/*
+	 * Skip the first new_nr_pages, since the new folio from them have all
+	 * the flags from the original folio.
+	 */
+	for (index = new_nr_pages; index < nr_pages; index += new_nr_pages) {
+		struct page *head = &folio->page;
+		struct page *new_head = head + index;
+
+		/*
+		 * Careful: new_folio is not a "real" folio before we cleared PageTail.
+		 * Don't pass it around before clear_compound_head().
+		 */
+		struct folio *new_folio = (struct folio *)new_head;
+
+		VM_BUG_ON_PAGE(atomic_read(&new_head->_mapcount) != -1, new_head);
+
+		/*
+		 * Clone page flags before unfreezing refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow flags change,
+		 * for example lock_page() which set PG_waiters.
+		 *
+		 * Note that for mapped sub-pages of an anonymous THP,
+		 * PG_anon_exclusive has been cleared in unmap_folio() and is stored in
+		 * the migration entry instead from where remap_page() will restore it.
+		 * We can still have PG_anon_exclusive set on effectively unmapped and
+		 * unreferenced sub-pages of an anonymous THP: we can simply drop
+		 * PG_anon_exclusive (-> PG_mappedtodisk) for these here.
+		 */
+		new_head->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+		new_head->flags |= (head->flags &
+				((1L << PG_referenced) |
+				 (1L << PG_swapbacked) |
+				 (1L << PG_swapcache) |
+				 (1L << PG_mlocked) |
+				 (1L << PG_uptodate) |
+				 (1L << PG_active) |
+				 (1L << PG_workingset) |
+				 (1L << PG_locked) |
+				 (1L << PG_unevictable) |
+#ifdef CONFIG_ARCH_USES_PG_ARCH_2
+				 (1L << PG_arch_2) |
+#endif
+#ifdef CONFIG_ARCH_USES_PG_ARCH_3
+				 (1L << PG_arch_3) |
+#endif
+				 (1L << PG_dirty) |
+				 LRU_GEN_MASK | LRU_REFS_MASK));
+
+		/* ->mapping in first and second tail page is replaced by other uses */
+		VM_BUG_ON_PAGE(new_nr_pages > 2 && new_head->mapping != TAIL_MAPPING,
+			       new_head);
+		new_head->mapping = head->mapping;
+		new_head->index = head->index + index;
+
+		/*
+		 * page->private should not be set in tail pages. Fix up and warn once
+		 * if private is unexpectedly set.
+		 */
+		if (unlikely(new_head->private)) {
+			VM_WARN_ON_ONCE_PAGE(true, new_head);
+			new_head->private = 0;
+		}
+
+		if (folio_test_swapcache(folio))
+			new_folio->swap.val = folio->swap.val + index;
+
+		/* Page flags must be visible before we make the page non-compound. */
+		smp_wmb();
+
+		/*
+		 * Clear PageTail before unfreezing page refcount.
+		 *
+		 * After successful get_page_unless_zero() might follow put_page()
+		 * which needs correct compound_head().
+		 */
+		clear_compound_head(new_head);
+		if (new_order) {
+			prep_compound_page(new_head, new_order);
+			folio_set_large_rmappable(new_folio);
+
+			folio_set_order(folio, new_order);
+		}
+
+		if (folio_test_young(folio))
+			folio_set_young(new_folio);
+		if (folio_test_idle(folio))
+			folio_set_idle(new_folio);
+
+		folio_xchg_last_cpupid(new_folio, folio_last_cpupid(folio));
+	}
+
+	if (!new_order)
+		ClearPageCompound(&folio->page);
+}
+
+/*
+ * It splits an unmapped @folio to lower order smaller folios in two ways.
+ * @folio: the to-be-split folio
+ * @new_order: the smallest order of the after split folios (since buddy
+ *             allocator like split generates folios with orders from @folio's
+ *             order - 1 to new_order).
+ * @split_at: in buddy allocator like split, the folio containing @split_at
+ *            will be split until its order becomes @new_order.
+ * @lock_at: the folio containing @lock_at is left locked for caller.
+ * @list: the after split folios will be added to @list if it is not NULL,
+ *        otherwise to LRU lists.
+ * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
+ * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
+ * @mapping: @folio->mapping
+ * @uniform_split: if the split is uniform or not (buddy allocator like split)
+ *
+ *
+ * 1. uniform split: the given @folio into multiple @new_order small folios,
+ *    where all small folios have the same order. This is done when
+ *    uniform_split is true.
+ * 2. buddy allocator like (non-uniform) split: the given @folio is split into
+ *    half and one of the half (containing the given page) is split into half
+ *    until the given @page's order becomes @new_order. This is done when
+ *    uniform_split is false.
+ *
+ * The high level flow for these two methods are:
+ * 1. uniform split: a single __split_folio_to_order() is called to split the
+ *    @folio into @new_order, then we traverse all the resulting folios one by
+ *    one in PFN ascending order and perform stats, unfreeze, adding to list,
+ *    and file mapping index operations.
+ * 2. non-uniform split: in general, folio_order - @new_order calls to
+ *    __split_folio_to_order() are made in a for loop to split the @folio
+ *    to one lower order at a time. The resulting small folios are processed
+ *    like what is done during the traversal in 1, except the one containing
+ *    @page, which is split in next for loop.
+ *
+ * After splitting, the caller's folio reference will be transferred to the
+ * folio containing @page. The other folios may be freed if they are not mapped.
+ *
+ * In terms of locking, after splitting,
+ * 1. uniform split leaves @page (or the folio contains it) locked;
+ * 2. buddy allocator like (non-uniform) split leaves @folio locked.
+ *
+ *
+ * For !uniform_split, when -ENOMEM is returned, the original folio might be
+ * split. The caller needs to check the input folio.
+ */
+static int __split_unmapped_folio(struct folio *folio, int new_order,
+		struct page *split_at, struct page *lock_at,
+		struct list_head *list, pgoff_t end,
+		struct xa_state *xas, struct address_space *mapping,
+		bool uniform_split)
+{
+	struct lruvec *lruvec;
+	struct address_space *swap_cache = NULL;
+	struct folio *origin_folio = folio;
+	struct folio *next_folio = folio_next(folio);
+	struct folio *new_folio;
+	struct folio *next;
+	int order = folio_order(folio);
+	int split_order;
+	int start_order = uniform_split ? new_order : order - 1;
+	int nr_dropped = 0;
+	int ret = 0;
+	bool stop_split = false;
+
+	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
+		/* a swapcache folio can only be uniformly split to order-0 */
+		if (!uniform_split || new_order != 0)
+			return -EINVAL;
+
+		swap_cache = swap_address_space(folio->swap);
+		xa_lock(&swap_cache->i_pages);
+	}
+
+	if (folio_test_anon(folio))
+		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
+
+	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
+	lruvec = folio_lruvec_lock(folio);
+
+	folio_clear_has_hwpoisoned(folio);
+
+	/*
+	 * split to new_order one order at a time. For uniform split,
+	 * folio is split to new_order directly.
+	 */
+	for (split_order = start_order;
+	     split_order >= new_order && !stop_split;
+	     split_order--) {
+		int old_order = folio_order(folio);
+		struct folio *release;
+		struct folio *end_folio = folio_next(folio);
+
+		/* order-1 anonymous folio is not supported */
+		if (folio_test_anon(folio) && split_order == 1)
+			continue;
+		if (uniform_split && split_order != new_order)
+			continue;
+
+		if (mapping) {
+			/*
+			 * uniform split has xas_split_alloc() called before
+			 * irq is disabled to allocate enough memory, whereas
+			 * non-uniform split can handle ENOMEM.
+			 */
+			if (uniform_split)
+				xas_split(xas, folio, old_order);
+			else {
+				xas_set_order(xas, folio->index, split_order);
+				xas_try_split(xas, folio, old_order);
+				if (xas_error(xas)) {
+					ret = xas_error(xas);
+					stop_split = true;
+					goto after_split;
+				}
+			}
+		}
+
+		/* complete memcg works before add pages to LRU */
+		split_page_memcg(&folio->page, old_order, split_order);
+		split_page_owner(&folio->page, old_order, split_order);
+		pgalloc_tag_split(folio, old_order, split_order);
+
+		__split_folio_to_order(folio, split_order);
+
+after_split:
+		/*
+		 * Iterate through after-split folios and perform related
+		 * operations. But in buddy allocator like split, the folio
+		 * containing the specified page is skipped until its order
+		 * is new_order, since the folio will be worked on in next
+		 * iteration.
+		 */
+		for (release = folio, next = folio_next(folio);
+		     release != end_folio;
+		     release = next, next = folio_next(next)) {
+			/*
+			 * for buddy allocator like split, the folio containing
+			 * page will be split next and should not be released,
+			 * until the folio's order is new_order or stop_split
+			 * is set to true by the above xas_split() failure.
+			 */
+			if (release == page_folio(split_at)) {
+				folio = release;
+				if (split_order != new_order && !stop_split)
+					continue;
+			}
+			if (folio_test_anon(release)) {
+				mod_mthp_stat(folio_order(release),
+						MTHP_STAT_NR_ANON, 1);
+			}
+
+			/*
+			 * Unfreeze refcount first. Additional reference from
+			 * page cache.
+			 */
+			folio_ref_unfreeze(release,
+				1 + ((!folio_test_anon(origin_folio) ||
+				     folio_test_swapcache(origin_folio)) ?
+					     folio_nr_pages(release) : 0));
+
+			if (release == origin_folio)
+				continue;
+
+			lru_add_page_tail(origin_folio, &release->page,
+						lruvec, list);
+
+			/* Some pages can be beyond EOF: drop them from page cache */
+			if (release->index >= end) {
+				if (shmem_mapping(origin_folio->mapping))
+					nr_dropped += folio_nr_pages(release);
+				else if (folio_test_clear_dirty(release))
+					folio_account_cleaned(release,
+						inode_to_wb(origin_folio->mapping->host));
+				__filemap_remove_folio(release, NULL);
+				folio_put(release);
+			} else if (!folio_test_anon(release)) {
+				__xa_store(&origin_folio->mapping->i_pages,
+						release->index, &release->page, 0);
+			} else if (swap_cache) {
+				__xa_store(&swap_cache->i_pages,
+						swap_cache_index(release->swap),
+						&release->page, 0);
+			}
+		}
+	}
+
+	unlock_page_lruvec(lruvec);
+
+	if (folio_test_anon(origin_folio)) {
+		if (folio_test_swapcache(origin_folio))
+			xa_unlock(&swap_cache->i_pages);
+	} else
+		xa_unlock(&mapping->i_pages);
+
+	/* Caller disabled irqs, so they are still disabled here */
+	local_irq_enable();
+
+	if (nr_dropped)
+		shmem_uncharge(mapping->host, nr_dropped);
+
+	remap_page(origin_folio, 1 << order,
+			folio_test_anon(origin_folio) ?
+				RMP_USE_SHARED_ZEROPAGE : 0);
+
+	/*
+	 * At this point, folio should contain the specified page.
+	 * For uniform split, it is left for caller to unlock.
+	 * For buddy allocator like split, the first after-split folio is left
+	 * for caller to unlock.
+	 */
+	for (new_folio = origin_folio, next = folio_next(origin_folio);
+	     new_folio != next_folio;
+	     new_folio = next, next = folio_next(next)) {
+		if (new_folio == page_folio(lock_at))
+			continue;
+
+		folio_unlock(new_folio);
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		free_page_and_swap_cache(&new_folio->page);
+	}
+	return ret;
+}
+
 /*
  * This function splits a large folio into smaller folios of order @new_order.
  * @page can point to any page of the large folio to split. The split operation