diff mbox series

[v5,1/6] mm: free zapped tail pages when splitting isolated thp

Message ID 20240830100438.3623486-2-usamaarif642@gmail.com (mailing list archive)
State New
Headers show
Series mm: split underused THPs | expand

Commit Message

Usama Arif Aug. 30, 2024, 10:03 a.m. UTC
From: Yu Zhao <yuzhao@google.com>

If a tail page has only two references left, one inherited from the
isolation of its head and the other from lru_add_page_tail() which we
are about to drop, it means this tail page was concurrently zapped.
Then we can safely free it and save page reclaim or migration the
trouble of trying it.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Tested-by: Shuang Zhai <zhais@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/huge_memory.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

Comments

Hugh Dickins Sept. 5, 2024, 8:46 a.m. UTC | #1
On Fri, 30 Aug 2024, Usama Arif wrote:

> From: Yu Zhao <yuzhao@google.com>
> 
> If a tail page has only two references left, one inherited from the
> isolation of its head and the other from lru_add_page_tail() which we
> are about to drop, it means this tail page was concurrently zapped.
> Then we can safely free it and save page reclaim or migration the
> trouble of trying it.
> 
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> Tested-by: Shuang Zhai <zhais@google.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
it is only an optimization, and unless a persuasive performance case
can be made to extend it, it ought to go (perhaps revisited later).

The problem I kept hitting was that all my work, requiring compaction and
reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
because nr_isolated_anon had grown high - and remained high even when the
load had all been killed.

Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
is the one to blame. I was intending to send this patch to "fix" it:

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
 			folio_clear_active(new_folio);
 			folio_clear_unevictable(new_folio);
 			list_del(&new_folio->lru);
+			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
+						folio_is_file_lru(folio));
 			if (!folio_batch_add(&free_folios, new_folio)) {
 				mem_cgroup_uncharge_folios(&free_folios);
 				free_unref_folios(&free_folios);

And that ran nicely, until I terminated the run and did
grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
at the end: stat_refresh kindly left a pr_warn in dmesg to say
nr_isolated_anon -334013737

My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
know how many pages they isolated and decremented the stats by, and
increment by that same number at the end; whereas other split_huge_pagers
(migration?) decrement one by one as they go through the list afterwards.

I've run out of time (I'm about to take a break): I gave up researching
who needs what, and was already feeling this optimization does too much
second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
rather admits to that).

And I don't think it's as simple as moving the node_stat_sub_folio()
into 2/6 where the zero pte is substituted: that would probably handle
the vast majority of cases, but aren't there others which pass the
folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
or racily truncated now that the folio has been unlocked, for example?

Hugh
Usama Arif Sept. 5, 2024, 10:21 a.m. UTC | #2
On 05/09/2024 09:46, Hugh Dickins wrote:
> On Fri, 30 Aug 2024, Usama Arif wrote:
> 
>> From: Yu Zhao <yuzhao@google.com>
>>
>> If a tail page has only two references left, one inherited from the
>> isolation of its head and the other from lru_add_page_tail() which we
>> are about to drop, it means this tail page was concurrently zapped.
>> Then we can safely free it and save page reclaim or migration the
>> trouble of trying it.
>>
>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>> Tested-by: Shuang Zhai <zhais@google.com>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> 
> I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
> it is only an optimization, and unless a persuasive performance case
> can be made to extend it, it ought to go (perhaps revisited later).
> 

I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it.

Its an optimization and underused shrinker doesn't depend on it.
Its possible that folio->new_folio below might fix it? But if it doesn't,
I can retry later on to make this work and resend it only if it alone shows
a significant performance improvement.

Thanks a lot for debugging this! and sorry it caused an issue.


> The problem I kept hitting was that all my work, requiring compaction and
> reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
> because nr_isolated_anon had grown high - and remained high even when the
> load had all been killed.
> 
> Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
> is the one to blame. I was intending to send this patch to "fix" it:
> 
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
>  			folio_clear_active(new_folio);
>  			folio_clear_unevictable(new_folio);
>  			list_del(&new_folio->lru);
> +			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
> +						folio_is_file_lru(folio));

Maybe this should have been below? (Notice the folio->new_folio)

+			node_stat_sub_folio(new_folio, NR_ISOLATED_ANON +
+						folio_is_file_lru(new_folio));

>  			if (!folio_batch_add(&free_folios, new_folio)) {
>  				mem_cgroup_uncharge_folios(&free_folios);
>  				free_unref_folios(&free_folios);
> 
> And that ran nicely, until I terminated the run and did
> grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
> at the end: stat_refresh kindly left a pr_warn in dmesg to say
> nr_isolated_anon -334013737
> 
> My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
> know how many pages they isolated and decremented the stats by, and
> increment by that same number at the end; whereas other split_huge_pagers
> (migration?) decrement one by one as they go through the list afterwards.
> 
> I've run out of time (I'm about to take a break): I gave up researching
> who needs what, and was already feeling this optimization does too much
> second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
> rather admits to that).
> 
> And I don't think it's as simple as moving the node_stat_sub_folio()
> into 2/6 where the zero pte is substituted: that would probably handle
> the vast majority of cases, but aren't there others which pass the
> folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
> or racily truncated now that the folio has been unlocked, for example?
> 
> Hugh
Hugh Dickins Sept. 5, 2024, 6:05 p.m. UTC | #3
On Thu, 5 Sep 2024, Usama Arif wrote:
> On 05/09/2024 09:46, Hugh Dickins wrote:
> > On Fri, 30 Aug 2024, Usama Arif wrote:
> > 
> >> From: Yu Zhao <yuzhao@google.com>
> >>
> >> If a tail page has only two references left, one inherited from the
> >> isolation of its head and the other from lru_add_page_tail() which we
> >> are about to drop, it means this tail page was concurrently zapped.
> >> Then we can safely free it and save page reclaim or migration the
> >> trouble of trying it.
> >>
> >> Signed-off-by: Yu Zhao <yuzhao@google.com>
> >> Tested-by: Shuang Zhai <zhais@google.com>
> >> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> > 
> > I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
> > it is only an optimization, and unless a persuasive performance case
> > can be made to extend it, it ought to go (perhaps revisited later).
> > 
> 
> I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it.
> 
> Its an optimization and underused shrinker doesn't depend on it.
> Its possible that folio->new_folio below might fix it? But if it doesn't,
> I can retry later on to make this work and resend it only if it alone shows
> a significant performance improvement.
> 
> Thanks a lot for debugging this! and sorry it caused an issue.
> 
> 
> > The problem I kept hitting was that all my work, requiring compaction and
> > reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
> > because nr_isolated_anon had grown high - and remained high even when the
> > load had all been killed.
> > 
> > Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
> > is the one to blame. I was intending to send this patch to "fix" it:
> > 
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
> >  			folio_clear_active(new_folio);
> >  			folio_clear_unevictable(new_folio);
> >  			list_del(&new_folio->lru);
> > +			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
> > +						folio_is_file_lru(folio));
> 
> Maybe this should have been below? (Notice the folio->new_folio)
> 
> +			node_stat_sub_folio(new_folio, NR_ISOLATED_ANON +
> +						folio_is_file_lru(new_folio));

Yes, how very stupid of me (I'm well aware of the earlier bugfixes here,
and ought not to have done a blind copy and paste of those lines): thanks.

However... it makes no difference. I gave yours a run, expecting a
much smaller negative number, but actually it works out much the same:
because, of course, by this point in the code "folio" is left pointing
to a small folio, and is almost equivalent to new_folio for the
node_stat calculations.

(I say "almost" because I guess there's a chance that the page at
folio got reused as part of a larger folio meanwhile, which would
throw the stats off (if they weren't off already).)

So, even with your fix to my fix, this code doesn't work.
Whereas a revert of this 1/6 does work: nr_isolated_anon and
nr_isolated_file come to 0 when the system is at rest, as expected
(and as silence from vmstat_refresh confirms - /proc/vmstat itself
presents negative stats as 0, in order to hide transient oddities).

Hugh

> 
> >  			if (!folio_batch_add(&free_folios, new_folio)) {
> >  				mem_cgroup_uncharge_folios(&free_folios);
> >  				free_unref_folios(&free_folios);
> > 
> > And that ran nicely, until I terminated the run and did
> > grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
> > at the end: stat_refresh kindly left a pr_warn in dmesg to say
> > nr_isolated_anon -334013737
> > 
> > My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
> > know how many pages they isolated and decremented the stats by, and
> > increment by that same number at the end; whereas other split_huge_pagers
> > (migration?) decrement one by one as they go through the list afterwards.
> > 
> > I've run out of time (I'm about to take a break): I gave up researching
> > who needs what, and was already feeling this optimization does too much
> > second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
> > rather admits to that).
> > 
> > And I don't think it's as simple as moving the node_stat_sub_folio()
> > into 2/6 where the zero pte is substituted: that would probably handle
> > the vast majority of cases, but aren't there others which pass the
> > folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
> > or racily truncated now that the folio has been unlocked, for example?
> > 
> > Hugh
Usama Arif Sept. 5, 2024, 7:24 p.m. UTC | #4
On 05/09/2024 19:05, Hugh Dickins wrote:
> On Thu, 5 Sep 2024, Usama Arif wrote:
>> On 05/09/2024 09:46, Hugh Dickins wrote:
>>> On Fri, 30 Aug 2024, Usama Arif wrote:
>>>
>>>> From: Yu Zhao <yuzhao@google.com>
>>>>
>>>> If a tail page has only two references left, one inherited from the
>>>> isolation of its head and the other from lru_add_page_tail() which we
>>>> are about to drop, it means this tail page was concurrently zapped.
>>>> Then we can safely free it and save page reclaim or migration the
>>>> trouble of trying it.
>>>>
>>>> Signed-off-by: Yu Zhao <yuzhao@google.com>
>>>> Tested-by: Shuang Zhai <zhais@google.com>
>>>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>
>>> I'm sorry, but I think this patch (just this 1/6) needs to be dropped:
>>> it is only an optimization, and unless a persuasive performance case
>>> can be made to extend it, it ought to go (perhaps revisited later).
>>>
>>
>> I am ok for patch 1 only to be dropped. Patches 2-6 are not dependent on it.
>>
>> Its an optimization and underused shrinker doesn't depend on it.
>> Its possible that folio->new_folio below might fix it? But if it doesn't,
>> I can retry later on to make this work and resend it only if it alone shows
>> a significant performance improvement.
>>
>> Thanks a lot for debugging this! and sorry it caused an issue.
>>
>>
>>> The problem I kept hitting was that all my work, requiring compaction and
>>> reclaim, got (killably) stuck in or repeatedly calling reclaim_throttle():
>>> because nr_isolated_anon had grown high - and remained high even when the
>>> load had all been killed.
>>>
>>> Bisection led to the 2/6 (remap to shared zeropage), but I'd say this 1/6
>>> is the one to blame. I was intending to send this patch to "fix" it:
>>>
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3295,6 +3295,8 @@ static void __split_huge_page(struct pag
>>>  			folio_clear_active(new_folio);
>>>  			folio_clear_unevictable(new_folio);
>>>  			list_del(&new_folio->lru);
>>> +			node_stat_sub_folio(folio, NR_ISOLATED_ANON +
>>> +						folio_is_file_lru(folio));
>>
>> Maybe this should have been below? (Notice the folio->new_folio)
>>
>> +			node_stat_sub_folio(new_folio, NR_ISOLATED_ANON +
>> +						folio_is_file_lru(new_folio));
> 
> Yes, how very stupid of me (I'm well aware of the earlier bugfixes here,
> and ought not to have done a blind copy and paste of those lines): thanks.
> 
> However... it makes no difference. I gave yours a run, expecting a
> much smaller negative number, but actually it works out much the same:
> because, of course, by this point in the code "folio" is left pointing
> to a small folio, and is almost equivalent to new_folio for the
> node_stat calculations.
> 
> (I say "almost" because I guess there's a chance that the page at
> folio got reused as part of a larger folio meanwhile, which would
> throw the stats off (if they weren't off already).)
> 
> So, even with your fix to my fix, this code doesn't work.
> Whereas a revert of this 1/6 does work: nr_isolated_anon and
> nr_isolated_file come to 0 when the system is at rest, as expected
> (and as silence from vmstat_refresh confirms - /proc/vmstat itself
> presents negative stats as 0, in order to hide transient oddities).
> 
> Hugh

Thanks for trying. I was hoping you had mTHPs enabled and then
the folio -> new_folio change would have fixed it.

Happy for patch 1 only to be dropped. I can try to figure it out
later and send if I can actually show any performance numbers
for the fixed version on real world cases.

Thanks,
Usama
> 
>>
>>>  			if (!folio_batch_add(&free_folios, new_folio)) {
>>>  				mem_cgroup_uncharge_folios(&free_folios);
>>>  				free_unref_folios(&free_folios);
>>>
>>> And that ran nicely, until I terminated the run and did
>>> grep nr_isolated /proc/sys/vm/stat_refresh /proc/vmstat
>>> at the end: stat_refresh kindly left a pr_warn in dmesg to say
>>> nr_isolated_anon -334013737
>>>
>>> My patch is not good enough. IIUC, some split_huge_pagers (reclaim?)
>>> know how many pages they isolated and decremented the stats by, and
>>> increment by that same number at the end; whereas other split_huge_pagers
>>> (migration?) decrement one by one as they go through the list afterwards.
>>>
>>> I've run out of time (I'm about to take a break): I gave up researching
>>> who needs what, and was already feeling this optimization does too much
>>> second guessing of what's needed (and its array of VM_WARN_ON_ONCE_FOLIOs
>>> rather admits to that).
>>>
>>> And I don't think it's as simple as moving the node_stat_sub_folio()
>>> into 2/6 where the zero pte is substituted: that would probably handle
>>> the vast majority of cases, but aren't there others which pass the
>>> folio_ref_freeze(new_folio, 2) test - the title's zapped tail pages,
>>> or racily truncated now that the folio has been unlocked, for example?
>>>
>>> Hugh
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 15418ffdd377..0c48806ccb9a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3170,7 +3170,9 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 	unsigned int new_nr = 1 << new_order;
 	int order = folio_order(folio);
 	unsigned int nr = 1 << order;
+	struct folio_batch free_folios;
 
+	folio_batch_init(&free_folios);
 	/* complete memcg works before add pages to LRU */
 	split_page_memcg(head, order, new_order);
 
@@ -3254,6 +3256,27 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 		if (subpage == page)
 			continue;
 		folio_unlock(new_folio);
+		/*
+		 * If a folio has only two references left, one inherited
+		 * from the isolation of its head and the other from
+		 * lru_add_page_tail() which we are about to drop, it means this
+		 * folio was concurrently zapped. Then we can safely free it
+		 * and save page reclaim or migration the trouble of trying it.
+		 */
+		if (list && folio_ref_freeze(new_folio, 2)) {
+			VM_WARN_ON_ONCE_FOLIO(folio_test_lru(new_folio), new_folio);
+			VM_WARN_ON_ONCE_FOLIO(folio_test_large(new_folio), new_folio);
+			VM_WARN_ON_ONCE_FOLIO(folio_mapped(new_folio), new_folio);
+
+			folio_clear_active(new_folio);
+			folio_clear_unevictable(new_folio);
+			list_del(&new_folio->lru);
+			if (!folio_batch_add(&free_folios, new_folio)) {
+				mem_cgroup_uncharge_folios(&free_folios);
+				free_unref_folios(&free_folios);
+			}
+			continue;
+		}
 
 		/*
 		 * Subpages may be freed if there wasn't any mapping
@@ -3264,6 +3287,11 @@  static void __split_huge_page(struct page *page, struct list_head *list,
 		 */
 		free_page_and_swap_cache(subpage);
 	}
+
+	if (free_folios.nr) {
+		mem_cgroup_uncharge_folios(&free_folios);
+		free_unref_folios(&free_folios);
+	}
 }
 
 /* Racy check whether the huge page can be split */