mbox series

[v3,0/6] mm: split underutilized THPs

Message ID 20240813120328.1275952-1-usamaarif642@gmail.com (mailing list archive)
Headers show
Series mm: split underutilized THPs | expand

Message

Usama Arif Aug. 13, 2024, 12:02 p.m. UTC
The current upstream default policy for THP is always. However, Meta
uses madvise in production as the current THP=always policy vastly
overprovisions THPs in sparsely accessed memory areas, resulting in
excessive memory pressure and premature OOM killing.
Using madvise + relying on khugepaged has certain drawbacks over
THP=always. Using madvise hints mean THPs aren't "transparent" and
require userspace changes. Waiting for khugepaged to scan memory and
collapse pages into THP can be slow and unpredictable in terms of performance
(i.e. you dont know when the collapse will happen), while production
environments require predictable performance. If there is enough memory
available, its better for both performance and predictability to have
a THP from fault time, i.e. THP=always rather than wait for khugepaged
to collapse it, and deal with sparsely populated THPs when the system is
running out of memory.

This patch-series is an attempt to mitigate the issue of running out of
memory when THP is always enabled. During runtime whenever a THP is being
faulted in or collapsed by khugepaged, the THP is added to a list.
Whenever memory reclaim happens, the kernel runs the deferred_split
shrinker which goes through the list and checks if the THP was underutilized,
i.e. how many of the base 4K pages of the entire THP were zero-filled.
If this number goes above a certain threshold, the shrinker will attempt
to split that THP. Then at remap time, the pages that were zero-filled are
mapped to the shared zeropage, hence saving memory. This method avoids the
downside of wasting memory in areas where THP is sparsely filled when THP
is always enabled, while still providing the upside THPs like reduced TLB
misses without having to use madvise.

Meta production workloads that were CPU bound (>99% CPU utilzation) were
tested with THP shrinker. The results after 2 hours are as follows:

                            | THP=madvise |  THP=always   | THP=always
                            |             |               | + shrinker series
                            |             |               | + max_ptes_none=409
-----------------------------------------------------------------------------
Performance improvement     |      -      |    +1.8%      |     +1.7%
(over THP=madvise)          |             |               |
-----------------------------------------------------------------------------
Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
-----------------------------------------------------------------------------
max_ptes_none=409 means that any THP that has more than 409 out of 512
(80%) zero filled filled pages will be split.

To test out the patches, the below commands without the shrinker will
invoke OOM killer immediately and kill stress, but will not fail with
the shrinker:

echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
echo 20M > /sys/fs/cgroup/test/memory.max
echo 0 > /sys/fs/cgroup/test/memory.swap.max
# allocate twice memory.max for each stress worker and touch 40/512 of
# each THP, i.e. vm-stride 50K.
# With the shrinker, max_ptes_none of 470 and below won't invoke OOM
# killer.
# Without the shrinker, OOM killer is invoked immediately irrespective
# of max_ptes_none value and kills stress.
stress --vm 1 --vm-bytes 40M --vm-stride 50K

v2 -> v3:
- Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes)
- Use flags argument instead of bools in remove_migration_ptes (Johannes)
- Use a new flag in folio->_flags_1 instead of folio->_partially_mapped
  (David Hildenbrand).
- Split out the last patch of v2 into 3, one for introducing the flag,
  one for splitting underutilized THPs on _deferred_list and one for adding
  sysfs entry to disable splitting (David Hildenbrand).

v1 -> v2:
- Turn page checks and operations to folio versions in __split_huge_page.
  This means patches 1 and 2 from v1 are no longer needed.
  (David Hildenbrand)
- Map to shared zeropage in all cases if the base page is zero-filled.
  The uffd selftest was removed.
  (David Hildenbrand).
- rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
  (Rik van Riel).
- Use unsigned long instead of uint64_t (kernel test robot).
 
Alexander Zhu (1):
  mm: selftest to verify zero-filled pages are mapped to zeropage

Usama Arif (3):
  mm: Introduce a pageflag for partially mapped folios
  mm: split underutilized THPs
  mm: add sysfs entry to disable splitting underutilized THPs

Yu Zhao (2):
  mm: free zapped tail pages when splitting isolated thp
  mm: remap unused subpages to shared zeropage when splitting isolated
    thp

 Documentation/admin-guide/mm/transhuge.rst    |   6 +
 include/linux/huge_mm.h                       |   4 +-
 include/linux/khugepaged.h                    |   1 +
 include/linux/page-flags.h                    |   3 +
 include/linux/rmap.h                          |   7 +-
 include/linux/vm_event_item.h                 |   1 +
 mm/huge_memory.c                              | 156 ++++++++++++++++--
 mm/hugetlb.c                                  |   1 +
 mm/internal.h                                 |   4 +-
 mm/khugepaged.c                               |   3 +-
 mm/memcontrol.c                               |   3 +-
 mm/migrate.c                                  |  74 +++++++--
 mm/migrate_device.c                           |   4 +-
 mm/page_alloc.c                               |   5 +-
 mm/rmap.c                                     |   3 +-
 mm/vmscan.c                                   |   3 +-
 mm/vmstat.c                                   |   1 +
 .../selftests/mm/split_huge_page_test.c       |  71 ++++++++
 tools/testing/selftests/mm/vm_util.c          |  22 +++
 tools/testing/selftests/mm/vm_util.h          |   1 +
 20 files changed, 333 insertions(+), 40 deletions(-)

Comments

Andi Kleen Aug. 13, 2024, 5:22 p.m. UTC | #1
Usama Arif <usamaarif642@gmail.com> writes:
>
> This patch-series is an attempt to mitigate the issue of running out of
> memory when THP is always enabled. During runtime whenever a THP is being
> faulted in or collapsed by khugepaged, the THP is added to a list.
> Whenever memory reclaim happens, the kernel runs the deferred_split
> shrinker which goes through the list and checks if the THP was underutilized,
> i.e. how many of the base 4K pages of the entire THP were zero-filled.

Sometimes when writing a benchmark I fill things with zero explictly
to avoid faults later. For example if you want to measure memory
read bandwidth you need to fault the pages first, but that fault
pattern may well be zero.

With your patch if there is memory pressure there are two effects:

- If things are remapped to the zero page the benchmark
reading memory may give unrealistically good results because
what is thinks is a big memory area is actually only backed
by a single page.

- If I expect to write I may end up with an unexpected zeropage->real
memory fault if the pages got remapped. 

I expect such patterns can happen without benchmarking too.
I could see it being a problem for latency sensitive applications.

Now you could argue that this all should only happen under memory
pressure and when that happens things may be slow anyways and your
patch will still be an improvement.

Maybe that's true but there might be still corner cases
which are negatively impacted by this. I don't have a good solution
other than a tunable, but I expect it will cause problems for someone.

The other problem I have with your patch is that it may cause the kernel
to pollute CPU caches in the background, which again will cause noise in
the system. Instead of plain memchr_inv, you should probably use some
primitive to bypass caches or use a NTA prefetch hint at least.

-Andi
Usama Arif Aug. 14, 2024, 10:13 a.m. UTC | #2
On 13/08/2024 18:22, Andi Kleen wrote:
> Usama Arif <usamaarif642@gmail.com> writes:
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> 
> Sometimes when writing a benchmark I fill things with zero explictly
> to avoid faults later. For example if you want to measure memory
> read bandwidth you need to fault the pages first, but that fault
> pattern may well be zero.
> 
> With your patch if there is memory pressure there are two effects:
> 
> - If things are remapped to the zero page the benchmark
> reading memory may give unrealistically good results because
> what is thinks is a big memory area is actually only backed
> by a single page.
> 
> - If I expect to write I may end up with an unexpected zeropage->real
> memory fault if the pages got remapped. 
> 
> I expect such patterns can happen without benchmarking too.
> I could see it being a problem for latency sensitive applications.
> 
> Now you could argue that this all should only happen under memory
> pressure and when that happens things may be slow anyways and your
> patch will still be an improvement.
> 
> Maybe that's true but there might be still corner cases
> which are negatively impacted by this. I don't have a good solution
> other than a tunable, but I expect it will cause problems for someone.
> 

There are currently 2 knobs to control behaviour of THP low utilization shrinker introduced in this series.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none:
The current default value for this is HPAGE_PMD_NR - 1 (511 for x86). If set to 511, the shrinker will immediately remove the folio from the deferred_list (Please see first if statement in thp_underutilized in Patch 5) and split is not attempted. Not a single page is checked at this point and there is no memory accesses done to impact performance.
If someone sets its to 510, it will exit as soon as a single page containing non-zero data is encountered (the else part in thp_underutilized).

/sys/kernel/mm/transparent_hugepage/thp_low_util_shrinker:
Introduced in patch 6, if someone really doesn't want to enable the shrinker, then they can set this to false. The folio will not be added to the _deferred_list at fault or collapse time, and it will be as if these patches didn't exist. Personally, I don't think its absolutely necessary to have this, but I added it incase someone comes up with some corner case.

For the first effect you mentioned, with the default behaviour of the patches with max_ptes_none set to 511, there will be no splitting of THPs, so you will get the same performance as without the series. 
If there is some benchmark that allocates all of the system memory with zeropages, causing shrinker to run and if someone has changed max_ptes_none and if they have kept thp_low_util_shrinker enabled and if all the benchmark does is read those pages, thus giving good memory results, then that benchmark is not really useful and the good results it gives is not unrealistic but a result of these patches. The stress example I have in the cover letter is an example. With these patches you can run stress or any other benchmark that behaves like this and still run other applications at the same time that consume memory, so the improvement is not unrealistic.

For the second effect of memory faults affecting latency sensitive applications, if THP is always enabled, and such applications are running out of memory resulting in shrinker to run, then a higher priority should be to have memory to run (which the shrinker will provide) rather than stalling for memory creating memory pressure which will result in latency spikes and possibly OOM killer being invoked killing the application.

I think we should focus on real world applications for which I have posted numbers in the cover letter and not tailor this for some benchmarks. If there is some real world low latency application where you could show these patches causing an issue, I would be happy to look into it. But again, with the default max_ptes_none of 511, it wouldn't.
 

> The other problem I have with your patch is that it may cause the kernel
> to pollute CPU caches in the background, which again will cause noise in
> the system. Instead of plain memchr_inv, you should probably use some
> primitive to bypass caches or use a NTA prefetch hint at least.
> 

A few points on this:
- the page is checked in 2 places, at shrink time and at split time, so having the page in cache is useful and needed.
- there is stuff like this already done in the kernel when there is memory pressure, for e.g. at swap time [1]. Its not memchr_inv, but doing the exact same thing as memchr_inv.
- At the time the shrinker runs, one of the highest priority of the kernel/system is to get free memory. We should not try to make this slower by messing around with caches.

I think the current behaviour in the patches is good because of the above points. But also I don't think there is a standard way of doing NTA prefetch across all architectures, x86 prefetch does it [1], but arm prefetch [2] does pld1keep, i.e. keep the data in L1 cache which is the opposite of what NTA prefetch is intended doing. 

[1] https://elixir.bootlin.com/linux/v6.10.4/source/mm/zswap.c#L1390
[2] https://elixir.bootlin.com/linux/v6.10.4/source/arch/x86/include/asm/processor.h#L614
[3] https://elixir.bootlin.com/linux/v6.10.4/source/arch/arm64/include/asm/processor.h#L360
Hugh Dickins Aug. 18, 2024, 5:13 a.m. UTC | #3
On Tue, 13 Aug 2024, Usama Arif wrote:

> The current upstream default policy for THP is always. However, Meta
> uses madvise in production as the current THP=always policy vastly
> overprovisions THPs in sparsely accessed memory areas, resulting in
> excessive memory pressure and premature OOM killing.
> Using madvise + relying on khugepaged has certain drawbacks over
> THP=always. Using madvise hints mean THPs aren't "transparent" and
> require userspace changes. Waiting for khugepaged to scan memory and
> collapse pages into THP can be slow and unpredictable in terms of performance
> (i.e. you dont know when the collapse will happen), while production
> environments require predictable performance. If there is enough memory
> available, its better for both performance and predictability to have
> a THP from fault time, i.e. THP=always rather than wait for khugepaged
> to collapse it, and deal with sparsely populated THPs when the system is
> running out of memory.
> 
> This patch-series is an attempt to mitigate the issue of running out of
> memory when THP is always enabled. During runtime whenever a THP is being
> faulted in or collapsed by khugepaged, the THP is added to a list.
> Whenever memory reclaim happens, the kernel runs the deferred_split
> shrinker which goes through the list and checks if the THP was underutilized,
> i.e. how many of the base 4K pages of the entire THP were zero-filled.
> If this number goes above a certain threshold, the shrinker will attempt
> to split that THP. Then at remap time, the pages that were zero-filled are
> mapped to the shared zeropage, hence saving memory. This method avoids the
> downside of wasting memory in areas where THP is sparsely filled when THP
> is always enabled, while still providing the upside THPs like reduced TLB
> misses without having to use madvise.
> 
> Meta production workloads that were CPU bound (>99% CPU utilzation) were
> tested with THP shrinker. The results after 2 hours are as follows:
> 
>                             | THP=madvise |  THP=always   | THP=always
>                             |             |               | + shrinker series
>                             |             |               | + max_ptes_none=409
> -----------------------------------------------------------------------------
> Performance improvement     |      -      |    +1.8%      |     +1.7%
> (over THP=madvise)          |             |               |
> -----------------------------------------------------------------------------
> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
> -----------------------------------------------------------------------------
> max_ptes_none=409 means that any THP that has more than 409 out of 512
> (80%) zero filled filled pages will be split.
> 
> To test out the patches, the below commands without the shrinker will
> invoke OOM killer immediately and kill stress, but will not fail with
> the shrinker:
> 
> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
> mkdir /sys/fs/cgroup/test
> echo $$ > /sys/fs/cgroup/test/cgroup.procs
> echo 20M > /sys/fs/cgroup/test/memory.max
> echo 0 > /sys/fs/cgroup/test/memory.swap.max
> # allocate twice memory.max for each stress worker and touch 40/512 of
> # each THP, i.e. vm-stride 50K.
> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
> # killer.
> # Without the shrinker, OOM killer is invoked immediately irrespective
> # of max_ptes_none value and kills stress.
> stress --vm 1 --vm-bytes 40M --vm-stride 50K
> 
> v2 -> v3:
> - Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes)
> - Use flags argument instead of bools in remove_migration_ptes (Johannes)
> - Use a new flag in folio->_flags_1 instead of folio->_partially_mapped
>   (David Hildenbrand).
> - Split out the last patch of v2 into 3, one for introducing the flag,
>   one for splitting underutilized THPs on _deferred_list and one for adding
>   sysfs entry to disable splitting (David Hildenbrand).
> 
> v1 -> v2:
> - Turn page checks and operations to folio versions in __split_huge_page.
>   This means patches 1 and 2 from v1 are no longer needed.
>   (David Hildenbrand)
> - Map to shared zeropage in all cases if the base page is zero-filled.
>   The uffd selftest was removed.
>   (David Hildenbrand).
> - rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
>   (Rik van Riel).
> - Use unsigned long instead of uint64_t (kernel test robot).
>  
> Alexander Zhu (1):
>   mm: selftest to verify zero-filled pages are mapped to zeropage
> 
> Usama Arif (3):
>   mm: Introduce a pageflag for partially mapped folios
>   mm: split underutilized THPs
>   mm: add sysfs entry to disable splitting underutilized THPs
> 
> Yu Zhao (2):
>   mm: free zapped tail pages when splitting isolated thp
>   mm: remap unused subpages to shared zeropage when splitting isolated
>     thp

Sorry, I don't have time to review this, but notice you're intending
a v4 of the series, so want to bring up four points quickly before that.

1. Even with the two fixes to 1/6 in __split_huge_page(), under load
this series causes system lockup, with interrupts disabled on most CPUs.

The error is in deferred_split_scan(), where the old code just did
a list_splice_tail() under split_queue_lock, but this series ends up
doing more there, including a folio_put(): deadlock when racing, and
that is the final folio_put() which brings refcount down to 0, which
then wants to take split_queue_lock.

The patch I've been using successfully on 6.11-rc3-next-20240816 below:
I do have other problems with current mm commits, so have not been able
to sustain a load for very long, but suspect those problems unrelated
to this series. Please fold this fix, or your own equivalent, into
your next version.

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3270,7 +3270,8 @@ static void __split_huge_page(struct pag
 
 			folio_clear_active(new_folio);
 			folio_clear_unevictable(new_folio);
-			if (!folio_batch_add(&free_folios, folio)) {
+			list_del(&new_folio->lru);
+			if (!folio_batch_add(&free_folios, new_folio)) {
 				mem_cgroup_uncharge_folios(&free_folios);
 				free_unref_folios(&free_folios);
 			}
@@ -3706,42 +3707,37 @@ static unsigned long deferred_split_scan
 		bool did_split = false;
 		bool underutilized = false;
 
-		if (folio_test_partially_mapped(folio))
-			goto split;
-		underutilized = thp_underutilized(folio);
-		if (underutilized)
-			goto split;
-		continue;
-split:
+		if (!folio_test_partially_mapped(folio)) {
+			underutilized = thp_underutilized(folio);
+			if (!underutilized)
+				goto next;
+		}
 		if (!folio_trylock(folio))
-			continue;
-		did_split = !split_folio(folio);
-		folio_unlock(folio);
-		if (did_split) {
-			/* Splitting removed folio from the list, drop reference here */
-			folio_put(folio);
+			goto next;
+		if (!split_folio(folio)) {
+			did_split = true;
 			if (underutilized)
 				count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE);
 			split++;
 		}
-	}
-
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	/*
-	 * Only add back to the queue if folio is partially mapped.
-	 * If thp_underutilized returns false, or if split_folio fails in
-	 * the case it was underutilized, then consider it used and don't
-	 * add it back to split_queue.
-	 */
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
-		if (folio_test_partially_mapped(folio))
-			list_move(&folio->_deferred_list, &ds_queue->split_queue);
-		else {
+		folio_unlock(folio);
+next:
+		/*
+		 * split_folio() removes folio from list on success.
+		 * Only add back to the queue if folio is partially mapped.
+		 * If thp_underutilized returns false, or if split_folio fails
+		 * in the case it was underutilized, then consider it used and
+		 * don't add it back to split_queue.
+		 */
+		if (!did_split && !folio_test_partially_mapped(folio)) {
 			list_del_init(&folio->_deferred_list);
 			ds_queue->split_queue_len--;
 		}
 		folio_put(folio);
 	}
+
+	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	list_splice_tail(&list, &ds_queue->split_queue);
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
 	/*

2. I don't understand why there needs to be a new PG_partially_mapped
flag, with all its attendant sets and tests and clears all over.  Why
can't deferred_split_scan() detect that case for itself, using the
criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
is commented "Do not use outside of rmap and debug code", and
folio_nr_pages_mapped() is currently only used from mm/debug.c; but
using the info already maintained is preferable to adding a PG_flag
(and perhaps more efficient - skips splitting when _nr_pages_mapped
already fell to 0 and folio will soon be freed).

3. Everything in /sys/kernel/mm/transparent_hugepage/ is about THPs,
so please remove the "thp_" from "thp_low_util_shrinker" -
"shrink_underused" perhaps.  And it needs a brief description in
Documentation/admin-guide/mm/transhuge.rst.

4. Wouldn't "underused" be better than "underutilized" throughout?

Hugh
David Hildenbrand Aug. 18, 2024, 7:45 a.m. UTC | #4
Hi Hugh,

> 2. I don't understand why there needs to be a new PG_partially_mapped
> flag, with all its attendant sets and tests and clears all over.  Why
> can't deferred_split_scan() detect that case for itself, using the
> criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
> is commented "Do not use outside of rmap and debug code", and
> folio_nr_pages_mapped() is currently only used from mm/debug.c; but
> using the info already maintained is preferable to adding a PG_flag
> (and perhaps more efficient - skips splitting when _nr_pages_mapped
> already fell to 0 and folio will soon be freed).

No new users of _nr_pages_mapped if easily/cleanly avoidable, please.

I'm currently cleaning up the final patches that introduce a new kernel 
config where we will stop maintaining the page->_mapcount for large 
folios (and consequently have to stop maintaining folio->_nr_pages_mapped).

That's the main reasons for the comment -- at one point in my life I 
want to be done with that project ;) .

folio->_nr_pages_mapped will still exist and be maintained without the 
new kernel config enabled. But in the new one, once we'll detect a 
partial mapping we'll have to flag the folio -- for example as done in 
this series.

Having two ways of handling that, depending on the kernel config, will 
not make the code any better.

But I agree that we should look into minimizing the usage of any new 
such flag: I'd have thought we only have to set the flag once, once we 
detect a partial mapping ... still have to review that patch more 
thoroughly.

> 
> 3. Everything in /sys/kernel/mm/transparent_hugepage/ is about THPs,
> so please remove the "thp_" from "thp_low_util_shrinker" -
> "shrink_underused" perhaps.  And it needs a brief description in
> Documentation/admin-guide/mm/transhuge.rst.

agreed.
Usama Arif Aug. 19, 2024, 2:36 a.m. UTC | #5
On 18/08/2024 06:13, Hugh Dickins wrote:
> On Tue, 13 Aug 2024, Usama Arif wrote:
> 
>> The current upstream default policy for THP is always. However, Meta
>> uses madvise in production as the current THP=always policy vastly
>> overprovisions THPs in sparsely accessed memory areas, resulting in
>> excessive memory pressure and premature OOM killing.
>> Using madvise + relying on khugepaged has certain drawbacks over
>> THP=always. Using madvise hints mean THPs aren't "transparent" and
>> require userspace changes. Waiting for khugepaged to scan memory and
>> collapse pages into THP can be slow and unpredictable in terms of performance
>> (i.e. you dont know when the collapse will happen), while production
>> environments require predictable performance. If there is enough memory
>> available, its better for both performance and predictability to have
>> a THP from fault time, i.e. THP=always rather than wait for khugepaged
>> to collapse it, and deal with sparsely populated THPs when the system is
>> running out of memory.
>>
>> This patch-series is an attempt to mitigate the issue of running out of
>> memory when THP is always enabled. During runtime whenever a THP is being
>> faulted in or collapsed by khugepaged, the THP is added to a list.
>> Whenever memory reclaim happens, the kernel runs the deferred_split
>> shrinker which goes through the list and checks if the THP was underutilized,
>> i.e. how many of the base 4K pages of the entire THP were zero-filled.
>> If this number goes above a certain threshold, the shrinker will attempt
>> to split that THP. Then at remap time, the pages that were zero-filled are
>> mapped to the shared zeropage, hence saving memory. This method avoids the
>> downside of wasting memory in areas where THP is sparsely filled when THP
>> is always enabled, while still providing the upside THPs like reduced TLB
>> misses without having to use madvise.
>>
>> Meta production workloads that were CPU bound (>99% CPU utilzation) were
>> tested with THP shrinker. The results after 2 hours are as follows:
>>
>>                             | THP=madvise |  THP=always   | THP=always
>>                             |             |               | + shrinker series
>>                             |             |               | + max_ptes_none=409
>> -----------------------------------------------------------------------------
>> Performance improvement     |      -      |    +1.8%      |     +1.7%
>> (over THP=madvise)          |             |               |
>> -----------------------------------------------------------------------------
>> Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
>> -----------------------------------------------------------------------------
>> max_ptes_none=409 means that any THP that has more than 409 out of 512
>> (80%) zero filled filled pages will be split.
>>
>> To test out the patches, the below commands without the shrinker will
>> invoke OOM killer immediately and kill stress, but will not fail with
>> the shrinker:
>>
>> echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
>> mkdir /sys/fs/cgroup/test
>> echo $$ > /sys/fs/cgroup/test/cgroup.procs
>> echo 20M > /sys/fs/cgroup/test/memory.max
>> echo 0 > /sys/fs/cgroup/test/memory.swap.max
>> # allocate twice memory.max for each stress worker and touch 40/512 of
>> # each THP, i.e. vm-stride 50K.
>> # With the shrinker, max_ptes_none of 470 and below won't invoke OOM
>> # killer.
>> # Without the shrinker, OOM killer is invoked immediately irrespective
>> # of max_ptes_none value and kills stress.
>> stress --vm 1 --vm-bytes 40M --vm-stride 50K
>>
>> v2 -> v3:
>> - Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes)
>> - Use flags argument instead of bools in remove_migration_ptes (Johannes)
>> - Use a new flag in folio->_flags_1 instead of folio->_partially_mapped
>>   (David Hildenbrand).
>> - Split out the last patch of v2 into 3, one for introducing the flag,
>>   one for splitting underutilized THPs on _deferred_list and one for adding
>>   sysfs entry to disable splitting (David Hildenbrand).
>>
>> v1 -> v2:
>> - Turn page checks and operations to folio versions in __split_huge_page.
>>   This means patches 1 and 2 from v1 are no longer needed.
>>   (David Hildenbrand)
>> - Map to shared zeropage in all cases if the base page is zero-filled.
>>   The uffd selftest was removed.
>>   (David Hildenbrand).
>> - rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage
>>   (Rik van Riel).
>> - Use unsigned long instead of uint64_t (kernel test robot).
>>  
>> Alexander Zhu (1):
>>   mm: selftest to verify zero-filled pages are mapped to zeropage
>>
>> Usama Arif (3):
>>   mm: Introduce a pageflag for partially mapped folios
>>   mm: split underutilized THPs
>>   mm: add sysfs entry to disable splitting underutilized THPs
>>
>> Yu Zhao (2):
>>   mm: free zapped tail pages when splitting isolated thp
>>   mm: remap unused subpages to shared zeropage when splitting isolated
>>     thp
> 
> Sorry, I don't have time to review this, but notice you're intending
> a v4 of the series, so want to bring up four points quickly before that.
> 
> 1. Even with the two fixes to 1/6 in __split_huge_page(), under load
> this series causes system lockup, with interrupts disabled on most CPUs.
> 
> The error is in deferred_split_scan(), where the old code just did
> a list_splice_tail() under split_queue_lock, but this series ends up
> doing more there, including a folio_put(): deadlock when racing, and
> that is the final folio_put() which brings refcount down to 0, which
> then wants to take split_queue_lock.
> 
> The patch I've been using successfully on 6.11-rc3-next-20240816 below:
> I do have other problems with current mm commits, so have not been able
> to sustain a load for very long, but suspect those problems unrelated
> to this series. Please fold this fix, or your own equivalent, into
> your next version.
> 
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3270,7 +3270,8 @@ static void __split_huge_page(struct pag
>  
>  			folio_clear_active(new_folio);
>  			folio_clear_unevictable(new_folio);
> -			if (!folio_batch_add(&free_folios, folio)) {
> +			list_del(&new_folio->lru);
> +			if (!folio_batch_add(&free_folios, new_folio)) {
>  				mem_cgroup_uncharge_folios(&free_folios);
>  				free_unref_folios(&free_folios);
>  			}
> @@ -3706,42 +3707,37 @@ static unsigned long deferred_split_scan
>  		bool did_split = false;
>  		bool underutilized = false;
>  
> -		if (folio_test_partially_mapped(folio))
> -			goto split;
> -		underutilized = thp_underutilized(folio);
> -		if (underutilized)
> -			goto split;
> -		continue;
> -split:
> +		if (!folio_test_partially_mapped(folio)) {
> +			underutilized = thp_underutilized(folio);
> +			if (!underutilized)
> +				goto next;
> +		}
>  		if (!folio_trylock(folio))
> -			continue;
> -		did_split = !split_folio(folio);
> -		folio_unlock(folio);
> -		if (did_split) {
> -			/* Splitting removed folio from the list, drop reference here */
> -			folio_put(folio);
> +			goto next;
> +		if (!split_folio(folio)) {
> +			did_split = true;
>  			if (underutilized)
>  				count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE);
>  			split++;
>  		}
> -	}
> -
> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	/*
> -	 * Only add back to the queue if folio is partially mapped.
> -	 * If thp_underutilized returns false, or if split_folio fails in
> -	 * the case it was underutilized, then consider it used and don't
> -	 * add it back to split_queue.
> -	 */
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> -		if (folio_test_partially_mapped(folio))
> -			list_move(&folio->_deferred_list, &ds_queue->split_queue);
> -		else {
> +		folio_unlock(folio);
> +next:
> +		/*
> +		 * split_folio() removes folio from list on success.
> +		 * Only add back to the queue if folio is partially mapped.
> +		 * If thp_underutilized returns false, or if split_folio fails
> +		 * in the case it was underutilized, then consider it used and
> +		 * don't add it back to split_queue.
> +		 */
> +		if (!did_split && !folio_test_partially_mapped(folio)) {
>  			list_del_init(&folio->_deferred_list);
>  			ds_queue->split_queue_len--;
>  		}
>  		folio_put(folio);
>  	}
> +
> +	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +	list_splice_tail(&list, &ds_queue->split_queue);
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
>  	/*
> 
> 2. I don't understand why there needs to be a new PG_partially_mapped
> flag, with all its attendant sets and tests and clears all over.  Why
> can't deferred_split_scan() detect that case for itself, using the
> criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
> is commented "Do not use outside of rmap and debug code", and
> folio_nr_pages_mapped() is currently only used from mm/debug.c; but
> using the info already maintained is preferable to adding a PG_flag
> (and perhaps more efficient - skips splitting when _nr_pages_mapped
> already fell to 0 and folio will soon be freed).
> 
> 3. Everything in /sys/kernel/mm/transparent_hugepage/ is about THPs,
> so please remove the "thp_" from "thp_low_util_shrinker" -
> "shrink_underused" perhaps.  And it needs a brief description in
> Documentation/admin-guide/mm/transhuge.rst.
> 
> 4. Wouldn't "underused" be better than "underutilized" throughout?
> 
> Hugh

Thanks for the review, especially for pointing out the deadlock. I have addressed points 1, 3 and 4 in v4, point 2 was addressed by David.
Usama Arif Aug. 19, 2024, 2:38 a.m. UTC | #6
On 18/08/2024 08:45, David Hildenbrand wrote:
> Hi Hugh,
> 
>> 2. I don't understand why there needs to be a new PG_partially_mapped
>> flag, with all its attendant sets and tests and clears all over.  Why
>> can't deferred_split_scan() detect that case for itself, using the
>> criteria from __folio_remove_rmap()? I see folio->_nr_pages_mapped
>> is commented "Do not use outside of rmap and debug code", and
>> folio_nr_pages_mapped() is currently only used from mm/debug.c; but
>> using the info already maintained is preferable to adding a PG_flag
>> (and perhaps more efficient - skips splitting when _nr_pages_mapped
>> already fell to 0 and folio will soon be freed).
> 
> No new users of _nr_pages_mapped if easily/cleanly avoidable, please.
> 
> I'm currently cleaning up the final patches that introduce a new kernel config where we will stop maintaining the page->_mapcount for large folios (and consequently have to stop maintaining folio->_nr_pages_mapped).
> 
> That's the main reasons for the comment -- at one point in my life I want to be done with that project ;) .
> 
> folio->_nr_pages_mapped will still exist and be maintained without the new kernel config enabled. But in the new one, once we'll detect a partial mapping we'll have to flag the folio -- for example as done in this series.
> 
> Having two ways of handling that, depending on the kernel config, will not make the code any better.
> 
> But I agree that we should look into minimizing the usage of any new such flag: I'd have thought we only have to set the flag once, once we detect a partial mapping ... still have to review that patch more thoroughly.

Yes, the flag is set only once, in deferred_split_folio once we detect a partial mapping.