mbox series

[RFC,v2,0/9] khugepaged: mTHP support

Message ID 20250211003028.213461-1-npache@redhat.com (mailing list archive)
Headers show
Series khugepaged: mTHP support | expand

Message

Nico Pache Feb. 11, 2025, 12:30 a.m. UTC
The following series provides khugepaged and madvise collapse with the
capability to collapse regions to mTHPs.

To achieve this we generalize the khugepaged functions to no longer depend
on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
(defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
using a bitmap. After the PMD scan is done, we do binary recursion on the
bitmap to find the optimal mTHP sizes for the PMD range. The restriction
on max_ptes_none is removed during the scan, to make sure we account for
the whole PMD range. max_ptes_none will be scaled by the attempted collapse
order to determine how full a THP must be to be eligible. If a mTHP collapse
is attempted, but contains swapped out, or shared pages, we dont perform the
collapse.

With the default max_ptes_none=511, the code should keep its most of its 
original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and 
constantly promote mTHPs to the next available size.

Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
Patch 2:     Refactor/rename hpage_collapse
Patch 3-5:   Generalize khugepaged functions for arbitrary orders
Patch 6-9:   The mTHP patches

---------
 Testing
---------
- Built for x86_64, aarch64, ppc64le, and s390x
- selftests mm
- I created a test script that I used to push khugepaged to its limits while
   monitoring a number of stats and tracepoints. The code is available 
   here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
   The summary from my testings was that there was no significant regression
   noticed through this test. In some cases my changes had better collapse
   latencies, and was able to scan more pages in the same amount of time/work,
   but for the most part the results were consistant.
- redis testing. I tested these changes along with my defer changes
  (see followup post for more details).
- some basic testing on 64k page size.
- lots of general use. These changes have been running in my VM for some time.

Changes since V1 [2]:
- Minor bug fixes discovered during review and testing
- removed dynamic allocations for bitmaps, and made them stack based
- Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
- Updated trace events to include collapsing order info.
- Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
- No longer require a chunk to be fully utilized before setting the bit. Use
   the same max_ptes_none scaling principle to achieve this.
- Skip mTHP collapse that requires swapin or shared handling. This helps prevent
   some of the "creep" that was discovered in v1.

[1] - https://gitlab.com/npache/khugepaged_mthp_test
[2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/

Nico Pache (9):
  introduce khugepaged_collapse_single_pmd to unify khugepaged and
    madvise_collapse
  khugepaged: rename hpage_collapse_* to khugepaged_*
  khugepaged: generalize hugepage_vma_revalidate for mTHP support
  khugepaged: generalize alloc_charge_folio for mTHP support
  khugepaged: generalize __collapse_huge_page_* for mTHP support
  khugepaged: introduce khugepaged_scan_bitmap for mTHP support
  khugepaged: add mTHP support
  khugepaged: improve tracepoints for mTHP orders
  khugepaged: skip collapsing mTHP to smaller orders

 include/linux/khugepaged.h         |   4 +
 include/trace/events/huge_memory.h |  34 ++-
 mm/khugepaged.c                    | 422 +++++++++++++++++++----------
 3 files changed, 306 insertions(+), 154 deletions(-)

Comments

Dev Jain Feb. 11, 2025, 12:49 p.m. UTC | #1
On 11/02/25 6:00 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> order to determine how full a THP must be to be eligible. If a mTHP collapse
> is attempted, but contains swapped out, or shared pages, we dont perform the
> collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its
> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> constantly promote mTHPs to the next available size.
> 
> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 2:     Refactor/rename hpage_collapse
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> 
> ---------
>   Testing
> ---------
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits while
>     monitoring a number of stats and tracepoints. The code is available
>     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>     The summary from my testings was that there was no significant regression
>     noticed through this test. In some cases my changes had better collapse
>     latencies, and was able to scan more pages in the same amount of time/work,
>     but for the most part the results were consistant.
> - redis testing. I tested these changes along with my defer changes
>    (see followup post for more details).
> - some basic testing on 64k page size.
> - lots of general use. These changes have been running in my VM for some time.
> 
> Changes since V1 [2]:
> - Minor bug fixes discovered during review and testing
> - removed dynamic allocations for bitmaps, and made them stack based
> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> - Updated trace events to include collapsing order info.
> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> - No longer require a chunk to be fully utilized before setting the bit. Use
>     the same max_ptes_none scaling principle to achieve this.
> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>     some of the "creep" that was discovered in v1.
> 
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> 
> Nico Pache (9):
>    introduce khugepaged_collapse_single_pmd to unify khugepaged and
>      madvise_collapse
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: improve tracepoints for mTHP orders
>    khugepaged: skip collapsing mTHP to smaller orders
> 
>   include/linux/khugepaged.h         |   4 +
>   include/trace/events/huge_memory.h |  34 ++-
>   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>   3 files changed, 306 insertions(+), 154 deletions(-)
> 

Does this patchset suffer from the problem described here:
https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
Nico Pache Feb. 12, 2025, 4:49 p.m. UTC | #2
On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/02/25 6:00 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> > order to determine how full a THP must be to be eligible. If a mTHP collapse
> > is attempted, but contains swapped out, or shared pages, we dont perform the
> > collapse.
> >
> > With the default max_ptes_none=511, the code should keep its most of its
> > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> > constantly promote mTHPs to the next available size.
> >
> > Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> > Patch 2:     Refactor/rename hpage_collapse
> > Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> > Patch 6-9:   The mTHP patches
> >
> > ---------
> >   Testing
> > ---------
> > - Built for x86_64, aarch64, ppc64le, and s390x
> > - selftests mm
> > - I created a test script that I used to push khugepaged to its limits while
> >     monitoring a number of stats and tracepoints. The code is available
> >     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >     The summary from my testings was that there was no significant regression
> >     noticed through this test. In some cases my changes had better collapse
> >     latencies, and was able to scan more pages in the same amount of time/work,
> >     but for the most part the results were consistant.
> > - redis testing. I tested these changes along with my defer changes
> >    (see followup post for more details).
> > - some basic testing on 64k page size.
> > - lots of general use. These changes have been running in my VM for some time.
> >
> > Changes since V1 [2]:
> > - Minor bug fixes discovered during review and testing
> > - removed dynamic allocations for bitmaps, and made them stack based
> > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> > - Updated trace events to include collapsing order info.
> > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> > - No longer require a chunk to be fully utilized before setting the bit. Use
> >     the same max_ptes_none scaling principle to achieve this.
> > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >     some of the "creep" that was discovered in v1.
> >
> > [1] - https://gitlab.com/npache/khugepaged_mthp_test
> > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >
> > Nico Pache (9):
> >    introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >      madvise_collapse
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: improve tracepoints for mTHP orders
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h         |   4 +
> >   include/trace/events/huge_memory.h |  34 ++-
> >   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >   3 files changed, 306 insertions(+), 154 deletions(-)
> >
>
> Does this patchset suffer from the problem described here:
> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
Hi Dev,

Sorry I meant to get back to you about that.

I understand your concern, but like I've mentioned before, the scan
with the read lock was done so we dont have to do the more expensive
locking, and could still gain insight into the state. You are right
that this info could become stale if the state changes dramatically,
but the collapse_isolate function will verify it and not collapse.
From my testing I found this to rarely happen.

Also, khugepaged, my changes, and your changes are all a victim of
this. Once we drop the read lock (to either allocate the folio, or
right before acquiring the write_lock), the state can change. In your
case, yes, you are gathering more up to date information, but is it
really that important/worth it to retake locks and rescan for each
instance if we are about to reverify with the write lock taken?

So in my eyes, this is not a "problem"

Cheers,
-- Nico


>
Dev Jain Feb. 13, 2025, 8:26 a.m. UTC | #3
On 12/02/25 10:19 pm, Nico Pache wrote:
> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 11/02/25 6:00 am, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>> collapse.
>>>
>>> With the default max_ptes_none=511, the code should keep its most of its
>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>> constantly promote mTHPs to the next available size.
>>>
>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 2:     Refactor/rename hpage_collapse
>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>> Patch 6-9:   The mTHP patches
>>>
>>> ---------
>>>    Testing
>>> ---------
>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>> - selftests mm
>>> - I created a test script that I used to push khugepaged to its limits while
>>>      monitoring a number of stats and tracepoints. The code is available
>>>      here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>      The summary from my testings was that there was no significant regression
>>>      noticed through this test. In some cases my changes had better collapse
>>>      latencies, and was able to scan more pages in the same amount of time/work,
>>>      but for the most part the results were consistant.
>>> - redis testing. I tested these changes along with my defer changes
>>>     (see followup post for more details).
>>> - some basic testing on 64k page size.
>>> - lots of general use. These changes have been running in my VM for some time.
>>>
>>> Changes since V1 [2]:
>>> - Minor bug fixes discovered during review and testing
>>> - removed dynamic allocations for bitmaps, and made them stack based
>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>> - Updated trace events to include collapsing order info.
>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>      the same max_ptes_none scaling principle to achieve this.
>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>      some of the "creep" that was discovered in v1.
>>>
>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>
>>> Nico Pache (9):
>>>     introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>       madvise_collapse
>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>     khugepaged: add mTHP support
>>>     khugepaged: improve tracepoints for mTHP orders
>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>    include/linux/khugepaged.h         |   4 +
>>>    include/trace/events/huge_memory.h |  34 ++-
>>>    mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>    3 files changed, 306 insertions(+), 154 deletions(-)
>>>
>>
>> Does this patchset suffer from the problem described here:
>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
> Hi Dev,
> 
> Sorry I meant to get back to you about that.
> 
> I understand your concern, but like I've mentioned before, the scan
> with the read lock was done so we dont have to do the more expensive
> locking, and could still gain insight into the state. You are right
> that this info could become stale if the state changes dramatically,
> but the collapse_isolate function will verify it and not collapse.

If the state changes dramatically, the _isolate function will verify it, 
and fallback. And this fallback happens after following this costly 
path: retrieve a large folio from the buddy allocator -> swapin pages 
from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush 
on all CPUs -> fallback in _isolate().
If you do fail in _isolate(), doesn't it make sense to get the updated 
state for the next fallback order immediately, because we have prior 
information that we failed because of PTE state? What your algorithm 
will do is *still* follow the costly path described above, and again 
fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like 
mine would.

The verification of the PTE state by the _isolate() function is the "no 
turning back" point of the algorithm. The verification by 
hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth 
it, before we do costly operations" point of the algorithm.

>  From my testing I found this to rarely happen.

Unfortunately, I am not very familiar with performance testing/load 
testing, I am fairly new to kernel programming, so I am getting there. 
But it really depends on the type of test you are running, what actually 
runs on memory-intensive systems, etc etc. In fact, on loaded systems I 
would expect the PTE state to dramatically change. But still, no opinion 
here.

> 
> Also, khugepaged, my changes, and your changes are all a victim of
> this. Once we drop the read lock (to either allocate the folio, or
> right before acquiring the write_lock), the state can change. In your
> case, yes, you are gathering more up to date information, but is it
> really that important/worth it to retake locks and rescan for each
> instance if we are about to reverify with the write lock taken?

You said "reverify": You are removing the verification, so this step 
won't be reverification, it will be verification. We do not want to 
verify *after* we have already done 95% of latency-heavy stuff, only to 
know that we are going to fail.

Algorithms in the kernel, in general, are of the following form: 1) 
Verify if a condition is true, resulting in taking a control path -> 2) 
do a lot of stuff -> "no turning back" step, wherein before committing 
(by taking locks, say), reverify if this is the control path we should 
be in. You are eliminating step 1).

Therefore, I will have to say that I disagree with your approach.

On top of this, in the subjective analysis in [1], point number 7 (along 
with point number 1) remains. And, point number 4 remains.

[1] 
https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/

> 
> So in my eyes, this is not a "problem"

Looks like the kernel scheduled us for a high-priority debate, I hope 
there's no deadlock :)

> 
> Cheers,
> -- Nico
> 
> 
>>
>
Dev Jain Feb. 13, 2025, 11:21 a.m. UTC | #4
On 13/02/25 1:56 pm, Dev Jain wrote:
> 
> 
> On 12/02/25 10:19 pm, Nico Pache wrote:
>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> On 11/02/25 6:00 am, Nico Pache wrote:
>>>> The following series provides khugepaged and madvise collapse with the
>>>> capability to collapse regions to mTHPs.
>>>>
>>>> To achieve this we generalize the khugepaged functions to no longer 
>>>> depend
>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of 
>>>> pages
>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>> using a bitmap. After the PMD scan is done, we do binary recursion 
>>>> on the
>>>> bitmap to find the optimal mTHP sizes for the PMD range. The 
>>>> restriction
>>>> on max_ptes_none is removed during the scan, to make sure we account 
>>>> for
>>>> the whole PMD range. max_ptes_none will be scaled by the attempted 
>>>> collapse
>>>> order to determine how full a THP must be to be eligible. If a mTHP 
>>>> collapse
>>>> is attempted, but contains swapped out, or shared pages, we dont 
>>>> perform the
>>>> collapse.
>>>>
>>>> With the default max_ptes_none=511, the code should keep its most of 
>>>> its
>>>> original behavior. To exercise mTHP collapse we need to set 
>>>> max_ptes_none<=255.
>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse 
>>>> "creep" and
>>>> constantly promote mTHPs to the next available size.
>>>>
>>>> Patch 1:     Some refactoring to combine madvise_collapse and 
>>>> khugepaged
>>>> Patch 2:     Refactor/rename hpage_collapse
>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>> Patch 6-9:   The mTHP patches
>>>>
>>>> ---------
>>>>    Testing
>>>> ---------
>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>> - selftests mm
>>>> - I created a test script that I used to push khugepaged to its 
>>>> limits while
>>>>      monitoring a number of stats and tracepoints. The code is 
>>>> available
>>>>      here[1] (Run in legacy mode for these changes and set mthp 
>>>> sizes to inherit)
>>>>      The summary from my testings was that there was no significant 
>>>> regression
>>>>      noticed through this test. In some cases my changes had better 
>>>> collapse
>>>>      latencies, and was able to scan more pages in the same amount 
>>>> of time/work,
>>>>      but for the most part the results were consistant.
>>>> - redis testing. I tested these changes along with my defer changes
>>>>     (see followup post for more details).
>>>> - some basic testing on 64k page size.
>>>> - lots of general use. These changes have been running in my VM for 
>>>> some time.
>>>>
>>>> Changes since V1 [2]:
>>>> - Minor bug fixes discovered during review and testing
>>>> - removed dynamic allocations for bitmaps, and made them stack based
>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>> - Updated trace events to include collapsing order info.
>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>>> - No longer require a chunk to be fully utilized before setting the 
>>>> bit. Use
>>>>      the same max_ptes_none scaling principle to achieve this.
>>>> - Skip mTHP collapse that requires swapin or shared handling. This 
>>>> helps prevent
>>>>      some of the "creep" that was discovered in v1.
>>>>
>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1- 
>>>> npache@redhat.com/
>>>>
>>>> Nico Pache (9):
>>>>     introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>>       madvise_collapse
>>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>     khugepaged: add mTHP support
>>>>     khugepaged: improve tracepoints for mTHP orders
>>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>>
>>>>    include/linux/khugepaged.h         |   4 +
>>>>    include/trace/events/huge_memory.h |  34 ++-
>>>>    mm/khugepaged.c                    | 422 ++++++++++++++++++ 
>>>> +----------
>>>>    3 files changed, 306 insertions(+), 154 deletions(-)
>>>>
>>>
>>> Does this patchset suffer from the problem described here:
>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680- 
>>> c2d48d4963b6@arm.com/
>> Hi Dev,
>>
>> Sorry I meant to get back to you about that.
>>
>> I understand your concern, but like I've mentioned before, the scan
>> with the read lock was done so we dont have to do the more expensive
>> locking, and could still gain insight into the state. You are right
>> that this info could become stale if the state changes dramatically,
>> but the collapse_isolate function will verify it and not collapse.
> 
> If the state changes dramatically, the _isolate function will verify it, 
> and fallback. And this fallback happens after following this costly 
> path: retrieve a large folio from the buddy allocator -> swapin pages 
> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush 
> on all CPUs -> fallback in _isolate().
> If you do fail in _isolate(), doesn't it make sense to get the updated 
> state for the next fallback order immediately, because we have prior 
> information that we failed because of PTE state? What your algorithm 
> will do is *still* follow the costly path described above, and again 
> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like 
> mine would.
> 
> The verification of the PTE state by the _isolate() function is the "no 
> turning back" point of the algorithm. The verification by 
> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth 
> it, before we do costly operations" point of the algorithm.
> 
>>  From my testing I found this to rarely happen.
> 
> Unfortunately, I am not very familiar with performance testing/load 
> testing, I am fairly new to kernel programming, so I am getting there. 
> But it really depends on the type of test you are running, what actually 
> runs on memory-intensive systems, etc etc. In fact, on loaded systems I 
> would expect the PTE state to dramatically change. But still, no opinion 
> here.
> 
>>
>> Also, khugepaged, my changes, and your changes are all a victim of
>> this. Once we drop the read lock (to either allocate the folio, or
>> right before acquiring the write_lock), the state can change. In your
>> case, yes, you are gathering more up to date information, but is it
>> really that important/worth it to retake locks and rescan for each
>> instance if we are about to reverify with the write lock taken?
> 
> You said "reverify": You are removing the verification, so this step 
> won't be reverification, it will be verification. We do not want to 
> verify *after* we have already done 95% of latency-heavy stuff, only to 
> know that we are going to fail.
> 
> Algorithms in the kernel, in general, are of the following form: 1) 
> Verify if a condition is true, resulting in taking a control path -> 2) 
> do a lot of stuff -> "no turning back" step, wherein before committing 
> (by taking locks, say), reverify if this is the control path we should 
> be in. You are eliminating step 1).
> 
> Therefore, I will have to say that I disagree with your approach.
> 
> On top of this, in the subjective analysis in [1], point number 7 (along 
> with point number 1) remains. And, point number 4 remains.
> 
> [1] https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b- 
> aba4b1a441b4@arm.com/
> 
>>
>> So in my eyes, this is not a "problem"
> 
> Looks like the kernel scheduled us for a high-priority debate, I hope 
> there's no deadlock :)
> 
>>
>> Cheers,
>> -- Nico
>>
>>

In any case, As Andrew notes, the first step is to justify that this 
functionality is beneficial to get in. The second step would be to 
compare the implementations. It would be great if someone from the 
community can jump in to test these things out : ) but in the meantime
I will begin working on the first step.

>>>
>>
> 
>
Nico Pache Feb. 13, 2025, 7:39 p.m. UTC | #5
On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 12/02/25 10:19 pm, Nico Pache wrote:
> > On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 11/02/25 6:00 am, Nico Pache wrote:
> >>> The following series provides khugepaged and madvise collapse with the
> >>> capability to collapse regions to mTHPs.
> >>>
> >>> To achieve this we generalize the khugepaged functions to no longer depend
> >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>> on max_ptes_none is removed during the scan, to make sure we account for
> >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> >>> order to determine how full a THP must be to be eligible. If a mTHP collapse
> >>> is attempted, but contains swapped out, or shared pages, we dont perform the
> >>> collapse.
> >>>
> >>> With the default max_ptes_none=511, the code should keep its most of its
> >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> >>> constantly promote mTHPs to the next available size.
> >>>
> >>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> >>> Patch 2:     Refactor/rename hpage_collapse
> >>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>> Patch 6-9:   The mTHP patches
> >>>
> >>> ---------
> >>>    Testing
> >>> ---------
> >>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>> - selftests mm
> >>> - I created a test script that I used to push khugepaged to its limits while
> >>>      monitoring a number of stats and tracepoints. The code is available
> >>>      here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >>>      The summary from my testings was that there was no significant regression
> >>>      noticed through this test. In some cases my changes had better collapse
> >>>      latencies, and was able to scan more pages in the same amount of time/work,
> >>>      but for the most part the results were consistant.
> >>> - redis testing. I tested these changes along with my defer changes
> >>>     (see followup post for more details).
> >>> - some basic testing on 64k page size.
> >>> - lots of general use. These changes have been running in my VM for some time.
> >>>
> >>> Changes since V1 [2]:
> >>> - Minor bug fixes discovered during review and testing
> >>> - removed dynamic allocations for bitmaps, and made them stack based
> >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>> - Updated trace events to include collapsing order info.
> >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> >>> - No longer require a chunk to be fully utilized before setting the bit. Use
> >>>      the same max_ptes_none scaling principle to achieve this.
> >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >>>      some of the "creep" that was discovered in v1.
> >>>
> >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >>>
> >>> Nico Pache (9):
> >>>     introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >>>       madvise_collapse
> >>>     khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>     khugepaged: generalize alloc_charge_folio for mTHP support
> >>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>     khugepaged: add mTHP support
> >>>     khugepaged: improve tracepoints for mTHP orders
> >>>     khugepaged: skip collapsing mTHP to smaller orders
> >>>
> >>>    include/linux/khugepaged.h         |   4 +
> >>>    include/trace/events/huge_memory.h |  34 ++-
> >>>    mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >>>    3 files changed, 306 insertions(+), 154 deletions(-)
> >>>
> >>
> >> Does this patchset suffer from the problem described here:
> >> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
> > Hi Dev,
> >
> > Sorry I meant to get back to you about that.
> >
> > I understand your concern, but like I've mentioned before, the scan
> > with the read lock was done so we dont have to do the more expensive
> > locking, and could still gain insight into the state. You are right
> > that this info could become stale if the state changes dramatically,
> > but the collapse_isolate function will verify it and not collapse.
>
> If the state changes dramatically, the _isolate function will verify it,
> and fallback. And this fallback happens after following this costly
> path: retrieve a large folio from the buddy allocator -> swapin pages
> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush
> on all CPUs -> fallback in _isolate().
> If you do fail in _isolate(), doesn't it make sense to get the updated
> state for the next fallback order immediately, because we have prior
> information that we failed because of PTE state? What your algorithm
> will do is *still* follow the costly path described above, and again
> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like
> mine would.

You do raise a valid point here, I can optimize my solution by
detecting certain collapse failure types and jump to the next scan.
I'll add that to my solution, thanks!

As for the disagreement around the bitmap, we'll leave that up to the
community to decide since we have differing opinions/solutions.

>
> The verification of the PTE state by the _isolate() function is the "no
> turning back" point of the algorithm. The verification by
> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth
> it, before we do costly operations" point of the algorithm.
>
> >  From my testing I found this to rarely happen.
>
> Unfortunately, I am not very familiar with performance testing/load
> testing, I am fairly new to kernel programming, so I am getting there.
> But it really depends on the type of test you are running, what actually
> runs on memory-intensive systems, etc etc. In fact, on loaded systems I
> would expect the PTE state to dramatically change. But still, no opinion
> here.

Yeah there are probably some cases where it happens more often.
Probably in cases of short lived allocations, but khugepaged doesn't
run that frequently so those won't be that big of an issue.

Our performance team is currently testing my implementation so I
should have more real workload test results soon. The redis testing
had some gains and didn't show any signs of obvious regressions.

As for the testing, check out
https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads
this does the tracing for my testing script. It can help you get
started. There are 3 different traces being applied there: the
bpftrace for collapse latencies, the perf record for the flamegraph
(not actually that useful, but may be useful to visualize any
weird/long paths that you may not have noticed), and the trace-cmd
which records the tracepoint of the scan and the collapse functions
then processes the data using the awk script-- the output being the
scan rate, the pages collapsed, and their result status (grouped by
order).

You can also look into https://github.com/gormanm/mmtests for
testing/comparing kernels. I was running the
config-memdb-redis-benchmark-medium workload.

>
> >
> > Also, khugepaged, my changes, and your changes are all a victim of
> > this. Once we drop the read lock (to either allocate the folio, or
> > right before acquiring the write_lock), the state can change. In your
> > case, yes, you are gathering more up to date information, but is it
> > really that important/worth it to retake locks and rescan for each
> > instance if we are about to reverify with the write lock taken?
>
> You said "reverify": You are removing the verification, so this step
> won't be reverification, it will be verification. We do not want to
> verify *after* we have already done 95% of latency-heavy stuff, only to
> know that we are going to fail.
>
> Algorithms in the kernel, in general, are of the following form: 1)
> Verify if a condition is true, resulting in taking a control path -> 2)
> do a lot of stuff -> "no turning back" step, wherein before committing
> (by taking locks, say), reverify if this is the control path we should
> be in. You are eliminating step 1).
>
> Therefore, I will have to say that I disagree with your approach.
>
> On top of this, in the subjective analysis in [1], point number 7 (along
> with point number 1) remains. And, point number 4 remains.

for 1) your worst case of 1024 is not the worst case. There are 8
possible orders in your implementation, if all are enabled, that is
4096 iterations in the worst case.
This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case.
>
> [1]
> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>
> >
> > So in my eyes, this is not a "problem"
>
> Looks like the kernel scheduled us for a high-priority debate, I hope
> there's no deadlock :)
>
> >
> > Cheers,
> > -- Nico
> >
> >
> >>
> >
>
Dev Jain Feb. 14, 2025, 2:01 a.m. UTC | #6
On 14/02/25 1:09 am, Nico Pache wrote:
> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 12/02/25 10:19 pm, Nico Pache wrote:
>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 11/02/25 6:00 am, Nico Pache wrote:
>>>>> The following series provides khugepaged and madvise collapse with the
>>>>> capability to collapse regions to mTHPs.
>>>>>
>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>>>> on max_ptes_none is removed during the scan, to make sure we account for
>>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>>>> collapse.
>>>>>
>>>>> With the default max_ptes_none=511, the code should keep its most of its
>>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>>>> constantly promote mTHPs to the next available size.
>>>>>
>>>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>>>> Patch 2:     Refactor/rename hpage_collapse
>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>>> Patch 6-9:   The mTHP patches
>>>>>
>>>>> ---------
>>>>>     Testing
>>>>> ---------
>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>>> - selftests mm
>>>>> - I created a test script that I used to push khugepaged to its limits while
>>>>>       monitoring a number of stats and tracepoints. The code is available
>>>>>       here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>>>       The summary from my testings was that there was no significant regression
>>>>>       noticed through this test. In some cases my changes had better collapse
>>>>>       latencies, and was able to scan more pages in the same amount of time/work,
>>>>>       but for the most part the results were consistant.
>>>>> - redis testing. I tested these changes along with my defer changes
>>>>>      (see followup post for more details).
>>>>> - some basic testing on 64k page size.
>>>>> - lots of general use. These changes have been running in my VM for some time.
>>>>>
>>>>> Changes since V1 [2]:
>>>>> - Minor bug fixes discovered during review and testing
>>>>> - removed dynamic allocations for bitmaps, and made them stack based
>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>>> - Updated trace events to include collapsing order info.
>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>>>       the same max_ptes_none scaling principle to achieve this.
>>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>>>       some of the "creep" that was discovered in v1.
>>>>>
>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>>>
>>>>> Nico Pache (9):
>>>>>      introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>>>        madvise_collapse
>>>>>      khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>      khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>>      khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>      khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>>      khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>      khugepaged: add mTHP support
>>>>>      khugepaged: improve tracepoints for mTHP orders
>>>>>      khugepaged: skip collapsing mTHP to smaller orders
>>>>>
>>>>>     include/linux/khugepaged.h         |   4 +
>>>>>     include/trace/events/huge_memory.h |  34 ++-
>>>>>     mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>>>     3 files changed, 306 insertions(+), 154 deletions(-)
>>>>>
>>>>
>>>> Does this patchset suffer from the problem described here:
>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
>>> Hi Dev,
>>>
>>> Sorry I meant to get back to you about that.
>>>
>>> I understand your concern, but like I've mentioned before, the scan
>>> with the read lock was done so we dont have to do the more expensive
>>> locking, and could still gain insight into the state. You are right
>>> that this info could become stale if the state changes dramatically,
>>> but the collapse_isolate function will verify it and not collapse.
>>
>> If the state changes dramatically, the _isolate function will verify it,
>> and fallback. And this fallback happens after following this costly
>> path: retrieve a large folio from the buddy allocator -> swapin pages
>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush
>> on all CPUs -> fallback in _isolate().
>> If you do fail in _isolate(), doesn't it make sense to get the updated
>> state for the next fallback order immediately, because we have prior
>> information that we failed because of PTE state? What your algorithm
>> will do is *still* follow the costly path described above, and again
>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like
>> mine would.
> 
> You do raise a valid point here, I can optimize my solution by
> detecting certain collapse failure types and jump to the next scan.
> I'll add that to my solution, thanks!
> 
> As for the disagreement around the bitmap, we'll leave that up to the
> community to decide since we have differing opinions/solutions.
> 
>>
>> The verification of the PTE state by the _isolate() function is the "no
>> turning back" point of the algorithm. The verification by
>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth
>> it, before we do costly operations" point of the algorithm.
>>
>>>   From my testing I found this to rarely happen.
>>
>> Unfortunately, I am not very familiar with performance testing/load
>> testing, I am fairly new to kernel programming, so I am getting there.
>> But it really depends on the type of test you are running, what actually
>> runs on memory-intensive systems, etc etc. In fact, on loaded systems I
>> would expect the PTE state to dramatically change. But still, no opinion
>> here.
> 
> Yeah there are probably some cases where it happens more often.
> Probably in cases of short lived allocations, but khugepaged doesn't
> run that frequently so those won't be that big of an issue.
> 
> Our performance team is currently testing my implementation so I
> should have more real workload test results soon. The redis testing
> had some gains and didn't show any signs of obvious regressions.
> 
> As for the testing, check out
> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads
> this does the tracing for my testing script. It can help you get
> started. There are 3 different traces being applied there: the
> bpftrace for collapse latencies, the perf record for the flamegraph
> (not actually that useful, but may be useful to visualize any
> weird/long paths that you may not have noticed), and the trace-cmd
> which records the tracepoint of the scan and the collapse functions
> then processes the data using the awk script-- the output being the
> scan rate, the pages collapsed, and their result status (grouped by
> order).
> 
> You can also look into https://github.com/gormanm/mmtests for
> testing/comparing kernels. I was running the
> config-memdb-redis-benchmark-medium workload.

Thanks. I'll take a look.

> 
>>
>>>
>>> Also, khugepaged, my changes, and your changes are all a victim of
>>> this. Once we drop the read lock (to either allocate the folio, or
>>> right before acquiring the write_lock), the state can change. In your
>>> case, yes, you are gathering more up to date information, but is it
>>> really that important/worth it to retake locks and rescan for each
>>> instance if we are about to reverify with the write lock taken?
>>
>> You said "reverify": You are removing the verification, so this step
>> won't be reverification, it will be verification. We do not want to
>> verify *after* we have already done 95% of latency-heavy stuff, only to
>> know that we are going to fail.
>>
>> Algorithms in the kernel, in general, are of the following form: 1)
>> Verify if a condition is true, resulting in taking a control path -> 2)
>> do a lot of stuff -> "no turning back" step, wherein before committing
>> (by taking locks, say), reverify if this is the control path we should
>> be in. You are eliminating step 1).
>>
>> Therefore, I will have to say that I disagree with your approach.
>>
>> On top of this, in the subjective analysis in [1], point number 7 (along
>> with point number 1) remains. And, point number 4 remains.
> 
> for 1) your worst case of 1024 is not the worst case. There are 8
> possible orders in your implementation, if all are enabled, that is
> 4096 iterations in the worst case.

Yes, that is exactly what I wrote in 1). I am still not convinced that 
the overhead you produce + 512 iterations is going to beat 4096 
iterations. Anyways, that is hand-waving and we should test this.

> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case.

Sorry, I am missing something here; how does the number of iterations 
change with page size? Am I not scanning the PTE table, which is 
invariant to the page size?

>>
>> [1]
>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>
>>>
>>> So in my eyes, this is not a "problem"
>>
>> Looks like the kernel scheduled us for a high-priority debate, I hope
>> there's no deadlock :)
>>
>>>
>>> Cheers,
>>> -- Nico
>>>
>>>
>>>>
>>>
>>
>
Nico Pache Feb. 15, 2025, 12:52 a.m. UTC | #7
On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 14/02/25 1:09 am, Nico Pache wrote:
> > On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
> >>
> >>
> >>
> >> On 12/02/25 10:19 pm, Nico Pache wrote:
> >>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 11/02/25 6:00 am, Nico Pache wrote:
> >>>>> The following series provides khugepaged and madvise collapse with the
> >>>>> capability to collapse regions to mTHPs.
> >>>>>
> >>>>> To achieve this we generalize the khugepaged functions to no longer depend
> >>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>>>> on max_ptes_none is removed during the scan, to make sure we account for
> >>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> >>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
> >>>>> is attempted, but contains swapped out, or shared pages, we dont perform the
> >>>>> collapse.
> >>>>>
> >>>>> With the default max_ptes_none=511, the code should keep its most of its
> >>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> >>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> >>>>> constantly promote mTHPs to the next available size.
> >>>>>
> >>>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> >>>>> Patch 2:     Refactor/rename hpage_collapse
> >>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>>>> Patch 6-9:   The mTHP patches
> >>>>>
> >>>>> ---------
> >>>>>     Testing
> >>>>> ---------
> >>>>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>>>> - selftests mm
> >>>>> - I created a test script that I used to push khugepaged to its limits while
> >>>>>       monitoring a number of stats and tracepoints. The code is available
> >>>>>       here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >>>>>       The summary from my testings was that there was no significant regression
> >>>>>       noticed through this test. In some cases my changes had better collapse
> >>>>>       latencies, and was able to scan more pages in the same amount of time/work,
> >>>>>       but for the most part the results were consistant.
> >>>>> - redis testing. I tested these changes along with my defer changes
> >>>>>      (see followup post for more details).
> >>>>> - some basic testing on 64k page size.
> >>>>> - lots of general use. These changes have been running in my VM for some time.
> >>>>>
> >>>>> Changes since V1 [2]:
> >>>>> - Minor bug fixes discovered during review and testing
> >>>>> - removed dynamic allocations for bitmaps, and made them stack based
> >>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>>>> - Updated trace events to include collapsing order info.
> >>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> >>>>> - No longer require a chunk to be fully utilized before setting the bit. Use
> >>>>>       the same max_ptes_none scaling principle to achieve this.
> >>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >>>>>       some of the "creep" that was discovered in v1.
> >>>>>
> >>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >>>>>
> >>>>> Nico Pache (9):
> >>>>>      introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >>>>>        madvise_collapse
> >>>>>      khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>>>      khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>>>      khugepaged: generalize alloc_charge_folio for mTHP support
> >>>>>      khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>>>      khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>>>      khugepaged: add mTHP support
> >>>>>      khugepaged: improve tracepoints for mTHP orders
> >>>>>      khugepaged: skip collapsing mTHP to smaller orders
> >>>>>
> >>>>>     include/linux/khugepaged.h         |   4 +
> >>>>>     include/trace/events/huge_memory.h |  34 ++-
> >>>>>     mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >>>>>     3 files changed, 306 insertions(+), 154 deletions(-)
> >>>>>
> >>>>
> >>>> Does this patchset suffer from the problem described here:
> >>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
> >>> Hi Dev,
> >>>
> >>> Sorry I meant to get back to you about that.
> >>>
> >>> I understand your concern, but like I've mentioned before, the scan
> >>> with the read lock was done so we dont have to do the more expensive
> >>> locking, and could still gain insight into the state. You are right
> >>> that this info could become stale if the state changes dramatically,
> >>> but the collapse_isolate function will verify it and not collapse.
> >>
> >> If the state changes dramatically, the _isolate function will verify it,
> >> and fallback. And this fallback happens after following this costly
> >> path: retrieve a large folio from the buddy allocator -> swapin pages
> >> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush
> >> on all CPUs -> fallback in _isolate().
> >> If you do fail in _isolate(), doesn't it make sense to get the updated
> >> state for the next fallback order immediately, because we have prior
> >> information that we failed because of PTE state? What your algorithm
> >> will do is *still* follow the costly path described above, and again
> >> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like
> >> mine would.
> >
> > You do raise a valid point here, I can optimize my solution by
> > detecting certain collapse failure types and jump to the next scan.
> > I'll add that to my solution, thanks!
> >
> > As for the disagreement around the bitmap, we'll leave that up to the
> > community to decide since we have differing opinions/solutions.
> >
> >>
> >> The verification of the PTE state by the _isolate() function is the "no
> >> turning back" point of the algorithm. The verification by
> >> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth
> >> it, before we do costly operations" point of the algorithm.
> >>
> >>>   From my testing I found this to rarely happen.
> >>
> >> Unfortunately, I am not very familiar with performance testing/load
> >> testing, I am fairly new to kernel programming, so I am getting there.
> >> But it really depends on the type of test you are running, what actually
> >> runs on memory-intensive systems, etc etc. In fact, on loaded systems I
> >> would expect the PTE state to dramatically change. But still, no opinion
> >> here.
> >
> > Yeah there are probably some cases where it happens more often.
> > Probably in cases of short lived allocations, but khugepaged doesn't
> > run that frequently so those won't be that big of an issue.
> >
> > Our performance team is currently testing my implementation so I
> > should have more real workload test results soon. The redis testing
> > had some gains and didn't show any signs of obvious regressions.
> >
> > As for the testing, check out
> > https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads
> > this does the tracing for my testing script. It can help you get
> > started. There are 3 different traces being applied there: the
> > bpftrace for collapse latencies, the perf record for the flamegraph
> > (not actually that useful, but may be useful to visualize any
> > weird/long paths that you may not have noticed), and the trace-cmd
> > which records the tracepoint of the scan and the collapse functions
> > then processes the data using the awk script-- the output being the
> > scan rate, the pages collapsed, and their result status (grouped by
> > order).
> >
> > You can also look into https://github.com/gormanm/mmtests for
> > testing/comparing kernels. I was running the
> > config-memdb-redis-benchmark-medium workload.
>
> Thanks. I'll take a look.
>
> >
> >>
> >>>
> >>> Also, khugepaged, my changes, and your changes are all a victim of
> >>> this. Once we drop the read lock (to either allocate the folio, or
> >>> right before acquiring the write_lock), the state can change. In your
> >>> case, yes, you are gathering more up to date information, but is it
> >>> really that important/worth it to retake locks and rescan for each
> >>> instance if we are about to reverify with the write lock taken?
> >>
> >> You said "reverify": You are removing the verification, so this step
> >> won't be reverification, it will be verification. We do not want to
> >> verify *after* we have already done 95% of latency-heavy stuff, only to
> >> know that we are going to fail.
> >>
> >> Algorithms in the kernel, in general, are of the following form: 1)
> >> Verify if a condition is true, resulting in taking a control path -> 2)
> >> do a lot of stuff -> "no turning back" step, wherein before committing
> >> (by taking locks, say), reverify if this is the control path we should
> >> be in. You are eliminating step 1).
> >>
> >> Therefore, I will have to say that I disagree with your approach.
> >>
> >> On top of this, in the subjective analysis in [1], point number 7 (along
> >> with point number 1) remains. And, point number 4 remains.
> >
> > for 1) your worst case of 1024 is not the worst case. There are 8
> > possible orders in your implementation, if all are enabled, that is
> > 4096 iterations in the worst case.
>
> Yes, that is exactly what I wrote in 1). I am still not convinced that
> the overhead you produce + 512 iterations is going to beat 4096
> iterations. Anyways, that is hand-waving and we should test this.
>
> > This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case.
>
> Sorry, I am missing something here; how does the number of iterations
> change with page size? Am I not scanning the PTE table, which is
> invariant to the page size?

I got the calculation wrong the first time and it's actually worst.
Lets hope I got this right this time
on ARM64 64k kernel:
PMD size = 512M
PTE= 64k
PTEs per PMD = 8192
log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
first and second order are skipped)

Assuming I understand your algorithm correctly, in the worst case you
are scanning the whole PMD for each order.

So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.

Please let me know if I'm missing something here.
>
> >>
> >> [1]
> >> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
> >>
> >>>
> >>> So in my eyes, this is not a "problem"
> >>
> >> Looks like the kernel scheduled us for a high-priority debate, I hope
> >> there's no deadlock :)
> >>
> >>>
> >>> Cheers,
> >>> -- Nico
> >>>
> >>>
> >>>>
> >>>
> >>
> >
>
Dev Jain Feb. 15, 2025, 6:38 a.m. UTC | #8
On 15/02/25 6:22 am, Nico Pache wrote:
> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 14/02/25 1:09 am, Nico Pache wrote:
>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 12/02/25 10:19 pm, Nico Pache wrote:
>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/02/25 6:00 am, Nico Pache wrote:
>>>>>>> The following series provides khugepaged and madvise collapse with the
>>>>>>> capability to collapse regions to mTHPs.
>>>>>>>
>>>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>>>>>> on max_ptes_none is removed during the scan, to make sure we account for
>>>>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>>>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>>>>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>>>>>> collapse.
>>>>>>>
>>>>>>> With the default max_ptes_none=511, the code should keep its most of its
>>>>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>>>>>> constantly promote mTHPs to the next available size.
>>>>>>>
>>>>>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>>>>>> Patch 2:     Refactor/rename hpage_collapse
>>>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>>>>> Patch 6-9:   The mTHP patches
>>>>>>>
>>>>>>> ---------
>>>>>>>      Testing
>>>>>>> ---------
>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>>>>> - selftests mm
>>>>>>> - I created a test script that I used to push khugepaged to its limits while
>>>>>>>        monitoring a number of stats and tracepoints. The code is available
>>>>>>>        here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>>>>>        The summary from my testings was that there was no significant regression
>>>>>>>        noticed through this test. In some cases my changes had better collapse
>>>>>>>        latencies, and was able to scan more pages in the same amount of time/work,
>>>>>>>        but for the most part the results were consistant.
>>>>>>> - redis testing. I tested these changes along with my defer changes
>>>>>>>       (see followup post for more details).
>>>>>>> - some basic testing on 64k page size.
>>>>>>> - lots of general use. These changes have been running in my VM for some time.
>>>>>>>
>>>>>>> Changes since V1 [2]:
>>>>>>> - Minor bug fixes discovered during review and testing
>>>>>>> - removed dynamic allocations for bitmaps, and made them stack based
>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>>>>> - Updated trace events to include collapsing order info.
>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>>>>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>>>>>        the same max_ptes_none scaling principle to achieve this.
>>>>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>>>>>        some of the "creep" that was discovered in v1.
>>>>>>>
>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>>>>>
>>>>>>> Nico Pache (9):
>>>>>>>       introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>>>>>         madvise_collapse
>>>>>>>       khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>>>       khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>>>>       khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>>>       khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>>>>       khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>>>       khugepaged: add mTHP support
>>>>>>>       khugepaged: improve tracepoints for mTHP orders
>>>>>>>       khugepaged: skip collapsing mTHP to smaller orders
>>>>>>>
>>>>>>>      include/linux/khugepaged.h         |   4 +
>>>>>>>      include/trace/events/huge_memory.h |  34 ++-
>>>>>>>      mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>>>>>      3 files changed, 306 insertions(+), 154 deletions(-)
>>>>>>>
>>>>>>
>>>>>> Does this patchset suffer from the problem described here:
>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.com/
>>>>> Hi Dev,
>>>>>
>>>>> Sorry I meant to get back to you about that.
>>>>>
>>>>> I understand your concern, but like I've mentioned before, the scan
>>>>> with the read lock was done so we dont have to do the more expensive
>>>>> locking, and could still gain insight into the state. You are right
>>>>> that this info could become stale if the state changes dramatically,
>>>>> but the collapse_isolate function will verify it and not collapse.
>>>>
>>>> If the state changes dramatically, the _isolate function will verify it,
>>>> and fallback. And this fallback happens after following this costly
>>>> path: retrieve a large folio from the buddy allocator -> swapin pages
>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush
>>>> on all CPUs -> fallback in _isolate().
>>>> If you do fail in _isolate(), doesn't it make sense to get the updated
>>>> state for the next fallback order immediately, because we have prior
>>>> information that we failed because of PTE state? What your algorithm
>>>> will do is *still* follow the costly path described above, and again
>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like
>>>> mine would.
>>>
>>> You do raise a valid point here, I can optimize my solution by
>>> detecting certain collapse failure types and jump to the next scan.
>>> I'll add that to my solution, thanks!
>>>
>>> As for the disagreement around the bitmap, we'll leave that up to the
>>> community to decide since we have differing opinions/solutions.
>>>
>>>>
>>>> The verification of the PTE state by the _isolate() function is the "no
>>>> turning back" point of the algorithm. The verification by
>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth
>>>> it, before we do costly operations" point of the algorithm.
>>>>
>>>>>    From my testing I found this to rarely happen.
>>>>
>>>> Unfortunately, I am not very familiar with performance testing/load
>>>> testing, I am fairly new to kernel programming, so I am getting there.
>>>> But it really depends on the type of test you are running, what actually
>>>> runs on memory-intensive systems, etc etc. In fact, on loaded systems I
>>>> would expect the PTE state to dramatically change. But still, no opinion
>>>> here.
>>>
>>> Yeah there are probably some cases where it happens more often.
>>> Probably in cases of short lived allocations, but khugepaged doesn't
>>> run that frequently so those won't be that big of an issue.
>>>
>>> Our performance team is currently testing my implementation so I
>>> should have more real workload test results soon. The redis testing
>>> had some gains and didn't show any signs of obvious regressions.
>>>
>>> As for the testing, check out
>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-performance.sh?ref_type=heads
>>> this does the tracing for my testing script. It can help you get
>>> started. There are 3 different traces being applied there: the
>>> bpftrace for collapse latencies, the perf record for the flamegraph
>>> (not actually that useful, but may be useful to visualize any
>>> weird/long paths that you may not have noticed), and the trace-cmd
>>> which records the tracepoint of the scan and the collapse functions
>>> then processes the data using the awk script-- the output being the
>>> scan rate, the pages collapsed, and their result status (grouped by
>>> order).
>>>
>>> You can also look into https://github.com/gormanm/mmtests for
>>> testing/comparing kernels. I was running the
>>> config-memdb-redis-benchmark-medium workload.
>>
>> Thanks. I'll take a look.
>>
>>>
>>>>
>>>>>
>>>>> Also, khugepaged, my changes, and your changes are all a victim of
>>>>> this. Once we drop the read lock (to either allocate the folio, or
>>>>> right before acquiring the write_lock), the state can change. In your
>>>>> case, yes, you are gathering more up to date information, but is it
>>>>> really that important/worth it to retake locks and rescan for each
>>>>> instance if we are about to reverify with the write lock taken?
>>>>
>>>> You said "reverify": You are removing the verification, so this step
>>>> won't be reverification, it will be verification. We do not want to
>>>> verify *after* we have already done 95% of latency-heavy stuff, only to
>>>> know that we are going to fail.
>>>>
>>>> Algorithms in the kernel, in general, are of the following form: 1)
>>>> Verify if a condition is true, resulting in taking a control path -> 2)
>>>> do a lot of stuff -> "no turning back" step, wherein before committing
>>>> (by taking locks, say), reverify if this is the control path we should
>>>> be in. You are eliminating step 1).
>>>>
>>>> Therefore, I will have to say that I disagree with your approach.
>>>>
>>>> On top of this, in the subjective analysis in [1], point number 7 (along
>>>> with point number 1) remains. And, point number 4 remains.
>>>
>>> for 1) your worst case of 1024 is not the worst case. There are 8
>>> possible orders in your implementation, if all are enabled, that is
>>> 4096 iterations in the worst case.
>>
>> Yes, that is exactly what I wrote in 1). I am still not convinced that
>> the overhead you produce + 512 iterations is going to beat 4096
>> iterations. Anyways, that is hand-waving and we should test this.
>>
>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my case.
>>
>> Sorry, I am missing something here; how does the number of iterations
>> change with page size? Am I not scanning the PTE table, which is
>> invariant to the page size?
> 
> I got the calculation wrong the first time and it's actually worst.
> Lets hope I got this right this time
> on ARM64 64k kernel:
> PMD size = 512M
> PTE= 64k
> PTEs per PMD = 8192

*facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't 
depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) = 
PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1 
<< 13.

> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
> first and second order are skipped)
> 
> Assuming I understand your algorithm correctly, in the worst case you
> are scanning the whole PMD for each order.
> 
> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.

Yup. Now it seems that the bitmap overhead may just be worth it; for the 
worst case the bitmap will give us an 11x saving...for the average case, 
it will give us 2x, but still, 8192 is a large number. I'll think of 
ways to test this out.

Btw, I was made aware that an LWN article just got posted on our work!
https://lwn.net/Articles/1009039/

> 
> Please let me know if I'm missing something here.
>>
>>>>
>>>> [1]
>>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/
>>>>
>>>>>
>>>>> So in my eyes, this is not a "problem"
>>>>
>>>> Looks like the kernel scheduled us for a high-priority debate, I hope
>>>> there's no deadlock :)
>>>>
>>>>>
>>>>> Cheers,
>>>>> -- Nico
>>>>>
>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Dev Jain Feb. 17, 2025, 6:39 a.m. UTC | #9
On 11/02/25 6:00 am, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> order to determine how full a THP must be to be eligible. If a mTHP collapse
> is attempted, but contains swapped out, or shared pages, we dont perform the
> collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its
> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> constantly promote mTHPs to the next available size.

How does creep stop when max_ptes_none <= 255?

> 
> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 2:     Refactor/rename hpage_collapse
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> 
> ---------
>   Testing
> ---------
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits while
>     monitoring a number of stats and tracepoints. The code is available
>     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>     The summary from my testings was that there was no significant regression
>     noticed through this test. In some cases my changes had better collapse
>     latencies, and was able to scan more pages in the same amount of time/work,
>     but for the most part the results were consistant.
> - redis testing. I tested these changes along with my defer changes
>    (see followup post for more details).
> - some basic testing on 64k page size.
> - lots of general use. These changes have been running in my VM for some time.
> 
> Changes since V1 [2]:
> - Minor bug fixes discovered during review and testing
> - removed dynamic allocations for bitmaps, and made them stack based
> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> - Updated trace events to include collapsing order info.
> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> - No longer require a chunk to be fully utilized before setting the bit. Use
>     the same max_ptes_none scaling principle to achieve this.
> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>     some of the "creep" that was discovered in v1.
> 
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> 
> Nico Pache (9):
>    introduce khugepaged_collapse_single_pmd to unify khugepaged and
>      madvise_collapse
>    khugepaged: rename hpage_collapse_* to khugepaged_*
>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>    khugepaged: generalize alloc_charge_folio for mTHP support
>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>    khugepaged: add mTHP support
>    khugepaged: improve tracepoints for mTHP orders
>    khugepaged: skip collapsing mTHP to smaller orders
> 
>   include/linux/khugepaged.h         |   4 +
>   include/trace/events/huge_memory.h |  34 ++-
>   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>   3 files changed, 306 insertions(+), 154 deletions(-)
>
Dev Jain Feb. 17, 2025, 8:05 a.m. UTC | #10
On 15/02/25 12:08 pm, Dev Jain wrote:
> 
> 
> On 15/02/25 6:22 am, Nico Pache wrote:
>> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote:
>>>
>>>
>>>
>>> On 14/02/25 1:09 am, Nico Pache wrote:
>>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 12/02/25 10:19 pm, Nico Pache wrote:
>>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/02/25 6:00 am, Nico Pache wrote:
>>>>>>>> The following series provides khugepaged and madvise collapse 
>>>>>>>> with the
>>>>>>>> capability to collapse regions to mTHPs.
>>>>>>>>
>>>>>>>> To achieve this we generalize the khugepaged functions to no 
>>>>>>>> longer depend
>>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks 
>>>>>>>> of pages
>>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>>>>>> using a bitmap. After the PMD scan is done, we do binary 
>>>>>>>> recursion on the
>>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The 
>>>>>>>> restriction
>>>>>>>> on max_ptes_none is removed during the scan, to make sure we 
>>>>>>>> account for
>>>>>>>> the whole PMD range. max_ptes_none will be scaled by the 
>>>>>>>> attempted collapse
>>>>>>>> order to determine how full a THP must be to be eligible. If a 
>>>>>>>> mTHP collapse
>>>>>>>> is attempted, but contains swapped out, or shared pages, we dont 
>>>>>>>> perform the
>>>>>>>> collapse.
>>>>>>>>
>>>>>>>> With the default max_ptes_none=511, the code should keep its 
>>>>>>>> most of its
>>>>>>>> original behavior. To exercise mTHP collapse we need to set 
>>>>>>>> max_ptes_none<=255.
>>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse 
>>>>>>>> "creep" and
>>>>>>>> constantly promote mTHPs to the next available size.
>>>>>>>>
>>>>>>>> Patch 1:     Some refactoring to combine madvise_collapse and 
>>>>>>>> khugepaged
>>>>>>>> Patch 2:     Refactor/rename hpage_collapse
>>>>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>>>>>> Patch 6-9:   The mTHP patches
>>>>>>>>
>>>>>>>> ---------
>>>>>>>>      Testing
>>>>>>>> ---------
>>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>>>>>> - selftests mm
>>>>>>>> - I created a test script that I used to push khugepaged to its 
>>>>>>>> limits while
>>>>>>>>        monitoring a number of stats and tracepoints. The code is 
>>>>>>>> available
>>>>>>>>        here[1] (Run in legacy mode for these changes and set 
>>>>>>>> mthp sizes to inherit)
>>>>>>>>        The summary from my testings was that there was no 
>>>>>>>> significant regression
>>>>>>>>        noticed through this test. In some cases my changes had 
>>>>>>>> better collapse
>>>>>>>>        latencies, and was able to scan more pages in the same 
>>>>>>>> amount of time/work,
>>>>>>>>        but for the most part the results were consistant.
>>>>>>>> - redis testing. I tested these changes along with my defer changes
>>>>>>>>       (see followup post for more details).
>>>>>>>> - some basic testing on 64k page size.
>>>>>>>> - lots of general use. These changes have been running in my VM 
>>>>>>>> for some time.
>>>>>>>>
>>>>>>>> Changes since V1 [2]:
>>>>>>>> - Minor bug fixes discovered during review and testing
>>>>>>>> - removed dynamic allocations for bitmaps, and made them stack 
>>>>>>>> based
>>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>>>>>> - Updated trace events to include collapsing order info.
>>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 
>>>>>>>> scale.
>>>>>>>> - No longer require a chunk to be fully utilized before setting 
>>>>>>>> the bit. Use
>>>>>>>>        the same max_ptes_none scaling principle to achieve this.
>>>>>>>> - Skip mTHP collapse that requires swapin or shared handling. 
>>>>>>>> This helps prevent
>>>>>>>>        some of the "creep" that was discovered in v1.
>>>>>>>>
>>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1- 
>>>>>>>> npache@redhat.com/
>>>>>>>>
>>>>>>>> Nico Pache (9):
>>>>>>>>       introduce khugepaged_collapse_single_pmd to unify 
>>>>>>>> khugepaged and
>>>>>>>>         madvise_collapse
>>>>>>>>       khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>>>>       khugepaged: generalize hugepage_vma_revalidate for mTHP 
>>>>>>>> support
>>>>>>>>       khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>>>>       khugepaged: generalize __collapse_huge_page_* for mTHP 
>>>>>>>> support
>>>>>>>>       khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>>>>       khugepaged: add mTHP support
>>>>>>>>       khugepaged: improve tracepoints for mTHP orders
>>>>>>>>       khugepaged: skip collapsing mTHP to smaller orders
>>>>>>>>
>>>>>>>>      include/linux/khugepaged.h         |   4 +
>>>>>>>>      include/trace/events/huge_memory.h |  34 ++-
>>>>>>>>      mm/khugepaged.c                    | 422 ++++++++++++++++++ 
>>>>>>>> +----------
>>>>>>>>      3 files changed, 306 insertions(+), 154 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> Does this patchset suffer from the problem described here:
>>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680- 
>>>>>>> c2d48d4963b6@arm.com/
>>>>>> Hi Dev,
>>>>>>
>>>>>> Sorry I meant to get back to you about that.
>>>>>>
>>>>>> I understand your concern, but like I've mentioned before, the scan
>>>>>> with the read lock was done so we dont have to do the more expensive
>>>>>> locking, and could still gain insight into the state. You are right
>>>>>> that this info could become stale if the state changes dramatically,
>>>>>> but the collapse_isolate function will verify it and not collapse.
>>>>>
>>>>> If the state changes dramatically, the _isolate function will 
>>>>> verify it,
>>>>> and fallback. And this fallback happens after following this costly
>>>>> path: retrieve a large folio from the buddy allocator -> swapin pages
>>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB 
>>>>> flush
>>>>> on all CPUs -> fallback in _isolate().
>>>>> If you do fail in _isolate(), doesn't it make sense to get the updated
>>>>> state for the next fallback order immediately, because we have prior
>>>>> information that we failed because of PTE state? What your algorithm
>>>>> will do is *still* follow the costly path described above, and again
>>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() 
>>>>> like
>>>>> mine would.
>>>>
>>>> You do raise a valid point here, I can optimize my solution by
>>>> detecting certain collapse failure types and jump to the next scan.
>>>> I'll add that to my solution, thanks!
>>>>
>>>> As for the disagreement around the bitmap, we'll leave that up to the
>>>> community to decide since we have differing opinions/solutions.
>>>>
>>>>>
>>>>> The verification of the PTE state by the _isolate() function is the 
>>>>> "no
>>>>> turning back" point of the algorithm. The verification by
>>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even 
>>>>> worth
>>>>> it, before we do costly operations" point of the algorithm.
>>>>>
>>>>>>    From my testing I found this to rarely happen.
>>>>>
>>>>> Unfortunately, I am not very familiar with performance testing/load
>>>>> testing, I am fairly new to kernel programming, so I am getting there.
>>>>> But it really depends on the type of test you are running, what 
>>>>> actually
>>>>> runs on memory-intensive systems, etc etc. In fact, on loaded 
>>>>> systems I
>>>>> would expect the PTE state to dramatically change. But still, no 
>>>>> opinion
>>>>> here.
>>>>
>>>> Yeah there are probably some cases where it happens more often.
>>>> Probably in cases of short lived allocations, but khugepaged doesn't
>>>> run that frequently so those won't be that big of an issue.
>>>>
>>>> Our performance team is currently testing my implementation so I
>>>> should have more real workload test results soon. The redis testing
>>>> had some gains and didn't show any signs of obvious regressions.
>>>>
>>>> As for the testing, check out
>>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record- 
>>>> khuge-performance.sh?ref_type=heads
>>>> this does the tracing for my testing script. It can help you get
>>>> started. There are 3 different traces being applied there: the
>>>> bpftrace for collapse latencies, the perf record for the flamegraph
>>>> (not actually that useful, but may be useful to visualize any
>>>> weird/long paths that you may not have noticed), and the trace-cmd
>>>> which records the tracepoint of the scan and the collapse functions
>>>> then processes the data using the awk script-- the output being the
>>>> scan rate, the pages collapsed, and their result status (grouped by
>>>> order).
>>>>
>>>> You can also look into https://github.com/gormanm/mmtests for
>>>> testing/comparing kernels. I was running the
>>>> config-memdb-redis-benchmark-medium workload.
>>>
>>> Thanks. I'll take a look.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Also, khugepaged, my changes, and your changes are all a victim of
>>>>>> this. Once we drop the read lock (to either allocate the folio, or
>>>>>> right before acquiring the write_lock), the state can change. In your
>>>>>> case, yes, you are gathering more up to date information, but is it
>>>>>> really that important/worth it to retake locks and rescan for each
>>>>>> instance if we are about to reverify with the write lock taken?
>>>>>
>>>>> You said "reverify": You are removing the verification, so this step
>>>>> won't be reverification, it will be verification. We do not want to
>>>>> verify *after* we have already done 95% of latency-heavy stuff, 
>>>>> only to
>>>>> know that we are going to fail.
>>>>>
>>>>> Algorithms in the kernel, in general, are of the following form: 1)
>>>>> Verify if a condition is true, resulting in taking a control path - 
>>>>> > 2)
>>>>> do a lot of stuff -> "no turning back" step, wherein before committing
>>>>> (by taking locks, say), reverify if this is the control path we should
>>>>> be in. You are eliminating step 1).
>>>>>
>>>>> Therefore, I will have to say that I disagree with your approach.
>>>>>
>>>>> On top of this, in the subjective analysis in [1], point number 7 
>>>>> (along
>>>>> with point number 1) remains. And, point number 4 remains.
>>>>
>>>> for 1) your worst case of 1024 is not the worst case. There are 8
>>>> possible orders in your implementation, if all are enabled, that is
>>>> 4096 iterations in the worst case.
>>>
>>> Yes, that is exactly what I wrote in 1). I am still not convinced that
>>> the overhead you produce + 512 iterations is going to beat 4096
>>> iterations. Anyways, that is hand-waving and we should test this.
>>>
>>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 
>>>> in my case.
>>>
>>> Sorry, I am missing something here; how does the number of iterations
>>> change with page size? Am I not scanning the PTE table, which is
>>> invariant to the page size?
>>
>> I got the calculation wrong the first time and it's actually worst.
>> Lets hope I got this right this time
>> on ARM64 64k kernel:
>> PMD size = 512M
>> PTE= 64k
>> PTEs per PMD = 8192
> 
> *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't 
> depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) = 
> PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1 
> << 13.
> 
>> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
>> first and second order are skipped)
>>
>> Assuming I understand your algorithm correctly, in the worst case you
>> are scanning the whole PMD for each order.
>>
>> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.
> 
> Yup. Now it seems that the bitmap overhead may just be worth it; for the 
> worst case the bitmap will give us an 11x saving...for the average case, 
> it will give us 2x, but still, 8192 is a large number. I'll think of 

Clearing on this: the saving is w.r.t the initial scan. That is, if time 
taken by NP is x + y + collapse_huge_page(), where x is the PMD scan and 
y is the bitmap overhead, then time taken by DJ is 2x + 
collapse_huge_page(). In collapse_huge_page(), both perform PTE scans in 
_isolate(). Anyhow, we differ in opinion as to where the max_ptes_* 
check should be placed; I recalled the following:

https://lore.kernel.org/all/20240809103129.365029-2-dev.jain@arm.com/
https://lore.kernel.org/all/761ba58e-9d6f-4a14-a513-dcc098c2aa94@redhat.com/

One thing you can do to relieve one of my criticisms (not completely) is 
apply the following patch (this can be done in both methods):

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b589f889bb5a..dc5cb602eaad 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1080,8 +1080,14 @@ static int __collapse_huge_page_swapin(struct 
mm_struct *mm,
  		}

  		vmf.orig_pte = ptep_get_lockless(pte);
-		if (!is_swap_pte(vmf.orig_pte))
+		if (!is_swap_pte(vmf.orig_pte)) {
  			continue;
+		} else {
+			if (order != HPAGE_PMD_ORDER) {
+				result = SCAN_EXCEED_SWAP_PTE;
+				goto out;
+			}
+		}

  		vmf.pte = pte;
  		vmf.ptl = ptl;
Nico Pache Feb. 17, 2025, 7:15 p.m. UTC | #11
On Sun, Feb 16, 2025 at 11:39 PM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 11/02/25 6:00 am, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> > order to determine how full a THP must be to be eligible. If a mTHP collapse
> > is attempted, but contains swapped out, or shared pages, we dont perform the
> > collapse.
> >
> > With the default max_ptes_none=511, the code should keep its most of its
> > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> > constantly promote mTHPs to the next available size.
>
> How does creep stop when max_ptes_none <= 255?

think of a 512kB mTHP region that is half utilized (but not
collapsed), and max_ptes_none = 255.

threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1)
                               >> (HPAGE_PMD_ORDER - state.order);

threshold_bits = 512 - 1 - 255 >> 9 - 4 = 256 >> 5 = 8

So >8 bits must be set, where the bitmap length for this section is a
max of 16 bits.
This means more than half of the mTHP has to be utilized before
collapse. If we collapse and introduce half of an empty mTHP then the
mTHP will be available next round for another promotion.
When max_ptes_none is 256, the threshold_bits = 7 and collapsing will
make enough non-none pages to collapse again.

>
> >
> > Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> > Patch 2:     Refactor/rename hpage_collapse
> > Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> > Patch 6-9:   The mTHP patches
> >
> > ---------
> >   Testing
> > ---------
> > - Built for x86_64, aarch64, ppc64le, and s390x
> > - selftests mm
> > - I created a test script that I used to push khugepaged to its limits while
> >     monitoring a number of stats and tracepoints. The code is available
> >     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >     The summary from my testings was that there was no significant regression
> >     noticed through this test. In some cases my changes had better collapse
> >     latencies, and was able to scan more pages in the same amount of time/work,
> >     but for the most part the results were consistant.
> > - redis testing. I tested these changes along with my defer changes
> >    (see followup post for more details).
> > - some basic testing on 64k page size.
> > - lots of general use. These changes have been running in my VM for some time.
> >
> > Changes since V1 [2]:
> > - Minor bug fixes discovered during review and testing
> > - removed dynamic allocations for bitmaps, and made them stack based
> > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> > - Updated trace events to include collapsing order info.
> > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> > - No longer require a chunk to be fully utilized before setting the bit. Use
> >     the same max_ptes_none scaling principle to achieve this.
> > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >     some of the "creep" that was discovered in v1.
> >
> > [1] - https://gitlab.com/npache/khugepaged_mthp_test
> > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >
> > Nico Pache (9):
> >    introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >      madvise_collapse
> >    khugepaged: rename hpage_collapse_* to khugepaged_*
> >    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >    khugepaged: generalize alloc_charge_folio for mTHP support
> >    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >    khugepaged: add mTHP support
> >    khugepaged: improve tracepoints for mTHP orders
> >    khugepaged: skip collapsing mTHP to smaller orders
> >
> >   include/linux/khugepaged.h         |   4 +
> >   include/trace/events/huge_memory.h |  34 ++-
> >   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >   3 files changed, 306 insertions(+), 154 deletions(-)
> >
>
Nico Pache Feb. 17, 2025, 7:19 p.m. UTC | #12
On Mon, Feb 17, 2025 at 1:06 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 15/02/25 12:08 pm, Dev Jain wrote:
> >
> >
> > On 15/02/25 6:22 am, Nico Pache wrote:
> >> On Thu, Feb 13, 2025 at 7:02 PM Dev Jain <dev.jain@arm.com> wrote:
> >>>
> >>>
> >>>
> >>> On 14/02/25 1:09 am, Nico Pache wrote:
> >>>> On Thu, Feb 13, 2025 at 1:26 AM Dev Jain <dev.jain@arm.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 12/02/25 10:19 pm, Nico Pache wrote:
> >>>>>> On Tue, Feb 11, 2025 at 5:50 AM Dev Jain <dev.jain@arm.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 11/02/25 6:00 am, Nico Pache wrote:
> >>>>>>>> The following series provides khugepaged and madvise collapse
> >>>>>>>> with the
> >>>>>>>> capability to collapse regions to mTHPs.
> >>>>>>>>
> >>>>>>>> To achieve this we generalize the khugepaged functions to no
> >>>>>>>> longer depend
> >>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks
> >>>>>>>> of pages
> >>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>>>>>>> using a bitmap. After the PMD scan is done, we do binary
> >>>>>>>> recursion on the
> >>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The
> >>>>>>>> restriction
> >>>>>>>> on max_ptes_none is removed during the scan, to make sure we
> >>>>>>>> account for
> >>>>>>>> the whole PMD range. max_ptes_none will be scaled by the
> >>>>>>>> attempted collapse
> >>>>>>>> order to determine how full a THP must be to be eligible. If a
> >>>>>>>> mTHP collapse
> >>>>>>>> is attempted, but contains swapped out, or shared pages, we dont
> >>>>>>>> perform the
> >>>>>>>> collapse.
> >>>>>>>>
> >>>>>>>> With the default max_ptes_none=511, the code should keep its
> >>>>>>>> most of its
> >>>>>>>> original behavior. To exercise mTHP collapse we need to set
> >>>>>>>> max_ptes_none<=255.
> >>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse
> >>>>>>>> "creep" and
> >>>>>>>> constantly promote mTHPs to the next available size.
> >>>>>>>>
> >>>>>>>> Patch 1:     Some refactoring to combine madvise_collapse and
> >>>>>>>> khugepaged
> >>>>>>>> Patch 2:     Refactor/rename hpage_collapse
> >>>>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>>>>>>> Patch 6-9:   The mTHP patches
> >>>>>>>>
> >>>>>>>> ---------
> >>>>>>>>      Testing
> >>>>>>>> ---------
> >>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>>>>>>> - selftests mm
> >>>>>>>> - I created a test script that I used to push khugepaged to its
> >>>>>>>> limits while
> >>>>>>>>        monitoring a number of stats and tracepoints. The code is
> >>>>>>>> available
> >>>>>>>>        here[1] (Run in legacy mode for these changes and set
> >>>>>>>> mthp sizes to inherit)
> >>>>>>>>        The summary from my testings was that there was no
> >>>>>>>> significant regression
> >>>>>>>>        noticed through this test. In some cases my changes had
> >>>>>>>> better collapse
> >>>>>>>>        latencies, and was able to scan more pages in the same
> >>>>>>>> amount of time/work,
> >>>>>>>>        but for the most part the results were consistant.
> >>>>>>>> - redis testing. I tested these changes along with my defer changes
> >>>>>>>>       (see followup post for more details).
> >>>>>>>> - some basic testing on 64k page size.
> >>>>>>>> - lots of general use. These changes have been running in my VM
> >>>>>>>> for some time.
> >>>>>>>>
> >>>>>>>> Changes since V1 [2]:
> >>>>>>>> - Minor bug fixes discovered during review and testing
> >>>>>>>> - removed dynamic allocations for bitmaps, and made them stack
> >>>>>>>> based
> >>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>>>>>>> - Updated trace events to include collapsing order info.
> >>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100
> >>>>>>>> scale.
> >>>>>>>> - No longer require a chunk to be fully utilized before setting
> >>>>>>>> the bit. Use
> >>>>>>>>        the same max_ptes_none scaling principle to achieve this.
> >>>>>>>> - Skip mTHP collapse that requires swapin or shared handling.
> >>>>>>>> This helps prevent
> >>>>>>>>        some of the "creep" that was discovered in v1.
> >>>>>>>>
> >>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-
> >>>>>>>> npache@redhat.com/
> >>>>>>>>
> >>>>>>>> Nico Pache (9):
> >>>>>>>>       introduce khugepaged_collapse_single_pmd to unify
> >>>>>>>> khugepaged and
> >>>>>>>>         madvise_collapse
> >>>>>>>>       khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>>>>>>       khugepaged: generalize hugepage_vma_revalidate for mTHP
> >>>>>>>> support
> >>>>>>>>       khugepaged: generalize alloc_charge_folio for mTHP support
> >>>>>>>>       khugepaged: generalize __collapse_huge_page_* for mTHP
> >>>>>>>> support
> >>>>>>>>       khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>>>>>>       khugepaged: add mTHP support
> >>>>>>>>       khugepaged: improve tracepoints for mTHP orders
> >>>>>>>>       khugepaged: skip collapsing mTHP to smaller orders
> >>>>>>>>
> >>>>>>>>      include/linux/khugepaged.h         |   4 +
> >>>>>>>>      include/trace/events/huge_memory.h |  34 ++-
> >>>>>>>>      mm/khugepaged.c                    | 422 ++++++++++++++++++
> >>>>>>>> +----------
> >>>>>>>>      3 files changed, 306 insertions(+), 154 deletions(-)
> >>>>>>>>
> >>>>>>>
> >>>>>>> Does this patchset suffer from the problem described here:
> >>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-
> >>>>>>> c2d48d4963b6@arm.com/
> >>>>>> Hi Dev,
> >>>>>>
> >>>>>> Sorry I meant to get back to you about that.
> >>>>>>
> >>>>>> I understand your concern, but like I've mentioned before, the scan
> >>>>>> with the read lock was done so we dont have to do the more expensive
> >>>>>> locking, and could still gain insight into the state. You are right
> >>>>>> that this info could become stale if the state changes dramatically,
> >>>>>> but the collapse_isolate function will verify it and not collapse.
> >>>>>
> >>>>> If the state changes dramatically, the _isolate function will
> >>>>> verify it,
> >>>>> and fallback. And this fallback happens after following this costly
> >>>>> path: retrieve a large folio from the buddy allocator -> swapin pages
> >>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB
> >>>>> flush
> >>>>> on all CPUs -> fallback in _isolate().
> >>>>> If you do fail in _isolate(), doesn't it make sense to get the updated
> >>>>> state for the next fallback order immediately, because we have prior
> >>>>> information that we failed because of PTE state? What your algorithm
> >>>>> will do is *still* follow the costly path described above, and again
> >>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd()
> >>>>> like
> >>>>> mine would.
> >>>>
> >>>> You do raise a valid point here, I can optimize my solution by
> >>>> detecting certain collapse failure types and jump to the next scan.
> >>>> I'll add that to my solution, thanks!
> >>>>
> >>>> As for the disagreement around the bitmap, we'll leave that up to the
> >>>> community to decide since we have differing opinions/solutions.
> >>>>
> >>>>>
> >>>>> The verification of the PTE state by the _isolate() function is the
> >>>>> "no
> >>>>> turning back" point of the algorithm. The verification by
> >>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even
> >>>>> worth
> >>>>> it, before we do costly operations" point of the algorithm.
> >>>>>
> >>>>>>    From my testing I found this to rarely happen.
> >>>>>
> >>>>> Unfortunately, I am not very familiar with performance testing/load
> >>>>> testing, I am fairly new to kernel programming, so I am getting there.
> >>>>> But it really depends on the type of test you are running, what
> >>>>> actually
> >>>>> runs on memory-intensive systems, etc etc. In fact, on loaded
> >>>>> systems I
> >>>>> would expect the PTE state to dramatically change. But still, no
> >>>>> opinion
> >>>>> here.
> >>>>
> >>>> Yeah there are probably some cases where it happens more often.
> >>>> Probably in cases of short lived allocations, but khugepaged doesn't
> >>>> run that frequently so those won't be that big of an issue.
> >>>>
> >>>> Our performance team is currently testing my implementation so I
> >>>> should have more real workload test results soon. The redis testing
> >>>> had some gains and didn't show any signs of obvious regressions.
> >>>>
> >>>> As for the testing, check out
> >>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-
> >>>> khuge-performance.sh?ref_type=heads
> >>>> this does the tracing for my testing script. It can help you get
> >>>> started. There are 3 different traces being applied there: the
> >>>> bpftrace for collapse latencies, the perf record for the flamegraph
> >>>> (not actually that useful, but may be useful to visualize any
> >>>> weird/long paths that you may not have noticed), and the trace-cmd
> >>>> which records the tracepoint of the scan and the collapse functions
> >>>> then processes the data using the awk script-- the output being the
> >>>> scan rate, the pages collapsed, and their result status (grouped by
> >>>> order).
> >>>>
> >>>> You can also look into https://github.com/gormanm/mmtests for
> >>>> testing/comparing kernels. I was running the
> >>>> config-memdb-redis-benchmark-medium workload.
> >>>
> >>> Thanks. I'll take a look.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Also, khugepaged, my changes, and your changes are all a victim of
> >>>>>> this. Once we drop the read lock (to either allocate the folio, or
> >>>>>> right before acquiring the write_lock), the state can change. In your
> >>>>>> case, yes, you are gathering more up to date information, but is it
> >>>>>> really that important/worth it to retake locks and rescan for each
> >>>>>> instance if we are about to reverify with the write lock taken?
> >>>>>
> >>>>> You said "reverify": You are removing the verification, so this step
> >>>>> won't be reverification, it will be verification. We do not want to
> >>>>> verify *after* we have already done 95% of latency-heavy stuff,
> >>>>> only to
> >>>>> know that we are going to fail.
> >>>>>
> >>>>> Algorithms in the kernel, in general, are of the following form: 1)
> >>>>> Verify if a condition is true, resulting in taking a control path -
> >>>>> > 2)
> >>>>> do a lot of stuff -> "no turning back" step, wherein before committing
> >>>>> (by taking locks, say), reverify if this is the control path we should
> >>>>> be in. You are eliminating step 1).
> >>>>>
> >>>>> Therefore, I will have to say that I disagree with your approach.
> >>>>>
> >>>>> On top of this, in the subjective analysis in [1], point number 7
> >>>>> (along
> >>>>> with point number 1) remains. And, point number 4 remains.
> >>>>
> >>>> for 1) your worst case of 1024 is not the worst case. There are 8
> >>>> possible orders in your implementation, if all are enabled, that is
> >>>> 4096 iterations in the worst case.
> >>>
> >>> Yes, that is exactly what I wrote in 1). I am still not convinced that
> >>> the overhead you produce + 512 iterations is going to beat 4096
> >>> iterations. Anyways, that is hand-waving and we should test this.
> >>>
> >>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096
> >>>> in my case.
> >>>
> >>> Sorry, I am missing something here; how does the number of iterations
> >>> change with page size? Am I not scanning the PTE table, which is
> >>> invariant to the page size?
> >>
> >> I got the calculation wrong the first time and it's actually worst.
> >> Lets hope I got this right this time
> >> on ARM64 64k kernel:
> >> PMD size = 512M
> >> PTE= 64k
> >> PTEs per PMD = 8192
> >
> > *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't
> > depend on page size, but #pte entries = PAGE_SIZE / sizeof(pte) =
> > PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is 1
> > << 13.
> >
> >> log2(8192) = 13 - 2 = 11 number of (m)THP sizes including PMD (the
> >> first and second order are skipped)
> >>
> >> Assuming I understand your algorithm correctly, in the worst case you
> >> are scanning the whole PMD for each order.
> >>
> >> So you scan 8192 PTEs 11 times. 8192 * 11 = 90112.
> >
> > Yup. Now it seems that the bitmap overhead may just be worth it; for the
> > worst case the bitmap will give us an 11x saving...for the average case,
> > it will give us 2x, but still, 8192 is a large number. I'll think of
>
> Clearing on this: the saving is w.r.t the initial scan. That is, if time
> taken by NP is x + y + collapse_huge_page(), where x is the PMD scan and
> y is the bitmap overhead, then time taken by DJ is 2x +
> collapse_huge_page(). In collapse_huge_page(), both perform PTE scans in
> _isolate(). Anyhow, we differ in opinion as to where the max_ptes_*
> check should be placed; I recalled the following:
>
> https://lore.kernel.org/all/20240809103129.365029-2-dev.jain@arm.com/
> https://lore.kernel.org/all/761ba58e-9d6f-4a14-a513-dcc098c2aa94@redhat.com/
>
> One thing you can do to relieve one of my criticisms (not completely) is
> apply the following patch (this can be done in both methods):
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b589f889bb5a..dc5cb602eaad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1080,8 +1080,14 @@ static int __collapse_huge_page_swapin(struct
> mm_struct *mm,
>                 }
>
>                 vmf.orig_pte = ptep_get_lockless(pte);
> -               if (!is_swap_pte(vmf.orig_pte))
> +               if (!is_swap_pte(vmf.orig_pte)) {
>                         continue;
> +               } else {
> +                       if (order != HPAGE_PMD_ORDER) {
> +                               result = SCAN_EXCEED_SWAP_PTE;
> +                               goto out;
> +                       }
> +               }
>
>                 vmf.pte = pte;
>                 vmf.ptl = ptl;
> --
>
> But this really is the same thing being done in the links above :)

Yes my code already does this,
@@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct
mm_struct *mm,
                if (!is_swap_pte(vmf.orig_pte))
                        continue;

+               if (order != HPAGE_PMD_ORDER) {
+                       result = SCAN_EXCEED_SWAP_PTE;
+                       goto out;
+               }
+
No need for the if/else nesting.
>
>
> > ways to test this out.
> >
> > Btw, I was made aware that an LWN article just got posted on our work!
> > https://lwn.net/Articles/1009039/
> >
> >>
> >> Please let me know if I'm missing something here.
> >>>
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-
> >>>>> aba4b1a441b4@arm.com/
> >>>>>
> >>>>>>
> >>>>>> So in my eyes, this is not a "problem"
> >>>>>
> >>>>> Looks like the kernel scheduled us for a high-priority debate, I hope
> >>>>> there's no deadlock :)
> >>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>> -- Nico
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
>
Ryan Roberts Feb. 18, 2025, 4:07 p.m. UTC | #13
On 11/02/2025 00:30, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> order to determine how full a THP must be to be eligible. If a mTHP collapse
> is attempted, but contains swapped out, or shared pages, we dont perform the
> collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its 
> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and 

nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
This is making my head hurt, but I *think* I agree with you that if
max_ptes_none is less than half of the number of ptes in a pmd, then creep
doesn't happen.

To make sure I've understood;

 - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
 - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
 - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
 - ...

So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
have 4 PTEs in a 32K area which is insufficient to collapse to 32K.

Sounds good to me!

> constantly promote mTHPs to the next available size.
> 
> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 2:     Refactor/rename hpage_collapse
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> 
> ---------
>  Testing
> ---------
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits while
>    monitoring a number of stats and tracepoints. The code is available 
>    here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>    The summary from my testings was that there was no significant regression
>    noticed through this test. In some cases my changes had better collapse
>    latencies, and was able to scan more pages in the same amount of time/work,
>    but for the most part the results were consistant.
> - redis testing. I tested these changes along with my defer changes
>   (see followup post for more details).
> - some basic testing on 64k page size.
> - lots of general use. These changes have been running in my VM for some time.
> 
> Changes since V1 [2]:
> - Minor bug fixes discovered during review and testing
> - removed dynamic allocations for bitmaps, and made them stack based
> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> - Updated trace events to include collapsing order info.
> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> - No longer require a chunk to be fully utilized before setting the bit. Use
>    the same max_ptes_none scaling principle to achieve this.
> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>    some of the "creep" that was discovered in v1.
> 
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> 
> Nico Pache (9):
>   introduce khugepaged_collapse_single_pmd to unify khugepaged and
>     madvise_collapse
>   khugepaged: rename hpage_collapse_* to khugepaged_*
>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>   khugepaged: generalize alloc_charge_folio for mTHP support
>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>   khugepaged: add mTHP support
>   khugepaged: improve tracepoints for mTHP orders
>   khugepaged: skip collapsing mTHP to smaller orders
> 
>  include/linux/khugepaged.h         |   4 +
>  include/trace/events/huge_memory.h |  34 ++-
>  mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>  3 files changed, 306 insertions(+), 154 deletions(-)
>
Nico Pache Feb. 18, 2025, 10:30 p.m. UTC | #14
On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/02/2025 00:30, Nico Pache wrote:
> > The following series provides khugepaged and madvise collapse with the
> > capability to collapse regions to mTHPs.
> >
> > To achieve this we generalize the khugepaged functions to no longer depend
> > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> > using a bitmap. After the PMD scan is done, we do binary recursion on the
> > bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> > on max_ptes_none is removed during the scan, to make sure we account for
> > the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> > order to determine how full a THP must be to be eligible. If a mTHP collapse
> > is attempted, but contains swapped out, or shared pages, we dont perform the
> > collapse.
> >
> > With the default max_ptes_none=511, the code should keep its most of its
> > original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>
> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
> This is making my head hurt, but I *think* I agree with you that if
> max_ptes_none is less than half of the number of ptes in a pmd, then creep
> doesn't happen.
Haha yea the compressed bitmap does not make the math super easy to
follow, but i'm glad we arrived at the same conclusion :)
>
> To make sure I've understood;
>
>  - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
>  - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
>  - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
>  - ...
>
> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
> have 4 PTEs in a 32K area which is insufficient to collapse to 32K.
>
> Sounds good to me!
Great! Another easy way to think about it is, with max_ptes_none =
HPAGE_PMD_NR/2, a collapse will double the size, and we only need half
for it to collapse again. Each size is 2x the last, so if we hit one
collapse, it will be eligible again next round.
>
> > constantly promote mTHPs to the next available size.
> >
> > Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> > Patch 2:     Refactor/rename hpage_collapse
> > Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> > Patch 6-9:   The mTHP patches
> >
> > ---------
> >  Testing
> > ---------
> > - Built for x86_64, aarch64, ppc64le, and s390x
> > - selftests mm
> > - I created a test script that I used to push khugepaged to its limits while
> >    monitoring a number of stats and tracepoints. The code is available
> >    here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >    The summary from my testings was that there was no significant regression
> >    noticed through this test. In some cases my changes had better collapse
> >    latencies, and was able to scan more pages in the same amount of time/work,
> >    but for the most part the results were consistant.
> > - redis testing. I tested these changes along with my defer changes
> >   (see followup post for more details).
> > - some basic testing on 64k page size.
> > - lots of general use. These changes have been running in my VM for some time.
> >
> > Changes since V1 [2]:
> > - Minor bug fixes discovered during review and testing
> > - removed dynamic allocations for bitmaps, and made them stack based
> > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> > - Updated trace events to include collapsing order info.
> > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> > - No longer require a chunk to be fully utilized before setting the bit. Use
> >    the same max_ptes_none scaling principle to achieve this.
> > - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >    some of the "creep" that was discovered in v1.
> >
> > [1] - https://gitlab.com/npache/khugepaged_mthp_test
> > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >
> > Nico Pache (9):
> >   introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >     madvise_collapse
> >   khugepaged: rename hpage_collapse_* to khugepaged_*
> >   khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >   khugepaged: generalize alloc_charge_folio for mTHP support
> >   khugepaged: generalize __collapse_huge_page_* for mTHP support
> >   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >   khugepaged: add mTHP support
> >   khugepaged: improve tracepoints for mTHP orders
> >   khugepaged: skip collapsing mTHP to smaller orders
> >
> >  include/linux/khugepaged.h         |   4 +
> >  include/trace/events/huge_memory.h |  34 ++-
> >  mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >  3 files changed, 306 insertions(+), 154 deletions(-)
> >
>
Dev Jain Feb. 19, 2025, 9:01 a.m. UTC | #15
On 19/02/25 4:00 am, Nico Pache wrote:
> On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 11/02/2025 00:30, Nico Pache wrote:
>>> The following series provides khugepaged and madvise collapse with the
>>> capability to collapse regions to mTHPs.
>>>
>>> To achieve this we generalize the khugepaged functions to no longer depend
>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>> on max_ptes_none is removed during the scan, to make sure we account for
>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>> collapse.
>>>
>>> With the default max_ptes_none=511, the code should keep its most of its
>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>
>> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
>> This is making my head hurt, but I *think* I agree with you that if
>> max_ptes_none is less than half of the number of ptes in a pmd, then creep
>> doesn't happen.
> Haha yea the compressed bitmap does not make the math super easy to
> follow, but i'm glad we arrived at the same conclusion :)
>>
>> To make sure I've understood;
>>
>>   - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
>>   - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
>>   - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
>>   - ...
>>
>> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
>> have 4 PTEs in a 32K area which is insufficient to collapse to 32K.
>>
>> Sounds good to me!
> Great! Another easy way to think about it is, with max_ptes_none =
> HPAGE_PMD_NR/2, a collapse will double the size, and we only need half
> for it to collapse again. Each size is 2x the last, so if we hit one
> collapse, it will be eligible again next round.

Please someone correct me if I am wrong.

Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64 
=> #chunks = 64 / 8 = 8.

Let the PTE state within the chunks be as follows:

Chunk 0: < 5 filled   Chunk 1: 5 filled   Chunk 2: 5 filled   Chunk 3: 5 
filled

Chunk 4: 5 filled   Chunk 5: < 5 filled   Chunk 6: < 5 filled   Chunk 7: 
< 5 filled

Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204 
< HPAGE_PMD_NR/2).
=> To collapse we need at least 60% of the PTEs filled.

Your algorithm marks chunks in the bitmap if 60% of the chunk is filled. 
Then, if the number of chunks set is greater than 60%, then we will 
collapse.

Chunk 0 will be marked zero because less than 5 PTEs are filled => 
percentage filled <= 50%

Right now the state is
0111 1000
where the indices are the chunk numbers.
Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse 
won't happen.

For the first 4 chunks, the percent filled is 75%.  So the state becomes
1111 1000
after 128K collapse, and now 256K collapse will happen.

Either I got this correct, or I do not understand the utility of 
maintaining chunks :) What you are doing is what I am doing except that 
my chunk size = 1.

>>
>>> constantly promote mTHPs to the next available size.
>>>
>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>> Patch 2:     Refactor/rename hpage_collapse
>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>> Patch 6-9:   The mTHP patches
>>>
>>> ---------
>>>   Testing
>>> ---------
>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>> - selftests mm
>>> - I created a test script that I used to push khugepaged to its limits while
>>>     monitoring a number of stats and tracepoints. The code is available
>>>     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>     The summary from my testings was that there was no significant regression
>>>     noticed through this test. In some cases my changes had better collapse
>>>     latencies, and was able to scan more pages in the same amount of time/work,
>>>     but for the most part the results were consistant.
>>> - redis testing. I tested these changes along with my defer changes
>>>    (see followup post for more details).
>>> - some basic testing on 64k page size.
>>> - lots of general use. These changes have been running in my VM for some time.
>>>
>>> Changes since V1 [2]:
>>> - Minor bug fixes discovered during review and testing
>>> - removed dynamic allocations for bitmaps, and made them stack based
>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>> - Updated trace events to include collapsing order info.
>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>     the same max_ptes_none scaling principle to achieve this.
>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>     some of the "creep" that was discovered in v1.
>>>
>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>
>>> Nico Pache (9):
>>>    introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>      madvise_collapse
>>>    khugepaged: rename hpage_collapse_* to khugepaged_*
>>>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>    khugepaged: generalize alloc_charge_folio for mTHP support
>>>    khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>    khugepaged: add mTHP support
>>>    khugepaged: improve tracepoints for mTHP orders
>>>    khugepaged: skip collapsing mTHP to smaller orders
>>>
>>>   include/linux/khugepaged.h         |   4 +
>>>   include/trace/events/huge_memory.h |  34 ++-
>>>   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>   3 files changed, 306 insertions(+), 154 deletions(-)
>>>
>>
> 
>
Ryan Roberts Feb. 19, 2025, 5 p.m. UTC | #16
On 11/02/2025 00:30, Nico Pache wrote:
> The following series provides khugepaged and madvise collapse with the
> capability to collapse regions to mTHPs.
> 
> To achieve this we generalize the khugepaged functions to no longer depend
> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> (defined by MTHP_MIN_ORDER) that are utilized. 

Having been through the code I don't think you can handle the case of a
partially covered PMD? That was never important before because by definition we
only cared about PMD-sized chunks, but it would be good to be able to collapse
to appropriately sized mTHP, memory within a VMA that doesn't cover an entire
PMD. Do you have plans to add that? Is the approach you've taken amenable to
extension in this way?

> This info is tracked
> using a bitmap. After the PMD scan is done, we do binary recursion on the
> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> on max_ptes_none is removed during the scan, to make sure we account for
> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> order to determine how full a THP must be to be eligible. If a mTHP collapse
> is attempted, but contains swapped out, or shared pages, we dont perform the
> collapse.
> 
> With the default max_ptes_none=511, the code should keep its most of its 
> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and 
> constantly promote mTHPs to the next available size.
> 
> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> Patch 2:     Refactor/rename hpage_collapse
> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> Patch 6-9:   The mTHP patches
> 
> ---------
>  Testing
> ---------
> - Built for x86_64, aarch64, ppc64le, and s390x
> - selftests mm
> - I created a test script that I used to push khugepaged to its limits while
>    monitoring a number of stats and tracepoints. The code is available 
>    here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>    The summary from my testings was that there was no significant regression
>    noticed through this test. In some cases my changes had better collapse
>    latencies, and was able to scan more pages in the same amount of time/work,
>    but for the most part the results were consistant.
> - redis testing. I tested these changes along with my defer changes
>   (see followup post for more details).
> - some basic testing on 64k page size.
> - lots of general use. These changes have been running in my VM for some time.
> 
> Changes since V1 [2]:
> - Minor bug fixes discovered during review and testing
> - removed dynamic allocations for bitmaps, and made them stack based
> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> - Updated trace events to include collapsing order info.
> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> - No longer require a chunk to be fully utilized before setting the bit. Use
>    the same max_ptes_none scaling principle to achieve this.
> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>    some of the "creep" that was discovered in v1.
> 
> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> 
> Nico Pache (9):
>   introduce khugepaged_collapse_single_pmd to unify khugepaged and
>     madvise_collapse
>   khugepaged: rename hpage_collapse_* to khugepaged_*
>   khugepaged: generalize hugepage_vma_revalidate for mTHP support
>   khugepaged: generalize alloc_charge_folio for mTHP support
>   khugepaged: generalize __collapse_huge_page_* for mTHP support
>   khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>   khugepaged: add mTHP support
>   khugepaged: improve tracepoints for mTHP orders
>   khugepaged: skip collapsing mTHP to smaller orders
> 
>  include/linux/khugepaged.h         |   4 +
>  include/trace/events/huge_memory.h |  34 ++-
>  mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>  3 files changed, 306 insertions(+), 154 deletions(-)
>
Nico Pache Feb. 20, 2025, 7:12 p.m. UTC | #17
On Wed, Feb 19, 2025 at 2:01 AM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 19/02/25 4:00 am, Nico Pache wrote:
> > On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 11/02/2025 00:30, Nico Pache wrote:
> >>> The following series provides khugepaged and madvise collapse with the
> >>> capability to collapse regions to mTHPs.
> >>>
> >>> To achieve this we generalize the khugepaged functions to no longer depend
> >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
> >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
> >>> using a bitmap. After the PMD scan is done, we do binary recursion on the
> >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
> >>> on max_ptes_none is removed during the scan, to make sure we account for
> >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
> >>> order to determine how full a THP must be to be eligible. If a mTHP collapse
> >>> is attempted, but contains swapped out, or shared pages, we dont perform the
> >>> collapse.
> >>>
> >>> With the default max_ptes_none=511, the code should keep its most of its
> >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
> >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
> >>
> >> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
> >> This is making my head hurt, but I *think* I agree with you that if
> >> max_ptes_none is less than half of the number of ptes in a pmd, then creep
> >> doesn't happen.
> > Haha yea the compressed bitmap does not make the math super easy to
> > follow, but i'm glad we arrived at the same conclusion :)
> >>
> >> To make sure I've understood;
> >>
> >>   - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
> >>   - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
> >>   - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
> >>   - ...
> >>
> >> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
> >> have 4 PTEs in a 32K area which is insufficient to collapse to 32K.
> >>
> >> Sounds good to me!
> > Great! Another easy way to think about it is, with max_ptes_none =
> > HPAGE_PMD_NR/2, a collapse will double the size, and we only need half
> > for it to collapse again. Each size is 2x the last, so if we hit one
> > collapse, it will be eligible again next round.
>
> Please someone correct me if I am wrong.
>

max_ptes_none = 204
scaled_none = (204 >> 9 - 3) = ~3.1
so 4 pages need to be available in each chunk for the bit to be set, not 5.

at 204 the bitmap check is
512 - 1 - 204 = 307
(PMD) 307 >> 3 = 38
(1024k) 307 >> 4 = 19
(512k) 307 >> 5 = 9
(256k) 307 >> 6 = 4

> Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64
> => #chunks = 64 / 8 = 8.
>
> Let the PTE state within the chunks be as follows:
>
> Chunk 0: < 5 filled   Chunk 1: 5 filled   Chunk 2: 5 filled   Chunk 3: 5
> filled
>
> Chunk 4: 5 filled   Chunk 5: < 5 filled   Chunk 6: < 5 filled   Chunk 7:
> < 5 filled
>
> Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204
> < HPAGE_PMD_NR/2).
> => To collapse we need at least 60% of the PTEs filled.
>
> Your algorithm marks chunks in the bitmap if 60% of the chunk is filled.
> Then, if the number of chunks set is greater than 60%, then we will
> collapse.
>
> Chunk 0 will be marked zero because less than 5 PTEs are filled =>
> percentage filled <= 50%
>
> Right now the state is
> 0111 1000
> where the indices are the chunk numbers.
> Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse
> won't happen.
>
> For the first 4 chunks, the percent filled is 75%.  So the state becomes
> 1111 1000
> after 128K collapse, and now 256K collapse will happen.
>
> Either I got this correct, or I do not understand the utility of
> maintaining chunks :) What you are doing is what I am doing except that
> my chunk size = 1.

Ignoring all the math, and just going off the 0111 1000
We do "creep", but its not the same type of "creep" we've been
describing. The collapse in the first half will allow the collapse in
order++ to collapse but it stops there and doesnt keep getting
promoted to a PMD size. That is unless the adjacent 256k also has some
bits set, then it can collapse to 512k. So I guess we still can creep,
but its way less aggressive and only when there is actual memory being
utilized in the adjacent chunk, so it's not like we are creating a
huge waste.


>
> >>
> >>> constantly promote mTHPs to the next available size.
> >>>
> >>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
> >>> Patch 2:     Refactor/rename hpage_collapse
> >>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
> >>> Patch 6-9:   The mTHP patches
> >>>
> >>> ---------
> >>>   Testing
> >>> ---------
> >>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>> - selftests mm
> >>> - I created a test script that I used to push khugepaged to its limits while
> >>>     monitoring a number of stats and tracepoints. The code is available
> >>>     here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
> >>>     The summary from my testings was that there was no significant regression
> >>>     noticed through this test. In some cases my changes had better collapse
> >>>     latencies, and was able to scan more pages in the same amount of time/work,
> >>>     but for the most part the results were consistant.
> >>> - redis testing. I tested these changes along with my defer changes
> >>>    (see followup post for more details).
> >>> - some basic testing on 64k page size.
> >>> - lots of general use. These changes have been running in my VM for some time.
> >>>
> >>> Changes since V1 [2]:
> >>> - Minor bug fixes discovered during review and testing
> >>> - removed dynamic allocations for bitmaps, and made them stack based
> >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>> - Updated trace events to include collapsing order info.
> >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
> >>> - No longer require a chunk to be fully utilized before setting the bit. Use
> >>>     the same max_ptes_none scaling principle to achieve this.
> >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
> >>>     some of the "creep" that was discovered in v1.
> >>>
> >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
> >>>
> >>> Nico Pache (9):
> >>>    introduce khugepaged_collapse_single_pmd to unify khugepaged and
> >>>      madvise_collapse
> >>>    khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>    khugepaged: generalize hugepage_vma_revalidate for mTHP support
> >>>    khugepaged: generalize alloc_charge_folio for mTHP support
> >>>    khugepaged: generalize __collapse_huge_page_* for mTHP support
> >>>    khugepaged: introduce khugepaged_scan_bitmap for mTHP support
> >>>    khugepaged: add mTHP support
> >>>    khugepaged: improve tracepoints for mTHP orders
> >>>    khugepaged: skip collapsing mTHP to smaller orders
> >>>
> >>>   include/linux/khugepaged.h         |   4 +
> >>>   include/trace/events/huge_memory.h |  34 ++-
> >>>   mm/khugepaged.c                    | 422 +++++++++++++++++++----------
> >>>   3 files changed, 306 insertions(+), 154 deletions(-)
> >>>
> >>
> >
> >
>
Dev Jain Feb. 21, 2025, 4:57 a.m. UTC | #18
On 21/02/25 12:42 am, Nico Pache wrote:
> On Wed, Feb 19, 2025 at 2:01 AM Dev Jain <dev.jain@arm.com> wrote:
>>
>>
>>
>> On 19/02/25 4:00 am, Nico Pache wrote:
>>> On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 11/02/2025 00:30, Nico Pache wrote:
>>>>> The following series provides khugepaged and madvise collapse with the
>>>>> capability to collapse regions to mTHPs.
>>>>>
>>>>> To achieve this we generalize the khugepaged functions to no longer depend
>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages
>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked
>>>>> using a bitmap. After the PMD scan is done, we do binary recursion on the
>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction
>>>>> on max_ptes_none is removed during the scan, to make sure we account for
>>>>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse
>>>>> order to determine how full a THP must be to be eligible. If a mTHP collapse
>>>>> is attempted, but contains swapped out, or shared pages, we dont perform the
>>>>> collapse.
>>>>>
>>>>> With the default max_ptes_none=511, the code should keep its most of its
>>>>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255.
>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and
>>>>
>>>> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)?
>>>> This is making my head hurt, but I *think* I agree with you that if
>>>> max_ptes_none is less than half of the number of ptes in a pmd, then creep
>>>> doesn't happen.
>>> Haha yea the compressed bitmap does not make the math super easy to
>>> follow, but i'm glad we arrived at the same conclusion :)
>>>>
>>>> To make sure I've understood;
>>>>
>>>>    - to collapse to 16K, you would need >=3 out of 4 PTEs to be present
>>>>    - to collapse to 32K, you would need >=5 out of 8 PTEs to be present
>>>>    - to collapse to 64K, you would need >=9 out of 16 PTEs to be present
>>>>    - ...
>>>>
>>>> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now
>>>> have 4 PTEs in a 32K area which is insufficient to collapse to 32K.
>>>>
>>>> Sounds good to me!
>>> Great! Another easy way to think about it is, with max_ptes_none =
>>> HPAGE_PMD_NR/2, a collapse will double the size, and we only need half
>>> for it to collapse again. Each size is 2x the last, so if we hit one
>>> collapse, it will be eligible again next round.
>>
>> Please someone correct me if I am wrong.
>>
> 
> max_ptes_none = 204
> scaled_none = (204 >> 9 - 3) = ~3.1
> so 4 pages need to be available in each chunk for the bit to be set, not 5.
> 
> at 204 the bitmap check is
> 512 - 1 - 204 = 307
> (PMD) 307 >> 3 = 38
> (1024k) 307 >> 4 = 19
> (512k) 307 >> 5 = 9
> (256k) 307 >> 6 = 4
> 
>> Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64
>> => #chunks = 64 / 8 = 8.
>>
>> Let the PTE state within the chunks be as follows:
>>
>> Chunk 0: < 5 filled   Chunk 1: 5 filled   Chunk 2: 5 filled   Chunk 3: 5
>> filled
>>
>> Chunk 4: 5 filled   Chunk 5: < 5 filled   Chunk 6: < 5 filled   Chunk 7:
>> < 5 filled
>>
>> Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204
>> < HPAGE_PMD_NR/2).
>> => To collapse we need at least 60% of the PTEs filled.
>>
>> Your algorithm marks chunks in the bitmap if 60% of the chunk is filled.
>> Then, if the number of chunks set is greater than 60%, then we will
>> collapse.
>>
>> Chunk 0 will be marked zero because less than 5 PTEs are filled =>
>> percentage filled <= 50%
>>
>> Right now the state is
>> 0111 1000
>> where the indices are the chunk numbers.
>> Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse
>> won't happen.
>>
>> For the first 4 chunks, the percent filled is 75%.  So the state becomes
>> 1111 1000
>> after 128K collapse, and now 256K collapse will happen.
>>
>> Either I got this correct, or I do not understand the utility of
>> maintaining chunks :) What you are doing is what I am doing except that
>> my chunk size = 1.
> 
> Ignoring all the math, and just going off the 0111 1000
> We do "creep", but its not the same type of "creep" we've been
> describing. The collapse in the first half will allow the collapse in
> order++ to collapse but it stops there and doesnt keep getting
> promoted to a PMD size. That is unless the adjacent 256k also has some
> bits set, then it can collapse to 512k. So I guess we still can creep,
> but its way less aggressive and only when there is actual memory being
> utilized in the adjacent chunk, so it's not like we are creating a
> huge waste.

I get you. You will creep when the adjacent chunk has at least 1 bit 
set. I don't really have a strong opinion on this one.

> 
> 
>>
>>>>
>>>>> constantly promote mTHPs to the next available size.
>>>>>
>>>>> Patch 1:     Some refactoring to combine madvise_collapse and khugepaged
>>>>> Patch 2:     Refactor/rename hpage_collapse
>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary orders
>>>>> Patch 6-9:   The mTHP patches
>>>>>
>>>>> ---------
>>>>>    Testing
>>>>> ---------
>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
>>>>> - selftests mm
>>>>> - I created a test script that I used to push khugepaged to its limits while
>>>>>      monitoring a number of stats and tracepoints. The code is available
>>>>>      here[1] (Run in legacy mode for these changes and set mthp sizes to inherit)
>>>>>      The summary from my testings was that there was no significant regression
>>>>>      noticed through this test. In some cases my changes had better collapse
>>>>>      latencies, and was able to scan more pages in the same amount of time/work,
>>>>>      but for the most part the results were consistant.
>>>>> - redis testing. I tested these changes along with my defer changes
>>>>>     (see followup post for more details).
>>>>> - some basic testing on 64k page size.
>>>>> - lots of general use. These changes have been running in my VM for some time.
>>>>>
>>>>> Changes since V1 [2]:
>>>>> - Minor bug fixes discovered during review and testing
>>>>> - removed dynamic allocations for bitmaps, and made them stack based
>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
>>>>> - Updated trace events to include collapsing order info.
>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale.
>>>>> - No longer require a chunk to be fully utilized before setting the bit. Use
>>>>>      the same max_ptes_none scaling principle to achieve this.
>>>>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent
>>>>>      some of the "creep" that was discovered in v1.
>>>>>
>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/
>>>>>
>>>>> Nico Pache (9):
>>>>>     introduce khugepaged_collapse_single_pmd to unify khugepaged and
>>>>>       madvise_collapse
>>>>>     khugepaged: rename hpage_collapse_* to khugepaged_*
>>>>>     khugepaged: generalize hugepage_vma_revalidate for mTHP support
>>>>>     khugepaged: generalize alloc_charge_folio for mTHP support
>>>>>     khugepaged: generalize __collapse_huge_page_* for mTHP support
>>>>>     khugepaged: introduce khugepaged_scan_bitmap for mTHP support
>>>>>     khugepaged: add mTHP support
>>>>>     khugepaged: improve tracepoints for mTHP orders
>>>>>     khugepaged: skip collapsing mTHP to smaller orders
>>>>>
>>>>>    include/linux/khugepaged.h         |   4 +
>>>>>    include/trace/events/huge_memory.h |  34 ++-
>>>>>    mm/khugepaged.c                    | 422 +++++++++++++++++++----------
>>>>>    3 files changed, 306 insertions(+), 154 deletions(-)
>>>>>
>>>>
>>>
>>>
>>
>